Markup languages and reproducible programming in statistics
R
R
There are several ‘layers’ in R
. Some layers you are allowed to fiddle around in, some are forbidden. In general there is the following distinction:
The global environment can be seen as a olympic-size swimming pool. Everything you do has its place there.
If you’d like, you may create another, seperate environment to work in.
If you create a function, it is positioned in the global environment.
Everything that happens in a function, stays in a function. Unless you specifically tell the function to share the information with the global environment.
See functions as a shampoo bottle in a swimming pool to which you add some water. If you’d like to see the color of the mixture, you’d have to squeeze the bottle for it to come out.
Packages have their own space.
There are two ways to load a package in R
library(stats)
and
require(stats)
R
in depthA workspace contains all changes you made to environments, functions and namespaces.
A saved workspace contains everything at the time of the state wherein it was saved.
You do not need to run all the previous code again if you would like to continue working at a later time.
Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets from raw text.
R
by default saves (part of) the code history and RStudio
expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
RStudio
RStudio
There may be tons of reasons for archiving
In this course we discuss the following
It is important that your workflow manages the process from start to finish
In between there are many choices to be made (e.g. data throughput processes) and without documenting every step, the reproducibility of your work may be at stake.
Research results are replicable if there is sufficient information available for independent researchers to make the same findings using the same procedures. (Gary King, 1995)
In computational sciences - such as ours - simply having the data and code means that the results are not only replicable, but fully reproducible.
We would like our results to be as fully reproducible as possible:
A. Reproducibility is one of the pillars of science
B. Reproducibility may greatly benefit you
set.seed(123) sample(1:18, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 13 7 ## [26] 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 16 17 18 17
set.seed(123) sample(1:18, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 13 7 ## [26] 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 16 17 18 17
set.seed(123) sample(1:18, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(1:18, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
set.seed(123) sample(1:18, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(1:18, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
The random seed is a number used to initialize the pseudo-random number generator
If replication is needed, pseudorandom number generators must be used
The initial value (the seed) itself does not need to be random.
When an R
instance is started, there is initially no seed. In that case, R
will create one from the current time and process ID.
If we fix the random seed we can exactly replicate the random process
If the method has not changed:
the results of the process will be identical when using the same seed.
rmarkdown
rmarkdown
rmarkdown
in RStudio
provides us with a platform that allows
R
, \(\LaTeX\), html
and python
RStudio
IDE.git
)