The following packages are required for this practical:
library(dplyr)
library(magrittr)
library(mice)
and if you’d like the same results as I have obtained, you can fix the random seed
set.seed(123)
mean = 5
and sd = 1
- \(N(5, 1)\),
rnorm()
rnorm(1000, 5) %>%
matrix(ncol = 2) %>%
plot()
anscombe
data setanscombe %>%
cor()
## x1 x2 x3 x4 y1 y2
## x1 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x2 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x3 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x4 -0.5000000 -0.5000000 -0.5000000 1.0000000 -0.5290927 -0.7184365
## y1 0.8164205 0.8164205 0.8164205 -0.5290927 1.0000000 0.7500054
## y2 0.8162365 0.8162365 0.8162365 -0.7184365 0.7500054 1.0000000
## y3 0.8162867 0.8162867 0.8162867 -0.3446610 0.4687167 0.5879193
## y4 -0.3140467 -0.3140467 -0.3140467 0.8165214 -0.4891162 -0.4780949
## y3 y4
## x1 0.8162867 -0.3140467
## x2 0.8162867 -0.3140467
## x3 0.8162867 -0.3140467
## x4 -0.3446610 0.8165214
## y1 0.4687167 -0.4891162
## y2 0.5879193 -0.4780949
## y3 1.0000000 -0.1554718
## y4 -0.1554718 1.0000000
x4
, y4
) on the anscombe
data setUsing the standard %>%
pipe:
anscombe %>%
subset(select = c(x4, y4)) %>%
cor()
## x4 y4
## x4 1.0000000 0.8165214
## y4 0.8165214 1.0000000
Alternatively, we can use the %$%
pipe from package magrittr
to make this process much more efficient.
anscombe %$%
cor(x4, y4)
## [1] 0.8165214
hgt
and wgt
in the boys
data set from package mice
.Because boys
has missings values for almost all variables, we must first select wgt
and hgt
and then omit the rows that have missing values, before we can calculate the correlation. Using the standard %>%
pipe, this would look like:
boys %>%
subset(select = c("wgt", "hgt")) %>%
cor(use = "pairwise.complete.obs")
## wgt hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000
which is equivalent to
boys %>%
subset(select = c("wgt", "hgt")) %>%
na.omit() %>%
cor()
## wgt hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000
Alternatively, we can use the %$%
pipe:
boys %$%
cor(hgt, wgt, use = "pairwise.complete.obs")
## [1] 0.9428906
The exposition %$%
pipe exposes the listed dimensions of the boys
dataset, such that we can refer to them directly.
boys
data set, hgt
is recorded in centimeters. Use a pipe to transform hgt
in the boys
dataset to height in meters and verify the transformationUsing the standard %>%
and the %$%
pipes:
boys %>%
transform(hgt = hgt / 100) %$%
mean(hgt, na.rm = TRUE)
## [1] 1.321518
hgt
, wgt
) two times: once for hgt
in meters and once for hgt
in centimeters. Make the points in the ‘centimeter’ plot red
and in the ‘meter’ plot blue
. boys %>%
subset(select = c(hgt, wgt)) %>%
plot(col = "red", main = "Height in centimeters")
boys %>%
subset(select = c(hgt, wgt)) %>%
transform(hgt = hgt / 100) %>%
plot(col = "blue", main = "Height in meters")
In the following experiment we investigate least-squares estimation of the mean. The data for this experiment can be found in the workspace Exercise4_data.RData
. Either download this workspace and manually load()
it into R
, or run the below connection:
con <- url("https://www.gerkovink.com/fundamentals/Wk2/Exercise/Exercise4_data.RData")
load(con)
The workspace contains a single vector, named x
and a vector named y
x
.mean(x)
## [1] 4.165695
The values in x
have been drawn from a population with mean \(\mu = 3\) and standard deviation \(\sigma = 7\).
mu = 3
deviations <- x - mu
ssd <- sum(deviations^2)
And a fast way with function apply, where a function is applied over the margin (1 = rows, 2 = columns) of some data.
ssd2 <- apply(X = outer(x, mu, FUN = "-")^2,
MARGIN = 2,
FUN = sum)
Here X = outer(x, mu, FUN = "-")^2
provides the data to apply. Remember that we have defined x
and mu
in the global environment. With the outer()
function, we can quickly determine the outer product of some vectors. In this case we simply use subtraction -
as our function to obtain for the column margin (MARGIN = 2
) - which in this case is just the vector of squared deviations - and sum (FUN = sum
) all these values into a single sum. The result is the sum of squared deviations.
Both solutions are identical
identical(ssd, ssd2)
## [1] TRUE
To see them, we can simply congatenate the two values:
c(ssd, ssd2)
## [1] 5843.083 5843.083
R
The apply class of functions is very flexible and lightning fast, when compared to manual operations that could easily be defined in terms of functions. The only caveat is that you need a function to apply. Many such functions are already available in R
, such as mean()
, mode()
, sum()
, cor()
, and so on.
However, if you need to perform more than a simple calculation, it is often necessary to create your own function. In R
functions take the following form
myfunction <- function(arguments){
hereyourfunctioncode
}
For example,
mean.sd <- function(argument1, argument2){
mean1 <- mean(argument1)
mean2 <- mean(argument2)
sd1 <- sd(argument1)
sd2 <- sd(argument2)
result <- data.frame(mean = c(mean1, mean2),
sd = c(sd1, sd2),
row.names = c("first", "second"))
return(result)
}
The above function calculates the means and standard deviations for two sources of input, then combines these statistics in a simple data frame and returns the data frame. The sources of input are defined in the function arguments argument1
and argument2
. The reason why we have to specify these arguments is simple:
\[\text{EVERYTHING THAT HAPPENS IN A FUNCTION COMES FROM THE}\] \[\text{FUNCTION AND STAYS IN THE FUNCTION!}\]
This is because a function opens a seperate environment that only exists for as long as the function operates. This means:
return()
. In general, using return()
makes it explicit what your function’s return is. For complicated functions this is proper coding procedure, but for simple functions it is not strictly necessary.To put this example function to the test:
mean.sd(argument1 = 1:10,
argument2 = 3:8)
## mean sd
## first 5.5 3.027650
## second 5.5 1.870829
or, simply:
mean.sd(1:10, 3:8)
## mean sd
## first 5.5 3.027650
## second 5.5 1.870829
lsfun
because we are going to identify the least squares estimate and coding is fun!We can use the
lsfun <- function(meanestimate) apply(outer(x, meanestimate, "-")^2, 2, sum)
or, 100% equivalently, but easier to spot as a function:
lsfun2 <- function(meanestimate){
apply(outer(x, meanestimate, "-")^2,
MARGIN = 2,
FUN = sum)
}
or, with a pipe
lsfun3 <- function(meanestimate){
outer(x, meanestimate, "-")^2 %>%
apply(MARGIN = 2, FUN = sum)
}
curve(lsfun, from = 1, to = 8, ylab = "Sum of the squared deviations")
We can see that the minimum lies somewhere above, but near the value 4. We already know from exercise 7 that the mean is equal to 4.1656949, and that is the x
coordinate for the abscissa. To verify this, we can change the range of the plotted curve to zoom in on 4.1656949:
curve(lsfun, from = 4.16, to = 4.17)
y
from Exercise4_data.RData
.y
and report on the shape of the data
round()
).meanestimate
that would minimize the least squares functionEnd of practical.