Exercises

The following packages are required for this practical:

library(dplyr)
library(magrittr)
library(mice)

and if you’d like the same results as I have obtained, you can fix the random seed

set.seed(123)

Exercise 1-5

Use a pipe to do the following:

draw 1000 values from a normal distribution with mean = 5 and sd = 1 - $N(5, 1)$,
- HINT: use rnorm()
create a matrix where the first 500 values are the first column and the second 500 values are the second column
make a scatterplot of these two columns

rnorm(1000, 5) %>%
  matrix(ncol = 2) %>%
  plot()

Use a pipe to calculate the correlation matrix on the anscombe data set

anscombe %>%
  cor()

##            x1         x2         x3         x4         y1         y2
## x1  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x2  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x3  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x4 -0.5000000 -0.5000000 -0.5000000  1.0000000 -0.5290927 -0.7184365
## y1  0.8164205  0.8164205  0.8164205 -0.5290927  1.0000000  0.7500054
## y2  0.8162365  0.8162365  0.8162365 -0.7184365  0.7500054  1.0000000
## y3  0.8162867  0.8162867  0.8162867 -0.3446610  0.4687167  0.5879193
## y4 -0.3140467 -0.3140467 -0.3140467  0.8165214 -0.4891162 -0.4780949
##            y3         y4
## x1  0.8162867 -0.3140467
## x2  0.8162867 -0.3140467
## x3  0.8162867 -0.3140467
## x4 -0.3446610  0.8165214
## y1  0.4687167 -0.4891162
## y2  0.5879193 -0.4780949
## y3  1.0000000 -0.1554718
## y4 -0.1554718  1.0000000

Now use a pipe to calculate the correlation for the pair (x4, y4) on the anscombe data set

Using the standard %>% pipe:

anscombe %>%
  subset(select = c(x4, y4)) %>%
  cor()

##           x4        y4
## x4 1.0000000 0.8165214
## y4 0.8165214 1.0000000

Alternatively, we can use the %$% pipe from package magrittr to make this process much more efficient.

anscombe %$%
  cor(x4, y4)

## [1] 0.8165214

Use a pipe to calculate the correlation between hgt and wgt in the boys data set from package mice.

Because boys has missings values for almost all variables, we must first select wgt and hgt and then omit the rows that have missing values, before we can calculate the correlation. Using the standard %>% pipe, this would look like:

boys %>%
  subset(select = c("wgt", "hgt")) %>%
  cor(use = "pairwise.complete.obs")

##           wgt       hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000

which is equivalent to

boys %>%
  subset(select = c("wgt", "hgt")) %>%
  na.omit() %>%
  cor()

##           wgt       hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000

Alternatively, we can use the %$% pipe:

boys %$% 
  cor(hgt, wgt, use = "pairwise.complete.obs")

## [1] 0.9428906

The exposition %$% pipe exposes the listed dimensions of the boys dataset, such that we can refer to them directly.

In the boys data set, hgt is recorded in centimeters. Use a pipe to transform hgt in the boys dataset to height in meters and verify the transformation

Using the standard %>% and the %$% pipes:

boys %>%
  transform(hgt = hgt / 100) %$%
  mean(hgt, na.rm = TRUE)

## [1] 1.321518

Exercise 6-8

Use pipes to plot the pair (hgt, wgt) two times: once for hgt in meters and once for hgt in centimeters. Make the points in the ‘centimeter’ plot red and in the ‘meter’ plot blue.

boys %>%
  subset(select = c(hgt, wgt)) %>%
  plot(col = "red", main = "Height in centimeters")

boys %>%
  subset(select = c(hgt, wgt)) %>%
  transform(hgt = hgt / 100) %>%
  plot(col = "blue", main = "Height in meters")

In the following experiment we investigate least-squares estimation of the mean. The data for this experiment can be found in the workspace Exercise4_data.RData. Either download this workspace and manually load() it into R, or run the below connection:

con <- url("https://www.gerkovink.com/fundamentals/Wk2/Exercise/Exercise4_data.RData")
load(con)

The workspace contains a single vector, named x and a vector named y

Obtain the sample mean of the values in x.

mean(x)

## [1] 4.165695

The values in x have been drawn from a population with mean $\mu = 3$ and standard deviation $\sigma = 7$.

Calculate the sample mean’s sum of squared deviations from $\mu$. The sum of squared deviations from mu is defined as: \[ \text{ssd} = \sum_{i=1}^{100} (x_i - \mu)^2.\] There is a slow way

mu = 3
deviations <- x - mu
ssd <- sum(deviations^2)

And a fast way with function apply, where a function is applied over the margin (1 = rows, 2 = columns) of some data.

ssd2 <- apply(X = outer(x, mu, FUN = "-")^2, 
              MARGIN = 2, 
              FUN = sum)

Here X = outer(x, mu, FUN = "-")^2 provides the data to apply. Remember that we have defined x and mu in the global environment. With the outer() function, we can quickly determine the outer product of some vectors. In this case we simply use subtraction - as our function to obtain for the column margin (MARGIN = 2) - which in this case is just the vector of squared deviations - and sum (FUN = sum) all these values into a single sum. The result is the sum of squared deviations.

Both solutions are identical

identical(ssd, ssd2)

## [1] TRUE

To see them, we can simply congatenate the two values:

c(ssd, ssd2)

## [1] 5843.083 5843.083

Functions in `R`

The apply class of functions is very flexible and lightning fast, when compared to manual operations that could easily be defined in terms of functions. The only caveat is that you need a function to apply. Many such functions are already available in R, such as mean(), mode(), sum(), cor(), and so on.

However, if you need to perform more than a simple calculation, it is often necessary to create your own function. In R functions take the following form

myfunction <- function(arguments){
  hereyourfunctioncode
}

For example,

mean.sd <- function(argument1, argument2){
  mean1 <- mean(argument1) 
  mean2 <- mean(argument2)
  sd1 <- sd(argument1)
  sd2 <- sd(argument2)
  result <- data.frame(mean = c(mean1, mean2),
                       sd = c(sd1, sd2), 
                       row.names = c("first", "second"))
  return(result)
}

The above function calculates the means and standard deviations for two sources of input, then combines these statistics in a simple data frame and returns the data frame. The sources of input are defined in the function arguments argument1 and argument2. The reason why we have to specify these arguments is simple:

\[\text{EVERYTHING THAT HAPPENS IN A FUNCTION COMES FROM THE}\] \[\text{FUNCTION AND STAYS IN THE FUNCTION!}\]

This is because a function opens a seperate environment that only exists for as long as the function operates. This means:

To get information from the global environment to the function’s environment, we need arguments.
To properly return information to the global environment, we should use return(). In general, using return() makes it explicit what your function’s return is. For complicated functions this is proper coding procedure, but for simple functions it is not strictly necessary.

To put this example function to the test:

mean.sd(argument1 = 1:10,
        argument2 = 3:8)

##        mean       sd
## first   5.5 3.027650
## second  5.5 1.870829

or, simply:

mean.sd(1:10, 3:8)

##        mean       sd
## first   5.5 3.027650
## second  5.5 1.870829

Exercises 9-10

Now create a function that automates the calculation of the sum of squares for any given $\mu$. Call the function lsfun because we are going to identify the least squares estimate and coding is fun!

We can use the

lsfun <- function(meanestimate) apply(outer(x, meanestimate, "-")^2, 2, sum)

or, 100% equivalently, but easier to spot as a function:

lsfun2 <- function(meanestimate){
  apply(outer(x, meanestimate, "-")^2, 
        MARGIN =  2, 
        FUN = sum)
}

or, with a pipe

lsfun3 <- function(meanestimate){
  outer(x, meanestimate, "-")^2 %>%
    apply(MARGIN = 2, FUN = sum)
}

Plot the curve of your least square function such that you can identify the minimum of the curve (i.e. the location for $x$ where the sum of the squared deviations are the lowest).

curve(lsfun, from = 1, to = 8, ylab = "Sum of the squared deviations")

We can see that the minimum lies somewhere above, but near the value 4. We already know from exercise 7 that the mean is equal to 4.1656949, and that is the x coordinate for the abscissa. To verify this, we can change the range of the plotted curve to zoom in on 4.1656949:

curve(lsfun, from = 4.16, to = 4.17)

Hand-in exercise

Repeat the experiment from 10 on object y from Exercise4_data.RData.

Plot a histogram of y and report on the shape of the data
- What do you think the mean is, based on the plot?
- Calculate the mean (expected value)
- Calculate the median (the center of the distribution)
- Calculate the mode (the most observed value). Calculate the mode on rounded numbers (HINT: use round()).
- what do the mean, mode and median tell you about the shape of the data?
Plot the least squares curve from values 2 to 10, conform the previous exercise
Plot a curve that zooms in on the minimum (i.e. the least squared deviations)
Report the meanestimate that would minimize the least squares function

End of practical.

Exercise 4

Gerko Vink

Fundamental Techniques in Data Science with R

Exercises

Exercise 1-5

Exercise 6-8

Functions in `R`

Exercises 9-10

Hand-in exercise

Useful References

Exercise 4

Gerko Vink

Fundamental Techniques in Data Science with R

Exercises

Exercise 1-5

Exercise 6-8

Functions in R

Exercises 9-10

Hand-in exercise

Useful References

Functions in `R`