We use the following packages:

library(plyr)
library(dplyr)
library(magrittr)
library(ggplot2)

Exercise

  1. Sample 100 samples from a standard normal distribution.
  2. For each of these samples, calculate the following statistics for the mean:
  1. Create a plot that demonstrates the following

    “A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)

  2. Present a table containing all simulated samples for which the resulting confidence interval does not contain the population value.


SOLUTION

NB: this is just one of many possible solutions!

We start by fixing the random seed. Otherwise your results will likely vary

set.seed(1234)

Then, we draw 100 samples from a standard normal distribution, i.e. a distribution with \(\mu=0\) and \(\sigma^2=1\), such that for any drawn sample \(X\) we could write \(X \sim \mathcal{N}(0, 1)\). No specification about the size of the samples is explicitly requested, so for computational reasons a mere 5000 cases would suffice to obtain a detailed approximation.

samples <- lapply(1:100, function(x) rnorm(5000, mean = 0, sd = 1))

We use the lapply() function to draw the 100 samples and return the resulting output as a list. You could also use a for-loop. Further, we extract the following sources of information for each sample:

info <- function(x){ 
  M <- mean(x)
  DF <- length(x) - 1
  SE <- 1 / sqrt(length(x))
  INT <- qt(.975, DF) * SE
  return(c("Mean"    = M, 
           "Bias"    = M - 0, 
           "Std.Err" = SE, 
           "Lower"   = M - INT, 
           "Upper"   = M + INT))
}

Then we can compute the info on each of the samples

results <- samples %>% sapply(info) %>% t()

Because object samples is a list, we can execute funtion sapply() to obtain a numerical object with the results of function info(). Function \(t()\) is here used to obtain the transpose of sapply()’s return - the resulting object has all information in the columns.

To create an indicator for the inclusion of the population value \(\mu=0\) in the confidence interval, we can add the following coverage column cov to the data:

results <- results %>%
  as.data.frame() %>%
  mutate(Covered = Lower < 0 & 0 < Upper)

Converting the numerical object to an object of class data.frame allows for a more convenient calling of elements. Now we can simply take the column means over dataframe results to obtain the average of the estimates returned by info().

colMeans(results)
##         Mean         Bias      Std.Err        Lower        Upper 
##  0.001341244  0.001341244  0.014142136 -0.026383545  0.029066033 
##      Covered 
##  0.950000000

We can see that 95 out of the 100 samples cover the population value.

To identify the samples for which the population value is not covered, we can use column cov as it is already a logical evaluation.

noncovered <- results[!results$Covered, ]

To create a graph that would serve the purpose of the exercise, one could think about the following graph:

library(ggplot2)

ggplot(results, aes(y = Mean, x = 1:100, ymax = Upper, ymin = Lower, 
                    colour = Covered)) + 
  geom_hline(yintercept = 0, color = "dark grey", size = 2) + 
  geom_pointrange() + 
  xlab("Simulations 1-100") +
  ylab("Means and 95% Confidence Intervals") + 
  theme_minimal()

To just plot the 5 cases that do not overlap with the population parameter:

ggplot(noncovered, aes(y = Mean, x = 1:5, ymax = Upper, ymin = Lower)) + 
  geom_hline(aes(yintercept = 0), color = "dark grey", size = 2) + 
  geom_pointrange(col = "red") + 
  xlab("Simulations 1-100") +
  ylab("Means and 95% Confidence Intervals") + 
  theme_minimal()


End of Practical