In this document we explore the confidence validity of bootstrap estimates. The goal of this investigation is to study the possibility of using the bootstrap to simulate a monte carlo experiment on a single (small) data set.
This section covers the global parameters and the necessary packages for executing the simulation study.
Later on, we will be using random numbers. We fix the seed of the random number generator, so that we can always reproduce the results obtained in this document.
set.seed(1234)The random seed is a number used to initialize the pseudo-random number generator. The initial value (the seed) itself does not need to be random. The resulting process is random because the seed value is not used to generate randomness - it merely forms the starting point of the algorithm for which the results are random.
library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(purrr)    # Functional programming
library(mice)     # Data imputation
library(psych)    # Descriptives
library(knitr)    # 
library(kableExtra) #Cool tablesIn this document we perform a Monte Carlo experiment. Here we define the number of replications that makes up this experiment.
nsim = 10000We define a completed version of the mice::boys data set as the origin - conventionally we would consider this to be an infinitely large population - against which we compare our estimates.
origin <- mice(boys, m=1, print = FALSE) %>% # impute only once
  mice::complete() # extract completed dataWe define the following model to be evaluated in simulation \[\text{wgt} = \beta_0 + \beta_1\text{hgt} + \beta2\text{age} + \epsilon\] For the origin set, we extract the linear model as follows:
truemodel <- origin %$%
  lm(wgt ~ hgt + age)We use the replicate() function to draw nsim =\(10^4\) bootstrap sets from the 748 rows in the origin.
simdata <- replicate(nsim, 
                     origin[sample(1:748, 748, replace = TRUE), ], 
                     simplify = FALSE)First we evaluate the model of interest on each of the drawn bootstrap samples. We do this with the purrr::map() function, which maps the evaluation on each of the listed elements in the simdata object.
model <- simdata %>% 
  map(~ lm(wgt ~ hgt + age, data = .x))Then, we extract the estimates from the model evaluations.
estimates <- model %>% 
  map(coef) %>% 
  do.call(rbind, args = .) # bind rows into matrixThe obtained average estimates are
estimates %>% 
  describe(quant = c(.025, .975)) %>%
  .[, c(2:4, 8:9, 11:12, 14:15)] %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)| n | mean | sd | min | max | skew | kurtosis | Q0.025 | Q0.975 | |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 10000 | -9.1407509 | 2.1468595 | -17.9173939 | -1.6043025 | -0.0164675 | 0.0372357 | -13.3425898 | -4.9293057 | 
| hgt | 10000 | 0.1898218 | 0.0309346 | 0.0767891 | 0.3146185 | -0.0158256 | 0.0555739 | 0.1284226 | 0.2502532 | 
| age | 10000 | 2.3348893 | 0.2108747 | 1.5188210 | 3.1577070 | 0.0707979 | 0.0876091 | 1.9302869 | 2.7626363 | 
We find that the estimates display on average a low bias.
estimates %>% 
  colMeans## (Intercept)         hgt         age 
##  -9.1407509   0.1898218   2.3348893RWe use the confint() function to extract the confidence intervals for each of the linear models in the model object.
ci <- model %>% 
  map(confint)We then calculate the proportion of confidence intervals that cover the truemodel parameters
cov <- ci %>% 
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .) # bind rows into matrixFor unknown \(\sigma^2_\beta\), the \(1-\alpha\) confidence interval for \(\beta\) is defined as
\[\mu = \hat{\beta} \pm t_{n-1, a/2}\frac{S}{\sqrt{N}}.\]. The following code extracts the estimates and the standard error
manual <- model %>% 
  map(function(x) cbind(vcov(x) %>% diag %>% sqrt(.), coef(x))) %>% # est and resp. vars
  map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1],  x[, 2] + qt(.975, 747) * x[, 1])) %>%
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .) # bind rows into matrixInstead of drawing the standard errors from every model, we can also make use of the bootstrap estimate of the standard error - i.e. the square root of the variance of the vector of estimates. This gives us the empirical standard deviation of the realized bootstrap sampling distribution of the estimates.
manual2 <- model %>% 
  map(function(x) cbind(sqrt(diag(var(estimates))), coef(x))) %>% # sd of 
  map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1],  x[, 2] + qt(.975, 747) * x[, 1])) %>%
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .)Another approach is to calculate the bootstrap CI based on the quantiles of the vectors of estimates. Simply dividing the confidence interval width would then yield an empirical equivalent to \(\pm t_{n-1, a/2}\frac{S}{\sqrt{N}}\).
# confidence intervals and widths
intercept <- data.frame(est = mean(estimates[, 1]), 
                        ciw = diff(quantile(estimates[, 1], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
hgt <- data.frame(est = mean(estimates[, 2]), 
                  ciw = diff(quantile(estimates[, 2], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
age <- data.frame(est = mean(estimates[, 3]), 
                  ciw = diff(quantile(estimates[, 3], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
# coverages
covint <- data.frame(est = estimates[, 1], 
                     low = estimates[, 1] - intercept$ciw/2,
                     up = estimates[, 1] + intercept$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[1] & coef(truemodel)[1] <= up)
covhgt <- data.frame(est = estimates[, 2], 
                     low = estimates[, 2] - hgt$ciw/2,
                     up = estimates[, 2] + hgt$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[2] & coef(truemodel)[2] <= up)
covage <- data.frame(est = estimates[, 3], 
                     low = estimates[, 3] - age$ciw/2,
                     up = estimates[, 3] + age$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[3] & coef(truemodel)[3] <= up)Below are the coverages for each of the investigated scenarios. Results are depicted for the coverage of the confidence intervals over the regression estimates for \(10^4\) simulations.
output <- rbind(colMeans(cov), 
                colMeans(manual), 
                colMeans(manual2), 
                colMeans(cbind(covint$cov, covhgt$cov, covage$cov)))
rownames(output) <- c("confint", 
                      "manual", 
                      "bootstrap SE", 
                      "bootstrap CI")
kable(output, "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)| (Intercept) | hgt | age | |
|---|---|---|---|
| confint | 0.9375 | 0.9231 | 0.9214 | 
| manual | 0.9375 | 0.9231 | 0.9214 | 
| bootstrap SE | 0.9503 | 0.9491 | 0.9491 | 
| bootstrap CI | 0.9500 | 0.9502 | 0.9504 | 
END OF DOCUMENT
sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.3.4 knitr_1.33       psych_2.1.3      mice_3.13.13    
## [5] purrr_0.3.4      magrittr_2.0.1   dplyr_1.0.7     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.23         bslib_0.2.5.1     lattice_0.20-44  
##  [5] colorspace_2.0-1  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.1.1
##  [9] viridisLite_0.4.0 yaml_2.2.1        utf8_1.2.1        rlang_0.4.11     
## [13] jquerylib_0.1.4   pillar_1.6.1      glue_1.4.2        withr_2.4.2      
## [17] DBI_1.1.1         lifecycle_1.0.0   stringr_1.4.0     munsell_0.5.0    
## [21] rvest_1.0.0       codetools_0.2-18  evaluate_0.14     parallel_4.1.0   
## [25] fansi_0.5.0       highr_0.9         broom_0.7.8       Rcpp_1.0.7       
## [29] backports_1.2.1   scales_1.1.1      webshot_0.5.2     jsonlite_1.7.2   
## [33] tmvnsim_1.0-2     systemfonts_1.0.2 mnormt_2.0.2      digest_0.6.27    
## [37] stringi_1.6.2     grid_4.1.0        tools_4.1.0       sass_0.4.0       
## [41] tibble_3.1.2      crayon_1.4.1      tidyr_1.1.3       pkgconfig_2.0.3  
## [45] MASS_7.3-54       ellipsis_0.3.2    xml2_1.3.2        assertthat_0.2.1 
## [49] rmarkdown_2.9     svglite_2.0.0     httr_1.4.2        rstudioapi_0.13  
## [53] R6_2.5.0          nnet_7.3-16       nlme_3.1-152      compiler_4.1.0