This is the third practical.

In this practical we will focus on analyzing the relation between the data and the missingness. For non-R users: In R one can simply call the help function for a any specific function func by typing help(func). E.g. help(mice) directs you to the help page of the mice function.

All the best,

Gerko and the mice team


1. Load R and load the packages mice, ggmice, lattice and dplyr. Set the seed to 123.

library(mice)     # Data imputation
library(dplyr)    # Data manipulation
library(lattice)  # Plotting device
library(ggplot2)  # Plotting device
library(ggmice)   # gg-like plots for mice
set.seed(123)

We choose seed value 123. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your R instance to 123, you will get the exact same results and plots as we present in this document if you follow the order and code of the exercises precisely.


The boys data set


2. The boys dataset is part of mice. It is a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study. Inspect the help for boys dataset and make yourself familiar with its contents.

To learn more about the contents of the data, use one of the two following help commands:

help(boys)
?boys

3. Get an overview of the data. Find information about the size of the data, the variables measured and the amount of missingness.

The first 10 cases are:

head(boys, n = 10)
##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
## 37 0.068 52.5 3.810 13.82 34.9 <NA> <NA> NA south
## 38 0.071 53.0 3.890 13.84 35.8 <NA> <NA> NA  west
## 39 0.071 55.1 3.880 12.77 36.8 <NA> <NA> NA  west
## 43 0.073 54.5 4.200 14.14 38.0 <NA> <NA> NA  east

The last 10 cases are:

tail(boys, n = 10)
##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7329 20.032 184.0 73.0 21.56 56.0 <NA> <NA> NA north
## 7362 20.117 188.7 89.4 25.10 58.1   G5   P6 25  east
## 7396 20.281 185.1 81.1 23.67 58.8   G5   P6 20 south
## 7405 20.323 182.5 69.0 20.71 59.0 <NA> <NA> NA north
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14   NA <NA> <NA> NA  east

We now have a clear indication that the data are sorted. A simple evaluation

!is.unsorted(boys$age)
## [1] TRUE

confirms this - !is.unsorted() evaluates the complement of is.unsorted(), so it tests whether the data are sorted. There is no is.sorted function in R.

The dimensions of the boys data set are:

dim(boys)
## [1] 748   9

We see that the boys data set has 748 cases over 9 variables. From those 9 variables

summary(boys)
##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

function summary() informs us that testicular volume tv has the most missings, followed by the genital and pubic hair stages gen and phb, each with 503 missing cells.


4. As we have seen before, the function md.pattern() can be used to display all different missing data patterns. How many different missing data patterns are present in the boys dataframe and which pattern occurs most frequently in the data?

plot_pattern(boys)

There are 13 patterns in total, with the pattern where gen, phb and tv are missing occuring the most.


5. How many patterns occur for which the variable gen (genital Tannerstage) is missing?

mpat <- md.pattern(boys, plot = FALSE)
sum(mpat[, "gen"] == 0)
## [1] 8

Answer: 8 patterns (503 cases)


6. Let us focus more precisely on the missing data patterns. Does the missing data of gen depend on age? One could for example check this by making a histogram of age separately for the cases with known genital stages and for cases with missing genital stages.

To create said histogram in R, a missingness indicator for gen has to be created. A missingness indicator is a dummy variable with value 1 for observed values (in this case genital status) and 0 for missing values. Create a missingness indicator for gen by typing

R <- is.na(boys$gen) 
head(R, n = 100)
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
tail(R, n = 100)
##   [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
##  [25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [37]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [49]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [61] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
##  [73] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
##  [85] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE
length(R)
## [1] 748

As we can see, the missingness indicator tells us for each of the 748 values in gen whether it is missing (TRUE) or observed (FALSE).

A histogram can be made with the function histogram() from package lattice.

lattice::histogram(boys$gen)

Or with ggmice:

boys %>% 
  ggmice(aes(gen)) + 
  geom_bar()

or, equivalently, one could use

lattice::histogram(~ gen, data = boys)

Writing the latter line of code for plots is more efficient than selecting every part of the boys data with the boys$... command, especially if plots become more advanced. The code for a conditional histogram of age given R is

lattice::histogram(~ age | R, data=boys)

The histogram shows that the missingness in gen is not equally distributed across age; or, equivalently, age seems to be differently distributed for observed and missing gen.

With ggmice:

boys %>% 
  ggmice(aes(age)) + 
  geom_histogram() + 
  facet_wrap(R)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


7. Impute the boys dataset with mice using all default settings and name the mids (multiply imputed data set) object imp.

imp <- mice(boys, print=FALSE)

8. Compare the means of the imputed data with the means of the incomplete data. One can use the function complete() with a mids-object as argument to obtain an imputed dataset. As default, the first imputed dataset will be given by this function.

summary(boys)
##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522
summary(complete(imp))
##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 83.53   1st Qu.: 11.85   1st Qu.:15.89  
##  Median :10.505   Median :145.75   Median : 34.55   Median :17.39  
##  Mean   : 9.159   Mean   :131.12   Mean   : 37.11   Mean   :18.03  
##  3rd Qu.:15.267   3rd Qu.:175.00   3rd Qu.: 59.35   3rd Qu.:19.48  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##        hc        gen      phb            tv            reg     
##  Min.   :33.70   G1:305   P1:325   Min.   : 1.000   north: 82  
##  1st Qu.:48.45   G2:149   P2:112   1st Qu.: 2.000   east :161  
##  Median :53.20   G3: 55   P3: 45   Median : 3.000   west :240  
##  Mean   :51.64   G4: 83   P4: 69   Mean   : 8.239   south:192  
##  3rd Qu.:56.00   G5:156   P5:112   3rd Qu.:15.000   city : 73  
##  Max.   :65.00            P6: 85   Max.   :25.000

Most means are roughly equal, except the mean of tv, which is much lower in the first imputed data set, when compared to the incomplete data. This makes sense because most genital measures are unobserved for the lower ages. When imputing these values, the means should decrease.

Investigating univariate properties by using functions such as summary(), may not be ideal in the case of hundreds of variables. To extract just the information you need, for all imputed datasets, we can make use of the with() function. To obtain summaries for each imputed tv only, type

imp %>%
  with(summary(tv)) %>%
  summary()
## # A tibble: 5 × 6
##   minimum    q1 median  mean    q3 maximum
##     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1       1     2      3  8.24    15      25
## 2       1     2      3  8.52    15      25
## 3       1     2      4  8.39    15      25
## 4       1     2      3  8.52    15      25
## 5       1     2      3  8.45    15      25

And to obtain e.g. the means alone, run

imp %>%
  with(mean(tv)) %>%
  summary()
## # A tibble: 5 × 1
##       x
##   <dbl>
## 1  8.24
## 2  8.52
## 3  8.39
## 4  8.52
## 5  8.45

The importance of the imputation model

The mammalsleep dataset is part of mice. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type

help(mammalsleep)

9. Get an overview of the data.

Find information about the size of the data, the variables measured and the amount of missingness.

head(mammalsleep)
##                     species       bw    brw sws  ps   ts  mls  gt pi sei odi
## 1          African elephant 6654.000 5712.0  NA  NA  3.3 38.6 645  3   5   3
## 2 African giant pouched rat    1.000    6.6 6.3 2.0  8.3  4.5  42  3   1   3
## 3                Arctic Fox    3.385   44.5  NA  NA 12.5 14.0  60  1   1   1
## 4    Arctic ground squirrel    0.920    5.7  NA  NA 16.5   NA  25  5   2   3
## 5            Asian elephant 2547.000 4603.0 2.1 1.8  3.9 69.0 624  3   5   4
## 6                    Baboon   10.550  179.5 9.1 0.7  9.8 27.0 180  4   4   4
summary(mammalsleep)
##                       species         bw                brw         
##  African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
##  African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
##  Arctic Fox               : 1   Median :   3.342   Median :  17.25  
##  Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
##  Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
##  Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
##  (Other)                  :56                                       
##       sws               ps              ts             mls         
##  Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
##  1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
##  Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
##  Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
##  3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
##  Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
##  NA's   :14       NA's   :12      NA's   :4       NA's   :4        
##        gt               pi             sei             odi       
##  Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4
str(mammalsleep)
## 'data.frame':    62 obs. of  11 variables:
##  $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ bw     : num  6654 1 3.38 0.92 2547 ...
##  $ brw    : num  5712 6.6 44.5 5.7 4603 ...
##  $ sws    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ ps     : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ ts     : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ mls    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ gt     : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ pi     : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ sei    : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ odi    : int  3 3 1 3 4 4 1 4 1 1 ...

As we have seen before, the function md.pattern() can be used to display all different missing data patterns. How many different missing data patterns are present in the mammalsleep dataframe and which pattern occurs most frequently in the data?

plot_pattern(mammalsleep, square = TRUE, rotate = TRUE)

Answer: 8 patterns in total, with the pattern where everything is observed occuring the most (42 times).


10. Generate five imputed datasets with the default method pmm. Give the algorithm 10 iterations.

imp1 <- mice(mammalsleep, 
             maxit = 10, 
             print = FALSE)
## Warning: Number of logged events: 525

We ignore the loggedEvents for now: we’ll consider that in the fifth practical. To inspect the trace lines for assessing algorithmic convergence:

plot_trace(imp1)


11. Perform a regression analysis on the imputed dataset with sws as dependent variable and log10(bw) and odi as independent variables.

fit1 <- with(imp1, lm(sws ~ log10(bw) + odi))

12. Pool the regression analysis and inspect the pooled analysis.

est1 <- pool(fit1)
est1
## Class: mipo    m = 5 
##          term m   estimate       ubar            b          t dfcom       df
## 1 (Intercept) 5  9.6393795 0.70494441 0.0070520548 0.71340687    59 56.30774
## 2   log10(bw) 5 -1.7096792 0.09951661 0.0005674433 0.10019754    59 56.67164
## 3         odi 5 -0.5015282 0.08813811 0.0014276654 0.08985131    59 55.72444
##           riv      lambda        fmi
## 1 0.012004444 0.011862047 0.04518444
## 2 0.006842395 0.006795894 0.04008488
## 3 0.019437659 0.019067041 0.05247504
summary(est1)
##          term   estimate std.error statistic       df      p.value
## 1 (Intercept)  9.6393795 0.8446342 11.412491 56.30774 4.440892e-16
## 2   log10(bw) -1.7096792 0.3165400 -5.401148 56.67164 1.361833e-06
## 3         odi -0.5015282 0.2997521 -1.673143 55.72444 9.990461e-02

The fmi and lambda are much too high. This is due to species being included in the imputation model. Because there are 62 species and mice automatically converts factors (categorical variables) to dummy variables, each species is modeled by its own imputation model.


13. Impute mammalsleep again, but now exclude species from the data.

imp2 <- mice(mammalsleep[ , -1], 
             maxit = 10, 
             print = FALSE)
## Warning: Number of logged events: 19

14. Compute and pool the regression analysis again.

fit2 <- with(imp2, lm(sws ~ log10(bw) + odi))
est2 <- pool(fit2)
est2
## Class: mipo    m = 5 
##          term m   estimate       ubar            b          t dfcom       df
## 1 (Intercept) 5 11.6074211 0.58918970 0.0119315063 0.60350751    59 55.30838
## 2   log10(bw) 5 -1.1498004 0.08317558 0.0049380096 0.08910120    59 50.33327
## 3         odi 5 -0.9041366 0.07366548 0.0005685897 0.07434778    59 56.50548
##           riv      lambda        fmi
## 1 0.024300845 0.023724324 0.05721096
## 2 0.071242199 0.066504287 0.10151042
## 3 0.009262244 0.009177242 0.04247914
summary(est2)
##          term   estimate std.error statistic       df     p.value
## 1 (Intercept) 11.6074211 0.7768575 14.941507 55.30838 0.000000000
## 2   log10(bw) -1.1498004 0.2984982 -3.851951 50.33327 0.000332537
## 3         odi -0.9041366 0.2726679 -3.315890 56.50548 0.001600279

Note that the fmi and lambda have dramatically decreased. The imputation model has been greatly improved.


15. Plot the trace lines for the new imputations

plot_trace(imp2)

Even though the fraction of information missing due to nonresponse (fmi) and the relative increase in variance due to nonresponse (lambda) are nice and low, the convergence turns out to be a real problem. The reason is the structure in the data. Total sleep (ts) is the sum of paradoxical sleep (ps) and short wave sleep (sws). This relation is ignored in the imputations, but it is necessary to take this relation into account. mice offers a routine called passive imputation, which allows users to take transformations, combinations and recoded variables into account when imputing their data.

We explain passive imputation in detail in the next practical.


Conclusion

We have seen that the practical execution of multiple imputation and pooling is straightforward with the R package mice. The package is designed to allow you to assess and control the imputations themselves, the convergence of the algorithm and the distributions and multivariate relations of the observed and imputed data.

It is important to ‘gain’ this control as a user. After all, we are imputing values and we aim to properly adress the uncertainty about the missingness problem.


- End of practical


sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggmice_0.0.1.9000 ggplot2_3.3.6     lattice_0.20-45   dplyr_1.0.9      
## [5] mice_3.14.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9       highr_0.9        bslib_0.3.1      compiler_4.2.0  
##  [5] pillar_1.7.0     jquerylib_0.1.4  tools_4.2.0      digest_0.6.29   
##  [9] gtable_0.3.0     jsonlite_1.8.0   evaluate_0.15    lifecycle_1.0.1 
## [13] tibble_3.1.7     pkgconfig_2.0.3  rlang_1.0.3      cli_3.3.0       
## [17] DBI_1.1.2        rstudioapi_0.13  yaml_2.3.5       xfun_0.31       
## [21] fastmap_1.1.0    withr_2.5.0      stringr_1.4.0    knitr_1.39      
## [25] generics_0.1.3   vctrs_0.4.1      sass_0.4.1       nnet_7.3-17     
## [29] grid_4.2.0       tidyselect_1.1.2 glue_1.6.2       R6_2.5.1        
## [33] fansi_1.0.3      rmarkdown_2.14   farver_2.1.1     tidyr_1.2.0     
## [37] purrr_0.3.4      magrittr_2.0.3   MASS_7.3-56      scales_1.2.0    
## [41] backports_1.4.1  htmltools_0.5.2  ellipsis_0.3.2   assertthat_0.2.1
## [45] colorspace_2.0-3 labeling_0.4.2   utf8_1.2.2       stringi_1.7.6   
## [49] munsell_0.5.0    broom_1.0.0      crayon_1.5.1