mice
: The imputation and
nonresponse modelsmice
teamThis is the third practical.
In this practical we will focus on analyzing the relation between the
data and the missingness. For non-R
users: In
R
one can simply call the help function for a any specific
function func
by typing help(func)
. E.g.
help(mice)
directs you to the help page of the
mice
function.
All the best,
Gerko and the mice team
1. Load R
and load the packages
mice
, ggmice
, lattice
and
dplyr
. Set the seed to 123
.
library(mice) # Data imputation
library(dplyr) # Data manipulation
library(lattice) # Plotting device
library(ggplot2) # Plotting device
library(ggmice) # gg-like plots for mice
set.seed(123)
We choose seed value 123
. This is an arbitrary value;
any value would be an equally good seed value. Fixing the random seed
enables you (and others) to exactly replicate anything that involves
random number generators. If you set the seed in your R
instance to 123
, you will get the exact same results and
plots as we present in this document if you follow the order and code of
the exercises precisely.
boys
data set2. The boys
dataset is part of
mice
. It is a subset of a large Dutch dataset containing
growth measures from the Fourth Dutch Growth Study. Inspect the help for
boys
dataset and make yourself familiar with its
contents.
To learn more about the contents of the data, use one of the two following help commands:
help(boys)
?boys
3. Get an overview of the data. Find information about the size of the data, the variables measured and the amount of missingness.
The first 10 cases are:
head(boys, n = 10)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
## 37 0.068 52.5 3.810 13.82 34.9 <NA> <NA> NA south
## 38 0.071 53.0 3.890 13.84 35.8 <NA> <NA> NA west
## 39 0.071 55.1 3.880 12.77 36.8 <NA> <NA> NA west
## 43 0.073 54.5 4.200 14.14 38.0 <NA> <NA> NA east
The last 10 cases are:
tail(boys, n = 10)
## age hgt wgt bmi hc gen phb tv reg
## 7329 20.032 184.0 73.0 21.56 56.0 <NA> <NA> NA north
## 7362 20.117 188.7 89.4 25.10 58.1 G5 P6 25 east
## 7396 20.281 185.1 81.1 23.67 58.8 G5 P6 20 south
## 7405 20.323 182.5 69.0 20.71 59.0 <NA> <NA> NA north
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60 NA <NA> <NA> NA west
## 7447 20.780 193.5 75.4 20.13 NA <NA> <NA> NA west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14 NA <NA> <NA> NA east
We now have a clear indication that the data are sorted. A simple evaluation
!is.unsorted(boys$age)
## [1] TRUE
confirms this - !is.unsorted()
evaluates the complement
of is.unsorted()
, so it tests whether the data are sorted.
There is no is.sorted
function in R
.
The dimensions of the boys
data set are:
dim(boys)
## [1] 748 9
We see that the boys
data set has 748 cases over 9
variables. From those 9 variables
summary(boys)
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
function summary()
informs us that testicular volume
tv
has the most missings, followed by the genital and pubic
hair stages gen
and phb
, each with 503 missing
cells.
4. As we have seen before, the function
md.pattern()
can be used to display all different missing
data patterns. How many different missing data patterns are present in
the boys dataframe and which pattern occurs most frequently in the
data?
plot_pattern(boys)
There are 13 patterns in total, with the pattern where
gen
,
phb
and tv
are missing occuring the most.
5. How many patterns occur for which the variable
gen
(genital Tannerstage) is missing?
mpat <- md.pattern(boys, plot = FALSE)
sum(mpat[, "gen"] == 0)
## [1] 8
Answer: 8 patterns (503 cases)
6. Let us focus more precisely on the missing data patterns.
Does the missing data of gen
depend on age
?
One could for example check this by making a histogram of
age
separately for the cases with known genital stages and
for cases with missing genital stages.
To create said histogram in R
, a missingness indicator
for gen
has to be created. A missingness indicator is a
dummy variable with value 1
for observed values (in this
case genital status) and 0
for missing values. Create a
missingness indicator for gen
by typing
R <- is.na(boys$gen)
head(R, n = 100)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
tail(R, n = 100)
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [13] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
## [25] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
## [37] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## [49] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [61] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [73] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [85] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## [97] TRUE TRUE TRUE TRUE
length(R)
## [1] 748
As we can see, the missingness indicator tells us for each of the 748
values in gen
whether it is missing (TRUE
) or
observed (FALSE
).
A histogram can be made with the function histogram()
from package lattice
.
lattice::histogram(boys$gen)
Or with ggmice
:
boys %>%
ggmice(aes(gen)) +
geom_bar()
or, equivalently, one could use
lattice::histogram(~ gen, data = boys)
Writing the latter line of code for plots is more efficient than
selecting every part of the boys
data with the
boys$...
command, especially if plots become more advanced.
The code for a conditional histogram of age
given
R
is
lattice::histogram(~ age | R, data=boys)
The histogram shows that the missingness in gen
is not
equally distributed across age
; or, equivalently,
age
seems to be differently distributed for observed and
missing gen
.
With ggmice
:
boys %>%
ggmice(aes(age)) +
geom_histogram() +
facet_wrap(R)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
7. Impute the boys
dataset with mice using all
default settings and name the mids
(multiply imputed data
set) object imp
.
imp <- mice(boys, print=FALSE)
8. Compare the means of the imputed data with the means of
the incomplete data. One can use the function complete()
with a mids
-object as argument to obtain an imputed
dataset. As default, the first imputed dataset will be given by this
function.
summary(boys)
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
summary(complete(imp))
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 83.53 1st Qu.: 11.85 1st Qu.:15.89
## Median :10.505 Median :145.75 Median : 34.55 Median :17.39
## Mean : 9.159 Mean :131.12 Mean : 37.11 Mean :18.03
## 3rd Qu.:15.267 3rd Qu.:175.00 3rd Qu.: 59.35 3rd Qu.:19.48
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## hc gen phb tv reg
## Min. :33.70 G1:305 P1:325 Min. : 1.000 north: 82
## 1st Qu.:48.45 G2:149 P2:112 1st Qu.: 2.000 east :161
## Median :53.20 G3: 55 P3: 45 Median : 3.000 west :240
## Mean :51.64 G4: 83 P4: 69 Mean : 8.239 south:192
## 3rd Qu.:56.00 G5:156 P5:112 3rd Qu.:15.000 city : 73
## Max. :65.00 P6: 85 Max. :25.000
Most means are roughly equal, except the mean of tv
,
which is much lower in the first imputed data set, when compared to the
incomplete data. This makes sense because most genital measures are
unobserved for the lower ages. When imputing these values, the means
should decrease.
Investigating univariate properties by using functions such as
summary()
, may not be ideal in the case of hundreds of
variables. To extract just the information you need, for all imputed
datasets, we can make use of the with()
function. To obtain
summaries for each imputed tv
only, type
imp %>%
with(summary(tv)) %>%
summary()
## # A tibble: 5 × 6
## minimum q1 median mean q3 maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 8.24 15 25
## 2 1 2 3 8.52 15 25
## 3 1 2 4 8.39 15 25
## 4 1 2 3 8.52 15 25
## 5 1 2 3 8.45 15 25
And to obtain e.g. the means alone, run
imp %>%
with(mean(tv)) %>%
summary()
## # A tibble: 5 × 1
## x
## <dbl>
## 1 8.24
## 2 8.52
## 3 8.39
## 4 8.52
## 5 8.45
The mammalsleep
dataset is part of mice
. It
contains the Allison and Cicchetti (1976) data for mammalian species. To
learn more about this data, type
help(mammalsleep)
9. Get an overview of the data.
Find information about the size of the data, the variables measured and the amount of missingness.
head(mammalsleep)
## species bw brw sws ps ts mls gt pi sei odi
## 1 African elephant 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3
## 2 African giant pouched rat 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
## 3 Arctic Fox 3.385 44.5 NA NA 12.5 14.0 60 1 1 1
## 4 Arctic ground squirrel 0.920 5.7 NA NA 16.5 NA 25 5 2 3
## 5 Asian elephant 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
## 6 Baboon 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
summary(mammalsleep)
## species bw brw
## African elephant : 1 Min. : 0.005 Min. : 0.14
## African giant pouched rat: 1 1st Qu.: 0.600 1st Qu.: 4.25
## Arctic Fox : 1 Median : 3.342 Median : 17.25
## Arctic ground squirrel : 1 Mean : 198.790 Mean : 283.13
## Asian elephant : 1 3rd Qu.: 48.202 3rd Qu.: 166.00
## Baboon : 1 Max. :6654.000 Max. :5712.00
## (Other) :56
## sws ps ts mls
## Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000
## 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625
## Median : 8.350 Median :1.800 Median :10.45 Median : 15.100
## Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878
## 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750
## Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000
## NA's :14 NA's :12 NA's :4 NA's :4
## gt pi sei odi
## Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 79.00 Median :3.000 Median :2.000 Median :2.000
## Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4
str(mammalsleep)
## 'data.frame': 62 obs. of 11 variables:
## $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ bw : num 6654 1 3.38 0.92 2547 ...
## $ brw : num 5712 6.6 44.5 5.7 4603 ...
## $ sws : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
## $ ps : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
## $ ts : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
## $ mls : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
## $ gt : num 645 42 60 25 624 180 35 392 63 230 ...
## $ pi : int 3 3 1 5 3 4 1 4 1 1 ...
## $ sei : int 5 1 1 2 5 4 1 5 2 1 ...
## $ odi : int 3 3 1 3 4 4 1 4 1 1 ...
As we have seen before, the function md.pattern()
can be
used to display all different missing data patterns. How many different
missing data patterns are present in the mammalsleep
dataframe and which pattern occurs most frequently in the data?
plot_pattern(mammalsleep, square = TRUE, rotate = TRUE)
Answer: 8 patterns in total, with the pattern where everything is observed occuring the most (42 times).
10. Generate five imputed datasets with the default method
pmm
. Give the algorithm 10 iterations.
imp1 <- mice(mammalsleep,
maxit = 10,
print = FALSE)
## Warning: Number of logged events: 525
We ignore the loggedEvents
for now: we’ll consider that
in the fifth practical. To inspect the trace lines for assessing
algorithmic convergence:
plot_trace(imp1)
11. Perform a regression analysis on the imputed dataset with
sws
as dependent variable and log10(bw)
and
odi
as independent variables.
fit1 <- with(imp1, lm(sws ~ log10(bw) + odi))
12. Pool the regression analysis and inspect the pooled analysis.
est1 <- pool(fit1)
est1
## Class: mipo m = 5
## term m estimate ubar b t dfcom df
## 1 (Intercept) 5 9.6393795 0.70494441 0.0070520548 0.71340687 59 56.30774
## 2 log10(bw) 5 -1.7096792 0.09951661 0.0005674433 0.10019754 59 56.67164
## 3 odi 5 -0.5015282 0.08813811 0.0014276654 0.08985131 59 55.72444
## riv lambda fmi
## 1 0.012004444 0.011862047 0.04518444
## 2 0.006842395 0.006795894 0.04008488
## 3 0.019437659 0.019067041 0.05247504
summary(est1)
## term estimate std.error statistic df p.value
## 1 (Intercept) 9.6393795 0.8446342 11.412491 56.30774 4.440892e-16
## 2 log10(bw) -1.7096792 0.3165400 -5.401148 56.67164 1.361833e-06
## 3 odi -0.5015282 0.2997521 -1.673143 55.72444 9.990461e-02
The fmi
and lambda
are much too high. This
is due to species
being included in the imputation model.
Because there are 62 species and mice automatically converts factors
(categorical variables) to dummy variables, each species is modeled by
its own imputation model.
13. Impute mammalsleep
again, but now exclude
species
from the data.
imp2 <- mice(mammalsleep[ , -1],
maxit = 10,
print = FALSE)
## Warning: Number of logged events: 19
14. Compute and pool the regression analysis again.
fit2 <- with(imp2, lm(sws ~ log10(bw) + odi))
est2 <- pool(fit2)
est2
## Class: mipo m = 5
## term m estimate ubar b t dfcom df
## 1 (Intercept) 5 11.6074211 0.58918970 0.0119315063 0.60350751 59 55.30838
## 2 log10(bw) 5 -1.1498004 0.08317558 0.0049380096 0.08910120 59 50.33327
## 3 odi 5 -0.9041366 0.07366548 0.0005685897 0.07434778 59 56.50548
## riv lambda fmi
## 1 0.024300845 0.023724324 0.05721096
## 2 0.071242199 0.066504287 0.10151042
## 3 0.009262244 0.009177242 0.04247914
summary(est2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 11.6074211 0.7768575 14.941507 55.30838 0.000000000
## 2 log10(bw) -1.1498004 0.2984982 -3.851951 50.33327 0.000332537
## 3 odi -0.9041366 0.2726679 -3.315890 56.50548 0.001600279
Note that the fmi
and lambda
have
dramatically decreased. The imputation model has been greatly
improved.
15. Plot the trace lines for the new imputations
plot_trace(imp2)
Even though the fraction of information missing due to nonresponse
(fmi) and the relative increase in variance due to nonresponse (lambda)
are nice and low, the convergence turns out to be a real problem. The
reason is the structure in the data. Total sleep (ts
) is
the sum of paradoxical sleep (ps
) and short wave sleep
(sws
). This relation is ignored in the imputations, but it
is necessary to take this relation into account. mice
offers a routine called passive imputation, which allows users
to take transformations, combinations and recoded variables into account
when imputing their data.
We explain passive imputation in detail in the next practical.
We have seen that the practical execution of multiple imputation and
pooling is straightforward with the R
package
mice
. The package is designed to allow you to assess and
control the imputations themselves, the convergence of the algorithm and
the distributions and multivariate relations of the observed and imputed
data.
It is important to ‘gain’ this control as a user. After all, we are imputing values and we aim to properly adress the uncertainty about the missingness problem.
- End of practical
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggmice_0.0.1.9000 ggplot2_3.3.6 lattice_0.20-45 dplyr_1.0.9
## [5] mice_3.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 highr_0.9 bslib_0.3.1 compiler_4.2.0
## [5] pillar_1.7.0 jquerylib_0.1.4 tools_4.2.0 digest_0.6.29
## [9] gtable_0.3.0 jsonlite_1.8.0 evaluate_0.15 lifecycle_1.0.1
## [13] tibble_3.1.7 pkgconfig_2.0.3 rlang_1.0.3 cli_3.3.0
## [17] DBI_1.1.2 rstudioapi_0.13 yaml_2.3.5 xfun_0.31
## [21] fastmap_1.1.0 withr_2.5.0 stringr_1.4.0 knitr_1.39
## [25] generics_0.1.3 vctrs_0.4.1 sass_0.4.1 nnet_7.3-17
## [29] grid_4.2.0 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
## [33] fansi_1.0.3 rmarkdown_2.14 farver_2.1.1 tidyr_1.2.0
## [37] purrr_0.3.4 magrittr_2.0.3 MASS_7.3-56 scales_1.2.0
## [41] backports_1.4.1 htmltools_0.5.2 ellipsis_0.3.2 assertthat_0.2.1
## [45] colorspace_2.0-3 labeling_0.4.2 utf8_1.2.2 stringi_1.7.6
## [49] munsell_0.5.0 broom_1.0.0 crayon_1.5.1