mice
: Algorithmic convergence
and inference poolingThis is the second practical.
The aim of this practical is to enhance your understanding of multiple imputation, in general. You will learn how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid the pitfalls researchers may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.
Again, we start by loading (with library()
) the
necessary packages and fixing the random seed to allow for our outcomes
to be replicable.
library(mice)
set.seed(123)
We also load the following packages
library(ggmice)
library(ggplot2)
library(dplyr)
library(magrittr)
All the best,
mice
1. Vary the number of imputations.
The number of imputed data sets can be specified by the
m = ...
argument. For example, to create just three imputed
data sets, specify
imp <- mice(nhanes,
m = 3,
print = FALSE)
2. Change the predictor matrix
The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable. Let us have a look at the predictor matrix that was used
imp$pred
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
Each variable in the data has a row and a column in the predictor
matrix. A value 1
indicates that the column variable was
used to impute the row variable. For example, the 1
at
entry [bmi, age]
indicates that variable age
was used to impute the incomplete variable bmi
. Note that
the diagonal is zero because a variable is not allowed to impute itself.
The row of age
is redundant, because there were no missing
values in age
. Even though predictor relations are
specified for age
, mice
will not use these
relations because it will never overwrite the observed values with
imputations. mice
gives you complete control over the
predictor matrix, enabling you to choose your own predictor relations.
This can be very useful, for example, when you have many variables or
when you have clear ideas or prior knowledge about relations in the data
at hand.
There are two ways in which you can create a predictor matrix in
mice
:
predictorMatrix
from every object
returned by mice()
:For example, we can use mice()
to give you us initial
predictor matrix, and change it afterwards, without running the
algorithm. This can be done by typing
ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
pred
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The object pred
contains the predictor matrix from an
initial run of mice
with zero iterations, specified by
maxit = 0
.
make.predictorMatrix()
to generate a
predictor matrix from any incomplete data set.For example,
pred <- make.predictorMatrix(nhanes)
pred
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
Altering the predictor matrix and returning it to the mice algorithm
is very simple. For example, the following code removes the variable
hyp
from the set of predictors, but still leaves it to be
predicted by the other variables.
pred[, "hyp"] <- 0
pred
## age bmi hyp chl
## age 0 1 0 1
## bmi 1 0 0 1
## hyp 1 1 0 1
## chl 1 1 0 0
Use your new predictor matrix in mice()
as follows
imp <- mice(nhanes,
predictorMatrix = pred,
print = FALSE)
There is a quickpred()
function that applies a quick
selection procedure of predictors, which can be handy for datasets
containing many variables. See ?quickpred
for more info.
Selecting predictors according to data relations with a minimum
correlation of \(\rho=.30\) can be done
by
ini <- mice(nhanes,
pred = quickpred(nhanes, mincor = .3),
print = FALSE)
ini$pred
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 0 0 1
## chl 1 1 1 0
For large predictor matrices, it can be useful to export them to
dedicated spreadsheet software like e.g. Microsoft Excel for easier
configuration (e.g. see the xlsx
package for easy exporting and importing of Excel files). Importing
data is straightforward in RStudio
through
File
> Import Dataset
.
3. Inspect the convergence of the algorithm
The mice()
function implements an iterative Markov Chain
Monte Carlo type of algorithm. Let us have a look at the trace lines
generated by the algorithm to study convergence:
imp <- mice(nhanes, print = FALSE)
plot_trace(imp)
The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle and be free of any trends at the later iterations. We inspect trends for the imputed values alone, because the observed data does not change.
The mice
algorithm uses random sampling, and therefore,
the results will be (perhaps slightly) different if we repeat the
imputations with different seeds. In order to get exactly the same
result, use the seed
argument
imp <- mice(nhanes,
seed = 123,
print = FALSE)
where 123
is some arbitrary number that you can choose
yourself. Rerunning this command will always yields the same imputed
values.
4. Change the imputation method
For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:
imp$meth
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
The variable age
is complete and therefore not imputed,
denoted by the ""
empty string. The other variables have
method pmm
, which stands for predictive mean
matching, the default in mice
for numerical and
integer data.
In reality, the nhanes
data are better described a as
mix of numerical and categorical data. Let us take a look at the
nhanes2
data frame:
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
and the structure of the data frame
str(nhanes2)
## 'data.frame': 25 obs. of 4 variables:
## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
Variable age
consists of 3 age categories, while
variable hyp
is binary. The mice()
function
takes these properties automatically into account. Impute the
nhanes2
dataset
imp <- mice(nhanes2,
print = FALSE)
imp$meth
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
Notice that mice
has set the imputation method for
variable hyp
to logreg
, which implements
multiple imputation by logistic regression.
An up-to-date overview of the methods in mice can be found by
methods(mice)
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.norm mice.impute.norm.boot
## [23] mice.impute.norm.nob mice.impute.norm.predict
## [25] mice.impute.panImpute mice.impute.passive
## [27] mice.impute.pmm mice.impute.polr
## [29] mice.impute.polyreg mice.impute.quadratic
## [31] mice.impute.rf mice.impute.ri
## [33] mice.impute.sample mice.mids
## [35] mice.theme
## see '?methods' for accessing help and source code
Let us change the imputation method for bmi
to Bayesian
normal linear regression imputation
ini <- mice(nhanes2,
maxit = 0)
meth <- ini$meth
meth
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
meth["bmi"] <- "norm"
meth
## age bmi hyp chl
## "" "norm" "logreg" "pmm"
and run the imputations again.
imp <- mice(nhanes2,
meth = meth,
print = FALSE)
We may now again plot trace lines to study convergence
plot_trace(imp)
5. Extend the number of iterations
Though using just five iterations (the default) often works well in
practice, we need to extend the number of iterations of the
mice
algorithm to confirm that there is no trend and that
the trace lines intermingle well. We can increase the number of
iterations to 40 by running 35 additional iterations using the
mice.mids()
function.
imp40 <- mice.mids(imp,
maxit = 35, # add 35 iterations
print = FALSE)
plot_trace(imp40)
6. Further diagnostic checking. Use function
stripplot()
.
Generally, one would prefer for the imputed data to be plausible
values, i.e. values that could have been observed if they had not been
missing. In order to form an idea about plausibility, one may check the
imputations and compare them against the observed values. If we are
willing to assume that the data are missing completely at random (MCAR),
then the imputations should have the same distribution as the observed
data. In general, distributions may be different because the missing
data are MAR (or even MNAR). However, very large discrepancies need to
be screened. Let us plot the observed and imputed data of
chl
by
ggmice(imp, aes(x = .imp, y = chl)) +
geom_jitter(width = .1)
The convention is to plot observed data in blue and the imputed data
in red. The figure graphs the data values of chl
before and
after imputation. Since the PMM method draws imputations from the
observed data, imputed values have the same gaps as in the observed
data, and are always within the range of the observed data. The figure
indicates that the distributions of the imputed and the observed values
are similar. The observed data have a particular feature that, for some
reason, thedata cluster around the value of 187. The imputations reflect
this feature, and are close to the data. Under MCAR, univariate
distributions of the observed and imputed data are expected to be
identical. Under MAR, they can be different, both in location and
spread, but their multivariate distribution is assumed to be identical.
There are many other ways to look at the imputed data.
The same plot can be made for the variable bmi
.
ggmice(imp, aes(x = .imp, y = bmi)) +
geom_jitter(width = .1)
Remember that bmi
was imputed by Bayesian linear
regression and (the range of) imputed values may therefore be different
than observed values.
7. Perform the following regression analysis on the multiply
imputed data and assign the result to object fit
.
\[ \text{bmi} = \beta_0 + \beta_1 \text{chl} + \epsilon \]
fit <- with(imp, lm(bmi ~ chl))
fit
## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
##
## call1 :
## mice(data = nhanes2, method = meth, printFlag = FALSE)
##
## nmis :
## age bmi hyp chl
## 0 9 8 10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 23.32599 0.02282
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 21.92498 0.02085
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 20.2116 0.0344
##
##
## [[4]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 20.15665 0.03852
##
##
## [[5]]
##
## Call:
## lm(formula = bmi ~ chl)
##
## Coefficients:
## (Intercept) chl
## 24.80765 0.01157
The fit
object contains the regression summaries for
each data set. The new object fit
is actually of class
mira
(multiply imputed repeated analyses).
class(fit)
## [1] "mira" "matrix"
Use the ls()
function to what out what is in the
object.
ls(fit)
## [1] "analyses" "call" "call1" "nmis"
Suppose we want to find the regression model fitted to the second imputed data set. It can be retrieved from the list by
summary(fit$analyses[[2]])
##
## Call:
## lm(formula = bmi ~ chl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6386 -3.1231 -0.2381 2.8443 8.8307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.92498 3.95950 5.537 1.24e-05 ***
## chl 0.02085 0.01970 1.058 0.301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.264 on 23 degrees of freedom
## Multiple R-squared: 0.04642, Adjusted R-squared: 0.004965
## F-statistic: 1.12 on 1 and 23 DF, p-value: 0.301
8. Pool the analyses from object fit
.
Pooling the repeated regression analyses can be done simply by typing
pool.fit <- pool(fit)
summary(pool.fit)
## term estimate std.error statistic df p.value
## 1 (Intercept) 22.08537216 4.58683836 4.814945 13.40323 0.0003100903
## 2 chl 0.02563088 0.02338968 1.095820 12.46079 0.2938853567
pool.fit
## Class: mipo m = 5
## term m estimate ubar b t dfcom
## 1 (Intercept) 5 22.08537216 16.177078691 4.0516728978 2.103909e+01 23
## 2 chl 5 0.02563088 0.000405596 0.0001179011 5.470773e-04 23
## df riv lambda fmi
## 1 13.40323 0.3005492 0.231094 0.3248446
## 2 12.46079 0.3488233 0.258613 0.3545184
or, equivalently, with a pipe
fit %>%
pool %>%
summary
## term estimate std.error statistic df p.value
## 1 (Intercept) 22.08537216 4.58683836 4.814945 13.40323 0.0003100903
## 2 chl 0.02563088 0.02338968 1.095820 12.46079 0.2938853567
which gives the relevant pooled regression coefficients and
parameters, as well as the fraction of information about the
coefficients missing due to nonresponse (fmi
) and the
proportion of the variation attributable to the missing data
(lambda
). The pooled fit object is of class
mipo
, which stands for multiply imputed pooled
object.
mice
can to pool many analyses from a variety of
packages for you. For flexibility and in order to run custom pooling
functions, mice also incorporates a function pool.scalar()
which pools univariate estimates of \(m\) repeated complete data analysis conform
Rubin’s pooling rules (Rubin, 1987, paragraph 3.1)
Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon
- End of practical
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.utf8 LC_CTYPE=Dutch_Netherlands.utf8
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_2.0.3 dplyr_1.0.9 ggplot2_3.3.6 ggmice_0.0.1.9000
## [5] mice_3.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 highr_0.9 bslib_0.3.1 compiler_4.2.1
## [5] pillar_1.7.0 jquerylib_0.1.4 tools_4.2.1 digest_0.6.29
## [9] gtable_0.3.0 lattice_0.20-45 jsonlite_1.8.0 evaluate_0.15
## [13] lifecycle_1.0.1 tibble_3.1.7 pkgconfig_2.0.3 rlang_1.0.3
## [17] cli_3.3.0 DBI_1.1.3 rstudioapi_0.13 yaml_2.3.5
## [21] xfun_0.31 fastmap_1.1.0 withr_2.5.0 stringr_1.4.0
## [25] knitr_1.39 generics_0.1.3 vctrs_0.4.1 sass_0.4.1
## [29] grid_4.2.1 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
## [33] fansi_1.0.3 rmarkdown_2.14 farver_2.1.1 tidyr_1.2.0
## [37] purrr_0.3.4 scales_1.2.0 backports_1.4.1 htmltools_0.5.2
## [41] ellipsis_0.3.2 assertthat_0.2.1 colorspace_2.0-3 labeling_0.4.2
## [45] utf8_1.2.2 stringi_1.7.6 munsell_0.5.0 broom_1.0.0
## [49] crayon_1.5.1