mice
: Algorithmic convergence and inference poolingThis is the second vignette in a series of six.
The aim of this vignette is to enhance your understanding of multiple imputation, in general. You will learn how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid the pitfalls researchers may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.
No previous experience with R
is required. Again, we start by loading (with require()
) the necessary packages and fixing the random seed to allow for our outcomes to be replicable.
require(mice)
require(lattice)
set.seed(123)
1. Vary the number of imputations.
The number of imputed data sets can be specified by the m = ...
argument. For example, to create just three imputed data sets, specify
imp <- mice(nhanes, m = 3, print=F)
2. Change the predictor matrix
The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable. Let us have a look at the predictor matrix that was used
imp$pred
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
Each variable in the data has a row and a column in the predictor matrix. A value 1
indicates that the column variable was used to impute the row variable. For example, the 1
at entry [bmi, age]
indicates that variable age
was used to impute the incomplete variable bmi
. Note that the diagonal is zero because a variable is not allowed to impute itself. The row of age
contains all zeros because there were no missing values in age
. mice
gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand. You can use mice()
to give you the initial predictor matrix, and change it afterwards, without running the algorithm. This can be done by typing
ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
pred
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The object pred
contains the predictor matrix from an initial run of mice
with zero iterations, specified by maxit = 0
. Altering the predictor matrix and returning it to the mice algorithm is very simple. For example, the following code removes the variable hyp
from the set of predictors, but still leaves it to be predicted by the other variables.
pred[ ,"hyp"] <- 0
pred
## age bmi hyp chl
## age 0 1 0 1
## bmi 1 0 0 1
## hyp 1 1 0 1
## chl 1 1 0 0
Use your new predictor matrix in mice()
as follows
imp <- mice(nhanes, pred=pred, print=F)
There is a special function called quickpred()
for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred
for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.30\) can be done by
ini <- mice(nhanes, pred=quickpred(nhanes, mincor=.3), print=F)
ini$pred
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 0 0 1
## chl 1 1 1 0
For large predictor matrices, it can be useful to export them to Microsoft Excel for easier configuration (e.g. see the xlsx
package for easy exporting and importing of Excel files).
3. Inspect the convergence of the algorithm
The mice()
function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:
imp <- mice(nhanes, print=F)
plot(imp)
The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle and be free of any trends at the later iterations.
The algorithm uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations with different seeds. In order to get exactly the same result, use the seed
argument
imp <- mice(nhanes, seed=123, print=F)
where 123
is some arbitrary number that you can choose yourself. Rerunning this command will always yields the same imputed values.
4. Change the imputation method
For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:
imp$meth
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
The variable age
is complete and therefore not imputed, denoted by the ""
empty string. The other variables have method pmm
, which stands for predictive mean matching, the default in mice
for numerical and integer data. In reality, the data are better described a as mix of numerical and categorical data. Let us take a look at the nhanes2
data frame
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
and the structure of the data frame
str(nhanes2)
## 'data.frame': 25 obs. of 4 variables:
## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
Variable age
consists of 3 age categories, while variable hyp
is binary. The mice()
function takes these properties automatically into account. Impute the nhanes2
dataset
imp <- mice(nhanes2, print=F)
imp$meth
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
Notice that mice
has set the imputation method for variable hyp
to logreg
, which implements multiple imputation by logistic regression.
An up-to-date overview of the methods in mice can be found by
methods(mice)
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lda
## [11] mice.impute.logreg mice.impute.logreg.boot
## [13] mice.impute.mean mice.impute.midastouch
## [15] mice.impute.norm mice.impute.norm.boot
## [17] mice.impute.norm.nob mice.impute.norm.predict
## [19] mice.impute.panImpute mice.impute.passive
## [21] mice.impute.pmm mice.impute.polr
## [23] mice.impute.polyreg mice.impute.quadratic
## [25] mice.impute.rf mice.impute.ri
## [27] mice.impute.sample mice.mids
## [29] mice.theme
## see '?methods' for accessing help and source code
Let us change the imputation method for bmi
to Bayesian normal linear regression imputation
ini <- mice(nhanes2, maxit = 0)
meth <- ini$meth
meth
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
meth["bmi"] <- "norm"
meth
## age bmi hyp chl
## "" "norm" "logreg" "pmm"
and run the imputations again.
imp <- mice(nhanes2, meth = meth, print=F)
We may now again plot trace lines to study convergence
plot(imp)