This is the second practical.

The aim of this practical is to enhance your understanding of multiple imputation, in general. You will learn how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid the pitfalls researchers may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

Again, we start by loading (with library()) the necessary packages and fixing the random seed to allow for our outcomes to be replicable.

library(mice)
set.seed(123)

We also load the following packages

library(ggmice)
library(ggplot2)
library(dplyr)
library(magrittr)

All the best,

Gerko and Stef

Multiple imputation with `mice`

1. Vary the number of imputations.

The number of imputed data sets can be specified by the m = ... argument. For example, to create just three imputed data sets, specify

imp <- mice(nhanes, 
            m = 3, 
            print = FALSE)

2. Change the predictor matrix

The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable. Let us have a look at the predictor matrix that was used

imp$pred

##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Each variable in the data has a row and a column in the predictor matrix. A value 1 indicates that the column variable was used to impute the row variable. For example, the 1 at entry [bmi, age] indicates that variable age was used to impute the incomplete variable bmi. Note that the diagonal is zero because a variable is not allowed to impute itself. The row of age is redundant, because there were no missing values in age. Even though predictor relations are specified for age, mice will not use these relations because it will never overwrite the observed values with imputations. mice gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand.

There are two ways in which you can create a predictor matrix in mice:

A. You can grab the predictorMatrix from every object returned by mice():

For example, we can use mice() to give you us initial predictor matrix, and change it afterwards, without running the algorithm. This can be done by typing

ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
pred

##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object pred contains the predictor matrix from an initial run of mice with zero iterations, specified by maxit = 0.

B. We can use make.predictorMatrix() to generate a predictor matrix from any incomplete data set.

For example,

pred <- make.predictorMatrix(nhanes)
pred

##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Altering the predictor matrix and returning it to the mice algorithm is very simple. For example, the following code removes the variable hyp from the set of predictors, but still leaves it to be predicted by the other variables.

pred[, "hyp"] <- 0
pred

##     age bmi hyp chl
## age   0   1   0   1
## bmi   1   0   0   1
## hyp   1   1   0   1
## chl   1   1   0   0

Use your new predictor matrix in mice() as follows

imp <- mice(nhanes, 
            predictorMatrix = pred, 
            print = FALSE)

There is a quickpred() function that applies a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.30\) can be done by

ini <- mice(nhanes, 
            pred = quickpred(nhanes, mincor = .3), 
            print = FALSE)
ini$pred

##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   1   1   0

For large predictor matrices, it can be useful to export them to dedicated spreadsheet software like e.g. Microsoft Excel for easier configuration (e.g. see the xlsx package for easy exporting and importing of Excel files). Importing data is straightforward in RStudio through File > Import Dataset.

3. Inspect the convergence of the algorithm

The mice() function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:

imp <- mice(nhanes, print = FALSE)
plot_trace(imp)

The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle and be free of any trends at the later iterations. We inspect trends for the imputed values alone, because the observed data does not change.

The mice algorithm uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations with different seeds. In order to get exactly the same result, use the seed argument

imp <- mice(nhanes, 
            seed = 123, 
            print = FALSE)

where 123 is some arbitrary number that you can choose yourself. Rerunning this command will always yields the same imputed values.

4. Change the imputation method

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

imp$meth

##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The variable age is complete and therefore not imputed, denoted by the "" empty string. The other variables have method pmm, which stands for predictive mean matching, the default in mice for numerical and integer data.

In reality, the nhanes data are better described a as mix of numerical and categorical data. Let us take a look at the nhanes2 data frame:

summary(nhanes2)

##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

and the structure of the data frame

str(nhanes2)

## 'data.frame':    25 obs. of  4 variables:
##  $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

Variable age consists of 3 age categories, while variable hyp is binary. The mice() function takes these properties automatically into account. Impute the nhanes2 dataset

imp <- mice(nhanes2, 
            print = FALSE)
imp$meth

##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

Notice that mice has set the imputation method for variable hyp to logreg, which implements multiple imputation by logistic regression.

An up-to-date overview of the methods in mice can be found by

methods(mice)

##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.norm                mice.impute.norm.boot          
## [23] mice.impute.norm.nob            mice.impute.norm.predict       
## [25] mice.impute.panImpute           mice.impute.passive            
## [27] mice.impute.pmm                 mice.impute.polr               
## [29] mice.impute.polyreg             mice.impute.quadratic          
## [31] mice.impute.rf                  mice.impute.ri                 
## [33] mice.impute.sample              mice.mids                      
## [35] mice.theme                     
## see '?methods' for accessing help and source code

Let us change the imputation method for bmi to Bayesian normal linear regression imputation

ini <- mice(nhanes2, 
            maxit = 0)
meth <- ini$meth
meth

##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

meth["bmi"] <- "norm"
meth

##      age      bmi      hyp      chl 
##       ""   "norm" "logreg"    "pmm"

and run the imputations again.

imp <- mice(nhanes2, 
            meth = meth, 
            print = FALSE)

We may now again plot trace lines to study convergence

plot_trace(imp)

5. Extend the number of iterations

Though using just five iterations (the default) often works well in practice, we need to extend the number of iterations of the mice algorithm to confirm that there is no trend and that the trace lines intermingle well. We can increase the number of iterations to 40 by running 35 additional iterations using the mice.mids() function.

imp40 <- mice.mids(imp, 
                   maxit = 35, # add 35 iterations
                   print = FALSE)

plot_trace(imp40)

6. Further diagnostic checking. Use function stripplot().

Generally, one would prefer for the imputed data to be plausible values, i.e. values that could have been observed if they had not been missing. In order to form an idea about plausibility, one may check the imputations and compare them against the observed values. If we are willing to assume that the data are missing completely at random (MCAR), then the imputations should have the same distribution as the observed data. In general, distributions may be different because the missing data are MAR (or even MNAR). However, very large discrepancies need to be screened. Let us plot the observed and imputed data of chl by

ggmice(imp, aes(x = .imp, y = chl)) +
  geom_jitter(width = .1)

The convention is to plot observed data in blue and the imputed data in red. The figure graphs the data values of chl before and after imputation. Since the PMM method draws imputations from the observed data, imputed values have the same gaps as in the observed data, and are always within the range of the observed data. The figure indicates that the distributions of the imputed and the observed values are similar. The observed data have a particular feature that, for some reason, thedata cluster around the value of 187. The imputations reflect this feature, and are close to the data. Under MCAR, univariate distributions of the observed and imputed data are expected to be identical. Under MAR, they can be different, both in location and spread, but their multivariate distribution is assumed to be identical. There are many other ways to look at the imputed data.

The same plot can be made for the variable bmi.

ggmice(imp, aes(x = .imp, y = bmi)) +
  geom_jitter(width = .1)

Remember that bmi was imputed by Bayesian linear regression and (the range of) imputed values may therefore be different than observed values.

Repeated analysis in mice

7. Perform the following regression analysis on the multiply imputed data and assign the result to object fit.

\[ \text{bmi} = \beta_0 + \beta_1 \text{chl} + \epsilon \]

fit <- with(imp, lm(bmi ~ chl))
fit

## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
## 
## call1 :
## mice(data = nhanes2, method = meth, printFlag = FALSE)
## 
## nmis :
## age bmi hyp chl 
##   0   9   8  10 
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    23.32599      0.02282  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    21.92498      0.02085  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##     20.2116       0.0344  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    20.15665      0.03852  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    24.80765      0.01157

The fit object contains the regression summaries for each data set. The new object fit is actually of class mira (multiply imputed repeated analyses).

class(fit)

## [1] "mira"   "matrix"

Use the ls() function to what out what is in the object.

ls(fit)

## [1] "analyses" "call"     "call1"    "nmis"

Suppose we want to find the regression model fitted to the second imputed data set. It can be retrieved from the list by

summary(fit$analyses[[2]])

## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6386 -3.1231 -0.2381  2.8443  8.8307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.92498    3.95950   5.537 1.24e-05 ***
## chl          0.02085    0.01970   1.058    0.301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.264 on 23 degrees of freedom
## Multiple R-squared:  0.04642,    Adjusted R-squared:  0.004965 
## F-statistic:  1.12 on 1 and 23 DF,  p-value: 0.301

8. Pool the analyses from object fit.

Pooling the repeated regression analyses can be done simply by typing

pool.fit <- pool(fit)
summary(pool.fit)

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 22.08537216 4.58683836  4.814945 13.40323 0.0003100903
## 2         chl  0.02563088 0.02338968  1.095820 12.46079 0.2938853567

pool.fit

## Class: mipo    m = 5 
##          term m    estimate         ubar            b            t dfcom
## 1 (Intercept) 5 22.08537216 16.177078691 4.0516728978 2.103909e+01    23
## 2         chl 5  0.02563088  0.000405596 0.0001179011 5.470773e-04    23
##         df       riv   lambda       fmi
## 1 13.40323 0.3005492 0.231094 0.3248446
## 2 12.46079 0.3488233 0.258613 0.3545184

or, equivalently, with a pipe

fit %>% 
  pool %>% 
  summary

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 22.08537216 4.58683836  4.814945 13.40323 0.0003100903
## 2         chl  0.02563088 0.02338968  1.095820 12.46079 0.2938853567

which gives the relevant pooled regression coefficients and parameters, as well as the fraction of information about the coefficients missing due to nonresponse (fmi) and the proportion of the variation attributable to the missing data (lambda). The pooled fit object is of class mipo, which stands for multiply imputed pooled object.

mice can to pool many analyses from a variety of packages for you. For flexibility and in order to run custom pooling functions, mice also incorporates a function pool.scalar() which pools univariate estimates of \(m\) repeated complete data analysis conform Rubin’s pooling rules (Rubin, 1987, paragraph 3.1)

References

Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon

- End of practical

sessionInfo()

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.utf8  LC_CTYPE=Dutch_Netherlands.utf8   
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Dutch_Netherlands.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_2.0.3    dplyr_1.0.9       ggplot2_3.3.6     ggmice_0.0.1.9000
## [5] mice_3.14.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3     highr_0.9        bslib_0.3.1      compiler_4.2.1  
##  [5] pillar_1.7.0     jquerylib_0.1.4  tools_4.2.1      digest_0.6.29   
##  [9] gtable_0.3.0     lattice_0.20-45  jsonlite_1.8.0   evaluate_0.15   
## [13] lifecycle_1.0.1  tibble_3.1.7     pkgconfig_2.0.3  rlang_1.0.3     
## [17] cli_3.3.0        DBI_1.1.3        rstudioapi_0.13  yaml_2.3.5      
## [21] xfun_0.31        fastmap_1.1.0    withr_2.5.0      stringr_1.4.0   
## [25] knitr_1.39       generics_0.1.3   vctrs_0.4.1      sass_0.4.1      
## [29] grid_4.2.1       tidyselect_1.1.2 glue_1.6.2       R6_2.5.1        
## [33] fansi_1.0.3      rmarkdown_2.14   farver_2.1.1     tidyr_1.2.0     
## [37] purrr_0.3.4      scales_1.2.0     backports_1.4.1  htmltools_0.5.2 
## [41] ellipsis_0.3.2   assertthat_0.2.1 colorspace_2.0-3 labeling_0.4.2  
## [45] utf8_1.2.2       stringi_1.7.6    munsell_0.5.0    broom_1.0.0     
## [49] crayon_1.5.1

`mice`: Algorithmic convergence and inference pooling

Gerko Vink and Stef van Buuren

Multiple Imputation in Practice

Multiple imputation with `mice`

Repeated analysis in mice

References

mice: Algorithmic convergence and inference pooling

Gerko Vink and Stef van Buuren

Multiple Imputation in Practice

Multiple imputation with mice

Repeated analysis in mice

References

`mice`: Algorithmic convergence and inference pooling

Multiple imputation with `mice`