`mice`

: Algorithmic convergence and inference poolingThis is the second vignette in a series of six.

The aim of this vignette is to enhance your understanding of multiple imputation, in general. You will learn how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid the pitfalls researchers may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

No previous experience with `R`

is required. Again, we start by loading (with `require()`

) the necessary packages and fixing the random seed to allow for our outcomes to be replicable.

```
require(mice)
require(lattice)
set.seed(123)
```

**1. Vary the number of imputations. **

The number of imputed data sets can be specified by the `m = ...`

argument. For example, to create just three imputed data sets, specify

`imp <- mice(nhanes, m = 3, print=F)`

**2. Change the predictor matrix**

The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable. Let us have a look at the predictor matrix that was used

`imp$pred`

```
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
```

Each variable in the data has a row and a column in the predictor matrix. A value `1`

indicates that the column variable was used to impute the row variable. For example, the `1`

at entry `[bmi, age]`

indicates that variable `age`

was used to impute the incomplete variable `bmi`

. Note that the diagonal is zero because a variable is not allowed to impute itself. The row of `age`

contains all zeros because there were no missing values in `age`

. `mice`

gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand. You can use `mice()`

to give you the initial predictor matrix, and change it afterwards, without running the algorithm. This can be done by typing

```
ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
pred
```

```
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
```

The object `pred`

contains the predictor matrix from an initial run of `mice`

with zero iterations, specified by `maxit = 0`

. Altering the predictor matrix and returning it to the mice algorithm is very simple. For example, the following code removes the variable `hyp`

from the set of predictors, but still leaves it to be predicted by the other variables.

```
pred[ ,"hyp"] <- 0
pred
```

```
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 1 0 1
## chl 1 1 0 0
```

Use your new predictor matrix in `mice()`

as follows

`imp <- mice(nhanes, pred=pred, print=F)`

There is a special function called `quickpred()`

for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See `?quickpred`

for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.30\) can be done by

```
ini <- mice(nhanes, pred=quickpred(nhanes, mincor=.3), print=F)
ini$pred
```

```
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 0 1
## hyp 1 0 0 1
## chl 1 1 1 0
```

For large predictor matrices, it can be useful to export them to Microsoft Excel for easier configuration (e.g.Â see the `xlsx`

package for easy exporting and importing of Excel files).

**3. Inspect the convergence of the algorithm**

The `mice()`

function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:

```
imp <- mice(nhanes, print=F)
plot(imp)
```

The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle and be free of any trends at the later iterations.

The algorithm uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations with different seeds. In order to get exactly the same result, use the `seed`

argument

`imp <- mice(nhanes, seed=123, print=F)`

where `123`

is some arbitrary number that you can choose yourself. Rerunning this command will always yields the same imputed values.

**4. Change the imputation method**

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

`imp$meth`

```
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
```

The variable `age`

is complete and therefore not imputed, denoted by the `""`

empty string. The other variables have method `pmm`

, which stands for *predictive mean matching*, the default in `mice`

for numerical and integer data. In reality, the data are better described a as mix of numerical and categorical data. Let us take a look at the `nhanes2`

data frame

`summary(nhanes2)`

```
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
```

and the structure of the data frame

`str(nhanes2)`

```
## 'data.frame': 25 obs. of 4 variables:
## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
```

Variable `age`

consists of 3 age categories, while variable `hyp`

is binary. The `mice()`

function takes these properties automatically into account. Impute the `nhanes2`

dataset

```
imp <- mice(nhanes2, print=F)
imp$meth
```

```
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
```

Notice that `mice`

has set the imputation method for variable `hyp`

to `logreg`

, which implements multiple imputation by *logistic regression*.

An up-to-date overview of the methods in mice can be found by

`methods(mice)`

```
## [1] mice.impute.2l.norm mice.impute.2l.pan
## [3] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [5] mice.impute.2lonly.pmm mice.impute.cart
## [7] mice.impute.fastpmm mice.impute.lda
## [9] mice.impute.logreg mice.impute.logreg.boot
## [11] mice.impute.mean mice.impute.midastouch
## [13] mice.impute.norm mice.impute.norm.boot
## [15] mice.impute.norm.nob mice.impute.norm.predict
## [17] mice.impute.passive mice.impute.pmm
## [19] mice.impute.polr mice.impute.polyreg
## [21] mice.impute.quadratic mice.impute.rf
## [23] mice.impute.ri mice.impute.sample
## [25] mice.mids mice.theme
## see '?methods' for accessing help and source code
```

Let us change the imputation method for `bmi`

to Bayesian normal linear regression imputation

```
ini <- mice(nhanes2, maxit = 0)
meth <- ini$meth
meth
```

```
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
```

```
meth["bmi"] <- "norm"
meth
```

```
## age bmi hyp chl
## "" "norm" "logreg" "pmm"
```

and run the imputations again.

`imp <- mice(nhanes2, meth = meth, print=F)`

We may now again plot trace lines to study convergence

`plot(imp)`