mice
mice
teamPlease use a recent mice
version for this
practical - such as the latest CRAN release or the latest release on
GitHub
This is the first practical in a series of seven. It will give you an
introduction to the R
-package mice
, an
open-source tool for flexible imputation of incomplete data. Over the
last decade, mice
has become an important piece of
imputation software, offering a very flexible environment for dealing
with incomplete data. Moreover, the ability to integrate
mice
with other packages in R
, and vice versa,
offers many options for applied researchers.
The aim of this introduction is to enhance your understanding of multiple imputation, in general. You will learn how to multiply impute simple datasets and how to obtain the imputed data for further analysis. The main objective is to increase your knowledge and understanding on applications of multiple imputation.
All the best,
Gerko and the mice team
mice
uses R
’s random number generator to
draw values with a probabilistic nature. Therefore, each time we use
mice
we will get slightly different results. To avoid this,
we can fix the seed value of the random number generator.
set.seed(123)
With this seed you’ll get the exact same results if you follow the steps in this document. If you obtain different results, you have changed the order of the steps either by adding or re-running a step.
mice
version 3.14 from CRANWe can use function install.packages()
to directly
grab the latest stable release version from CRAN.
# To install the latest stable mice release:
install.packages("mice")
1. Open R
and load the package
mice
library(mice)
The version number for your mice
can be found by
running
version()
## [1] "mice 3.14.0 2021-11-23 C:/Users/5868777/Documents/R/win-library/3.0"
Another package that you will need repeatedly during this course is
ggmice
, which can aid the process of imputation by
visualization. You can install ggmice
from GitHub as well,
and subsequently load the package with the function
library()
. The package ggmice
relies on the
renowned R-package ggplot2
, which is commonly used for data
visualization (you can install ggplot2
simply from CRAN).
For this practical it is necessary to download the development version
of ggmice
from our GitHub repository:
# If devtools is not installed:
install.packages("devtools")
devtools::install_github("amices/ggmice")
We can download the latest ggplot2
from CRAN:
install.packages(ggplot2)
library(ggmice)
##
## Attaching package: 'ggmice'
## The following objects are masked from 'package:mice':
##
## bwplot, densityplot, stripplot, xyplot
library(ggplot2)
Other packages that are used in this document are:
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
2. Inspect the incomplete data
The mice
package contains several datasets. Once the
package is loaded, these datasets can be used. Have a look at the
nhanes
dataset (Schafer, 1997, Table 6.14) by typing
nhanes
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
The nhanes
dataset is a small data set with non-monotone
missing values. It contains 25 observations on four variables: age
group, body mass index, hypertension and
cholesterol (mg/dL).
To learn more about the data, use one of the two following help commands:
help(nhanes)
## starting httpd help server ... done
?nhanes
3. Get an overview of the data by the summary()
command:
summary(nhanes)
## age bmi hyp chl
## Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0
## 1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0
## Median :2.00 Median :26.75 Median :1.000 Median :187.0
## Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4
## 3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0
## Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0
## NA's :9 NA's :8 NA's :10
Using summary()
on data sets is often informative,
because the distributional information (continuous variables) or the
frequency distribution (factors) for every column in your data frame is
printed to the R
console. However, if there are too many
variables, a step-by-step approach may be more useful.
4. Inspect the missing data pattern
Check the missingness pattern for the nhanes
dataset
plot_pattern(nhanes)
The missingness pattern shows that there are 27 missing values in
total: 10 for chl
, 9 for bmi
and 8 for
hyp
. Moreover, there are thirteen completely observed rows,
four rows with 1 missing, one row with 2 missings and seven rows with 3
missings. Looking at the missing data pattern is always useful (but may
be difficult for datasets with many variables). It can give you an
indication on how much information is missing and how the missingness is
distributed.
5. Form a regression model on the nhanes
data
set where age
is predicted from bmi
.
We can use the with()
family of functions for this. The
following function call
fit <- with(data = nhanes,
expr = lm(age ~ bmi))
evaluates with(data, expression)
, so it evaluates the
linear model lm(age ~ bmi)
on data set nhanes
.
The resulting object fit
is identical to the output from
lm(age ~ bmi, data = nhanes)
. We learn the
with()
function now, because the with()
convention conforms to one of the analysis pipelines that we will use
with mice()
later on.
If we ask the summary of the fitted regression analysis, we obtain:
summary(fit)
##
## Call:
## lm(formula = age ~ bmi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2660 -0.5614 -0.1225 0.4660 1.2344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.76718 1.31945 2.855 0.0127 *
## bmi -0.07359 0.04910 -1.499 0.1561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8015 on 14 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.1383, Adjusted R-squared: 0.07672
## F-statistic: 2.246 on 1 and 14 DF, p-value: 0.1561
No significant effect for bmi
when we model the
age
variable.
6. Impute the missing data in the nhanes
dataset
with mean imputation. In mice
we can use many
imputation method. The call below imputes all missing values by the
arithmetic mean (method = "mean"
) over the observed values
for every incomplete column in the nhanes
data set and
returns a single (m = 1
) imputed data set. The algorithm -
i.e. the procedure that generates the imputations - has been given a
single iteration (maxit = 1
) to reach convergence. We’ll
dive into the specifics of algorithmic convergence with
mice
in the next practical.
imp <- mice(nhanes,
method = "mean",
m = 1,
maxit = 1)
##
## iter imp variable
## 1 1 bmi hyp chl
The imputations are now done. Running only a single imputation
(m = 1
) is practically efficient, as substituting each
missing data multiple times with the observed data mean would be
redundant (the inference would be equal, no matter which imputed data
set we would analyze). Likewise, more iterations
(maxit = 1
) would be computationally inefficient as the
observed data mean does not change based on our imputations. We
named the imputed object imp
following the convention used
in mice
, but if you wish you can name it anything you’d
like.
7. Explore the imputed data with the complete()
function. What do you think the variable means are? What happened to the
regression equation after imputation?
We use the function complete()
, which by default returns
the first completed data set. Since we only have a single imputation for
every missing datum, this makes sense and we do not have to change the
default behavior of complete()
.
complete(imp)
## age bmi hyp chl
## 1 1 26.5625 1.235294 191.4
## 2 2 22.7000 1.000000 187.0
## 3 1 26.5625 1.000000 187.0
## 4 3 26.5625 1.235294 191.4
## 5 1 20.4000 1.000000 113.0
## 6 3 26.5625 1.235294 184.0
## 7 1 22.5000 1.000000 118.0
## 8 1 30.1000 1.000000 187.0
## 9 2 22.0000 1.000000 238.0
## 10 2 26.5625 1.235294 191.4
## 11 1 26.5625 1.235294 191.4
## 12 2 26.5625 1.235294 191.4
## 13 3 21.7000 1.000000 206.0
## 14 2 28.7000 2.000000 204.0
## 15 1 29.6000 1.000000 191.4
## 16 1 26.5625 1.235294 191.4
## 17 3 27.2000 2.000000 284.0
## 18 2 26.3000 2.000000 199.0
## 19 1 35.3000 1.000000 218.0
## 20 3 25.5000 2.000000 191.4
## 21 1 26.5625 1.235294 191.4
## 22 1 33.2000 1.000000 229.0
## 23 1 27.5000 1.000000 131.0
## 24 3 24.9000 1.000000 191.4
## 25 2 27.4000 1.000000 186.0
We see the repetitive numbers 26.5625
for
bmi
, 1.2352594
for hyp
, and
191.4
for chl
. These can be confirmed as the
means of the respective variables (columns):
colMeans(nhanes, na.rm = TRUE)
## age bmi hyp chl
## 1.760000 26.562500 1.235294 191.400000
We’ve seen during the inspection of the missing data pattern that
variable age
has no missings. Therefore nothing is imputed
for age
because we would not want to alter the observed
(and bonafide) values.
To inspect the regression model with the imputed data, run:
fit <- with(data = imp,
expr = lm(age ~ bmi))
summary(fit)
## # A tibble: 2 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 3.71 1.33 2.80 0.0103 25
## 2 bmi -0.0736 0.0497 -1.48 0.152 25
It is clear that our inference did not change, but then again this is
not surprising as variable bmi
is more-or-less normally
distributed and we are just adding weight to the mean.
ggmice(nhanes, mapping = aes(x = bmi)) +
geom_density(trim = F) +
geom_rug() +
xlim(15, 40)
## Warning: Removed 9 rows containing non-finite values (stat_density).
8. Impute the missing data in the nhanes
dataset
with regression imputation.
We can use the same function call as under exercise 7, with
method = "norm.predict"
, which yields predictions from the
normal linear regression model.
imp <- mice(nhanes,
method = "norm.predict",
m = 1,
maxit = 1)
##
## iter imp variable
## 1 1 bmi hyp chl
The imputations are now done. This code imputes the missing values in
the data set by the regression imputation method. The argument
method = "norm.predict"
first fits a regression model for
each observed value, based on the corresponding values in other
variables and then imputes the missing values with the fitted
(predicted) values from the normal linear regression model.
9. Again, inspect the completed data and investigate the imputed data regression model.
The completed data:
complete(imp)
## age bmi hyp chl
## 1 1 29.34414 1.068231 178.4069
## 2 2 22.70000 1.000000 187.0000
## 3 1 29.34414 1.000000 187.0000
## 4 3 21.16204 1.468008 213.4246
## 5 1 20.40000 1.000000 113.0000
## 6 3 21.29839 1.462721 184.0000
## 7 1 22.50000 1.000000 118.0000
## 8 1 30.10000 1.000000 187.0000
## 9 2 22.00000 1.000000 238.0000
## 10 2 28.61036 1.376561 220.2994
## 11 1 25.45177 0.932814 150.1886
## 12 2 27.11824 1.328365 209.4622
## 13 3 21.70000 1.000000 206.0000
## 14 2 28.70000 2.000000 204.0000
## 15 1 29.60000 1.000000 180.6739
## 16 1 36.58091 1.301983 230.9672
## 17 3 27.20000 2.000000 284.0000
## 18 2 26.30000 2.000000 199.0000
## 19 1 35.30000 1.000000 218.0000
## 20 3 25.50000 2.000000 242.8373
## 21 1 33.14904 1.191132 206.0417
## 22 1 33.20000 1.000000 229.0000
## 23 1 27.50000 1.000000 131.0000
## 24 3 24.90000 1.000000 243.7188
## 25 2 27.40000 1.000000 186.0000
The repetitive numbering we saw under mean imputation is now gone
when we impute the conditional mean - i.e. the expectation of
age
for every given bmi
. We have now obtained
a more natural looking set of imputations: instead of filling in the
same bmi
for all ages, we now take age
(as
well as hyp
and chl
) into account when
imputing bmi
.
To inspect the regression model with the imputed data, run:
fit <- with(imp, lm(age ~ bmi))
summary(fit)
## # A tibble: 2 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 4.47 0.893 5.01 0.0000458 25
## 2 bmi -0.100 0.0325 -3.08 0.00534 25
We have now omitted the specification of data
and
expr
in the above call, as data
is always the
first argument in with()
and expr
is always
the second argument.
It is clear that our inference has changed. In fact, we extrapolated
(part of) the regression model for the observed data to missing data in
bmi
. In other words; the relation (read: information) gets
stronger and we’ve obtained more observations that conform exactly to
the relation in the observed data. From an inferential statistics
viewpoint, this approach would only be valid
if we have definitive proof that the unobserved values would exactly
conform to the expected observed data. If this assumption does not hold,
norm.predict
creates too little variation in our data set
and we can not trust the resulting standard errors and p-values.
10. Impute the missing data in the nhanes
dataset with stochastic regression imputation. With stochastic
regression imputation, an error term is added to the predicted values,
such that the imputations show variation around the regression line. The
errors are normally distributed with mean 0 and variance equal to the
residual variance.
imp <- mice(nhanes,
method = "norm.nob",
m = 1,
maxit = 1)
##
## iter imp variable
## 1 1 bmi hyp chl
The imputations are now done. This code imputes the missing values in the data set by the stochastic regression imputation method. For every missing datum, some noise (a stochastic term) is added to reflect the sampling variance in the observed cases. This imputation method, however, does not consider that we only have a single sample from the true data generating model. In that sense, it does not incorporate the variability of the regression weights and is therefore not ‘proper’ in the sense of Rubin (1987). For small samples, the variability of the imputed data will be underestimated.
11. Again, inspect the completed data and investigate the imputed data regression model.
complete(imp)
## age bmi hyp chl
## 1 1 27.07564 1.390325 137.0008
## 2 2 22.70000 1.000000 187.0000
## 3 1 27.39885 1.000000 187.0000
## 4 3 15.91262 1.699775 163.1688
## 5 1 20.40000 1.000000 113.0000
## 6 3 23.90580 1.867292 184.0000
## 7 1 22.50000 1.000000 118.0000
## 8 1 30.10000 1.000000 187.0000
## 9 2 22.00000 1.000000 238.0000
## 10 2 29.08769 1.678547 176.6941
## 11 1 21.90959 1.078977 196.6707
## 12 2 31.02043 1.402626 274.4438
## 13 3 21.70000 1.000000 206.0000
## 14 2 28.70000 2.000000 204.0000
## 15 1 29.60000 1.000000 146.5473
## 16 1 37.90818 1.217217 226.8935
## 17 3 27.20000 2.000000 284.0000
## 18 2 26.30000 2.000000 199.0000
## 19 1 35.30000 1.000000 218.0000
## 20 3 25.50000 2.000000 219.9185
## 21 1 32.23070 1.017536 226.9338
## 22 1 33.20000 1.000000 229.0000
## 23 1 27.50000 1.000000 131.0000
## 24 3 24.90000 1.000000 241.3764
## 25 2 27.40000 1.000000 186.0000
We have once more obtained a more natural looking set of imputations,
where instead of filling in the same bmi
for all ages, we
now take age
(as well as hyp
and
chl
) into account when imputing bmi
. We also
add a random error to allow for our imputations to be off the regression
line.
To inspect the regression model with the imputed data, run:
fit <- with(imp, lm(age ~ bmi))
summary(fit)
## # A tibble: 2 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 3.79 0.847 4.47 0.000175 25
## 2 bmi -0.0754 0.0310 -2.43 0.0233 25
12. Re-run the stochastic imputation model with seed
123
and verify if your results are the same as the ones
below
## # A tibble: 2 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 3.75 0.736 5.10 0.0000362 25
## 2 bmi -0.0792 0.0287 -2.77 0.0110 25
The imputation procedure uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations. In order to get exactly the same result, you can use the seed argument
imp <- mice(nhanes,
method = "norm.nob",
m = 1,
maxit = 1,
seed = 123)
fit <- with(imp, lm(age ~ bmi))
summary(fit)
where 123 is some arbitrary number that you can choose yourself. Re-running this command will always yields the same imputed values. The ability to replicate one’s findings exactly is considered essential in today’s reproducible science.
13. Let us impute the missing data in the nhanes
dataset To do multiple imputation, we can simply call
mice()
on our data set:
imp <- mice(nhanes)
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
The imputations are now done. As you can see, the algorithm ran for 5
iterations (the default) and presented us with 5 imputations for each
missing datum. For the rest of this document we will omit printing of
the iteration cycle when we run mice
. We do so by adding
print=F
to the mice
call.
imp
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## age bmi hyp chl
## "" "pmm" "pmm" "pmm"
## PredictorMatrix:
## age bmi hyp chl
## age 0 1 1 1
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
The object imp
contains a multiply imputed data set (of
class mids
). It encapsulates all information from imputing
the nhanes
dataset, such as the original data, the imputed
values, the number of missing values, number of iterations, and so
on.
To obtain an overview of the information stored in the object
imp
, use the attributes()
function:
attributes(imp)
## $names
## [1] "data" "imp" "m" "where"
## [5] "blocks" "call" "nmis" "method"
## [9] "predictorMatrix" "visitSequence" "formulas" "post"
## [13] "blots" "ignore" "seed" "iteration"
## [17] "lastSeedValue" "chainMean" "chainVar" "loggedEvents"
## [21] "version" "date"
##
## $class
## [1] "mids"
For example, the original data are stored as
imp$data
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
and the imputations are stored as
imp$imp
## $age
## [1] 1 2 3 4 5
## <0 rows> (or 0-length row.names)
##
## $bmi
## 1 2 3 4 5
## 1 29.6 28.7 27.2 22.0 27.5
## 3 30.1 22.0 30.1 30.1 33.2
## 4 22.7 24.9 22.7 25.5 24.9
## 6 27.5 25.5 24.9 24.9 21.7
## 10 35.3 20.4 28.7 30.1 22.5
## 11 27.2 35.3 21.7 22.5 22.7
## 12 22.5 22.7 27.4 27.5 22.7
## 16 35.3 22.0 22.0 33.2 22.7
## 21 33.2 22.0 21.7 29.6 29.6
##
## $hyp
## 1 2 3 4 5
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 6 1 2 2 2 1
## 10 1 2 2 2 1
## 11 1 1 1 1 1
## 12 1 1 1 2 1
## 16 1 1 1 1 1
## 21 1 1 1 1 2
##
## $chl
## 1 2 3 4 5
## 1 187 204 187 118 187
## 4 284 184 206 184 184
## 10 199 187 184 204 131
## 11 187 186 131 187 113
## 12 187 229 206 199 131
## 15 187 199 187 187 229
## 16 187 131 238 199 187
## 20 218 186 218 186 184
## 21 187 113 113 187 187
## 24 284 184 218 184 204
14. Extract the completed data
By default, mice()
calculates five (m = 5)
imputed data sets. In order to get the third imputed data set, use the
complete()
function
c3 <- complete(imp, 3)
plot_pattern(c3)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
The collection of the \(m\) imputed
data sets can be exported by function complete()
in long,
broad and repeated formats. For example,
c.long <- complete(imp, action = "long")
c.long
## .imp .id age bmi hyp chl
## 1 1 1 1 29.6 1 187
## 2 1 2 2 22.7 1 187
## 3 1 3 1 30.1 1 187
## 4 1 4 3 22.7 1 284
## 5 1 5 1 20.4 1 113
## 6 1 6 3 27.5 1 184
## 7 1 7 1 22.5 1 118
## 8 1 8 1 30.1 1 187
## 9 1 9 2 22.0 1 238
## 10 1 10 2 35.3 1 199
## 11 1 11 1 27.2 1 187
## 12 1 12 2 22.5 1 187
## 13 1 13 3 21.7 1 206
## 14 1 14 2 28.7 2 204
## 15 1 15 1 29.6 1 187
## 16 1 16 1 35.3 1 187
## 17 1 17 3 27.2 2 284
## 18 1 18 2 26.3 2 199
## 19 1 19 1 35.3 1 218
## 20 1 20 3 25.5 2 218
## 21 1 21 1 33.2 1 187
## 22 1 22 1 33.2 1 229
## 23 1 23 1 27.5 1 131
## 24 1 24 3 24.9 1 284
## 25 1 25 2 27.4 1 186
## 26 2 1 1 28.7 1 204
## 27 2 2 2 22.7 1 187
## 28 2 3 1 22.0 1 187
## 29 2 4 3 24.9 1 184
## 30 2 5 1 20.4 1 113
## 31 2 6 3 25.5 2 184
## 32 2 7 1 22.5 1 118
## 33 2 8 1 30.1 1 187
## 34 2 9 2 22.0 1 238
## 35 2 10 2 20.4 2 187
## 36 2 11 1 35.3 1 186
## 37 2 12 2 22.7 1 229
## 38 2 13 3 21.7 1 206
## 39 2 14 2 28.7 2 204
## 40 2 15 1 29.6 1 199
## 41 2 16 1 22.0 1 131
## 42 2 17 3 27.2 2 284
## 43 2 18 2 26.3 2 199
## 44 2 19 1 35.3 1 218
## 45 2 20 3 25.5 2 186
## 46 2 21 1 22.0 1 113
## 47 2 22 1 33.2 1 229
## 48 2 23 1 27.5 1 131
## 49 2 24 3 24.9 1 184
## 50 2 25 2 27.4 1 186
## 51 3 1 1 27.2 1 187
## 52 3 2 2 22.7 1 187
## 53 3 3 1 30.1 1 187
## 54 3 4 3 22.7 1 206
## 55 3 5 1 20.4 1 113
## 56 3 6 3 24.9 2 184
## 57 3 7 1 22.5 1 118
## 58 3 8 1 30.1 1 187
## 59 3 9 2 22.0 1 238
## 60 3 10 2 28.7 2 184
## 61 3 11 1 21.7 1 131
## 62 3 12 2 27.4 1 206
## 63 3 13 3 21.7 1 206
## 64 3 14 2 28.7 2 204
## 65 3 15 1 29.6 1 187
## 66 3 16 1 22.0 1 238
## 67 3 17 3 27.2 2 284
## 68 3 18 2 26.3 2 199
## 69 3 19 1 35.3 1 218
## 70 3 20 3 25.5 2 218
## 71 3 21 1 21.7 1 113
## 72 3 22 1 33.2 1 229
## 73 3 23 1 27.5 1 131
## 74 3 24 3 24.9 1 218
## 75 3 25 2 27.4 1 186
## 76 4 1 1 22.0 1 118
## 77 4 2 2 22.7 1 187
## 78 4 3 1 30.1 1 187
## 79 4 4 3 25.5 1 184
## 80 4 5 1 20.4 1 113
## 81 4 6 3 24.9 2 184
## 82 4 7 1 22.5 1 118
## 83 4 8 1 30.1 1 187
## 84 4 9 2 22.0 1 238
## 85 4 10 2 30.1 2 204
## 86 4 11 1 22.5 1 187
## 87 4 12 2 27.5 2 199
## 88 4 13 3 21.7 1 206
## 89 4 14 2 28.7 2 204
## 90 4 15 1 29.6 1 187
## 91 4 16 1 33.2 1 199
## 92 4 17 3 27.2 2 284
## 93 4 18 2 26.3 2 199
## 94 4 19 1 35.3 1 218
## 95 4 20 3 25.5 2 186
## 96 4 21 1 29.6 1 187
## 97 4 22 1 33.2 1 229
## 98 4 23 1 27.5 1 131
## 99 4 24 3 24.9 1 184
## 100 4 25 2 27.4 1 186
## 101 5 1 1 27.5 1 187
## 102 5 2 2 22.7 1 187
## 103 5 3 1 33.2 1 187
## 104 5 4 3 24.9 1 184
## 105 5 5 1 20.4 1 113
## 106 5 6 3 21.7 1 184
## 107 5 7 1 22.5 1 118
## 108 5 8 1 30.1 1 187
## 109 5 9 2 22.0 1 238
## 110 5 10 2 22.5 1 131
## 111 5 11 1 22.7 1 113
## 112 5 12 2 22.7 1 131
## 113 5 13 3 21.7 1 206
## 114 5 14 2 28.7 2 204
## 115 5 15 1 29.6 1 229
## 116 5 16 1 22.7 1 187
## 117 5 17 3 27.2 2 284
## 118 5 18 2 26.3 2 199
## 119 5 19 1 35.3 1 218
## 120 5 20 3 25.5 2 184
## 121 5 21 1 29.6 2 187
## 122 5 22 1 33.2 1 229
## 123 5 23 1 27.5 1 131
## 124 5 24 3 24.9 1 204
## 125 5 25 2 27.4 1 186
and
c.broad <- complete(imp, action = "broad")
## New names:
## • `age` -> `age...1`
## • `bmi` -> `bmi...2`
## • `hyp` -> `hyp...3`
## • `chl` -> `chl...4`
## • `age` -> `age...5`
## • `bmi` -> `bmi...6`
## • `hyp` -> `hyp...7`
## • `chl` -> `chl...8`
## • `age` -> `age...9`
## • `bmi` -> `bmi...10`
## • `hyp` -> `hyp...11`
## • `chl` -> `chl...12`
## • `age` -> `age...13`
## • `bmi` -> `bmi...14`
## • `hyp` -> `hyp...15`
## • `chl` -> `chl...16`
## • `age` -> `age...17`
## • `bmi` -> `bmi...18`
## • `hyp` -> `hyp...19`
## • `chl` -> `chl...20`
c.broad
## age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1 1 29.6 1 187 1 28.7 1 204 1 27.2 1 187
## 2 2 22.7 1 187 2 22.7 1 187 2 22.7 1 187
## 3 1 30.1 1 187 1 22.0 1 187 1 30.1 1 187
## 4 3 22.7 1 284 3 24.9 1 184 3 22.7 1 206
## 5 1 20.4 1 113 1 20.4 1 113 1 20.4 1 113
## 6 3 27.5 1 184 3 25.5 2 184 3 24.9 2 184
## 7 1 22.5 1 118 1 22.5 1 118 1 22.5 1 118
## 8 1 30.1 1 187 1 30.1 1 187 1 30.1 1 187
## 9 2 22.0 1 238 2 22.0 1 238 2 22.0 1 238
## 10 2 35.3 1 199 2 20.4 2 187 2 28.7 2 184
## 11 1 27.2 1 187 1 35.3 1 186 1 21.7 1 131
## 12 2 22.5 1 187 2 22.7 1 229 2 27.4 1 206
## 13 3 21.7 1 206 3 21.7 1 206 3 21.7 1 206
## 14 2 28.7 2 204 2 28.7 2 204 2 28.7 2 204
## 15 1 29.6 1 187 1 29.6 1 199 1 29.6 1 187
## 16 1 35.3 1 187 1 22.0 1 131 1 22.0 1 238
## 17 3 27.2 2 284 3 27.2 2 284 3 27.2 2 284
## 18 2 26.3 2 199 2 26.3 2 199 2 26.3 2 199
## 19 1 35.3 1 218 1 35.3 1 218 1 35.3 1 218
## 20 3 25.5 2 218 3 25.5 2 186 3 25.5 2 218
## 21 1 33.2 1 187 1 22.0 1 113 1 21.7 1 113
## 22 1 33.2 1 229 1 33.2 1 229 1 33.2 1 229
## 23 1 27.5 1 131 1 27.5 1 131 1 27.5 1 131
## 24 3 24.9 1 284 3 24.9 1 184 3 24.9 1 218
## 25 2 27.4 1 186 2 27.4 1 186 2 27.4 1 186
## age.4 bmi.4 hyp.4 chl.4 age.5 bmi.5 hyp.5 chl.5
## 1 1 22.0 1 118 1 27.5 1 187
## 2 2 22.7 1 187 2 22.7 1 187
## 3 1 30.1 1 187 1 33.2 1 187
## 4 3 25.5 1 184 3 24.9 1 184
## 5 1 20.4 1 113 1 20.4 1 113
## 6 3 24.9 2 184 3 21.7 1 184
## 7 1 22.5 1 118 1 22.5 1 118
## 8 1 30.1 1 187 1 30.1 1 187
## 9 2 22.0 1 238 2 22.0 1 238
## 10 2 30.1 2 204 2 22.5 1 131
## 11 1 22.5 1 187 1 22.7 1 113
## 12 2 27.5 2 199 2 22.7 1 131
## 13 3 21.7 1 206 3 21.7 1 206
## 14 2 28.7 2 204 2 28.7 2 204
## 15 1 29.6 1 187 1 29.6 1 229
## 16 1 33.2 1 199 1 22.7 1 187
## 17 3 27.2 2 284 3 27.2 2 284
## 18 2 26.3 2 199 2 26.3 2 199
## 19 1 35.3 1 218 1 35.3 1 218
## 20 3 25.5 2 186 3 25.5 2 184
## 21 1 29.6 1 187 1 29.6 2 187
## 22 1 33.2 1 229 1 33.2 1 229
## 23 1 27.5 1 131 1 27.5 1 131
## 24 3 24.9 1 184 3 24.9 1 204
## 25 2 27.4 1 186 2 27.4 1 186
and
c.all<- complete(imp, action = "all")
c.all
## $`1`
## age bmi hyp chl
## 1 1 29.6 1 187
## 2 2 22.7 1 187
## 3 1 30.1 1 187
## 4 3 22.7 1 284
## 5 1 20.4 1 113
## 6 3 27.5 1 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 35.3 1 199
## 11 1 27.2 1 187
## 12 2 22.5 1 187
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 187
## 16 1 35.3 1 187
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 218
## 21 1 33.2 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 284
## 25 2 27.4 1 186
##
## $`2`
## age bmi hyp chl
## 1 1 28.7 1 204
## 2 2 22.7 1 187
## 3 1 22.0 1 187
## 4 3 24.9 1 184
## 5 1 20.4 1 113
## 6 3 25.5 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 20.4 2 187
## 11 1 35.3 1 186
## 12 2 22.7 1 229
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 199
## 16 1 22.0 1 131
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 22.0 1 113
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## $`3`
## age bmi hyp chl
## 1 1 27.2 1 187
## 2 2 22.7 1 187
## 3 1 30.1 1 187
## 4 3 22.7 1 206
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 28.7 2 184
## 11 1 21.7 1 131
## 12 2 27.4 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 187
## 16 1 22.0 1 238
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 218
## 21 1 21.7 1 113
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 218
## 25 2 27.4 1 186
##
## $`4`
## age bmi hyp chl
## 1 1 22.0 1 118
## 2 2 22.7 1 187
## 3 1 30.1 1 187
## 4 3 25.5 1 184
## 5 1 20.4 1 113
## 6 3 24.9 2 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 30.1 2 204
## 11 1 22.5 1 187
## 12 2 27.5 2 199
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 187
## 16 1 33.2 1 199
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 186
## 21 1 29.6 1 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
##
## $`5`
## age bmi hyp chl
## 1 1 27.5 1 187
## 2 2 22.7 1 187
## 3 1 33.2 1 187
## 4 3 24.9 1 184
## 5 1 20.4 1 113
## 6 3 21.7 1 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 22.5 1 131
## 11 1 22.7 1 113
## 12 2 22.7 1 131
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 229
## 16 1 22.7 1 187
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 184
## 21 1 29.6 2 187
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 204
## 25 2 27.4 1 186
##
## attr(,"class")
## [1] "mild" "list"
are multiply imputed completed data sets in long, broad and list
format, respectively. See ?complete
for more detail.
We have seen that (multiple) imputation is straightforward with
mice
. However, don’t let the simplicity of the software
fool you into thinking that the problem itself is also straightforward.
In the next practical we will therefore explore how the mice package can
flexibly provide us the tools to assess and control the imputation of
missing data.
Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon
Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Table 6.14. Amazon
Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. pdf
- End of practical
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.utf8 LC_CTYPE=Dutch_Netherlands.utf8
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.9 magrittr_2.0.3 ggplot2_3.3.6 ggmice_0.0.1.9000
## [5] mice_3.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 highr_0.9 bslib_0.3.1 compiler_4.2.1
## [5] pillar_1.7.0 jquerylib_0.1.4 tools_4.2.1 digest_0.6.29
## [9] gtable_0.3.0 lattice_0.20-45 jsonlite_1.8.0 evaluate_0.15
## [13] lifecycle_1.0.1 tibble_3.1.7 pkgconfig_2.0.3 rlang_1.0.3
## [17] cli_3.3.0 DBI_1.1.3 rstudioapi_0.13 yaml_2.3.5
## [21] xfun_0.31 fastmap_1.1.0 withr_2.5.0 stringr_1.4.0
## [25] knitr_1.39 generics_0.1.3 vctrs_0.4.1 sass_0.4.1
## [29] grid_4.2.1 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
## [33] fansi_1.0.3 rmarkdown_2.14 farver_2.1.1 tidyr_1.2.0
## [37] purrr_0.3.4 scales_1.2.0 backports_1.4.1 htmltools_0.5.2
## [41] ellipsis_0.3.2 assertthat_0.2.1 colorspace_2.0-3 labeling_0.4.2
## [45] utf8_1.2.2 stringi_1.7.6 munsell_0.5.0 broom_1.0.0
## [49] crayon_1.5.1