mice
: Passive imputation and
Post-processingThis is the fourth practical.
In this practical we will walk you through the more advanced features
of mice
, such as post-processing of imputations
and passive imputation.
All the best,
1. Open R
and load the packages
mice
and dplyr
.
library(mice) # Data imputation
library(dplyr) # Data manipulation
library(purrr) # Functional programming
library(lattice) # Plotting device
library(ggplot2) # Plotting device
library(ggmice) # gg-like plots for mice
library(plotly) # interactive plots
# Fix the RNG seed
set.seed(123)
We choose seed value 123
. This is an arbitrary value;
any value would be an equally good seed value. Fixing the random seed
enables you (and others) to exactly replicate anything that involves
random number generators. If you set the seed in your R
instance to 123
, you will get the exact same results and
plots as we present in this document if you follow the order and code of
the exercises precisely.
There is often a need for transformed, combined or recoded versions
of the data. In the case of incomplete data, one could impute the
original, and transform the completed original afterwards, or transform
the incomplete original and impute the transformed version. If, however,
both the original and the transformed version are needed within the
imputation algorithm, neither of these approaches work: One cannot be
sure that the transformation holds between the imputed values of the
original and transformed versions. mice
has a built-in
approach, called passive imputation, to deal with situations as
described above. The goal of passive imputation is to maintain the
consistency among different transformations of the same data. As an
example, consider the following deterministic function in the
boys
data \[\text{BMI} =
\frac{\text{Weight (kg)}}{\text{Height}^2 \text{(m)}}\] or the
compositional relation in the mammalsleep data: \[\text{ts} = \text{ps}+\text{sws}\]
2. Use passive imputation to impute the deterministic sleep
relation in the mammalsleep
data. Name the new multiply
imputed dataset pas.imp
.
First, we create a method
vector:
meth <- make.method(mammalsleep)
meth
## species bw brw sws ps ts mls gt pi sei
## "" "" "" "pmm" "pmm" "pmm" "pmm" "pmm" "" ""
## odi
## ""
and a predictorMatrix
:
pred <- make.predictorMatrix(mammalsleep)
pred
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 1 0 1 1 1 1 1 1 1 1 1
## brw 1 1 0 1 1 1 1 1 1 1 1
## sws 1 1 1 0 1 1 1 1 1 1 1
## ps 1 1 1 1 0 1 1 1 1 1 1
## ts 1 1 1 1 1 0 1 1 1 1 1
## mls 1 1 1 1 1 1 0 1 1 1 1
## gt 1 1 1 1 1 1 1 0 1 1 1
## pi 1 1 1 1 1 1 1 1 0 1 1
## sei 1 1 1 1 1 1 1 1 1 0 1
## odi 1 1 1 1 1 1 1 1 1 1 0
We add the call for passive imputation to the ts
element
in the meth
object
meth["ts"]<- "~ I(sws + ps)"
meth
## species bw brw sws ps
## "" "" "" "pmm" "pmm"
## ts mls gt pi sei
## "~ I(sws + ps)" "pmm" "pmm" "" ""
## odi
## ""
and set the predictor relations for ts
with
sws
and ps
to 0
. Also, we have to
exclude Species
as a predictor
pred[c("sws", "ps"), "ts"] <- 0
pred[, "species"] <- 0
pred
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 0 0 1 1 1 1 1 1 1 1 1
## brw 0 1 0 1 1 1 1 1 1 1 1
## sws 0 1 1 0 1 0 1 1 1 1 1
## ps 0 1 1 1 0 0 1 1 1 1 1
## ts 0 1 1 1 1 0 1 1 1 1 1
## mls 0 1 1 1 1 1 0 1 1 1 1
## gt 0 1 1 1 1 1 1 0 1 1 1
## pi 0 1 1 1 1 1 1 1 0 1 1
## sei 0 1 1 1 1 1 1 1 1 0 1
## odi 0 1 1 1 1 1 1 1 1 1 0
This avoids circularity problems where ts
would feed
back into sws
and ps
, from which it is
calculated:
We can then run the imputations as
pas.imp <- mice(mammalsleep,
meth = meth,
pred = pred,
maxit = 15,
seed = 123,
print = FALSE)
We used a custom predictor matrix and method vector to tailor our
imputation approach to the passive imputation problem. We made sure to
exclude ts
as a predictor for the imputation of
sws
and ps
to avoid circularity.
We also gave the imputation algorithm 10 iterations to converge and
fixed the seed to 123
for this mice
instance.
This means that even when people do not fix the overall R
seed for a session, exact replication of results can be obtained by
simply fixing the seed
for the random number generator
within mice
. Naturally, the same input (data) is each time
required to yield the same output (mids
-object).
3. Inspect the trace lines for
pas.imp
.
plot_trace(pas.imp)
We can see that the pathological nonconvergence we experienced before has been properly dealt with. The trace lines for the sleep variable look okay now and convergence can be inferred by studying the trace lines.
Remember that we imputed the boys
data previously with
pmm
and with norm
, where we saw that
pmm
preserves the observed data range and norm
allows us to extend outside of the range of the observed data. If we
were to impute e.g. tv
with norm
the problem
arises that there may be negative values among the imputations. Somehow
we should be able to
tv
.The mice()
function has an argument called
post
that takes a vector of strings of R
commands. These commands are parsed and evaluated after the univariate
imputation function returns, and thus provides a way of post-processing
the imputed values while using the processed version in the imputation
algorithm. In other words; the post-processing allows us to manipulate
the imputations for a particular variable that are generated within each
iteration. Such manipulations directly affect the imputated values of
that variable and the imputations for other variables. Naturally, such a
procedure should be handled with care.
4. Post-process the values to constrain them between 1 and
25, use norm
as the imputation method for
tv
.
meth <- make.method(boys)
meth["tv"] <- "norm"
post <- make.post(boys)
post["tv"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(1, 25))"
imp <- mice(boys,
meth = meth,
post = post,
print = FALSE)
In this way the imputed values of tv
are constrained
(squeezed by function squeeze()
) between 1 and 25.
5. Compare the imputed values and histograms of
tv
for the solution obtained by pmm
to the
constrained solution (created with norm
, constrained
between 1 and 25).
First, we recreate the default pmm
solution
imp.pmm <- mice(boys,
print=FALSE)
and we inspect the imputed values for the norm
solution
imp %>%
complete("all") %>%
purrr::map("tv") %>% # map() comes from package purrr
sapply(table, simplify = TRUE) %>%
.[[1]] # take only the first listed dimension because of printing
##
## 1 1.00649453784308 1.00982920079416 1.15269427727004
## 292 1 1 1
## 1.17581071876652 1.18207989603427 1.19942095901173 1.20949839891864
## 1 1 1 1
## 1.23048088086359 1.41073117576312 1.43789286838149 1.5222389495334
## 1 1 1 1
## 1.56041669547988 1.57588689309982 1.72097699584266 1.83756862205572
## 1 1 1 1
## 1.83797495955285 1.97245805483343 1.98851265830229 2
## 1 1 1 26
## 2.01612495185465 2.1047767820243 2.23215867224639 2.30184168112338
## 1 1 1 1
## 2.30532953711583 2.35976228524561 2.40088420781944 2.44378699589746
## 1 1 1 1
## 2.53990154463181 2.55125059733124 2.61996575736638 2.67916086081524
## 1 1 1 1
## 2.87716226910656 2.88117793032014 2.96195361330228 3
## 1 1 1 19
## 3.20930945168391 3.2616386453062 3.35243447298713 3.37463566257075
## 1 1 1 1
## 3.49108692404755 3.54634014255576 3.6598133245376 3.7738729565257
## 1 1 1 1
## 3.85516210109852 3.90021123600448 4 4.09410101416819
## 1 1 17 1
## 4.14998413382255 4.17517346987585 4.1795853974424 4.24993355952502
## 1 1 1 1
## 4.30544759853944 4.46017637617961 4.51253870741541 4.53703357240513
## 1 1 1 1
## 4.58062955848928 4.58457406159914 4.74845076293805 4.91258905677252
## 1 1 1 1
## 5 5.1527744338739 5.45569436133809 5.45611240406588
## 5 1 1 1
## 5.49630947952168 5.75680605798326 5.86546055983343 6
## 1 1 1 10
## 6.20285539185884 6.37115087279455 6.46153979943104 6.61661877016197
## 1 1 1 1
## 6.6545204634103 6.87917567698322 6.96162417527009 7.12473843528905
## 1 1 1 1
## 7.21409919282494 7.47508149240378 7.757906656965 8
## 1 1 1 13
## 8.22389451703838 8.7017053970976 8.81813191087598 8.88635070490706
## 1 1 1 1
## 8.9420514684627 9 9.23366797008576 9.3232737234394
## 1 1 1 1
## 9.62333868663044 9.63597150328143 9.95208884573322 9.98625182558946
## 1 1 1 1
## 10 10.1636982474849 10.1868750750821 10.1911834457696
## 16 1 1 1
## 10.1953591688128 10.4544006230868 10.4584727887885 10.4706044919954
## 1 1 1 1
## 10.8474185779806 11.1805232721364 11.5143215397929 11.5740227738612
## 1 1 1 1
## 11.6631652884435 11.8876674683072 11.9304739173046 12
## 1 1 1 15
## 12.1322799141038 12.3025117544699 12.7106561095844 12.7746248330509
## 1 1 1 1
## 13 13.0936201537131 13.2830801881372 13.2846556998542
## 1 1 1 1
## 13.3223771579154 13.3566046513651 13.4338015413842 13.5365280881557
## 1 1 1 1
## 13.9317234815714 14 14.2024196579873 14.2157519150853
## 1 1 1 1
## 14.4634451920897 14.7582349852311 14.7685876358953 14.7974493930794
## 1 1 1 1
## 14.8086653926169 14.842145826169 15 15.1328918478211
## 1 1 27 1
## 15.2211647718857 15.2982441576676 15.3266057602964 15.4490417556256
## 1 1 1 1
## 15.4725492798951 15.5505770023786 15.7499927368705 15.8636375904038
## 1 1 1 1
## 15.8990497763138 16 16.0026657425697 16.0521913980514
## 1 1 1 1
## 16.0612601917031 16.1362765164314 16.1634476795731 16.1792551582938
## 1 1 1 1
## 16.3748925501372 16.4600779912783 16.5078994231325 16.5514384782586
## 1 1 1 1
## 16.5650824888355 16.9856677646296 17 17.0071074912293
## 1 1 1 1
## 17.059219724912 17.1979570740214 17.23970699111 17.2593487206854
## 1 1 1 1
## 17.4754295006349 17.5595929619069 17.7457605750693 17.9405557702254
## 1 1 1 1
## 17.9573858667122 18 18.083456303857 18.1471603005207
## 1 1 1 1
## 18.3494020772848 18.524717997109 18.5464264030669 18.6228106985796
## 1 1 1 1
## 18.6626384514709 18.8082470191459 18.9159970787405 18.9325365273361
## 1 1 1 1
## 18.9849864031893 18.9851910861529 18.9869613118325 19.0069773475117
## 1 1 1 1
## 19.0440402955858 19.0952406748918 19.1210701337786 19.2428259366155
## 1 1 1 1
## 19.2564964225346 19.2677286575065 19.2931167181757 19.3084804629669
## 1 1 1 1
## 19.5386390942929 19.616820747369 19.7150687158619 19.7364371981962
## 1 1 1 1
## 19.7424471762963 19.7620766562697 19.8332081383525 19.8687823301684
## 1 1 1 1
## 19.8759173182129 19.9252352312519 20 20.057616454172
## 1 1 38 1
## 20.1351347040467 20.1907849735746 20.1927865145999 20.3332398475225
## 1 1 1 1
## 20.3549086435044 20.3947717531481 20.4189455865838 20.633792751586
## 1 1 1 1
## 20.6339019818025 20.6692931756032 20.7518616961018 20.7696683043196
## 1 1 1 1
## 20.9230475615683 20.9758953771529 21.1149527842006 21.2113095059881
## 1 1 1 1
## 21.2251500840523 21.2664439150096 21.3051935147088 21.3435210900071
## 1 1 1 1
## 21.4647264732031 21.587139105416 21.690841206011 21.6969288875524
## 1 1 1 1
## 21.7181891706342 21.7977321027637 21.8114820995156 21.9628589789866
## 1 1 1 1
## 22.1303889884696 22.4889850539823 22.8552078506202 23.1378860428222
## 1 1 1 1
## 23.2195947091257 23.2219995935603 23.4372923006549 23.4657773755636
## 1 1 1 1
## 23.5839172443998 23.6634085261527 23.6681982217235 23.8236081411663
## 1 1 1 1
## 23.9131068803584 24.0087052970175 24.1443090101323 25
## 1 1 1 38
and for the pmm
solution
imp.pmm %>%
complete("all") %>%
map("tv") %>%
sapply(table)
## 1 2 3 4 5
## 1 71 79 64 65 69
## 2 230 207 212 213 238
## 3 89 104 87 98 83
## 4 33 30 29 30 26
## 5 11 6 12 8 5
## 6 15 19 28 21 17
## 8 25 26 36 25 34
## 9 1 1 1 3 1
## 10 29 25 37 37 28
## 12 29 29 32 29 29
## 13 1 2 2 3 2
## 14 3 2 2 1 1
## 15 55 50 67 59 51
## 16 6 2 3 2 3
## 17 3 4 1 3 1
## 18 3 3 2 1 5
## 20 88 89 70 87 89
## 25 56 70 63 63 66
It is clear that the norm solution - as expected - does not give us integer data as imputations. Next, we inspect and compare the density of the incomplete and imputed data for the constrained solution
# default lattice solution from mice
densityplot(imp, ~tv)
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_vrb, group = .imp)) +
## ggplot2::geom_density()
##
## See amices.org/ggmice for more info.
We can create the same type plot with ggmice
:
ggmice(imp, aes(x = tv, group = .imp)) +
geom_density()
When we compare this plot to the mice.impute.pmm()
solution
ggmice(imp.pmm, aes(x = tv, group = .imp)) +
geom_density()
we see that the imputed values have a bit more mass towards the lower
end in the mice.impute.pmm()
solution
A nice way of plotting the histograms of both datasets simultaneously
is by creating first the dataframe (here we named it tvm
)
that contains the data in one column and the imputation method in
another column.
tv <- c(complete(imp.pmm)$tv, complete(imp)$tv)
used.method <- rep(c("pmm", "norm"), each = nrow(boys)) %>%
as.factor
tvm <- data.frame(tv = tv, method = used.method)
and then plotting a histogram of tv
conditional on
method.
histogram(~tv | method, data = tvm, nint = 25)
Is there still a difference in distribution between the two different imputation methods? Which imputations are more plausible do you think?
To create the same plot with ggmice
:
tvm %>%
ggmice(aes(x = tv)) +
geom_histogram() +
facet_wrap(used.method)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
6. Make a missing data indicator (name it miss
)
for bmi
and check the relation of bmi
,
wgt
and hgt
for the boys in the
pmm
-imputed data. To do so, plot the imputed values against
their respective calculated values.
miss <- is.na(imp$data$bmi)
xyplot(imp.pmm, bmi ~ I (wgt / (hgt / 100)^2),
na.groups = miss, cex = c(0.8, 1.2), pch = c(1, 20),
ylab = "BMI (kg/m2) Imputed", xlab = "BMI (kg/m2) Calculated")
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot 2 variables named 'my_x' and 'my_y' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_x, y = my_y)) +
## ggplot2::geom_point()
##
## See amices.org/ggmice for more info.
With this plot we show that the relation between hgt
,
wgt
and bmi
is not preserved in the imputed
values. In order to preserve this relation, we should use passive
imputation.
We can generate the same plot with ggmice
:
ggmice(imp.pmm, aes(x = I(wgt / (hgt / 100)^2),
y = bmi,
groups = .imp)) +
geom_point() +
ylab("BMI (kg/m2) Imputed") +
xlab("BMI (kg/m2) Calculated")
7. Use passive imputation to preserve the relation between
imputed bmi
, wgt
and hgt
by
setting the imputation method for bmi
to:
meth["bmi"]<- "~ I(wgt / (hgt / 100)^2)"
but do not solve for circularity.
meth <- make.method(boys)
meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
imp <- mice(boys,
meth = meth,
print=FALSE)
8. Again, plot the imputed values of bmi
versus
the calculated values and check whether there is convergence for
bmi
.
To inspect the relation:
plot <- ggmice(imp, aes(x = I(wgt / (hgt / 100)^2),
y = bmi,
groups = .imp)) +
geom_point() +
ylab("BMI (kg/m2) Imputed") +
xlab("BMI (kg/m2) Calculated")
ggplotly(plot)
To study convergence for bmi
alone:
plot_trace(imp, c("bmi"))
Although the relation of bmi
is preserved now in the
imputations we get absurd imputations and on top of that we clearly see
there are some problems with the convergence of bmi
. The
problem is that we have circularity in the imputations. We used passive
imputation for bmi
but bmi
is also
automatically used as predictor for wgt
and
hgt
. This can be solved by adjusting the predictor
matrix.
9. Solve the problem of circularity (if you did not already do so) and plot once again the imputed values of bmi versus the calculated values.
First, we remove bmi
as a predictor for hgt
and wgt
to remove circularity.
pred <- make.predictorMatrix(boys)
pred[c("hgt", "wgt"), "bmi"] <- 0
pred
## age hgt wgt bmi hc gen phb tv reg
## age 0 1 1 1 1 1 1 1 1
## hgt 1 0 1 0 1 1 1 1 1
## wgt 1 1 0 0 1 1 1 1 1
## bmi 1 1 1 0 1 1 1 1 1
## hc 1 1 1 1 0 1 1 1 1
## gen 1 1 1 1 1 0 1 1 1
## phb 1 1 1 1 1 1 0 1 1
## tv 1 1 1 1 1 1 1 0 1
## reg 1 1 1 1 1 1 1 1 0
and we run the mice
algorithm again with the new
predictor matrix (we still ‘borrow’ the imputation methods object
meth
from before)
imp <-mice(boys,
meth = meth,
pred = pred,
print = FALSE)
Second, we recreate the plots from Assignment 8. We start with the plot to inspect the relations in the observed and imputed data
ggmice(imp, aes(x = I(wgt / (hgt / 100)^2),
y = bmi,
groups = .imp)) +
geom_point() +
ylab("BMI (kg/m2) Imputed") +
xlab("BMI (kg/m2) Calculated")
and continue with the trace plot to study convergence
plot_trace(imp, c("bmi"))
All is well now! But let’s run a couple of more iterations to be sure.
mice.mids(imp,
maxit = 5,
print = FALSE) %>%
plot_trace()
We have seen that the practical execution of multiple imputation and
pooling is straightforward with the R
package
mice
. The package is designed to allow you to assess and
control the imputations themselves, the convergence of the algorithm and
the distributions and multivariate relations of the observed and imputed
data.
It is important to ‘gain’ this control as a user. After all, we are imputing values and taking their uncertainty properly into account. Being also uncertain about the process that generated those values is therefor not a valid option.
Never set all relations fixed. You will remain with the starting values and waste your computer’s energy (and your own).
meth<- make.method(boys)
pred <- make.predictorMatrix(boys)
meth["bmi"]<- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"]<- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"]<- "~ I(sqrt(wgt / bmi) * 100)"
pred[c("bmi", "wgt", "hgt"), ] <- 0
imp.path <- mice(boys,
meth=meth,
pred=pred,
seed=123)
##
## iter imp variable
## 1 1 hgt wgt bmi hc gen phb tv reg
## 1 2 hgt wgt bmi hc gen phb tv reg
## 1 3 hgt wgt bmi hc gen phb tv reg
## 1 4 hgt wgt bmi hc gen phb tv reg
## 1 5 hgt wgt bmi hc gen phb tv reg
## 2 1 hgt wgt bmi hc gen phb tv reg
## 2 2 hgt wgt bmi hc gen phb tv reg
## 2 3 hgt wgt bmi hc gen phb tv reg
## 2 4 hgt wgt bmi hc gen phb tv reg
## 2 5 hgt wgt bmi hc gen phb tv reg
## 3 1 hgt wgt bmi hc gen phb tv reg
## 3 2 hgt wgt bmi hc gen phb tv reg
## 3 3 hgt wgt bmi hc gen phb tv reg
## 3 4 hgt wgt bmi hc gen phb tv reg
## 3 5 hgt wgt bmi hc gen phb tv reg
## 4 1 hgt wgt bmi hc gen phb tv reg
## 4 2 hgt wgt bmi hc gen phb tv reg
## 4 3 hgt wgt bmi hc gen phb tv reg
## 4 4 hgt wgt bmi hc gen phb tv reg
## 4 5 hgt wgt bmi hc gen phb tv reg
## 5 1 hgt wgt bmi hc gen phb tv reg
## 5 2 hgt wgt bmi hc gen phb tv reg
## 5 3 hgt wgt bmi hc gen phb tv reg
## 5 4 hgt wgt bmi hc gen phb tv reg
## 5 5 hgt wgt bmi hc gen phb tv reg
plot(imp.path, c("hgt", "wgt", "bmi"))
We named the mids
object imp.path
, because
the nonconvergence is pathological in this example!
- End of practical
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.0 ggmice_0.0.1.9000 ggplot2_3.3.6 lattice_0.20-45
## [5] purrr_0.3.4 dplyr_1.0.9 mice_3.14.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.31 bslib_0.3.1 colorspace_2.0-3
## [5] vctrs_0.4.1 generics_0.1.3 htmltools_0.5.2 viridisLite_0.4.0
## [9] yaml_2.3.5 utf8_1.2.2 rlang_1.0.3 jquerylib_0.1.4
## [13] pillar_1.7.0 glue_1.6.2 withr_2.5.0 DBI_1.1.2
## [17] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
## [21] htmlwidgets_1.5.4 evaluate_0.15 labeling_0.4.2 knitr_1.39
## [25] fastmap_1.1.0 crosstalk_1.2.0 fansi_1.0.3 highr_0.9
## [29] broom_1.0.0 Rcpp_1.0.9 backports_1.4.1 scales_1.2.0
## [33] jsonlite_1.8.0 farver_2.1.1 digest_0.6.29 stringi_1.7.6
## [37] grid_4.2.0 cli_3.3.0 tools_4.2.0 magrittr_2.0.3
## [41] sass_0.4.1 lazyeval_0.2.2 tibble_3.1.7 crayon_1.5.1
## [45] tidyr_1.2.0 pkgconfig_2.0.3 MASS_7.3-56 ellipsis_0.3.2
## [49] data.table_1.14.2 assertthat_0.2.1 rmarkdown_2.14 httr_1.4.3
## [53] rstudioapi_0.13 R6_2.5.1 nnet_7.3-17 compiler_4.2.0