The future starts today

For big datasets or high number of imputations, performing multiple imputation with function mice from package mice (Van Buuren & Groothuis-Oudshoorn, 2011) might take a long time. As a solution, wrapper function futuremice was created to enable the imputation procedure to be run in parallel. This is done by dividing the imputations over multiple cores (or CPUs), thus potentially speeding up the process. The function futuremice is a sequel to parlMICE (Schouten & Vink, 2017), developed to improve user-friendliness.

This vignette demonstrates two applications of the futuremice function. The first application shows the tradeoff between time and increasing number of imputations ($$m$$) for a small dataset; the second application does the same, but for a relatively large dataset. We also discuss futuremice’s arguments.

The function futuremice depends on packages future, furrr and mice. For more information about running functions in futures, see e.g. the future manual or the furrr manual. Function futuremice found its inspiration from Max’s useful suggestions on parallelization of mice’s chains on stackoverflow.

Time gain with small datasets

We demonstrate the potential gain in computing efficiency on simulated data. To this end we sample 1,000 cases from a multivariate normal distribution with mean vector

$\mu = \left[\begin{array} {r} 0 \\ 0 \\ 0 \\ 0 \end{array}\right]$

and covariance matrix

$\Sigma = \left[\begin{array} {rrrr} 1&0.5&0.5&0.5 \\ 0.5&1&0.5&0.5 \\ 0.5&0.5&1&0.5 \\ 0.5&0.5&0.5&1 \end{array}\right].$

A MCAR missingness mechanism is imposed on the data where 80 percent of the cases (i.e. rows) has missingness on one variable. All variables have missing values. The missingness is randomly generated with the following arguments from function mice::ampute:

set.seed(123)

small_covmat <- diag(4)
small_covmat[small_covmat == 0] <- 0.5
small_data <- MASS::mvrnorm(1000,
mu = c(0, 0, 0, 0),
Sigma = small_covmat)

small_data_with_missings <- ampute(small_data, prop = 0.8, mech = "MCAR")\$amp
head(small_data_with_missings)
V1 V2 V3 V4
-0.1667048 0.9165856 0.6389869 NA
-0.4548685 0.4313280 NA 0.5753627
-1.2432777 -0.4162831 -1.9552769 NA
-0.1366822 NA -0.5998099 0.7553689
-1.6633582 -0.7137484 1.8412701 0.1269927
NA -1.3018272 -1.4972105 -1.9058145

We compare the default ‘sequential’ function mice with function futuremice. In both functions we use the defaults arguments for the mice algorithm, although these could very easily be changed if desired by the user. To demonstrate the increased efficiency when putting more than one computing core to work, we repeat the procedure with futuremice for 1, 2, 3 and 4 cores. Figure 1 shows a graphical representation of the results.

Figure 1. Processing time for small datasets. Multiple imputations are performed with mice (conventional) and wrapper function futureMICE (1, 2, 3 and 4 cores, respectively). The dataset has 1000 cases and 4 variables with a correlation of 0.5. 80 percent of the cases has one missing value based on MCAR missingness.

It becomes apparent that for a small to moderate number of imputations, the conventional mice function is faster than the wrapper function futuremice. This is the case until the number of imputations $$m = 120$$. For higher $$m$$, wrapper function futuremice returns the imputations somewhat faster.

Time gain with large datasets

We replicated the above detailed simulation setup with a larger dataset of 10,000 cases and 8 variables. The mean and covariance structure follow the sampling scheme of the smaller data set. We show the results of this simulation in Figure 2.

V1 V2 V3 V4 V5 V6 V7 V8
0.3177437 NA 0.3578290 -0.7861403 0.0857024 0.2905915 -0.1159348 0.3464402
-0.4895928 0.7905551 0.9676060 NA 0.3915238 1.3301799 0.5672698 0.1748194
NA -1.2294188 0.2485337 -0.2706589 -1.5055993 -0.7062091 0.8060020 -0.4176853
-0.1711396 NA -1.2937757 0.0984493 -0.1351536 0.3613034 -0.5861565 -0.6498191
0.4208610 -0.0102911 0.3268812 NA 0.9371669 0.0886542 1.4311793 -0.1800665
1.8674356 1.9724127 0.3847853 0.3058566 0.3201818 NA -1.2755379 1.3359326