The future starts today


For big datasets or high number of imputations, performing multiple imputation with function mice from package mice (Van Buuren & Groothuis-Oudshoorn, 2011) might take a long time. As a solution, wrapper function futuremice was created to enable the imputation procedure to be run in parallel. This is done by dividing the imputations over multiple cores (or CPUs), thus potentially speeding up the process. The function futuremice is a sequel to parlMICE (Schouten & Vink, 2017), developed to improve user-friendliness.

This vignette demonstrates two applications of the futuremice function. The first application shows the tradeoff between time and increasing number of imputations (\(m\)) for a small dataset; the second application does the same, but for a relatively large dataset. We also discuss futuremice’s arguments.

The function futuremice depends on packages future, furrr and mice. For more information about running functions in futures, see e.g. the future manual or the furrr manual. Function futuremice found its inspiration from Max’s useful suggestions on parallelization of mice’s chains on stackoverflow.


Time gain with small datasets

We demonstrate the potential gain in computing efficiency on simulated data. To this end we sample 1,000 cases from a multivariate normal distribution with mean vector

\[ \mu = \left[\begin{array} {r} 0 \\ 0 \\ 0 \\ 0 \end{array}\right] \]

and covariance matrix

\[ \Sigma = \left[\begin{array} {rrrr} 1&0.5&0.5&0.5 \\ 0.5&1&0.5&0.5 \\ 0.5&0.5&1&0.5 \\ 0.5&0.5&0.5&1 \end{array}\right]. \]

A MCAR missingness mechanism is imposed on the data where 80 percent of the cases (i.e. rows) has missingness on one variable. All variables have missing values. The missingness is randomly generated with the following arguments from function mice::ampute:

set.seed(123)

small_covmat <- diag(4)
small_covmat[small_covmat == 0] <- 0.5
small_data <- MASS::mvrnorm(1000, 
                      mu = c(0, 0, 0, 0),
                      Sigma = small_covmat)

small_data_with_missings <- ampute(small_data, prop = 0.8, mech = "MCAR")$amp
head(small_data_with_missings)
V1 V2 V3 V4
-0.1667048 0.9165856 0.6389869 NA
-0.4548685 0.4313280 NA 0.5753627
-1.2432777 -0.4162831 -1.9552769 NA
-0.1366822 NA -0.5998099 0.7553689
-1.6633582 -0.7137484 1.8412701 0.1269927
NA -1.3018272 -1.4972105 -1.9058145

We compare the default ‘sequential’ function mice with function futuremice. In both functions we use the defaults arguments for the mice algorithm, although these could very easily be changed if desired by the user. To demonstrate the increased efficiency when putting more than one computing core to work, we repeat the procedure with futuremice for 1, 2, 3 and 4 cores. Figure 1 shows a graphical representation of the results.


Figure 1. Processing time for small datasets. Multiple imputations are performed with mice (conventional) and wrapper function futureMICE (1, 2, 3 and 4 cores, respectively). The dataset has 1000 cases and 4 variables with a correlation of 0.5. 80 percent of the cases has one missing value based on MCAR missingness.


It becomes apparent that for a small to moderate number of imputations, the conventional mice function is faster than the wrapper function futuremice. This is the case until the number of imputations \(m = 120\). For higher \(m\), wrapper function futuremice returns the imputations somewhat faster.


Time gain with large datasets

We replicated the above detailed simulation setup with a larger dataset of 10,000 cases and 8 variables. The mean and covariance structure follow the sampling scheme of the smaller data set. We show the results of this simulation in Figure 2.


V1 V2 V3 V4 V5 V6 V7 V8
0.3177437 NA 0.3578290 -0.7861403 0.0857024 0.2905915 -0.1159348 0.3464402
-0.4895928 0.7905551 0.9676060 NA 0.3915238 1.3301799 0.5672698 0.1748194
NA -1.2294188 0.2485337 -0.2706589 -1.5055993 -0.7062091 0.8060020 -0.4176853
-0.1711396 NA -1.2937757 0.0984493 -0.1351536 0.3613034 -0.5861565 -0.6498191
0.4208610 -0.0102911 0.3268812 NA 0.9371669 0.0886542 1.4311793 -0.1800665
1.8674356 1.9724127 0.3847853 0.3058566 0.3201818 NA -1.2755379 1.3359326

Figure 2. Processing time for large datasets. Multiple imputations are performed with mice (conventional) and wrapper function parlMICE (1, 2 and 3 cores respectively). The dataset has 10000 cases and 8 variables with a correlation of 0.5. 80 percent of the cases has one missing value based on MCAR missingness.


When datasets are sufficiently large, function futuremice works faster than mice for all \(m\). In such cases, even for very small numbers of imputations, running mice in parallel with futuremice saves a significant amount of time. This gain in efficiency can increase to more than 50 percent for \(100\) imputations and more.

There is not a large difference between using 2 and 3 cores with wrapper function parlMICE. For all number of imputations, the procedure runs faster with 3 cores, even though the imputations have to be divided over the cores. It might therefore be desirable to use always as many cores as possible, while leaving 1 core out to govern any overhead computing. For example, on a hexacore machine, use only 5 cores to run the mice algorithm in parallel with futuremice.


Default settings

We will now discuss the arguments of function futuremice. Easy imputation of an incomplete dataset (say, nhanes) can be performed with futuremice in the following way.

imp <- futuremice(nhanes)
## Number of cores not specified. Based on your machine a value of n.core = 3 is chosen; the imputations are distributed about equally over the cores.
class(imp)
## [1] "mids"

The function returns a mids object as created by mice. In fact, futuremice makes use of function mice::ibind to combine the mids objects returned by the different cores. Therefore, the call of the mids object has slightly changed.

imp$call
## [[1]]
## mice(data = data, m = x, printFlag = F, seed = seed)
## 
## [[2]]
## ibind(x = imp, y = imps[[i]])
## 
## [[3]]
## ibind(x = imp, y = imps[[i]])

Additionally, futuremice makes use of a parallelseed argument that is stored in imp$parallelseed.

imp$parallelseed
## [1] 198128673

If no seed is specified by the user, a seed will be drawn randomly from a uniform distribution \(U(-999999999,999999999)\), and this seed will be returned, such that the user can reproduce the obtained results even when no seed is specified. See section Argument parallelseed for more information.

All other parts of the mids object are standard.


Using mice arguments

Function futuremice is able to deal with the conventional mice arguments. In order to change the imputation method from its default (predictive mean matching) to, for example, Bayesian linear regression, the method argument can be adjusted. For other possibilities with mice, we refer to the mice manual.

imp <- futuremice(nhanes, method = "norm")
## Number of cores not specified. Based on your machine a value of n.core = 3 is chosen; the imputations are distributed about equally over the cores.
imp$method
##    age    bmi    hyp    chl 
##     "" "norm" "norm" "norm"

In mice, the number of imputations is specified with argument m. In futuremice, the same argument should be used, and futuremice takes care of dividing the imputations equally over the cores. The next section discusses these arguments.


Argument n.core

With n.core, the number of cores (or CPUs) is given, and the number of imputations m is (about) equally distributed over the cores.

As a default, n.core is specified as the number of available, logical cores minus 1. The default number of imputations has been set to m = 5, just as in a regular mice call. Hence, on machines with 4 available, logical cores, \(5\) imputations are divided over 3 cores, leaving 1 core available for any overhead computations. This results in a number of imputations per core of: \(core1, core1, core2, core3, core3\), respectively.

The computer with which this vignette is run, has

parallelly::availableCores(logical = TRUE)
## system 
##      4

available, logical cores. Accordingly, the number of imputations per core equals core1, core1, core2, core3, core3 We can check this by evaluating the \(m\) that is shown in the mids object.

imp$m

Argument parallelseed

In simulation studies, it is often desired to set a seed to make the results reproducible. Similarly to mice, the seed value for futuremice can be defined outside the function. Under the hood, futuremice makes use of the furrr::furrr_options(seed = TRUE) argument, which recognizes that a seed has been specified in the global environment. Hence users can specify the following code to obtain identical results.

library(magrittr)
set.seed(123)
imp1 <- futuremice(nhanes, n.core = 3)
set.seed(123)
imp2 <- futuremice(nhanes, n.core = 3)

imp1 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 104.166698 3034.488212 661.434261 3828.209325 23 14.25140 0.2615667 0.2073348 0.2992306
bmi 5 3.225462 4.187099 0.844003 5.199902 23 14.71046 0.2418867 0.1947736 0.2857059
imp2 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 104.166698 3034.488212 661.434261 3828.209325 23 14.25140 0.2615667 0.2073348 0.2992306
bmi 5 3.225462 4.187099 0.844003 5.199902 23 14.71046 0.2418867 0.1947736 0.2857059

A user can also specify a seed within the futuremice call, by specifying the argument parallelseed. This seed is parsed to withr::local_seed(), such that the global environment is not affected by a different seed within the futuremice function. Hence, users can also specify a seed as follows.

imp3 <- futuremice(nhanes, parallelseed = 123, n.core = 3)
imp4 <- futuremice(nhanes, parallelseed = 123, n.core = 3)

imp3 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 129.350091 3029.97733 798.544710 3988.23098 23 13.08385 0.3162577 0.2402703 0.3347415
bmi 5 2.439893 4.20187 1.086733 5.50595 23 13.20235 0.3103571 0.2368492 0.3310517
imp4 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 129.350091 3029.97733 798.544710 3988.23098 23 13.08385 0.3162577 0.2402703 0.3347415
bmi 5 2.439893 4.20187 1.086733 5.50595 23 13.20235 0.3103571 0.2368492 0.3310517

This also yields identical results for imp3 and imp4. However, note that this does not result in the exact same imputations as the procedure where the seed is specified in the global environment.

If no seed is specified in the global environment, or in the call itself, the results are still reproducible, because in such circumstances, futuremice randomly draws a seed from a uniform distribution \(U(-999999999,999999999)\). This randomly drawn seed is stored under $parallelseed in the output object, such that reproducible results can be obtained as follows.

imp5 <- futuremice(nhanes, n.core = 3)
parallelseed <- imp5$parallelseed
imp6 <- futuremice(nhanes, parallelseed = parallelseed, n.core = 3)

imp5 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 119.542243 3207.87525 1597.12735 5124.428072 23 9.073427 0.5974524 0.3740033 0.4777015
bmi 5 2.709899 4.49144 2.41978 7.395175 23 8.613504 0.6465044 0.3926527 0.4972460
imp6 %$% lm(chl ~ bmi) %>% pool %$% pooled
term m estimate ubar b t dfcom df riv lambda fmi
(Intercept) 5 119.542243 3207.87525 1597.12735 5124.428072 23 9.073427 0.5974524 0.3740033 0.4777015
bmi 5 2.709899 4.49144 2.41978 7.395175 23 8.613504 0.6465044 0.3926527 0.4972460

WARNING: Under unique circumstances, users might want to check whether imputations obtained under different streams are identical. This can be done by specifying the regular seed argument in the futuremice call. This seed is parsed to the separate mice calls within all futures, such that the results will be identical over the cores. If users specify the seed argument rather than the parallelseed argument, futuremice will ask if this is intended behavior if the user is in an interactive() session. Otherwise, a warning will be printed.


Systems other than Windows

Function futuremice calls for function future_map with plan("multisession") from the furrr package. Although other options are available, we have chosen for the plan("multisession") because it allows for the use of multiple cores on all computers, including a Windows computer. The user may adjust this by specifying the future.plan argument within futuremice. Other options are for example future.plan = "multicore", which results in plan("multicore") (which is not supported on Windows computers), future.plan = "cluster", resulting in plan("cluster"). For all options regarding plan(), check ?future::plan().


References

Manual R-package future, available at https://cran.r-project.org/web/packages/future/future.pdf

Manual R-package furrr, available at https://cran.r-project.org/web/packages/furrr/furrr.pdf

Manual package MICE, available at https://cran.r-project.org/web/packages/mice/mice.pdf

Schouten, R.M., Lugtig, P.J. and Vink, G. (2016). Multiple amputation using ampute [manual]. Available at https://github.com/RianneSchouten/mice/blob/ampute/vignettes/Vignette_Ampute.pdf

Schouten, R.M. and Vink, G. (2017). parlmice: faster, paraleller, micer. https://www.gerkovink.com/parlMICE/Vignette_parlMICE.html

Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45 (3), 1-67.


End of Vignette