futuremiceFor big datasets or high number of imputations, performing multiple imputation with function mice from package mice (Van Buuren & Groothuis-Oudshoorn, 2011) might take a long time. As a solution, wrapper function futuremice was created to enable the imputation procedure to be run in parallel. This is done by dividing the imputations over multiple cores (or CPUs), thus potentially speeding up the process. The function futuremice is a sequel to parlMICE (Schouten & Vink, 2017), developed to improve user-friendliness.
This vignette demonstrates two applications of the futuremice function. The first application shows the tradeoff between time and increasing number of imputations (\(m\)) for a small dataset; the second application does the same, but for a relatively large dataset. We also discuss futuremice’s arguments.
The function futuremice depends on packages future, furrr and mice. For more information about running functions in futures, see e.g. the future manual or the furrr manual. Function futuremice found its inspiration from Max’s useful suggestions on parallelization of mice’s chains on stackoverflow.
We demonstrate the potential gain in computing efficiency on simulated data. To this end we sample 1,000 cases from a multivariate normal distribution with mean vector
\[ \mu = \left[\begin{array} {r} 0 \\ 0 \\ 0 \\ 0 \end{array}\right] \]
and covariance matrix
\[ \Sigma = \left[\begin{array} {rrrr} 1&0.5&0.5&0.5 \\ 0.5&1&0.5&0.5 \\ 0.5&0.5&1&0.5 \\ 0.5&0.5&0.5&1 \end{array}\right]. \]
A MCAR missingness mechanism is imposed on the data where 80 percent of the cases (i.e. rows) has missingness on one variable. All variables have missing values. The missingness is randomly generated with the following arguments from function mice::ampute:
set.seed(123)
small_covmat <- diag(4)
small_covmat[small_covmat == 0] <- 0.5
small_data <- MASS::mvrnorm(1000,
mu = c(0, 0, 0, 0),
Sigma = small_covmat)
small_data_with_missings <- ampute(small_data, prop = 0.8, mech = "MCAR")$amp
head(small_data_with_missings)
| V1 | V2 | V3 | V4 |
|---|---|---|---|
| -0.1667048 | 0.9165856 | 0.6389869 | NA |
| -0.4548685 | 0.4313280 | NA | 0.5753627 |
| -1.2432777 | -0.4162831 | -1.9552769 | NA |
| -0.1366822 | NA | -0.5998099 | 0.7553689 |
| -1.6633582 | -0.7137484 | 1.8412701 | 0.1269927 |
| NA | -1.3018272 | -1.4972105 | -1.9058145 |
We compare the default ‘sequential’ function mice with function futuremice. In both functions we use the defaults arguments for the mice algorithm, although these could very easily be changed if desired by the user. To demonstrate the increased efficiency when putting more than one computing core to work, we repeat the procedure with futuremice for 1, 2, 3 and 4 cores. Figure 1 shows a graphical representation of the results.
Figure 1. Processing time for small datasets. Multiple imputations are performed with mice (conventional) and wrapper function futureMICE (1, 2, 3 and 4 cores, respectively). The dataset has 1000 cases and 4 variables with a correlation of 0.5. 80 percent of the cases has one missing value based on MCAR missingness.
It becomes apparent that for a small to moderate number of imputations, the conventional mice function is faster than the wrapper function futuremice. This is the case until the number of imputations \(m = 120\). For higher \(m\), wrapper function futuremice returns the imputations somewhat faster.
We replicated the above detailed simulation setup with a larger dataset of 10,000 cases and 8 variables. The mean and covariance structure follow the sampling scheme of the smaller data set. We show the results of this simulation in Figure 2.
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
|---|---|---|---|---|---|---|---|
| 0.3177437 | NA | 0.3578290 | -0.7861403 | 0.0857024 | 0.2905915 | -0.1159348 | 0.3464402 |
| -0.4895928 | 0.7905551 | 0.9676060 | NA | 0.3915238 | 1.3301799 | 0.5672698 | 0.1748194 |
| NA | -1.2294188 | 0.2485337 | -0.2706589 | -1.5055993 | -0.7062091 | 0.8060020 | -0.4176853 |
| -0.1711396 | NA | -1.2937757 | 0.0984493 | -0.1351536 | 0.3613034 | -0.5861565 | -0.6498191 |
| 0.4208610 | -0.0102911 | 0.3268812 | NA | 0.9371669 | 0.0886542 | 1.4311793 | -0.1800665 |
| 1.8674356 | 1.9724127 | 0.3847853 | 0.3058566 | 0.3201818 | NA | -1.2755379 | 1.3359326 |
Figure 2. Processing time for large datasets. Multiple imputations are performed with mice (conventional) and wrapper function parlMICE (1, 2 and 3 cores respectively). The dataset has 10000 cases and 8 variables with a correlation of 0.5. 80 percent of the cases has one missing value based on MCAR missingness.
When datasets are sufficiently large, function futuremice works faster than mice for all \(m\). In such cases, even for very small numbers of imputations, running mice in parallel with futuremice saves a significant amount of time. This gain in efficiency can increase to more than 50 percent for \(100\) imputations and more.
There is not a large difference between using 2 and 3 cores with wrapper function parlMICE. For all number of imputations, the procedure runs faster with 3 cores, even though the imputations have to be divided over the cores. It might therefore be desirable to use always as many cores as possible, while leaving 1 core out to govern any overhead computing. For example, on a hexacore machine, use only 5 cores to run the mice algorithm in parallel with futuremice.
We will now discuss the arguments of function futuremice. Easy imputation of an incomplete dataset (say, nhanes) can be performed with futuremice in the following way.
imp <- futuremice(nhanes)
## Number of cores not specified. Based on your machine a value of n.core = 3 is chosen; the imputations are distributed about equally over the cores.
class(imp)
## [1] "mids"
The function returns a mids object as created by mice. In fact, futuremice makes use of function mice::ibind to combine the mids objects returned by the different cores. Therefore, the call of the mids object has slightly changed.
imp$call
## [[1]]
## mice(data = data, m = x, printFlag = F, seed = seed)
##
## [[2]]
## ibind(x = imp, y = imps[[i]])
##
## [[3]]
## ibind(x = imp, y = imps[[i]])
Additionally, futuremice makes use of a parallelseed argument that is stored in imp$parallelseed.
imp$parallelseed
## [1] 198128673
If no seed is specified by the user, a seed will be drawn randomly from a uniform distribution \(U(-999999999,999999999)\), and this seed will be returned, such that the user can reproduce the obtained results even when no seed is specified. See section Argument parallelseed for more information.
All other parts of the mids object are standard.
mice argumentsFunction futuremice is able to deal with the conventional mice arguments. In order to change the imputation method from its default (predictive mean matching) to, for example, Bayesian linear regression, the method argument can be adjusted. For other possibilities with mice, we refer to the mice manual.
imp <- futuremice(nhanes, method = "norm")
## Number of cores not specified. Based on your machine a value of n.core = 3 is chosen; the imputations are distributed about equally over the cores.
imp$method
## age bmi hyp chl
## "" "norm" "norm" "norm"
In mice, the number of imputations is specified with argument m. In futuremice, the same argument should be used, and futuremice takes care of dividing the imputations equally over the cores. The next section discusses these arguments.
n.coreWith n.core, the number of cores (or CPUs) is given, and the number of imputations m is (about) equally distributed over the cores.
As a default, n.core is specified as the number of available, logical cores minus 1. The default number of imputations has been set to m = 5, just as in a regular mice call. Hence, on machines with 4 available, logical cores, \(5\) imputations are divided over 3 cores, leaving 1 core available for any overhead computations. This results in a number of imputations per core of: \(core1, core1, core2, core3, core3\), respectively.
The computer with which this vignette is run, has
parallelly::availableCores(logical = TRUE)
## system
## 4
available, logical cores. Accordingly, the number of imputations per core equals core1, core1, core2, core3, core3 We can check this by evaluating the \(m\) that is shown in the mids object.
imp$m
parallelseedIn simulation studies, it is often desired to set a seed to make the results reproducible. Similarly to mice, the seed value for futuremice can be defined outside the function. Under the hood, futuremice makes use of the furrr::furrr_options(seed = TRUE) argument, which recognizes that a seed has been specified in the global environment. Hence users can specify the following code to obtain identical results.
library(magrittr)
set.seed(123)
imp1 <- futuremice(nhanes, n.core = 3)
set.seed(123)
imp2 <- futuremice(nhanes, n.core = 3)
imp1 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 104.166698 | 3034.488212 | 661.434261 | 3828.209325 | 23 | 14.25140 | 0.2615667 | 0.2073348 | 0.2992306 |
| bmi | 5 | 3.225462 | 4.187099 | 0.844003 | 5.199902 | 23 | 14.71046 | 0.2418867 | 0.1947736 | 0.2857059 |
imp2 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 104.166698 | 3034.488212 | 661.434261 | 3828.209325 | 23 | 14.25140 | 0.2615667 | 0.2073348 | 0.2992306 |
| bmi | 5 | 3.225462 | 4.187099 | 0.844003 | 5.199902 | 23 | 14.71046 | 0.2418867 | 0.1947736 | 0.2857059 |
A user can also specify a seed within the futuremice call, by specifying the argument parallelseed. This seed is parsed to withr::local_seed(), such that the global environment is not affected by a different seed within the futuremice function. Hence, users can also specify a seed as follows.
imp3 <- futuremice(nhanes, parallelseed = 123, n.core = 3)
imp4 <- futuremice(nhanes, parallelseed = 123, n.core = 3)
imp3 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 129.350091 | 3029.97733 | 798.544710 | 3988.23098 | 23 | 13.08385 | 0.3162577 | 0.2402703 | 0.3347415 |
| bmi | 5 | 2.439893 | 4.20187 | 1.086733 | 5.50595 | 23 | 13.20235 | 0.3103571 | 0.2368492 | 0.3310517 |
imp4 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 129.350091 | 3029.97733 | 798.544710 | 3988.23098 | 23 | 13.08385 | 0.3162577 | 0.2402703 | 0.3347415 |
| bmi | 5 | 2.439893 | 4.20187 | 1.086733 | 5.50595 | 23 | 13.20235 | 0.3103571 | 0.2368492 | 0.3310517 |
This also yields identical results for imp3 and imp4. However, note that this does not result in the exact same imputations as the procedure where the seed is specified in the global environment.
If no seed is specified in the global environment, or in the call itself, the results are still reproducible, because in such circumstances, futuremice randomly draws a seed from a uniform distribution \(U(-999999999,999999999)\). This randomly drawn seed is stored under $parallelseed in the output object, such that reproducible results can be obtained as follows.
imp5 <- futuremice(nhanes, n.core = 3)
parallelseed <- imp5$parallelseed
imp6 <- futuremice(nhanes, parallelseed = parallelseed, n.core = 3)
imp5 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 119.542243 | 3207.87525 | 1597.12735 | 5124.428072 | 23 | 9.073427 | 0.5974524 | 0.3740033 | 0.4777015 |
| bmi | 5 | 2.709899 | 4.49144 | 2.41978 | 7.395175 | 23 | 8.613504 | 0.6465044 | 0.3926527 | 0.4972460 |
imp6 %$% lm(chl ~ bmi) %>% pool %$% pooled
| term | m | estimate | ubar | b | t | dfcom | df | riv | lambda | fmi |
|---|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5 | 119.542243 | 3207.87525 | 1597.12735 | 5124.428072 | 23 | 9.073427 | 0.5974524 | 0.3740033 | 0.4777015 |
| bmi | 5 | 2.709899 | 4.49144 | 2.41978 | 7.395175 | 23 | 8.613504 | 0.6465044 | 0.3926527 | 0.4972460 |
WARNING: Under unique circumstances, users might want to check whether imputations obtained under different streams are identical. This can be done by specifying the regular seed argument in the futuremice call. This seed is parsed to the separate mice calls within all futures, such that the results will be identical over the cores. If users specify the seed argument rather than the parallelseed argument, futuremice will ask if this is intended behavior if the user is in an interactive() session. Otherwise, a warning will be printed.
Function futuremice calls for function future_map with plan("multisession") from the furrr package. Although other options are available, we have chosen for the plan("multisession") because it allows for the use of multiple cores on all computers, including a Windows computer. The user may adjust this by specifying the future.plan argument within futuremice. Other options are for example future.plan = "multicore", which results in plan("multicore") (which is not supported on Windows computers), future.plan = "cluster", resulting in plan("cluster"). For all options regarding plan(), check ?future::plan().
Manual R-package future, available at https://cran.r-project.org/web/packages/future/future.pdf
Manual R-package furrr, available at https://cran.r-project.org/web/packages/furrr/furrr.pdf
Manual package MICE, available at https://cran.r-project.org/web/packages/mice/mice.pdf
Schouten, R.M., Lugtig, P.J. and Vink, G. (2016). Multiple amputation using ampute [manual]. Available at https://github.com/RianneSchouten/mice/blob/ampute/vignettes/Vignette_Ampute.pdf
Schouten, R.M. and Vink, G. (2017). parlmice: faster, paraleller, micer. https://www.gerkovink.com/parlMICE/Vignette_parlMICE.html
Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45 (3), 1-67.
End of Vignette