::install_github("amices/mice", ref = "imputation_models") remotes
Imputation Models in MICE
1 Example
Suppose you created a risk prediction model for a cohort of patients. The dataset contained missing values, which you imputed using MICE. Now, you want to implement the risk prediction model in clinical practice to assist decision-making for new patients. Since missing values are expected in the new patient data, how should you proceed?
Clearly, you cannot retrain the imputation model on the new patients. The number of new patients may be too small to support proper training, and a new imputation model would likely produce different missing value estimates. These differences raise concerns about the consistency and comparability of the risk prediction model.
A better approach is to apply the same imputation model trained on the original cohort to fill in missing values for new patients. Previously, doing so required manual workarounds that were inefficient and difficult to implement. This vignette introduces a new functionality that lets you store the imputation model and apply it seamlessly to new patients, ensuring consistency and reproducibility.
2 Installation
The new functionality is currently experimental and available in the "imputation_models"
branch of the mice
repository. To install this branch from GitHub, use:
3 MICE Architecture
The MICE algorithm follows a two-level modular architecture. At the first level, mice()
, the core function in the mice
package (van Buuren and Groothuis-Oudshoorn 2011; van Buuren 2018), orchestrates the imputation process by managing data preprocessing, variable selection, and iterative imputation steps. At the second level, MICE applies elementary imputation methods—such as normal imputation and PMM—to generate missing values based on model-specific assumptions.
Since storing and reusing imputation models requires both managing the overall imputation process and adjusting the underlying imputation methods, modifications were made at both levels.
We begin by exploring the first level of the MICE architecture, where the new tasks
and models
arguments modify the imputation workflow.
Next, we examine the second level, where elementary imputation functions—such as mice.impute.norm()
and mice.impute.pmm()
—have been extended to support tasks
and models
.
4 New tasks
and model
Arguments
The mice()
function orchestrates the imputation process by managing the imputation model and generating imputed values. It first validates the input data and user-specified settings. It then initializes the imputation model, iterates elementary imputation methods, and returns a mids
object containing the original dataset, model specifications, and imputed values.
The mids
object is central to the multiple imputation workflow, supporting pooling, diagnostics, and visualization.
To enhance imputation flexibility, mice()
introduces two new arguments: tasks
and models
, which we detail in the following sections.
4.1 Tasks
The tasks
argument in mice()
is a character vector specifying the task to perform for each variable during imputation. The available options are:
tasks = "impute"
: Estimates parameters, generates imputations, and stores the original data along with imputations—but not the imputation model. This corresponds to classic MICE behavior and is the default setting.tasks = "train"
: Estimates parameters, generates imputations, and stores the original data, imputations, and imputation model.tasks = "train"
saves the imputation model along with the data, allowing for both inspection and reuse.tasks = "fill"
: Applies a stored imputation model to new data, generating imputations without re-estimating parameters.
All tasks
options produce mids
objects, which can be used for pooling, diagnostics, and visualization as in standard MICE workflows.
Next, we show practical examples of how to use the tasks
argument.
4.1.1 Example: tasks = "impute"
Setting tasks = "impute"
returns a mids
object identical to the classic MICE implementation: it contains imputed values but does not store the imputation model. The following code generates a mids
object using the built-in nhanes
dataset:
library(mice, warn.conflicts = FALSE)
<- mice(nhanes, tasks = "impute", seed = 1, print = FALSE)
imputed class(imputed)
#> [1] "mids"
As in classic MICE, specifying tasks = "impute"
applies this setting to all variables by expanding to:
$tasks
imputed#> age bmi hyp chl
#> "impute" "impute" "impute" "impute"
tasks = "impute"
behaves exactly like the classic mice()
function, returning a mids
object with imputed values only—without storing the imputation model.
4.1.2 Example: tasks = "train"
To also store the imputation model, use tasks = "train"
instead of tasks = "impute"
:
<- mice(nhanes, tasks = "train", seed = 1, print = FALSE)
trained class(trained)
#> [1] "mids"
This time, the mids
object contains both the imputed values and the imputation model. To confirm the difference, inspect the store
component:
$store
trained#> [1] "train"
The new models component in the mids
object stores the imputation model for each variable and each imputation. A mids
object with store == "train"
can later be used to generate imputations for new data.
Before applying a stored model, let’s highlight two key differences between store == "impute"
and store == "train"
:
store == "train"
: Themids
object includes the trained imputation model in themodels
component.store == "impute"
: No model is stored.
Depending on the imputation method, the stored models may differ slightly. For parametric methods, storing the model requires saving formulas, estimated coefficients, and metadata such as factor levels. For non-parametric methods like PMM, storing the model involves saving observed values instead of estimated parameters.
4.1.3 Example: tasks = "fill"
Training creates a transferable representation of the imputation model, allowing it to be reused on new data without altering the original model or re-estimating parameters. To apply a stored model, use tasks = "fill"
in mice()
, along with the models
argument to specify the trained model.
The following code demonstrates how to use a trained model to fill missing values in new data:
<- data.frame(age = c(2, 1), bmi = c(NA, NA), chl = c(NA, 190), hyp = c(NA, 1))
newdata <- mice(newdata, tasks = "fill", models = trained$models, seed = 1, print = FALSE)
filled $store
filled#> [1] "fill"
$data
filled#> age bmi chl hyp
#> 1 2 NA NA NA
#> 2 1 NA 190 1
$imp
filled#> $age
#> [1] 1 2 3 4 5
#> <0 rows> (or 0-length row.names)
#>
#> $bmi
#> 1 2 3 4 5
#> 1 35.3 27.2 33.2 26.3 28.7
#> 2 30.1 33.2 33.2 30.1 26.3
#>
#> $chl
#> 1 2 3 4 5
#> 1 204 229 284 284 206
#>
#> $hyp
#> 1 2 3 4 5
#> 1 1 2 1 1 1
complete(filled, 2)
#> age bmi chl hyp
#> 1 2 27.2 229 2
#> 2 1 33.2 190 1
Note the use of both tasks = "fill"
(to apply the stored model) and models = trained$models
(to provide the trained imputation model).
filled$store
will be set to"fill"
if all missing values are imputed using the stored model.filled$data
holds the new dataset with missing values, whilefilled$imp
stores the imputations.
The complete()
function can be used to extract the multiply-imputed datasets. The example displays the imputed values for the second imputation.
4.2 Compact Representation of the Imputation Model
The compact
argument controls the storage of the trained model. When compact = TRUE
, the imputation model is stored in a minimized form, without retaining the training data used to estimate the model. This form retains all necessary parameters for generating imputed values while excluding training data, making it suitable for exchange, distribution, and production.
# Train a compact imputation model (no training data stored)
<- mice(nhanes, tasks = "train", compact = TRUE, seed = 1, print = FALSE)
trained_compact $store
trained_compact#> [1] "train_compact"
# No training data is stored in the compact model
$data
trained_compact#> NULL
$imp
trained_compact#> NULL
# But we can still use the compact model to fill missing values in new data
<- mice(newdata, tasks = "fill", models = trained_compact$models, seed = 1, print = FALSE)
filled_compact $store
filled_compact#> [1] "fill"
complete(filled_compact, 2)
#> age bmi chl hyp
#> 1 2 27.2 229 2
#> 2 1 33.2 190 1
Since compact models do not store training data, the data
and imp
components remain empty.
# Attempting to complete a compact model without training data will return an error
try(complete(trained_compact))
#> Error in complete.mids(trained_compact) :
#> Cannot complete compact training object.
#> Set 'compact = FALSE' to preserve training data and imputations.
Running complete()
requires stored imputations to reconstruct datasets, which are unavailable in compact models. As a result, complete()
does not work when store == "train_compact"
.
4.3 Comparison of tasks = "train"
, "train_compact"
, and "fill"
Feature | tasks = "train" |
tasks = "train", compact = TRUE |
tasks = "fill" |
---|---|---|---|
Stores imputations? | ✅ Yes | ❌ No | ✅ Yes (on new data) |
Stores imputation model? | ✅ Yes | ✅ Yes | ❌ No |
Stores training data? | ✅ Yes | ❌ No | ❌ No |
Can generate new imputations? | ✅ Yes | ✅ Yes | ✅ Yes |
Requires models argument? |
❌ No | ❌ No | ✅ Yes (to specify trained model) |
complete() works? |
✅ Yes | ❌ No | ✅ Yes |
Typical use case | Storing full imputation model and training data | Creating a lightweight imputation model for sharing or production | Applying a trained imputation model to new data |
4.3.1 Explanation of Key Differences
tasks = "train"
: Stores everything, including the original dataset, the imputation model, and all imputations.tasks = "train", compact = TRUE
: Stores only the imputation model, making it lightweight but preventing the use ofcomplete()
.tasks = "fill"
: Uses a previously stored model to generate new imputations but does not store the model itself.
4.4 Train and Fill Subsets of Variables
Imputation models can be trained on a subset of variables rather than the entire dataset. This allows missing values in those variables to be filled using the trained model, while other variables are imputed using standard MICE methods.
The following fragment specifies a vector tasks = c("bmi", "hyp")
to train the imputation model only for bmi
and hyp
:
<- make.tasks(nhanes)
tasks c("bmi", "hyp")] <- "train"
tasks[<- mice(nhanes, tasks = tasks, seed = 1, print = FALSE)
trained_subset names(trained_subset$models)
#> [1] "hyp" "bmi"
Now apply the subset model to fill missing values in newdata
:
c("bmi", "hyp")] <- c("fill", "fill")
tasks[<- mice(newdata, tasks = tasks, models = trained_subset$models, seed = 1, print = FALSE)
filled_subset #> Warning: Number of logged events: 1
complete(filled_subset)
#> age bmi chl hyp
#> 1 2 NA NA NA
#> 2 1 27.2 190 1
Row 2 was successfully imputed using the trained model for bmi
and hyp
. However, imputation fails for row 1 because mice()
cannot build an imputation model for chl
(which was not part of the trained subset).
mice()
checks each variable for constant or collinear values before building an imputation model. In this case, chl
is constant (only one value exists), so mice()
fails to construct an imputation model, leading to missing values propagating to bmi
and hyp
. However, variables explicitly set to "fill"
(bmi
and hyp
) do not require a model to be built, so they bypass this check.
There are two ways to address this issue:
- Include
chl
in the training subset, effectively training models for all variables. - Append additional records to
newdata
so thatmice()
can generate an on-the-fly imputation model forchl
. This approach is shown below:
<- rbind(newdata, nhanes[11:20, ])
newdata_append <- mice(newdata_append, tasks = tasks, models = trained_subset$models, seed = 1, print = FALSE)
filled_subset complete(filled_subset, 2)[1:2, ]
#> age bmi chl hyp
#> 1 2 26.3 199 1
#> 2 1 22.0 190 1
Here, 10 records from nhanes
are appended to newdata
. This allows mice()
to train an imputation model for chl
on the extra data, while still applying the trained model for bmi
and hyp
.
Final observations:
- When a combination of training and filling is used, the resulting mids object is stored with
store = "train"
. - Appendix A contains the full specification of
mids
components for store modes"impute"
,"train"
, and"fill"
.
5 Elementary Imputation Functions
This section explains how mice.impute.norm()
, mice.impute.pmm()
, and other mice.impute.*()
functions have been adapted to support tasks
and models
, allowing for reusable imputation models and enhanced flexibility in workflows.
mice.impute.*()
functions generate imputed values for individual variables (univariate imputation) or groups of variables (multivariate imputation). mice()
automatically calls mice.impute.*()
functions to handle variable-wise and block-wise imputation.
This section is primarily relevant for developers who wish to extend the mice
package with new imputation methods.
5.1 Normal Imputation
The mice.impute.norm()
function is responsible for the following tasks:
- Verify whether an imputation model exists before proceeding.
- For
task = "fill"
: Retrieve the stored imputation model, generate imputed values, and return them. - Estimate imputation model parameters.
- For
task = "train"
: Store the imputation model for later use. - Generate and return imputed values.
5.1.1 Example: mice.impute.norm()
Let us examine how these actions are implemented in mice.impute.norm()
for normal imputation:
::mice.impute.norm
mice#> function (y, ry, x, wy = NULL, task = "impute", model = NULL,
#> ridge = 1e-05, ...)
#> {
#> check.model.exists(model, task)
#> method <- "norm"
#> if (is.null(wy))
#> wy <- !ry
#> x <- cbind(1, as.matrix(x))
#> if (task == "fill" && check.model.match(model, x, method)) {
#> return(x[wy, ] %*% model$beta.dot + rnorm(sum(wy)) *
#> model$sigma.dot)
#> }
#> parm <- .norm.draw(y, ry, x, ridge = ridge, ...)
#> if (task == "train") {
#> model$setup <- list(method = method, n = sum(ry), task = task,
#> ridge = ridge)
#> model$beta.hat <- drop(parm$coef)
#> model$beta.dot <- drop(parm$beta)
#> model$sigma.hat <- parm$sigma.hat
#> model$sigma.dot <- parm$sigma
#> model$xnames <- colnames(x)
#> }
#> return(x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma)
#> }
#> <bytecode: 0x10ffdd650>
#> <environment: namespace:mice>
The function argument list introduces two new arguments:
model
: An environment that contains the imputation model.task
: Specifies the task to perform ("impute"
,"train"
, or"fill"
).
The function body clearly separates different tasks:
- Check model availability:
check.model.exists(model, task)
- Fill missing values
(task = "fill")
: Retrieve the stored model and generate imputations. - Estimate parameters:
.norm.draw(y, ry, x, ridge = ridge, ...)
estimates imputation parameters based on the available data. - Train and store model
(task = "train")
: Save the trained imputation model. - Return imputed values:
return(x[wy, ], ...)
.
All mice.impute.*()
functions follow this pattern.
Model storage and retrieval: The model
object is an environment that stores imputation parameters. During training, each iteration updates the imputation model by overwriting previous estimates with new ones. After the final iteration, the model is stored as a list in the models
component of the mids
object. When mice()
imports a trained model, it reconstructs this list back into a nested environment—one for each variable and each imputation.
5.2 Predictive Mean Matching (PMM)
The mice.impute.pmm()
function performs the same core tasks as mice.impute.norm()
. However, PMM is a semi-parametric imputation method (Little 1988) that requires additional steps to ensure imputations remain consistent with the standard algorithm while enabling imputations without access to the original training dataset.
To store the imputation model, we use percentile binning, which divides the linear predictor into equally sized bins and assigns donor values to each bin.
5.2.1 Steps in mice.impute.pmm()
- Estimate the imputation model parameters.
- Calculate the linear predictor for all cases.
- For each case with a missing outcome:
- Identify the \(k\) closest observed cases based on the linear predictor.
- Randomly select one of these \(k\) donor values as the imputed value.
While parameter estimation follows a similar approach to mice.impute.norm()
, PMM requires an efficient way to store donor values. Instead of storing the full training dataset, we create equally sized bins based on the linear predictor. At each bin threshold, we store the \(k\) closest donor values.
When imputing a new case, the function:
- Identifies the bin it belongs to by locating its left and right bin thresholds.
- Draws a donor value from the \(k\) closest observed values within the left or right bin.
- Weighs donor selection based on the case’s relative distance to the bin thresholds.
This approach is memory-efficient, handles gaps in the linear predictor, and allows for fast and accurate imputations based on the choice of bins (\(t\)) and donors (\(k\)).
5.2.2 Model Storage for PMM
The mice.impute.pmm()
function stores the following components in the model
object:
edges
: A vector of length \(t + 1\) defining the bin thresholds.lookup
: A matrix of size \(t \times k\) containing donor values for each bin.
Appendix B describes four internal helper functions:
initialize.nbins()
: Sets the number of bins \(t\).initialize.donors()
: Sets the number of donors \(k\).bin.yhat()
: Divides the linear predictor into bins usingedges
.draw.neighbors.pmm()
: Selects imputed values from thelookup
table.
During training, all four functions are used. However, filling only needs draw.neighbors.pmm()
.
5.2.3 Experimental Aspects of \(t\) and \(k\)
The methods for determining \(t\) and \(k\) in initialize.nbins()
and initialize.donors()
are preliminary and based on limited simulations.
The default values for \(t\) (typically 15–30 bins) and \(k\) (typically 5–15 donors) were selected based on a small simulation study that varied only sample size. The study assumed an imputation model with approximately 10% explained variance. However, because optimal values for \(t\) and \(k\) likely depend on prediction error, these defaults may not be suitable for other models. Although some work is available in adaptive tuning (Schenker and Taylor 1996), optimizing \(t\) and \(k\) for different levels of prediction error remains an open research question.
5.3 Structure of the models
Component
When tasks = "train"
, the mice()
function stores the imputation model in the models
component of the mids
object. The models
component is a list containing the setup and estimates of the imputation model for each variable and each repeated imputation. Since there are m
imputations and ncol(data)
variables, there can be up to m * ncol(data)
imputation models in total.
The models
component is organized by variable and imputation number, allowing each variable’s imputation process to be stored separately across imputations. This structure ensures that stored models can be reused for generating imputations on new data.
The following diagram visualizes the hierarchical structure of the models
component:
graph TD; A[mids] --> B[models] B --> C[bmi] B --> D[hyp] B --> E[chl] C --> F[Imputation 1] C --> G[Imputation 2] D --> H[Imputation 1] D --> I[Imputation 2] E --> J[Imputation 1] E --> K[Imputation 2]
The names of the parts of the first imputation model for bmi
are:
names(trained$models$bmi[[1]])
#> [1] "lookup" "edges" "beta.dot" "xnames" "beta.hat" "factor" "formula"
#> [8] "setup"
In general, the names and types of stored objects depend on the imputation method. The setup
object for the first imputation model for bmi
can be accessed as follows:
unlist(trained$models$bmi[[1]]$setup)
#> method n donors matchtype quantify trim task.bmi nbins
#> "pmm" "16" "5" "1" "TRUE" "1" "train" "13"
#> ridge
#> "1e-05"
It contains settings specific to the imputation model. We can retrieve the formula
used in the model as:
$models$bmi[[1]]$formula
trained#> [1] "bmi ~ age + hyp + chl"
which shows that bmi
is imputed using a linear combination of age
, hyp
, and chl
. We obtain the least squares estimates for the regression weights by:
$models$bmi[[1]]$beta.hat
trained#> age hyp chl
#> 19.45854723 -5.55551631 2.02220838 0.07131134
However, the actual imputation model does not use these weights directly; instead, it draws randomly from a distribution that accounts for parameter uncertainty (Rubin 1987). The drawn values can be accessed as:
$models$bmi[[1]]$beta.dot
trained#> age hyp chl
#> 8.3452747 -5.1967134 4.4862182 0.1070697
For small samples or highly collinear predictors, beta.hat
and beta.dot
can differ substantially—a phenomenon sometimes called ‘bouncing betas’, where parameter estimates fluctuate across imputations. To ensure “proper imputation”, MICE uses beta draws (beta.dot
) rather than fixed estimates (beta.hat
).
5.4 Additional Components for PMM
For PMM, two additional objects are stored in models
:
edges
: A vector of length \(t + 1\) containing the bin thresholds \(\theta_j\).lookup
: A \(t × k\) matrix storing the donor values for each bin.
The bin thresholds in edges
are located on:
$models$bmi[[1]]$edges
trained#> 0% 7.692308% 15.38462% 23.07692% 30.76923% 38.46154% 46.15385% 53.84615%
#> 19.50434 23.64460 23.79063 24.14798 24.78715 25.22078 26.48167 26.95094
#> 61.53846% 69.23077% 76.92308% 84.61538% 92.30769% 100%
#> 27.08883 27.18614 28.37493 30.79091 31.47111 32.25554
The lookup
object is a matrix, where each row represents a bin threshold \(\theta_j\) that segments the linear predictor. Observations with \(\hat y\) values falling outside the bin range (\(\hat y \leq \theta_1\) or \(\hat y > \theta_t\)) are assigned to the first or last bin, respectively. The lookup
table for the first imputation model for bmi
can be accessed as:
$models$bmi[[1]]$lookup
trained#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 21.7 27.4 27.4 27.4 27.4
#> [2,] 22.7 22.7 22.7 22.7 22.7
#> [3,] 20.4 20.4 20.4 20.4 20.4
#> [4,] 22.5 22.5 22.5 22.5 22.5
#> [5,] 24.9 24.9 24.9 24.9 24.9
#> [6,] 27.5 27.5 27.5 27.5 27.5
#> [7,] 28.7 26.3 26.3 26.3 26.3
#> [8,] 35.3 22.7 27.5 22.0 22.5
#> [9,] 27.2 25.5 25.5 25.5 25.5
#> [10,] 22.0 22.0 22.0 22.0 22.0
#> [11,] 30.1 30.1 30.1 30.1 30.1
#> [12,] 22.7 28.7 22.7 22.5 26.3
#> [13,] 33.2 35.3 35.3 35.3 29.6
The lookup
object is used in draw.neighbors.pmm()
to select donor values for imputations.
6 Methodological Considerations
6.1 Number of Imputations m
Used in "train"
and "fill"
Since we did not specify \(m_\text{train}\) (the number of imputations for training) and \(m_\text{fill}\) (the number of imputations for filling), both steps default to m = 5
. Although the user can set both independently, we recommend \(m_\text{train} = m_\text{fill}\) to ensure consistency.
If \(m_\text{train} < m_\text{fill}\), the mice()
function will automatically recycle trained models to generate additional imputations, without warning the user. When recycling, the imputations may exhibit too little variability, particularly in small samples, potentially leading to underestimated downstream variability.
If \(m_\text{train} > m_\text{fill}\), the mice()
function will discard the extra imputations. In general, increasing \(m_\text{fill}\) reduces Monte Carlo error and improves the stability of the between-imputation variance estimate.
Despite its statistical drawbacks, many users prefer \(m_\text{fill} = 1\) for its simplicity, as working with a single dataset is often more convenient. However, as Dempster and Rubin (1983, 8) caution:
Imputation is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all.
A single completed dataset may be a convenient fiction, useful for various purposes such as obtaining population estimates, developing an imputation model, or estimating the most likely version of the hidden data values. However, a single imputed dataset fails to account for uncertainty inherent to the missing data, leading to biased estimates and overconfident inferences in downstream decision making.
6.2 Full Fill vs. Partial Fill
The simplest workflow, known as full fill, trains the imputation model on all columns in the dataset. A full fill is useful when data is split by rows (horizontal partitioning) or when standardized imputations are needed across datasets with the same structure. In such cases, a model can be trained on one dataset partition and then applied to new datasets containing different records with the same type of variables. A variation on this approach is to train the model on a subset of rows to generate imputations for the remaining rows to save computation time. Another use case occurs when some variables included in the trained model are missing in a new dataset, and the goal is to impute only those missing variables. Since the model does not need to be retrained, imputations for new data can be generated almost instantly.
However, when data is split into different subsets of columns (vertical partitioning), the trained model may not cover all variables. In these cases, partial fill can be used, where supported variables are imputed along with additional imputations for variables absent from the original model.
A common example occurs in longitudinal studies, where data is collected at multiple time points. A model trained on data from an earlier time point can be used to impute missing values before extending the model to include variables collected at later time points—without requiring retraining on the full dataset. Another example is the integration of data from different sources without directly merging them. Suppose source A contains variables \(X\) and \(Y\), while source B contains \(X\) and \(Z\). Instead of combining these datasets into a single table, an imputation model can be sequentially extended under the assumption of conditional independence \(Y \perp Z \mid X\).
The process proceeds as follows:
- Train a model on source B to impute \(Z\) from \(X\).
- Use this model to impute \(Z\) in source A.
- Store the final model for \(X, Y\) and \(Z\) for future use.
This approach removes the need to merge sources A and B into a single dataset. Analysts can share trained models instead of exchanging raw data, thereby improving efficiency, adaptability, and data privacy while still utilizing all available information.
A key assumption in this approach is conditional independence \(Y \perp Z \mid X\), meaning that once we account for \(X\), there is no remaining association between \(Y\) and \(Z\). In practice, this assumption may not always hold. If an additional data source C containing both \(Y\) and \(Z\) is available, models trained on source C can be incorporated to capture the direct relationship between \(Y\) and \(Z\), improving the accuracy of imputations.
6.3 Store
The store
component of a mids
object is automatically determined based on the tasks
argument. It specifies which components are saved in the imputation model.
If tasks
is not specified (or set to "impute"
), the default behavior is:
<- mice(nhanes, seed = 1, print = FALSE)
imputed $store
imputed#> [1] "impute"
$tasks
imputed#> age bmi hyp chl
#> "impute" "impute" "impute" "impute"
$models
imputed#> NULL
Since tasks = "impute"
does not train models, the models component remains NULL. The possible values of store are:
impute
: When all tasks are"impute"
, themids
object mimics the classicmids
structure and does not store models.train
: If one or more tasks include"train"
, themids
object stores the models component for the subset of trained variables.fill
: If all tasks are"fill"
, themids
object does not store models. The settingstore = "fill"
assumes a fully trained model for all variables.
Note: If additional variables exist in the new data, mice()
will silently impute them using the trained model. This may produce unintended imputations if the new variables were not part of the training set. The resulting mids
object will have store = "train"
.
Additionally, there is a special fourth store
value:
train_compact
: This value is assigned whenstore = "train"
andcompact = TRUE
. Thetrain_compact
mode stores a minimized version of the imputation model without training data, making it suitable for production and sharing.
7 Conclusion
This vignette introduced new functionality in mice()
that enables storing and reusing imputation models, enhancing reproducibility, efficiency, and interoperability. By utilizing the tasks
and models
arguments, users can now train imputation models, apply them to new data, and share models across studies. The flexibility of full fill and partial fill workflows further supports diverse data structures, including longitudinal data and multi-source datasets. These enhancements streamline missing data workflows, reduce redundancy, and improve data privacy by allowing models to be shared instead of raw datasets. Future developments will refine model selection and optimization for various imputation scenarios.
References
APPENDIX A: Components of the mids
object
The following table summarizes the components saved by mice()
for different store
values. The store
value is set to "impute"
, "train"
, or "fill"
based on the tasks
argument. mice()
return store == "impute"
by default, and store == "fill"
if all variables are filled. In all other cases, the store == "train"
.
The table lists the components of the mids
object for each store
value.
Name | Description | I/O | Data Type | Impute | Train | Fill |
---|---|---|---|---|---|---|
predictorMatrix |
Specifies predictor set | I | matrix |
YES | YES | NO |
formulas |
Formulae for imputation models | I | list |
YES | YES | NO |
modeltype |
Form of imputation model | I | character |
YES | YES | NO |
post |
Commands for post-processing | I | character vector |
YES | YES | NO |
ignore |
Logical vector for ignored rows | I | logical vector |
YES | YES | NO |
seed |
Seed value for reproducibility | I | integer |
YES | YES | NO |
nmis |
Count of missing values per variable | O | numeric vector |
YES | YES | NO |
chainMean |
Mean of imputed values | O | array |
YES | YES | NO |
chainVar |
Variance of imputed values | O | array |
YES | YES | NO |
loggedEvents |
Warnings and corrective tasks | O | data.frame |
YES | YES | NO |
blocks |
Blocks of variables for imputation | I | list |
YES | YES | NO |
method |
Imputation method per block | I | character vector |
YES | YES | NO |
blots |
Extra arguments per block | I | list |
YES | YES | NO |
visitSequence |
Order of block visits | I | character vector |
YES | YES | NO |
iteration |
Last iteration number | O | integer |
YES | YES | NO |
lastSeedValue |
Random number generator state | O | integer |
YES | YES | NO |
tasks |
Specifies the imputation tasks | I | character vector |
YES | YES | NO |
models |
Stores imputation model estimates | O | list |
NO | YES | NO |
data |
Data to be imputed | I | data.frame |
YES | YES | YES |
where |
Specifies where imputations occur | I | matrix |
YES | YES | YES |
imp |
List of imputations per variable | O | list |
YES | YES | YES |
m |
Number of imputations | I | integer |
YES | YES | YES |
store |
Storage set | O | character |
YES | YES | YES |
version |
Version number of mice package |
O | character |
YES | YES | YES |
date |
Date when the object was created | O | character |
YES | YES | YES |
call |
Call that created the object | O | call |
YES | YES | YES |
The mids
object store
can be minimized from "train"
to "train_compact"
by setting compact = TRUE
. The minimal representation "train_compact"
saves the following components: blocks
, method
, blots
, visitSequence
, iteration
, lastSeedValue
, tasks
, models
, m
, store
, version
, date
and call
. This feature is particularly useful if training data cannot be shared, or for production applications where memory efficiency is critical. The minimized model retains only the essential information required for generating imputed values, reducing memory usage while maintaining the core functionality of the imputation model. Note that some downstream functions, like complete()
or with()
, cannot support store = "train_compact"
.
APPENDIX B: Computational details of binning in PMM
Initialization of the Number of Bins (initialize.nbins
)
The function initialize.nbins()
determines an appropriate number of bins (nbins
) to be used in PMM based on the sample size (n
) and the number of unique values (nu
) in the predicted or observed data.
The default computation of nbins
uses the relation nbins = round(4 * log(n) + 1.5)
. This formula suggests that the number of bins grows logarithmically with the sample size (n
), ensuring a reasonable bin width without excessive granularity. The coefficients (4
and 1.5
) are chosen to scale the number of bins appropriately across different n
. Since nbins
represents a discretization of the data, it cannot exceed the number of unique values (nu
) in the predicted or observed variable. If nbins > nu
, the function sets nbins = nu
to ensure that each unique value has its own bin. This adjustment is necessary to prevent empty bins and ensure that each unique value is represented in the binning process. The function enforces a lower bound of 2 bins to prevent degenerate cases where binning would be ineffective.
For large samples of highly-correlated continuous data, the user can increase the set nbins = 100
or higher to improve precision.
Initialization of the Number of Donors (initialize.donors()
)
The initialize.donors()
function determines the number of donors donors
for imputation based on the sample size (n
). If donors
is NULL
, it is computed using round(n / 600 + 7)
, ensuring a gradual increase as n
grows. The computed value is then constrained using max(1L, min(donors, n))
, which ensures at least one donor while preventing the donor count from exceeding n
.
Setting donors = 1
(not recommended) selects the closest donor. Thus, different cases that are in the same bin will obtain the same imputed value from the left or right edge. Setting donors = n
(not recommended) effectively samples from the marginal distribution, and weakens the relations between the data. The literature suggests values between 5 and 10 donors (Morris, White, and Royston 2014). Because of binning, the optimal number of donors may actually need to be higher than 5 or 10 in order to reduce repetition of donors from the same bin.
Binning of the Linear Predictor (bin.yhat()
)
The bin.yhat()
function divides the linear predictor into bins using the thresholds edges
. It first sorts the linear predictor yhat
and determines the bin index for each value based on the thresholds. The function uses the findInterval()
function to assign each value to the corresponding bin. The bin index is calculated as bin = findInterval(yhat, edges, left.open = TRUE)
. The left.open = TRUE
argument ensures that values equal to the threshold are assigned to the left bin, consistent with the binning process. The function returns the bin index for each value.
Drawing Imputations (draw.neighbors.pmm()
)
The draw.neighbors.pmm
function selects imputed values using predefined bins. Given predicted values (yhat
), bin edges (edges
), and a lookup table (lookup
), the function assigns each yhat
value to a bin using findInterval
. If yhat
is lower than the first bin or higher than the last bin, it selects the first or last bin. To ensure smooth transitions between bins, the function calculates a probability weight based on the distance between yhat
and the bin edges. Using a Bernoulli distribution, it probabilistically selects the left or right bin. Once the bin is selected, the function samples mlocal
observed y
values from the corresponding row in the lookup table. The result is an n × mlocal
matrix containing imputed values.