Imputation Models in MICE

Author

Stef van Buuren

Published

March 21, 2025

1 Example

Suppose you created a risk prediction model for a cohort of patients. The dataset contained missing values, which you imputed using MICE. Now, you want to implement the risk prediction model in clinical practice to assist decision-making for new patients. Since missing values are expected in the new patient data, how should you proceed?

Clearly, you cannot retrain the imputation model on the new patients. The number of new patients may be too small to support proper training, and a new imputation model would likely produce different missing value estimates. These differences raise concerns about the consistency and comparability of the risk prediction model.

A better approach is to apply the same imputation model trained on the original cohort to fill in missing values for new patients. Previously, doing so required manual workarounds that were inefficient and difficult to implement. This vignette introduces a new functionality that lets you store the imputation model and apply it seamlessly to new patients, ensuring consistency and reproducibility.

2 Installation

The new functionality is currently experimental and available in the "imputation_models" branch of the mice repository. To install this branch from GitHub, use:

remotes::install_github("amices/mice", ref = "imputation_models")

3 MICE Architecture

The MICE algorithm follows a two-level modular architecture. At the first level, mice(), the core function in the mice package (van Buuren and Groothuis-Oudshoorn 2011; van Buuren 2018), orchestrates the imputation process by managing data preprocessing, variable selection, and iterative imputation steps. At the second level, MICE applies elementary imputation methods—such as normal imputation and PMM—to generate missing values based on model-specific assumptions.

Since storing and reusing imputation models requires both managing the overall imputation process and adjusting the underlying imputation methods, modifications were made at both levels.

We begin by exploring the first level of the MICE architecture, where the new tasks and models arguments modify the imputation workflow.

Next, we examine the second level, where elementary imputation functions—such as mice.impute.norm() and mice.impute.pmm()—have been extended to support tasks and models.

4 New tasks and model Arguments

The mice() function orchestrates the imputation process by managing the imputation model and generating imputed values. It first validates the input data and user-specified settings. It then initializes the imputation model, iterates elementary imputation methods, and returns a mids object containing the original dataset, model specifications, and imputed values.

The mids object is central to the multiple imputation workflow, supporting pooling, diagnostics, and visualization.

To enhance imputation flexibility, mice() introduces two new arguments: tasks and models, which we detail in the following sections.

4.1 Tasks

The tasks argument in mice() is a character vector specifying the task to perform for each variable during imputation. The available options are:

  • tasks = "impute": Estimates parameters, generates imputations, and stores the original data along with imputations—but not the imputation model. This corresponds to classic MICE behavior and is the default setting.
  • tasks = "train": Estimates parameters, generates imputations, and stores the original data, imputations, and imputation model. tasks = "train" saves the imputation model along with the data, allowing for both inspection and reuse.
  • tasks = "fill": Applies a stored imputation model to new data, generating imputations without re-estimating parameters.

All tasks options produce mids objects, which can be used for pooling, diagnostics, and visualization as in standard MICE workflows.

Next, we show practical examples of how to use the tasks argument.

4.1.1 Example: tasks = "impute"

Setting tasks = "impute" returns a mids object identical to the classic MICE implementation: it contains imputed values but does not store the imputation model. The following code generates a mids object using the built-in nhanes dataset:

library(mice, warn.conflicts = FALSE)
imputed <- mice(nhanes, tasks = "impute", seed = 1, print = FALSE)
class(imputed)
#> [1] "mids"

As in classic MICE, specifying tasks = "impute" applies this setting to all variables by expanding to:

imputed$tasks
#>      age      bmi      hyp      chl 
#> "impute" "impute" "impute" "impute"

tasks = "impute" behaves exactly like the classic mice() function, returning a mids object with imputed values only—without storing the imputation model.

4.1.2 Example: tasks = "train"

To also store the imputation model, use tasks = "train" instead of tasks = "impute":

trained <- mice(nhanes, tasks = "train", seed = 1, print = FALSE)
class(trained)
#> [1] "mids"

This time, the mids object contains both the imputed values and the imputation model. To confirm the difference, inspect the store component:

trained$store
#> [1] "train"

The new models component in the mids object stores the imputation model for each variable and each imputation. A mids object with store == "train" can later be used to generate imputations for new data.

Before applying a stored model, let’s highlight two key differences between store == "impute" and store == "train":

  • store == "train": The mids object includes the trained imputation model in the models component.
  • store == "impute": No model is stored.

Depending on the imputation method, the stored models may differ slightly. For parametric methods, storing the model requires saving formulas, estimated coefficients, and metadata such as factor levels. For non-parametric methods like PMM, storing the model involves saving observed values instead of estimated parameters.

4.1.3 Example: tasks = "fill"

Training creates a transferable representation of the imputation model, allowing it to be reused on new data without altering the original model or re-estimating parameters. To apply a stored model, use tasks = "fill" in mice(), along with the models argument to specify the trained model.

The following code demonstrates how to use a trained model to fill missing values in new data:

newdata <- data.frame(age = c(2, 1), bmi = c(NA, NA), chl = c(NA, 190), hyp = c(NA, 1))
filled <- mice(newdata, tasks = "fill", models = trained$models, seed = 1, print = FALSE)
filled$store
#> [1] "fill"
filled$data
#>   age bmi chl hyp
#> 1   2  NA  NA  NA
#> 2   1  NA 190   1
filled$imp
#> $age
#> [1] 1 2 3 4 5
#> <0 rows> (or 0-length row.names)
#> 
#> $bmi
#>      1    2    3    4    5
#> 1 35.3 27.2 33.2 26.3 28.7
#> 2 30.1 33.2 33.2 30.1 26.3
#> 
#> $chl
#>     1   2   3   4   5
#> 1 204 229 284 284 206
#> 
#> $hyp
#>   1 2 3 4 5
#> 1 1 2 1 1 1
complete(filled, 2)
#>   age  bmi chl hyp
#> 1   2 27.2 229   2
#> 2   1 33.2 190   1

Note the use of both tasks = "fill" (to apply the stored model) and models = trained$models (to provide the trained imputation model).

  • filled$store will be set to "fill" if all missing values are imputed using the stored model.
  • filled$data holds the new dataset with missing values, while filled$imp stores the imputations.

The complete() function can be used to extract the multiply-imputed datasets. The example displays the imputed values for the second imputation.

4.2 Compact Representation of the Imputation Model

The compact argument controls the storage of the trained model. When compact = TRUE, the imputation model is stored in a minimized form, without retaining the training data used to estimate the model. This form retains all necessary parameters for generating imputed values while excluding training data, making it suitable for exchange, distribution, and production.

# Train a compact imputation model (no training data stored)
trained_compact <- mice(nhanes, tasks = "train", compact = TRUE, seed = 1, print = FALSE)
trained_compact$store
#> [1] "train_compact"

# No training data is stored in the compact model
trained_compact$data
#> NULL
trained_compact$imp
#> NULL

# But we can still use the compact model to fill missing values in new data
filled_compact <- mice(newdata, tasks = "fill", models = trained_compact$models, seed = 1, print = FALSE)
filled_compact$store
#> [1] "fill"
complete(filled_compact, 2)
#>   age  bmi chl hyp
#> 1   2 27.2 229   2
#> 2   1 33.2 190   1

Since compact models do not store training data, the data and imp components remain empty.

# Attempting to complete a compact model without training data will return an error
try(complete(trained_compact))
#> Error in complete.mids(trained_compact) : 
#>   Cannot complete compact training object.
#>  Set 'compact = FALSE' to preserve training data and imputations.

Running complete() requires stored imputations to reconstruct datasets, which are unavailable in compact models. As a result, complete() does not work when store == "train_compact".

4.3 Comparison of tasks = "train", "train_compact", and "fill"

Feature tasks = "train" tasks = "train", compact = TRUE tasks = "fill"
Stores imputations? ✅ Yes ❌ No ✅ Yes (on new data)
Stores imputation model? ✅ Yes ✅ Yes ❌ No
Stores training data? ✅ Yes ❌ No ❌ No
Can generate new imputations? ✅ Yes ✅ Yes ✅ Yes
Requires models argument? ❌ No ❌ No ✅ Yes (to specify trained model)
complete() works? ✅ Yes ❌ No ✅ Yes
Typical use case Storing full imputation model and training data Creating a lightweight imputation model for sharing or production Applying a trained imputation model to new data

4.3.1 Explanation of Key Differences

  • tasks = "train": Stores everything, including the original dataset, the imputation model, and all imputations.
  • tasks = "train", compact = TRUE: Stores only the imputation model, making it lightweight but preventing the use of complete().
  • tasks = "fill": Uses a previously stored model to generate new imputations but does not store the model itself.

4.4 Train and Fill Subsets of Variables

Imputation models can be trained on a subset of variables rather than the entire dataset. This allows missing values in those variables to be filled using the trained model, while other variables are imputed using standard MICE methods.

The following fragment specifies a vector tasks = c("bmi", "hyp") to train the imputation model only for bmi and hyp:

tasks <- make.tasks(nhanes)
tasks[c("bmi", "hyp")] <- "train"
trained_subset <- mice(nhanes, tasks = tasks, seed = 1, print = FALSE)
names(trained_subset$models)
#> [1] "hyp" "bmi"

Now apply the subset model to fill missing values in newdata:

tasks[c("bmi", "hyp")] <- c("fill", "fill")
filled_subset <- mice(newdata, tasks = tasks, models = trained_subset$models, seed = 1, print = FALSE)
#> Warning: Number of logged events: 1
complete(filled_subset)
#>   age  bmi chl hyp
#> 1   2   NA  NA  NA
#> 2   1 27.2 190   1

Row 2 was successfully imputed using the trained model for bmi and hyp. However, imputation fails for row 1 because mice() cannot build an imputation model for chl (which was not part of the trained subset).

mice() checks each variable for constant or collinear values before building an imputation model. In this case, chl is constant (only one value exists), so mice() fails to construct an imputation model, leading to missing values propagating to bmi and hyp. However, variables explicitly set to "fill" (bmi and hyp) do not require a model to be built, so they bypass this check.

There are two ways to address this issue:

  • Include chl in the training subset, effectively training models for all variables.
  • Append additional records to newdata so that mice() can generate an on-the-fly imputation model for chl. This approach is shown below:
newdata_append <- rbind(newdata, nhanes[11:20, ])
filled_subset <- mice(newdata_append, tasks = tasks, models = trained_subset$models, seed = 1, print = FALSE)
complete(filled_subset, 2)[1:2, ]
#>   age  bmi chl hyp
#> 1   2 26.3 199   1
#> 2   1 22.0 190   1

Here, 10 records from nhanes are appended to newdata. This allows mice() to train an imputation model for chl on the extra data, while still applying the trained model for bmi and hyp.

Final observations:

  • When a combination of training and filling is used, the resulting mids object is stored with store = "train".
  • Appendix A contains the full specification of mids components for store modes "impute", "train", and "fill".

5 Elementary Imputation Functions

This section explains how mice.impute.norm(), mice.impute.pmm(), and other mice.impute.*() functions have been adapted to support tasks and models, allowing for reusable imputation models and enhanced flexibility in workflows.

mice.impute.*() functions generate imputed values for individual variables (univariate imputation) or groups of variables (multivariate imputation). mice() automatically calls mice.impute.*() functions to handle variable-wise and block-wise imputation.

This section is primarily relevant for developers who wish to extend the mice package with new imputation methods.

5.1 Normal Imputation

The mice.impute.norm() function is responsible for the following tasks:

  1. Verify whether an imputation model exists before proceeding.
  2. For task = "fill": Retrieve the stored imputation model, generate imputed values, and return them.
  3. Estimate imputation model parameters.
  4. For task = "train": Store the imputation model for later use.
  5. Generate and return imputed values.

5.1.1 Example: mice.impute.norm()

Let us examine how these actions are implemented in mice.impute.norm() for normal imputation:

mice::mice.impute.norm
#> function (y, ry, x, wy = NULL, task = "impute", model = NULL, 
#>     ridge = 1e-05, ...) 
#> {
#>     check.model.exists(model, task)
#>     method <- "norm"
#>     if (is.null(wy)) 
#>         wy <- !ry
#>     x <- cbind(1, as.matrix(x))
#>     if (task == "fill" && check.model.match(model, x, method)) {
#>         return(x[wy, ] %*% model$beta.dot + rnorm(sum(wy)) * 
#>             model$sigma.dot)
#>     }
#>     parm <- .norm.draw(y, ry, x, ridge = ridge, ...)
#>     if (task == "train") {
#>         model$setup <- list(method = method, n = sum(ry), task = task, 
#>             ridge = ridge)
#>         model$beta.hat <- drop(parm$coef)
#>         model$beta.dot <- drop(parm$beta)
#>         model$sigma.hat <- parm$sigma.hat
#>         model$sigma.dot <- parm$sigma
#>         model$xnames <- colnames(x)
#>     }
#>     return(x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma)
#> }
#> <bytecode: 0x10ffdd650>
#> <environment: namespace:mice>

The function argument list introduces two new arguments:

  • model: An environment that contains the imputation model.
  • task: Specifies the task to perform ("impute", "train", or "fill").

The function body clearly separates different tasks:

  1. Check model availability: check.model.exists(model, task)
  2. Fill missing values (task = "fill"): Retrieve the stored model and generate imputations.
  3. Estimate parameters: .norm.draw(y, ry, x, ridge = ridge, ...) estimates imputation parameters based on the available data.
  4. Train and store model (task = "train"): Save the trained imputation model.
  5. Return imputed values: return(x[wy, ], ...).

All mice.impute.*() functions follow this pattern.

Model storage and retrieval: The model object is an environment that stores imputation parameters. During training, each iteration updates the imputation model by overwriting previous estimates with new ones. After the final iteration, the model is stored as a list in the models component of the mids object. When mice() imports a trained model, it reconstructs this list back into a nested environment—one for each variable and each imputation.

5.2 Predictive Mean Matching (PMM)

The mice.impute.pmm() function performs the same core tasks as mice.impute.norm(). However, PMM is a semi-parametric imputation method (Little 1988) that requires additional steps to ensure imputations remain consistent with the standard algorithm while enabling imputations without access to the original training dataset.

To store the imputation model, we use percentile binning, which divides the linear predictor into equally sized bins and assigns donor values to each bin.

5.2.1 Steps in mice.impute.pmm()

  1. Estimate the imputation model parameters.
  2. Calculate the linear predictor for all cases.
  3. For each case with a missing outcome:
    • Identify the \(k\) closest observed cases based on the linear predictor.
    • Randomly select one of these \(k\) donor values as the imputed value.

While parameter estimation follows a similar approach to mice.impute.norm(), PMM requires an efficient way to store donor values. Instead of storing the full training dataset, we create equally sized bins based on the linear predictor. At each bin threshold, we store the \(k\) closest donor values.

When imputing a new case, the function:

  1. Identifies the bin it belongs to by locating its left and right bin thresholds.
  2. Draws a donor value from the \(k\) closest observed values within the left or right bin.
  3. Weighs donor selection based on the case’s relative distance to the bin thresholds.

This approach is memory-efficient, handles gaps in the linear predictor, and allows for fast and accurate imputations based on the choice of bins (\(t\)) and donors (\(k\)).

5.2.2 Model Storage for PMM

The mice.impute.pmm() function stores the following components in the model object:

  • edges: A vector of length \(t + 1\) defining the bin thresholds.
  • lookup: A matrix of size \(t \times k\) containing donor values for each bin.

Appendix B describes four internal helper functions:

  • initialize.nbins(): Sets the number of bins \(t\).
  • initialize.donors(): Sets the number of donors \(k\).
  • bin.yhat(): Divides the linear predictor into bins using edges.
  • draw.neighbors.pmm(): Selects imputed values from the lookup table.

During training, all four functions are used. However, filling only needs draw.neighbors.pmm().

5.2.3 Experimental Aspects of \(t\) and \(k\)

The methods for determining \(t\) and \(k\) in initialize.nbins() and initialize.donors() are preliminary and based on limited simulations.

The default values for \(t\) (typically 15–30 bins) and \(k\) (typically 5–15 donors) were selected based on a small simulation study that varied only sample size. The study assumed an imputation model with approximately 10% explained variance. However, because optimal values for \(t\) and \(k\) likely depend on prediction error, these defaults may not be suitable for other models. Although some work is available in adaptive tuning (Schenker and Taylor 1996), optimizing \(t\) and \(k\) for different levels of prediction error remains an open research question.

5.3 Structure of the models Component

When tasks = "train", the mice() function stores the imputation model in the models component of the mids object. The models component is a list containing the setup and estimates of the imputation model for each variable and each repeated imputation. Since there are m imputations and ncol(data) variables, there can be up to m * ncol(data) imputation models in total.

The models component is organized by variable and imputation number, allowing each variable’s imputation process to be stored separately across imputations. This structure ensures that stored models can be reused for generating imputations on new data.

The following diagram visualizes the hierarchical structure of the models component:

graph TD;
    A[mids] --> B[models]
    B --> C[bmi]
    B --> D[hyp]
    B --> E[chl]
    C --> F[Imputation 1]
    C --> G[Imputation 2]
    D --> H[Imputation 1]
    D --> I[Imputation 2]
    E --> J[Imputation 1]
    E --> K[Imputation 2]

The names of the parts of the first imputation model for bmi are:

names(trained$models$bmi[[1]])
#> [1] "lookup"   "edges"    "beta.dot" "xnames"   "beta.hat" "factor"   "formula" 
#> [8] "setup"

In general, the names and types of stored objects depend on the imputation method. The setup object for the first imputation model for bmi can be accessed as follows:

unlist(trained$models$bmi[[1]]$setup)
#>    method         n    donors matchtype  quantify      trim  task.bmi     nbins 
#>     "pmm"      "16"       "5"       "1"    "TRUE"       "1"   "train"      "13" 
#>     ridge 
#>   "1e-05"

It contains settings specific to the imputation model. We can retrieve the formula used in the model as:

trained$models$bmi[[1]]$formula
#> [1] "bmi ~ age + hyp + chl"

which shows that bmi is imputed using a linear combination of age, hyp, and chl. We obtain the least squares estimates for the regression weights by:

trained$models$bmi[[1]]$beta.hat
#>                     age         hyp         chl 
#> 19.45854723 -5.55551631  2.02220838  0.07131134

However, the actual imputation model does not use these weights directly; instead, it draws randomly from a distribution that accounts for parameter uncertainty (Rubin 1987). The drawn values can be accessed as:

trained$models$bmi[[1]]$beta.dot
#>                   age        hyp        chl 
#>  8.3452747 -5.1967134  4.4862182  0.1070697

For small samples or highly collinear predictors, beta.hat and beta.dot can differ substantially—a phenomenon sometimes called ‘bouncing betas’, where parameter estimates fluctuate across imputations. To ensure “proper imputation”, MICE uses beta draws (beta.dot) rather than fixed estimates (beta.hat).

5.4 Additional Components for PMM

For PMM, two additional objects are stored in models:

  • edges: A vector of length \(t + 1\) containing the bin thresholds \(\theta_j\).
  • lookup: A \(t × k\) matrix storing the donor values for each bin.

The bin thresholds in edges are located on:

trained$models$bmi[[1]]$edges
#>        0% 7.692308% 15.38462% 23.07692% 30.76923% 38.46154% 46.15385% 53.84615% 
#>  19.50434  23.64460  23.79063  24.14798  24.78715  25.22078  26.48167  26.95094 
#> 61.53846% 69.23077% 76.92308% 84.61538% 92.30769%      100% 
#>  27.08883  27.18614  28.37493  30.79091  31.47111  32.25554

The lookup object is a matrix, where each row represents a bin threshold \(\theta_j\) that segments the linear predictor. Observations with \(\hat y\) values falling outside the bin range (\(\hat y \leq \theta_1\) or \(\hat y > \theta_t\)) are assigned to the first or last bin, respectively. The lookup table for the first imputation model for bmi can be accessed as:

trained$models$bmi[[1]]$lookup
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,] 21.7 27.4 27.4 27.4 27.4
#>  [2,] 22.7 22.7 22.7 22.7 22.7
#>  [3,] 20.4 20.4 20.4 20.4 20.4
#>  [4,] 22.5 22.5 22.5 22.5 22.5
#>  [5,] 24.9 24.9 24.9 24.9 24.9
#>  [6,] 27.5 27.5 27.5 27.5 27.5
#>  [7,] 28.7 26.3 26.3 26.3 26.3
#>  [8,] 35.3 22.7 27.5 22.0 22.5
#>  [9,] 27.2 25.5 25.5 25.5 25.5
#> [10,] 22.0 22.0 22.0 22.0 22.0
#> [11,] 30.1 30.1 30.1 30.1 30.1
#> [12,] 22.7 28.7 22.7 22.5 26.3
#> [13,] 33.2 35.3 35.3 35.3 29.6

The lookup object is used in draw.neighbors.pmm() to select donor values for imputations.

5.5 Sharing the models List Component

The models list component in mice() allows users to store trained imputation models and apply them to new datasets, ensuring consistent missing data handling without re-estimating parameters.

Beyond reusability, models enhances transparency into MICE’s mechanics, serving as a diagnostic tool for refining imputation strategies. By inspecting stored models, users can evaluate imputation decisions, identify patterns, and adjust settings accordingly.

Additionally, models encapsulates all essential elements needed to standardize and share imputation workflows across datasets. When combined with a structured codebook, users can create reproducible imputation modules, enabling them to share, apply, and publish standardized missing data solutions in different studies and applications.

6 Methodological Considerations

6.1 Number of Imputations m Used in "train" and "fill"

Since we did not specify \(m_\text{train}\) (the number of imputations for training) and \(m_\text{fill}\) (the number of imputations for filling), both steps default to m = 5. Although the user can set both independently, we recommend \(m_\text{train} = m_\text{fill}\) to ensure consistency.

If \(m_\text{train} < m_\text{fill}\), the mice() function will automatically recycle trained models to generate additional imputations, without warning the user. When recycling, the imputations may exhibit too little variability, particularly in small samples, potentially leading to underestimated downstream variability.

If \(m_\text{train} > m_\text{fill}\), the mice() function will discard the extra imputations. In general, increasing \(m_\text{fill}\) reduces Monte Carlo error and improves the stability of the between-imputation variance estimate.

Despite its statistical drawbacks, many users prefer \(m_\text{fill} = 1\) for its simplicity, as working with a single dataset is often more convenient. However, as Dempster and Rubin (1983, 8) caution:

Imputation is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all.

A single completed dataset may be a convenient fiction, useful for various purposes such as obtaining population estimates, developing an imputation model, or estimating the most likely version of the hidden data values. However, a single imputed dataset fails to account for uncertainty inherent to the missing data, leading to biased estimates and overconfident inferences in downstream decision making.

6.2 Full Fill vs. Partial Fill

The simplest workflow, known as full fill, trains the imputation model on all columns in the dataset. A full fill is useful when data is split by rows (horizontal partitioning) or when standardized imputations are needed across datasets with the same structure. In such cases, a model can be trained on one dataset partition and then applied to new datasets containing different records with the same type of variables. A variation on this approach is to train the model on a subset of rows to generate imputations for the remaining rows to save computation time. Another use case occurs when some variables included in the trained model are missing in a new dataset, and the goal is to impute only those missing variables. Since the model does not need to be retrained, imputations for new data can be generated almost instantly.

However, when data is split into different subsets of columns (vertical partitioning), the trained model may not cover all variables. In these cases, partial fill can be used, where supported variables are imputed along with additional imputations for variables absent from the original model.

A common example occurs in longitudinal studies, where data is collected at multiple time points. A model trained on data from an earlier time point can be used to impute missing values before extending the model to include variables collected at later time points—without requiring retraining on the full dataset. Another example is the integration of data from different sources without directly merging them. Suppose source A contains variables \(X\) and \(Y\), while source B contains \(X\) and \(Z\). Instead of combining these datasets into a single table, an imputation model can be sequentially extended under the assumption of conditional independence \(Y \perp Z \mid X\).

The process proceeds as follows:

  1. Train a model on source B to impute \(Z\) from \(X\).
  2. Use this model to impute \(Z\) in source A.
  3. Store the final model for \(X, Y\) and \(Z\) for future use.

This approach removes the need to merge sources A and B into a single dataset. Analysts can share trained models instead of exchanging raw data, thereby improving efficiency, adaptability, and data privacy while still utilizing all available information.

A key assumption in this approach is conditional independence \(Y \perp Z \mid X\), meaning that once we account for \(X\), there is no remaining association between \(Y\) and \(Z\). In practice, this assumption may not always hold. If an additional data source C containing both \(Y\) and \(Z\) is available, models trained on source C can be incorporated to capture the direct relationship between \(Y\) and \(Z\), improving the accuracy of imputations.

6.3 Store

The store component of a mids object is automatically determined based on the tasks argument. It specifies which components are saved in the imputation model.

If tasks is not specified (or set to "impute"), the default behavior is:

imputed <- mice(nhanes, seed = 1, print = FALSE)
imputed$store
#> [1] "impute"
imputed$tasks
#>      age      bmi      hyp      chl 
#> "impute" "impute" "impute" "impute"
imputed$models
#> NULL

Since tasks = "impute" does not train models, the models component remains NULL. The possible values of store are:

  • impute: When all tasks are "impute", the mids object mimics the classic mids structure and does not store models.
  • train: If one or more tasks include "train", the mids object stores the models component for the subset of trained variables.
  • fill: If all tasks are "fill", the mids object does not store models. The setting store = "fill" assumes a fully trained model for all variables.

Note: If additional variables exist in the new data, mice() will silently impute them using the trained model. This may produce unintended imputations if the new variables were not part of the training set. The resulting mids object will have store = "train".

Additionally, there is a special fourth store value:

  • train_compact: This value is assigned when store = "train" and compact = TRUE. The train_compact mode stores a minimized version of the imputation model without training data, making it suitable for production and sharing.

7 Conclusion

This vignette introduced new functionality in mice() that enables storing and reusing imputation models, enhancing reproducibility, efficiency, and interoperability. By utilizing the tasks and models arguments, users can now train imputation models, apply them to new data, and share models across studies. The flexibility of full fill and partial fill workflows further supports diverse data structures, including longitudinal data and multi-source datasets. These enhancements streamline missing data workflows, reduce redundancy, and improve data privacy by allowing models to be shared instead of raw datasets. Future developments will refine model selection and optimization for various imputation scenarios.

References

Dempster, A. P., and D. B. Rubin. 1983. “Introduction.” In Incomplete Data in Sample Surveys, 2:3–10. New York: Academic Press.
Little, R. J. A. 1988. “Missing-Data Adjustments in Large Surveys (with Discussion).” Journal of Business Economics and Statistics 6 (3): 287–301.
Morris, T. P., I. R. White, and P. Royston. 2014. “Tuning Multiple Imputation by Predictive Mean Matching and Local Residual Draws.” BMC Medical Reseach Methods 14: 75.
Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
Schenker, N., and J. M. G. Taylor. 1996. “Partially Parametric Techniques for Multiple Imputation.” Computational Statistics & Data Analysis 22 (4): 425–46.
van Buuren, S. 2018. Flexible Imputation of Missing Data. Second Edition. Boca Raton, FL: Chapman & Hall/CRC Press. https://stefvanbuuren.name/fimd/.
van Buuren, S., and K. Groothuis-Oudshoorn. 2011. MICE: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67.

APPENDIX A: Components of the mids object

The following table summarizes the components saved by mice() for different store values. The store value is set to "impute", "train", or "fill" based on the tasks argument. mice() return store == "impute" by default, and store == "fill" if all variables are filled. In all other cases, the store == "train".

The table lists the components of the mids object for each store value.

Name Description I/O Data Type Impute Train Fill
predictorMatrix Specifies predictor set I matrix YES YES NO
formulas Formulae for imputation models I list YES YES NO
modeltype Form of imputation model I character YES YES NO
post Commands for post-processing I character vector YES YES NO
ignore Logical vector for ignored rows I logical vector YES YES NO
seed Seed value for reproducibility I integer YES YES NO
nmis Count of missing values per variable O numeric vector YES YES NO
chainMean Mean of imputed values O array YES YES NO
chainVar Variance of imputed values O array YES YES NO
loggedEvents Warnings and corrective tasks O data.frame YES YES NO
blocks Blocks of variables for imputation I list YES YES NO
method Imputation method per block I character vector YES YES NO
blots Extra arguments per block I list YES YES NO
visitSequence Order of block visits I character vector YES YES NO
iteration Last iteration number O integer YES YES NO
lastSeedValue Random number generator state O integer YES YES NO
tasks Specifies the imputation tasks I character vector YES YES NO
models Stores imputation model estimates O list NO YES NO
data Data to be imputed I data.frame YES YES YES
where Specifies where imputations occur I matrix YES YES YES
imp List of imputations per variable O list YES YES YES
m Number of imputations I integer YES YES YES
store Storage set O character YES YES YES
version Version number of mice package O character YES YES YES
date Date when the object was created O character YES YES YES
call Call that created the object O call YES YES YES

The mids object store can be minimized from "train" to "train_compact" by setting compact = TRUE. The minimal representation "train_compact" saves the following components: blocks, method, blots, visitSequence, iteration, lastSeedValue, tasks, models, m, store, version, date and call. This feature is particularly useful if training data cannot be shared, or for production applications where memory efficiency is critical. The minimized model retains only the essential information required for generating imputed values, reducing memory usage while maintaining the core functionality of the imputation model. Note that some downstream functions, like complete() or with(), cannot support store = "train_compact".

APPENDIX B: Computational details of binning in PMM

Initialization of the Number of Bins (initialize.nbins)

The function initialize.nbins() determines an appropriate number of bins (nbins) to be used in PMM based on the sample size (n) and the number of unique values (nu) in the predicted or observed data.

The default computation of nbins uses the relation nbins = round(4 * log(n) + 1.5). This formula suggests that the number of bins grows logarithmically with the sample size (n), ensuring a reasonable bin width without excessive granularity. The coefficients (4 and 1.5) are chosen to scale the number of bins appropriately across different n. Since nbins represents a discretization of the data, it cannot exceed the number of unique values (nu) in the predicted or observed variable. If nbins > nu, the function sets nbins = nu to ensure that each unique value has its own bin. This adjustment is necessary to prevent empty bins and ensure that each unique value is represented in the binning process. The function enforces a lower bound of 2 bins to prevent degenerate cases where binning would be ineffective.

For large samples of highly-correlated continuous data, the user can increase the set nbins = 100 or higher to improve precision.

Initialization of the Number of Donors (initialize.donors())

The initialize.donors() function determines the number of donors donors for imputation based on the sample size (n). If donors is NULL, it is computed using round(n / 600 + 7), ensuring a gradual increase as n grows. The computed value is then constrained using max(1L, min(donors, n)), which ensures at least one donor while preventing the donor count from exceeding n.

Setting donors = 1 (not recommended) selects the closest donor. Thus, different cases that are in the same bin will obtain the same imputed value from the left or right edge. Setting donors = n (not recommended) effectively samples from the marginal distribution, and weakens the relations between the data. The literature suggests values between 5 and 10 donors (Morris, White, and Royston 2014). Because of binning, the optimal number of donors may actually need to be higher than 5 or 10 in order to reduce repetition of donors from the same bin.

Binning of the Linear Predictor (bin.yhat())

The bin.yhat() function divides the linear predictor into bins using the thresholds edges. It first sorts the linear predictor yhat and determines the bin index for each value based on the thresholds. The function uses the findInterval() function to assign each value to the corresponding bin. The bin index is calculated as bin = findInterval(yhat, edges, left.open = TRUE). The left.open = TRUE argument ensures that values equal to the threshold are assigned to the left bin, consistent with the binning process. The function returns the bin index for each value.

Drawing Imputations (draw.neighbors.pmm())

The draw.neighbors.pmm function selects imputed values using predefined bins. Given predicted values (yhat), bin edges (edges), and a lookup table (lookup), the function assigns each yhat value to a bin using findInterval. If yhat is lower than the first bin or higher than the last bin, it selects the first or last bin. To ensure smooth transitions between bins, the function calculates a probability weight based on the distance between yhat and the bin edges. Using a Bernoulli distribution, it probabilistically selects the left or right bin. Once the bin is selected, the function samples mlocal observed y values from the corresponding row in the lookup table. The result is an n × mlocal matrix containing imputed values.