mice
: An approach to sensitivity analysisThis is the last vignette in the series.
The focus of this document is on sensitivity analysis in the context of missing data. The goal of sensitivity analysis is to study the influence that violations of the missingness assumptions have on the obtained inference.
The Leiden data set
The Leiden data set is a subset of 956 members of a very old (85+) cohort in Leiden. Multiple imputation of this data set has been described in Boshuizen et al (1998), Van Buuren et al (1999) and Van Buuren (2012), chapter 7.
The main question is how blood pressure affects mortality risk in the oldest old. We have reasons to mistrust the MAR assumption in this case. In particular, we worried whether the imputations of blood pressure under MAR would be low enough. The sensitivity analysis explores the effect of artificially lowering the imputed blood pressure by deducting an amount of δ from the values imputed under MAR. In order to preserve the relations between the variables, this needs to be done during the iterations.
Unfortunately we cannot share the Leiden data set with you. But we detail the approach below.
1. Open R
and load the packages mice
, lattice
and survival
.
set.seed(123)
library("mice")
library("lattice")
library("survival")
2. The leiden data set.
summary(leiden)
## sexe lftanam rrsyst rrdiast
## Min. :0.0000 Min. : 85.48 Min. : 90.0 Min. : 50.00
## 1st Qu.:0.0000 1st Qu.: 87.50 1st Qu.:135.0 1st Qu.: 75.00
## Median :0.0000 Median : 89.07 Median :150.0 Median : 80.00
## Mean :0.2709 Mean : 89.78 Mean :152.9 Mean : 82.78
## 3rd Qu.:1.0000 3rd Qu.: 91.52 3rd Qu.:170.0 3rd Qu.: 90.00
## Max. :1.0000 Max. :103.54 Max. :260.0 Max. :140.00
## NA's :121 NA's :126
## dwa survda alb chol
## Min. :0.0000 Min. : 2.0 Min. :21.00 Min. : 2.700
## 1st Qu.:0.0000 1st Qu.: 534.8 1st Qu.:39.00 1st Qu.: 4.800
## Median :0.0000 Median :1196.5 Median :41.00 Median : 5.700
## Mean :0.2437 Mean :1195.4 Mean :40.77 Mean : 5.704
## 3rd Qu.:0.0000 3rd Qu.:1889.0 3rd Qu.:43.00 3rd Qu.: 6.400
## Max. :1.0000 Max. :2610.0 Max. :52.00 Max. :10.900
## NA's :229 NA's :232
## mmse woon
## Min. : 1.00 Min. :0.000
## 1st Qu.:21.00 1st Qu.:0.000
## Median :26.00 Median :3.000
## Mean :23.67 Mean :1.775
## 3rd Qu.:29.00 3rd Qu.:3.000
## Max. :30.00 Max. :4.000
## NA's :85
str(leiden)
## 'data.frame': 956 obs. of 10 variables:
## $ sexe : num 0 0 0 0 0 0 0 1 1 0 ...
## $ lftanam: num 87.8 87.8 89.1 90.3 87.8 ...
## $ rrsyst : num 160 140 155 155 110 120 180 135 130 160 ...
## $ rrdiast: num 100 70 85 90 60 80 75 80 60 90 ...
## $ dwa : num 0 0 0 0 0 0 0 0 0 0 ...
## $ survda : num 1082 27 1604 528 1100 ...
## $ alb : num 41 NA 41 44 37 NA 42 NA 45 46 ...
## $ chol : num 4.4 NA 4.6 3.9 5.3 NA 7.2 NA 5.1 6.5 ...
## $ mmse : num 12 9 25 27 14 NA 28 26 30 14 ...
## $ woon : num 4 3 0 1 0 3 3 0 4 4 ...
head(leiden)
## sexe lftanam rrsyst rrdiast dwa survda alb chol mmse woon
## 1 0 87.80 160 100 0 1082 41 4.4 12 4
## 2 0 87.75 140 70 0 27 NA NA 9 3
## 3 0 89.08 155 85 0 1604 41 4.6 25 0
## 4 0 90.29 155 90 0 528 44 3.9 27 1
## 5 0 87.76 110 60 0 1100 37 5.3 14 0
## 6 0 91.39 120 80 0 5 NA NA NA 3
tail(leiden)
## sexe lftanam rrsyst rrdiast dwa survda alb chol mmse woon
## 1229 1 93.85 130 70 0 523 40 5.3 28 0
## 1230 0 92.20 190 90 0 1182 44 5.8 26 3
## 1232 0 95.02 150 80 0 861 35 5.0 28 0
## 1233 0 88.30 120 60 0 129 42 8.6 21 0
## 1235 1 89.02 140 80 0 374 40 5.2 23 0
## 1236 0 85.70 130 65 0 1744 36 7.2 27 3
3. Perform a dry run (using maxit = 0
) in mice
. List the number of missing values per variable.
ini <- mice(leiden, maxit = 0)
ini$nmis
## sexe lftanam rrsyst rrdiast dwa survda alb chol mmse
## 0 0 121 126 0 0 229 232 85
## woon
## 0
There are 121 missings (NA
’s) for rrsyst
, 126 missings for rrdiast
, 229 missings for alb
, 232 missings for chol
and 85 missing values for mmse
.
4. Study the missing data pattern in more detail using md.pattern()
and fluxplot()
. The interest here focusses on imputing systolic blood pressure (rrsyst
) and diastolic blood pressure (rrdiast
).
md.pattern(leiden)
## sexe lftanam dwa survda woon mmse rrsyst rrdiast alb chol
## 621 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 1 0 1 1
## 149 1 1 1 1 1 1 1 1 0 0 2
## 2 1 1 1 1 1 1 1 0 1 1 1
## 2 1 1 1 1 1 1 1 0 0 0 3
## 72 1 1 1 1 1 1 0 0 1 1 2
## 2 1 1 1 1 1 1 0 0 1 0 3
## 20 1 1 1 1 1 1 0 0 0 0 4
## 21 1 1 1 1 1 0 1 1 1 1 1
## 36 1 1 1 1 1 0 1 1 0 0 3
## 1 1 1 1 1 1 0 1 0 0 0 4
## 7 1 1 1 1 1 0 0 0 1 1 3
## 20 1 1 1 1 1 0 0 0 0 0 5
## 0 0 0 0 0 85 121 126 229 232 793
fx <- fluxplot(leiden)
Variables with higher outflux are (potentially) the more powerful predictors. Variables with higher influx depend stronger on the imputation model. When points are relatively close to the diagonal, it indicates that influx and outflux are balanced.
The variables in the upper left corner have the more complete information, so the number of missing data problems for this group is relatively small. The variables in the middle have an outflux between 0.5 and 0.8, which is small. Missing data problems are thus more severe, but potentially this group could also contain important variables. The lower (bottom) variables have an outflux with 0.5 or lower, so their predictive power is limited. Also, this group has a higher influx, and, thus, depend more highly on the imputation model.
If you’d like this information in tabulated form, you can simply ask
fx
## pobs influx outflux ainb aout fico
## sexe 1.0000000 0.00000000 1.0000000 0.0000000 0.09216643 0.3504184
## lftanam 1.0000000 0.00000000 1.0000000 0.0000000 0.09216643 0.3504184
## rrsyst 0.8734310 0.09798107 0.5573770 0.7887971 0.05881570 0.2562874
## rrdiast 0.8682008 0.10231550 0.5422446 0.7910053 0.05756359 0.2518072
## dwa 1.0000000 0.00000000 1.0000000 0.0000000 0.09216643 0.3504184
## survda 1.0000000 0.00000000 1.0000000 0.0000000 0.09216643 0.3504184
## alb 0.7604603 0.19311053 0.2471627 0.8214459 0.02995568 0.1458047
## chol 0.7573222 0.19573400 0.2383354 0.8218391 0.02900552 0.1422652
## mmse 0.9110879 0.06798221 0.6796974 0.7790850 0.06875877 0.2870264
## woon 1.0000000 0.00000000 1.0000000 0.0000000 0.09216643 0.3504184
5. The cases with and without blood pressure observed have very different survival rates. Show this.
We can see this easily from the Kaplan-Meier plot.
km <- survfit(Surv(survda/365, 1-dwa) ~ is.na(rrsyst), data = leiden)
plot(km,
lty = 1,
lwd = 1.5,
xlab = "Years since intake",
ylab = "K-M Survival probability", las=1,
col = c(mdc(4), mdc(5)),
mark.time = FALSE)
text(4, 0.7, "BP measured")
text(2, 0.3, "BP missing")
In the next steps we are going to impute rrsyst
and rrdiast
under two scenarios: MAR and MNAR. We will use the delta adjustment technique described in paragraph 7.2.3 in Van Buuren (2012)
6. Create a \(\delta\) vector that represent the following adjustment values for mmHg: 0 for MAR, and -5, -10, -15, and -20 for MNAR.
delta <- c(0, -5, -10, -15, -20)
The recipe for creating MNAR imputations for \(\delta \neq 0\) uses the post-processing facility of mice. This allows to change the imputations on the fly by deducting a value of \(\delta\) from the values just imputed.
7. Impute the leiden data using the delta adjustment technique. We only have to deduct from rrsyst
, because rrdiast
will adapt to the changed rrsyst
when it is imputed using rrsyst
as predictor. Store the five imputed scenarios (adjustment) in a list called imp.all
.
imp.all <- vector("list", length(delta))
post <- ini$post
for (i in 1:length(delta)){
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["rrsyst"] <- cmd
imp <- mice(leiden, post = post, maxit = 5, seed = i, print = FALSE)
imp.all[[i]] <- imp
}
8. Inspect the imputations. Compare the imputations for blood pressure under the most extreme scenarios with a box-and-whiskers plot. Is this as expected?
For the scenario where \(\delta = 0\) we can plot the first object from the list. This object is the mids
-object that considers imputations under no adjustment.
bwplot(imp.all[[1]])