This is the last vignette in the series.

The focus of this document is on sensitivity analysis in the context of missing data. The goal of sensitivity analysis is to study the influence that violations of the missingness assumptions have on the obtained inference.


The Leiden data set

The Leiden data set is a subset of 956 members of a very old (85+) cohort in Leiden. Multiple imputation of this data set has been described in Boshuizen et al (1998), Van Buuren et al (1999) and Van Buuren (2012), chapter 7.

The main question is how blood pressure affects mortality risk in the oldest old. We have reasons to mistrust the MAR assumption in this case. In particular, we worried whether the imputations of blood pressure under MAR would be low enough. The sensitivity analysis explores the effect of artificially lowering the imputed blood pressure by deducting an amount of δ from the values imputed under MAR. In order to preserve the relations between the variables, this needs to be done during the iterations.

Unfortunately we cannot share the Leiden data set with you. But we detail the approach below.


1. Open R and load the packages mice, lattice and survival.

set.seed(123)
library("mice")
library("lattice")
library("survival")

2. The leiden data set.

summary(leiden)
##       sexe           lftanam           rrsyst         rrdiast      
##  Min.   :0.0000   Min.   : 85.48   Min.   : 90.0   Min.   : 50.00  
##  1st Qu.:0.0000   1st Qu.: 87.50   1st Qu.:135.0   1st Qu.: 75.00  
##  Median :0.0000   Median : 89.07   Median :150.0   Median : 80.00  
##  Mean   :0.2709   Mean   : 89.78   Mean   :152.9   Mean   : 82.78  
##  3rd Qu.:1.0000   3rd Qu.: 91.52   3rd Qu.:170.0   3rd Qu.: 90.00  
##  Max.   :1.0000   Max.   :103.54   Max.   :260.0   Max.   :140.00  
##                                    NA's   :121     NA's   :126     
##       dwa             survda            alb             chol       
##  Min.   :0.0000   Min.   :   2.0   Min.   :21.00   Min.   : 2.700  
##  1st Qu.:0.0000   1st Qu.: 534.8   1st Qu.:39.00   1st Qu.: 4.800  
##  Median :0.0000   Median :1196.5   Median :41.00   Median : 5.700  
##  Mean   :0.2437   Mean   :1195.4   Mean   :40.77   Mean   : 5.704  
##  3rd Qu.:0.0000   3rd Qu.:1889.0   3rd Qu.:43.00   3rd Qu.: 6.400  
##  Max.   :1.0000   Max.   :2610.0   Max.   :52.00   Max.   :10.900  
##                                    NA's   :229     NA's   :232     
##       mmse            woon      
##  Min.   : 1.00   Min.   :0.000  
##  1st Qu.:21.00   1st Qu.:0.000  
##  Median :26.00   Median :3.000  
##  Mean   :23.67   Mean   :1.775  
##  3rd Qu.:29.00   3rd Qu.:3.000  
##  Max.   :30.00   Max.   :4.000  
##  NA's   :85
str(leiden)
## 'data.frame':    956 obs. of  10 variables:
##  $ sexe   : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ lftanam: num  87.8 87.8 89.1 90.3 87.8 ...
##  $ rrsyst : num  160 140 155 155 110 120 180 135 130 160 ...
##  $ rrdiast: num  100 70 85 90 60 80 75 80 60 90 ...
##  $ dwa    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ survda : num  1082 27 1604 528 1100 ...
##  $ alb    : num  41 NA 41 44 37 NA 42 NA 45 46 ...
##  $ chol   : num  4.4 NA 4.6 3.9 5.3 NA 7.2 NA 5.1 6.5 ...
##  $ mmse   : num  12 9 25 27 14 NA 28 26 30 14 ...
##  $ woon   : num  4 3 0 1 0 3 3 0 4 4 ...
head(leiden)
##   sexe lftanam rrsyst rrdiast dwa survda alb chol mmse woon
## 1    0   87.80    160     100   0   1082  41  4.4   12    4
## 2    0   87.75    140      70   0     27  NA   NA    9    3
## 3    0   89.08    155      85   0   1604  41  4.6   25    0
## 4    0   90.29    155      90   0    528  44  3.9   27    1
## 5    0   87.76    110      60   0   1100  37  5.3   14    0
## 6    0   91.39    120      80   0      5  NA   NA   NA    3
tail(leiden)
##      sexe lftanam rrsyst rrdiast dwa survda alb chol mmse woon
## 1229    1   93.85    130      70   0    523  40  5.3   28    0
## 1230    0   92.20    190      90   0   1182  44  5.8   26    3
## 1232    0   95.02    150      80   0    861  35  5.0   28    0
## 1233    0   88.30    120      60   0    129  42  8.6   21    0
## 1235    1   89.02    140      80   0    374  40  5.2   23    0
## 1236    0   85.70    130      65   0   1744  36  7.2   27    3

3. Perform a dry run (using maxit = 0) in mice. List the number of missing values per variable.

ini <- mice(leiden, maxit = 0)
ini$nmis
##    sexe lftanam  rrsyst rrdiast     dwa  survda     alb    chol    mmse 
##       0       0     121     126       0       0     229     232      85 
##    woon 
##       0

There are 121 missings (NA’s) for rrsyst, 126 missings for rrdiast, 229 missings for alb, 232 missings for chol and 85 missing values for mmse.


4. Study the missing data pattern in more detail using md.pattern() and fluxplot(). The interest here focusses on imputing systolic blood pressure (rrsyst) and diastolic blood pressure (rrdiast).

md.pattern(leiden)
##     sexe lftanam dwa survda woon mmse rrsyst rrdiast alb chol    
## 621    1       1   1      1    1    1      1       1   1    1   0
##   2    1       1   1      1    1    1      1       0   1    1   1
##   1    1       1   1      1    1    1      1       1   0    1   1
##   2    1       1   1      1    1    1      1       1   1    0   1
##  21    1       1   1      1    1    0      1       1   1    1   1
##  72    1       1   1      1    1    1      0       0   1    1   2
## 149    1       1   1      1    1    1      1       1   0    0   2
##   2    1       1   1      1    1    1      0       0   1    0   3
##   2    1       1   1      1    1    1      1       0   0    0   3
##   7    1       1   1      1    1    0      0       0   1    1   3
##  36    1       1   1      1    1    0      1       1   0    0   3
##  20    1       1   1      1    1    1      0       0   0    0   4
##   1    1       1   1      1    1    0      1       0   0    0   4
##  20    1       1   1      1    1    0      0       0   0    0   5
##        0       0   0      0    0   85    121     126 229  232 793
fx <- fluxplot(leiden)

Variables with higher outflux are (potentially) the more powerful predictors. Variables with higher influx depend stronger on the imputation model. When points are relatively close to the diagonal, it indicates that influx and outflux are balanced.

The variables in the upper left corner have the more complete information, so the number of missing data problems for this group is relatively small. The variables in the middle have an outflux between 0.5 and 0.8, which is small. Missing data problems are thus more severe, but potentially this group could also contain important variables. The lower (bottom) variables have an outflux with 0.5 or lower, so their predictive power is limited. Also, this group has a higher influx, and, thus, depend more highly on the imputation model.

If you’d like this information in tabulated form, you can simply ask

fx
##              pobs     influx   outflux      ainb       aout      fico
## sexe    1.0000000 0.00000000 1.0000000       NaN 0.08294979 0.3504184
## lftanam 1.0000000 0.00000000 1.0000000       NaN 0.08294979 0.3504184
## rrsyst  0.8734310 0.09798107 0.5573770 0.7099174 0.05293413 0.2562874
## rrdiast 0.8682008 0.10231550 0.5422446 0.7119048 0.05180723 0.2518072
## dwa     1.0000000 0.00000000 1.0000000       NaN 0.08294979 0.3504184
## survda  1.0000000 0.00000000 1.0000000       NaN 0.08294979 0.3504184
## alb     0.7604603 0.19311053 0.2471627 0.7393013 0.02696011 0.1458047
## chol    0.7573222 0.19573400 0.2383354 0.7396552 0.02610497 0.1422652
## mmse    0.9110879 0.06798221 0.6796974 0.7011765 0.06188289 0.2870264
## woon    1.0000000 0.00000000 1.0000000       NaN 0.08294979 0.3504184

5. The cases with and without blood pressure observed have very different survival rates. Show this.

We can see this easily from the Kaplan-Meier plot.

km <- survfit(Surv(survda/365, 1-dwa) ~ is.na(rrsyst), data = leiden) 
plot(km, 
     lty  = 1, 
     lwd  = 1.5, 
     xlab = "Years since intake",
     ylab = "K-M Survival probability", las=1, 
     col  = c(mdc(4), mdc(5)), 
     mark.time = FALSE)
text(4, 0.7, "BP measured")
text(2, 0.3, "BP missing")

In the next steps we are going to impute rrsyst and rrdiast under two scenarios: MAR and MNAR. We will use the delta adjustment technique described in paragraph 7.2.3 in Van Buuren (2012)


6. Create a \(\delta\) vector that represent the following adjustment values for mmHg: 0 for MAR, and -5, -10, -15, and -20 for MNAR.

delta <- c(0, -5, -10, -15, -20)

The recipe for creating MNAR imputations for \(\delta \neq 0\) uses the post-processing facility of mice. This allows to change the imputations on the fly by deducting a value of \(\delta\) from the values just imputed.


7. Impute the leiden data using the delta adjustment technique. We only have to deduct from rrsyst, because rrdiast will adapt to the changed rrsyst when it is imputed using rrsyst as predictor. Store the five imputed scenarios (adjustment) in a list called imp.all.

imp.all <- vector("list", length(delta))
post <- ini$post
for (i in 1:length(delta)){
  d <- delta[i]
  cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
  post["rrsyst"] <- cmd
  imp <- mice(leiden, post = post, maxit = 5, seed = i, print = FALSE)
  imp.all[[i]] <- imp
}

8. Inspect the imputations. Compare the imputations for blood pressure under the most extreme scenarios with a box-and-whiskers plot. Is this as expected?

For the scenario where \(\delta = 0\) we can plot the first object from the list. This object is the mids-object that considers imputations under no adjustment.

bwplot(imp.all[[1]])