The Dance of the Missingness Mechanisms

Padova - January 28, 2019

This presentation

has a website:

www.gerkovink.com/Padova/

You can find all related materials, links and references there.

About me

Name: Gerko Vink
Identifies as: Statistical scientist
Studied to be: Psychologist
Landed in: Statistics
Enjoys: Putting easter eggs in software
Pet peeve: Excessively certain Data Scientists
Married, with 2 kids
Favorite guitarist: Jimmy Page
Has mortgage

Outline

In order to fully understand missing data theory, we need to focus on the following

Identify the problem
Recreate the problem
Study the problem’s behavior (effect)
Identify if the effect is indeed a problem

Disclaimer

I will use

mice in R for generating imputations
ampute in mice in R
my own research, preferences and opinions

Most of this presentation is joint work with Rianne Schouten.

People who contributed or whose minds have somehow been picked in parts relevant to this presentation Stef van Buuren, Andrew Gelman, Jeroen Pannekoek, Shahab Jolani, Peter Lugtig and Jaap Brand.

Identifying the problem

The scope

Bold statement #1

\[\text{EVERYTHING IS A MISSING DATA PROBLEM}\]

Why? there is always some source of information missing.

data may be missing
assumptions are usually made
some form of sampling is used
a theoretical distribution is used for comparison

The solution?

Since we’re statisticians, we can model the uncertainty.

The obvious?

George Box has been attributed the following aphorism \[\text{ALL MODELS ARE WRONG, BUT SOME ARE USEFUL}\]

He also stated \[\text{HOW WRONG DO THEY NEED TO BE TO NOT BE USEFUL?}\]

These remarks are important; because the implications of being wrong are not obvious to everyone.

Correct model

In general, the only correct model is the true data generating model.

If you can recreate the true data generating model; then the resulting sampled measurements of said model are technically flawless and its estimates are unbiased.

The models involved

The divine

The true data generating model

The erroneous

The sampling model
The nonresponse model
The analysis model

The (un)holy grail

The imputation model

The convention

(Un)known implications

Let $Y$ be an incomplete data set and let $R$ be the matrix that stores the missing data in $Y$. Then, if we define $\psi$ to contain the parameters of the missing data model:

MCAR

Probability to be missing is not related to any data \[P(R = 0 \; | \; Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;\psi)\]

MAR

Probability to be missing depends on known data \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;Y_\text{obs}, \psi)\]

MNAR

Probability to be missing may depend on unknown data also \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi)\]

In real life

Consequences of missing data

Less information than initially planned
Is there enough statistical power?
Different analyses result in different $n$
Unable to calculate even simple statistics
Systematic bias in the analyses
Are intervals and P-values appropriate?

Bold statement #2

\[\text{REAL LIFE PROBLEMS ARE NEVER MCAR}\]

So far

In general, missing data can severely complicate interpretation and analysis.

Not treating missing data properly may yield invalid inferences

Ignoring your missing data is never a good idea

Recreating the problem

Why on earth would you ampute?

The quality of a solution obtained by any (multiple) imputation procedure depends on

the statistical properties of the incomplete data
the inference problem with respect to the assumptions
the degree to which an imputation procedure is able to capture these properties when modeling missing values on the given data.

Simulation

When evaluating the statistical properties (and thereby the practical applicability) of new imputation methodology, researchers most often make use of simulation studies.

a complete data set is usually generated from a statistical model
another model is used to induce missingness
a set of evaluation criteria is postulated to evaluate the performance of the imputation method.

Simple methods

Stef van Buuren considers three univariate amputation examples in his book.

The data are generated as 1000 draws from the bivariate normal distribution $P(Y_1, Y_2)$ with means $\mu_1 = \mu_2 =5$, variances $\sigma_1^2 =\sigma_2^2 = 1$, and covariance $\sigma_{12} = 0.6$. We assume that all values generated are positive. Missing data in $Y_2$ can be created in many ways. Let $R_2$ be the response indicator for $Y_2$.

\[\begin{align} \mathrm{MARRIGHT} &:& \mathrm{logit}(\Pr(R_2=0)) = -5 + Y_1 \tag{3.1}\\ \mathrm{MARMID} &:& \mathrm{logit}(\Pr(R_2=0)) = 0.75 - |Y_1-5| \tag{3.2}\\ \mathrm{MARTAIL} &:& \mathrm{logit}(\Pr(R_2=0)) = -0.75 + |Y_1-5| \tag{3.3} \end{align}\]

Simple methods visualized

$Probability that $Y_2$ is missing as a function of the values of $Y_1$ under three models for the missing data.$

Figure 3.2: Probability that $Y_2$ is missing as a function of the values of $Y_1$ under three models for the missing data.

Danger of simplicity

The induced (MAR) mechanism may be very spurious

Proper missingness

An example data set

set.seed(123)
data <- 
  matrix(c(1, 0.5, 0.3, 
           0.5, 1, 0.5,
           0.3, 0.5, 1), 3, 3) %>%
  mvtnorm::rmvnorm(n = 1000, 
                   mean = c(3, 20, 40), 
                   sigma = .) %>%
  as.data.frame %>%
  round(3)

names(data) <- c("Income", "WorkingYears", "Age")

An example data set

datatable(data, options = list(pageLength = 6))

Multivariate amputation with `mice::ampute`

amp <- ampute(data, 
              patterns =  matrix(c(1, 0, 1,
                                   1, 0, 1, 
                                   0, 0, 1), 
                                 nrow = 3, 
                                 byrow = TRUE),
              freq = c(0.6, 0.2, 0.2),
              prop = 0.5, 
              mech = "MAR")

Multivariate amputation with `mice::ampute`

md.pattern(amp$amp)

##     Age Income WorkingYears    
## 483   1      1            1   0
## 415   1      1            0   1
## 102   1      0            0   2
##       0    102          517 619

So far

We have identified the problem
We can recreate the problem

Now we can investigate: is the problem really a problem?

Study the problem’s behavior

MAR vs MNAR

Simple setup

Comparing complete case analysis (CCA) to multiple imputation:

2 variables, $Y$ and $X$.
varying correlation: $\rho = .1$ to $\rho = .9$
all flavors of missingness that ampute serves
- MCAR
- MAR: left, mid, tails
- MNAR: left, mid, tails
three levels of missingness: 10%, 50% and 90%
study means, correlations and variances
evaluate bias, coverage rates and confidence interval widths

Expectations

In general we expected a small problem to have a small effect

high missingness problems are harder to solve than low missingness problems
the stronger the information, the harder the effect
do high information clusters favor imputation and amputation equally?

Mean of $Y$

Results: 10 % missingness

Results: 90 % missingness

Results: 90% missingness

Correlation between $Y$ and $X$

Results: 10 % missingness

Results: 90 % missingness

Results: 90% missingness

To conclude

Implications

Summary

With the highest possible information structures, we get

the least bias
the best coverage
the smallest confidence intervals

BUT ALSO THE STRONGEST MISSINGNESS

To draw inference from incomplete data, we usually assume that the observed part can serve as a proxy for the missing part

With little information (low correlation AND/OR high missingness), we heavily lean on this assumption
With strong information (high correlation AND/OR low missingness), this focus shifts more towards the observed part.

This presentation

has a website:

www.gerkovink.com/Padova/

You can find all related materials, links and references there.

This presentation

has a website:

About me

Outline

Disclaimer

Identifying the problem

The scope

Bold statement #1

The solution?

The obvious?

Correct model

The models involved

The divine

The erroneous

The (un)holy grail

The convention

(Un)known implications

MCAR

MAR

MNAR

In real life

Consequences of missing data

Bold statement #2

So far

Recreating the problem

Why on earth would you ampute?

Simulation

Simple methods

Simple methods visualized

Danger of simplicity

The induced (MAR) mechanism may be very spurious

Proper missingness

An example data set

An example data set

Multivariate amputation with mice::ampute

Multivariate amputation with mice::ampute

So far

Study the problem’s behavior

MAR vs MNAR

Simple setup

Expectations

Mean of \(Y\)

Results: 10 % missingness

Results: 10 % missingness

Results: 90 % missingness

Results: 90% missingness

Correlation between \(Y\) and \(X\)

Results: 10 % missingness

Results: 10 % missingness

Results: 90 % missingness

Results: 90% missingness

To conclude

Implications

Summary

This presentation

has a website:

Multivariate amputation with `mice::ampute`

Multivariate amputation with `mice::ampute`