Padova - January 28, 2019

This presentation

has a website:

About me

  • Name: Gerko Vink
  • Identifies as: Statistical scientist
  • Studied to be: Psychologist
  • Landed in: Statistics
  • Enjoys: Putting easter eggs in software
  • Pet peeve: Excessively certain Data Scientists
  • Married, with 2 kids
  • Favorite guitarist: Jimmy Page
  • Has mortgage

Outline

In order to fully understand missing data theory, we need to focus on the following

  • Identify the problem

  • Recreate the problem

  • Study the problem’s behavior (effect)

  • Identify if the effect is indeed a problem

Disclaimer

I will use

  • mice in R for generating imputations
  • ampute in mice in R
  • my own research, preferences and opinions

Most of this presentation is joint work with Rianne Schouten.

People who contributed or whose minds have somehow been picked in parts relevant to this presentation Stef van Buuren, Andrew Gelman, Jeroen Pannekoek, Shahab Jolani, Peter Lugtig and Jaap Brand.

Identifying the problem

The scope

Bold statement #1

\[\text{EVERYTHING IS A MISSING DATA PROBLEM}\]

Why? there is always some source of information missing.

  • data may be missing
  • assumptions are usually made
  • some form of sampling is used
  • a theoretical distribution is used for comparison

The solution?

Since we’re statisticians, we can model the uncertainty.

The obvious?

George Box has been attributed the following aphorism \[\text{ALL MODELS ARE WRONG, BUT SOME ARE USEFUL}\]

He also stated \[\text{HOW WRONG DO THEY NEED TO BE TO NOT BE USEFUL?}\]

These remarks are important; because the implications of being wrong are not obvious to everyone.

Correct model

In general, the only correct model is the true data generating model.

  • If you can recreate the true data generating model; then the resulting sampled measurements of said model are technically flawless and its estimates are unbiased.

The models involved

The divine

  • The true data generating model

The erroneous

  • The sampling model
  • The nonresponse model
  • The analysis model

The (un)holy grail

  • The imputation model

The convention

(Un)known implications

Let \(Y\) be an incomplete data set and let \(R\) be the matrix that stores the missing data in \(Y\). Then, if we define \(\psi\) to contain the parameters of the missing data model:

MCAR

Probability to be missing is not related to any data \[P(R = 0 \; | \; Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;\psi)\]

MAR

Probability to be missing depends on known data \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;Y_\text{obs}, \psi)\]

MNAR

Probability to be missing may depend on unknown data also \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi)\]

In real life

Consequences of missing data

  1. Less information than initially planned
  2. Is there enough statistical power?
  3. Different analyses result in different \(n\)
  4. Unable to calculate even simple statistics
  5. Systematic bias in the analyses
  6. Are intervals and P-values appropriate?

Bold statement #2

\[\text{REAL LIFE PROBLEMS ARE NEVER MCAR}\]

So far

In general, missing data can severely complicate interpretation and analysis.

Not treating missing data properly may yield invalid inferences

Ignoring your missing data is never a good idea

Recreating the problem

Why on earth would you ampute?

The quality of a solution obtained by any (multiple) imputation procedure depends on

  1. the statistical properties of the incomplete data
  2. the inference problem with respect to the assumptions
  3. the degree to which an imputation procedure is able to capture these properties when modeling missing values on the given data.

Simulation

When evaluating the statistical properties (and thereby the practical applicability) of new imputation methodology, researchers most often make use of simulation studies.

  1. a complete data set is usually generated from a statistical model
  2. another model is used to induce missingness
  3. a set of evaluation criteria is postulated to evaluate the performance of the imputation method.

Simple methods

Stef van Buuren considers three univariate amputation examples in his book.

The data are generated as 1000 draws from the bivariate normal distribution \(P(Y_1, Y_2)\) with means \(\mu_1 = \mu_2 =5\), variances \(\sigma_1^2 =\sigma_2^2 = 1\), and covariance \(\sigma_{12} = 0.6\). We assume that all values generated are positive. Missing data in \(Y_2\) can be created in many ways. Let \(R_2\) be the response indicator for \(Y_2\).

\[\begin{align} \mathrm{MARRIGHT} &:& \mathrm{logit}(\Pr(R_2=0)) = -5 + Y_1 \tag{3.1}\\ \mathrm{MARMID} &:& \mathrm{logit}(\Pr(R_2=0)) = 0.75 - |Y_1-5| \tag{3.2}\\ \mathrm{MARTAIL} &:& \mathrm{logit}(\Pr(R_2=0)) = -0.75 + |Y_1-5| \tag{3.3} \end{align}\]

Simple methods visualized

Probability that \(Y_2\) is missing as a function of the values of \(Y_1\) under three models for the missing data.

Figure 3.2: Probability that \(Y_2\) is missing as a function of the values of \(Y_1\) under three models for the missing data.

Danger of simplicity

The induced (MAR) mechanism may be very spurious

HTML5 Icon

Proper missingness

HTML5 Icon

An example data set

set.seed(123)
data <- 
  matrix(c(1, 0.5, 0.3, 
           0.5, 1, 0.5,
           0.3, 0.5, 1), 3, 3) %>%
  mvtnorm::rmvnorm(n = 1000, 
                   mean = c(3, 20, 40), 
                   sigma = .) %>%
  as.data.frame %>%
  round(3)

names(data) <- c("Income", "WorkingYears", "Age")

An example data set

datatable(data, options = list(pageLength = 6))

Multivariate amputation with mice::ampute

amp <- ampute(data, 
              patterns =  matrix(c(1, 0, 1,
                                   1, 0, 1, 
                                   0, 0, 1), 
                                 nrow = 3, 
                                 byrow = TRUE),
              freq = c(0.6, 0.2, 0.2),
              prop = 0.5, 
              mech = "MAR")

Multivariate amputation with mice::ampute

md.pattern(amp$amp)

##     Age Income WorkingYears    
## 483   1      1            1   0
## 415   1      1            0   1
## 102   1      0            0   2
##       0    102          517 619

So far

  1. We have identified the problem
  2. We can recreate the problem

Now we can investigate: is the problem really a problem?

Study the problem’s behavior

MAR vs MNAR

HTML5 Icon

Simple setup

Comparing complete case analysis (CCA) to multiple imputation:

  • 2 variables, \(Y\) and \(X\).
  • varying correlation: \(\rho = .1\) to \(\rho = .9\)
  • all flavors of missingness that ampute serves
    • MCAR
    • MAR: left, mid, tails
    • MNAR: left, mid, tails
  • three levels of missingness: 10%, 50% and 90%
  • study means, correlations and variances
  • evaluate bias, coverage rates and confidence interval widths

Expectations

In general we expected a small problem to have a small effect

  • high missingness problems are harder to solve than low missingness problems
  • the stronger the information, the harder the effect
  • do high information clusters favor imputation and amputation equally?

Mean of \(Y\)

Results: 10 % missingness

HTML5 Icon

Results: 10 % missingness

HTML5 Icon

Results: 90 % missingness

HTML5 Icon

Results: 90% missingness

HTML5 Icon

Correlation between \(Y\) and \(X\)

Results: 10 % missingness

HTML5 Icon

Results: 10 % missingness

HTML5 Icon

Results: 90 % missingness

HTML5 Icon

Results: 90% missingness

HTML5 Icon

To conclude

Implications

HTML5 Icon

Summary

With the highest possible information structures, we get

  • the least bias
  • the best coverage
  • the smallest confidence intervals

BUT ALSO THE STRONGEST MISSINGNESS

To draw inference from incomplete data, we usually assume that the observed part can serve as a proxy for the missing part

  • With little information (low correlation AND/OR high missingness), we heavily lean on this assumption
  • With strong information (high correlation AND/OR low missingness), this focus shifts more towards the observed part.

This presentation

has a website: