You can find all related materials, links and references there.
Padova - January 28, 2019
You can find all related materials, links and references there.
In order to fully understand missing data theory, we need to focus on the following
Identify the problem
Recreate the problem
Study the problem’s behavior (effect)
Identify if the effect is indeed a problem
I will use
mice
in R
for generating imputationsampute
in mice
in R
Most of this presentation is joint work with Rianne Schouten.
People who contributed or whose minds have somehow been picked in parts relevant to this presentation Stef van Buuren, Andrew Gelman, Jeroen Pannekoek, Shahab Jolani, Peter Lugtig and Jaap Brand.
\[\text{EVERYTHING IS A MISSING DATA PROBLEM}\]
Why? there is always some source of information missing.
Since we’re statisticians, we can model the uncertainty.
George Box has been attributed the following aphorism \[\text{ALL MODELS ARE WRONG, BUT SOME ARE USEFUL}\]
He also stated \[\text{HOW WRONG DO THEY NEED TO BE TO NOT BE USEFUL?}\]
These remarks are important; because the implications of being wrong are not obvious to everyone.
In general, the only correct model is the true data generating model.
Let \(Y\) be an incomplete data set and let \(R\) be the matrix that stores the missing data in \(Y\). Then, if we define \(\psi\) to contain the parameters of the missing data model:
Probability to be missing is not related to any data \[P(R = 0 \; | \; Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;\psi)\]
Probability to be missing depends on known data \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi) = P(R = 0 \; | \;Y_\text{obs}, \psi)\]
Probability to be missing may depend on unknown data also \[P(R = 0 \; | \;Y_\text{obs}, Y_\text{mis}, \psi)\]
\[\text{REAL LIFE PROBLEMS ARE NEVER MCAR}\]
In general, missing data can severely complicate interpretation and analysis.
Not treating missing data properly may yield invalid inferences
Ignoring your missing data is never a good idea
The quality of a solution obtained by any (multiple) imputation procedure depends on
When evaluating the statistical properties (and thereby the practical applicability) of new imputation methodology, researchers most often make use of simulation studies.
Stef van Buuren considers three univariate amputation examples in his book.
The data are generated as 1000 draws from the bivariate normal distribution \(P(Y_1, Y_2)\) with means \(\mu_1 = \mu_2 =5\), variances \(\sigma_1^2 =\sigma_2^2 = 1\), and covariance \(\sigma_{12} = 0.6\). We assume that all values generated are positive. Missing data in \(Y_2\) can be created in many ways. Let \(R_2\) be the response indicator for \(Y_2\).
\[\begin{align} \mathrm{MARRIGHT} &:& \mathrm{logit}(\Pr(R_2=0)) = -5 + Y_1 \tag{3.1}\\ \mathrm{MARMID} &:& \mathrm{logit}(\Pr(R_2=0)) = 0.75 - |Y_1-5| \tag{3.2}\\ \mathrm{MARTAIL} &:& \mathrm{logit}(\Pr(R_2=0)) = -0.75 + |Y_1-5| \tag{3.3} \end{align}\]
set.seed(123) data <- matrix(c(1, 0.5, 0.3, 0.5, 1, 0.5, 0.3, 0.5, 1), 3, 3) %>% mvtnorm::rmvnorm(n = 1000, mean = c(3, 20, 40), sigma = .) %>% as.data.frame %>% round(3) names(data) <- c("Income", "WorkingYears", "Age")
datatable(data, options = list(pageLength = 6))
mice::ampute
amp <- ampute(data, patterns = matrix(c(1, 0, 1, 1, 0, 1, 0, 0, 1), nrow = 3, byrow = TRUE), freq = c(0.6, 0.2, 0.2), prop = 0.5, mech = "MAR")
mice::ampute
md.pattern(amp$amp)
## Age Income WorkingYears ## 483 1 1 1 0 ## 415 1 1 0 1 ## 102 1 0 0 2 ## 0 102 517 619
Now we can investigate: is the problem really a problem?
Comparing complete case analysis (CCA) to multiple imputation:
In general we expected a small problem to have a small effect
With the highest possible information structures, we get
BUT ALSO THE STRONGEST MISSINGNESS
To draw inference from incomplete data, we usually assume that the observed part can serve as a proxy for the missing part
You can find all related materials, links and references there.