In this practical, we will move away from the problem of missingness.
Rather than imputing data that we do not have, we want to obscure the
data that we do have, to prevent disclosure of identifying or sensitive
information in the data set. However, obscuring the actual data may not
result in the loss of too much information. That is, we want the
synthetic, imputed data to contain almost the same amount of information
as the original data set, but without accidentally disclosing
information about the sampled individuals. It turns out that regardless
of whether we impute to solve for missingness or to overimpute
observed data, the procedure is pretty similar, and we can use
mice
(van Buuren and Groothuis-Oudshoorn 2011)
to generate the imputations.
In this practical, we will first quickly create a single, complete
version of the mice::boys
data, and regard this completed
data set as the actually observed data.
We use this complete data set to create multiple synthetic versions.
Again, we start by loading the required packages, and setting the seed to ensure reproducibility of the results.
library(mice)
library(ggmice)
library(ggplot2)
library(dplyr)
library(purrr)
library(magrittr)
set.seed(123)
mice::boys
dataBy now, you are already quite familiar with the
mice::boys
data. This data set contains information on the
growth of 748 Dutch boys, and contains the following 9 variables:
age
: Decimal age (0-21 years)hgt
: Height (cm)wgt
: Weight (kg)bmi
: Body mass indexhc
: Head circumference (cm)gen
: Genital Tanner stage (G1-G5)phb
: Pubic hair (Tanner P1-P6)tv
: Testicular volume (ml)reg
: Region (north, east, west, south, city)As you understand, these data contain quite some sensitive information. Fortunately, all cases are completely anonymous and and identification on the basis of the data is impossible. Suppose, however, that the data could be linked to a second data set, that does contain identifying information on the sampled individuals. Consequently, it would be inappropriate (and presumably illegal) to share the research data openly. Yet, we can create a fake alternative that can be nearly as informative as the observed data (Volker and Vink 2021), and that allows data users to answer additional research questions without requiring access to the observed data.
As you know, the boys
data suffers from quite some
missingness, that we will solve for first, before creating a synthetic
version of this data. Although single imputation is generally out of
question, we will, for once, overstep our hard limits, and create a
single completed version of the boys
data. Using a single
complete data set allows us to focus on synthesization, rather than
spending a lot of time on appropriately dealing with the missingness,
and taking this into account in the synthesis procedure. For now, we
thus simply assume that the boys
data is completely
observed, by imputing it once.
1. Create a completed version of the boys
data
by using a single imputation with default imputation settings and 20
iterations.
Hint. Do not forget to use passive
imputation
on bmi
and to adjust the predictor matrix accordingly.
meth <- make.method(boys)
meth["bmi"] <- "~I(wgt/(hgt/100)^2)"
pred <- make.predictorMatrix(boys)
pred[c("hgt", "wgt"), "bmi"] <- 0
mis_imp <- mice(data = boys,
m = 1,
predictorMatrix = pred,
method = meth,
maxit = 20,
print = F)
dat <- complete(mis_imp)
2. Test whether you have indeed created a fully observed data set, by checking the missing data pattern.
plot_pattern(dat)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
For now, we assume that this completed data set is the actually observed data, so that we can focus on synthesization. When creating synthetic data, one has to take into account that there are two, generally conflicting, aspects to synthesization. On the one hand, one should protect the privacy and the confidentiality of the observations. On the other hand, one would not like to lose the information that was present in the original data. That is, the synthetic data should contain the same information and the same relationships as present in the original data. Increasing the quality of synthetic data generally comes at the cost of greater disclosure risks. Moreover, assessing the quality of the synthesization procedure is still very much work in progress, without clear evaluation guidelines. Therefore, substantial scrutiny is required when synthetic data are indeed made publicly available, because no one would like to find their personal information somewhere on the internet.
Broadly speaking, two methods for creating synthetic data can be distinguished. The first one is based on parametric imputation models, which assumes that the structure of the data is fixed, and draws synthetic values from a pre-specified probability distribution. That is, after estimating a statistical model, the synthetic data are generated from a probability distribution, without making any further use of the observed data. In general, this procedure is less likely to result in an accidental release of disclosive information. However, these parametric methods are often less capable of capturing the complex nature of real-world data sets.
The subtleties of real-world data are often better reproduced with non-parametric imputation models. Using this approach, a non-parametric model is estimated, resulting in a donor pool out of which a single observation per observation and per variable is drawn. These models thus reuse the observed data to serve as synthetic data. Accordingly, much of the values that were in the observed data end up in the synthetic data. However, these observed data are generally combined in unique ways, it is generally not possible to link this information to the original respondents. The non-parametric procedures often yield better inferences, while still being able to prevent disclosure risk (although more research into measures to qualify the remaining risks is required). Therefore, this practical will showcase how to generate synthetic data using one such non-parametric method: classification and regression trees [CART; Breiman et al. (1984)].
When creating synthetic data, it is important to specify which of the
values are to be replaced by synthetic values. Theoretically, it is
possible to only impute a subset of the data, for example those values
that bear the highest risk of being disclosive (e.g., the 1% highest
incomes). Yet, in general, it is safest to replace all values, because
this complicates the linkage of synthetic data with other data sources.
In mice
, the values that are to be replaced can be
specified using the where
-matrix.
3. Create a matrix of the same dimensions as the
boys
data, where all cells depict the logical operator
TRUE
.
where_syn <- matrix(data = TRUE,
nrow = nrow(boys),
ncol = ncol(boys))
After specifying the where
-matrix, we can proceed with
specifying cart
as the imputation method using the
method
parameter.
Note, however, that passive
imputation does not work when observed values are overimputed. The
reason for this is that all imputation methods, including passive
imputation, rely on the observed data if these are present. Passive
imputation would thus result in bmi
values that are
determined by the observed wgt
and hgt
values
of each observation, rather than on the synthetic values for these
variables, which is not what we want. Therefore, we make use of the
possibility to post-process the imputations, such that the relationship
between the synthetic hgt
and wgt
values is
preserved in bmi
.
4. Create a post-processing object, that ensures that
imputations for bmi
are calculated on the basis of its
deterministic relationship with hgt
and
wgt
.
post <- make.post(dat)
post["bmi"] <- "imp[[j]][, i] <- imp[['wgt']][, i] / (imp[['hgt']][, i] / 100)^2"
Using this code, we overwrite the initially imputed
bmi
-values, to make sure that the relationship with
hgt
and wgt
is preserved.
Now we have taken care of these issues, we can create a synthetic version of the completed boys data set.
5. Create m = 10
synthetic data sets with
mice
, using cart
as the imputation method, set
the complexity parameter to cp = 1e-08
, and specify the
where
-parameter, the post
-parameter and the
predictorMatrix
-parameter as the just specified
objects.
Note: When creating a synthetic data, a single iteration is sufficient.
syn <- mice(data = dat,
where = where_syn,
m = 10,
maxit = 1,
method = "cart",
predictorMatrix = pred,
post = post,
cp = 1e-08,
print = F)
See, creating the synthetic data is a piece of cake. However, after
creating the synthetic data, we must assess its quality in
terms of data utility and disclosure risk. Quality control is
conveniently performed using visual methods, and can be done using the
package ggmice
(Oberman 2022).
The quality of synthetic data sets can be assessed on multiple levels and in multiple different ways. Starting on a univariate level, the distributions of the synthetic data sets can be compared with the distribution of the observed data. For the continuous variables, this can be done by comparing the densities of the synthetic data sets with the observed data sets.
6. Create a density plot for each numeric variable with
ggmice()
, mapping the variables to the x-axis, and using
the function geom_density()
.
ggmice(data = syn, mapping = aes(x = age)) +
geom_density()
ggmice(data = syn, mapping = aes(x = hgt)) +
geom_density()
ggmice(data = syn, mapping = aes(x = wgt)) +
geom_density()
ggmice(data = syn, mapping = aes(x = bmi)) +
geom_density()
ggmice(data = syn, mapping = aes(x = hc)) +
geom_density()
ggmice(data = syn, mapping = aes(x = tv)) +
geom_density()
You can do the same in one go, using some functional programming, and
making use of the patchwork::wrap_plots()
function to
stitch the figures together.
cont_vars <- c('age', 'hgt', 'wgt', 'bmi', 'hc', 'tv')
cont_vars %>%
map(~ ggmice(data = syn,
mapping = aes_string(x = .x)) +
geom_density()) %>%
patchwork::wrap_plots()
Generally speaking, the synthetic data looks quite similar to the
observed data, although some deviations occur. These deviations mainly
occur because our synthetic data contains 10 times as many observations
as the observed data (because we used m = 10
). The
consequence is that the figures for the synthetic data contain a higher
level of detail, and thus appear to be less “smooth”. This can be dealt
with in two ways. First, the bandwidth of the densities can be fixed to
the bandwidth of the observed data, such that both the observed and
synthetic data densities are equally smooth.
bandwidths <- map_dbl(cont_vars, ~ dat[, .x] %>% bw.nrd0)
names(bandwidths) <- cont_vars
cont_vars %>%
map(~ ggmice(data = syn,
mapping = aes_string(x = .x)) +
geom_density(bw = bandwidths[.x])) %>%
patchwork::wrap_plots()
If the densities use the same bandwidth, we see that the observed and the synthetic data are, on average over all 10 imputed sets, near identical.
A second solution is to plot the densities for each imputed set separately. This also shows how much variation there is between the imputed data sets.
7. Create a density plot for each numeric variable with
ggmice()
, mapping these variables to the x-axis, using the
function geom_density()
, and make sure that each imputed
set obtains its own density.
Hint: The code
ggmice(syn, mapping = aes(x = VARIABLE, group = .imp))
creates the ggmice
object per imputed set.
ggmice(data = syn, mapping = aes(x = age, group = .imp)) +
geom_density()
ggmice(data = syn, mapping = aes(x = hgt, group = .imp)) +
geom_density()
ggmice(data = syn, mapping = aes(x = wgt, group = .imp)) +
geom_density()
ggmice(data = syn, mapping = aes(x = bmi, group = .imp)) +
geom_density()
ggmice(data = syn, mapping = aes(x = hc, group = .imp)) +
geom_density()
ggmice(data = syn, mapping = aes(x = tv, group = .imp)) +
geom_density()
And, doing the same in one go, yields
cont_vars %>%
map(~ ggmice(data = syn,
mapping = aes_string(x = .x, group = '.imp')) +
geom_density()) %>%
patchwork::wrap_plots()
On a univariate level, the synthetic data of the continuous variables looks very similar to the observed data. However, the data also consisted of several categorical variables. For these variables, we can compare the observed and synthetic data on a univariate level by using bar plots.
8. Create a bar plot using geom_bar()
for each
categorical variable in the data, with one bar per observed category,
and one bar for each category over all synthetic sets
combined.
Hint: Within ggmice
, set
mapping = aes(x = VARIABLE, group = .where)
, and within
geom_bar()
, set mapping = aes(y = ..prop..)
and position = position_dodge()
to make sure the bars are
comparable.
ggmice(data = syn, mapping = aes(x = gen, group = .where)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
ggmice(data = syn, mapping = aes(x = phb, group = .where)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
ggmice(data = syn, mapping = aes(x = reg, group = .where)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
cat_vars <- c("gen", "phb", "reg")
cat_vars %>%
map(~ ggmice(data = syn,
mapping = aes_string(x = .x, group = '.where')) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")) %>%
patchwork::wrap_plots()
Similarly to the continuous variables, the synthetic categorical variables are, on average, almost identical to the observed categorical variables. To get a feeling of the variability between the synthetic data sets, we do the same thing for each synthetic set.
9. Create a bar plot using geom_bar()
for each
categorical variable in the data, with one bar for the observed
category, and one bar per per category per synthetic data
set.
Hint: Now set the group parameter to .imp
instead of .where
.
ggmice(data = syn, mapping = aes(x = gen, group = .imp)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
ggmice(data = syn, mapping = aes(x = phb, group = .imp)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
ggmice(data = syn, mapping = aes(x = reg, group = .imp)) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")
Or in one go:
cat_vars %>%
map(~ ggmice(syn, mapping = aes_string(.x, group = '.imp')) +
geom_bar(mapping = aes(y = ..prop..),
position = position_dodge2(),
fill = "transparent")) %>%
patchwork::wrap_plots()
Similarly to the continuous variables, the synthetic categorical data looks very much like the observed data. This means that, at least on a univariate level, we did a good job in creating the synthetic data.
Now we know that the synthetic data and the observed data are almost identical on a univariate level, we assess the quality of the synthetic data on the next level, which concerns the bivariate relationships between the variables. We can assess the bivariate relationships by comparing the correlation matrix of the observed data, with the correlation matrices of the synthetic data. As you know from Practical J, calculating the correlations between the variables in the synthetic data set requires some additional steps.
10. Calculate the correlations between the continuous
variables in the actual data, and store these in the object
obs_cor
.
obs_cor <- dat %>%
select(where(is.numeric)) %>%
cor()
The next step is to calculate the correlations between the variables in the synthetic data. Pooling these correlation requires us to first apply Fisher’s transformation on the correlations in each synthetic data set. Subsequently, the average of the Fisher-transformed correlation matrices can be calculated, and they can be transformed back to the original scale, using the backtransformation.
11. Calculate the correlations between the continuous variables in each synthetic data set, apply Fisher’s transformation, and average the estimates over all synthetic data sets, before back-transforming the correlations.
fisher.trans <- function(x) 1/2 * log((1 + x) / (1 - x))
fisher.backtrans <- function(x) (exp(2 * x) - 1) / (exp(2 * x) + 1)
syn_cor <- syn %>%
mice::complete("all") %>%
map(select, where(is.numeric)) %>%
map(cor) %>%
map(fisher.trans) %>%
{Reduce("+", .) / length(.)} %>%
fisher.backtrans()
diag(syn_cor) <- 1
Now we have the correct correlations between the numeric variables, we can compare the synthetic correlations with the observed correlations, by looking at the difference between the two.
12. Subtract the pooled synthetic data correlations from the observed data correlations, and inspect the differences.
obs_cor - syn_cor
## age hgt wgt bmi hc tv
## age 0.0000000000 0.002418332 0.003203875 0.03559779 -0.0003408397 0.008674584
## hgt 0.0024183323 0.000000000 0.004825015 0.04829597 0.0012989533 0.003506654
## wgt 0.0032038745 0.004825015 0.000000000 0.02977120 0.0023563788 0.004448757
## bmi 0.0355977947 0.048295974 0.029771200 0.00000000 0.0355236753 0.043983937
## hc -0.0003408397 0.001298953 0.002356379 0.03552368 0.0000000000 0.005770587
## tv 0.0086745841 0.003506654 0.004448757 0.04398394 0.0057705869 0.000000000
Although there are some differences between the correlations, these
differences are pretty small. The largest differences occur for the
relationships with bmi
, because the correlations between
all variables and bmi
in the synthetic data tend to be
slightly underestimated. For all practical purposes, these differences
seem negligible.
Additionally, it might be the case that you already know which analyses are to be run on the synthetic data. This can be the case if you are not allowed to share the original research data to reproduce the analyses of a paper you wrote, but instead, you distribute a synthetic version of this data. Consequently, reviewers might verify whether they obtain similar results on a synthetic version of the data. Although this does not allow to ensure exact reproducibility, reviewers can at least verify whether the way of handling the data makes sense, and whether different operationalizations would yield wildly different results. In fact, this approach has been applied, for example here.
13. Run a regression model on the original data, in which
testicular volume is regressed on age, BMI and genital Tanner stage
(i.e., tv ~ age + bmi + gen
), and store the output in the
object obs_fit
.
obs_fit <- lm(tv ~ age + hgt + bmi + gen, dat)
summary(obs_fit)
##
## Call:
## lm(formula = tv ~ age + hgt + bmi + gen, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6615 -1.6663 0.0662 1.1143 13.8856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.58191 1.16941 3.918 9.75e-05 ***
## age 0.44203 0.08873 4.982 7.86e-07 ***
## hgt -0.01164 0.01174 -0.991 0.3218
## bmi 0.12201 0.04948 2.466 0.0139 *
## gen.L 10.17909 0.39942 25.484 < 2e-16 ***
## gen.Q 2.49289 0.30265 8.237 7.99e-16 ***
## gen.C -0.44385 0.27909 -1.590 0.1122
## gen^4 0.63280 0.36000 1.758 0.0792 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.087 on 740 degrees of freedom
## Multiple R-squared: 0.8548, Adjusted R-squared: 0.8534
## F-statistic: 622.4 on 7 and 740 DF, p-value: < 2.2e-16
14. Run the same regression model on the synthetic data sets,
and pool the results, but use Reiter’s -Reiter
(2003) pooling rules by specifying
rule = reiter2003
when pooling the
regressions.
syn_fit <- syn %$% lm(tv ~ age + hgt + bmi + gen)
summary(pool(syn_fit, rule = "reiter2003"))
## term estimate std.error statistic df p.value
## 1 (Intercept) 3.937505145 1.35651715 2.9026579 809.7099 3.800579e-03
## 2 age 0.449623962 0.10393173 4.3261474 313.9035 2.041631e-05
## 3 hgt -0.006581607 0.01362608 -0.4830156 383.3484 6.293604e-01
## 4 bmi 0.112137056 0.05236926 2.1412763 3171.1510 3.232777e-02
## 5 gen.L 9.419481185 0.46109362 20.4285655 901.9920 0.000000e+00
## 6 gen.Q 2.207279651 0.35900500 6.1483257 909.4742 1.170964e-09
## 7 gen.C -0.835419953 0.32524449 -2.5685907 2592.9347 1.026680e-02
## 8 gen^4 -0.108278085 0.44336982 -0.2442162 406.1461 8.071867e-01
As you can see, the results are very similar and would lead to the
same conclusions, except for the dummy of gen.C
, which is
something that can always happen. In fact, it would be quite remarkable
if all coefficients would turn out to be exactly equal. This is one of
the things that happens every now and then when creating synthetic data,
which only indicates that there remains considerable uncertainty with
respect to the “true” parameter value.
As a final check of the quality of the data, it is possible to try to predict whether an observation stems from the observed data set or from the synthetic versions. This can be done using classification methods, such as, for example, a logistic regression model. If the data differs in important respects, classification methods could be able to identify in which respects the data are flawed.
To do this, we first need to stack the observed and the synthetic data, before we can actually predict which of the two sets an observation stems from.
15. Stack the observed and synthetic data sets, and create an indicator for the synthetic data sets.
Hint: You can use the function complete()
from
the mice
-package to stack the data sets, and
mutate()
from dplyr
to create the
indicator.
obs_syn <- complete(syn, action = "long", include = TRUE) %>%
mutate(synthetic = factor(ifelse(.imp == 0, "Observed", "Synthetic")))
16. Predict whether an observation comes from the observed or
synthetic data set on the basis of all variables, using a logistic
regression model (
glm(formula, family = "binomial", data)
).
glm(synthetic ~ age + hgt + wgt + bmi + hc + gen + phb + tv + reg,
family = "binomial",
data = obs_syn) %>%
summary()
##
## Call:
## glm(formula = synthetic ~ age + hgt + wgt + bmi + hc + gen +
## phb + tv + reg, family = "binomial", data = obs_syn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2369 0.4291 0.4353 0.4412 0.4644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9215539 0.7529143 2.552 0.0107 *
## age -0.0030162 0.0306279 -0.098 0.9216
## hgt 0.0037530 0.0074303 0.505 0.6135
## wgt -0.0082609 0.0117076 -0.706 0.4804
## bmi 0.0265405 0.0355807 0.746 0.4557
## hc -0.0043669 0.0199890 -0.218 0.8271
## gen.L 0.0268195 0.2082120 0.129 0.8975
## gen.Q 0.0003555 0.1200523 0.003 0.9976
## gen.C 0.0100122 0.1000630 0.100 0.9203
## gen^4 0.0014627 0.1228931 0.012 0.9905
## phb.L 0.0565471 0.2394665 0.236 0.8133
## phb.Q -0.0197587 0.1291751 -0.153 0.8784
## phb.C -0.0273447 0.1124292 -0.243 0.8078
## phb^4 0.0048153 0.1144740 0.042 0.9664
## phb^5 0.0113901 0.1317678 0.086 0.9311
## tv -0.0008432 0.0117165 -0.072 0.9426
## regeast -0.0004291 0.1465205 -0.003 0.9977
## regwest -0.0353635 0.1386503 -0.255 0.7987
## regsouth 0.0060215 0.1427906 0.042 0.9664
## regcity 0.0163561 0.1730244 0.095 0.9247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5013.1 on 8227 degrees of freedom
## Residual deviance: 5012.0 on 8208 degrees of freedom
## AIC: 5052
##
## Number of Fisher Scoring iterations: 5
None of the estimates are statistically different from zero. This indicates that we cannot model the outcome (belonging to a synthetic or original case) successfully, which shows that the observed and synthetic data are very similar.
Now we have assessed the quality of the synthetic data in terms of its similarities with the observed data, it is good to do a final check on the privacy of the model. In fact, a synthetic data set that is exactly the same as the observed data would yield perfect inferential characteristics, but nevertheless would be inappropriate as it does not provide any security on privacy and confidentiality of the participants. One way to do this is to check for duplicates in the observed data.
17. Append the original data to the synthetic data, and check whether some of the values in the original data also occur in the synthetic data.
Hint: Start with the synthetic data, remove the variables
.imp
and .id
, and append the original data to
it. Subsequently, you can use duplicated()
and
which()
to check whether (and if so, which) observations
occur repeatedly.
complete(syn, "long") %>%
select(-c(.imp, .id)) %>%
bind_rows(dat) %>%
duplicated %>%
which()
## integer(0)
None of observations occur repeatedly, so we have not accidentally copied any of the “true” observations into the synthetic sets. This provides some safeguard against accidentally releasing sensitive information. However, if the data contains really sensitive information, this might not be enough, and one could for example check whether the synthetic data differs from the observed data along multiple dimensions (i.e., variables). Such additional checks depend on the problem at hand. Additionally, one might want to take additional measures against accidentally disclosing information about observations, for example by drawing some of the variables from a parametric distribution. Even before distribution synthetic data, think wisely about whether there may remain any disclosure risks with respect to the data that will be distributed.