In this practical, we will move away from the problem of missingness. Rather than imputing data that we do not have, we want to obscure the data that we do have, to prevent disclosure of identifying or sensitive information in the data set. However, obscuring the actual data may not result in the loss of too much information. That is, we want the synthetic, imputed data to contain almost the same amount of information as the original data set, but without accidentally disclosing information about the sampled individuals. It turns out that regardless of whether we impute to solve for missingness or to overimpute observed data, the procedure is pretty similar, and we can use mice (van Buuren and Groothuis-Oudshoorn 2011) to generate the imputations.

In this practical, we will first quickly create a single, complete version of the mice::boys data, and regard this completed data set as the actually observed data.
We use this complete data set to create multiple synthetic versions.

Again, we start by loading the required packages, and setting the seed to ensure reproducibility of the results.

library(mice)
library(ggmice)
library(ggplot2)
library(dplyr)
library(purrr)
library(magrittr)

set.seed(123)

The mice::boys data

By now, you are already quite familiar with the mice::boys data. This data set contains information on the growth of 748 Dutch boys, and contains the following 9 variables:

  • age: Decimal age (0-21 years)
  • hgt: Height (cm)
  • wgt: Weight (kg)
  • bmi: Body mass index
  • hc: Head circumference (cm)
  • gen: Genital Tanner stage (G1-G5)
  • phb: Pubic hair (Tanner P1-P6)
  • tv: Testicular volume (ml)
  • reg: Region (north, east, west, south, city)

As you understand, these data contain quite some sensitive information. Fortunately, all cases are completely anonymous and and identification on the basis of the data is impossible. Suppose, however, that the data could be linked to a second data set, that does contain identifying information on the sampled individuals. Consequently, it would be inappropriate (and presumably illegal) to share the research data openly. Yet, we can create a fake alternative that can be nearly as informative as the observed data (Volker and Vink 2021), and that allows data users to answer additional research questions without requiring access to the observed data.


As you know, the boys data suffers from quite some missingness, that we will solve for first, before creating a synthetic version of this data. Although single imputation is generally out of question, we will, for once, overstep our hard limits, and create a single completed version of the boys data. Using a single complete data set allows us to focus on synthesization, rather than spending a lot of time on appropriately dealing with the missingness, and taking this into account in the synthesis procedure. For now, we thus simply assume that the boys data is completely observed, by imputing it once.


1. Create a completed version of the boys data by using a single imputation with default imputation settings and 20 iterations.

Hint. Do not forget to use passive imputation on bmi and to adjust the predictor matrix accordingly.


meth <- make.method(boys)
meth["bmi"] <- "~I(wgt/(hgt/100)^2)"

pred <- make.predictorMatrix(boys)
pred[c("hgt", "wgt"), "bmi"] <- 0

mis_imp <- mice(data = boys, 
                m = 1, 
                predictorMatrix = pred,
                method = meth, 
                maxit = 20, 
                print = F)

dat <- complete(mis_imp)

2. Test whether you have indeed created a fully observed data set, by checking the missing data pattern.


plot_pattern(dat)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'


For now, we assume that this completed data set is the actually observed data, so that we can focus on synthesization. When creating synthetic data, one has to take into account that there are two, generally conflicting, aspects to synthesization. On the one hand, one should protect the privacy and the confidentiality of the observations. On the other hand, one would not like to lose the information that was present in the original data. That is, the synthetic data should contain the same information and the same relationships as present in the original data. Increasing the quality of synthetic data generally comes at the cost of greater disclosure risks. Moreover, assessing the quality of the synthesization procedure is still very much work in progress, without clear evaluation guidelines. Therefore, substantial scrutiny is required when synthetic data are indeed made publicly available, because no one would like to find their personal information somewhere on the internet.


Broadly speaking, two methods for creating synthetic data can be distinguished. The first one is based on parametric imputation models, which assumes that the structure of the data is fixed, and draws synthetic values from a pre-specified probability distribution. That is, after estimating a statistical model, the synthetic data are generated from a probability distribution, without making any further use of the observed data. In general, this procedure is less likely to result in an accidental release of disclosive information. However, these parametric methods are often less capable of capturing the complex nature of real-world data sets.

The subtleties of real-world data are often better reproduced with non-parametric imputation models. Using this approach, a non-parametric model is estimated, resulting in a donor pool out of which a single observation per observation and per variable is drawn. These models thus reuse the observed data to serve as synthetic data. Accordingly, much of the values that were in the observed data end up in the synthetic data. However, these observed data are generally combined in unique ways, it is generally not possible to link this information to the original respondents. The non-parametric procedures often yield better inferences, while still being able to prevent disclosure risk (although more research into measures to qualify the remaining risks is required). Therefore, this practical will showcase how to generate synthetic data using one such non-parametric method: classification and regression trees [CART; Breiman et al. (1984)].


When creating synthetic data, it is important to specify which of the values are to be replaced by synthetic values. Theoretically, it is possible to only impute a subset of the data, for example those values that bear the highest risk of being disclosive (e.g., the 1% highest incomes). Yet, in general, it is safest to replace all values, because this complicates the linkage of synthetic data with other data sources. In mice, the values that are to be replaced can be specified using the where-matrix.


3. Create a matrix of the same dimensions as the boys data, where all cells depict the logical operator TRUE.


where_syn <- matrix(data = TRUE, 
                    nrow = nrow(boys), 
                    ncol = ncol(boys))

After specifying the where-matrix, we can proceed with specifying cart as the imputation method using the method parameter.

Note, however, that passive imputation does not work when observed values are overimputed. The reason for this is that all imputation methods, including passive imputation, rely on the observed data if these are present. Passive imputation would thus result in bmi values that are determined by the observed wgt and hgt values of each observation, rather than on the synthetic values for these variables, which is not what we want. Therefore, we make use of the possibility to post-process the imputations, such that the relationship between the synthetic hgt and wgt values is preserved in bmi.


4. Create a post-processing object, that ensures that imputations for bmi are calculated on the basis of its deterministic relationship with hgt and wgt.


post <- make.post(dat)
post["bmi"] <- "imp[[j]][, i] <- imp[['wgt']][, i] / (imp[['hgt']][, i] / 100)^2"

Using this code, we overwrite the initially imputed bmi-values, to make sure that the relationship with hgt and wgt is preserved.


Now we have taken care of these issues, we can create a synthetic version of the completed boys data set.


5. Create m = 10 synthetic data sets with mice, using cart as the imputation method, set the complexity parameter to cp = 1e-08, and specify the where-parameter, the post-parameter and the predictorMatrix-parameter as the just specified objects.

Note: When creating a synthetic data, a single iteration is sufficient.


syn <- mice(data = dat, 
            where = where_syn,
            m = 10, 
            maxit = 1,
            method = "cart",
            predictorMatrix = pred,
            post = post,
            cp = 1e-08,
            print = F)

See, creating the synthetic data is a piece of cake. However, after creating the synthetic data, we must assess its quality in terms of data utility and disclosure risk. Quality control is conveniently performed using visual methods, and can be done using the package ggmice (Oberman 2022).

The quality of synthetic data sets can be assessed on multiple levels and in multiple different ways. Starting on a univariate level, the distributions of the synthetic data sets can be compared with the distribution of the observed data. For the continuous variables, this can be done by comparing the densities of the synthetic data sets with the observed data sets.


6. Create a density plot for each numeric variable with ggmice(), mapping the variables to the x-axis, and using the function geom_density().


ggmice(data = syn, mapping = aes(x = age)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = hgt)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = wgt)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = bmi)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = hc)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = tv)) +
  geom_density()

You can do the same in one go, using some functional programming, and making use of the patchwork::wrap_plots() function to stitch the figures together.

cont_vars <- c('age', 'hgt', 'wgt', 'bmi', 'hc', 'tv')

cont_vars %>%
  map(~ ggmice(data = syn, 
               mapping = aes_string(x = .x)) + 
        geom_density()) %>%
  patchwork::wrap_plots()


Generally speaking, the synthetic data looks quite similar to the observed data, although some deviations occur. These deviations mainly occur because our synthetic data contains 10 times as many observations as the observed data (because we used m = 10). The consequence is that the figures for the synthetic data contain a higher level of detail, and thus appear to be less “smooth”. This can be dealt with in two ways. First, the bandwidth of the densities can be fixed to the bandwidth of the observed data, such that both the observed and synthetic data densities are equally smooth.

bandwidths <- map_dbl(cont_vars, ~ dat[, .x] %>% bw.nrd0)
names(bandwidths) <- cont_vars

cont_vars %>%
  map(~ ggmice(data = syn, 
               mapping = aes_string(x = .x)) + 
        geom_density(bw = bandwidths[.x])) %>%
  patchwork::wrap_plots()

If the densities use the same bandwidth, we see that the observed and the synthetic data are, on average over all 10 imputed sets, near identical.


A second solution is to plot the densities for each imputed set separately. This also shows how much variation there is between the imputed data sets.


7. Create a density plot for each numeric variable with ggmice(), mapping these variables to the x-axis, using the function geom_density(), and make sure that each imputed set obtains its own density.

Hint: The code ggmice(syn, mapping = aes(x = VARIABLE, group = .imp)) creates the ggmice object per imputed set.


ggmice(data = syn, mapping = aes(x = age, group = .imp)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = hgt, group = .imp)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = wgt, group = .imp)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = bmi, group = .imp)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = hc, group = .imp)) +
  geom_density()


ggmice(data = syn, mapping = aes(x = tv, group = .imp)) +
  geom_density()

And, doing the same in one go, yields

cont_vars %>%
  map(~ ggmice(data = syn, 
               mapping = aes_string(x = .x, group = '.imp')) +
        geom_density()) %>%
  patchwork::wrap_plots()


On a univariate level, the synthetic data of the continuous variables looks very similar to the observed data. However, the data also consisted of several categorical variables. For these variables, we can compare the observed and synthetic data on a univariate level by using bar plots.


8. Create a bar plot using geom_bar() for each categorical variable in the data, with one bar per observed category, and one bar for each category over all synthetic sets combined.

Hint: Within ggmice, set mapping = aes(x = VARIABLE, group = .where), and within geom_bar(), set mapping = aes(y = ..prop..) and position = position_dodge() to make sure the bars are comparable.


ggmice(data = syn, mapping = aes(x = gen, group = .where)) +
  geom_bar(mapping = aes(y = ..prop..), 
           position = position_dodge2(), 
           fill = "transparent")

ggmice(data = syn, mapping = aes(x = phb, group = .where)) +
  geom_bar(mapping = aes(y = ..prop..),
           position = position_dodge2(),
           fill = "transparent")

ggmice(data = syn, mapping = aes(x = reg, group = .where)) +
  geom_bar(mapping = aes(y = ..prop..),
           position = position_dodge2(),
           fill = "transparent")

cat_vars <- c("gen", "phb", "reg")

cat_vars %>% 
  map(~ ggmice(data = syn, 
               mapping = aes_string(x = .x, group = '.where')) +
        geom_bar(mapping = aes(y = ..prop..),
                 position = position_dodge2(),
                 fill = "transparent")) %>%
  patchwork::wrap_plots()


Similarly to the continuous variables, the synthetic categorical variables are, on average, almost identical to the observed categorical variables. To get a feeling of the variability between the synthetic data sets, we do the same thing for each synthetic set.


9. Create a bar plot using geom_bar() for each categorical variable in the data, with one bar for the observed category, and one bar per per category per synthetic data set.

Hint: Now set the group parameter to .imp instead of .where.


ggmice(data = syn, mapping = aes(x = gen, group = .imp)) +
  geom_bar(mapping = aes(y = ..prop..),
           position = position_dodge2(),
           fill = "transparent")

ggmice(data = syn, mapping = aes(x = phb, group = .imp)) +
  geom_bar(mapping = aes(y = ..prop..),
           position = position_dodge2(),
           fill = "transparent")

ggmice(data = syn, mapping = aes(x = reg, group = .imp)) +
  geom_bar(mapping = aes(y = ..prop..),
           position = position_dodge2(),
           fill = "transparent")

Or in one go:

cat_vars %>%
  map(~ ggmice(syn, mapping = aes_string(.x, group = '.imp')) +
        geom_bar(mapping = aes(y = ..prop..),
                 position = position_dodge2(),
                 fill = "transparent")) %>% 
  patchwork::wrap_plots()

Similarly to the continuous variables, the synthetic categorical data looks very much like the observed data. This means that, at least on a univariate level, we did a good job in creating the synthetic data.


Now we know that the synthetic data and the observed data are almost identical on a univariate level, we assess the quality of the synthetic data on the next level, which concerns the bivariate relationships between the variables. We can assess the bivariate relationships by comparing the correlation matrix of the observed data, with the correlation matrices of the synthetic data. As you know from Practical J, calculating the correlations between the variables in the synthetic data set requires some additional steps.


10. Calculate the correlations between the continuous variables in the actual data, and store these in the object obs_cor.


obs_cor <- dat %>%
  select(where(is.numeric)) %>%
  cor()

The next step is to calculate the correlations between the variables in the synthetic data. Pooling these correlation requires us to first apply Fisher’s transformation on the correlations in each synthetic data set. Subsequently, the average of the Fisher-transformed correlation matrices can be calculated, and they can be transformed back to the original scale, using the backtransformation.


11. Calculate the correlations between the continuous variables in each synthetic data set, apply Fisher’s transformation, and average the estimates over all synthetic data sets, before back-transforming the correlations.


fisher.trans <- function(x) 1/2 * log((1 + x) / (1 - x))
fisher.backtrans <- function(x) (exp(2 * x) - 1) / (exp(2 * x) + 1)

syn_cor <- syn %>%
  mice::complete("all") %>% 
  map(select, where(is.numeric)) %>% 
  map(cor) %>%
  map(fisher.trans) %>% 
  {Reduce("+", .) / length(.)} %>% 
  fisher.backtrans()

diag(syn_cor) <- 1

Now we have the correct correlations between the numeric variables, we can compare the synthetic correlations with the observed correlations, by looking at the difference between the two.


12. Subtract the pooled synthetic data correlations from the observed data correlations, and inspect the differences.


obs_cor - syn_cor
##               age         hgt         wgt        bmi            hc          tv
## age  0.0000000000 0.002418332 0.003203875 0.03559779 -0.0003408397 0.008674584
## hgt  0.0024183323 0.000000000 0.004825015 0.04829597  0.0012989533 0.003506654
## wgt  0.0032038745 0.004825015 0.000000000 0.02977120  0.0023563788 0.004448757
## bmi  0.0355977947 0.048295974 0.029771200 0.00000000  0.0355236753 0.043983937
## hc  -0.0003408397 0.001298953 0.002356379 0.03552368  0.0000000000 0.005770587
## tv   0.0086745841 0.003506654 0.004448757 0.04398394  0.0057705869 0.000000000

Although there are some differences between the correlations, these differences are pretty small. The largest differences occur for the relationships with bmi, because the correlations between all variables and bmi in the synthetic data tend to be slightly underestimated. For all practical purposes, these differences seem negligible.


Additionally, it might be the case that you already know which analyses are to be run on the synthetic data. This can be the case if you are not allowed to share the original research data to reproduce the analyses of a paper you wrote, but instead, you distribute a synthetic version of this data. Consequently, reviewers might verify whether they obtain similar results on a synthetic version of the data. Although this does not allow to ensure exact reproducibility, reviewers can at least verify whether the way of handling the data makes sense, and whether different operationalizations would yield wildly different results. In fact, this approach has been applied, for example here.


13. Run a regression model on the original data, in which testicular volume is regressed on age, BMI and genital Tanner stage (i.e., tv ~ age + bmi + gen), and store the output in the object obs_fit.


obs_fit <- lm(tv ~ age + hgt + bmi + gen, dat)

summary(obs_fit)
## 
## Call:
## lm(formula = tv ~ age + hgt + bmi + gen, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6615  -1.6663   0.0662   1.1143  13.8856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.58191    1.16941   3.918 9.75e-05 ***
## age          0.44203    0.08873   4.982 7.86e-07 ***
## hgt         -0.01164    0.01174  -0.991   0.3218    
## bmi          0.12201    0.04948   2.466   0.0139 *  
## gen.L       10.17909    0.39942  25.484  < 2e-16 ***
## gen.Q        2.49289    0.30265   8.237 7.99e-16 ***
## gen.C       -0.44385    0.27909  -1.590   0.1122    
## gen^4        0.63280    0.36000   1.758   0.0792 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.087 on 740 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:  0.8534 
## F-statistic: 622.4 on 7 and 740 DF,  p-value: < 2.2e-16

14. Run the same regression model on the synthetic data sets, and pool the results, but use Reiter’s -Reiter (2003) pooling rules by specifying rule = reiter2003 when pooling the regressions.

syn_fit <- syn %$% lm(tv ~ age + hgt + bmi + gen)

summary(pool(syn_fit, rule = "reiter2003"))
##          term     estimate  std.error  statistic        df      p.value
## 1 (Intercept)  3.937505145 1.35651715  2.9026579  809.7099 3.800579e-03
## 2         age  0.449623962 0.10393173  4.3261474  313.9035 2.041631e-05
## 3         hgt -0.006581607 0.01362608 -0.4830156  383.3484 6.293604e-01
## 4         bmi  0.112137056 0.05236926  2.1412763 3171.1510 3.232777e-02
## 5       gen.L  9.419481185 0.46109362 20.4285655  901.9920 0.000000e+00
## 6       gen.Q  2.207279651 0.35900500  6.1483257  909.4742 1.170964e-09
## 7       gen.C -0.835419953 0.32524449 -2.5685907 2592.9347 1.026680e-02
## 8       gen^4 -0.108278085 0.44336982 -0.2442162  406.1461 8.071867e-01

As you can see, the results are very similar and would lead to the same conclusions, except for the dummy of gen.C, which is something that can always happen. In fact, it would be quite remarkable if all coefficients would turn out to be exactly equal. This is one of the things that happens every now and then when creating synthetic data, which only indicates that there remains considerable uncertainty with respect to the “true” parameter value.


As a final check of the quality of the data, it is possible to try to predict whether an observation stems from the observed data set or from the synthetic versions. This can be done using classification methods, such as, for example, a logistic regression model. If the data differs in important respects, classification methods could be able to identify in which respects the data are flawed.

To do this, we first need to stack the observed and the synthetic data, before we can actually predict which of the two sets an observation stems from.


15. Stack the observed and synthetic data sets, and create an indicator for the synthetic data sets.

Hint: You can use the function complete() from the mice-package to stack the data sets, and mutate() from dplyr to create the indicator.


obs_syn <- complete(syn, action = "long", include = TRUE) %>%
  mutate(synthetic = factor(ifelse(.imp == 0, "Observed", "Synthetic")))

16. Predict whether an observation comes from the observed or synthetic data set on the basis of all variables, using a logistic regression model ( glm(formula, family = "binomial", data)).


glm(synthetic ~ age + hgt + wgt + bmi + hc + gen + phb + tv + reg,
    family = "binomial",
    data = obs_syn) %>%
  summary()
## 
## Call:
## glm(formula = synthetic ~ age + hgt + wgt + bmi + hc + gen + 
##     phb + tv + reg, family = "binomial", data = obs_syn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2369   0.4291   0.4353   0.4412   0.4644  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.9215539  0.7529143   2.552   0.0107 *
## age         -0.0030162  0.0306279  -0.098   0.9216  
## hgt          0.0037530  0.0074303   0.505   0.6135  
## wgt         -0.0082609  0.0117076  -0.706   0.4804  
## bmi          0.0265405  0.0355807   0.746   0.4557  
## hc          -0.0043669  0.0199890  -0.218   0.8271  
## gen.L        0.0268195  0.2082120   0.129   0.8975  
## gen.Q        0.0003555  0.1200523   0.003   0.9976  
## gen.C        0.0100122  0.1000630   0.100   0.9203  
## gen^4        0.0014627  0.1228931   0.012   0.9905  
## phb.L        0.0565471  0.2394665   0.236   0.8133  
## phb.Q       -0.0197587  0.1291751  -0.153   0.8784  
## phb.C       -0.0273447  0.1124292  -0.243   0.8078  
## phb^4        0.0048153  0.1144740   0.042   0.9664  
## phb^5        0.0113901  0.1317678   0.086   0.9311  
## tv          -0.0008432  0.0117165  -0.072   0.9426  
## regeast     -0.0004291  0.1465205  -0.003   0.9977  
## regwest     -0.0353635  0.1386503  -0.255   0.7987  
## regsouth     0.0060215  0.1427906   0.042   0.9664  
## regcity      0.0163561  0.1730244   0.095   0.9247  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5013.1  on 8227  degrees of freedom
## Residual deviance: 5012.0  on 8208  degrees of freedom
## AIC: 5052
## 
## Number of Fisher Scoring iterations: 5

None of the estimates are statistically different from zero. This indicates that we cannot model the outcome (belonging to a synthetic or original case) successfully, which shows that the observed and synthetic data are very similar.


Now we have assessed the quality of the synthetic data in terms of its similarities with the observed data, it is good to do a final check on the privacy of the model. In fact, a synthetic data set that is exactly the same as the observed data would yield perfect inferential characteristics, but nevertheless would be inappropriate as it does not provide any security on privacy and confidentiality of the participants. One way to do this is to check for duplicates in the observed data.


17. Append the original data to the synthetic data, and check whether some of the values in the original data also occur in the synthetic data.

Hint: Start with the synthetic data, remove the variables .imp and .id, and append the original data to it. Subsequently, you can use duplicated() and which() to check whether (and if so, which) observations occur repeatedly.


complete(syn, "long") %>%
  select(-c(.imp, .id)) %>%
  bind_rows(dat) %>%
  duplicated %>%
  which()
## integer(0)

None of observations occur repeatedly, so we have not accidentally copied any of the “true” observations into the synthetic sets. This provides some safeguard against accidentally releasing sensitive information. However, if the data contains really sensitive information, this might not be enough, and one could for example check whether the synthetic data differs from the observed data along multiple dimensions (i.e., variables). Such additional checks depend on the problem at hand. Additionally, one might want to take additional measures against accidentally disclosing information about observations, for example by drawing some of the variables from a parametric distribution. Even before distribution synthetic data, think wisely about whether there may remain any disclosure risks with respect to the data that will be distributed.


References

Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. New York: CRC press. https://doi.org/10.1201/9781315139470.
Oberman, Hanne. 2022. ggmice: Visualizations for ’Mice’ with ’Ggplot2’.
Reiter, Jerome P. 2003. “Inference for Partially Synthetic, Public Use Microdata Sets.” Survey Methodology 29 (2): 181–88.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Volker, Thom Benjamin, and Gerko Vink. 2021. “Anonymiced Shareable Data: Using Mice to Create and Analyze Multiply Imputed Synthetic Datasets.” Psych 3 (4): 703–16. https://doi.org/10.3390/psych3040045.