Introduction

In this practical, we will build upon the topics of the last two weeks, and deepen your understanding of various data wrangling and visualization techniques, using the tidyverse functionalities. Additionally, we briefly introduce the concept of missing data and how to visualize relations between the data and missingness.

Don’t forget to open a project file called 03_Exploratory.Rproj and to create your .Rmd or .R file to work in.

library(tidyverse)
library(magrittr)
library(mice)
library(DAAG)

The boys data

Today, we will work with the boys data from the R package mice, containing information from a sample of 748 boys. Information about this data can be obtained by running the code ?mice::boys.

When working with a new data set, it is always a good idea to familiarize yourself with the structure of the data.

head(boys)
tail(boys)

It seems like the data is sorted on the variable age. To verify this, we can test our intuition explicitly.

!is.unsorted(boys$age)
## [1] TRUE

Indeed, the data is sorted on age. In R, there is no is.sorted function, but we can test whether the data is sorted by testing whether the data is not unsorted.

Additionally, you saw that the data contains 9 variables (age, hgt, wgt, bmi, hc, gen, phb, tv, reg), and that the data suffers from missingness. To get a further understanding of the data, we can ask for a summary of all variables.

summary(boys)
##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

Next to looking at the numbers, it is often insightful to visualize some of the univariate distributions of the data, for example age and gen (genital Tanner stage).


1. Create a histogram of the variable age using the function geom_histogram().


2. Create a bar chart of the variable gen using the function geom_bar().


Assessing missing data

Now we know that there is a substantial amount of missing data, it is time to assess how severe the missingness problem is. One way to do this is by asking for the missing data pattern, using the function md.pattern() from mice.

md.pattern(boys)

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

The md.pattern shows that there are 223 cases without any missing values, 525 observations with at least one value missing, and that there are 1622 missing values in total in the data, such that 24% of the data is missing.

Also, it becomes clear that the variables gen (genital Tanner stage), phb (pubic hair) and tv (testicular volume) have most missing values.


Now we now that there is a substantial amount of missing information in the boys data, we can more closely focus on the missing data patterns. A first step could be to test whether missingness in the variables gen, phb and tv depends on someones age.


3. Create a missingness indicator for the variables gen, phb and tv.


4. Assess whether missingness in the variables gen, phb and tv is related to someones age.

Hint: Use the group_by() and summarize() functions from the dplyr library.


Although such differences can be useful, they rarely tell the complete story about the missingness. A useful strategy to assess missing data patterns is visualization, as figures generally tell more than just plain numbers.


5. Create a histogram for the variable age, faceted by whether or not someone has a missing value on gen.

Hint: Use the function facet_wrap (see help(facet_wrap)) to create two separate histograms for both groups.


Now we know somewhat more about the missingness problem in the boys dataset, we can continue visualizing other relationships in the data. The next step is to visualize the relationship between age and bmi.


6. Create a scatterplot with age on the x-axis and bmi on the y-axis, using the function geom_point().


Although we already know how missingness in the variable gen is related to age, we do not know whether gen is also related to bmi.


7. Add a colour aesthetic to the previous plot using the missingness indicator of the variable gen.


Usually, we would, at this point, handle the missingness problem accordingly. Yet, the scope of this course does not allow for a thorough module on dealing with missing data. If you are interested in this topic, or if you will need it during future (course)work, there are plenty of online resources. The short list of resources displayed below should get you going.


Description Author & Title Link
Missing data theory Van Buuren - Flexible Imputation of Missing Data Click
Missing data theory (somewhat more technical) Little & Rubin - Statistical Analysis with Missing Data Click
Tutorials on dealing with missing data Vink & Van Buuren Click (see Vignettes).

Visualizing the boys data

In the remainder of this practical, we will ignore the missing data problems, and focus on visualizing (relations in) the data.


8. Visualize the relationship between reg (region) and age using a boxplot.


9. Create a density plot of age, splitting the densities by gen using the fill aesthetic.


10. Create a diverging bar chart for hgt in the boys data set, that displays for every age year that year’s mean height in deviations from the overall average hgt.

Hint: You will have to create a categorical age variable for every year, and center the variable hgt such that it reflects that person’s difference from the mean height in the entire data.


Regression visualization

So far, we have predominantly focused our visualization on variables in the data. We can easily extend this to visualizing models that are constructed from the data. Similarly to looking at visualizations of variables, visualizations of models can help interpreting the results and the fit of the model, by making things obvious that would be obscure when looking at the raw numbers.


For this exercise, we are going to work with the elastic band data sets from the package DAAG, providing information on the stretch of a rubber band. For more information on the data, call help(elastic1) or help(elastic2).


11. Load the data elastic1 and elastic2 and bind the data frames together using the function bind_rows() and add a grouping variable indicating whether an observation comes from elastic1 or from elastic2.


12. Create a scatterplot mapping stretch on the x-axis and distance on the y-axis, and map the just created group indicator as the color aesthetic.


13. Recreate the previous plot, but now assess whether the results of the two data sets appear consistent by adding a linear regression line.

Hint: Use the function geom_smooth().


14. For each of the data sets elastic1 and elastic2, fit a regression model with y = distance on x = stretch using lm(y ~ x, data).


15. For both of the previously created fitted models, determine the fitted values and the standard errors of the fitted values, and the proportion explained variance \(R^2\).

Hint: Check out the predict function in R and notice the se.fit argument, as well as the summary function.


16. Study the residual versus leverage plots for both models.

Hint: Use plot(which = 5) on the fitted objects.


17. Use the elastic2 variable stretch to obtain predictions on the model fitted on elastic1.


18. Now make a scatterplot to investigate similarity between the predicted values and the observed values for elastic2.


Hand-in

When you have finished the practical,

  • enclose all files of the project (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file or folder, and

  • hand in that zip or folder by PR from your fork here. Do so before Lecture 4.