In this practical, we will build upon the topics of the last two weeks, and deepen your understanding of various data wrangling and visualization techniques, using the tidyverse
functionalities. Additionally, we briefly introduce the concept of missing data and how to visualize relations between the data and missingness.
Don’t forget to open a project file called 03_Exploratory.Rproj
and to create your .Rmd
or .R
file to work in.
library(tidyverse)
library(magrittr)
library(mice)
library(DAAG)
boys
dataToday, we will work with the boys
data from the R
package mice
, containing information from a sample of 748 boys. Information about this data can be obtained by running the code ?mice::boys
.
When working with a new data set, it is always a good idea to familiarize yourself with the structure of the data.
head(boys)
tail(boys)
It seems like the data is sorted on the variable age
. To verify this, we can test our intuition explicitly.
!is.unsorted(boys$age)
## [1] TRUE
Indeed, the data is sorted on age
. In R
, there is no is.sorted
function, but we can test whether the data is sorted by testing whether the data is not unsorted.
Additionally, you saw that the data contains 9 variables (age, hgt, wgt, bmi, hc, gen, phb, tv, reg), and that the data suffers from missingness. To get a further understanding of the data, we can ask for a summary of all variables.
summary(boys)
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
Next to looking at the numbers, it is often insightful to visualize some of the univariate distributions of the data, for example age
and gen
(genital Tanner stage).
1. Create a histogram of the variable age
using the function geom_histogram()
.
2. Create a bar chart of the variable gen
using the function geom_bar()
.
Now we know that there is a substantial amount of missing data, it is time to assess how severe the missingness problem is. One way to do this is by asking for the missing data pattern, using the function md.pattern()
from mice
.
md.pattern(boys)
## age reg wgt hgt bmi hc gen phb tv
## 223 1 1 1 1 1 1 1 1 1 0
## 19 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 0 1 0 2
## 437 1 1 1 1 1 1 0 0 0 3
## 43 1 1 1 1 1 0 0 0 0 4
## 16 1 1 1 0 0 1 0 0 0 5
## 1 1 1 1 0 0 0 0 0 0 6
## 1 1 1 0 1 0 1 0 0 0 5
## 1 1 1 0 0 0 1 1 1 1 3
## 1 1 1 0 0 0 0 1 1 1 4
## 1 1 1 0 0 0 0 0 0 0 7
## 3 1 0 1 1 1 1 0 0 0 4
## 0 3 4 20 21 46 503 503 522 1622
The md.pattern
shows that there are 223 cases without any missing values, 525 observations with at least one value missing, and that there are 1622 missing values in total in the data, such that 24% of the data is missing.
Also, it becomes clear that the variables gen
(genital Tanner stage), phb
(pubic hair) and tv
(testicular volume) have most missing values.
Now we now that there is a substantial amount of missing information in the boys
data, we can more closely focus on the missing data patterns. A first step could be to test whether missingness in the variables gen
, phb
and tv
depends on someones age.
3. Create a missingness indicator for the variables gen
, phb
and tv
.
4. Assess whether missingness in the variables gen
, phb
and tv
is related to someones age.
Hint: Use the group_by()
and summarize()
functions from the dplyr
library.
Although such differences can be useful, they rarely tell the complete story about the missingness. A useful strategy to assess missing data patterns is visualization, as figures generally tell more than just plain numbers.
5. Create a histogram for the variable age
, faceted by whether or not someone has a missing value on gen
.
Hint: Use the function facet_wrap
(see help(facet_wrap)
) to create two separate histograms for both groups.
Now we know somewhat more about the missingness problem in the boys
dataset, we can continue visualizing other relationships in the data. The next step is to visualize the relationship between age
and bmi
.
6. Create a scatterplot with age
on the x-axis and bmi
on the y-axis, using the function geom_point()
.
Although we already know how missingness in the variable gen
is related to age
, we do not know whether gen
is also related to bmi
.
7. Add a colour aesthetic to the previous plot using the missingness indicator of the variable gen
.
Usually, we would, at this point, handle the missingness problem accordingly. Yet, the scope of this course does not allow for a thorough module on dealing with missing data. If you are interested in this topic, or if you will need it during future (course)work, there are plenty of online resources. The short list of resources displayed below should get you going.
Description | Author & Title | Link |
---|---|---|
Missing data theory | Van Buuren - Flexible Imputation of Missing Data | Click |
Missing data theory (somewhat more technical) | Little & Rubin - Statistical Analysis with Missing Data | Click |
Tutorials on dealing with missing data | Vink & Van Buuren | Click (see Vignettes ). |
In the remainder of this practical, we will ignore the missing data problems, and focus on visualizing (relations in) the data.
8. Visualize the relationship between reg
(region) and age
using a boxplot.
9. Create a density plot of age, splitting the densities by gen
using the fill
aesthetic.
10. Create a diverging bar chart for hgt
in the boys
data set, that displays for every age
year that year’s mean height in deviations from the overall average hgt
.
Hint: You will have to create a categorical age variable for every year, and center the variable hgt
such that it reflects that person’s difference from the mean height in the entire data.
So far, we have predominantly focused our visualization on variables in the data. We can easily extend this to visualizing models that are constructed from the data. Similarly to looking at visualizations of variables, visualizations of models can help interpreting the results and the fit of the model, by making things obvious that would be obscure when looking at the raw numbers.
For this exercise, we are going to work with the elastic band data sets from the package DAAG
, providing information on the stretch of a rubber band. For more information on the data, call help(elastic1)
or help(elastic2)
.
11. Load the data elastic1
and elastic2
and bind the data frames together using the function bind_rows()
and add a grouping variable indicating whether an observation comes from elastic1
or from elastic2
.
12. Create a scatterplot mapping stretch
on the x-axis and distance
on the y-axis, and map the just created group indicator as the color aesthetic.
13. Recreate the previous plot, but now assess whether the results of the two data sets appear consistent by adding a linear regression line.
Hint: Use the function geom_smooth()
.
14. For each of the data sets elastic1
and elastic2
, fit a regression model with y = distance
on x = stretch
using lm(y ~ x, data)
.
15. For both of the previously created fitted models, determine the fitted values and the standard errors of the fitted values, and the proportion explained variance \(R^2\).
Hint: Check out the predict
function in R
and notice the se.fit
argument, as well as the summary
function.
16. Study the residual versus leverage plots for both models.
Hint: Use plot(which = 5)
on the fitted objects.
17. Use the elastic2
variable stretch
to obtain predictions on the model fitted on elastic1
.
18. Now make a scatterplot to investigate similarity between the predicted values and the observed values for elastic2
.
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or .Rmd
files including the one with your answers, and the .Rproj
file) in a zip file or folder, and
hand in that zip or folder by PR from your fork here. Do so before Lecture 4.