In this practical, we will build upon the topics of the last two weeks, and deepen your understanding of various data wrangling and visualization techniques, using the tidyverse
functionalities. Additionally, we briefly introduce the concept of missing data and how to visualize relations between the data and missingness.
Don’t forget to open a project file called 03_Exploratory.Rproj
and to create your .Rmd
or .R
file to work in.
library(tidyverse)
library(magrittr)
library(mice)
library(DAAG)
boys
dataToday, we will work with the boys
data from the R
package mice
, containing information from a sample of 748 boys. Information about this data can be obtained by running the code ?mice::boys
.
When working with a new data set, it is always a good idea to familiarize yourself with the structure of the data.
head(boys)
tail(boys)
It seems like the data is sorted on the variable age
. To verify this, we can test our intuition explicitly.
!is.unsorted(boys$age)
## [1] TRUE
Indeed, the data is sorted on age
. In R
, there is no is.sorted
function, but we can test whether the data is sorted by testing whether the data is not unsorted.
Additionally, you saw that the data contains 9 variables (age, hgt, wgt, bmi, hc, gen, phb, tv, reg), and that the data suffers from missingness. To get a further understanding of the data, we can ask for a summary of all variables.
summary(boys)
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
Next to looking at the numbers, it is often insightful to visualize some of the univariate distributions of the data, for example age
and gen
(genital Tanner stage).
1. Create a histogram of the variable age
using the function geom_histogram()
.
%>%
boys ggplot(aes(x = age)) +
geom_histogram(fill = "dark green") +
theme_minimal() +
labs(title = "Distribution of age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Most boys are relatively young, while there is
# also a substantial group of boys between the age
# of 10 and 20 years.
2. Create a bar chart of the variable gen
using the function geom_bar()
.
%>%
boys ggplot(aes(x = gen)) +
geom_bar(fill = "dark green") +
theme_minimal() +
labs(title = "Distribution of genital Tanner stage")
Now we know that there is a substantial amount of missing data, it is time to assess how severe the missingness problem is. One way to do this is by asking for the missing data pattern, using the function md.pattern()
from mice
.
md.pattern(boys)
## age reg wgt hgt bmi hc gen phb tv
## 223 1 1 1 1 1 1 1 1 1 0
## 19 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 0 1 0 2
## 437 1 1 1 1 1 1 0 0 0 3
## 43 1 1 1 1 1 0 0 0 0 4
## 16 1 1 1 0 0 1 0 0 0 5
## 1 1 1 1 0 0 0 0 0 0 6
## 1 1 1 0 1 0 1 0 0 0 5
## 1 1 1 0 0 0 1 1 1 1 3
## 1 1 1 0 0 0 0 1 1 1 4
## 1 1 1 0 0 0 0 0 0 0 7
## 3 1 0 1 1 1 1 0 0 0 4
## 0 3 4 20 21 46 503 503 522 1622
The md.pattern
shows that there are 223 cases without any missing values, 525 observations with at least one value missing, and that there are 1622 missing values in total in the data, such that 24% of the data is missing.
Also, it becomes clear that the variables gen
(genital Tanner stage), phb
(pubic hair) and tv
(testicular volume) have most missing values.
Now we now that there is a substantial amount of missing information in the boys
data, we can more closely focus on the missing data patterns. A first step could be to test whether missingness in the variables gen
, phb
and tv
depends on someones age.
3. Create a missingness indicator for the variables gen
, phb
and tv
.
<- boys %>%
boys_mis mutate(gen_mis = is.na(gen),
phb_mis = is.na(phb),
tv_mis = is.na(tv))
4. Assess whether missingness in the variables gen
, phb
and tv
is related to someones age.
Hint: Use the group_by()
and summarize()
functions from the dplyr
library.
%>%
boys_mis group_by(gen_mis) %>%
summarize(age = mean(age))
%>%
boys_mis group_by(phb_mis) %>%
summarize(age = mean(age))
%>%
boys_mis group_by(tv_mis) %>%
summarize(age = mean(age))
# And we see that those with a missing value on the variables
# of interest have a substantial lower age than those with an
# observed value.
Although such differences can be useful, they rarely tell the complete story about the missingness. A useful strategy to assess missing data patterns is visualization, as figures generally tell more than just plain numbers.
5. Create a histogram for the variable age
, faceted by whether or not someone has a missing value on gen
.
Hint: Use the function facet_wrap
(see help(facet_wrap)
) to create two separate histograms for both groups.
%>%
boys_mis ggplot(aes(x = age)) +
geom_histogram(fill = "dark green") +
facet_wrap(~gen_mis) +
theme_minimal()
# Now we see what we couldn't have seen before. Boys
# with an observed value on gen all are at least seven
# years old, while those with a missing value on gen
# are far more often between 0 and 5 years old.
Now we know somewhat more about the missingness problem in the boys
dataset, we can continue visualizing other relationships in the data. The next step is to visualize the relationship between age
and bmi
.
6. Create a scatterplot with age
on the x-axis and bmi
on the y-axis, using the function geom_point()
.
%>%
boys_mis ggplot(aes(x = age, y = bmi)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of age versus bmi")
## Warning: Removed 21 rows containing missing values (geom_point).
# Note that ggplot gives the warning that 21 rows are
# removed due to missingness in bmi.
Although we already know how missingness in the variable gen
is related to age
, we do not know whether gen
is also related to bmi
.
7. Add a colour aesthetic to the previous plot using the missingness indicator of the variable gen
.
%>%
boys_mis ggplot(aes(x = age, y = bmi, col = gen_mis)) +
geom_point() +
theme_minimal() +
scale_color_viridis_d() +
labs(title = "Scatter plot of age versus bmi")
## Warning: Removed 21 rows containing missing values (geom_point).
# Again, we clearly see that younger boys generally have
# a missing value on gen, while there does not seem to be
# much of a relation between missingness on gen and bmi
Usually, we would, at this point, handle the missingness problem accordingly. Yet, the scope of this course does not allow for a thorough module on dealing with missing data. If you are interested in this topic, or if you will need it during future (course)work, there are plenty of online resources. The short list of resources displayed below should get you going.
Description | Author & Title | Link |
---|---|---|
Missing data theory | Van Buuren - Flexible Imputation of Missing Data | Click |
Missing data theory (somewhat more technical) | Little & Rubin - Statistical Analysis with Missing Data | Click |
Tutorials on dealing with missing data | Vink & Van Buuren | Click (see Vignettes ). |
In the remainder of this practical, we will ignore the missing data problems, and focus on visualizing (relations in) the data.
8. Visualize the relationship between reg
(region) and age
using a boxplot.
%>%
boys_mis ggplot(aes(x = reg, y = age)) +
geom_boxplot(fill = "dark green") +
theme_minimal() +
labs(title = "Boxplot of age by region.")
# Boys in the northerns region seem a little older
# than the rest, boys with an NA on region seem to
# be much younger than the rest, while there does
# not appear to be much of a difference in the
# remaining categories.
9. Create a density plot of age, splitting the densities by gen
using the fill
aesthetic.
%>%
boys_mis ggplot(aes(x = age, fill = gen)) +
geom_density(alpha = 0.7) +
theme_minimal() +
scale_fill_brewer() +
labs(title = "Density of age by genital Tanner stage")
# You'll see a clear relation between gen and age,
# which is not really surprising, because physical
# development is usually driven by aging.
10. Create a diverging bar chart for hgt
in the boys
data set, that displays for every age
year that year’s mean height in deviations from the overall average hgt
.
Hint: You will have to create a categorical age variable for every year, and center the variable hgt
such that it reflects that person’s difference from the mean height in the entire data.
%>%
boys mutate(Age = cut(age, 0:22, labels = paste0(0:21, " years")),
Height = hgt - mean(hgt, na.rm = TRUE)) %>%
group_by(Age) %>%
summarize(Height = mean(Height, na.rm = TRUE)) %>%
mutate(color = ifelse(Height > 0, "Above average", "Below average")) %>%
ggplot(aes(x = Height, y = Age, fill = color)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.title = element_blank())
# We can clearly see that the average height in the
# group is reached just before age 7.
So far, we have predominantly focused our visualization on variables in the data. We can easily extend this to visualizing models that are constructed from the data. Similarly to looking at visualizations of variables, visualizations of models can help interpreting the results and the fit of the model, by making things obvious that would be obscure when looking at the raw numbers.
For this exercise, we are going to work with the elastic band data sets from the package DAAG
, providing information on the stretch of a rubber band. For more information on the data, call help(elastic1)
or help(elastic2)
.
11. Load the data elastic1
and elastic2
and bind the data frames together using the function bind_rows()
and add a grouping variable indicating whether an observation comes from elastic1
or from elastic2
.
<- bind_rows("Elastic1" = elastic1,
elastic "Elastic2" = elastic2,
.id = "Set")
12. Create a scatterplot mapping stretch
on the x-axis and distance
on the y-axis, and map the just created group indicator as the color aesthetic.
%>%
elastic ggplot(aes(x = stretch, y = distance, col = Set)) +
geom_point() +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Elastic bands data")
13. Recreate the previous plot, but now assess whether the results of the two data sets appear consistent by adding a linear regression line.
Hint: Use the function geom_smooth()
.
%>%
elastic ggplot(aes(x = stretch, y = distance, col = Set)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Elastic bands data")
# The results seem very consistent: Data set elastic2 has
# more observations over a larger range, but both sets
# result in roughly the same regression line. Data set
# elastic1 seems to have an odd-one-out value.
14. For each of the data sets elastic1
and elastic2
, fit a regression model with y = distance
on x = stretch
using lm(y ~ x, data)
.
<- lm(distance ~ stretch, elastic1)
fit1 <- lm(distance ~ stretch, elastic2) fit2
15. For both of the previously created fitted models, determine the fitted values and the standard errors of the fitted values, and the proportion explained variance \(R^2\).
Hint: Check out the predict
function in R
and notice the se.fit
argument, as well as the summary
function.
# Check fitted values
%>% predict(se.fit = TRUE) fit1
## $fit
## 1 2 3 4 5 6 7
## 183.1429 235.7143 196.2857 209.4286 170.0000 156.8571 222.5714
##
## $se.fit
## [1] 6.586938 10.621119 5.891537 6.586938 8.331891 10.621119 8.331891
##
## $df
## [1] 5
##
## $residual.scale
## [1] 15.58754
%>% predict(se.fit = TRUE) fit2
## $fit
## 1 2 3 4 5 6 7 8
## 77.58333 196.58333 137.08333 166.83333 256.08333 226.33333 107.33333 226.33333
## 9
## 285.83333
##
## $se.fit
## [1] 6.740293 3.520003 4.358744 3.635444 5.060323 4.064550 5.453165 4.064550
## [9] 6.296773
##
## $df
## [1] 7
##
## $residual.scale
## [1] 10.44202
# Note that fit1 (based on elastic1) has a larger
# residual standard deviation (i.e., $residual.scale).
# Check R2
%>% summary() fit1
##
## Call:
## lm(formula = distance ~ stretch, data = elastic1)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.1429 -18.7143 -7.2857 -1.4286 8.0000 -6.8571 26.4286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -119.143 70.943 -1.679 0.15391
## stretch 6.571 1.473 4.462 0.00663 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 5 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.7591
## F-statistic: 19.91 on 1 and 5 DF, p-value: 0.006631
%>% summary() fit2
##
## Call:
## lm(formula = distance ~ stretch, data = elastic2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0833 -7.0833 -0.5833 5.1667 20.1667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.9167 15.6102 -6.465 0.000345 ***
## stretch 5.9500 0.3148 18.899 2.89e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.44 on 7 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.978
## F-statistic: 357.2 on 1 and 7 DF, p-value: 2.888e-07
# Or directly
%>% summary() %$% r.squared fit1
## [1] 0.7992446
%>% summary() %$% r.squared fit2
## [1] 0.9807775
# Note that the model based on elastic2 has smaller standard
# errors and a much larger R2. This is due to the larger
# range of values in elastic2 and the absence of an outlier.
16. Study the residual versus leverage plots for both models.
Hint: Use plot(which = 5)
on the fitted objects.
%>% plot(which = 5) fit1
%>% plot(which = 5) fit2
# For elastic1, case 2 has the largest influence on
# estimation. However, it is not the case with the
# largest residual:
$residuals fit1
## 1 2 3 4 5 6
## -0.1428571 -18.7142857 -7.2857143 -1.4285714 8.0000000 -6.8571429
## 7
## 26.4285714
# As we can see, case 7 has the largest residual.
17. Use the elastic2
variable stretch
to obtain predictions on the model fitted on elastic1
.
<- predict(fit1, newdata = elastic2) pred
18. Now make a scatterplot to investigate similarity between the predicted values and the observed values for elastic2
.
<-
pred_dat data.frame(distance = pred,
stretch = elastic2$stretch) %>%
bind_rows(Predicted = .,
Observed = elastic2,
.id = "Predicted")
%>%
pred_dat ggplot(aes(stretch, distance, col = Predicted)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Predicted and observed distances")
# The predicted values are very similar to the observed
# values. However, they do not strictly follow the straight
# line because there is some modeling error: we use elastic1's
# model to predict elastic2's distance [error source 1] and
# we compare those predictions to elastic2's observed distance
# [error source 2]. However, if you consider the modeling,
# these predictions are very accurate and have high
# correlations with the observed values:
cor(elastic2$distance, pred)
## [1] 0.9903421
When you have finished the practical,
enclose all files of the project (i.e. all .R
and/or .Rmd
files including the one with your answers, and the .Rproj
file) in a zip file or folder, and
hand in that zip or folder by PR from your fork here. Do so before Lecture 4.