Introduction

In this practical, we will learn how to visualise data after we have cleaned up our datasets using the dplyr verbs from the previous practical: filter(), arrange(), mutate(), select(), and summarise(). For the visualisations, we will be using a package that implements the grammar of graphics: ggplot2. Please review the lecture slides for week 2 beforehand.

Don’t forget to open the project file 02_Data_Visualisation.Rproj and to create your .Rmd or .R file to work in.

library(ISLR)
library(tidyverse)

An excellent reference manual for ggplot can be found on the tidyverse website: https://ggplot2.tidyverse.org/reference/

What is `ggplot`?

Plots can be made in R without the use of ggplot using plot(), hist() or barplot() and related functions. Here is an example of each on the Hitters dataset from ISLR:

# Get an idea of what the Hitters dataset looks like
head(Hitters)

# histogram of the distribution of salary
hist(Hitters$Salary, xlab = "Salary in thousands of dollars")

# barplot of how many members in each league
barplot(table(Hitters$League))

# number of career home runs vs 1986 home runs
plot(x = Hitters$Hits, y = Hitters$HmRun, 
     xlab = "Hits", ylab = "Home runs")

These plots are informative and useful for visually inspecting the dataset, and they each have a specific syntax associated with them. ggplot has a more unified approach to plotting, where you build up a plot layer by layer using the + operator:

homeruns_plot <- 
  ggplot(Hitters, aes(x = Hits, y = HmRun)) +
  geom_point() +
  labs(x = "Hits", y = "Home runs")

homeruns_plot

As introduced in the lectures, a ggplot object is built up in different layers:

input the dataset to a ggplot() function call
construct aesthetic mappings
add (geometric) components to your plot that use these mappings
add labels, themes, visuals.

Because of this layered syntax, it is then easy to add elements like these fancy density lines, a title, and a different theme:

homeruns_plot + 
  geom_density_2d() +
  labs(title = "Cool density and scatter plot of baseball data") +
  theme_minimal()

In conclusion, ggplot objects are easy to manipulate and they force a principled approach to data visualisation. In this practical, we will learn how to construct them.

Name the aesthetics, geoms, scales, and facets of the above visualisation. Also name any statistical transformations or special coordinate systems.

# Aesthetics: 
#   number of hits mapped to x-position
#   number of home runs mapped to y-position
# Geoms: points and contour lines
# Scales:
#   x-axis: continuous
#   y-axis: continuous
# Facets: None
# Statistical transformations: None
# Special coordinate system: None (just cartesian)

Aesthetics and data preparation

The first step in constructing a ggplot is the preparation of your data and the mapping of variables to aesthetics. In the homeruns_plot, we used an existing data frame, the Hitters dataset.

The data frame needs to have proper column names and the types of the variables in the data frame need to be correctly specified. Numbers should be numerics, categories should be factors, and names or identifiers should be character variables. ggplot() always expects a data frame, which may feel awfully strict, but it allows for excellent flexibility in the remaining plotting steps.

Run the code below to generate data. There will be three vectors in your environment. Put them in a data frame for entering it in a ggplot() call using either the data.frame() or the tibble() function. Give informative names and make sure the types are correct (use the as.<type>() functions). Name the result gg_students

set.seed(1234)
student_grade  <- rnorm(32, 7)
student_number <- round(runif(32) * 2e6 + 5e6)
programme      <- sample(c("Science", "Social Science"), 32, replace = TRUE)

gg_students <- tibble(
  number = as.character(student_number), # an identifier
  grade  = student_grade,                # already the correct type.
  prog   = as.factor(programme)          # categories should be factors.
)

head(gg_students)

# note that if you use data.frame(), you need to set the argument 
# stringsAsFactors to FALSE to get student number to be a character.
# tibble() does this by default.

Mapping aesthetics is usually done in the main ggplot() call. Aesthetic mappings are the second argument to the function, after the data frame.

Plot the first homeruns_plot again, but map the Hits to the y-axis and the HmRun to the x-axis instead.

ggplot(Hitters, aes(x = HmRun, y = Hits)) +
  geom_point() +
  labs(y = "Hits", x = "Home runs")

Recreate the same plot once more, but now also map the variable League to the colour aesthetic and the variable Salary to the size aesthetic.

ggplot(Hitters, aes(x = HmRun, y = Hits, colour = League, size = Salary)) +
  geom_point() +
  labs(y = "Hits", x = "Home runs")

## Warning: Removed 59 rows containing missing values (geom_point).

Examples of aesthetics are:

x
y
alpha (transparency)
colour
fill
group
shape
size
stroke

Geoms

Up until now we have used two geoms: contour lines and points. The geoms in ggplot2 are added via the geom_<geomtype>() functions. Each geom has a required aesthetic mapping to work. For example, geom_point() needs at least and x and y position mapping, as you can read here. You can check the required aesthetic mapping for each geom via the ?geom_<geomtype>.

Look at the many different geoms on the reference website.

There are two types of geoms:

geoms which perform a transformation of the data beforehand, such as geom_density_2d() which calculates contour lines from x and y positions.
geoms which do not transform data beforehand, but use the aesthetic mapping directly, such as geom_point().

Visual exploratory data analysis

Several types of plots are useful for exploratory data analysis. In this section, you will construct different plots to get a feel for the two datasets we use in this practical: Hitters and gg_students. One of the most common tasks is to look at the distributions of variables in your dataset.

Histogram

Use geom_histogram() to create a histogram of the grades of the students in the gg_students dataset. Play around with the binwidth argument of the geom_histogram() function.

gg_students %>%
  ggplot(aes(x = grade)) +
  geom_histogram(binwidth = .5)

Density

The continuous equivalent of the histogram is the density estimate.

Use geom_density() to create a density plot of the grades of the students in the gg_students dataset. Add the argument fill = "light seagreen" to geom_density().

gg_students %>% 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen")

The downside of only looking at the density or histogram is that it is an abstraction from the raw data, thus it might alter interpretations. For example, it could be that a grade between 8.5 and 9 is in fact impossible. We do not see this in the density estimate. To counter this, we can add a raw data display in the form of rug marks.

Add rug marks to the density plot through geom_rug(). You can edit the colour and size of the rug marks using those arguments within the geom_rug() function.

gg_students %>% 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen") +
  geom_rug(size = 1, colour = "light seagreen")

Increase the data to ink ratio by removing the y axis label, setting the theme to theme_minimal(), and removing the border of the density polygon. Also set the limits of the x-axis to go from 0 to 10 using the xlim() function, because those are the plausible values for a student grade.

gg_students %>% 
  ggplot(aes(x = grade)) +
  geom_density(fill = "light seagreen", colour = NA) +
  geom_rug(size = 1, colour = "light seagreen") +
  theme_minimal() +
  labs(y = "") +
  xlim(0, 10)

Boxplot

Another common task is to compare distributions across groups. A classic example of a visualisation that performs this is the boxplot, accessible via geom_boxplot(). It allows for visual comparison of the distribution of two or more groups through their summary statistics.

Create a boxplot of student grades per programme in the gg_students dataset you made earlier: map the programme variable to the x position and the grade to the y position. For extra visual aid, you can additionally map the programme variable to the fill aesthetic.

gg_students %>% 
  ggplot(aes(x = prog, y = grade, fill = prog)) +
  geom_boxplot() +
  theme_minimal()

What do each of the horizontal lines in the boxplot mean? What do the vertical lines (whiskers) mean?

# From the help file of geom_boxplot:

# The middle line indicates the median, the outer horizontal
# lines are the 25th and 75th percentile.

# The upper whisker extends from the hinge to the largest value 
# no further than 1.5 * IQR from the hinge (where IQR is the 
# inter-quartile range, or distance between the first and third 
# quartiles). The lower whisker extends from the hinge to the 
# smallest value at most 1.5 * IQR of the hinge. Data beyond 
# the end of the whiskers are called "outlying" points and are 
# plotted individually.

Two densities

Comparison of distributions across categories can also be done by adding a fill aesthetic to the density plot you made earlier. Try this out. To take care of the overlap, you might want to add some transparency in the geom_density() function using the alpha argument.

gg_students %>% 
  ggplot(aes(x = grade, fill = prog)) +
  geom_density(alpha = .5, colour = NA) +
  geom_rug(size = 1, colour = "light seagreen") +
  theme_minimal() +
  labs(y = "", fill = "Programme") +
  xlim(0, 10)

Bar plot

We can display amounts or proportions as a bar plot to compare group sizes of a factor.

Create a bar plot of the variable Years from the Hitters dataset.

Hitters %>% 
  ggplot(aes(x = Years)) + 
  geom_bar() +
  theme_minimal()

geom_bar() automatically transforms variables to counts (see ?stat_count), similar to how the function table() works:

table(Hitters$Years)

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 23 24 
## 22 25 29 36 30 30 21 16 15 14 10 14 12 13  9  7  7  7  1  2  1  1

Line plot

The Smarket dataset contains daily return rates and trade volume for 1250 days on the S&P 500 stock market.

Use geom_line() to make a line plot out of the first 200 observations of the variable Volume (the number of trades made on each day) of the Smarket dataset. You will need to create a Day variable using mutate() to map to the x-position. This variable can simply be the integers from 1 to 200. Remember, you can select the first 200 rows using Smarket[1:200, ].

Smarket[1:200,] %>% 
  mutate(Day = 1:200) %>% 
  ggplot(aes(x = Day, y = Volume)) +
  geom_line() +
  theme_minimal()

We can edit properties of the line by adding additional arguments into the geom_line() function.

Give the line a nice colour and increase its size. Also add points of the same colour on top.

Smarket[1:200, ] %>% 
  mutate(Day = 1:200) %>% 
  ggplot(aes(x = Day, y = Volume)) +
  geom_line(colour = "#00008b", size = 1) +
  geom_point(colour = "#00008b", size = 1) +
  theme_minimal()

Use the function which.max() to find out which of the first 200 days has the highest trade volume and use the function max() to find out how large this volume was.

which.max(Smarket[1:200, ]$Volume)

## [1] 170

max(Smarket[1:200, ]$Volume)

## [1] 2.33083

Use geom_label(aes(x = your_x, y = your_y, label = "Peak volume")) to add a label to this day. You can use either the values or call the functions. Place the label near the peak!

Smarket[1:200, ] %>% 
  mutate(Day = 1:200) %>% 
  ggplot(aes(x = Day, y = Volume)) +
  geom_line(colour = "#00008b", size = 1) +
  geom_label(aes(x = 170, y = 2.5, label = "Peak volume")) +
  theme_minimal()

This exercise shows that aesthetics can also be mapped separately per geom, in addition to globally in the ggplot() function call. Also, the data can be different for different geoms: here the data for geom_label has only a single data point: your chosen location and the “Peak volume” label.

Faceting

Create a data frame called baseball based on the Hitters dataset. In this data frame, create a factor variable which splits players’ salary range into 3 categories. Tip: use the filter() function to remove the missing values, and then use the cut() function and assign nice labels to the categories. In addition, create a variable which indicates the proportion of career hits that was a home run.

baseball <-
  Hitters %>% 
  filter(!is.na(Salary)) %>% 
  mutate(
    Salary_range = cut(Salary, breaks = 3, 
                       labels = c("Low salary", "Mid salary", "High salary")),
    Career_hmrun_proportion = CHmRun/CHits
  )

Create a scatter plot where you map CWalks to the x position and the proportion you calculated in the previous exercise to the y position. Fix the y axis limits to (0, 0.4) and the x axis to (0, 1600) using ylim() and xlim(). Add nice x and y axis titles using the labs() function. Save the plot as the variable baseball_plot.

baseball_plot <-   
  baseball %>% 
  ggplot(aes(x = CWalks, y = Career_hmrun_proportion)) +
  geom_point() +
  ylim(0, 0.4) +
  xlim(0, 1600) + 
  theme_minimal() +
  labs(y = "Proportion of home runs",
       x = "Career number of walks")

baseball_plot

Split up this plot into three parts based on the salary range variable you calculated. Use the facet_wrap() function for this; look at the examples in the help file for tips.

baseball_plot + facet_wrap(~Salary_range)

Faceting can help interpretation. In this case, we can see that high-salary earners are far away from the point (0, 0) on average, but that there are low-salary earners which are even further away. Faceting should preferably be done using a factor variable. The order of the facets is taken from the levels() of the factor. Changing the order of the facets can be done using fct_relevel() if needed.

Final exercise

Create an interesting data visualisation based on the Carseats data from the ISLR package.

# an example answer could be:
Carseats %>% 
  mutate(Competition = Price/CompPrice,
         ShelveLoc   = fct_relevel(ShelveLoc, "Bad", "Medium", "Good")) %>% 
  ggplot(aes(x = Competition, y = Sales, colour = Age)) +
  geom_point() +
  theme_minimal() +
  scale_colour_viridis_c() + # add a custom colour scale
  facet_wrap(vars(ShelveLoc))

Hand-in

When you have finished the practical,

enclose all files of the project 02_Data_Visualization.Rproj (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file, and
hand in the zip by PR from your fork here. Do so before Lecture 3. That way we can iron out issues during the next Q&A in Week 3.

Data Visualisation using ggplot2