Practical I

Exercises

The following packages are required for this practical:

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.6.3

library(mice)

## Warning: package 'mice' was built under R version 3.6.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

and if you’d like the same results as I have obtained, you can fix the random seed

set.seed(123)

Function plot() is the core plotting function in R. Find out more about plot(): Try both the help in the help-pane and ?plot in the console. Look at the examples by running example(plot).

The help tells you all about a functions arguments (the input you can specify), as well as the element the function returns to the Global Environment. There are strict rules for publishing packages in R. For your packages to appear on the Comprehensive R Archive Network (CRAN), a rigorous series of checks have to be passed. As a result, all user-level components (functions, datasets, elements) that are published, have an acompanying documentation that elaborates how the function should be used, what can be expected, or what type of information a data set contains. Help files often contain example code that can be run to demonstrate the workings.

?plot

## starting httpd help server ... done

example(plot)

## 
## plot> require(stats) # for lowess, rpois, rnorm
## 
## plot> plot(cars)

## 
## plot> lines(lowess(cars))
## 
## plot> plot(sin, -pi, 2*pi) # see ?plot.function

## 
## plot> ## Discrete Distribution Plot:
## plot> plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
## plot+      main = "rpois(100, lambda = 5)")

## 
## plot> ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
## plot> plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")

## 
## plot> points(x, cex = .5, col = "dark red")

There are many more functions that can plot specific types of plots. For example, function hist() plots histograms, but falls back on the basic plot() function. Package ggplot2 is an excellent package to use for more complex plots. Pretty much any type of plot can be made in R. All ggplot2 documentation can be found at ggplot2.tidyverse.org

Create a scatterplot between age and bmi in the mice::boys data set

With the standard plotting device in R:

mice::boys %$% plot(bmi ~ age)

or, with ggplot2:

p <- ggplot(mice::boys, aes(age, bmi))
p + geom_point()

## Warning: Removed 21 rows containing missing values (geom_point).

Package ggplot2 offers far greater flexibility in data visualization than the standard plotting devices in R. However, it has its own language, which allows you to easily expand graphs with additional commands. To make these expansions or layers clearly visible, it is advisable to use the plotting language conventions. For example,

mice::boys %>% 
  ggplot(aes(age, bmi)) +
  geom_point()

would yield the same plot as

ggplot(mice::boys, aes(age, bmi)) + geom_point()

but the latter style may be less informative, especially if more customization takes place and if you share your code with others.

Now recreate the plot with the following specifications:

If bmi < 18.5 use color = "light blue"
If bmi > 18.5 & bmi < 25 use color = "light green"
If bmi > 25 & bmi < 30 use color = "orange"
If bmi > 30 use color = "red"

Hint: it may help to expand the data set with a new variable.

It may be easier to create a new variable that creates the specified categories. We can use the cut() function to do this quickly

boys2 <- 
  boys %>%
  mutate(class = cut(bmi, c(0, 18.5, 25, 30, Inf),
                    labels = c("underweight",
                               "healthy",
                               "overweight",
                               "obese")))

by specifying the boundaries of the intervals. In this case we obtain 4 intervals: 0-18.5, 18.5-25, 25-30 and 30-Inf. We used the %>% pipe to work with bmi directly. Alternatively, we could have done this without a pipe:

boys3 <- boys
boys3$class <- cut(boys$bmi, c(0, 18.5, 25, 30, Inf), 
                   labels = c("underweight",
                              "healthy",
                              "overweight",
                              "obese"))

to obtain the same result.

With the standard plotting device in R we can now specify:

plot(bmi ~ age, subset = class == "underweight", col = "light blue", data = boys2, 
     ylim = c(10, 35), xlim = c(0, 25))
points(bmi ~ age, subset = class == "healthy", col = "light green", data = boys2)
points(bmi ~ age, subset = class == "overweight", col = "orange", data = boys2)
points(bmi ~ age, subset = class == "obese", col = "red", data = boys2)

and with ggplot2 we can call

boys2 %>%
  ggplot() +
  geom_point(aes(age, bmi, col = class))

## Warning: Removed 21 rows containing missing values (geom_point).

Although the different classifications have different colours, the colours are not conform the specifications of this exercise. We can manually override this:

boys2 %>%
  ggplot() +
  geom_point(aes(age, bmi, col = class)) +
  scale_color_manual(values = c("light blue", "light green", "orange", "red"))

## Warning: Removed 21 rows containing missing values (geom_point).

Because there are missing values, ggplot2 displays a warning message. If we would like to not consider the missing values when plotting, we can simply exclude the NAs by using a filter():

boys2 %>% 
  filter(!is.na(class)) %>%
  ggplot() +
  geom_point(aes(age, bmi, col = class)) +
  scale_color_manual(values = c("light blue", "light green", "orange", "red"))

Specifying a filter on the feature class is sufficient: age has no missings and the missings in class directly correspond to missing values on bmi. Filtering on bmi would therefore yield an identical plot.

Create a histogram for age in the boys data set

With the standard plotting device in R:

boys %$%
  hist(age, breaks = 50)

The breaks = 50 overrides the default breaks between the bars. By default the plot would be

boys %$%
  hist(age)

Using a pipe is a nice approach for this plot because it inherits the names of the objects we aim to plot. Without the pipe we might need to adjust the main title for the histogram:

hist(boys$age, breaks = 50)

With ggplot2:

boys %>%
  ggplot() + 
  geom_histogram(aes(age), binwidth = .4)

Please note that the plots from geom_histogram() and hist use different calculations for the bars (bins) and hence may look slightly different.

Create a bar chart for reg in the boys data set With a standard plotting device in R:

boys %$%
  table(reg) %>%
  barplot()

With ggplot2:

boys %>%
  ggplot() + 
  geom_bar(aes(reg))

Note that geom_bar by default plots the NA’s, while barplot() omits the NA’s without warning. If we would not like to plot the NAs, then a simple filter() (see exercise 2) on the boys data is efficient.

Create a box plot for hgt with different boxes for reg in the boys data set With a standard plotting device in R:

boys %$%
  boxplot(hgt ~ reg)

With ggplot2:

boys %>%
  ggplot(aes(reg, hgt)) +
  geom_boxplot()

## Warning: Removed 20 rows containing non-finite values (stat_boxplot).

Create a density plot for age with different curves for boys from the city and boys from rural areas (!city). With a standard plotting device in R:

d1 <- boys %>%
  subset(reg == "city") %$%
  density(age)
d2 <- boys %>%
  subset(reg != "city") %$% 
  density(age)

plot(d1, col = "red", ylim = c(0, .08)) 
lines(d2, col = "blue")

The above plot can also be generated without pipes, but results in an ugly main title. You may edit the title via the main argument in the plot() function.

plot(density(boys$age[!is.na(boys$reg) & boys$reg == "city"]), 
     col = "red", 
     ylim = c(0, .08))
lines(density(boys$age[!is.na(boys$reg) & boys$reg != "city"]), 
      col = "blue")

With ggplot2 everything looks much nicer:

boys %>%
  mutate(area = ifelse(reg == "city", "city", "rural")) %>%
  filter(!is.na(area)) %>%
  ggplot(aes(age, fill = area)) +
  geom_density(alpha = .3) # some transparency

Create a diverging bar chart for hgt in the boys data set, that displays for every age year that year’s mean height in deviations from the overall average hgt

Let’s not make things too complicated and just focus on ggplot2:

boys %>%
  mutate(Hgt = hgt - mean(hgt, na.rm = TRUE),
         Age = cut(age, 0:22, labels = 0:21)) %>%
  group_by(Age) %>%
  summarize(Hgt = mean(Hgt, na.rm = TRUE)) %>% 
  mutate(Diff = cut(Hgt, c(-Inf, 0, Inf),
                    labels = c("Below Average", "Above Average"))) %>%
  ggplot(aes(x = Age, y = Hgt, fill = Diff)) + 
  geom_bar(stat = "identity") +
  coord_flip()

We can clearly see that the average height in the group is reached just before age 7.

The group_by() and summarize() function are advanced dplyr functions used to return the mean() of deviation Hgt for every group in Age. For example, if we would like the mean and sd of height hgt for every region reg in the boys data, we could call:

boys %>%
  group_by(reg) %>% 
  summarize(mean_hgt = mean(hgt, na.rm = TRUE), 
            sd_hgt   = sd(hgt, na.rm = TRUE))

## # A tibble: 6 x 3
##   reg   mean_hgt sd_hgt
##   <fct>    <dbl>  <dbl>
## 1 north    152.    43.8
## 2 east     134.    43.2
## 3 west     130.    48.0
## 4 south    128.    46.3
## 5 city     126.    46.9
## 6 <NA>      73.0   29.3

The na.rm argument ensures that the mean and sd of only the observed values in each category are used.

End of Practical

Practical I

Gerko Vink & Erik-Jan van Kesteren

Statistical Programming in R

Exercises

Useful References