This is the first programming practical. If you haven’t yet done so, open the project file 01_Data_Wrangling.Rproj
in RStudio. You can choose to write the answers to your exercises in either an .R
file or in an .Rmd
file. Example answer files are provided in the project directory (example_answers.Rmd
and example_answers.R
). You can open these from the files pane and use them as a starting point. While working through the exercises, write down your code in one of these files. Use proper style and provide comments so you can read it back later and still understand what is happening.
The practicals always start with the packages we are going to use. Be sure to run these lines in your session to load their functions before you continue. If there are packages that you have not yet installed, first install them with install.packages()
.
library(ISLR)
library(tidyverse)
library(haven)
library(readxl)
There are several data types in R
. Here is a table with the most common ones:
Type | Short | Example |
---|---|---|
Integer | int | 0, 1, 2, 3, -4, -5 |
Numeric / Double | dbl | 0.1, -2.5, 123.456 |
Character | chr | “dav is a cool course” |
Logical | lgl | TRUE / FALSE |
Factor | fctr | low, medium, high |
The class()
function can give you an idea about what type of data each variable contains.
R
and inspect their data types using the class()
function. Try to guess beforehand what their types will be!<- 1:5
object_1 <- 1L:5L
object_2 <- "-123.456"
object_3 <- as.numeric(object_2)
object_4 <- letters[object_1]
object_5 <- as.factor(rep(object_5, 2))
object_6 <- c(1, 2, 3, "4", "5", "6") object_7
The factor data type is special to R
and uncommon in other programming languages. It is used to represent categorical variables with fixed possible values. For example, when there is a multiple choice question with 5 possible choices (a to e) and 10 students answer the question, we may get a result as in object_6
.
Vectors can have only a single data type. Note that the first three elements in object_7
have been converted. We can convert to different data types using the as.<class>()
functions.
object_7
back to a vector of numbers using the as.numeric()
function<- as.numeric(object_7) object_7
A list is a collection of objects. The elements may have names, but it is not necessary. Each element of a list can have a different data type.
objects
containing object 1 to 7 using the list()
function.<- list(object_1, object_2, object_3, object_4, object_5, object_6,
objects object_7)
You can select elements of a list using its name (objects$elementname
) or using its index (objects[[1]]
for the first element).
A special type of list is the data.frame
. It is the same as a list, but each element is forced to have the same length and a name. The elements of a data.frame
are the columns of a dataset. In the tidyverse, data.frame
s are called tibble
s.
object_1
, object_2
, and object_5
using the data.frame()
function<- data.frame(Var1 = object_1, Var2 = object_2, Var3 = object_5)
dat dat
Just like a list, the columns in a data frame (the variables in a dataset) can be accessed using their name df$columnname
or their index df[[1]]
. Additionally, the tenth row can be selected using df[10, ]
, the second column using df[, 2]
and cell number 10, 2 can be accessed using df[10, 2]
. This is because data frames also behave like the matrix
data type in addition to the list
type.
ncol()
and nrow()
. Try them out!ncol(dat)
## [1] 3
nrow(dat)
## [1] 5
We are going to use a dataset from Kaggle - the Google play store apps data by user lava18
. We have downloaded it into the data folder already from https://www.kaggle.com/lava18/google-play-store-apps (downloaded on 2018-09-28).
Tidyverse contains many data loading functions – each for their own file type – in the packages readr
(default file types), readxl
(excel files), and haven
(external file types such as from SPSS or Stata). The most common file type is csv
, which is what we use here.
read_csv()
to import the file “data/googleplaystore.csv” and store it in a variable called apps
.<- read_csv("data/googleplaystore.csv") apps
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## App = col_character(),
## Category = col_character(),
## Rating = col_double(),
## Reviews = col_double(),
## Size = col_character(),
## Installs = col_character(),
## Type = col_character(),
## Price = col_character(),
## `Content Rating` = col_character(),
## Genres = col_character(),
## `Last Updated` = col_character(),
## `Current Ver` = col_character(),
## `Android Ver` = col_character()
## )
If necessary, use the help files. These import functions from the tidyverse are fast and safe: they display informative errors if anything goes wrong. read_csv()
also displays a message with information on how each column is imported: which variable type each column gets.
# Several columns such as price and number of installs were imported as
# character data types, but they are numbers.
head()
to look at the first few rows of the apps
datasethead(apps)
readxl
). Also try out the function tail()
and View()
(with a capital V).<- read_xlsx("data/students.xlsx")
students head(students)
tail(students)
summary()
function. What is the range of the grades achieved by the students?summary(students)
## student_number grade programme
## Min. :4011659 Min. :4.844 Length:37
## 1st Qu.:4862862 1st Qu.:6.390 Class :character
## Median :6000241 Median :7.151 Mode :character
## Mean :5686729 Mean :6.991
## 3rd Qu.:6553913 3rd Qu.:7.573
## Max. :6997130 Max. :9.291
dplyr
verbsThe tidyverse package dplyr
contains functions to transform, rearrange, and filter data frames.
The first verb is filter()
, which selects rows from a data frame. Chapter 5 of R4DS states that to use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
When you’re starting out with R, the easiest mistake to make is to use =
instead of ==
when testing for equality.
filter()
(especially the examples) and show the students with a grade lower than 5.5filter(students, grade < 5.5)
If you are unsure how to proceed, read Section 5.2.2 from R4DS.
filter(students, grade > 8, programme == "A")
The second verb is arrange()
, which sorts a data frame by one or more columns.
arrange(students, programme, -grade)
The third verb is select()
, which selects columns of interest.
student_number
and programme
columns from the students datasetselect(students, student_number, programme)
# or, equivalently: select(students, -grade)
With mutate()
you can compute new columns and transform existing columns as functions of the columns in your dataset. For example, we may create a new logical column in the students dataset to indicate whether a student has passed or failed:
<- mutate(students, pass = grade > 5.5)
students students
Now, the students dataset has an extra column named “pass”.
You can also transform existing columns with the mutate()
function. For example, we may want to transform the programme column to an actual programme name according to this table:
Code | Name |
---|---|
A | Science |
B | Social Science |
mutate()
and recode()
to change the codes in the programme column of the students dataset to their names. Store the result in a variable called students_recoded
<- mutate(students,
students_recoded programme = recode(programme, "A" = "Science", "B" = "Social Science")
)
Chapter 5 of R4DS neatly summarises the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
filter()
).arrange()
).select()
).mutate()
).Cleaning data files and extracting the most useful information is essential to any downstream steps such as plotting or analysis. Make sure you know exactly which variable types are in your tibbles / data frames!
A very useful feature in tidyverse is the pipe %>%
. The pipe inputs the result from the left-hand side as the first argument of the right-hand side function: filter(students, grade > 5.5)
becomes students %>% filter(grade > 5.5)
. With the pipe, a set of processing steps becomes a neatly legible data processing pipeline!
Different tasks we have performed on the students dataset can be done in one pipeline like so:
<-
students_dataset read_xlsx("data/students.xlsx") %>%
mutate(prog = recode(programme, "A" = "Science", "B" = "Social Science")) %>%
filter(grade > 5.5) %>%
arrange(programme, -grade) %>%
select(student_number, prog, grade)
students_dataset
In one statement, we have loaded the dataset from disk, recoded the programme variable, filtered only students that pass, reordered the rows and selected the relevant columns only. We did not need to save intermediate results or nest functions deeply.
mutate
and parse_number()
, (c) shows only apps with more than 500 000 000 downloads, (d) orders them by rating (best on top), and (e) shows only the relevant columns (you can choose which are relevant, but select at least the Rating
and Category
variables). Save the result under the name popular_apps
.If you find duplicates, you may need to use distinct(App, .keep_all = TRUE)
as the last step in your pipeline to remove duplicate app names. Tip: ctrl/cmd + shift + M
inserts a pipe operator in RStudio.
<-
popular_apps read_csv("data/googleplaystore.csv") %>%
mutate(Downloads = parse_number(Installs)) %>%
filter(Downloads > 5e8) %>% # 5e8 is the same as 5 x 10^8
arrange(-Rating) %>%
select(App, Rating, Reviews, Downloads, Category) %>%
distinct(App, .keep_all = TRUE)
popular_apps
We have now seen how we can transform and clean our datasets. The next step is to start exploring the dataset by computing relevant summary statistics, such as means, ranges, variances, differences, etc. We have already used the function summary()
which comes with R
, but dplyr
has extra summary functionality in the form of the summarise()
(or summarize()
) verb.
An example to get the mean grade of the students_dataset
we made earlier is below:
%>%
students_dataset summarise(
mean = mean(grade),
variance = var(grade),
min = min(grade),
max = max(grade)
)
%>%
popular_apps summarise(
med = median(Rating),
min = min(Rating),
max = max(Rating)
)
The summarise()
function works with any function that takes a vector of numbers and outputs a single number. For example, we can create our own Median Absolute Deviation (MAD) function:
<- function(x) {
mad median(abs(x - median(x)))
}
%>% summarise(mad = mad(grade)) students_dataset
%>%
popular_apps summarise(
med = median(Rating),
min = min(Rating),
max = max(Rating),
mad = mad(Rating)
)
By itself, the summarise()
function is not very useful; we can also simply use the summary()
function or directly enter the vector we are interested in as an argument to the functions: mad(students_dataset$grade)
= 0.5908503. The power of summarise()
is in its combined use with the group_by()
function, which makes it easy to make grouped summaries:
%>%
students_dataset group_by(prog) %>%
summarise(
mean = mean(grade),
variance = var(grade),
min = min(grade),
max = max(grade)
)
%>%
popular_apps group_by(Category) %>%
summarise(
med = median(Rating),
min = min(Rating),
max = max(Rating),
mad = mad(Rating)
)
read_csv("data/googleplaystore.csv") %>%
filter(Category == "GAME" | Category == "COMMUNICATION") %>%
select(App, Category, Rating) %>%
distinct(App, .keep_all = TRUE) %>%
group_by(Category) %>%
summarise(
mean = mean(Rating, na.rm = TRUE),
median = median(Rating, na.rm = TRUE)
)
When you have finished the practical,
enclose all files of the project 01_Data_wrangling.Rproj
(i.e. all .R
and/or .Rmd
files including the one with your answers, and the .Rproj
file) in a zip file, and
hand in the zip by PR from your fork here. Do so before Lecture 3. That way we can iron out issues during the next Q&A in Week 2.