Fundamental Techniques in Data Science with R

Very useful

Laura Boeschoten

About me

A small primer

Election data from NY Times

Distribution of votes

Per county

Population density

Time of vote

Absentee votes

Goal of this course

Real-world Goal

We learn to use regression; a technique that is aimed at figuring out the strength of the relation between an outcome and a set of predictors.

It is basically one of the following three scenarios:

Another example

In general it holds that the more hours you spend on studying, the higher your grade. But this relation is not 1:1.

What is the relation?

We will learn to identify the average conditional relation between outcome and predictors.

Who passed?

We will learn to differentiate our investigation between different groups. For example, is there a difference in the relation between Hours and Grade for different groups?

Who passed?

Finally, we will learn how to estimate the probability of pass or fail, based on the Hours studied. In other words, how many Hours should you study to pass?

Real-world Example

What is the relation between age and bmi

Real-world Example

Can we infer the relation between age and bmi, but now for different levels of bmi

Real-world Example

Finally, if we would develop some intervention campaign; which age-group(s) should be targeted?

What else?

Learn to keep your cool

HTML5 Icon

and build the foundation for a succesfull scripting career in predictive and inferential analytics

Formal Goals

  1. apply and interpret the basic methodological and statistical concepts that are associated with doing predictive and/or inferential research;
  2. apply and interpret important techniques in linear and logistic regression analysis;

This means that you will learn the ins and outs of inferential and predictive research with linear and logistic models.

  • what this all covers will become clear during the course
  • we will learn R to perform our data analysis and visualizations
  • we will learn the math and skills behind these ubiquitous modeling techniques
  • we will also learn the assumptions of (logistic) regression models

Workgroups, Assignments, Exercises and Exam

Course Manual

  • R-exercises: every week
  • Workgroup: every Thursday @ 9am
    • group work
    • manage your expectations
  • Assignments: 2 in total, graded
    • If your name is on the assignment; I assume you have contributed.
    • If a group omits a name; I expect that you have notified that person.
  • Exam: just 1, graded

Overview of this course

Program

Week # Topic R-practical Workgroup
1 The elemental building blocks of R Assigning objects and elements; creating vectors, matrices, dataframes and lists Receive instructions and form groups
2 Finding the least squares solution; simple linear regression Subsetting data; using pipes to simplify the workflow Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes
3 Linear modeling in R; testing assumptions; standardized residuals, leverage and Cook’s distance Class lm in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
4 Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance Demonstrate confidence validity of the linear model on simulated data with rmarkdown Test and quantify the effect of the defined model; continue the project in rmarkdown
5 Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials Cross-validation and model fit in R Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data

Program

Week # Topic R-practical Workgroup
6 Simple logistic regression Class glm(formula, family = "binomial") in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
7 Formulating the logistic model and interpreting the parameters; marginal effects Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals Test and quantify the effect of the defined model
8 Logistic regression model evaluation; cross-validation; multiple regression; interactions Multiple logistic regression and cross-validating the logistic regression in R Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data

We need R. What is it?

Software

HTML5 Icon

The origin of R

  • R is a language and environment for statistical computing and for graphics

  • GNU project (100% free software)

  • Managed by the R Foundation for Statistical Computing, Vienna, Austria.

  • Community-driven

  • Based on the object-oriented language S (1975)

What is RStudio?

Integrated Development Environment

HTML5 Icon

RStudio

  • Aggregates all convenient information and procedures into one single place
  • Allows you to work in projects
  • Manages your code with highlighting
  • Gives extra functionality (Shiny, knitr, markdown, LaTeX)
  • Allows for integration with version control routines, such as Git.

How does R work

Objects and elements

  • R works with objects that consist of elements. The smallest elements are numbers and characters.

    • These elements are assigned to objects.
    • A set of objects can be used to perform calculations
    • Calculations can be presented as functions
    • Functions are used to perform calculations and return new objects, containing calculated (or estimated) elements.

The help

  • Everything that is published on the Comprehensive R Archive Network (CRAN) and is aimed at R users, must be accompanied by a help file.

  • If you know the name of the function that performs an operation, e.g. anova(), then you just type ?anova or help(anova) in the console.

  • If you do not know the name of the function: type ?? followed by your search criterion. For example ??anova returns a list of all help pages that contain the word ‘anova’

  • Alternatively, the internet will tell you almost everything you’d like to know and sites such as http://www.stackoverflow.com and http://www.stackexchange.com, as well as Google can be of tremendous help.

    • If you google R related issues; use ‘R:’ as a prefix in your search term

Assigning elements to objects

  • Assigning things in R is very straightforward:

    • you just use <-
  • For example, if you assign the value 100 (an element) to object a, you would type

a <- 100

Calling objects

  • Calling things in R is also very straightforward:

    • you just use type the name you have given to the object
  • For example, we assigned the value 100 to object a. To call object a, we would type

a
## [1] 100

Writing code

HTML5 Icon

This is why we use R-Studio.

Objects that contain more than one element

More than one element

  • We can assign more than one element to a vector (in this case a 1-dimensional congatenation of numbers 1 through 5)
a <- c(1, 2, 3, 4, 5)
a
## [1] 1 2 3 4 5
b <- 1:5
b
## [1] 1 2 3 4 5

More than one element, with characters

Characters (or character strings) in R are indicated by the double quote identifier.

a.new <- c(a, "A")
a.new
## [1] "1" "2" "3" "4" "5" "A"

Notice the difference with a from the previous slide

a
## [1] 1 2 3 4 5

Quickly identifying elements in vectors

rep(a, 15)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3
## [39] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Calling elements in vectors

If we would want just the third element, we would type

a[3]
## [1] 3

Multiple vectors in one object

This we would refer to as a matrix

c <- matrix(a, nrow = 5, ncol = 2)
c
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5
matrix(a, nrow = 5, ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    1
## [4,]    2    3
## [5,]    4    5

Calling elements in matrices #1

  • The first row is called by
c[1, ]
## [1] 1 1
  • The second column is called by
c[, 2]
## [1] 1 2 3 4 5

Calling elements in matrices #2

  • The intersection of the first row and second column is called by
c[1, 2]
## [1] 1

In short; (square) brackets [] are used to call elements, rows and columns.

Matrices with mixed numeric / character data

If we add a character column to matrix c; everything becomes a character:

cbind(c, letters[1:5])
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Matrices with mixed numeric / character data

Alternatively,

cbind(c, c("a", "b", "c", "d", "e"))
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.

Data frames

d <- data.frame("V1" = rnorm(5),
                "V2" = rnorm(5, mean = 5, sd = 2), 
                "V3" = letters[1:5])
d
##           V1       V2 V3
## 1  2.1988103 4.047506  a
## 2  1.3124130 3.422794  b
## 3 -0.2651451 3.810765  c
## 4  0.5431941 8.301815  d
## 5 -0.4143399 4.891944  e

We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.

Data frames (continued)

Data frames can contain both numerical and character elements at the same time, although never in the same column.

You can name the columns and rows in data frames (just like in matrices)

row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
d
##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

Calling row elements in data frames

There are two ways to obtain row 3 from data frame d:

d["row 3", ]
##               V1       V2 V3
## row 3 -0.2651451 3.810765  c

and

d[3, ]
##               V1       V2 V3
## row 3 -0.2651451 3.810765  c

The intersection between row 2 and column 4 can be obtained by

d[2, 3]
## [1] "b"

Calling columns elements in data frames

Both

d[, "V2"] # and
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944
d[, 2]
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944

yield the second column. But we can also use $ to call variable names in data frame objects

d$V2
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944

Beyond two dimensions: a list

List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by

f <- list(a)
f
## [[1]]
## [1] 1 2 3 4 5

Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f is object a

f[[1]]
## [1] 1 2 3 4 5

Lists (continued)

We can simply add an object or element to an existing list

f[[2]] <- d
f
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

to obtain a list with a vector and a data frame.

Lists (continued)

We can add names to the list as follows

names(f) <- c("vector", "data frame")
f
## $vector
## [1] 1 2 3 4 5
## 
## $`data frame`
##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

Calling elements in lists

Calling the vector (a) from the list can be done as follows

f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5

Lists in lists

Take the following example

g <- list(f, f)

To call the vector from the second list within the list g, use the following code

g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5

Logical operators

  • Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action. There are more operations, but these are the most useful.

  • For example, if we would like elements out of matrix c that are larger than 3, we would type:

c[c > 3]
## [1] 4 5 4 5

Why does a logical statement on a matrix return a vector?

c > 3
##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE

The column values for TRUE may be of different length. A vector as a return is therefore more appropriate.

Logical operators (cont’d)

  • If we would like the elements that are smaller than 3 OR larger than 3, we could type
c[c < 3 | c > 3] #c smaller than 3 or larger than 3
## [1] 1 2 4 5 1 2 4 5

or

c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5

Logical operators (cont’d)

  • In fact, c != 3 returns a matrix
##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,]  TRUE  TRUE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
  • Remember c?:
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Things that cannot be done

  • Things that have no representation in real number space (at least not without tremendous effort)
    • For example, the following code returns “Not a Number”
0 / 0
## [1] NaN
  • Also impossible are calculations based on missing values (NA’s)
mean(c(1, 2, NA, 4, 5))
## [1] NA

Standard solves for missing values

There are two easy ways to perform “listwise deletion”:

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3

Floating point example

(3 - 2.9)
## [1] 0.1
(3 - 2.9) == 0.1
## [1] FALSE

Why does R tell us that 3 - 2.9 \(\neq\) 0.1?

(3 - 2.9) - .1
## [1] 8.326673e-17

Some programming tips:

  • keep your code tidy
  • use comments (text preceded by #) to clarify what you are doing
    • If you look at your code again, one month from now: you will not know what you did –> unless you use comments
  • when working with functions, use the TAB key to quickly access the help for the function’s components
  • work with logically named R-scripts
    • indicate the sequential nature of your work
  • work with RStudio projects
  • if allowed, place your project folders in some cloud-based environment

Functions

Functions have parentheses (). Names directly followed by parentheses always indicate functions. For example;

  • matrix() is a function
  • c() is a function
  • but (1 - 2) * 5 is a calculation, not a function

Packages

Packages give additional functionality to R.

By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.

The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:

  • newly developed functions and software are readily available
  • this is different from other mainstream software, like SPSS, where new methodology may take years to be implemented.

A list of available packages can be found on CRAN

Loading packages

There are two ways to load a package in R

library(stats)

and

require(stats)

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

R in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to R.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

  • You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

  • There are multiple ways to access the code history.

    1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
    2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in RStudio

  • Every project has its own history
  • Every research project has its own project
  • Every project can have its own folder, which also serves as a research archive
  • Every project can have its own version control system
  • R-studio projects can relate to Git (or other online) repositories

In general…

  • Use common sense and BE CONSISTENT.

  • Browse through the tidyverse style guide

    • The point of having style guidelines is to have a common vocabulary of coding
    • so people can concentrate on what you are saying, rather than on how you are saying it.
  • If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.

  • Intentional spacing makes your code easier to interpret

    • a<-c(1,2,3,4,5) vs;
    • a <- c(1, 2, 3, 4, 5)
  • at least put a space after every comma!

To continue

What to do?

  1. Before Thursday: make exercise 1
  2. Thursday: workgroup.
  3. After the workgroup: make exercise 3
  4. Next week:
  • R: dplyr, pipes and linear modeling.
  • Statistics: deviations, the use of squares and simple modeling

See you next week