Introduction to R

Fundamental Techniques in Data Science with R

Very useful

You can access the course materials quickly from

www.gerkovink.com/fundamentals

\[\quad\]

You can reach me at G.Vink@uu.nl or via the MS Teams channels

Laura Boeschoten

About me

A small primer

Election data from NY Times

Distribution of votes

Per county

Population density

Time of vote

Absentee votes

Goal of this course

Real-world Goal

We learn to use regression; a technique that is aimed at figuring out the strength of the relation between an outcome and a set of predictors.

It is basically one of the following three scenarios:

Another example

In general it holds that the more hours you spend on studying, the higher your grade. But this relation is not 1:1.

What is the relation?

We will learn to identify the average conditional relation between outcome and predictors.

Who passed?

We will learn to differentiate our investigation between different groups. For example, is there a difference in the relation between Hours and Grade for different groups?

Who passed?

Finally, we will learn how to estimate the probability of pass or fail, based on the Hours studied. In other words, how many Hours should you study to pass?

Real-world Example

What is the relation between age and bmi

Real-world Example

Can we infer the relation between age and bmi, but now for different levels of bmi

Real-world Example

Finally, if we would develop some intervention campaign; which age-group(s) should be targeted?

What else?

Learn to keep your cool

and build the foundation for a succesfull scripting career in predictive and inferential analytics

Formal Goals

apply and interpret the basic methodological and statistical concepts that are associated with doing predictive and/or inferential research;
apply and interpret important techniques in linear and logistic regression analysis;

This means that you will learn the ins and outs of inferential and predictive research with linear and logistic models.

what this all covers will become clear during the course
we will learn R to perform our data analysis and visualizations
we will learn the math and skills behind these ubiquitous modeling techniques
we will also learn the assumptions of (logistic) regression models

Workgroups, Assignments, Exercises and Exam

Course Manual

R-exercises: every week
Workgroup: every Thursday @ 9am
- group work
- manage your expectations
Assignments: 2 in total, graded
- If your name is on the assignment; I assume you have contributed.
- If a group omits a name; I expect that you have notified that person.
Exam: just 1, graded

Overview of this course

Program

Week #	Topic	`R`-practical	Workgroup
1	The elemental building blocks of `R`	Assigning objects and elements; creating vectors, matrices, dataframes and lists	Receive instructions and form groups
2	Finding the least squares solution; simple linear regression	Subsetting data; using pipes to simplify the workflow	Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes
3	Linear modeling in `R`; testing assumptions; standardized residuals, leverage and Cook’s distance	Class `lm` in `R`; modeling, prediction and visualization	Fit your defined model; evaluate if assumptions are met
4	Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance	Demonstrate confidence validity of the linear model on simulated data with `rmarkdown`	Test and quantify the effect of the defined model; continue the project in `rmarkdown`
5	Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials	Cross-validation and model fit in `R`	Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data

Program

Week #	Topic	`R`-practical	Workgroup
6	Simple logistic regression	Class `glm(formula, family = "binomial")` in `R`; modeling, prediction and visualization	Fit your defined model; evaluate if assumptions are met
7	Formulating the logistic model and interpreting the parameters; marginal effects	Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals	Test and quantify the effect of the defined model
8	Logistic regression model evaluation; cross-validation; multiple regression; interactions	Multiple logistic regression and cross-validating the logistic regression in `R`	Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data

We need R. What is it?

Software

The origin of R

R is a language and environment for statistical computing and for graphics
GNU project (100% free software)
Managed by the R Foundation for Statistical Computing, Vienna, Austria.
Community-driven
Based on the object-oriented language S (1975)

What is RStudio?

Integrated Development Environment

RStudio

Aggregates all convenient information and procedures into one single place
Allows you to work in projects
Manages your code with highlighting
Gives extra functionality (Shiny, knitr, markdown, LaTeX)
Allows for integration with version control routines, such as Git.

How does R work

Objects and elements

R works with objects that consist of elements. The smallest elements are numbers and characters.
- These elements are assigned to objects.
- A set of objects can be used to perform calculations
- Calculations can be presented as functions
- Functions are used to perform calculations and return new objects, containing calculated (or estimated) elements.

The help

Everything that is published on the Comprehensive R Archive Network (CRAN) and is aimed at R users, must be accompanied by a help file.
If you know the name of the function that performs an operation, e.g. anova(), then you just type ?anova or help(anova) in the console.
If you do not know the name of the function: type ?? followed by your search criterion. For example ??anova returns a list of all help pages that contain the word ‘anova’
Alternatively, the internet will tell you almost everything you’d like to know and sites such as http://www.stackoverflow.com and http://www.stackexchange.com, as well as Google can be of tremendous help.
- If you google R related issues; use ‘R:’ as a prefix in your search term

Assigning elements to objects

Assigning things in R is very straightforward:
- you just use <-
For example, if you assign the value 100 (an element) to object a, you would type

a <- 100

Calling objects

Calling things in R is also very straightforward:
- you just use type the name you have given to the object
For example, we assigned the value 100 to object a. To call object a, we would type

## [1] 100

Writing code

This is why we use R-Studio.

Objects that contain more than one element

More than one element

We can assign more than one element to a vector (in this case a 1-dimensional congatenation of numbers 1 through 5)

a <- c(1, 2, 3, 4, 5)
a

## [1] 1 2 3 4 5

b <- 1:5
b

## [1] 1 2 3 4 5

More than one element, with characters

Characters (or character strings) in R are indicated by the double quote identifier.

a.new <- c(a, "A")
a.new

## [1] "1" "2" "3" "4" "5" "A"

Notice the difference with a from the previous slide

## [1] 1 2 3 4 5

Quickly identifying elements in vectors

rep(a, 15)

##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3
## [39] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Calling elements in vectors

If we would want just the third element, we would type

a[3]

## [1] 3

Multiple vectors in one object

This we would refer to as a matrix

c <- matrix(a, nrow = 5, ncol = 2)
c

##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

matrix(a, nrow = 5, ncol = 2, byrow = TRUE)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    1
## [4,]    2    3
## [5,]    4    5

Calling elements in matrices #1

The first row is called by

c[1, ]

## [1] 1 1

The second column is called by

c[, 2]

## [1] 1 2 3 4 5

Calling elements in matrices #2

The intersection of the first row and second column is called by

c[1, 2]

## [1] 1

In short; (square) brackets [] are used to call elements, rows and columns.

Matrices with mixed numeric / character data

If we add a character column to matrix c; everything becomes a character:

cbind(c, letters[1:5])

##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Matrices with mixed numeric / character data

Alternatively,

cbind(c, c("a", "b", "c", "d", "e"))

##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.

Data frames

d <- data.frame("V1" = rnorm(5),
                "V2" = rnorm(5, mean = 5, sd = 2), 
                "V3" = letters[1:5])
d

##           V1       V2 V3
## 1  2.1988103 4.047506  a
## 2  1.3124130 3.422794  b
## 3 -0.2651451 3.810765  c
## 4  0.5431941 8.301815  d
## 5 -0.4143399 4.891944  e

We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where $V1$ is standard normal and $V2 \sim N(5,2)$ - and a character set.

Data frames (continued)

Data frames can contain both numerical and character elements at the same time, although never in the same column.

You can name the columns and rows in data frames (just like in matrices)

row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
d

##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

Calling row elements in data frames

There are two ways to obtain row 3 from data frame d:

d["row 3", ]

##               V1       V2 V3
## row 3 -0.2651451 3.810765  c

and

d[3, ]

##               V1       V2 V3
## row 3 -0.2651451 3.810765  c

The intersection between row 2 and column 4 can be obtained by

d[2, 3]

## [1] "b"

Calling columns elements in data frames

Both

d[, "V2"] # and

## [1] 4.047506 3.422794 3.810765 8.301815 4.891944

d[, 2]

## [1] 4.047506 3.422794 3.810765 8.301815 4.891944

yield the second column. But we can also use $ to call variable names in data frame objects

d$V2

## [1] 4.047506 3.422794 3.810765 8.301815 4.891944

Beyond two dimensions: a list

List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by

f <- list(a)
f

## [[1]]
## [1] 1 2 3 4 5

Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f is object a

f[[1]]

## [1] 1 2 3 4 5

Lists (continued)

We can simply add an object or element to an existing list

f[[2]] <- d
f

## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

to obtain a list with a vector and a data frame.

Lists (continued)

We can add names to the list as follows

names(f) <- c("vector", "data frame")
f

## $vector
## [1] 1 2 3 4 5
## 
## $`data frame`
##               V1       V2 V3
## row 1  2.1988103 4.047506  a
## row 2  1.3124130 3.422794  b
## row 3 -0.2651451 3.810765  c
## row 4  0.5431941 8.301815  d
## row 5 -0.4143399 4.891944  e

Calling elements in lists

Calling the vector (a) from the list can be done as follows

f[[1]]

## [1] 1 2 3 4 5

f[["vector"]]

## [1] 1 2 3 4 5

f$vector

## [1] 1 2 3 4 5

Lists in lists

Take the following example

g <- list(f, f)

To call the vector from the second list within the list g, use the following code

g[[2]][[1]]

## [1] 1 2 3 4 5

g[[2]]$vector

## [1] 1 2 3 4 5

Logical operators

Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action. There are more operations, but these are the most useful.
For example, if we would like elements out of matrix c that are larger than 3, we would type:

c[c > 3]

## [1] 4 5 4 5

Why does a logical statement on a matrix return a vector?

c > 3

##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE

The column values for TRUE may be of different length. A vector as a return is therefore more appropriate.

Logical operators (cont’d)

If we would like the elements that are smaller than 3 OR larger than 3, we could type

c[c < 3 | c > 3] #c smaller than 3 or larger than 3

## [1] 1 2 4 5 1 2 4 5

c[c != 3] #c not equal to 3

## [1] 1 2 4 5 1 2 4 5

Logical operators (cont’d)

In fact, c != 3 returns a matrix

##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,]  TRUE  TRUE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE

Remember c?:

##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Things that cannot be done

Things that have no representation in real number space (at least not without tremendous effort)
- For example, the following code returns “Not a Number”

0 / 0

## [1] NaN

Also impossible are calculations based on missing values (NA’s)

mean(c(1, 2, NA, 4, 5))

## [1] NA

Standard solves for missing values

There are two easy ways to perform “listwise deletion”:

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)

## [1] 3

mean(na.omit(c(1, 2, NA, 4, 5)))

## [1] 3

Floating point example

(3 - 2.9)

## [1] 0.1

(3 - 2.9) == 0.1

## [1] FALSE

Why does R tell us that 3 - 2.9 $\neq$ 0.1?

(3 - 2.9) - .1

## [1] 8.326673e-17

Some programming tips:

keep your code tidy
use comments (text preceded by #) to clarify what you are doing
- If you look at your code again, one month from now: you will not know what you did –> unless you use comments
when working with functions, use the TAB key to quickly access the help for the function’s components
work with logically named R-scripts
- indicate the sequential nature of your work
work with RStudio projects
if allowed, place your project folders in some cloud-based environment

Functions

Functions have parentheses (). Names directly followed by parentheses always indicate functions. For example;

matrix() is a function
c() is a function
but (1 - 2) * 5 is a calculation, not a function

Packages

Packages give additional functionality to R.

By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.

The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:

newly developed functions and software are readily available
this is different from other mainstream software, like SPSS, where new methodology may take years to be implemented.

A list of available packages can be found on CRAN

Loading packages

There are two ways to load a package in R

library(stats)

and

require(stats)

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

`R` in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to R.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

There are multiple ways to access the code history.
1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in `RStudio`

Every project has its own history
Every research project has its own project
Every project can have its own folder, which also serves as a research archive
Every project can have its own version control system
R-studio projects can relate to Git (or other online) repositories

In general…

Use common sense and BE CONSISTENT.
Browse through the tidyverse style guide
- The point of having style guidelines is to have a common vocabulary of coding
- so people can concentrate on what you are saying, rather than on how you are saying it.
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.
Intentional spacing makes your code easier to interpret
- a<-c(1,2,3,4,5) vs;
- a <- c(1, 2, 3, 4, 5)
at least put a space after every comma!

Very useful

Laura Boeschoten

About me

A small primer

Election data from NY Times

Distribution of votes

Per county

Population density

Time of vote

Absentee votes

Goal of this course

Real-world Goal

Another example

What is the relation?

Who passed?

Who passed?

Real-world Example

Real-world Example

Real-world Example

What else?

Formal Goals

Workgroups, Assignments, Exercises and Exam

Overview of this course

Program

Program

We need R. What is it?

Software

The origin of R

What is RStudio?

Integrated Development Environment

RStudio

How does R work

Objects and elements

The help

Assigning elements to objects

Calling objects

Writing code

Objects that contain more than one element

More than one element

More than one element, with characters

Quickly identifying elements in vectors

Calling elements in vectors

Multiple vectors in one object

Calling elements in matrices #1

Calling elements in matrices #2

Matrices with mixed numeric / character data

Matrices with mixed numeric / character data

Data frames

Data frames (continued)

Calling row elements in data frames

Calling columns elements in data frames

Beyond two dimensions: a list

Lists (continued)

Lists (continued)

Calling elements in lists

Lists in lists

Logical operators

Why does a logical statement on a matrix return a vector?

Logical operators (cont’d)

Logical operators (cont’d)

Things that cannot be done

Standard solves for missing values

Floating point example

Some programming tips:

Functions

Packages

Loading packages

Installing packages

R in depth

Workspaces and why you should sometimes save them

History and why it is useful

Working in projects in RStudio

In general…

To continue

What to do?

See you next week

`R` in depth

Working in projects in `RStudio`