Fundamental Techniques in Data Science with R

Today

This lecture

  • Issues from last week

  • Data manipulation

  • Basic analysis (correlation & t-test)

  • Pipes

  • Deviations and modeling

Issues from last week

  • View()
  • rmarkdown
  • Navigating RStudio
  • R/RStudio/CRAN ecosystem
  • console vs R-script vs rmarkdown

New packages we use

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes
library(mice)     # for the nhanes data

New functions

  • transform(): changing and adding columns
  • dplyr::filter(): row-wise selection (of cases)
  • table(): frequency tables
  • class(): object class
  • levels(): levels of a factor
  • order(): data entries in increasing order
  • haven::read_sav(): import SPSS data
  • cor(): bivariate correlation
  • sample(): drawing a sample
  • t.test(): t-test

Data manipulation

The cats data

head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

How to get only Female cats?

fem.cats <- cats[cats$Sex == "F", ]
dim(fem.cats)
## [1] 47  3
head(fem.cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

How to get only heavy cats?

heavy.cats <- cats[cats$Bwt > 3, ]
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

How to get only heavy cats?

heavy.cats <- subset(cats, Bwt > 3)
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

another way: dplyr

filter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")
##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"

Working with factors

levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)
## 
## Female   Male 
##     47     97
head(cats)
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Sorting

sorted.cats <- cats[order(cats$Bwt), ]
head(sorted.cats)
##       Sex Bwt Hwt
## 1  Female 2.0 7.0
## 2  Female 2.0 7.4
## 3  Female 2.0 9.5
## 48   Male 2.0 6.5
## 49   Male 2.0 6.5
## 4  Female 2.1 7.2

Combining matrices or dataframes

cats.numbers <- cbind(Weight = cats$Bwt, HeartWeight = cats$Hwt)
head(cats.numbers)
##      Weight HeartWeight
## [1,]    2.0         7.0
## [2,]    2.0         7.4
## [3,]    2.0         9.5
## [4,]    2.1         7.2
## [5,]    2.1         7.3
## [6,]    2.1         7.6

Combining matrices or dataframes

rbind(cats[1:3, ], cats[1:5, ])
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.0 7.0
## 5 Female 2.0 7.4
## 6 Female 2.0 9.5
## 7 Female 2.1 7.2
## 8 Female 2.1 7.3

Basic analysis

Correlation

cor(cats[, -1])
##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column

Correlation

cor.test(cats$Bwt, cats$Hwt)
## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

T-test

plot(formula = Hwt ~ Sex, data = cats)

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

What do pipes do:

  • f(x) becomes x %>% f()
rnorm(10) %>% mean()
## [1] 0.2229627
  • f(x, y) becomes x %>% f(y)
boys %>% cor(use = "pairwise.complete.obs")
##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000
  • h(g(f(x))) becomes x %>% f %>% g %>% h
boys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4

Example outlier filtering

nrow(cats)
## [1] 144
cats.outl <- 
  cats %>% 
  filter(Hwt < mean(Hwt) + 3 * sd(Hwt), 
         Hwt > mean(Hwt) - 3 * sd(Hwt))

nrow(cats.outl)
## [1] 143
cats %>%  
  filter(Hwt > mean(Hwt) + 3 * sd(Hwt))
##    Sex Bwt  Hwt
## 1 Male 3.9 20.5

Disclaimer: I don’t like outlier filtering!

More pipe stuff

The standard %>% pipe

HTML5 Icon


Key combo: Ctrl+shift+m or cmd+shift+m (Mac) will insert a %>% pipe

The %$% pipe

HTML5 Icon

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

cats %>%
  plot(Hwt ~ Bwt, data = .)

VS

cats %$%
  plot(Hwt ~ Bwt)

The . can be used as a placeholder in the pipe.

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- cats %$%
  t.test(Bwt ~ Sex)

cats.test
## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group Female   mean in group Male 
##             2.359574             2.900000

Squared deviations

We have already met deviations today

  • correlations

\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]

  • t-test

\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]

  • variance

\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]

Can you identify which part of these equations are the deviations?

Deviations

Deviations tell what the distance is for each value (observation) to a comparison value.

  • often, the mean is chosen as a comparison value.

Why the mean?

The arithmetic mean is a very informative measure:

  • it is the average
  • it is the mathematical expectation
  • it is the central value of a set of discrete numbers

\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]

The mean as a center

Deviations from the mean

Plotting the deviations

Use of deviations

Deviations summarize the fit of all the points in the data to a single point

The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.

  • The mean yields the lowest set of deviations
plotdata %>%
  mutate("Mean" = X - mean(X), 
         "Mean + 3" = X - (mean(X) + 3)) %>%
  select("Mean", "Mean + 3") %>%
  colSums %>%
  round(digits = 3)
##     Mean Mean + 3 
##        0     -300

The mean minimizes the deviations

What happens

Plotting the standardized deviations

Plotting the squared deviations

Why squared deviations are useful

Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]

Deviations from the mean

Deviations from the mean

Least squares solution

Least squares solution

What’s next

During the practical exercises this week we will learn to calculate and explore squared deviations. During these exercises you’ll learn to minimize the deviations.

Next week we’ll jump to regression. We’ll see how deviations still play a role in that framework:

  • We minimize deviations to infer a linear model
  • If our model is not perfect, observations will still deviate from the model