Pipes and least squares estimation

Fundamental Techniques in Data Science with R

Today

This lecture

Issues from last week
Data manipulation
Basic analysis (correlation & t-test)
Pipes
Deviations and modeling

Issues from last week

View()
rmarkdown
Navigating RStudio
R/RStudio/CRAN ecosystem
console vs R-script vs rmarkdown

New packages we use

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes
library(mice)     # for the nhanes data

New functions

transform(): changing and adding columns
dplyr::filter(): row-wise selection (of cases)
table(): frequency tables
class(): object class
levels(): levels of a factor
order(): data entries in increasing order
haven::read_sav(): import SPSS data
cor(): bivariate correlation
sample(): drawing a sample
t.test(): t-test

Data manipulation

The cats data

head(cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

str(cats)

## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

How to get only Female cats?

fem.cats <- cats[cats$Sex == "F", ]
dim(fem.cats)

## [1] 47  3

head(fem.cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

How to get only heavy cats?

heavy.cats <- cats[cats$Bwt > 3, ]
dim(heavy.cats)

## [1] 36  3

head(heavy.cats)

##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

How to get only heavy cats?

heavy.cats <- subset(cats, Bwt > 3)
dim(heavy.cats)

## [1] 36  3

head(heavy.cats)

##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

another way: `dplyr`

filter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")

##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)

## [1] "factor"

levels(cats$Sex)

## [1] "F" "M"

Working with factors

levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)

## 
## Female   Male 
##     47     97

head(cats)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Sorting

sorted.cats <- cats[order(cats$Bwt), ]
head(sorted.cats)

##       Sex Bwt Hwt
## 1  Female 2.0 7.0
## 2  Female 2.0 7.4
## 3  Female 2.0 9.5
## 48   Male 2.0 6.5
## 49   Male 2.0 6.5
## 4  Female 2.1 7.2

Combining matrices or dataframes

cats.numbers <- cbind(Weight = cats$Bwt, HeartWeight = cats$Hwt)
head(cats.numbers)

##      Weight HeartWeight
## [1,]    2.0         7.0
## [2,]    2.0         7.4
## [3,]    2.0         9.5
## [4,]    2.1         7.2
## [5,]    2.1         7.3
## [6,]    2.1         7.6

Combining matrices or dataframes

rbind(cats[1:3, ], cats[1:5, ])

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.0 7.0
## 5 Female 2.0 7.4
## 6 Female 2.0 9.5
## 7 Female 2.1 7.2
## 8 Female 2.1 7.3

Basic analysis

Correlation

cor(cats[, -1])

##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column

Correlation

cor.test(cats$Bwt, cats$Hwt)

## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

T-test

plot(formula = Hwt ~ Sex, data = cats)

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

data operations are structured from left-to-right and not from in-to-out
nested function calls are avoided
local variables and copied objects are avoided
easy to add steps in the sequence

What do pipes do:

f(x) becomes x %>% f()

rnorm(10) %>% mean()

## [1] 0.2229627

f(x, y) becomes x %>% f(y)

boys %>% cor(use = "pairwise.complete.obs")

##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000

h(g(f(x))) becomes x %>% f %>% g %>% h

boys %>% subset(select = wgt) %>% na.omit() %>% max()

## [1] 117.4

Example outlier filtering

nrow(cats)

## [1] 144

cats.outl <- 
  cats %>% 
  filter(Hwt < mean(Hwt) + 3 * sd(Hwt), 
         Hwt > mean(Hwt) - 3 * sd(Hwt))

nrow(cats.outl)

## [1] 143

cats %>%  
  filter(Hwt > mean(Hwt) + 3 * sd(Hwt))

##    Sex Bwt  Hwt
## 1 Male 3.9 20.5

Disclaimer: I don’t like outlier filtering!

More pipe stuff

The standard `%>%` pipe

Key combo: Ctrl+shift+m or cmd+shift+m (Mac) will insert a %>% pipe

The `%$%` pipe

The role of `.` in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

cats %>%
  plot(Hwt ~ Bwt, data = .)

cats %$%
  plot(Hwt ~ Bwt)

The . can be used as a placeholder in the pipe.

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- cats %$%
  t.test(Bwt ~ Sex)

cats.test

## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group Female   mean in group Male 
##             2.359574             2.900000

Squared deviations

We have already met deviations today

correlations

\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]

t-test

\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]

variance

\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]

Can you identify which part of these equations are the deviations?

Deviations

Deviations tell what the distance is for each value (observation) to a comparison value.

often, the mean is chosen as a comparison value.

Why the mean?

The arithmetic mean is a very informative measure:

it is the average
it is the mathematical expectation
it is the central value of a set of discrete numbers

\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]

The mean as a center

Deviations from the mean

Plotting the deviations

Use of deviations

Deviations summarize the fit of all the points in the data to a single point

The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.

The mean yields the lowest set of deviations

plotdata %>%
  mutate("Mean" = X - mean(X), 
         "Mean + 3" = X - (mean(X) + 3)) %>%
  select("Mean", "Mean + 3") %>%
  colSums %>%
  round(digits = 3)

##     Mean Mean + 3 
##        0     -300

The mean minimizes the deviations

What happens

Plotting the standardized deviations

Plotting the squared deviations

Why squared deviations are useful

Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]

Deviations from the mean

Least squares solution

What’s next

During the practical exercises this week we will learn to calculate and explore squared deviations. During these exercises you’ll learn to minimize the deviations.

Next week we’ll jump to regression. We’ll see how deviations still play a role in that framework:

We minimize deviations to infer a linear model
If our model is not perfect, observations will still deviate from the model

Today

This lecture

Issues from last week

New packages we use

New functions

Data manipulation

The cats data

How to get only Female cats?

How to get only heavy cats?

How to get only heavy cats?

another way: dplyr

Working with factors

Working with factors

Sorting

Combining matrices or dataframes

Combining matrices or dataframes

Basic analysis

Correlation

Correlation

Correlation

T-test

T-test

Pipes

This is a pipe:

Why are pipes useful?

With pipes

What do pipes do:

Example outlier filtering

More pipe stuff

The standard %>% pipe

The %$% pipe

The role of . in a pipe

Performing a t-test in a pipe

Storing a t-test from a pipe

Squared deviations

We have already met deviations today

Deviations

Why the mean?

The mean as a center

Deviations from the mean

Plotting the deviations

Use of deviations

What happens

Plotting the standardized deviations

Plotting the squared deviations

Why squared deviations are useful

Deviations from the mean

Deviations from the mean

Least squares solution

Least squares solution

What’s next

another way: `dplyr`

The standard `%>%` pipe

The `%$%` pipe

The role of `.` in a pipe