Fundamental Techniques in Data Science with R

Last week

  • pipes with package magrittr
    • %>%
    • %$%
  • deviations
    • from the mean
    • standardized
    • squared

Today

  • the linear model
  • assumptions associated with the linear model

We use the following packages

library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)

Statistical models

Statistical model

The mathematical formulation of the relationship between variables can be written as

\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]

or (for the greek people) in notation as \[y=\mu+\varepsilon\]

where

  • \(\mu\) (mean) is the part of the outcome that is explained by model
  • \(\varepsilon\) (residual) is the part of outcome that is not explained by model

A simple example

Regression model:

  • Model individual age from weight

\[ \text{age}_i=\alpha+\beta\cdot{\text{weight}}_i+\varepsilon_i \]

where

  • \(i\) indicates the individual in \(i = 1, \dots, n\)
  • \(n\) is the sample size
  • \(\alpha+\beta{x}_i\) is the mean of age, conditional on weight
  • \(\varepsilon_i\) is random variation

The linear model

The function lm() is a base function in R and allows you to pose a variety of linear models.

args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

If we want to know what these arguments do we can ask R:

?lm

This will open a help page on the lm() function.

The boys data from mice

head(boys, n = 20)
##       age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3   0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4   0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18  0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23  0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28  0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36  0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
## 37  0.068 52.5 3.810 13.82 34.9 <NA> <NA> NA south
## 38  0.071 53.0 3.890 13.84 35.8 <NA> <NA> NA  west
## 39  0.071 55.1 3.880 12.77 36.8 <NA> <NA> NA  west
## 43  0.073 54.5 4.200 14.14 38.0 <NA> <NA> NA  east
## 53  0.076 58.5 5.920 17.29 40.5 <NA> <NA> NA  west
## 60  0.079 55.0 4.430 14.64 38.0 <NA> <NA> NA  east
## 62  0.079 58.5 5.745 16.78 38.5 <NA> <NA> NA  city
## 75  0.082 59.5 5.100 14.40 39.1 <NA> <NA> NA  west
## 85  0.084 52.5 4.120 14.94 37.3 <NA> <NA> NA  <NA>
## 93  0.084 54.0 4.500 15.43 37.0 <NA> <NA> NA south
## 99  0.087 59.5 5.130 14.49 38.0 <NA> <NA> NA  west
## 103 0.087   NA 4.540    NA 39.0 <NA> <NA> NA  west
## 113 0.090 55.7 4.070 13.11 36.6 <NA> <NA> NA  east
## 127 0.093 56.0 5.410 17.25 40.0 <NA> <NA> NA north

Continuous predictors

To obtain a linear model with a main effects for wgt, we formulate \(age\sim wgt\)

boys %$% 
  lm(age ~ wgt)
## 
## Call:
## lm(formula = age ~ wgt)
## 
## Coefficients:
## (Intercept)          wgt  
##     -0.1924       0.2517

Continuous predictors: more detail

boys %$% lm(age ~ wgt) %>% summary()
## 
## Call:
## lm(formula = age ~ wgt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0907  -1.1686  -0.5342   1.4286   5.5143 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.192377   0.136878  -1.405     0.16    
## wgt          0.251673   0.003018  83.395   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.142 on 742 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9036, Adjusted R-squared:  0.9035 
## F-statistic:  6955 on 1 and 742 DF,  p-value: < 2.2e-16

Continuous predictors

To obtain a linear model with just main effects for wgt and hgt, we formulate \(age\sim wgt + hgt\)

boys %$% 
  lm(age ~ wgt + hgt)
## 
## Call:
## lm(formula = age ~ wgt + hgt)
## 
## Coefficients:
## (Intercept)          wgt          hgt  
##    -7.45952      0.07138      0.10660

Continuous predictors: more detail

boys %$% 
  lm(age ~ wgt + hgt) %>%
  summary
## 
## Call:
## lm(formula = age ~ wgt + hgt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1627 -0.8836 -0.1656  0.8765  4.4365 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.459525   0.246782  -30.23   <2e-16 ***
## wgt          0.071376   0.005949   12.00   <2e-16 ***
## hgt          0.106602   0.003336   31.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.391 on 724 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9591 
## F-statistic:  8510 on 2 and 724 DF,  p-value: < 2.2e-16

Continuous predictors: interaction effects

To predict \(age\) from \(wgt\), \(hgt\) and the interaction \(wgt*hgt\) we formulate \(age\sim wgt * hgt\)

boys %$% lm(age ~ wgt * hgt) %>% summary
## 
## Call:
## lm(formula = age ~ wgt * hgt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2067 -0.8603 -0.1419  0.8452  4.7526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.2833480  0.2510218 -29.015  < 2e-16 ***
## wgt         -0.0196161  0.0284921  -0.688  0.49137    
## hgt          0.1120325  0.0037076  30.217  < 2e-16 ***
## wgt:hgt      0.0004142  0.0001269   3.265  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.382 on 723 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9598, Adjusted R-squared:  0.9596 
## F-statistic:  5752 on 3 and 723 DF,  p-value: < 2.2e-16

Categorical predictors in the linear model

If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.

fit <- boys %$% 
  lm(age ~ reg)
fit
## 
## Call:
## lm(formula = age ~ reg)
## 
## Coefficients:
## (Intercept)      regeast      regwest     regsouth      regcity  
##      11.898       -2.657       -2.901       -3.307       -3.627

What are dummies?

fit %>% model.matrix() %>% tail(n = 20)
##     (Intercept) regeast regwest regsouth regcity
## 729           1       1       0        0       0
## 730           1       0       1        0       0
## 731           1       0       0        1       0
## 732           1       0       1        0       0
## 733           1       0       0        0       0
## 734           1       1       0        0       0
## 735           1       0       0        0       1
## 736           1       0       1        0       0
## 737           1       1       0        0       0
## 738           1       0       0        0       1
## 739           1       0       0        0       0
## 740           1       1       0        0       0
## 741           1       0       0        1       0
## 742           1       0       0        0       0
## 743           1       0       1        0       0
## 744           1       0       0        0       0
## 745           1       0       1        0       0
## 746           1       0       1        0       0
## 747           1       0       0        0       0
## 748           1       1       0        0       0

and again with more detail

fit %>% summary
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.805  -7.376   1.339   5.955  11.935 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.8984     0.7595  15.666  < 2e-16 ***
## regeast      -2.6568     0.9312  -2.853 0.004449 ** 
## regwest      -2.9008     0.8788  -3.301 0.001011 ** 
## regsouth     -3.3074     0.9064  -3.649 0.000282 ***
## regcity      -3.6266     1.1031  -3.288 0.001058 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.836 on 740 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.02077,    Adjusted R-squared:  0.01548 
## F-statistic: 3.925 on 4 and 740 DF,  p-value: 0.003678

Components of the linear model

fit %>% names() # the names of the list with output objects
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "contrasts"     "xlevels"       "call"         
## [13] "terms"         "model"
fit$coef  # show the estimated coefficients
## (Intercept)     regeast     regwest    regsouth     regcity 
##   11.898420   -2.656786   -2.900792   -3.307388   -3.626625
fit %>% coef # alternative
## (Intercept)     regeast     regwest    regsouth     regcity 
##   11.898420   -2.656786   -2.900792   -3.307388   -3.626625

Fitting a line to data

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

  • \(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
  • \(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

The anscombe data

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

Data visualization with ggplot2

Why visualise?

  • We can process a lot of information quickly with our eyes
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Back to lm

Fitting a line

The linear model would take the following form:

fit <- 
  yourdata %>%
  lm(youroutcome ~ yourpredictors)

fit %>% summary() # pipe
summary(fit) # base R

Output:

  • Residuals: minimum, maximum and quartiles
  • Coefficients: estimates, SE’s, t-values and \(p\)-values
  • Fit measures
    • Residuals SE (standard error residuals)
    • Multiple R-squared (proportion variance explained)
    • F-statistic and \(p\)-value (significance test model)

anscombe example

fit <- anscombe %$%
  lm(y1 ~ x1)

fit %>% summary
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Checking assumptions

  1. linearity
    • scatterplot \(y\) and \(x\)
    • include loess curve when in doubt
    • does a squared term improve fit?
  2. normality residuals
    • normal probability plots qqnorm()
    • if sample is small; qqnorm with simulated errors cf. rnorm(n, 0, s)
  3. constant error variance
    • residual plot
    • scale-location plot

Linearity

anscombe %>%
  ggplot(aes(x1, y1)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

anscombe %$%
  lm(y1 ~ x1 + I(x1^2)) %>%
  summary()
## 
## Call:
## lm(formula = y1 ~ x1 + I(x1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8704 -0.3481 -0.2456  0.7129  1.8072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.75507    3.28815   0.230    0.824
## x1           1.06925    0.78982   1.354    0.213
## I(x1^2)     -0.03162    0.04336  -0.729    0.487
## 
## Residual standard error: 1.27 on 8 degrees of freedom
## Multiple R-squared:  0.6873, Adjusted R-squared:  0.6092 
## F-statistic: 8.793 on 2 and 8 DF,  p-value: 0.009558

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows some divergence from normality at the tails

Outliers, influence and robust regression

Outliers and influential cases

Leverage: see the fit line as a lever.

  • some points pull/push harder; they have more leverage

Standardized residuals:

  • The values that have more leverage tend to be closer to the line
  • The line is fit so as to be closer to them
  • The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
  • Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

  • it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(e_z\) (standardized residuals).

If the model is correct we expect:

  • 5% of \(|e_z|>1.96\)
  • 1% of \(|e_z|>2.58\)
  • 0% of \(|e_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

  • cases with Cook’s Distance \(> 1\), or
  • cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out

Next week

  • Inferential modeling
    • drawing conclusions from data
    • confidence intervals
  • What if assumptions are violated?