Linear Modeling

Statistical Programming with R

We use the following packages

library(MASS)
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
library(DAAG)
library(car)

Fitting a line to data

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

\(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
\(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

The `anscombe` data

head(anscombe)

##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

Or, alternatively,

anscombe %>%
  head

##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

The linear model would take the following form:

fit <- 
  yourdata %>%
  lm(youroutcome ~ yourpredictors)

summary(fit)

Output:

Residuals: minimum, maximum and quartiles
Coefficients: estimates, SE’s, t-values and \(p\)-values
Fit measures
- Residuals SE (standard error residuals)
- Multiple R-squared (proportion variance explained)
- F-statistic and \(p\)-value (significance test model)

`anscombe` example

fit <- anscombe %$%
  lm(y1 ~ x1)

summary(fit)

## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Checking assumptions

linearity
- scatterplot \(y\) and \(x\)
- include loess curve when in doubt
- does a squared term improve fit?
normality residuals
- normal probability plots qqnorm()
- if sample is small; qqnorm with simulated errors cf. rnorm(n, 0, s)
constant error variance
- residual plot
- scale-location plot

Linearity

anscombe %>%
  ggplot(aes(x1, y1)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

anscombe %$%
  lm(y1 ~ x1 + I(x1^2)) %>%
  summary()

## 
## Call:
## lm(formula = y1 ~ x1 + I(x1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8704 -0.3481 -0.2456  0.7129  1.8072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.75507    3.28815   0.230    0.824
## x1           1.06925    0.78982   1.354    0.213
## I(x1^2)     -0.03162    0.04336  -0.729    0.487
## 
## Residual standard error: 1.27 on 8 degrees of freedom
## Multiple R-squared:  0.6873, Adjusted R-squared:  0.6092 
## F-statistic: 8.793 on 2 and 8 DF,  p-value: 0.009558

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows some divergence from normality at the tails

Outliers, influence and robust regression

Outliers and influential cases

Leverage: see the fit line as a lever.

some points pull/push harder; they have more leverage

Standardized residuals:

The values that have more leverage tend to be closer to the line
The line is fit so as to be closer to them
The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(e_z\) (standardized residuals).

If the model is correct we expect:

5% of \(|e_z|>1.96\)
1% of \(|e_z|>2.58\)
0% of \(|e_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

cases with Cook’s Distance \(> 1\), or
cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out

Model fit

New data example

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)

boys %>% 
  head

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Model

boys.fit %>%
  anova()

## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

boys.fit %>%
  summary()

## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Model factors

boys.fit %>%
  model.matrix() %>%
  head()

##   (Intercept) regeast regwest regsouth regcity
## 1           1       0       0        0       0
## 2           1       0       0        1       0
## 3           1       0       0        1       0
## 4           1       1       0        0       0
## 5           1       0       1        0       0
## 6           1       1       0        0       0

`aov()`

boys.fit %>% 
  aov()

## Call:
##    aov(formula = .)
## 
## Terms:
##                       reg Residuals
## Sum of Squares    14.6987 2164.5869
## Deg. of Freedom         4       218
## 
## Residual standard error: 3.151079
## Estimated effects may be unbalanced

aov() is for balanced designs.

`anova()`

boys.fit %>% 
  anova()

## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

`Anova()` with type I Sum of Squares

SS(A) for factor A.
SS(B | A) for factor B.
SS(AB | B, A) for interaction AB.

If data are unbalanced, the ordering defines the estimates:
- Different ordering of factors will yield different effects

`Anova()` with type II Sum of Squares

SS(A | B) for factor A.
SS(B | A) for factor B.

No interaction assumed. So test for this first!

only if interaction is not significant continue with interpretation of the main effects.
most powerful if there is no interaction

`Anova()` with type II Sum of Squares

boys.fit %>% 
  car::Anova(type = 2)

## Anova Table (Type II tests)
## 
## Response: age
##           Sum Sq  Df F value Pr(>F)
## reg         14.7   4  0.3701 0.8298
## Residuals 2164.6 218

`Anova()` with type III Sum of Squares

SS(A | B, AB) for factor A.
SS(B | A, AB) for factor B.

Tests if there is a main effect, given the other main effect and the interaction.

however, most often we do not care about interpretation of the main effects if the interaction is significant

When data are balanced, Type I, II and II are identical, because the factors are orthogonal

`Anova()` with type III Sum of Squares

boys.fit %>% 
  car::Anova(type = 3)

## Anova Table (Type III tests)
## 
## Response: age
##             Sum Sq  Df  F value Pr(>F)    
## (Intercept) 6708.8   1 675.6530 <2e-16 ***
## reg           14.7   4   0.3701 0.8298    
## Residuals   2164.6 218                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post hoc comparisons

coef <- 
  boys.fit %>%
  aov() %>%
  summary.lm()
coef

## 
## Call:
## aov(formula = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Post hoc comparisons

p.val <- coef$coefficients
p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")

##  (Intercept)      regeast      regwest     regsouth      regcity 
## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00

AIC and BIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()

## [1] 1151.684

and Bayesian Information Criterion

boys.fit %>%
  BIC()

## [1] 1172.127

Model comparison

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()

## [1] 1151.684

boys.fit2 %>% AIC()

## [1] 836.3545

Another model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

boys.fit %>% AIC()

## [1] 1151.684

boys.fit2 %>% AIC()

## [1] 836.3545

boys.fit3 %>% AIC()

## [1] 823.0386

Model comparison

anova(boys.fit, boys.fit2, boys.fit3)

## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    218 2164.59                                  
## 2    217  521.64  1   1642.94 731.83 < 2.2e-16 ***
## 3    215  482.67  2     38.97   8.68  0.000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  car::Anova(type = 3)

## Anova Table (Type III tests)
## 
## Response: age
##             Sum Sq  Df F value   Pr(>F)    
## (Intercept)  27.04   1 12.0434 0.000628 ***
## reg           5.90   4  0.6569 0.622651    
## hgt         106.79   1 47.5667 5.84e-11 ***
## wgt           7.33   1  3.2654 0.072152 .  
## hgt:wgt       2.99   1  1.3324 0.249655    
## Residuals   482.67 215                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  car::Anova(type = 2)

## Anova Table (Type II tests)
## 
## Response: age
##           Sum Sq  Df F value    Pr(>F)    
## reg         5.90   4  0.6569    0.6227    
## hgt       170.35   1 75.8810 8.114e-16 ***
## wgt        35.98   1 16.0276 8.595e-05 ***
## hgt:wgt     2.99   1  1.3324    0.2497    
## Residuals 482.67 215                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  anova()

## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

boys.fit3 %>%
  dfbeta() %>%
  head(n = 7)

##   (Intercept)      regeast       regwest      regsouth       regcity
## 1  0.30446529 -0.032645200 -0.0326165198 -0.0304499108 -0.0318295197
## 2 -0.12357958  0.003358160  0.0031894645 -0.0355420008  0.0050565041
## 3 -0.17522651  0.001479115  0.0015118575 -0.0134428270  0.0001493751
## 4 -0.08857830 -0.006195476  0.0007649256  0.0002461411 -0.0001545015
## 5  0.31715776  0.003385884 -0.0572382687  0.0032145390  0.0142911153
## 6 -0.17554676 -0.018945547  0.0024334472  0.0015126731  0.0006986407
## 7 -0.08200772  0.013753533  0.0137667118  0.0134385305  0.0131184131
##             hgt          wgt       hgt:wgt
## 1 -0.0018391896 -0.001790011  1.389532e-05
## 2  0.0009885098 -0.002377479  9.303061e-06
## 3  0.0010352437  0.002803606 -1.608245e-05
## 4  0.0004883509  0.001847202 -1.001945e-05
## 5 -0.0009320126 -0.019683940  9.719343e-05
## 6  0.0009077265  0.003892513 -2.022645e-05
## 7  0.0003391486  0.001610804 -8.082354e-06

Prediction

Fitted values:

Let’s use the simpler anscombe data example

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1

##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)

##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients

## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909

fit$coefficients

## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

predict(fit, interval = "prediction")

##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

Assessing predictive accuracy

K-fold cross-validation

Divide sample in \(k\) mutually exclusive training sets
Do for all \(j\in\{1,\dots,k\}\) training sets
1. fit model to training set \(j\)
2. obtain predictions for test set \(j\) (remaining cases)
3. compute residual variance (MSE) for test set \(j\)
Compare MSE in cross-validation with MSE in sample
Small difference suggests good predictive accuracy

K-fold cross-validation `anscombe` data

residual variance sample is \(1.24^2 \approx 1.53\)
residual variance cross-validation is 1.93
regression lines in the 3 folds are similar

DAAG::CVlm(anscombe, fit, plotit=T, printit=F)

K-fold cross-validation `boys` data

residual variance sample is 2
residual variance cross-validation is 2.38
regression lines in the 3 folds almost identical

DAAG::CVlm(na.omit(boys), boys.fit3, plotit=T, printit=F)

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()

## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()

## [1] 724

Some other modeling devices in `R`

lm(): linear modeling
glm(): generalized linear modeling
gamlss::gamlss: generalized additive models (for location, scale and shape)
lme4::lmer: linear mixed effect models
lme4::glmer: generalized linear mixed effect models
lme4::nlmer: non-linear mixed effect models

We use the following packages

Fitting a line to data

Linear regression

The anscombe data

Fitting a line

Fitting a line

Fitting a line

anscombe example

Checking assumptions

Linearity

Linearity

Adding a squared term

Constant error variance?

No constant error variance!

Normality of errors

Outliers, influence and robust regression

Outliers and influential cases

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers and influential cases

Model fit

New data example

Model

Model factors

aov()

anova()

Anova() with type I Sum of Squares

Anova() with type II Sum of Squares

Anova() with type II Sum of Squares

Anova() with type III Sum of Squares

Anova() with type III Sum of Squares

Post hoc comparisons

Post hoc comparisons

AIC and BIC

Model comparison

Another model

Model comparison

Model comparison

Model fit again

Model fit again

Model fit again

Influence of cases

Prediction

Fitted values:

Predict new values

Predictions are draws from the regression line

Prediction intervals

Assessing predictive accuracy

K-fold cross-validation

K-fold cross-validation anscombe data

K-fold cross-validation boys data

How many cases are used?

Some other modeling devices in R

The `anscombe` data

`anscombe` example

`aov()`

`anova()`

`Anova()` with type I Sum of Squares

`Anova()` with type II Sum of Squares

`Anova()` with type II Sum of Squares

`Anova()` with type III Sum of Squares

`Anova()` with type III Sum of Squares

K-fold cross-validation `anscombe` data

K-fold cross-validation `boys` data

Some other modeling devices in `R`