Statistical Programming with R

We use the following packages

library(MASS)
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
library(DAAG)
library(car)

Fitting a line to data

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

  • \(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
  • \(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

The anscombe data

head(anscombe)
##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

Or, alternatively,

anscombe %>%
  head
##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

Fitting a line

The linear model would take the following form:

fit <- 
  yourdata %>%
  lm(youroutcome ~ yourpredictors)

summary(fit)

Output:

  • Residuals: minimum, maximum and quartiles
  • Coefficients: estimates, SE’s, t-values and \(p\)-values
  • Fit measures
    • Residuals SE (standard error residuals)
    • Multiple R-squared (proportion variance explained)
    • F-statistic and \(p\)-value (significance test model)

anscombe example

fit <- anscombe %$%
  lm(y1 ~ x1)

summary(fit)
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Checking assumptions

  1. linearity
    • scatterplot \(y\) and \(x\)
    • include loess curve when in doubt
    • does a squared term improve fit?
  2. normality residuals
    • normal probability plots qqnorm()
    • if sample is small; qqnorm with simulated errors cf. rnorm(n, 0, s)
  3. constant error variance
    • residual plot
    • scale-location plot

Linearity

anscombe %>%
  ggplot(aes(x1, y1)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

anscombe %$%
  lm(y1 ~ x1 + I(x1^2)) %>%
  summary()
## 
## Call:
## lm(formula = y1 ~ x1 + I(x1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8704 -0.3481 -0.2456  0.7129  1.8072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.75507    3.28815   0.230    0.824
## x1           1.06925    0.78982   1.354    0.213
## I(x1^2)     -0.03162    0.04336  -0.729    0.487
## 
## Residual standard error: 1.27 on 8 degrees of freedom
## Multiple R-squared:  0.6873, Adjusted R-squared:  0.6092 
## F-statistic: 8.793 on 2 and 8 DF,  p-value: 0.009558

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows some divergence from normality at the tails

Outliers, influence and robust regression

Outliers and influential cases

Leverage: see the fit line as a lever.

  • some points pull/push harder; they have more leverage

Standardized residuals:

  • The values that have more leverage tend to be closer to the line
  • The line is fit so as to be closer to them
  • The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
  • Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

  • it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(e_z\) (standardized residuals).

If the model is correct we expect:

  • 5% of \(|e_z|>1.96\)
  • 1% of \(|e_z|>2.58\)
  • 0% of \(|e_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

  • cases with Cook’s Distance \(> 1\), or
  • cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out

Model fit

New data example

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys %>% 
  head
##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Model

boys.fit %>%
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

boys.fit %>%
  summary()
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Model factors

boys.fit %>%
  model.matrix() %>%
  head()
##   (Intercept) regeast regwest regsouth regcity
## 1           1       0       0        0       0
## 2           1       0       0        1       0
## 3           1       0       0        1       0
## 4           1       1       0        0       0
## 5           1       0       1        0       0
## 6           1       1       0        0       0

aov()

boys.fit %>% 
  aov()
## Call:
##    aov(formula = .)
## 
## Terms:
##                       reg Residuals
## Sum of Squares    14.6987 2164.5869
## Deg. of Freedom         4       218
## 
## Residual standard error: 3.151079
## Estimated effects may be unbalanced

aov() is for balanced designs.

anova()

boys.fit %>% 
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

Anova() with type I Sum of Squares

  1. SS(A) for factor A.
  2. SS(B | A) for factor B.
  3. SS(AB | B, A) for interaction AB.
  • If data are unbalanced, the ordering defines the estimates:
    • Different ordering of factors will yield different effects

Anova() with type II Sum of Squares

  1. SS(A | B) for factor A.
  2. SS(B | A) for factor B.

No interaction assumed. So test for this first!

  • only if interaction is not significant continue with interpretation of the main effects.
  • most powerful if there is no interaction

Anova() with type II Sum of Squares

boys.fit %>% 
  car::Anova(type = 2)
## Anova Table (Type II tests)
## 
## Response: age
##           Sum Sq  Df F value Pr(>F)
## reg         14.7   4  0.3701 0.8298
## Residuals 2164.6 218

Anova() with type III Sum of Squares

  1. SS(A | B, AB) for factor A.
  2. SS(B | A, AB) for factor B.

Tests if there is a main effect, given the other main effect and the interaction.

  • however, most often we do not care about interpretation of the main effects if the interaction is significant

When data are balanced, Type I, II and II are identical, because the factors are orthogonal

Anova() with type III Sum of Squares

boys.fit %>% 
  car::Anova(type = 3)
## Anova Table (Type III tests)
## 
## Response: age
##             Sum Sq  Df  F value Pr(>F)    
## (Intercept) 6708.8   1 675.6530 <2e-16 ***
## reg           14.7   4   0.3701 0.8298    
## Residuals   2164.6 218                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post hoc comparisons

coef <- 
  boys.fit %>%
  aov() %>%
  summary.lm()
coef
## 
## Call:
## aov(formula = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Post hoc comparisons

p.val <- coef$coefficients
p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
##  (Intercept)      regeast      regwest     regsouth      regcity 
## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00

AIC and BIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
## [1] 1151.684

and Bayesian Information Criterion

boys.fit %>%
  BIC()
## [1] 1172.127

Model comparison

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545

Another model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 %>% AIC()
## [1] 823.0386

Model comparison

anova(boys.fit, boys.fit2, boys.fit3)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    218 2164.59                                  
## 2    217  521.64  1   1642.94 731.83 < 2.2e-16 ***
## 3    215  482.67  2     38.97   8.68  0.000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  car::Anova(type = 3)
## Anova Table (Type III tests)
## 
## Response: age
##             Sum Sq  Df F value   Pr(>F)    
## (Intercept)  27.04   1 12.0434 0.000628 ***
## reg           5.90   4  0.6569 0.622651    
## hgt         106.79   1 47.5667 5.84e-11 ***
## wgt           7.33   1  3.2654 0.072152 .  
## hgt:wgt       2.99   1  1.3324 0.249655    
## Residuals   482.67 215                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  car::Anova(type = 2)
## Anova Table (Type II tests)
## 
## Response: age
##           Sum Sq  Df F value    Pr(>F)    
## reg         5.90   4  0.6569    0.6227    
## hgt       170.35   1 75.8810 8.114e-16 ***
## wgt        35.98   1 16.0276 8.595e-05 ***
## hgt:wgt     2.99   1  1.3324    0.2497    
## Residuals 482.67 215                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fit again

boys.fit3 %>%
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

boys.fit3 %>%
  dfbeta() %>%
  head(n = 7)
##   (Intercept)      regeast       regwest      regsouth       regcity
## 1  0.30446529 -0.032645200 -0.0326165198 -0.0304499108 -0.0318295197
## 2 -0.12357958  0.003358160  0.0031894645 -0.0355420008  0.0050565041
## 3 -0.17522651  0.001479115  0.0015118575 -0.0134428270  0.0001493751
## 4 -0.08857830 -0.006195476  0.0007649256  0.0002461411 -0.0001545015
## 5  0.31715776  0.003385884 -0.0572382687  0.0032145390  0.0142911153
## 6 -0.17554676 -0.018945547  0.0024334472  0.0015126731  0.0006986407
## 7 -0.08200772  0.013753533  0.0137667118  0.0134385305  0.0131184131
##             hgt          wgt       hgt:wgt
## 1 -0.0018391896 -0.001790011  1.389532e-05
## 2  0.0009885098 -0.002377479  9.303061e-06
## 3  0.0010352437  0.002803606 -1.608245e-05
## 4  0.0004883509  0.001847202 -1.001945e-05
## 5 -0.0009320126 -0.019683940  9.719343e-05
## 6  0.0009077265  0.003892513 -2.022645e-05
## 7  0.0003391486  0.001610804 -8.082354e-06

Prediction

Fitted values:

Let’s use the simpler anscombe data example

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)
##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

predict(fit, interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

Assessing predictive accuracy

K-fold cross-validation

  • Divide sample in \(k\) mutually exclusive training sets
  • Do for all \(j\in\{1,\dots,k\}\) training sets

    1. fit model to training set \(j\)
    2. obtain predictions for test set \(j\) (remaining cases)
    3. compute residual variance (MSE) for test set \(j\)
  • Compare MSE in cross-validation with MSE in sample
  • Small difference suggests good predictive accuracy

K-fold cross-validation anscombe data

  • residual variance sample is \(1.24^2 \approx 1.53\)
  • residual variance cross-validation is 1.93
  • regression lines in the 3 folds are similar
DAAG::CVlm(anscombe, fit, plotit=T, printit=F)

K-fold cross-validation boys data

  • residual variance sample is 2
  • residual variance cross-validation is 2.38
  • regression lines in the 3 folds almost identical
DAAG::CVlm(na.omit(boys), boys.fit3, plotit=T, printit=F)

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 724

Some other modeling devices in R

  • lm(): linear modeling
  • glm(): generalized linear modeling
  • gamlss::gamlss: generalized additive models (for location, scale and shape)
  • lme4::lmer: linear mixed effect models
  • lme4::glmer: generalized linear mixed effect models
  • lme4::nlmer: non-linear mixed effect models