Fundamental Techniques in Data Science with R

Today

This lecture

  • model fit

  • model complexity

  • cross validation

We use the following packages

library(MASS)
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
library(DAAG)
library(car)

set.seed(123)

Model fit

A simple model

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys.fit
## 
## Call:
## lm(formula = age ~ reg)
## 
## Coefficients:
## (Intercept)      regeast      regwest     regsouth      regcity  
##     14.7109      -0.8168      -0.7044      -0.6913      -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
##   Group.1        x
## 1   north 14.71094
## 2    east 13.89410
## 3    west 14.00657
## 4   south 14.01965
## 5    city 14.04460

Plotting the model

means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(na.omit(boys), aes(x = reg, y = age)) + 
  geom_point(color = "grey") + 
  geom_point(data = means, stat = "identity", size = 3)

Model parameters

boys.fit %>%
  summary()
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Is it a good model?

boys.fit %>%
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.

The outcome age does not change significantly when reg is varied.

Model factors

boys.fit %>%
  model.matrix() %>%
  head(n = 10)
##    (Intercept) regeast regwest regsouth regcity
## 1            1       0       0        0       0
## 2            1       0       0        1       0
## 3            1       0       0        1       0
## 4            1       1       0        0       0
## 5            1       0       1        0       0
## 6            1       1       0        0       0
## 7            1       0       0        0       0
## 8            1       0       1        0       0
## 9            1       0       0        1       0
## 10           1       0       1        0       0

R expands the categorical variable for us

  • it dummy-codes the 5 categories into 4 dummies (and an intercept).

Post hoc comparisons

coef <- boys.fit %>% aov() %>% summary.lm()
coef
## 
## Call:
## aov(formula = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Post hoc comparisons

Without adjustments for the p-value

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  age and reg 
## 
##       north east west south
## east  0.25  -    -    -    
## west  0.31  0.85 -    -    
## south 0.32  0.83 0.98 -    
## city  0.46  0.86 0.96 0.98 
## 
## P value adjustment method: none

Post hoc comparisons

With adjusted p-values cf. Bonferoni correction

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  age and reg 
## 
##       north east west south
## east  1     -    -    -    
## west  1     1    -    -    
## south 1     1    1    -    
## city  1     1    1    1    
## 
## P value adjustment method: bonferroni

Post hoc comparisons

Manually calculated

p.val <- coef$coefficients
p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
##  (Intercept)      regeast      regwest     regsouth      regcity 
## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00

If you have trouble reading scientific notation, 5.077098e-68 means the following

\[5.077098\text{e-68} = 5.077098 \times 10^{-68} = 5.077098 \times (\frac{1}{10})^{-68}\]

This indicates that the comma should be moved 68 places to the left:

\[5.077098\text{e-68} = .000000000000000000000000000000000000\] \[000000000000000000000000000000005077098\]

AIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
## [1] 1151.684

What is AIC

AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.

  • The better the fit, the less the loss in information
  • AIC works on the log scale:
    • \(\text{log}(0) = -\infty\), \(\text{log}(1) = 0\), etc.
  • the closer the AIC is to \(-\infty\), the better

Model comparison

A new model

Let’s add predictor hgt to the model:

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545

Another model

Let’s add wgt to the model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt)

And another model

Let’s add wgt and the interaction between wgt and hgt to the model

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 %>% AIC()
## [1] 822.4164
boys.fit4 %>% AIC()
## [1] 823.0386

Another form of model comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 731.8311 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0276 8.595e-05 ***
## 4    215  482.67  1      2.99   1.3324    0.2497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit3

boys.fit3 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6343    0.1667    
## hgt         1 1642.94 1642.94 730.7065 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0029 8.688e-05 ***
## Residuals 216  485.66    2.25                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit4

boys.fit4 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that reg and the interaction hgt:wgt are redundant

Remove reg

boys.fit5 <- 
  na.omit(boys) %$%
  lm(age ~ hgt + wgt)

Let’s revisit the comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ hgt + wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 730.7065 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0029 8.688e-05 ***
## 4    220  492.43 -4     -6.77   0.7529    0.5571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But the boys.fit5 model is better than the previous model with fewer parameters

Stepwise regression

We start with the full model, which contains all parameters for all columns.

The most straightforward way to go about this is by specifying the following model:

full.model <- lm(age ~ ., data = na.omit(boys))
full.model
## 
## Call:
## lm(formula = age ~ ., data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          wgt          bmi           hc        gen.L  
##    2.556051     0.059987    -0.009846     0.142162    -0.024086     1.287455  
##       gen.Q        gen.C        gen^4        phb.L        phb.Q        phb.C  
##   -0.006861    -0.187256     0.034186     1.552398     0.499620     0.656277  
##       phb^4        phb^5           tv      regeast      regwest     regsouth  
##   -0.094722    -0.113686     0.074321    -0.222249    -0.233307    -0.258771  
##     regcity  
##    0.188423

Stepwise regression - continued

We can then start with specifying the stepwise model. In this case we choose direction both.

step.model <- step(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
##     2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
##       phb^4        phb^5           tv  
##    -0.06225     -0.15167      0.07957

Other options are

  • forward: fit all univariate models, add the best predictor and continue.
  • backward: fit the full model, eliminate the worst predictor and continue.

Summary

step.model %>% summary
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5077 -0.7144 -0.0814  0.6850  3.1724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.070854   1.576365   1.314  0.19036    
## hgt          0.054822   0.009986   5.490 1.13e-07 ***
## bmi          0.107415   0.030135   3.564  0.00045 ***
## phb.L        2.703275   0.437108   6.184 3.12e-09 ***
## phb.Q        0.288768   0.211836   1.363  0.17426    
## phb.C        0.484921   0.202856   2.390  0.01769 *  
## phb^4       -0.062246   0.208472  -0.299  0.76555    
## phb^5       -0.151667   0.240599  -0.630  0.52912    
## tv           0.079573   0.019600   4.060 6.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.172 on 214 degrees of freedom
## Multiple R-squared:  0.8651, Adjusted R-squared:   0.86 
## F-statistic: 171.5 on 8 and 214 DF,  p-value: < 2.2e-16

Stepwise regression - AIC

full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
##     2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
##       phb^4        phb^5           tv  
##    -0.06225     -0.15167      0.07957

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

step.model %>%
  dfbeta() %>%
  head(n = 7)
##       (Intercept)           hgt           bmi        phb.L         phb.Q
## 3279 -0.142127090  0.0010828575 -0.0021654085 -0.021523825 -0.0006087783
## 3283 -0.005201914  0.0001127452 -0.0014100750  0.009068652 -0.0147293625
## 3296 -0.081439010  0.0004244061  0.0002603379 -0.009247562 -0.0071293767
## 3321 -0.084630826  0.0004186923  0.0008222594 -0.005498486 -0.0059276580
## 3323  0.154386486 -0.0004476521 -0.0052964367  0.042872835 -0.0213761347
## 3327 -0.046095468  0.0001081361  0.0011150546 -0.004047875 -0.0085461280
## 3357 -0.045933397  0.0001345605  0.0009770371 -0.005087053 -0.0062604484
##            phb.C        phb^4        phb^5            tv
## 3279 0.007427067 -0.004226982 -0.001213833  0.0001061106
## 3283 0.011761595 -0.005498411  0.001164307  0.0007051901
## 3296 0.008538302 -0.003688611  0.000842872  0.0002494383
## 3321 0.006839217 -0.003341741  0.000866896 -0.0002569914
## 3323 0.011758694 -0.006659755  0.000493897  0.0012317886
## 3327 0.007807966 -0.002901270  0.001499505  0.0003376220
## 3357 0.006152384 -0.002274777  0.001148552  0.0002329346

Prediction

Fitted values

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
fit %>% predict(newdata = new.x1)
##         1         2         3         4         5         6         7         8 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727  7.000818 
##         9        10        11        12        13        14        15        16 
##  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 10.501455 11.001545 
##        17        18        19        20 
## 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- fit %>% predict(newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

fit %>% predict(interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.

Assessing predictive accuracy

K-fold cross-validation

  • Divide sample in \(k\) mutually exclusive training sets

  • Do for all \(j\in\{1,\dots,k\}\) training sets

    1. fit model to training set \(j\)
    2. obtain predictions for test set \(j\) (remaining cases)
    3. compute residual variance (MSE) for test set \(j\)
  • Compare MSE in cross-validation with MSE in sample

  • Small difference suggests good predictive accuracy

The original model

fit %>% summary()
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

K-fold cross-validation anscombe data

DAAG::CVlm(anscombe, fit, plotit=F, printit=T)
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x1         1   27.5   27.51      18 0.0022 **
## Residuals  9   13.8    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## fold 1 
## Observations in test set: 3 
##                  5    7    11
## x1          11.000 6.00 5.000
## cvpred       8.443 5.59 5.014
## y1           8.330 7.24 5.680
## CV residual -0.113 1.65 0.666
## 
## Sum of squares = 3.19    Mean square = 1.06    n = 3 
## 
## fold 2 
## Observations in test set: 4 
##                  2      6     8    10
## x1           8.000 14.000  4.00  7.00
## cvpred       7.572  9.681  6.17  7.22
## y1           6.950  9.960  4.26  4.82
## CV residual -0.622  0.279 -1.91 -2.40
## 
## Sum of squares = 9.86    Mean square = 2.47    n = 4 
## 
## fold 3 
## Observations in test set: 4 
##                 1     3    4     9
## x1          10.00 13.00 9.00 12.00
## cvpred       7.84  9.37 7.33  8.86
## y1           8.04  7.58 8.81 10.84
## CV residual  0.20 -1.79 1.48  1.98
## 
## Sum of squares = 9.35    Mean square = 2.34    n = 4 
## 
## Overall (Sum over all 4 folds) 
##   ms 
## 2.04

K-fold cross-validation anscombe data

  • residual variance sample is \(1.24^2 \approx 1.53\)
  • residual variance cross-validation is 2.04
  • regression lines in the 3 folds are not the same, but similar
DAAG::CVlm(anscombe, fit, plotit="Observed", printit=F)

Plotting the residuals

DAAG::CVlm(anscombe, fit, plotit="Residual", printit=F)

K-fold cross-validation boys data

  • residual variance sample is 1.37
  • residual variance cross-validation is 1.46
  • regression lines in the 3 folds almost identical
DAAG::CVlm(na.omit(boys), step.model, plotit="Observed", printit=F)

Plotting the residuals

DAAG::CVlm(na.omit(boys), step.model, plotit="Residual", printit=F)

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 724

Next week:

  • Logistic regression