Data Science and Predictive Machine Learning

This lecture

  • model fit

  • model complexity

  • peak into cross validation

We use the following packages

library(MASS)
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
library(DAAG)
library(car)

set.seed(123)

Model fit

A simple model

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys.fit
## 
## Call:
## lm(formula = age ~ reg)
## 
## Coefficients:
## (Intercept)      regeast      regwest     regsouth      regcity  
##     14.7109      -0.8168      -0.7044      -0.6913      -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
##   Group.1        x
## 1   north 14.71094
## 2    east 13.89410
## 3    west 14.00657
## 4   south 14.01965
## 5    city 14.04460

Plotting the model

means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
ggplot(na.omit(boys), aes(x = reg, y = age)) + 
  geom_point(color = "grey") + 
  geom_point(data = means, stat = "identity", size = 3)

Model parameters

boys.fit %>%
  summary()
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Is it a good model?

boys.fit %>%
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.

The outcome age does not change significantly when reg is varied.

Model factors

boys.fit %>%
  model.matrix() %>%
  head(n = 10)
##    (Intercept) regeast regwest regsouth regcity
## 1            1       0       0        0       0
## 2            1       0       0        1       0
## 3            1       0       0        1       0
## 4            1       1       0        0       0
## 5            1       0       1        0       0
## 6            1       1       0        0       0
## 7            1       0       0        0       0
## 8            1       0       1        0       0
## 9            1       0       0        1       0
## 10           1       0       1        0       0

R expands the categorical variable for us

  • it dummy-codes the 5 categories into 4 dummies (and an intercept).

Post hoc comparisons

coef <- boys.fit %>% aov() %>% summary.lm()
coef
## 
## Call:
## aov(formula = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Post hoc comparisons

Without adjustments for the p-value

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  age and reg 
## 
##       north east west south
## east  0.25  -    -    -    
## west  0.31  0.85 -    -    
## south 0.32  0.83 0.98 -    
## city  0.46  0.86 0.96 0.98 
## 
## P value adjustment method: none

Post hoc comparisons

With adjusted p-values cf. Bonferoni correction

na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  age and reg 
## 
##       north east west south
## east  1     -    -    -    
## west  1     1    -    -    
## south 1     1    1    -    
## city  1     1    1    1    
## 
## P value adjustment method: bonferroni

Post hoc comparisons

Manually calculated

p.val <- coef$coefficients
p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
##  (Intercept)      regeast      regwest     regsouth      regcity 
## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00

If you have trouble reading scientific notation, 5.077098e-68 means the following

\[5.077098\text{e-68} = 5.077098 \times 10^{-68} = 5.077098 \times (\frac{1}{10})^{-68}\]

This indicates that the comma should be moved 68 places to the left:

\[5.077098\text{e-68} = .000000000000000000000000000000000000\] \[000000000000000000000000000000005077098\]

AIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
## [1] 1151.684

What is AIC

AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.

  • The better the fit, the less the loss in information
  • AIC works on the log scale:
    • \(\text{log}(0) = -\infty\), \(\text{log}(1) = 0\), etc.
  • the closer the AIC is to \(-\infty\), the better

Model comparison

A new model

Let’s add predictor hgt to the model:

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545

Another model

Let’s add wgt to the model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt)

And another model

Let’s add wgt and the interaction between wgt and hgt to the model

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 %>% AIC()
## [1] 822.4164
boys.fit4 %>% AIC()
## [1] 823.0386

Another form of model comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 731.8311 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0276 8.595e-05 ***
## 4    215  482.67  1      2.99   1.3324    0.2497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit3

boys.fit3 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6343    0.1667    
## hgt         1 1642.94 1642.94 730.7065 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0029 8.688e-05 ***
## Residuals 216  485.66    2.25                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit4

boys.fit4 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that reg and the interaction hgt:wgt are redundant

Remove reg

boys.fit5 <- 
  na.omit(boys) %$%
  lm(age ~ hgt + wgt)

Let’s revisit the comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ hgt + wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 730.7065 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0029 8.688e-05 ***
## 4    220  492.43 -4     -6.77   0.7529    0.5571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But the boys.fit5 model is better than the previous model with fewer parameters

Stepwise regression

We start with the full model, which contains all parameters for all columns.

The most straightforward way to go about this is by specifying the following model:

full.model <- lm(age ~ ., data = na.omit(boys))
full.model
## 
## Call:
## lm(formula = age ~ ., data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          wgt          bmi           hc        gen.L  
##    2.556051     0.059987    -0.009846     0.142162    -0.024086     1.287455  
##       gen.Q        gen.C        gen^4        phb.L        phb.Q        phb.C  
##   -0.006861    -0.187256     0.034186     1.552398     0.499620     0.656277  
##       phb^4        phb^5           tv      regeast      regwest     regsouth  
##   -0.094722    -0.113686     0.074321    -0.222249    -0.233307    -0.258771  
##     regcity  
##    0.188423

Stepwise regression - continued

We can then start with specifying the stepwise model. In this case we choose direction both.

step.model <- step(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
##     2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
##       phb^4        phb^5           tv  
##    -0.06225     -0.15167      0.07957

Other options are

  • forward: fit all univariate models, add the best predictor and continue.
  • backward: fit the full model, eliminate the worst predictor and continue.

Summary

step.model %>% summary
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5077 -0.7144 -0.0814  0.6850  3.1724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.070854   1.576365   1.314  0.19036    
## hgt          0.054822   0.009986   5.490 1.13e-07 ***
## bmi          0.107415   0.030135   3.564  0.00045 ***
## phb.L        2.703275   0.437108   6.184 3.12e-09 ***
## phb.Q        0.288768   0.211836   1.363  0.17426    
## phb.C        0.484921   0.202856   2.390  0.01769 *  
## phb^4       -0.062246   0.208472  -0.299  0.76555    
## phb^5       -0.151667   0.240599  -0.630  0.52912    
## tv           0.079573   0.019600   4.060 6.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.172 on 214 degrees of freedom
## Multiple R-squared:  0.8651, Adjusted R-squared:   0.86 
## F-statistic: 171.5 on 8 and 214 DF,  p-value: < 2.2e-16

Stepwise regression - AIC

full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q        phb.C  
##     2.07085      0.05482      0.10742      2.70328      0.28877      0.48492  
##       phb^4        phb^5           tv  
##    -0.06225     -0.15167      0.07957

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

step.model %>%
  dfbeta() %>%
  head(n = 7)
##       (Intercept)           hgt           bmi        phb.L         phb.Q
## 3279 -0.142127090  0.0010828575 -0.0021654085 -0.021523825 -0.0006087783
## 3283 -0.005201914  0.0001127452 -0.0014100750  0.009068652 -0.0147293625
## 3296 -0.081439010  0.0004244061  0.0002603379 -0.009247562 -0.0071293767
## 3321 -0.084630826  0.0004186923  0.0008222594 -0.005498486 -0.0059276580
## 3323  0.154386486 -0.0004476521 -0.0052964367  0.042872835 -0.0213761347
## 3327 -0.046095468  0.0001081361  0.0011150546 -0.004047875 -0.0085461280
## 3357 -0.045933397  0.0001345605  0.0009770371 -0.005087053 -0.0062604484
##            phb.C        phb^4        phb^5            tv
## 3279 0.007427067 -0.004226982 -0.001213833  0.0001061106
## 3283 0.011761595 -0.005498411  0.001164307  0.0007051901
## 3296 0.008538302 -0.003688611  0.000842872  0.0002494383
## 3321 0.006839217 -0.003341741  0.000866896 -0.0002569914
## 3323 0.011758694 -0.006659755  0.000493897  0.0012317886
## 3327 0.007807966 -0.002901270  0.001499505  0.0003376220
## 3357 0.006152384 -0.002274777  0.001148552  0.0002329346

Prediction

Fitted values

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
fit %>% predict(newdata = new.x1)
##         1         2         3         4         5         6         7         8 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727  7.000818 
##         9        10        11        12        13        14        15        16 
##  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 10.501455 11.001545 
##        17        18        19        20 
## 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- fit %>% predict(newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

fit %>% predict(interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 724

Confidence intervals?

95% confidence interval

If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals

\[ 95\%~CI=\bar{x}\pm{t}_{(1-\alpha/2)}\cdot SEM \]

Example

x.bar <- 7.6 # sample mean
SEM   <- 2.1 # standard error of the mean
n     <- 11 # sample size
df    <- n-1 # degrees of freedom
alpha <- .15 # significance level
t.crit <- qt(1 - alpha / 2, df) # t(1 - alpha / 2) for df = 10
c(x.bar - t.crit * SEM, x.bar + t.crit * SEM) 
## [1]  4.325605 10.874395

HTML5 Icon
    Neyman, J. (1934). On the Two Different Aspects of the Representative Method: 
    The Method of Stratified Sampling and the Method of Purposive Selection. 
    Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625

Misconceptions

Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .

  1. A realised 95% CI does not mean:
  • that there is a 95% probability the population parameter lies within the interval

  • that there is a 95% probability that the interval covers the population parameter

    Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.

    The 95% probability only has to do with the estimation procedure.

  1. A 95% confidence interval does not mean that 95% of the sample data lie within the interval.
  2. A confidence interval is not a range of plausible values for the sample mean, though it may be understood as an estimate of plausible values for the population parameter.
  3. A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample mean from a repeat of the experiment falling within this interval.

Confidence intervals

100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.

For fun