Fundamental Techniques in Data Science with R
model fit
model complexity
cross validation
library(MASS) library(dplyr) library(magrittr) library(ggplot2) library(mice) library(DAAG) library(car) set.seed(123)
boys.fit <- na.omit(boys) %$% # Extremely wasteful lm(age ~ reg) boys.fit
## ## Call: ## lm(formula = age ~ reg) ## ## Coefficients: ## (Intercept) regeast regwest regsouth regcity ## 14.7109 -0.8168 -0.7044 -0.6913 -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
## Group.1 x ## 1 north 14.71094 ## 2 east 13.89410 ## 3 west 14.00657 ## 4 south 14.01965 ## 5 city 14.04460
means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(na.omit(boys), aes(x = reg, y = age)) + geom_point(color = "grey") + geom_point(data = means, stat = "identity", size = 3)
boys.fit %>% summary()
## ## Call: ## lm(formula = age ~ reg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.8519 -2.5301 0.0254 2.2274 6.2614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.7109 0.5660 25.993 <2e-16 *** ## regeast -0.8168 0.7150 -1.142 0.255 ## regwest -0.7044 0.6970 -1.011 0.313 ## regsouth -0.6913 0.6970 -0.992 0.322 ## regcity -0.6663 0.9038 -0.737 0.462 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.151 on 218 degrees of freedom ## Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148 ## F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
boys.fit %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.7 3.6747 0.3701 0.8298 ## Residuals 218 2164.6 9.9293
It is not a very informative model. The anova
is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.
The outcome age
does not change significantly when reg
is varied.
boys.fit %>% model.matrix() %>% head(n = 10)
## (Intercept) regeast regwest regsouth regcity ## 1 1 0 0 0 0 ## 2 1 0 0 1 0 ## 3 1 0 0 1 0 ## 4 1 1 0 0 0 ## 5 1 0 1 0 0 ## 6 1 1 0 0 0 ## 7 1 0 0 0 0 ## 8 1 0 1 0 0 ## 9 1 0 0 1 0 ## 10 1 0 1 0 0
R
expands the categorical variable for us
5
categories into 4
dummies (and an intercept).coef <- boys.fit %>% aov() %>% summary.lm() coef
## ## Call: ## aov(formula = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.8519 -2.5301 0.0254 2.2274 6.2614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.7109 0.5660 25.993 <2e-16 *** ## regeast -0.8168 0.7150 -1.142 0.255 ## regwest -0.7044 0.6970 -1.011 0.313 ## regsouth -0.6913 0.6970 -0.992 0.322 ## regcity -0.6663 0.9038 -0.737 0.462 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.151 on 218 degrees of freedom ## Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148 ## F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
Without adjustments for the p-value
na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "none")
## ## Pairwise comparisons using t tests with pooled SD ## ## data: age and reg ## ## north east west south ## east 0.25 - - - ## west 0.31 0.85 - - ## south 0.32 0.83 0.98 - ## city 0.46 0.86 0.96 0.98 ## ## P value adjustment method: none
With adjusted p-values cf. Bonferoni correction
na.omit(boys) %$% pairwise.t.test(age, reg, p.adj = "bonf")
## ## Pairwise comparisons using t tests with pooled SD ## ## data: age and reg ## ## north east west south ## east 1 - - - ## west 1 1 - - ## south 1 1 1 - ## city 1 1 1 1 ## ## P value adjustment method: bonferroni
Manually calculated
p.val <- coef$coefficients p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
## (Intercept) regeast regwest regsouth regcity ## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
If you have trouble reading scientific notation, 5.077098e-68
means the following
\[5.077098\text{e-68} = 5.077098 \times 10^{-68} = 5.077098 \times (\frac{1}{10})^{-68}\]
This indicates that the comma should be moved 68 places to the left:
\[5.077098\text{e-68} = .000000000000000000000000000000000000\] \[000000000000000000000000000000005077098\]
Akaike’s An Information Criterion
boys.fit %>% AIC()
## [1] 1151.684
AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.
Let’s add predictor hgt
to the model:
boys.fit2 <- na.omit(boys) %$% lm(age ~ reg + hgt) boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
Let’s add wgt
to the model
boys.fit3 <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt)
Let’s add wgt
and the interaction between wgt
and hgt
to the model
boys.fit4 <- na.omit(boys) %$% lm(age ~ reg + hgt * wgt)
is equivalent to
boys.fit4 <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt + hgt:wgt)
boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 %>% AIC()
## [1] 822.4164
boys.fit4 %>% AIC()
## [1] 823.0386
anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table ## ## Model 1: age ~ reg ## Model 2: age ~ reg + hgt ## Model 3: age ~ reg + hgt + wgt ## Model 4: age ~ reg + hgt * wgt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 218 2164.59 ## 2 217 521.64 1 1642.94 731.8311 < 2.2e-16 *** ## 3 216 485.66 1 35.98 16.0276 8.595e-05 *** ## 4 215 482.67 1 2.99 1.3324 0.2497 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3
boys.fit3 %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.70 3.67 1.6343 0.1667 ## hgt 1 1642.94 1642.94 730.7065 < 2.2e-16 *** ## wgt 1 35.98 35.98 16.0029 8.688e-05 *** ## Residuals 216 485.66 2.25 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit4
boys.fit4 %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.70 3.67 1.6368 0.1661 ## hgt 1 1642.94 1642.94 731.8311 < 2.2e-16 *** ## wgt 1 35.98 35.98 16.0276 8.595e-05 *** ## hgt:wgt 1 2.99 2.99 1.3324 0.2497 ## Residuals 215 482.67 2.24 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that reg
and the interaction hgt:wgt
are redundant
reg
boys.fit5 <- na.omit(boys) %$% lm(age ~ hgt + wgt)
Let’s revisit the comparison
anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table ## ## Model 1: age ~ reg ## Model 2: age ~ reg + hgt ## Model 3: age ~ reg + hgt + wgt ## Model 4: age ~ hgt + wgt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 218 2164.59 ## 2 217 521.64 1 1642.94 730.7065 < 2.2e-16 *** ## 3 216 485.66 1 35.98 16.0029 8.688e-05 *** ## 4 220 492.43 -4 -6.77 0.7529 0.5571 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But the boys.fit5
model is better than the previous model with fewer parameters
We start with the full model, which contains all parameters for all columns.
The most straightforward way to go about this is by specifying the following model:
full.model <- lm(age ~ ., data = na.omit(boys)) full.model
## ## Call: ## lm(formula = age ~ ., data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt wgt bmi hc gen.L ## 2.556051 0.059987 -0.009846 0.142162 -0.024086 1.287455 ## gen.Q gen.C gen^4 phb.L phb.Q phb.C ## -0.006861 -0.187256 0.034186 1.552398 0.499620 0.656277 ## phb^4 phb^5 tv regeast regwest regsouth ## -0.094722 -0.113686 0.074321 -0.222249 -0.233307 -0.258771 ## regcity ## 0.188423
We can then start with specifying the stepwise model. In this case we choose direction both
.
step.model <- step(full.model, direction = "both", trace = FALSE) step.model
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt bmi phb.L phb.Q phb.C ## 2.07085 0.05482 0.10742 2.70328 0.28877 0.48492 ## phb^4 phb^5 tv ## -0.06225 -0.15167 0.07957
Other options are
forward
: fit all univariate models, add the best predictor and continue.backward
: fit the full model, eliminate the worst predictor and continue.step.model %>% summary
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5077 -0.7144 -0.0814 0.6850 3.1724 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.070854 1.576365 1.314 0.19036 ## hgt 0.054822 0.009986 5.490 1.13e-07 *** ## bmi 0.107415 0.030135 3.564 0.00045 *** ## phb.L 2.703275 0.437108 6.184 3.12e-09 *** ## phb.Q 0.288768 0.211836 1.363 0.17426 ## phb.C 0.484921 0.202856 2.390 0.01769 * ## phb^4 -0.062246 0.208472 -0.299 0.76555 ## phb^5 -0.151667 0.240599 -0.630 0.52912 ## tv 0.079573 0.019600 4.060 6.89e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.172 on 214 degrees of freedom ## Multiple R-squared: 0.8651, Adjusted R-squared: 0.86 ## F-statistic: 171.5 on 8 and 214 DF, p-value: < 2.2e-16
full.model <- lm(age ~ ., data = na.omit(boys)) step.model <- MASS::stepAIC(full.model, direction = "both", trace = FALSE) step.model
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt bmi phb.L phb.Q phb.C ## 2.07085 0.05482 0.10742 2.70328 0.28877 0.48492 ## phb^4 phb^5 tv ## -0.06225 -0.15167 0.07957
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
step.model %>% dfbeta() %>% head(n = 7)
## (Intercept) hgt bmi phb.L phb.Q ## 3279 -0.142127090 0.0010828575 -0.0021654085 -0.021523825 -0.0006087783 ## 3283 -0.005201914 0.0001127452 -0.0014100750 0.009068652 -0.0147293625 ## 3296 -0.081439010 0.0004244061 0.0002603379 -0.009247562 -0.0071293767 ## 3321 -0.084630826 0.0004186923 0.0008222594 -0.005498486 -0.0059276580 ## 3323 0.154386486 -0.0004476521 -0.0052964367 0.042872835 -0.0213761347 ## 3327 -0.046095468 0.0001081361 0.0011150546 -0.004047875 -0.0085461280 ## 3357 -0.045933397 0.0001345605 0.0009770371 -0.005087053 -0.0062604484 ## phb.C phb^4 phb^5 tv ## 3279 0.007427067 -0.004226982 -0.001213833 0.0001061106 ## 3283 0.011761595 -0.005498411 0.001164307 0.0007051901 ## 3296 0.008538302 -0.003688611 0.000842872 0.0002494383 ## 3321 0.006839217 -0.003341741 0.000866896 -0.0002569914 ## 3323 0.011758694 -0.006659755 0.000493897 0.0012317886 ## 3327 0.007807966 -0.002901270 0.001499505 0.0003376220 ## 3357 0.006152384 -0.002274777 0.001148552 0.0002329346
Let’s use the simpler anscombe
data example
fit <- anscombe %$% lm(y1 ~ x1) y_hat <- fit %>% fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6 ## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364 ## 7 8 9 10 11 ## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1
, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20) fit %>% predict(newdata = new.x1)
## 1 2 3 4 5 6 7 8 ## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 7.000818 ## 9 10 11 12 13 14 15 16 ## 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 10.501455 11.001545 ## 17 18 19 20 ## 11.501636 12.001727 12.501818 13.001909
pred <- fit %>% predict(newdata = new.x1) lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1 ## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1 ## 3.0000909 0.5000909
fit %>% predict(interval = "prediction")
## fit lwr upr ## 1 8.001000 5.067072 10.934928 ## 2 7.000818 4.066890 9.934747 ## 3 9.501273 6.390801 12.611745 ## 4 7.500909 4.579129 10.422689 ## 5 8.501091 5.531014 11.471168 ## 6 10.001364 6.789620 13.213107 ## 7 6.000636 2.971271 9.030002 ## 8 5.000455 1.788711 8.212198 ## 9 9.001182 5.971816 12.030547 ## 10 6.500727 3.530650 9.470804 ## 11 5.500545 2.390073 8.611017
A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.
Divide sample in \(k\) mutually exclusive training sets
Do for all \(j\in\{1,\dots,k\}\) training sets
Compare MSE in cross-validation with MSE in sample
Small difference suggests good predictive accuracy
fit %>% summary()
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
anscombe
dataDAAG::CVlm(anscombe, fit, plotit=F, printit=T)
## Analysis of Variance Table ## ## Response: y1 ## Df Sum Sq Mean Sq F value Pr(>F) ## x1 1 27.5 27.51 18 0.0022 ** ## Residuals 9 13.8 1.53 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## fold 1 ## Observations in test set: 3 ## 5 7 11 ## x1 11.000 6.00 5.000 ## cvpred 8.443 5.59 5.014 ## y1 8.330 7.24 5.680 ## CV residual -0.113 1.65 0.666 ## ## Sum of squares = 3.19 Mean square = 1.06 n = 3 ## ## fold 2 ## Observations in test set: 4 ## 2 6 8 10 ## x1 8.000 14.000 4.00 7.00 ## cvpred 7.572 9.681 6.17 7.22 ## y1 6.950 9.960 4.26 4.82 ## CV residual -0.622 0.279 -1.91 -2.40 ## ## Sum of squares = 9.86 Mean square = 2.47 n = 4 ## ## fold 3 ## Observations in test set: 4 ## 1 3 4 9 ## x1 10.00 13.00 9.00 12.00 ## cvpred 7.84 9.37 7.33 8.86 ## y1 8.04 7.58 8.81 10.84 ## CV residual 0.20 -1.79 1.48 1.98 ## ## Sum of squares = 9.35 Mean square = 2.34 n = 4 ## ## Overall (Sum over all 4 folds) ## ms ## 2.04
anscombe
dataDAAG::CVlm(anscombe, fit, plotit="Observed", printit=F)
DAAG::CVlm(anscombe, fit, plotit="Residual", printit=F)
boys
dataDAAG::CVlm(na.omit(boys), step.model, plotit="Observed", printit=F)
DAAG::CVlm(na.omit(boys), step.model, plotit="Residual", printit=F)
na.omit(boys) %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 223
If we would not have used na.omit()
boys %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 724