library(MASS) library(dplyr) library(magrittr) library(ggplot2) library(mice) library(DAAG) library(car)
Statistical Programming with R
library(MASS) library(dplyr) library(magrittr) library(ggplot2) library(mice) library(DAAG) library(car)
Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]
Assumptions:
anscombe
datahead(anscombe)
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04
Or, alternatively,
anscombe %>% head
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04
anscombe %>% ggplot(aes(y1, x1)) + geom_point() + geom_smooth(method = "lm")
The linear model would take the following form:
fit <- yourdata %>% lm(youroutcome ~ yourpredictors) summary(fit)
Output:
anscombe
examplefit <- anscombe %$% lm(y1 ~ x1) summary(fit)
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
qqnorm()
qqnorm
with simulated errors cf. rnorm(n, 0, s)
anscombe %>% ggplot(aes(x1, y1)) + geom_point() + geom_smooth(method = "loess", col = "blue") + geom_smooth(method = "lm", col = "orange")
The loess curve suggests slight non-linearity
anscombe %$% lm(y1 ~ x1 + I(x1^2)) %>% summary()
## ## Call: ## lm(formula = y1 ~ x1 + I(x1^2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.8704 -0.3481 -0.2456 0.7129 1.8072 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.75507 3.28815 0.230 0.824 ## x1 1.06925 0.78982 1.354 0.213 ## I(x1^2) -0.03162 0.04336 -0.729 0.487 ## ## Residual standard error: 1.27 on 8 degrees of freedom ## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6092 ## F-statistic: 8.793 on 2 and 8 DF, p-value: 0.009558
par(mfrow = c(1, 2)) fit %>% plot(which = c(1, 3), cex = .6)
par(mfrow = c(1, 2)) boys %$% lm(bmi ~ age) %>% plot(which = c(1, 3), cex = .6)
fit %>% plot(which = 2, cex = .6)
The QQplot shows some divergence from normality at the tails
Leverage: see the fit line as a lever.
Standardized residuals:
Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.
Outliers are cases with large \(e_z\) (standardized residuals).
If the model is correct we expect:
Influential cases are cases with large influence on parameter estimates
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out
boys.fit <- na.omit(boys) %$% # Extremely wasteful lm(age ~ reg)
boys %>% head
## age hgt wgt bmi hc gen phb tv reg ## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south ## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south ## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south ## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south ## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south ## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
boys.fit %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.7 3.6747 0.3701 0.8298 ## Residuals 218 2164.6 9.9293
boys.fit %>% summary()
## ## Call: ## lm(formula = age ~ reg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.8519 -2.5301 0.0254 2.2274 6.2614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.7109 0.5660 25.993 <2e-16 *** ## regeast -0.8168 0.7150 -1.142 0.255 ## regwest -0.7044 0.6970 -1.011 0.313 ## regsouth -0.6913 0.6970 -0.992 0.322 ## regcity -0.6663 0.9038 -0.737 0.462 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.151 on 218 degrees of freedom ## Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148 ## F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
boys.fit %>% model.matrix() %>% head()
## (Intercept) regeast regwest regsouth regcity ## 1 1 0 0 0 0 ## 2 1 0 0 1 0 ## 3 1 0 0 1 0 ## 4 1 1 0 0 0 ## 5 1 0 1 0 0 ## 6 1 1 0 0 0
aov()
boys.fit %>% aov()
## Call: ## aov(formula = .) ## ## Terms: ## reg Residuals ## Sum of Squares 14.6987 2164.5869 ## Deg. of Freedom 4 218 ## ## Residual standard error: 3.151079 ## Estimated effects may be unbalanced
aov()
is for balanced designs.
anova()
boys.fit %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.7 3.6747 0.3701 0.8298 ## Residuals 218 2164.6 9.9293
Anova()
with type I Sum of SquaresAnova()
with type II Sum of SquaresNo interaction assumed. So test for this first!
Anova()
with type II Sum of Squaresboys.fit %>% car::Anova(type = 2)
## Anova Table (Type II tests) ## ## Response: age ## Sum Sq Df F value Pr(>F) ## reg 14.7 4 0.3701 0.8298 ## Residuals 2164.6 218
Anova()
with type III Sum of SquaresTests if there is a main effect, given the other main effect and the interaction.
When data are balanced, Type I, II and II are identical, because the factors are orthogonal
Anova()
with type III Sum of Squaresboys.fit %>% car::Anova(type = 3)
## Anova Table (Type III tests) ## ## Response: age ## Sum Sq Df F value Pr(>F) ## (Intercept) 6708.8 1 675.6530 <2e-16 *** ## reg 14.7 4 0.3701 0.8298 ## Residuals 2164.6 218 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef <- boys.fit %>% aov() %>% summary.lm() coef
## ## Call: ## aov(formula = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.8519 -2.5301 0.0254 2.2274 6.2614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.7109 0.5660 25.993 <2e-16 *** ## regeast -0.8168 0.7150 -1.142 0.255 ## regwest -0.7044 0.6970 -1.011 0.313 ## regsouth -0.6913 0.6970 -0.992 0.322 ## regcity -0.6663 0.9038 -0.737 0.462 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.151 on 218 degrees of freedom ## Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148 ## F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
p.val <- coef$coefficients p.adjust(p.val[, "Pr(>|t|)"], method = "bonferroni")
## (Intercept) regeast regwest regsouth regcity ## 5.077098e-68 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
Akaike’s An Information Criterion
boys.fit %>% AIC()
## [1] 1151.684
and Bayesian Information Criterion
boys.fit %>% BIC()
## [1] 1172.127
boys.fit2 <- na.omit(boys) %$% lm(age ~ reg + hgt) boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 <- na.omit(boys) %$% lm(age ~ reg + hgt * wgt)
is equivalent to
boys.fit3 <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt + hgt:wgt)
boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
boys.fit3 %>% AIC()
## [1] 823.0386
anova(boys.fit, boys.fit2, boys.fit3)
## Analysis of Variance Table ## ## Model 1: age ~ reg ## Model 2: age ~ reg + hgt ## Model 3: age ~ reg + hgt * wgt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 218 2164.59 ## 2 217 521.64 1 1642.94 731.83 < 2.2e-16 *** ## 3 215 482.67 2 38.97 8.68 0.000237 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3 %>% car::Anova(type = 3)
## Anova Table (Type III tests) ## ## Response: age ## Sum Sq Df F value Pr(>F) ## (Intercept) 27.04 1 12.0434 0.000628 *** ## reg 5.90 4 0.6569 0.622651 ## hgt 106.79 1 47.5667 5.84e-11 *** ## wgt 7.33 1 3.2654 0.072152 . ## hgt:wgt 2.99 1 1.3324 0.249655 ## Residuals 482.67 215 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3 %>% car::Anova(type = 2)
## Anova Table (Type II tests) ## ## Response: age ## Sum Sq Df F value Pr(>F) ## reg 5.90 4 0.6569 0.6227 ## hgt 170.35 1 75.8810 8.114e-16 *** ## wgt 35.98 1 16.0276 8.595e-05 *** ## hgt:wgt 2.99 1 1.3324 0.2497 ## Residuals 482.67 215 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3 %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.70 3.67 1.6368 0.1661 ## hgt 1 1642.94 1642.94 731.8311 < 2.2e-16 *** ## wgt 1 35.98 35.98 16.0276 8.595e-05 *** ## hgt:wgt 1 2.99 2.99 1.3324 0.2497 ## Residuals 215 482.67 2.24 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
boys.fit3 %>% dfbeta() %>% head(n = 7)
## (Intercept) regeast regwest regsouth regcity ## 1 0.30446529 -0.032645200 -0.0326165198 -0.0304499108 -0.0318295197 ## 2 -0.12357958 0.003358160 0.0031894645 -0.0355420008 0.0050565041 ## 3 -0.17522651 0.001479115 0.0015118575 -0.0134428270 0.0001493751 ## 4 -0.08857830 -0.006195476 0.0007649256 0.0002461411 -0.0001545015 ## 5 0.31715776 0.003385884 -0.0572382687 0.0032145390 0.0142911153 ## 6 -0.17554676 -0.018945547 0.0024334472 0.0015126731 0.0006986407 ## 7 -0.08200772 0.013753533 0.0137667118 0.0134385305 0.0131184131 ## hgt wgt hgt:wgt ## 1 -0.0018391896 -0.001790011 1.389532e-05 ## 2 0.0009885098 -0.002377479 9.303061e-06 ## 3 0.0010352437 0.002803606 -1.608245e-05 ## 4 0.0004883509 0.001847202 -1.001945e-05 ## 5 -0.0009320126 -0.019683940 9.719343e-05 ## 6 0.0009077265 0.003892513 -2.022645e-05 ## 7 0.0003391486 0.001610804 -8.082354e-06
Let’s use the simpler anscombe
data example
y_hat <- fit %>% fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6 ## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364 ## 7 8 9 10 11 ## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1
, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20) predict.lm(fit, newdata = new.x1)
## 1 2 3 4 5 6 7 ## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 ## 8 9 10 11 12 13 14 ## 7.000818 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 ## 15 16 17 18 19 20 ## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909
pred <- predict.lm(fit, newdata = new.x1) lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1 ## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1 ## 3.0000909 0.5000909
predict(fit, interval = "prediction")
## fit lwr upr ## 1 8.001000 5.067072 10.934928 ## 2 7.000818 4.066890 9.934747 ## 3 9.501273 6.390801 12.611745 ## 4 7.500909 4.579129 10.422689 ## 5 8.501091 5.531014 11.471168 ## 6 10.001364 6.789620 13.213107 ## 7 6.000636 2.971271 9.030002 ## 8 5.000455 1.788711 8.212198 ## 9 9.001182 5.971816 12.030547 ## 10 6.500727 3.530650 9.470804 ## 11 5.500545 2.390073 8.611017
Do for all \(j\in\{1,\dots,k\}\) training sets
Small difference suggests good predictive accuracy
anscombe
dataDAAG::CVlm(anscombe, fit, plotit=T, printit=F)
boys
dataDAAG::CVlm(na.omit(boys), boys.fit3, plotit=T, printit=F)
na.omit(boys) %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 223
If we would not have used na.omit()
boys %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 724
R
lm()
: linear modelingglm()
: generalized linear modelinggamlss::gamlss
: generalized additive models (for location, scale and shape)lme4::lmer
: linear mixed effect modelslme4::glmer
: generalized linear mixed effect modelslme4::nlmer
: non-linear mixed effect models