- pipes with package
magrittr
%>%
%$%
- deviations
- from the mean
- standardized
- squared
Fundamental Techniques in Data Science with R
magrittr
%>%
%$%
library(dplyr) library(magrittr) library(ggplot2) library(mice)
The mathematical formulation of the relationship between variables can be written as
\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]
or (for the greek people) in notation as \[y=\mu+\varepsilon\]
where
Regression model:
\[ \text{age}_i=\alpha+\beta\cdot{\text{weight}}_i+\varepsilon_i \]
where
age
, conditional on weight
The function lm()
is a base function in R
and allows you to pose a variety of linear models.
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", ## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, ## contrasts = NULL, offset, ...) ## NULL
If we want to know what these arguments do we can ask R:
?lm
This will open a help page on the lm()
function.
boys
data from mice
head(boys, n = 20)
## age hgt wgt bmi hc gen phb tv reg ## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south ## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south ## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south ## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south ## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south ## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south ## 37 0.068 52.5 3.810 13.82 34.9 <NA> <NA> NA south ## 38 0.071 53.0 3.890 13.84 35.8 <NA> <NA> NA west ## 39 0.071 55.1 3.880 12.77 36.8 <NA> <NA> NA west ## 43 0.073 54.5 4.200 14.14 38.0 <NA> <NA> NA east ## 53 0.076 58.5 5.920 17.29 40.5 <NA> <NA> NA west ## 60 0.079 55.0 4.430 14.64 38.0 <NA> <NA> NA east ## 62 0.079 58.5 5.745 16.78 38.5 <NA> <NA> NA city ## 75 0.082 59.5 5.100 14.40 39.1 <NA> <NA> NA west ## 85 0.084 52.5 4.120 14.94 37.3 <NA> <NA> NA <NA> ## 93 0.084 54.0 4.500 15.43 37.0 <NA> <NA> NA south ## 99 0.087 59.5 5.130 14.49 38.0 <NA> <NA> NA west ## 103 0.087 NA 4.540 NA 39.0 <NA> <NA> NA west ## 113 0.090 55.7 4.070 13.11 36.6 <NA> <NA> NA east ## 127 0.093 56.0 5.410 17.25 40.0 <NA> <NA> NA north
To obtain a linear model with a main effects for wgt
, we formulate \(age\sim wgt\)
boys %$% lm(age ~ wgt)
## ## Call: ## lm(formula = age ~ wgt) ## ## Coefficients: ## (Intercept) wgt ## -0.1924 0.2517
boys %$% lm(age ~ wgt) %>% summary()
## ## Call: ## lm(formula = age ~ wgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.0907 -1.1686 -0.5342 1.4286 5.5143 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.192377 0.136878 -1.405 0.16 ## wgt 0.251673 0.003018 83.395 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.142 on 742 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.9036, Adjusted R-squared: 0.9035 ## F-statistic: 6955 on 1 and 742 DF, p-value: < 2.2e-16
To obtain a linear model with just main effects for wgt
and hgt
, we formulate \(age\sim wgt + hgt\)
boys %$% lm(age ~ wgt + hgt)
## ## Call: ## lm(formula = age ~ wgt + hgt) ## ## Coefficients: ## (Intercept) wgt hgt ## -7.45952 0.07138 0.10660
boys %$% lm(age ~ wgt + hgt) %>% summary
## ## Call: ## lm(formula = age ~ wgt + hgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.1627 -0.8836 -0.1656 0.8765 4.4365 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.459525 0.246782 -30.23 <2e-16 *** ## wgt 0.071376 0.005949 12.00 <2e-16 *** ## hgt 0.106602 0.003336 31.96 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.391 on 724 degrees of freedom ## (21 observations deleted due to missingness) ## Multiple R-squared: 0.9592, Adjusted R-squared: 0.9591 ## F-statistic: 8510 on 2 and 724 DF, p-value: < 2.2e-16
To predict \(age\) from \(wgt\), \(hgt\) and the interaction \(wgt*hgt\) we formulate \(age\sim wgt * hgt\)
boys %$% lm(age ~ wgt * hgt) %>% summary
## ## Call: ## lm(formula = age ~ wgt * hgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2067 -0.8603 -0.1419 0.8452 4.7526 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.2833480 0.2510218 -29.015 < 2e-16 *** ## wgt -0.0196161 0.0284921 -0.688 0.49137 ## hgt 0.1120325 0.0037076 30.217 < 2e-16 *** ## wgt:hgt 0.0004142 0.0001269 3.265 0.00115 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.382 on 723 degrees of freedom ## (21 observations deleted due to missingness) ## Multiple R-squared: 0.9598, Adjusted R-squared: 0.9596 ## F-statistic: 5752 on 3 and 723 DF, p-value: < 2.2e-16
If a categorical variable is entered into function lm()
, it is automatically converted to a dummy set in R
. The first level is always taken as the reference category. If we want another reference category we can use the function relevel()
to change this.
fit <- boys %$% lm(age ~ reg) fit
## ## Call: ## lm(formula = age ~ reg) ## ## Coefficients: ## (Intercept) regeast regwest regsouth regcity ## 11.898 -2.657 -2.901 -3.307 -3.627
fit %>% model.matrix() %>% tail(n = 20)
## (Intercept) regeast regwest regsouth regcity ## 729 1 1 0 0 0 ## 730 1 0 1 0 0 ## 731 1 0 0 1 0 ## 732 1 0 1 0 0 ## 733 1 0 0 0 0 ## 734 1 1 0 0 0 ## 735 1 0 0 0 1 ## 736 1 0 1 0 0 ## 737 1 1 0 0 0 ## 738 1 0 0 0 1 ## 739 1 0 0 0 0 ## 740 1 1 0 0 0 ## 741 1 0 0 1 0 ## 742 1 0 0 0 0 ## 743 1 0 1 0 0 ## 744 1 0 0 0 0 ## 745 1 0 1 0 0 ## 746 1 0 1 0 0 ## 747 1 0 0 0 0 ## 748 1 1 0 0 0
fit %>% summary
## ## Call: ## lm(formula = age ~ reg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.805 -7.376 1.339 5.955 11.935 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.8984 0.7595 15.666 < 2e-16 *** ## regeast -2.6568 0.9312 -2.853 0.004449 ** ## regwest -2.9008 0.8788 -3.301 0.001011 ** ## regsouth -3.3074 0.9064 -3.649 0.000282 *** ## regcity -3.6266 1.1031 -3.288 0.001058 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.836 on 740 degrees of freedom ## (3 observations deleted due to missingness) ## Multiple R-squared: 0.02077, Adjusted R-squared: 0.01548 ## F-statistic: 3.925 on 4 and 740 DF, p-value: 0.003678
fit %>% names() # the names of the list with output objects
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "na.action" "contrasts" "xlevels" "call" ## [13] "terms" "model"
fit$coef # show the estimated coefficients
## (Intercept) regeast regwest regsouth regcity ## 11.898420 -2.656786 -2.900792 -3.307388 -3.626625
fit %>% coef # alternative
## (Intercept) regeast regwest regsouth regcity ## 11.898420 -2.656786 -2.900792 -3.307388 -3.626625
Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]
Assumptions:
anscombe
dataanscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>% ggplot(aes(y1, x1)) + geom_point() + geom_smooth(method = "lm")
ggplot2
Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.
ggplot2
?Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.
With ggplot2
you
ggplot2
then takes care of the details
1: Provide the data
boys %>% ggplot()
2: map variable to aesthetics
boys %>% ggplot(aes(x = age, y = bmi))
3: state which geometric object to display
boys %>% ggplot(aes(x = age, y = bmi)) + geom_point()
Create the plot
gg <- boys %>% ggplot(aes(x = age, y = bmi)) + geom_point(col = "dark green")
Add another layer (smooth fit line)
gg <- gg + geom_smooth(col = "dark blue")
Give it some labels and a nice look
gg <- gg + labs(x = "Age", y = "BMI", title = "BMI trend for boys") + theme_minimal()
plot(gg)
lm
The linear model would take the following form:
fit <- yourdata %>% lm(youroutcome ~ yourpredictors) fit %>% summary() # pipe summary(fit) # base R
Output:
anscombe
examplefit <- anscombe %$% lm(y1 ~ x1) fit %>% summary
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
qqnorm()
qqnorm
with simulated errors cf. rnorm(n, 0, s)
anscombe %>% ggplot(aes(x1, y1)) + geom_point() + geom_smooth(method = "loess", col = "blue") + geom_smooth(method = "lm", col = "orange")
The loess curve suggests slight non-linearity
anscombe %$% lm(y1 ~ x1 + I(x1^2)) %>% summary()
## ## Call: ## lm(formula = y1 ~ x1 + I(x1^2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.8704 -0.3481 -0.2456 0.7129 1.8072 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.75507 3.28815 0.230 0.824 ## x1 1.06925 0.78982 1.354 0.213 ## I(x1^2) -0.03162 0.04336 -0.729 0.487 ## ## Residual standard error: 1.27 on 8 degrees of freedom ## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6092 ## F-statistic: 8.793 on 2 and 8 DF, p-value: 0.009558
par(mfrow = c(1, 2)) fit %>% plot(which = c(1, 3), cex = .6)
par(mfrow = c(1, 2)) boys %$% lm(bmi ~ age) %>% plot(which = c(1, 3), cex = .6)
fit %>% plot(which = 2, cex = .6)
The QQplot shows some divergence from normality at the tails
Leverage: see the fit line as a lever.
Standardized residuals:
Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.
Outliers are cases with large \(e_z\) (standardized residuals).
If the model is correct we expect:
Influential cases are cases with large influence on parameter estimates
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out