- squared deviations
- linear modeling
- assumptions associated with the linear model
Data Science and Predictive Machine Learning
library(dplyr) library(magrittr) library(ggplot2) library(mice)
\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]
\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]
\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]
Deviations tell what the distance is for each value (observation
) to a comparison/reference value.
The arithmetic mean is a very informative measure:
\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]
Deviations summarize the fit of all the points in the data to a single point
The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.
plotdata %>% mutate("Mean" = X - mean(X), "Mean + 3" = X - (mean(X) + 3)) %>% select("Mean", "Mean + 3") %>% colSums %>% round(digits = 3)
## Mean Mean + 3 ## 0 -300
The mean minimizes the deviations
Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]
Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]
Assumptions:
anscombe
dataanscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>% ggplot(aes(y1, x1)) + geom_point() + geom_smooth(method = "lm")
The linear model would take the following form:
fit <- yourdata %>% lm(youroutcome ~ yourpredictors) fit %>% summary() # pipe summary(fit) # base R
Output:
anscombe
examplefit <- anscombe %$% lm(y1 ~ x1) fit %>% summary
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
There are four key assumptions about the use of linear regression models.
In short, we assume
The outcome to have a linear relation with the predictors and the predictor relations to be additive.
\[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]
The residuals are statistically independent
The residual variance is constant
The residuals are normally distributed with mean \(\mu_\epsilon = 0\)
\[y = \alpha + \beta X + \epsilon\] and \[\hat{y} = \alpha + \beta X\] As a result, the following components in linear modeling
can also be seen as columns in a data set.
As a data set, this would be the result for lm(y1 ~ x1, data = anscombe)
Here the residuals are not independent, and the residual variance is not constant!
Here the residuals do not have mean zero.
Here the residual variance is not constant, the residuals are not normally distributed and the relation between \(Y\) and \(X\) is not linear!
Leverage: see the fit line as a lever.
Standardized residuals:
Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.
Outliers are cases with large \(\epsilon_z\) (standardized residuals).
If the model is correct we expect:
Influential cases are cases with large influence on parameter estimates
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
These are the plots for the Violated Assumptions #3
scenario. There are many cases with unrealistically large \(|e_z|\), so these could be labeled as outliers. There are no cases with Cook’s Distance \(>1\), but case 72 stands out. Of course this model should have never been fit.