Data Science and Predictive Machine Learning

This lecture

  • squared deviations
  • linear modeling
  • assumptions associated with the linear model

We use the following packages

library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)

Squared deviations

We have already met deviations before

  • correlations

\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]

  • t-test

\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]

  • variance

\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]

Deviations

Deviations tell what the distance is for each value (observation) to a comparison/reference value.

  • often, the mean is chosen as a the reference.

Why the mean?

The arithmetic mean is a very informative measure:

  • it is the average
  • it is the mathematical expectation
  • it is the central value of a set of discrete numbers

\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]

The mean as a center

Deviations from the mean

Plotting the deviations

Use of deviations

Deviations summarize the fit of all the points in the data to a single point

The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.

  • The mean yields the lowest set of deviations
plotdata %>%
  mutate("Mean" = X - mean(X), 
         "Mean + 3" = X - (mean(X) + 3)) %>%
  select("Mean", "Mean + 3") %>%
  colSums %>%
  round(digits = 3)
##     Mean Mean + 3 
##        0     -300

The mean minimizes the deviations

What happens

Plotting the standardized deviations

Plotting the squared deviations

Why squared deviations are useful

Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]

Deviations from the mean

Deviations from the mean

Least squares solution

Least squares solution

Fitting a line to data

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

  • \(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
  • \(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

The anscombe data

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

Fitting a line

The linear model would take the following form:

fit <- 
  yourdata %>%
  lm(youroutcome ~ yourpredictors)

fit %>% summary() # pipe
summary(fit) # base R

Output:

  • Residuals: minimum, maximum and quartiles
  • Coefficients: estimates, SE’s, t-values and \(p\)-values
  • Fit measures
    • Residuals SE (standard error residuals)
    • Multiple R-squared (proportion variance explained)
    • F-statistic and \(p\)-value (significance test model)

anscombe example

fit <- anscombe %$%
  lm(y1 ~ x1)

fit %>% summary
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Assumptions

The key assumptions

There are four key assumptions about the use of linear regression models.

In short, we assume

  • The outcome to have a linear relation with the predictors and the predictor relations to be additive.

    • the expected value for the outcome is a straight-line function of each predictor, given that the others are fixed.
    • the slope of each line does not depend on the values of the other predictors
    • the effects of the predictors on the expected value are additive

    \[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]

  • The residuals are statistically independent

  • The residual variance is constant

    • accross the expected values
    • across any of the predictors
  • The residuals are normally distributed with mean \(\mu_\epsilon = 0\)

A simple example

\[y = \alpha + \beta X + \epsilon\] and \[\hat{y} = \alpha + \beta X\] As a result, the following components in linear modeling

  • outcome \(y\)
  • predicted outcome \(\hat{y}\)
  • predictor \(X\)
  • residual \(\epsilon\)

can also be seen as columns in a data set.

As a data set

As a data set, this would be the result for lm(y1 ~ x1, data = anscombe)

An example

Residual variance

Violated assumptions #1

Here the residuals are not independent, and the residual variance is not constant!

Residual variance

Violated assumptions #2

Here the residuals do not have mean zero.

Residual variance

Violated assumptions #3

Here the residual variance is not constant, the residuals are not normally distributed and the relation between \(Y\) and \(X\) is not linear!

Residual variance

Leverage and influence revisited

Outliers and influential cases

Leverage: see the fit line as a lever.

  • some points pull/push harder; they have more leverage

Standardized residuals:

  • The values that have more leverage tend to be closer to the line
  • The line is fit so as to be closer to them
  • The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
  • Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

  • it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(\epsilon_z\) (standardized residuals).

If the model is correct we expect:

  • 5% of standardized residuals \(|\epsilon_z|>1.96\)
  • 1% of standardized residuals \(|\epsilon_z|>2.58\)
  • 0% of standardized residuals \(|\epsilon_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

  • cases with Cook’s Distance \(> 1\), or
  • cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

These are the plots for the Violated Assumptions #3 scenario. There are many cases with unrealistically large \(|e_z|\), so these could be labeled as outliers. There are no cases with Cook’s Distance \(>1\), but case 72 stands out. Of course this model should have never been fit.

For fun