Data Science and Predictive Machine Learning

Packages used in this lecture

library(magrittr) # pipes
library(dplyr)    # data manipulation
library(ggplot2)  # plotting
library(DAAG)     # data sets and functions

So far

At this point we have covered the following models:

  • Simple linear regression (SLR)

\[y=\alpha+\beta x+\epsilon\]

The relationship between a numerical outcome and a numerical or categorical predictor

  • Multiple linear regression (MLR)

\[y=\alpha+\beta_1 x_1 + \beta_2 x_2 + \dots \beta_p x_p + \epsilon\]

The relationship between a numerical outcome and multiple numerical or categorical predictors

What remains

We have not yet covered how to handle outcomes that are not categorical or how to deal with predictors that are nonlinear or have a strict dependency structure.

What we have learned

We have covered the following topics last week:

  • fit SLR and MLR models
  • select MLR models
  • interpret model parameters
  • perform hypothesis test on slope and intercept parameters
  • perform hypothesis test for the whole regression model
  • calculate confidence intervals for regression parameters
  • obtain prediction intervals for fitted values
  • study the influence of single cases
  • study the validity of linear regression assumptions:
    • linearity, constant residual variance
  • study the residuals, leverage and Cook’s distance

Rewriting what we know

Instead of modeling

\[y=\alpha+\beta x+\epsilon\]

we can also consider \[\mathbb{E}[y] = \alpha + \beta x\]

They’re the same. Different notation, different framework.

The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).

This is what we will be doing today

Illustration of the problem

A simulated data set

To further illustrate why the linear model is not an appropriate model for discrete data I propose the following simple simulated data set:

set.seed(123)
simulated <- data.frame(discrete = c(rep(0, 50), rep(1, 50)),
                        continuous = c(rnorm(50, 10, 3), rnorm(50, 15, 3)))

simulated %>% summary
##     discrete     continuous    
##  Min.   :0.0   Min.   : 4.100  
##  1st Qu.:0.0   1st Qu.: 9.656  
##  Median :0.5   Median :12.904  
##  Mean   :0.5   Mean   :12.771  
##  3rd Qu.:1.0   3rd Qu.:15.570  
##  Max.   :1.0   Max.   :21.562

This data allows us to illustrate modeling the relation between the discrete outcome and the continuous predictor with logistic regression.

Remember that fixing the random seed allows for a replicable random number generator sequence.

Visualizing simulated data

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point()

Modeling simulated with lm

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange") 

The orange line represents the lm linear regression line. It is not a good representation for our data, as it assumes the data are continuous and projects values outside of the range of the observed data.

Modeling simulated with glm

simulated %>% ggplot(aes(x = continuous, y = discrete)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) 

The blue glm logistic regression line represents this data infinitely better than the orange lm line. It assumes the data to be 0 or 1 and does not project values outside of the range of the observed data.

How does this work?

Generalized linear modeling

There is a very general way of addressing this type of problem in regression. The models that use this general way are called generalized linear models (GLMs).

Every generalized linear model has the following three characteristics:

  1. A probability distribution that describes the outcome
  2. A linear predictor model
  3. A link function that relates the linear predictor to the the parameter of the outcome’s probability distribution.

The linear predictor model in (2) is \[\eta = \bf{X}\beta\] where \(\eta\) denotes a linear predictor and the link function in (3) is \[\bf{X}\beta = g(\mu)\] The technique to model a binary outcome based on a set of continuous or discrete predictors is called logistic regression. Logistic regression is an example of a generalized linear model.

The link function

Modeling the odds

Odds are a way of quantifying the probability of an event \(E\)

The odds for an event \(E\) are \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1 - P(E)}\] The odds of getting heads in a coin toss is

\[\text{odds}(\text{heads}) = \frac{P(\text{heads})}{P(\text{tails})} = \frac{P(\text{heads})}{1 - P(\text{heads})}\] For a fair coin, this would result in

\[\text{odds}(\text{heads}) = \frac{.5}{1 - .5} = 1\]

Another odds example

The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls

  • The odds of a player choosing a blue ball are \[\text{odds}(\text{blue}) = \frac{36}{8} = \frac{36/44}{8/44} = \frac{.8182}{.1818} = 4.5\]
  • The odds of a player choosing a red ball are \[\text{odds}(\text{red}) = \frac{6}{38} = \frac{6/44}{36/44} = \frac{.1364}{.8636}\approx .16\]
  • The odds of a player choosing a green ball are \[\text{odds}(\text{green}) = \frac{2}{42} = \frac{2/44}{42/44} = \frac{.0455}{.9545}\approx .05\]

Odds of 1 indicate an equal likelihood of the event occuring or not occuring. Odds < 1 indicate a lower likelihood of the event occuring vs. not occuring. Odds > 1 indicate a higher likelihood of the event occuring.

GLM’s continued

Remember that \[y=\alpha+\beta x+\epsilon,\]

and that \[\mathbb{E}[y] = \alpha + \beta x.\]

As a result \[y = \mathbb{E}[y] + \epsilon.\]

and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)

Logistic regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

In logistic regression we assume that the true data generating model for the outcome variable follows a binomial distribution.

  • it is therefore intuitive to think of logistic regression as modeling the probability of succes \(p\) for any given set of predictors.

How

We specify a reasonable link that connects \(\eta\) to \(p\). Most common in logistic regression is the logit link

\[logit(p)=\text{log}(\frac{p}{1−p}) , \text{ for } 0 \leq p \leq 1\] We might recognize \(\frac{p}{1−p}\) as the odds.

\(\log(\text{odds})\) explained

Now if we visualize the relation between our predictor(s) and the logodds

The link to the responses explained

Logistic regression

Logistic regression

With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).

  • Deviance is the fit of the observed values to the expected values.

With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.

The likelihood is the (joint) probability of the observed values, given the current model parameters.

In normally distributed data: \(\text{SS}=\text{D}\).

The logistic regression model

Remember the three characteristics for every generalized linear model:

  1. A probability distribution that describes the outcome
  2. A linear predictor model
  3. A link function that relates the linear predictor to the the parameter of the outcome’s probability distribution.

For the logistic model this gives us:

  1. \(y_i \sim \text{Binom}(p_i)\)
  2. \(\eta = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)
  3. \(\text{logit}(p) = \eta\)

Simple substitution brings us at

\[p_i = \frac{\text{exp}(\eta)}{1+\text{exp}(\eta)} = \frac{\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}{1+\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}\]

Fitting a logistic regression

The anesthetic data

anesthetic %>% head(n = 10)
##    move conc    logconc nomove
## 1     0  1.0  0.0000000      1
## 2     1  1.2  0.1823216      0
## 3     0  1.4  0.3364722      1
## 4     1  1.4  0.3364722      0
## 5     1  1.2  0.1823216      0
## 6     0  2.5  0.9162907      1
## 7     0  1.6  0.4700036      1
## 8     1  0.8 -0.2231436      0
## 9     0  1.6  0.4700036      1
## 10    1  1.4  0.3364722      0

Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted.

Fitting a logistic regression model

Fitting a glm in R is not much different from fitting a lm. We do, however, need to specify what type of glm to use by specifying both the family and the type of link function we need.

For logistic regression we need the binomial family as the binomial distribution is the probability distribution that describes our outcome. We also use the logit link, which is the default for the binomial glm family.

fit <- anesthetic %$% 
  glm(nomove ~ conc, family = binomial(link="logit"))
fit
## 
## Call:  glm(formula = nomove ~ conc, family = binomial(link = "logit"))
## 
## Coefficients:
## (Intercept)         conc  
##      -6.469        5.567  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
## Null Deviance:       41.46 
## Residual Deviance: 27.75     AIC: 31.75

The model parameters

fit %>% summary
## 
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76666  -0.74407   0.03413   0.68666   2.06900  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.418  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.455  on 29  degrees of freedom
## Residual deviance: 27.754  on 28  degrees of freedom
## AIC: 31.754
## 
## Number of Fisher Scoring iterations: 5

The regression parameters

fit %>% summary %>% .$coefficients
##              Estimate Std. Error   z value    Pr(>|z|)
## (Intercept) -6.468675   2.418470 -2.674697 0.007479691
## conc         5.566762   2.043591  2.724010 0.006449448

With every unit increase in concentration conc, the log odds of not moving increases with 5.5667617. This increase can be considered different from zero as the p-value is 0.0064494.

In other words; an increase in conc will lower the probability of moving. We can verify this by modeling move instead of nomove:

anesthetic %$% 
  glm(move ~ conc, family = binomial(link="logit")) %>%
  summary %>% .$coefficients
##              Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)  6.468675   2.418470  2.674697 0.007479691
## conc        -5.566762   2.043591 -2.724010 0.006449448

However..

Error

library(caret)
pred <- fit %>% predict(type = "response")
confusionMatrix(data = factor(as.numeric(pred > 0.5), labels = c("move", "nomove")),
                reference = factor(anesthetic$nomove, labels = c("move", "nomove")))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction move nomove
##     move     10      2
##     nomove    4     14
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6143, 0.9229)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.002316        
##                                           
##                   Kappa : 0.5946          
##                                           
##  Mcnemar's Test P-Value : 0.683091        
##                                           
##             Sensitivity : 0.7143          
##             Specificity : 0.8750          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.4667          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7946          
##                                           
##        'Positive' Class : move            
## 

Issue

With the error of the model (or lack thereof) - comes a problem.

  1. Is the (lack of) error due to the modeling?
  2. Would another model give us a different error?
  3. Is the (lack of error) due to the data?
  4. Would other data give us a different error?
    If (1) and (2) are the case –> model selection needed to improve the model
    But what if the better model is only due to our data?
    If (3) and (4) are the case –> we need other data to validate that our model is reliable

Some concepts

Training
If the model will only fit well on the data is has been trained on, then we are overfitting. We have then successfully modeled not only the data, but also (much of) the noise.

  • overfitted models have little bias, but high variance
  • the opposite, underfitting occurs when the model is too simple and cannot capture the structure of the data.
    • underfitted models have high bias, but low variance.

A great example is the library of babel. It contains every phrase, page, etc. that will ever be written in English. However, it is a most inefficient way of writing beautiful literature.

Testing
To avoid overfitting we can train the model on one data set and test its performance on another (seperate) data set that comes from the same true data generating model.

Validation
If we are also optimizing hyperparameters, then the in-between-step of validation makes sense. You then train the initial model on one data set, validate its optimization on another data set and finally test its performance on the last data set.

On real data

Collecting multiple independent data sets to realize a true train/validate/test approach is a costly endeavor that takes up a lot of time and resources.

Alternative: splitting the observed data
We can randomly split the data into 2 parts: one part for training and one part for testing

  • The downside to this approach is that everything becomes split-depended
    • in theory we could still obtain a good or bad performance because of the split.
    • influential values that are not part of the training data will skew the test performance
    • variance is usually low, but bias can be higher

Solution: K-fold crossvalidation
Partition the data into \(K\) folds and use each fold once as the test set and the remaining \(K-1\) folds as the training set.

  • You’ll have \(K\) estimates of the test accuracy
  • These \(K\) estimates balance bias and variance
  • A special case is when you take \(K = N\). This is called leave on out crossvalidation

Example

set.seed(123)
library(caret)

# define training control
train_control <- trainControl(method = "cv", number = 3, savePredictions = TRUE)

# train the model on training set
model <- train(as.factor(move) ~ conc,
               data = DAAG::anesthetic,
               trControl = train_control,
               method = "glm",
               family = binomial(link = "logit"))

# print cv scores
model$results
##   parameter  Accuracy     Kappa AccuracySD  KappaSD
## 1      none 0.7919192 0.5737505   0.121414 0.253953

The predictions

model$pred