library(magrittr) # pipes library(dplyr) # data manipulation library(ggplot2) # plotting library(DAAG) # data sets and functions
Data Science and Predictive Machine Learning
library(magrittr) # pipes library(dplyr) # data manipulation library(ggplot2) # plotting library(DAAG) # data sets and functions
At this point we have covered the following models:
\[y=\alpha+\beta x+\epsilon\]
The relationship between a numerical outcome and a numerical or categorical predictor
\[y=\alpha+\beta_1 x_1 + \beta_2 x_2 + \dots \beta_p x_p + \epsilon\]
The relationship between a numerical outcome and multiple numerical or categorical predictors
We have not yet covered how to handle outcomes that are not categorical or how to deal with predictors that are nonlinear or have a strict dependency structure.
We have covered the following topics last week:
Instead of modeling
\[y=\alpha+\beta x+\epsilon\]
we can also consider \[\mathbb{E}[y] = \alpha + \beta x\]
They’re the same. Different notation, different framework.
The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).
This is what we will be doing today
simulated
data setTo further illustrate why the linear model is not an appropriate model for discrete data I propose the following simple simulated data set:
set.seed(123) simulated <- data.frame(discrete = c(rep(0, 50), rep(1, 50)), continuous = c(rnorm(50, 10, 3), rnorm(50, 15, 3))) simulated %>% summary
## discrete continuous ## Min. :0.0 Min. : 4.100 ## 1st Qu.:0.0 1st Qu.: 9.656 ## Median :0.5 Median :12.904 ## Mean :0.5 Mean :12.771 ## 3rd Qu.:1.0 3rd Qu.:15.570 ## Max. :1.0 Max. :21.562
This data allows us to illustrate modeling the relation between the discrete
outcome and the continuous
predictor with logistic regression.
Remember that fixing the random seed allows for a replicable random number generator sequence.
simulated
datasimulated %>% ggplot(aes(x = continuous, y = discrete)) + geom_point()
simulated
with lm
simulated %>% ggplot(aes(x = continuous, y = discrete)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange")
The orange line represents the lm
linear regression line. It is not a good representation for our data, as it assumes the data are continuous and projects values outside of the range of the observed data.
simulated
with glm
simulated %>% ggplot(aes(x = continuous, y = discrete)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "orange") + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)
The blue glm
logistic regression line represents this data infinitely better than the orange lm
line. It assumes the data to be 0
or 1
and does not project values outside of the range of the observed data.
There is a very general way of addressing this type of problem in regression. The models that use this general way are called generalized linear models (GLMs).
Every generalized linear model has the following three characteristics:
The linear predictor model in (2) is \[\eta = \bf{X}\beta\] where \(\eta\) denotes a linear predictor and the link function in (3) is \[\bf{X}\beta = g(\mu)\] The technique to model a binary outcome based on a set of continuous or discrete predictors is called logistic regression. Logistic regression is an example of a generalized linear model.
The link function for logistic regression is the logit link
\[\bf{X}\beta = ln(\frac{\mu}{1 - \mu})\]
where \[\mu = \frac{\text{exp}(\bf{X}\beta)}{1 + \text{exp}(\bf{X}\beta)} = \frac{1}{1 + \text{exp}(-\bf{X}\beta)}\]
Before we continue with discussing the link function, we are first going to dive into the concept of odds.
Properly understanding odds is necessary to perform and interpret logistic regression, as the logit
link is connected to the odds
Odds are a way of quantifying the probability of an event \(E\)
The odds for an event \(E\) are \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1 - P(E)}\] The odds of getting heads in a coin toss is
\[\text{odds}(\text{heads}) = \frac{P(\text{heads})}{P(\text{tails})} = \frac{P(\text{heads})}{1 - P(\text{heads})}\] For a fair coin, this would result in
\[\text{odds}(\text{heads}) = \frac{.5}{1 - .5} = 1\]
The game Lingo has 44 balls: 36 blue, 6 red and 2 green balls
Odds of 1 indicate an equal likelihood of the event occuring or not occuring. Odds < 1
indicate a lower likelihood of the event occuring vs. not occuring. Odds > 1
indicate a higher likelihood of the event occuring.
Remember that \[y=\alpha+\beta x+\epsilon,\]
and that \[\mathbb{E}[y] = \alpha + \beta x.\]
As a result \[y = \mathbb{E}[y] + \epsilon.\]
and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)
Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.
In logistic regression we assume that the true data generating model for the outcome variable follows a binomial distribution.
We specify a reasonable link that connects \(\eta\) to \(p\). Most common in logistic regression is the logit link
\[logit(p)=\text{log}(\frac{p}{1−p}) , \text{ for } 0 \leq p \leq 1\] We might recognize \(\frac{p}{1−p}\) as the odds.
Now if we visualize the relation between our predictor(s) and the logodds
And the relation between our predictor(s) and the probability
With linear regression we had the Sum of Squares (SS)
. Its logistic counterpart is the Deviance (D)
.
With logistic regression we aim to maximize the likelihood
, which is equivalent to minimizing the deviance.
The likelihood is the (joint) probability of the observed values, given the current model parameters.
In normally distributed data: \(\text{SS}=\text{D}\).
Remember the three characteristics for every generalized linear model:
For the logistic model this gives us:
Simple substitution brings us at
\[p_i = \frac{\text{exp}(\eta)}{1+\text{exp}(\eta)} = \frac{\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}{1+\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}\]
anesthetic
dataanesthetic %>% head(n = 10)
## move conc logconc nomove ## 1 0 1.0 0.0000000 1 ## 2 1 1.2 0.1823216 0 ## 3 0 1.4 0.3364722 1 ## 4 1 1.4 0.3364722 0 ## 5 1 1.2 0.1823216 0 ## 6 0 2.5 0.9162907 1 ## 7 0 1.6 0.4700036 1 ## 8 1 0.8 -0.2231436 0 ## 9 0 1.6 0.4700036 1 ## 10 1 1.4 0.3364722 0
Thirty patients were given an anesthetic agent maintained at a predetermined level (conc
) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted.
Fitting a glm
in R
is not much different from fitting a lm
. We do, however, need to specify what type of glm
to use by specifying both the family
and the type of link
function we need.
For logistic regression we need the binomial family as the binomial distribution is the probability distribution that describes our outcome. We also use the logit
link, which is the default for the binomial glm
family.
fit <- anesthetic %$% glm(nomove ~ conc, family = binomial(link="logit")) fit
## ## Call: glm(formula = nomove ~ conc, family = binomial(link = "logit")) ## ## Coefficients: ## (Intercept) conc ## -6.469 5.567 ## ## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual ## Null Deviance: 41.46 ## Residual Deviance: 27.75 AIC: 31.75
fit %>% summary
## ## Call: ## glm(formula = nomove ~ conc, family = binomial(link = "logit")) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.76666 -0.74407 0.03413 0.68666 2.06900 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.469 2.418 -2.675 0.00748 ** ## conc 5.567 2.044 2.724 0.00645 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 41.455 on 29 degrees of freedom ## Residual deviance: 27.754 on 28 degrees of freedom ## AIC: 31.754 ## ## Number of Fisher Scoring iterations: 5
fit %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.468675 2.418470 -2.674697 0.007479691 ## conc 5.566762 2.043591 2.724010 0.006449448
With every unit increase in concentration conc
, the log odds of not moving increases with 5.5667617. This increase can be considered different from zero as the p-value is 0.0064494.
In other words; an increase in conc
will lower the probability of moving. We can verify this by modeling move
instead of nomove
:
anesthetic %$% glm(move ~ conc, family = binomial(link="logit")) %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.468675 2.418470 2.674697 0.007479691 ## conc -5.566762 2.043591 -2.724010 0.006449448
library(caret) pred <- fit %>% predict(type = "response") confusionMatrix(data = factor(as.numeric(pred > 0.5), labels = c("move", "nomove")), reference = factor(anesthetic$nomove, labels = c("move", "nomove")))
## Confusion Matrix and Statistics ## ## Reference ## Prediction move nomove ## move 10 2 ## nomove 4 14 ## ## Accuracy : 0.8 ## 95% CI : (0.6143, 0.9229) ## No Information Rate : 0.5333 ## P-Value [Acc > NIR] : 0.002316 ## ## Kappa : 0.5946 ## ## Mcnemar's Test P-Value : 0.683091 ## ## Sensitivity : 0.7143 ## Specificity : 0.8750 ## Pos Pred Value : 0.8333 ## Neg Pred Value : 0.7778 ## Prevalence : 0.4667 ## Detection Rate : 0.3333 ## Detection Prevalence : 0.4000 ## Balanced Accuracy : 0.7946 ## ## 'Positive' Class : move ##
With the error of the model (or lack thereof) - comes a problem.
better model
is only due to our data? Training
If the model will only fit well on the data is has been trained on, then we are overfitting. We have then successfully modeled not only the data, but also (much of) the noise.
A great example is the library of babel. It contains every phrase, page, etc. that will ever be written in English. However, it is a most inefficient way of writing beautiful literature.
Testing
To avoid overfitting we can train the model on one data set and test its performance on another (seperate) data set that comes from the same true data generating model.
Validation
If we are also optimizing hyperparameters, then the in-between-step of validation makes sense. You then train the initial model on one data set, validate its optimization on another data set and finally test its performance on the last data set.
Collecting multiple independent data sets to realize a true train/validate/test
approach is a costly endeavor that takes up a lot of time and resources.
Alternative: splitting the observed data
We can randomly split the data into 2 parts: one part for training and one part for testing
Solution: K-fold crossvalidation
Partition the data into \(K\) folds and use each fold once as the test set and the remaining \(K-1\) folds as the training set.
set.seed(123) library(caret) # define training control train_control <- trainControl(method = "cv", number = 3, savePredictions = TRUE) # train the model on training set model <- train(as.factor(move) ~ conc, data = DAAG::anesthetic, trControl = train_control, method = "glm", family = binomial(link = "logit")) # print cv scores model$results
## parameter Accuracy Kappa AccuracySD KappaSD ## 1 none 0.7919192 0.5737505 0.121414 0.253953
model$pred