library(magrittr) # pipes library(dplyr) # data manipulation library(mice) # data library(lattice) # plotting - used for conditional plotting library(ggplot2) # plotting library(DAAG) # data sets and functions library(caret) # confusion matrices
Fundamental Techniques in Data Science with R
library(magrittr) # pipes library(dplyr) # data manipulation library(mice) # data library(lattice) # plotting - used for conditional plotting library(ggplot2) # plotting library(DAAG) # data sets and functions library(caret) # confusion matrices
With logistic regression we can model a binary outcome on a set of continuous, binary and/or categorical predictors.
If we would use linear regression with a binary outcome variable, we would create a linear probability model
Because the linear probability model violates its assumptions when performed on a binary outcome variable, the standard errors and hypothesis tests are not valid. In short, you may draw invalid conclusions.
We start this lecture with a data set that logs the survival of passengers on board of the disastrous maiden voyage of the ocean liner Titanic
titanic <- read.csv(file = "titanic.csv", header = TRUE, stringsAsFactors = TRUE) titanic %>% head
## Survived Pclass Name Sex Age ## 1 0 3 Mr. Owen Harris Braund male 22 ## 2 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38 ## 3 1 3 Miss. Laina Heikkinen female 26 ## 4 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35 ## 5 0 3 Mr. William Henry Allen male 35 ## 6 0 3 Mr. James Moran male 27 ## Siblings.Spouses.Aboard Parents.Children.Aboard Fare ## 1 1 0 7.2500 ## 2 1 0 71.2833 ## 3 0 0 7.9250 ## 4 1 0 53.1000 ## 5 0 0 8.0500 ## 6 0 0 8.4583
str(titanic)
## 'data.frame': 887 obs. of 8 variables: ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... ## $ Age : num 22 38 26 35 35 27 54 2 27 14 ... ## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
We have information on the following features.
Our outcome/dependent variable:
Some potential predictors:
c(male, female)
and more.
We can start investigating if there are patterns in this data that are related to the survival probability.
For example, we could hypothesize based on the crede “women and children first” that
Age
relates to the probability of survival in that younger passengers have a higher probability of survivalSex
relates to survival in that females have a higher probability of survivalBased on socio-economic status, we could hypothesize that
Pclass
relates to the probability of survival in that higher travel class leads to a higher probability of survivalAnd so on.
Age
related?titanic %>% ggplot(aes(x = Age, y = Survived)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) + xlim(-1, 100)
titanic %$% table(Pclass, Survived)
## Survived ## Pclass 0 1 ## 1 80 136 ## 2 97 87 ## 3 368 119
It seems that the higher the class (i.e. 1 > 2 > 3
), the higher the probability of survival.
We can verify this
titanic %$% table(Pclass, Survived) %>% prop.table(margin = 1) %>% round(digits = 2)
## Survived ## Pclass 0 1 ## 1 0.37 0.63 ## 2 0.53 0.47 ## 3 0.76 0.24
Let’s fit these models and gradually build them up in the number of predictors.
fit1 <- titanic %$% glm(Survived ~ Age, family = binomial(link="logit")) fit2 <- titanic %$% glm(Survived ~ Sex, family = binomial(link="logit")) fit3 <- titanic %$% glm(Survived ~ Pclass, family = binomial(link="logit"))
Survived ~ Age
titanic %$% histogram(~ Age | Survived == 1)
The distribution of Age
for the survivors (TRUE
) is different from the distribution of Age
for the non-survivors (FALSE
). Especially at the younger end there is a point mass for the survivors, which indicates that children have a higher probability of survival. However, it is not dramatically different.
Survived ~ Age
fit1 %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.209188757 0.159493684 -1.311580 0.18966182 ## Age -0.008774355 0.004947362 -1.773542 0.07613892
We can see that there is a trend. The log odds of Survived
decreases with increasing Age
. For every year increase in age, the log-odds of Survived
decreases with -0.0087744. However, this decrease is not too convincing given the effect, the standard error and the size of the data. Hence, the p-value is a little on the larger side.
When we inspect the deviance
c(fit1$null.deviance, fit1$deviance)
## [1] 1182.770 1179.593
We see that there is almost no decrease in deviance between the empty model (i.e. the model with only an intercept) and the model with Age
included. The difference in df is only 1 parameter.
Survived ~ Sex
titanic %$% histogram(~ Sex | Survived == 1)
Wow! These distributions are very different! Females seem to have a much higher probability of survival.
Survived ~ Sex
fit2 %>% summary %>% coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.056589 0.1289864 8.191477 2.580394e-16 ## Sexmale -2.505126 0.1672333 -14.979827 9.947556e-51
The coefficients confirm this. The log odds of Survived
decreases for males, when compared to females. The decrease is quite pronounced, too: -2.505126.
When we inspect the deviance
c(fit2$null.deviance, fit2$deviance)
## [1] 1182.7699 916.1224
We see that there is a massive gain in the deviance between the empty model (i.e. the model with only an intercept - represented by null.deviance
) and the model with Sex
included. The difference in df is only 1 parameter.
Survived ~ Pclass
titanic %$% histogram(~ Pclass | Survived == 1)
There is a very apparent difference between the distributions of the survivors and non-survivors over the classes. For example, we see that in 1st and 2nd class there are more survivors than non-survivors, while in the third class this relation is opposite.
Survived ~ Pclass
fit3 %>% summary %>% coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.4390660 0.20742485 6.937771 3.983356e-12 ## Pclass -0.8442286 0.08719006 -9.682624 3.574230e-22
Here, there is something wrong! Class is an ordered categorical variable, not a continuous variable. In other words, we cannot simply increase the log odds of Survived
with -0.8442286 for every unit increase in Pclass
.
If we would interpret these coefficients ‘as is’, we would assume that class two is twice the value for class 1, that class three is 1.5 times the value for class 2, and so on.
titanic %<>% mutate(Pclass = factor(Pclass, levels = c(3, 2, 1), ordered = FALSE)) str(titanic)
## 'data.frame': 887 obs. of 8 variables: ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ... ## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... ## $ Age : num 22 38 26 35 35 27 54 2 27 14 ... ## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
The Pclass
column is now correctly coded as a factor. We ignore the ordering as this goes beyond the scope of the course.
## ## Call: ## glm(formula = Survived ~ Age + Sex + Pclass, family = binomial(link = "logit")) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6811 -0.6653 -0.4137 0.6367 2.4505 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.17948 0.22784 5.177 2.26e-07 *** ## Age -0.03427 0.00716 -4.787 1.69e-06 *** ## Sexmale -2.58872 0.18701 -13.843 < 2e-16 *** ## Pclass2 1.25633 0.22646 5.548 2.90e-08 *** ## Pclass1 2.45544 0.25322 9.697 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1182.77 on 886 degrees of freedom ## Residual deviance: 801.59 on 882 degrees of freedom ## AIC: 811.59 ## ## Number of Fisher Scoring iterations: 5
fit.interaction <- titanic %$% glm(Survived ~ Age * Sex * Pclass, family = binomial(link="logit")) fit.interaction %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.38542858 0.35158572 1.0962578 0.272965982 ## Age -0.01742787 0.01399943 -1.2448985 0.213169059 ## Sexmale -1.24102347 0.51966698 -2.3881130 0.016935134 ## Pclass2 3.66379424 1.38138966 2.6522525 0.007995671 ## Pclass1 1.11218683 1.49587117 0.7435044 0.457176346 ## Age:Sexmale -0.02261191 0.02066970 -1.0939639 0.273970802 ## Age:Pclass2 -0.03196246 0.03845267 -0.8312158 0.405851754 ## Age:Pclass1 0.08036169 0.05283156 1.5210925 0.128236622 ## Sexmale:Pclass2 -1.91119761 1.57587112 -1.2127880 0.225210878 ## Sexmale:Pclass1 0.81487712 1.66024791 0.4908165 0.623556217 ## Age:Sexmale:Pclass2 -0.03128163 0.04938976 -0.6333627 0.526496788 ## Age:Sexmale:Pclass1 -0.08001824 0.05687997 -1.4067912 0.159489308
Now none of the main effects are significant. The variance (differences) have flowed into the interaction effects.
An interaction occurs when the (causal) effect of one predictor on the outcome depends on the level of the (causal) effect of another predictor.
E.g. the relation between body temperature and air temperature depends on the species.
To illustrate, I will limit this investigation to Age
and Pclass
for males only.
predict
function to illustrate the conditional probabilities within each classTo do so, we need to create a new
data frame that has all the combinations of predictors we need.
new <- data.frame(Pclass = factor(rep(c(1, 2, 3), c(80, 80, 80))), Age = rep(1:80, times = 3), Sex = rep("male", times = 240)) new <- cbind(new, predict(fit.interaction, newdata = new, type = "link", se = TRUE)) head(new)
## Pclass Age Sex fit se.fit residual.scale ## 1 1 1 male 1.0317727 0.5963908 1 ## 2 1 2 male 0.9920764 0.5826134 1 ## 3 1 3 male 0.9523801 0.5688767 1 ## 4 1 4 male 0.9126838 0.5551835 1 ## 5 1 5 male 0.8729874 0.5415373 1 ## 6 1 6 male 0.8332911 0.5279416 1
There are two simple approaches to obtain the predicted probabilities. First, we could simply ask for the predicted response:
new$prob <- predict(fit.interaction, newdata = new, type = "response")
Or we could use the distribution function plogis()
:
new$prob2 <- plogis(new$fit) head(new)
## Pclass Age Sex fit se.fit residual.scale prob prob2 ## 1 1 1 male 1.0317727 0.5963908 1 0.7372594 0.7372594 ## 2 1 2 male 0.9920764 0.5826134 1 0.7294979 0.7294979 ## 3 1 3 male 0.9523801 0.5688767 1 0.7215936 0.7215936 ## 4 1 4 male 0.9126838 0.5551835 1 0.7135490 0.7135490 ## 5 1 5 male 0.8729874 0.5415373 1 0.7053669 0.7053669 ## 6 1 6 male 0.8332911 0.5279416 1 0.6970504 0.6970504
new %<>% mutate(lower = plogis(fit - 1.96 * se.fit), upper = plogis(fit + 1.96 * se.fit)) head(new)
## Pclass Age Sex fit se.fit residual.scale prob prob2 ## 1 1 1 male 1.0317727 0.5963908 1 0.7372594 0.7372594 ## 2 1 2 male 0.9920764 0.5826134 1 0.7294979 0.7294979 ## 3 1 3 male 0.9523801 0.5688767 1 0.7215936 0.7215936 ## 4 1 4 male 0.9126838 0.5551835 1 0.7135490 0.7135490 ## 5 1 5 male 0.8729874 0.5415373 1 0.7053669 0.7053669 ## 6 1 6 male 0.8332911 0.5279416 1 0.6970504 0.6970504 ## lower upper ## 1 0.4657654 0.9003122 ## 2 0.4626085 0.8941640 ## 3 0.4594348 0.8876919 ## 4 0.4562432 0.8808856 ## 5 0.4530325 0.8737357 ## 6 0.4498011 0.8662333
A data frame with simulated Pclass
and Age
for males.
new %>% summary()
## Pclass Age Sex fit se.fit ## 1:80 Min. : 1.00 Length:240 Min. :-7.3657 Min. :0.1588 ## 2:80 1st Qu.:20.75 Class :character 1st Qu.:-3.2680 1st Qu.:0.2879 ## 3:80 Median :40.50 Mode :character Median :-1.7875 Median :0.4157 ## Mean :40.50 Mean :-2.0998 Mean :0.5133 ## 3rd Qu.:60.25 3rd Qu.:-0.7446 3rd Qu.:0.6093 ## Max. :80.00 Max. : 1.0318 Max. :1.6068 ## residual.scale prob prob2 lower ## Min. :1 Min. :0.0006322 Min. :0.0006322 Min. :0.0000271 ## 1st Qu.:1 1st Qu.:0.0366906 1st Qu.:0.0366906 1st Qu.:0.0118488 ## Median :1 Median :0.1433767 Median :0.1433767 Median :0.0826488 ## Mean :1 Mean :0.2129610 Mean :0.2129610 Mean :0.1360227 ## 3rd Qu.:1 3rd Qu.:0.3220022 3rd Qu.:0.3220022 3rd Qu.:0.2223781 ## Max. :1 Max. :0.7372594 Max. :0.7372594 Max. :0.4657654 ## upper ## Min. :0.01454 ## 1st Qu.:0.10794 ## Median :0.25041 ## Mean :0.31211 ## 3rd Qu.:0.44139 ## Max. :0.90031
new %>% ggplot(aes(x = Age, y = fit)) + geom_line(aes(colour = Pclass), lwd = 1)
new %>% ggplot(aes(x = Age, y = prob)) + geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) + geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")
To repeat the previous scenario for the model where only main effects are included:
fit <- titanic %$% glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit")) new2 <- new %>% select(Age, Pclass, Sex) new2 <- cbind(new2, predict(fit, newdata = new2, type = "link", se = TRUE)) new2 %<>% mutate(prob = plogis(fit), lower = plogis(fit - 1.96 * se.fit), upper = plogis(fit + 1.96 * se.fit))
new2 %>% ggplot(aes(x = Age, y = fit)) + geom_line(aes(colour = Pclass), lwd = 1)
new2 %>% ggplot(aes(x = Age, y = prob)) + geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) + geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")
anova(fit, fit.interaction, test = "Chisq")
## Analysis of Deviance Table ## ## Model 1: Survived ~ Age + Sex + Pclass ## Model 2: Survived ~ Age * Sex * Pclass ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 882 801.59 ## 2 875 755.18 7 46.413 7.263e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction model fits much better to the data.
AIC(fit, fit.interaction)
## df AIC ## fit 5 811.5940 ## fit.interaction 12 779.1805
pred <- predict(fit, type = "response") head(pred)
## 1 2 3 4 5 6 ## 0.10309564 0.91153082 0.57158684 0.91947951 0.06856994 0.08829226
pred <- ifelse(pred > 0.5, yes = 1, no = 0) head(pred)
## 1 2 3 4 5 6 ## 0 1 1 1 0 0
CM <- table(pred, titanic$Survived) CM
## ## pred 0 1 ## 0 463 100 ## 1 82 242
The accuracy can then be calculated as the percentage of correct predictions, i.e. the sum over the diagonal elements divided by the total number of cases:
correct <- CM %>% diag %>% sum correct / sum(CM)
## [1] 0.794814
confusionMatrix(as.factor(pred), reference = as.factor(titanic$Survived))
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 463 100 ## 1 82 242 ## ## Accuracy : 0.7948 ## 95% CI : (0.7667, 0.8209) ## No Information Rate : 0.6144 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.5627 ## ## Mcnemar's Test P-Value : 0.2076 ## ## Sensitivity : 0.8495 ## Specificity : 0.7076 ## Pos Pred Value : 0.8224 ## Neg Pred Value : 0.7469 ## Prevalence : 0.6144 ## Detection Rate : 0.5220 ## Detection Prevalence : 0.6347 ## Balanced Accuracy : 0.7786 ## ## 'Positive' Class : 0 ##
## Survived not.Survived ## Predicted Survived A B ## Predicted not.Survived C D
Survived
that have been correctly predicted as Survived
.not Survived
that have been correctly predicted as not Survived
.fit <- glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit"), data = titanic) set.seed(123) cv <- CVbinary(fit)
## ## Fold: 3 10 2 6 5 4 9 8 7 1 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.793
The accuracy after crossvalidation is more or less the same as the accuracy in the data as a whole. This indicates that the fitted model may be useful to new data that comes from the same population; i.e. the model may be generalized to new data.
The `CVbinary() function relies on K-fold crossvalidation. This type of crossvalidation relies on randomly splitting the data into folds (parts). To allow you to replicate these results I fix the random seed.
set.seed(123) CVbinary(fit, nfolds = 5)
## ## Fold: 3 2 5 4 1 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.797
CVbinary(fit, nfolds = 2)
## ## Fold: 1 2 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.782