Fundamental Techniques in Data Science with R

Packages and functions used

library(magrittr) # pipes
library(dplyr)    # data manipulation
library(mice)     # data
library(lattice)  # plotting - used for conditional plotting
library(ggplot2)  # plotting
library(DAAG)     # data sets and functions
library(caret)    # confusion matrices

Recap

With logistic regression we can model a binary outcome on a set of continuous, binary and/or categorical predictors.

If we would use linear regression with a binary outcome variable, we would create a linear probability model

  • The linear probability model is a way to describe the conditional probabilities
  • It is not a good way:
    • the residual variance is not constant (violation of homoscedasticity)
    • the residuals are not normally distributed

Because the linear probability model violates its assumptions when performed on a binary outcome variable, the standard errors and hypothesis tests are not valid. In short, you may draw invalid conclusions.

Titanic data

Example: titanic data

We start this lecture with a data set that logs the survival of passengers on board of the disastrous maiden voyage of the ocean liner Titanic

titanic <- read.csv(file = "titanic.csv", header = TRUE, stringsAsFactors = TRUE)
titanic %>% head
##   Survived Pclass                                               Name    Sex Age
## 1        0      3                             Mr. Owen Harris Braund   male  22
## 2        1      1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female  38
## 3        1      3                              Miss. Laina Heikkinen female  26
## 4        1      1        Mrs. Jacques Heath (Lily May Peel) Futrelle female  35
## 5        0      3                            Mr. William Henry Allen   male  35
## 6        0      3                                    Mr. James Moran   male  27
##   Siblings.Spouses.Aboard Parents.Children.Aboard    Fare
## 1                       1                       0  7.2500
## 2                       1                       0 71.2833
## 3                       0                       0  7.9250
## 4                       1                       0 53.1000
## 5                       0                       0  8.0500
## 6                       0                       0  8.4583

Inspect the data set

str(titanic)
## 'data.frame':    887 obs. of  8 variables:
##  $ Survived               : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass                 : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name                   : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
##  $ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

What sources of information

We have information on the following features.

Our outcome/dependent variable:

  • Survived: yes or no

Some potential predictors:

  • Sex: the passenger’s gender coded as c(male, female)
  • Pclass: the class the passenger traveled in
  • Age: the passenger’s age in years
  • Siblings.Spouses.Aboard: if siblings or spouses were also aboard
  • Parents.Children.Aboard: if the passenger’s parents or children were aboard

and more.

Hypothetically

We can start investigating if there are patterns in this data that are related to the survival probability.

For example, we could hypothesize based on the crede “women and children first” that

  • Age relates to the probability of survival in that younger passengers have a higher probability of survival
  • Sex relates to survival in that females have a higher probability of survival

Based on socio-economic status, we could hypothesize that

  • Pclass relates to the probability of survival in that higher travel class leads to a higher probability of survival

And so on.

A quick investigation

Is Age related?

Inspecting the data

titanic %$% table(Pclass, Survived)
##       Survived
## Pclass   0   1
##      1  80 136
##      2  97  87
##      3 368 119

It seems that the higher the class (i.e. 1 > 2 > 3), the higher the probability of survival.

We can verify this

titanic %$% table(Pclass, Survived) %>% prop.table(margin = 1) %>% round(digits = 2)
##       Survived
## Pclass    0    1
##      1 0.37 0.63
##      2 0.53 0.47
##      3 0.76 0.24

A more thorough inspection

Fitting the titanic models

Let’s fit these models and gradually build them up in the number of predictors.

fit1 <- titanic %$% 
  glm(Survived ~ Age, family = binomial(link="logit"))

fit2 <- titanic %$% 
  glm(Survived ~ Sex, family = binomial(link="logit"))

fit3 <- titanic %$% 
  glm(Survived ~ Pclass, family = binomial(link="logit"))

Survived ~ Age

titanic %$% histogram(~ Age | Survived == 1)

The distribution of Age for the survivors (TRUE) is different from the distribution of Age for the non-survivors (FALSE). Especially at the younger end there is a point mass for the survivors, which indicates that children have a higher probability of survival. However, it is not dramatically different.

Survived ~ Age

fit1 %>% summary %>% .$coefficients
##                 Estimate  Std. Error   z value   Pr(>|z|)
## (Intercept) -0.209188757 0.159493684 -1.311580 0.18966182
## Age         -0.008774355 0.004947362 -1.773542 0.07613892

We can see that there is a trend. The log odds of Survived decreases with increasing Age. For every year increase in age, the log-odds of Survived decreases with -0.0087744. However, this decrease is not too convincing given the effect, the standard error and the size of the data. Hence, the p-value is a little on the larger side.

When we inspect the deviance

c(fit1$null.deviance, fit1$deviance)
## [1] 1182.770 1179.593

We see that there is almost no decrease in deviance between the empty model (i.e. the model with only an intercept) and the model with Age included. The difference in df is only 1 parameter.

Survived ~ Sex

titanic %$% histogram(~ Sex | Survived == 1)

Wow! These distributions are very different! Females seem to have a much higher probability of survival.

Survived ~ Sex

fit2 %>% summary %>% coef
##              Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  1.056589  0.1289864   8.191477 2.580394e-16
## Sexmale     -2.505126  0.1672333 -14.979827 9.947556e-51

The coefficients confirm this. The log odds of Survived decreases for males, when compared to females. The decrease is quite pronounced, too: -2.505126.

When we inspect the deviance

c(fit2$null.deviance, fit2$deviance)
## [1] 1182.7699  916.1224

We see that there is a massive gain in the deviance between the empty model (i.e. the model with only an intercept - represented by null.deviance) and the model with Sex included. The difference in df is only 1 parameter.

Survived ~ Pclass

titanic %$% histogram(~ Pclass | Survived == 1)

There is a very apparent difference between the distributions of the survivors and non-survivors over the classes. For example, we see that in 1st and 2nd class there are more survivors than non-survivors, while in the third class this relation is opposite.

Survived ~ Pclass

fit3 %>% summary %>% coef
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  1.4390660 0.20742485  6.937771 3.983356e-12
## Pclass      -0.8442286 0.08719006 -9.682624 3.574230e-22

Here, there is something wrong! Class is an ordered categorical variable, not a continuous variable. In other words, we cannot simply increase the log odds of Survived with -0.8442286 for every unit increase in Pclass.

Why not?

If we would interpret these coefficients ‘as is’, we would assume that class two is twice the value for class 1, that class three is 1.5 times the value for class 2, and so on.

Edit the data

titanic %<>% 
  mutate(Pclass = factor(Pclass, levels = c(3, 2, 1), ordered = FALSE))
str(titanic)
## 'data.frame':    887 obs. of  8 variables:
##  $ Survived               : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass                 : Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ...
##  $ Name                   : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
##  $ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

The Pclass column is now correctly coded as a factor. We ignore the ordering as this goes beyond the scope of the course.

Titanic with main effects

## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6811  -0.6653  -0.4137   0.6367   2.4505  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.17948    0.22784   5.177 2.26e-07 ***
## Age         -0.03427    0.00716  -4.787 1.69e-06 ***
## Sexmale     -2.58872    0.18701 -13.843  < 2e-16 ***
## Pclass2      1.25633    0.22646   5.548 2.90e-08 ***
## Pclass1      2.45544    0.25322   9.697  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.77  on 886  degrees of freedom
## Residual deviance:  801.59  on 882  degrees of freedom
## AIC: 811.59
## 
## Number of Fisher Scoring iterations: 5

Titanic with interactions

fit.interaction <- titanic %$% glm(Survived ~ Age * Sex * Pclass, 
                                   family = binomial(link="logit"))
fit.interaction %>% summary %>% .$coefficients
##                        Estimate Std. Error    z value    Pr(>|z|)
## (Intercept)          0.38542858 0.35158572  1.0962578 0.272965982
## Age                 -0.01742787 0.01399943 -1.2448985 0.213169059
## Sexmale             -1.24102347 0.51966698 -2.3881130 0.016935134
## Pclass2              3.66379424 1.38138966  2.6522525 0.007995671
## Pclass1              1.11218683 1.49587117  0.7435044 0.457176346
## Age:Sexmale         -0.02261191 0.02066970 -1.0939639 0.273970802
## Age:Pclass2         -0.03196246 0.03845267 -0.8312158 0.405851754
## Age:Pclass1          0.08036169 0.05283156  1.5210925 0.128236622
## Sexmale:Pclass2     -1.91119761 1.57587112 -1.2127880 0.225210878
## Sexmale:Pclass1      0.81487712 1.66024791  0.4908165 0.623556217
## Age:Sexmale:Pclass2 -0.03128163 0.04938976 -0.6333627 0.526496788
## Age:Sexmale:Pclass1 -0.08001824 0.05687997 -1.4067912 0.159489308

Now none of the main effects are significant. The variance (differences) have flowed into the interaction effects.

Interactions

An interaction occurs when the (causal) effect of one predictor on the outcome depends on the level of the (causal) effect of another predictor.

Image Source

E.g. the relation between body temperature and air temperature depends on the species.

Visualizing the effects

To illustrate, I will limit this investigation to Age and Pclass for males only.

  • We can use the predict function to illustrate the conditional probabilities within each class

To do so, we need to create a new data frame that has all the combinations of predictors we need.

new <- data.frame(Pclass = factor(rep(c(1, 2, 3), c(80, 80, 80))), 
                  Age = rep(1:80, times = 3),
                  Sex = rep("male", times = 240))
new <- cbind(new, 
             predict(fit.interaction, newdata = new, 
                     type = "link", se = TRUE))
head(new)
##   Pclass Age  Sex       fit    se.fit residual.scale
## 1      1   1 male 1.0317727 0.5963908              1
## 2      1   2 male 0.9920764 0.5826134              1
## 3      1   3 male 0.9523801 0.5688767              1
## 4      1   4 male 0.9126838 0.5551835              1
## 5      1   5 male 0.8729874 0.5415373              1
## 6      1   6 male 0.8332911 0.5279416              1

Adding the predicted probabilities

There are two simple approaches to obtain the predicted probabilities. First, we could simply ask for the predicted response:

new$prob <- predict(fit.interaction, newdata = new, type = "response")

Or we could use the distribution function plogis():

new$prob2 <- plogis(new$fit)
head(new)
##   Pclass Age  Sex       fit    se.fit residual.scale      prob     prob2
## 1      1   1 male 1.0317727 0.5963908              1 0.7372594 0.7372594
## 2      1   2 male 0.9920764 0.5826134              1 0.7294979 0.7294979
## 3      1   3 male 0.9523801 0.5688767              1 0.7215936 0.7215936
## 4      1   4 male 0.9126838 0.5551835              1 0.7135490 0.7135490
## 5      1   5 male 0.8729874 0.5415373              1 0.7053669 0.7053669
## 6      1   6 male 0.8332911 0.5279416              1 0.6970504 0.6970504

Adding confidence intervals

new %<>% 
  mutate(lower = plogis(fit - 1.96 * se.fit),
         upper = plogis(fit + 1.96 * se.fit))

head(new)
##   Pclass Age  Sex       fit    se.fit residual.scale      prob     prob2
## 1      1   1 male 1.0317727 0.5963908              1 0.7372594 0.7372594
## 2      1   2 male 0.9920764 0.5826134              1 0.7294979 0.7294979
## 3      1   3 male 0.9523801 0.5688767              1 0.7215936 0.7215936
## 4      1   4 male 0.9126838 0.5551835              1 0.7135490 0.7135490
## 5      1   5 male 0.8729874 0.5415373              1 0.7053669 0.7053669
## 6      1   6 male 0.8332911 0.5279416              1 0.6970504 0.6970504
##       lower     upper
## 1 0.4657654 0.9003122
## 2 0.4626085 0.8941640
## 3 0.4594348 0.8876919
## 4 0.4562432 0.8808856
## 5 0.4530325 0.8737357
## 6 0.4498011 0.8662333

What do we have?

A data frame with simulated Pclass and Age for males.

new %>% summary()
##  Pclass      Age            Sex                 fit              se.fit      
##  1:80   Min.   : 1.00   Length:240         Min.   :-7.3657   Min.   :0.1588  
##  2:80   1st Qu.:20.75   Class :character   1st Qu.:-3.2680   1st Qu.:0.2879  
##  3:80   Median :40.50   Mode  :character   Median :-1.7875   Median :0.4157  
##         Mean   :40.50                      Mean   :-2.0998   Mean   :0.5133  
##         3rd Qu.:60.25                      3rd Qu.:-0.7446   3rd Qu.:0.6093  
##         Max.   :80.00                      Max.   : 1.0318   Max.   :1.6068  
##  residual.scale      prob               prob2               lower          
##  Min.   :1      Min.   :0.0006322   Min.   :0.0006322   Min.   :0.0000271  
##  1st Qu.:1      1st Qu.:0.0366906   1st Qu.:0.0366906   1st Qu.:0.0118488  
##  Median :1      Median :0.1433767   Median :0.1433767   Median :0.0826488  
##  Mean   :1      Mean   :0.2129610   Mean   :0.2129610   Mean   :0.1360227  
##  3rd Qu.:1      3rd Qu.:0.3220022   3rd Qu.:0.3220022   3rd Qu.:0.2223781  
##  Max.   :1      Max.   :0.7372594   Max.   :0.7372594   Max.   :0.4657654  
##      upper        
##  Min.   :0.01454  
##  1st Qu.:0.10794  
##  Median :0.25041  
##  Mean   :0.31211  
##  3rd Qu.:0.44139  
##  Max.   :0.90031

Visualizing the effects: link

Visualizing the effects: probabilities

new %>%
  ggplot(aes(x = Age, y = prob)) + 
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
  geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")

No interaction assumed

To repeat the previous scenario for the model where only main effects are included:

fit <- titanic %$% glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit")) 
new2 <- new %>% select(Age, Pclass, Sex)
new2 <- cbind(new2, predict(fit, newdata = new2, type = "link", se = TRUE))
new2 %<>% 
  mutate(prob = plogis(fit),
         lower = plogis(fit - 1.96 * se.fit),
         upper = plogis(fit + 1.96 * se.fit))

No interaction: the link

No interaction: probabilities

new2 %>%
  ggplot(aes(x = Age, y = prob)) + 
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
  geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")

Model comparison

anova(fit, fit.interaction, test = "Chisq") 
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Age + Sex + Pclass
## Model 2: Survived ~ Age * Sex * Pclass
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       882     801.59                          
## 2       875     755.18  7   46.413 7.263e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction model fits much better to the data.

AIC(fit, fit.interaction)
##                 df      AIC
## fit              5 811.5940
## fit.interaction 12 779.1805

Prediction

pred <- predict(fit, type = "response")
head(pred)
##          1          2          3          4          5          6 
## 0.10309564 0.91153082 0.57158684 0.91947951 0.06856994 0.08829226
pred <- ifelse(pred > 0.5, yes = 1, no = 0)
head(pred)
## 1 2 3 4 5 6 
## 0 1 1 1 0 0

Confusion matrix

CM <- table(pred, titanic$Survived)
CM
##     
## pred   0   1
##    0 463 100
##    1  82 242

The accuracy can then be calculated as the percentage of correct predictions, i.e. the sum over the diagonal elements divided by the total number of cases:

correct <- CM %>% diag %>% sum
correct / sum(CM)
## [1] 0.794814

In a function

confusionMatrix(as.factor(pred), reference = as.factor(titanic$Survived))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 463 100
##          1  82 242
##                                           
##                Accuracy : 0.7948          
##                  95% CI : (0.7667, 0.8209)
##     No Information Rate : 0.6144          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5627          
##                                           
##  Mcnemar's Test P-Value : 0.2076          
##                                           
##             Sensitivity : 0.8495          
##             Specificity : 0.7076          
##          Pos Pred Value : 0.8224          
##          Neg Pred Value : 0.7469          
##              Prevalence : 0.6144          
##          Detection Rate : 0.5220          
##    Detection Prevalence : 0.6347          
##       Balanced Accuracy : 0.7786          
##                                           
##        'Positive' Class : 0               
## 

Sensitivity / Specificity

##                        Survived not.Survived
## Predicted Survived            A            B
## Predicted not.Survived        C            D
  • Sensitivity is calculated as \(\text{Sensitivity} = A/(A+C) = \frac{463}{463+82} =\) 0.8495413
    • Sensitivity is the proportion of actual Survived that have been correctly predicted as Survived.
    • Sensitivity is the True Positive Rate
  • Specificity is calculated as \(\text{Specificity} = D/(B+D) = \frac{242}{100+242} =\) 0.7076023
    • Specificity is the proportion of actual not Survived that have been correctly predicted as not Survived.
    • Specificity is the True Negative Rate

Cross validating predictive power

fit <- glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit"), 
           data = titanic)
set.seed(123)
cv <- CVbinary(fit)
## 
## Fold:  3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.793

The accuracy after crossvalidation is more or less the same as the accuracy in the data as a whole. This indicates that the fitted model may be useful to new data that comes from the same population; i.e. the model may be generalized to new data.

The `CVbinary() function relies on K-fold crossvalidation. This type of crossvalidation relies on randomly splitting the data into folds (parts). To allow you to replicate these results I fix the random seed.

  • In real life there is no need for fixing the random seed if reproducibility is not required.

Cross validating predictive power

set.seed(123)
CVbinary(fit, nfolds = 5)
## 
## Fold:  3 2 5 4 1
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.797
CVbinary(fit, nfolds = 2)
## 
## Fold:  1 2
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.782