We use the following packages:

library(tidyverse)
library(magrittr)
library(caret)
library(kernlab)
library(MLeval)
library(pROC)

Introduction

Let’s take the titanic data that we used before and fit the following four models on a training version (70% of cases) of that data set.

A logistic regression model
A linear kernel SVM
A polynomial kernel SVM
A radial kernel SVM

Finally, compare the performance of all 4 techniques on the test version (30% of not yet used cases) of that data set.

Grab the data

We can use the following code block to directly load the data in our workspace:

con <- url("https://www.gerkovink.com/erasmus/Day%202/Part%20D/titanic.csv")
titanic <- read_csv(con)

## Rows: 887 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Prepare the data

We need to take care of some columns that are not well-coded. Let’s make all the measurement levels as they are supposed to be. That means factors into factors, ordered factors into ordered factors, etc.

titanic %<>% 
  mutate(Pclass   = factor(Pclass, 
                         ordered = TRUE, 
                         labels = c("1st class", "2nd class", "3rd class")), 
         Survived = factor(Survived, 
                           labels = c("Died", "Survived")))

str(titanic)

## spec_tbl_df [887 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Survived               : Factor w/ 2 levels "Died","Survived": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass                 : Ord.factor w/ 3 levels "1st class"<"2nd class"<..: 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name                   : chr [1:887] "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
##  $ Sex                    : chr [1:887] "male" "female" "female" "female" ...
##  $ Age                    : num [1:887] 22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings/Spouses Aboard: num [1:887] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents/Children Aboard: num [1:887] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num [1:887] 7.25 71.28 7.92 53.1 8.05 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   `Siblings/Spouses Aboard` = col_double(),
##   ..   `Parents/Children Aboard` = col_double(),
##   ..   Fare = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The %<>% pipe returns the result of the pipe to the object.

Validation set

Let’s split the titanic data into a training and validation set. Before we do so, we fix the random number generator seed in order to allow for reproduction of our results. Any seed value will do. My favorite seed is 123.

set.seed(123)

Now we can split the data into a test and a training part.

idx <- createDataPartition(titanic$Survived, p = .7, list = FALSE)
train <- titanic[idx, ]
test <- titanic[-idx, ]

Modeling

We now go through the four models where we predict Survived from the other features in titanic - with the exception of Name, naturally. If we would use Name, we would fit a zero-residual model: i.e. a model for every row seperately.

For ease of coding we exclude the Name column from the titanic set.

train %<>% select(-Name)

Again, we use the %<>% pipe because it returns the result of the pipe to the object.

Linear model

Let’s fit the linear model

lm.train <- glm(Survived ~ ., 
                data = train, 
                family = binomial(link = "logit"))

And generate the predicted values

lm.pred <- predict(lm.train, 
                   newdata = test %>% select(-Name),
                   type = "response")

To inspect the performance of the final (and only) model:

confusionMatrix(ifelse(lm.pred < .5, "Died", "Survived") %>% factor, 
                test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died      141       35
##   Survived   22       67
##                                           
##                Accuracy : 0.7849          
##                  95% CI : (0.7305, 0.8328)
##     No Information Rate : 0.6151          
##     P-Value [Acc > NIR] : 2.519e-09       
##                                           
##                   Kappa : 0.5346          
##                                           
##  Mcnemar's Test P-Value : 0.112           
##                                           
##             Sensitivity : 0.8650          
##             Specificity : 0.6569          
##          Pos Pred Value : 0.8011          
##          Neg Pred Value : 0.7528          
##              Prevalence : 0.6151          
##          Detection Rate : 0.5321          
##    Detection Prevalence : 0.6642          
##       Balanced Accuracy : 0.7609          
##                                           
##        'Positive' Class : Died            
##

Linear kernel SVM

Let’s train the linear kernel support vector machine

train_control <- trainControl(method="repeatedcv", number=10, repeats=3,
                              savePredictions = TRUE, 
                              classProbs = TRUE, 
                              verboseIter = FALSE)
linearSVM <- train(Survived ~., 
                  data = train, 
                  method = "svmLinear", 
                  trControl = train_control,  
                  preProcess = c("center","scale"),
                  tuneGrid = expand.grid(C = seq(0.1, 10, by = .5)))

When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.1

Let’s inspect the tuning parameters and the cross-validated performance on the training set.

plot(linearSVM)

Let’s also inspect the ROC curve on the cross-validated data:

plots <- evalm(linearSVM, showplots = FALSE, silent = TRUE)
plots$roc

plots$stdres

## $`Group 1`
##                Score        CI
## SENS           0.708 0.65-0.76
## SPEC           0.843  0.8-0.88
## MCC            0.556      <NA>
## Informedness   0.551      <NA>
## PREC           0.739 0.68-0.79
## NPV            0.821 0.78-0.86
## FPR            0.157      <NA>
## F1             0.723      <NA>
## TP           170.000      <NA>
## FP            60.000      <NA>
## TN           322.000      <NA>
## FN            70.000      <NA>
## AUC-ROC        0.720 0.68-0.76
## AUC-PR         0.580      <NA>
## AUC-PRG        0.240      <NA>

The Receiver Operator Characteristic (ROC) curve shows the trade-off between sensitivity - or true positive rate (TPR) - and specificity: 1 – false positive rate (FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. A random classifier is expected to yield predictions that result in a perfect relation between sensitivity and specificity. The ROC curve will then go along the diagonal (where FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

The ROC does not depend on the class distribution, making it very useful for evaluating classifiers that aim to predict rare events. Rare events are e.g. disease or disasters, where so-called class balances are very skewed. Accuracy would then favor classifiers that always predict a negative outcome.

We can use the area under the ROC curve (AUC) to compare different predictive classifiers. The AUC on the crossvalidated trained model is .73.

pred.probs <- predict(linearSVM, 
                      newdata = test %>% select(-Name), 
                      type = "prob") 
ROC <- roc(predictor = pred.probs$Survived,
           response = test$Survived, 
           plot = TRUE)

## Setting levels: control = Died, case = Survived

## Setting direction: controls < cases

ROC$auc

## Area under the curve: 0.8236

plot(ROC)

Let’s generate the predicted values

linearSVM.pred <- predict(linearSVM, 
                          newdata = test %>% select(-Name), 
                          type = "raw")

To inspect the performance of the final model on the test set:

confusionMatrix(linearSVM.pred, test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died      142       39
##   Survived   21       63
##                                           
##                Accuracy : 0.7736          
##                  95% CI : (0.7184, 0.8225)
##     No Information Rate : 0.6151          
##     P-Value [Acc > NIR] : 2.807e-08       
##                                           
##                   Kappa : 0.5055          
##                                           
##  Mcnemar's Test P-Value : 0.02819         
##                                           
##             Sensitivity : 0.8712          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7845          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.6151          
##          Detection Rate : 0.5358          
##    Detection Prevalence : 0.6830          
##       Balanced Accuracy : 0.7444          
##                                           
##        'Positive' Class : Died            
##

Polynomial kernel SVM

Let’s train the polynomial kernel support vector machine

polySVM <- train(Survived ~., 
                 data = train, 
                 method = "svmPoly", 
                 trControl = train_control,  
                 preProcess = c("center","scale"),
                 tuneGrid = expand.grid(C = seq(0.25, 2, by = .25),
                                        scale = seq(0.1, .3, by = .1),
                                        degree = c(1:4)))

Let’s inspect the tuning parameters and the cross-validated performance on the training set.

plot(polySVM)

polySVM

## Support Vector Machines with Polynomial Kernel 
## 
## 622 samples
##   6 predictor
##   2 classes: 'Died', 'Survived' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 560, 560, 560, 559, 560, 560, ... 
## Resampling results across tuning parameters:
## 
##   C     scale  degree  Accuracy   Kappa    
##   0.25  0.1    1       0.7910309  0.5557889
##   0.25  0.1    2       0.8204728  0.6153980
##   0.25  0.1    3       0.8312255  0.6389489
##   0.25  0.1    4       0.8263868  0.6288898
##   0.25  0.2    1       0.7910309  0.5557889
##   0.25  0.2    2       0.8215822  0.6190424
##   0.25  0.2    3       0.8258577  0.6272477
##   0.25  0.2    4       0.8280253  0.6330356
##   0.25  0.3    1       0.7910309  0.5557889
##   0.25  0.3    2       0.8237071  0.6226657
##   0.25  0.3    3       0.8205325  0.6158598
##   0.25  0.3    4       0.8108124  0.5864759
##   0.50  0.1    1       0.7910309  0.5557889
##   0.50  0.1    2       0.8237157  0.6230856
##   0.50  0.1    3       0.8301417  0.6364636
##   0.50  0.1    4       0.8263868  0.6286117
##   0.50  0.2    1       0.7910309  0.5557889
##   0.50  0.2    2       0.8231695  0.6213465
##   0.50  0.2    3       0.8232122  0.6224694
##   0.50  0.2    4       0.8231951  0.6211647
##   0.50  0.3    1       0.7910309  0.5557889
##   0.50  0.3    2       0.8215822  0.6183047
##   0.50  0.3    3       0.8173323  0.6088670
##   0.50  0.3    4       0.7872162  0.5268853
##   0.75  0.1    1       0.7910309  0.5557889
##   0.75  0.1    2       0.8242704  0.6244605
##   0.75  0.1    3       0.8290835  0.6339438
##   0.75  0.1    4       0.8280082  0.6323988
##   0.75  0.2    1       0.7910309  0.5557889
##   0.75  0.2    2       0.8237071  0.6221929
##   0.75  0.2    3       0.8199949  0.6146466
##   0.75  0.2    4       0.8210445  0.6163681
##   0.75  0.3    1       0.7905018  0.5545475
##   0.75  0.3    2       0.8263697  0.6273625
##   0.75  0.3    3       0.8130312  0.5992729
##   0.75  0.3    4       0.8001109  0.5592056
##   1.00  0.1    1       0.7910309  0.5557889
##   1.00  0.1    2       0.8210360  0.6178495
##   1.00  0.1    3       0.8290835  0.6341977
##   1.00  0.1    4       0.8237413  0.6237084
##   1.00  0.2    1       0.7910309  0.5557889
##   1.00  0.2    2       0.8237071  0.6220430
##   1.00  0.2    3       0.8221454  0.6197063
##   1.00  0.2    4       0.8204984  0.6141388
##   1.00  0.3    1       0.7905018  0.5545475
##   1.00  0.3    2       0.8242277  0.6222235
##   1.00  0.3    3       0.8114183  0.5949912
##   1.00  0.3    4       0.7732719  0.4820830
##   1.25  0.1    1       0.7910309  0.5557889
##   1.25  0.1    2       0.8242448  0.6242691
##   1.25  0.1    3       0.8280082  0.6319069
##   1.25  0.1    4       0.8200034  0.6150001
##   1.25  0.2    1       0.7905018  0.5545475
##   1.25  0.2    2       0.8253030  0.6251883
##   1.25  0.2    3       0.8200119  0.6148727
##   1.25  0.2    4       0.8151220  0.5990568
##   1.25  0.3    1       0.7905018  0.5545475
##   1.25  0.3    2       0.8204813  0.6149046
##   1.25  0.3    3       0.8071002  0.5834182
##   1.25  0.3    4       0.7530125  0.4251175
##   1.50  0.1    1       0.7910309  0.5557889
##   1.50  0.1    2       0.8231866  0.6218777
##   1.50  0.1    3       0.8274706  0.6306296
##   1.50  0.1    4       0.8226745  0.6209851
##   1.50  0.2    1       0.7905018  0.5545475
##   1.50  0.2    2       0.8252944  0.6246708
##   1.50  0.2    3       0.8178614  0.6100472
##   1.50  0.2    4       0.8119218  0.5946906
##   1.50  0.3    1       0.7905018  0.5545475
##   1.50  0.3    2       0.8215480  0.6177644
##   1.50  0.3    3       0.8156938  0.6041679
##   1.50  0.3    4       0.7518177  0.4223787
##   1.75  0.1    1       0.7910309  0.5557889
##   1.75  0.1    2       0.8204984  0.6158411
##   1.75  0.1    3       0.8253115  0.6261000
##   1.75  0.1    4       0.8248165  0.6256193
##   1.75  0.2    1       0.7905018  0.5545475
##   1.75  0.2    2       0.8231609  0.6200506
##   1.75  0.2    3       0.8189452  0.6130803
##   1.75  0.2    4       0.7984810  0.5590285
##   1.75  0.3    1       0.7905018  0.5545475
##   1.75  0.3    2       0.8204813  0.6151917
##   1.75  0.3    3       0.8108722  0.5930300
##   1.75  0.3    4       0.7497440  0.4247249
##   2.00  0.1    1       0.7910309  0.5557889
##   2.00  0.1    2       0.8226318  0.6202868
##   2.00  0.1    3       0.8210360  0.6169541
##   2.00  0.1    4       0.8221454  0.6194494
##   2.00  0.2    1       0.7905018  0.5545475
##   2.00  0.2    2       0.8242277  0.6222787
##   2.00  0.2    3       0.8146527  0.6036025
##   2.00  0.2    4       0.8060079  0.5765475
##   2.00  0.3    1       0.7905018  0.5545475
##   2.00  0.3    2       0.8204813  0.6154468
##   2.00  0.3    3       0.8129886  0.5980960
##   2.00  0.3    4       0.7610514  0.4509216
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.

Inspect the ROC curve of the predictions

pred.probs <- predict(polySVM, 
                      newdata = test %>% select(-Name), 
                      type = "prob") 
ROC <- roc(predictor = pred.probs$Survived,
           response = test$Survived, 
           plot = TRUE)

## Setting levels: control = Died, case = Survived

## Setting direction: controls < cases

ROC$auc

## Area under the curve: 0.817

plot(ROC)

Now we generate the predicted values

polySVM.pred <- predict(polySVM, 
                        newdata = test %>% select(-Name), 
                        type = "raw")

To inspect the performance of the final model on the test set:

confusionMatrix(polySVM.pred, test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died      149       40
##   Survived   14       62
##                                           
##                Accuracy : 0.7962          
##                  95% CI : (0.7426, 0.8431)
##     No Information Rate : 0.6151          
##     P-Value [Acc > NIR] : 1.858e-10       
##                                           
##                   Kappa : 0.5481          
##                                           
##  Mcnemar's Test P-Value : 0.0006688       
##                                           
##             Sensitivity : 0.9141          
##             Specificity : 0.6078          
##          Pos Pred Value : 0.7884          
##          Neg Pred Value : 0.8158          
##              Prevalence : 0.6151          
##          Detection Rate : 0.5623          
##    Detection Prevalence : 0.7132          
##       Balanced Accuracy : 0.7610          
##                                           
##        'Positive' Class : Died            
##

Radial kernel SVM

Let’s train the polynomial kernel support vector machine

radialSVM <- train(Survived~., 
                   data = train, 
                   method = "svmRadial", 
                   trControl = train_control,  
                   preProcess = c("center","scale"),
                   tuneLength = 10)

Instead of specifying a grid, we can also ask caret to utilize a tunelength of 10. It will then cycle over the hyperparameter grid conform this length. For the linear SVM kernel, there is only tuning parameter \(C\); tunelength needs more than one tuning parameter to be used. When we inspect the object we see that the optimal value for \(C\) has been trained to be 3, 0.1, 0.25

When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.4301104, 0.5

Let’s inspect the tuning parameters and the cross-validated performance on the training set.

plot(radialSVM)

radialSVM

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 622 samples
##   6 predictor
##   2 classes: 'Died', 'Survived' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 560, 560, 560, 560, 559, 560, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.8290067  0.6337552
##     0.50  0.8359874  0.6471976
##     1.00  0.8359788  0.6477731
##     2.00  0.8300905  0.6330929
##     4.00  0.8295699  0.6330901
##     8.00  0.8247483  0.6239330
##    16.00  0.8225806  0.6198035
##    32.00  0.8209677  0.6168844
##    64.00  0.8145332  0.6015914
##   128.00  0.8075696  0.5820259
## 
## Tuning parameter 'sigma' was held constant at a value of 0.4301104
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4301104 and C = 0.5.

Let’s inspect the ROC curve on the predictions

pred.probs <- predict(radialSVM, 
                      newdata = test %>% select(-Name), 
                      type = "prob") 
ROC <- roc(predictor = pred.probs$Survived,
           response = test$Survived, 
           plot = TRUE)

## Setting levels: control = Died, case = Survived

## Setting direction: controls < cases

ROC$auc

## Area under the curve: 0.812

plot(ROC)

And generate the predicted values

radialSVM.pred <- predict(radialSVM, 
                          newdata = test %>% select(-Name), 
                          type = "raw")

To inspect the performance of the final model on the test set:

confusionMatrix(radialSVM.pred, test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died      150       37
##   Survived   13       65
##                                           
##                Accuracy : 0.8113          
##                  95% CI : (0.7589, 0.8566)
##     No Information Rate : 0.6151          
##     P-Value [Acc > NIR] : 4.176e-12       
##                                           
##                   Kappa : 0.5832          
##                                           
##  Mcnemar's Test P-Value : 0.001143        
##                                           
##             Sensitivity : 0.9202          
##             Specificity : 0.6373          
##          Pos Pred Value : 0.8021          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.6151          
##          Detection Rate : 0.5660          
##    Detection Prevalence : 0.7057          
##       Balanced Accuracy : 0.7788          
##                                           
##        'Positive' Class : Died            
##

End of Practical

Practical I

Gerko Vink

Data Science and Predictive Machine Learning

Introduction

Grab the data

Prepare the data

Validation set

Modeling

Linear model

Linear kernel SVM

Polynomial kernel SVM

Radial kernel SVM