We use the following packages:
library(tidyverse)
library(magrittr)
library(caret)
library(kernlab)
library(MLeval)
library(pROC)
Let’s take the titanic
data that we used before and fit the following four models on a training version (70% of cases) of that data set.
Finally, compare the performance of all 4 techniques on the test version (30% of not yet used cases) of that data set.
We can use the following code block to directly load the data in our workspace:
con <- url("https://www.gerkovink.com/erasmus/Day%202/Part%20D/titanic.csv")
titanic <- read_csv(con)
## Rows: 887 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We need to take care of some columns that are not well-coded. Let’s make all the measurement levels as they are supposed to be. That means factors into factors, ordered factors into ordered factors, etc.
titanic %<>%
mutate(Pclass = factor(Pclass,
ordered = TRUE,
labels = c("1st class", "2nd class", "3rd class")),
Survived = factor(Survived,
labels = c("Died", "Survived")))
str(titanic)
## spec_tbl_df [887 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Survived : Factor w/ 2 levels "Died","Survived": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Ord.factor w/ 3 levels "1st class"<"2nd class"<..: 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:887] "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
## $ Sex : chr [1:887] "male" "female" "female" "female" ...
## $ Age : num [1:887] 22 38 26 35 35 27 54 2 27 14 ...
## $ Siblings/Spouses Aboard: num [1:887] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parents/Children Aboard: num [1:887] 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num [1:887] 7.25 71.28 7.92 53.1 8.05 ...
## - attr(*, "spec")=
## .. cols(
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. `Siblings/Spouses Aboard` = col_double(),
## .. `Parents/Children Aboard` = col_double(),
## .. Fare = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The %<>%
pipe returns the result of the pipe to the object.
Let’s split the titanic data into a training and validation set. Before we do so, we fix the random number generator seed in order to allow for reproduction of our results. Any seed value will do. My favorite seed is 123
.
set.seed(123)
Now we can split the data into a test
and a training
part.
idx <- createDataPartition(titanic$Survived, p = .7, list = FALSE)
train <- titanic[idx, ]
test <- titanic[-idx, ]
We now go through the four models where we predict Survived
from the other features in titanic
- with the exception of Name
, naturally. If we would use Name
, we would fit a zero-residual model: i.e. a model for every row seperately.
For ease of coding we exclude the Name
column from the titanic
set.
train %<>% select(-Name)
Again, we use the %<>%
pipe because it returns the result of the pipe to the object.
Let’s fit the linear model
lm.train <- glm(Survived ~ .,
data = train,
family = binomial(link = "logit"))
And generate the predicted values
lm.pred <- predict(lm.train,
newdata = test %>% select(-Name),
type = "response")
To inspect the performance of the final (and only) model:
confusionMatrix(ifelse(lm.pred < .5, "Died", "Survived") %>% factor,
test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 141 35
## Survived 22 67
##
## Accuracy : 0.7849
## 95% CI : (0.7305, 0.8328)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 2.519e-09
##
## Kappa : 0.5346
##
## Mcnemar's Test P-Value : 0.112
##
## Sensitivity : 0.8650
## Specificity : 0.6569
## Pos Pred Value : 0.8011
## Neg Pred Value : 0.7528
## Prevalence : 0.6151
## Detection Rate : 0.5321
## Detection Prevalence : 0.6642
## Balanced Accuracy : 0.7609
##
## 'Positive' Class : Died
##
Let’s train the linear kernel support vector machine
train_control <- trainControl(method="repeatedcv", number=10, repeats=3,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = FALSE)
linearSVM <- train(Survived ~.,
data = train,
method = "svmLinear",
trControl = train_control,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.1, 10, by = .5)))
When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.1
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(linearSVM)
Let’s also inspect the ROC curve on the cross-validated data:
plots <- evalm(linearSVM, showplots = FALSE, silent = TRUE)
plots$roc
plots$stdres
## $`Group 1`
## Score CI
## SENS 0.708 0.65-0.76
## SPEC 0.843 0.8-0.88
## MCC 0.556 <NA>
## Informedness 0.551 <NA>
## PREC 0.739 0.68-0.79
## NPV 0.821 0.78-0.86
## FPR 0.157 <NA>
## F1 0.723 <NA>
## TP 170.000 <NA>
## FP 60.000 <NA>
## TN 322.000 <NA>
## FN 70.000 <NA>
## AUC-ROC 0.720 0.68-0.76
## AUC-PR 0.580 <NA>
## AUC-PRG 0.240 <NA>
The Receiver Operator Characteristic (ROC) curve shows the trade-off between sensitivity - or true positive rate (TPR) - and specificity: 1 – false positive rate (FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. A random classifier is expected to yield predictions that result in a perfect relation between sensitivity and specificity. The ROC curve will then go along the diagonal (where FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
The ROC does not depend on the class distribution, making it very useful for evaluating classifiers that aim to predict rare events. Rare events are e.g. disease or disasters, where so-called class balances are very skewed. Accuracy would then favor classifiers that always predict a negative outcome.
We can use the area under the ROC curve (AUC) to compare different predictive classifiers. The AUC on the crossvalidated trained model is .73
.
pred.probs <- predict(linearSVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.8236
plot(ROC)
Let’s generate the predicted values
linearSVM.pred <- predict(linearSVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(linearSVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 142 39
## Survived 21 63
##
## Accuracy : 0.7736
## 95% CI : (0.7184, 0.8225)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 2.807e-08
##
## Kappa : 0.5055
##
## Mcnemar's Test P-Value : 0.02819
##
## Sensitivity : 0.8712
## Specificity : 0.6176
## Pos Pred Value : 0.7845
## Neg Pred Value : 0.7500
## Prevalence : 0.6151
## Detection Rate : 0.5358
## Detection Prevalence : 0.6830
## Balanced Accuracy : 0.7444
##
## 'Positive' Class : Died
##
Let’s train the polynomial kernel support vector machine
polySVM <- train(Survived ~.,
data = train,
method = "svmPoly",
trControl = train_control,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.25, 2, by = .25),
scale = seq(0.1, .3, by = .1),
degree = c(1:4)))
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(polySVM)
polySVM
## Support Vector Machines with Polynomial Kernel
##
## 622 samples
## 6 predictor
## 2 classes: 'Died', 'Survived'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 560, 560, 560, 559, 560, 560, ...
## Resampling results across tuning parameters:
##
## C scale degree Accuracy Kappa
## 0.25 0.1 1 0.7910309 0.5557889
## 0.25 0.1 2 0.8204728 0.6153980
## 0.25 0.1 3 0.8312255 0.6389489
## 0.25 0.1 4 0.8263868 0.6288898
## 0.25 0.2 1 0.7910309 0.5557889
## 0.25 0.2 2 0.8215822 0.6190424
## 0.25 0.2 3 0.8258577 0.6272477
## 0.25 0.2 4 0.8280253 0.6330356
## 0.25 0.3 1 0.7910309 0.5557889
## 0.25 0.3 2 0.8237071 0.6226657
## 0.25 0.3 3 0.8205325 0.6158598
## 0.25 0.3 4 0.8108124 0.5864759
## 0.50 0.1 1 0.7910309 0.5557889
## 0.50 0.1 2 0.8237157 0.6230856
## 0.50 0.1 3 0.8301417 0.6364636
## 0.50 0.1 4 0.8263868 0.6286117
## 0.50 0.2 1 0.7910309 0.5557889
## 0.50 0.2 2 0.8231695 0.6213465
## 0.50 0.2 3 0.8232122 0.6224694
## 0.50 0.2 4 0.8231951 0.6211647
## 0.50 0.3 1 0.7910309 0.5557889
## 0.50 0.3 2 0.8215822 0.6183047
## 0.50 0.3 3 0.8173323 0.6088670
## 0.50 0.3 4 0.7872162 0.5268853
## 0.75 0.1 1 0.7910309 0.5557889
## 0.75 0.1 2 0.8242704 0.6244605
## 0.75 0.1 3 0.8290835 0.6339438
## 0.75 0.1 4 0.8280082 0.6323988
## 0.75 0.2 1 0.7910309 0.5557889
## 0.75 0.2 2 0.8237071 0.6221929
## 0.75 0.2 3 0.8199949 0.6146466
## 0.75 0.2 4 0.8210445 0.6163681
## 0.75 0.3 1 0.7905018 0.5545475
## 0.75 0.3 2 0.8263697 0.6273625
## 0.75 0.3 3 0.8130312 0.5992729
## 0.75 0.3 4 0.8001109 0.5592056
## 1.00 0.1 1 0.7910309 0.5557889
## 1.00 0.1 2 0.8210360 0.6178495
## 1.00 0.1 3 0.8290835 0.6341977
## 1.00 0.1 4 0.8237413 0.6237084
## 1.00 0.2 1 0.7910309 0.5557889
## 1.00 0.2 2 0.8237071 0.6220430
## 1.00 0.2 3 0.8221454 0.6197063
## 1.00 0.2 4 0.8204984 0.6141388
## 1.00 0.3 1 0.7905018 0.5545475
## 1.00 0.3 2 0.8242277 0.6222235
## 1.00 0.3 3 0.8114183 0.5949912
## 1.00 0.3 4 0.7732719 0.4820830
## 1.25 0.1 1 0.7910309 0.5557889
## 1.25 0.1 2 0.8242448 0.6242691
## 1.25 0.1 3 0.8280082 0.6319069
## 1.25 0.1 4 0.8200034 0.6150001
## 1.25 0.2 1 0.7905018 0.5545475
## 1.25 0.2 2 0.8253030 0.6251883
## 1.25 0.2 3 0.8200119 0.6148727
## 1.25 0.2 4 0.8151220 0.5990568
## 1.25 0.3 1 0.7905018 0.5545475
## 1.25 0.3 2 0.8204813 0.6149046
## 1.25 0.3 3 0.8071002 0.5834182
## 1.25 0.3 4 0.7530125 0.4251175
## 1.50 0.1 1 0.7910309 0.5557889
## 1.50 0.1 2 0.8231866 0.6218777
## 1.50 0.1 3 0.8274706 0.6306296
## 1.50 0.1 4 0.8226745 0.6209851
## 1.50 0.2 1 0.7905018 0.5545475
## 1.50 0.2 2 0.8252944 0.6246708
## 1.50 0.2 3 0.8178614 0.6100472
## 1.50 0.2 4 0.8119218 0.5946906
## 1.50 0.3 1 0.7905018 0.5545475
## 1.50 0.3 2 0.8215480 0.6177644
## 1.50 0.3 3 0.8156938 0.6041679
## 1.50 0.3 4 0.7518177 0.4223787
## 1.75 0.1 1 0.7910309 0.5557889
## 1.75 0.1 2 0.8204984 0.6158411
## 1.75 0.1 3 0.8253115 0.6261000
## 1.75 0.1 4 0.8248165 0.6256193
## 1.75 0.2 1 0.7905018 0.5545475
## 1.75 0.2 2 0.8231609 0.6200506
## 1.75 0.2 3 0.8189452 0.6130803
## 1.75 0.2 4 0.7984810 0.5590285
## 1.75 0.3 1 0.7905018 0.5545475
## 1.75 0.3 2 0.8204813 0.6151917
## 1.75 0.3 3 0.8108722 0.5930300
## 1.75 0.3 4 0.7497440 0.4247249
## 2.00 0.1 1 0.7910309 0.5557889
## 2.00 0.1 2 0.8226318 0.6202868
## 2.00 0.1 3 0.8210360 0.6169541
## 2.00 0.1 4 0.8221454 0.6194494
## 2.00 0.2 1 0.7905018 0.5545475
## 2.00 0.2 2 0.8242277 0.6222787
## 2.00 0.2 3 0.8146527 0.6036025
## 2.00 0.2 4 0.8060079 0.5765475
## 2.00 0.3 1 0.7905018 0.5545475
## 2.00 0.3 2 0.8204813 0.6154468
## 2.00 0.3 3 0.8129886 0.5980960
## 2.00 0.3 4 0.7610514 0.4509216
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
Inspect the ROC curve of the predictions
pred.probs <- predict(polySVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.817
plot(ROC)
Now we generate the predicted values
polySVM.pred <- predict(polySVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(polySVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 149 40
## Survived 14 62
##
## Accuracy : 0.7962
## 95% CI : (0.7426, 0.8431)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 1.858e-10
##
## Kappa : 0.5481
##
## Mcnemar's Test P-Value : 0.0006688
##
## Sensitivity : 0.9141
## Specificity : 0.6078
## Pos Pred Value : 0.7884
## Neg Pred Value : 0.8158
## Prevalence : 0.6151
## Detection Rate : 0.5623
## Detection Prevalence : 0.7132
## Balanced Accuracy : 0.7610
##
## 'Positive' Class : Died
##
Let’s train the polynomial kernel support vector machine
radialSVM <- train(Survived~.,
data = train,
method = "svmRadial",
trControl = train_control,
preProcess = c("center","scale"),
tuneLength = 10)
Instead of specifying a grid, we can also ask caret
to utilize a tunelength of 10
. It will then cycle over the hyperparameter grid conform this length. For the linear SVM kernel, there is only tuning parameter \(C\); tunelength
needs more than one tuning parameter to be used. When we inspect the object we see that the optimal value for \(C\) has been trained to be 3, 0.1, 0.25
When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.4301104, 0.5
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(radialSVM)
radialSVM
## Support Vector Machines with Radial Basis Function Kernel
##
## 622 samples
## 6 predictor
## 2 classes: 'Died', 'Survived'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 560, 560, 560, 560, 559, 560, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8290067 0.6337552
## 0.50 0.8359874 0.6471976
## 1.00 0.8359788 0.6477731
## 2.00 0.8300905 0.6330929
## 4.00 0.8295699 0.6330901
## 8.00 0.8247483 0.6239330
## 16.00 0.8225806 0.6198035
## 32.00 0.8209677 0.6168844
## 64.00 0.8145332 0.6015914
## 128.00 0.8075696 0.5820259
##
## Tuning parameter 'sigma' was held constant at a value of 0.4301104
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4301104 and C = 0.5.
Let’s inspect the ROC curve on the predictions
pred.probs <- predict(radialSVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.812
plot(ROC)
And generate the predicted values
radialSVM.pred <- predict(radialSVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(radialSVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 150 37
## Survived 13 65
##
## Accuracy : 0.8113
## 95% CI : (0.7589, 0.8566)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 4.176e-12
##
## Kappa : 0.5832
##
## Mcnemar's Test P-Value : 0.001143
##
## Sensitivity : 0.9202
## Specificity : 0.6373
## Pos Pred Value : 0.8021
## Neg Pred Value : 0.8333
## Prevalence : 0.6151
## Detection Rate : 0.5660
## Detection Prevalence : 0.7057
## Balanced Accuracy : 0.7788
##
## 'Positive' Class : Died
##
End of Practical