We use the following packages:
library(mice)
library(caret)
library(dplyr)
library(magrittr)
library(DAAG)
library(readr)
We use the titanic
data set for this exercise. Download the titanic.csv
data set.
con <- url("https://www.gerkovink.com/erasmus/Day%202/Part%20D/titanic.csv")
titanic <- read_csv(con)
## Rows: 887 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 1 Inspect the titanic data set by calling titanic
in the console and with functions summary()
, str()
and md.pattern()
.
titanic
## # A tibble: 887 × 8
## Survived Pclass Name Sex Age `Siblings/Spous… `Parents/Childr… Fare
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0 3 Mr. Owe… male 22 1 0 7.25
## 2 1 1 Mrs. Jo… female 38 1 0 71.3
## 3 1 3 Miss. L… female 26 0 0 7.92
## 4 1 1 Mrs. Ja… female 35 1 0 53.1
## 5 0 3 Mr. Wil… male 35 0 0 8.05
## 6 0 3 Mr. Jam… male 27 0 0 8.46
## 7 0 1 Mr. Tim… male 54 0 0 51.9
## 8 0 3 Master.… male 2 3 1 21.1
## 9 1 3 Mrs. Os… female 27 0 2 11.1
## 10 1 2 Mrs. Ni… female 14 1 0 30.1
## # … with 877 more rows
We can see that the titanic
data set is imported as a tibble
. A tibble
is a more flexible data frame with a much nicer printing class.
summary(titanic)
## Survived Pclass Name Sex
## Min. :0.0000 Min. :1.000 Length:887 Length:887
## 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character
## Median :0.0000 Median :3.000 Mode :character Mode :character
## Mean :0.3856 Mean :2.306
## 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :3.000
## Age Siblings/Spouses Aboard Parents/Children Aboard
## Min. : 0.42 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.25 1st Qu.:0.0000 1st Qu.:0.0000
## Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.47 Mean :0.5254 Mean :0.3833
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.0000 Max. :6.0000
## Fare
## Min. : 0.000
## 1st Qu.: 7.925
## Median : 14.454
## Mean : 32.305
## 3rd Qu.: 31.137
## Max. :512.329
The summary()
output gives us direct information about the parametric nature of the columns is the data
str(titanic)
## spec_tbl_df [887 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Survived : num [1:887] 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num [1:887] 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:887] "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
## $ Sex : chr [1:887] "male" "female" "female" "female" ...
## $ Age : num [1:887] 22 38 26 35 35 27 54 2 27 14 ...
## $ Siblings/Spouses Aboard: num [1:887] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parents/Children Aboard: num [1:887] 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num [1:887] 7.25 71.28 7.92 53.1 8.05 ...
## - attr(*, "spec")=
## .. cols(
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. `Siblings/Spouses Aboard` = col_double(),
## .. `Parents/Children Aboard` = col_double(),
## .. Fare = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
When we study the structure of the data set, we see that the outcome Survived
is not coded as a factor
, but as a numeric column. The same holds for Pclass
. This will influence the default estimation later on. There are more irregularities, but we’ll ignore those for now.
md.pattern(titanic, rotate.names = TRUE)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Survived Pclass Name Sex Age Siblings/Spouses Aboard
## 887 1 1 1 1 1 1
## 0 0 0 0 0 0
## Parents/Children Aboard Fare
## 887 1 1 0
## 0 0 0
There are no missing values in this titanic
data set.
Exercise 2 Correct the measurement level of the columns Pclass
and Survived
. Then ask for the summary()
once more.
titanic %<>%
mutate(Pclass = factor(Pclass, labels = c("1st class", "2nd class", "3rd class")),
Survived = factor(Survived, labels = c("No", "Yes")))
titanic %>% summary()
## Survived Pclass Name Sex
## No :545 1st class:216 Length:887 Length:887
## Yes:342 2nd class:184 Class :character Class :character
## 3rd class:487 Mode :character Mode :character
##
##
##
## Age Siblings/Spouses Aboard Parents/Children Aboard
## Min. : 0.42 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.25 1st Qu.:0.0000 1st Qu.:0.0000
## Median :28.00 Median :0.0000 Median :0.0000
## Mean :29.47 Mean :0.5254 Mean :0.3833
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.0000 Max. :6.0000
## Fare
## Min. : 0.000
## 1st Qu.: 7.925
## Median : 14.454
## Mean : 32.305
## 3rd Qu.: 31.137
## Max. :512.329
We now see the tabular information about the Survived
and Pclass
columns. This is because these columns are now coded as factors (i.e. categorical variables with a numeric representation). Note that in the mutate
call, I used the %<>%
pipe. This assign pipe returns the endresult of the pipe to the original object. This mitigates the use of the <-
assign operator and the double calling of the titanic
set in the regular strategy below:
titanic <- titanic %>%
mutate(Pclass = factor(Pclass, labels = c("1st class", "2nd class", "3rd class")),
Survived = factor(Survived, labels = c("No", "Yes")))
Exercise 3 Split the data manually into two parts: a training part (70% of cases) and a test part (30% of cases). Verify the dimensions of the splits with function dim()
.
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(titanic$Survived, p = .7, times = 1, list = FALSE)
train <- titanic[trainIndex, ]
test <- titanic[-trainIndex, ]
We make use of the createDataPartition()
function from package caret
to generate the rownumbers for the splits. We could have also done this manually with e.g. trainIndex <- sample(1:nrow(titanic), round(.7*nrow(titanic)), replace = TRUE)
. I find the createDataPartition()
function always convenient, because it directly plugs into the caret
functionality.
dim(train)
## [1] 622 8
dim(test)
## [1] 265 8
We can see that the split with 622 cases in the train
set and 265 cases in the test
set approximates the desired p = .7
split probability with 0.7012401.
Exercise 3 Predict Age
from Pclass
, Sex
and Survived
. Train your model on the train
set and validate it on the test
set
fit <- train %$%
lm(Age ~ Pclass + Sex + Survived)
pred <- fit %>%
predict(newdata = test)
The fit
object contains the model fitted on the training data. The pred
object contains the predictions obtained by applying the fit
model to the test
data.
Exercise 4 Now calculate the RMSE and the \(R^2\) for the predictions and compare those to the fitted model in fit
results <- data.frame(R2 = c(cor(pred, test$Age)^2,
summary(fit)$r.squared),
RMSE = c((pred - test$Age)^2 %>% sum %>% sqrt,
fit$residuals^2 %>% mean %>% sqrt))
rownames(results) <- c("predicted", "fitted")
results
## R2 RMSE
## predicted 0.1802922 208.17344
## fitted 0.2066340 12.57457
We see that the \(R^2\) is lower for the predictions and that the root mean squared error is higher. For unbiased estimators we can view the RMSE as the standard error of the estimator. The MSE would then be the variance of that unbiased estimator.
Exercise 5 Now use the caret
package to do the same as above. Use the default paramters for the train()
function and use the train
data to train the model.
set.seed(123) # for reproducibility
# train the model on training set
model <- train(Age ~ Pclass + Sex + Survived,
data = train,
method = "lm")
model
## Linear Regression
##
## 622 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 622, 622, 622, 622, 622, 622, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 12.83484 0.1941763 10.1362
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
We see that the train
function by default uses a Bootstrapped resampling: the train
data is resampled with replacement 25 times and every time the model is evaluated. Every individual sample is slightly different and, hence, the distribution of obtained results is also different. We can get information about the variance from:
model$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 12.83484 0.1941763 10.1362 0.5506277 0.04287056 0.4955852
Exercise 6 Now use the model from (5) to predict the test
data and calculate the same metrics as in (4).
pred <- predict(model, newdata = test)
# R^2
cor(pred, test$Age)^2
## [1] 0.1802922
# RMSE
(pred - test$Age)^2 %>% mean %>% sqrt
## [1] 12.78799
A much easier way of obtaining the same metrics is with the postResample()
function:
postResample(pred = pred, obs = test$Age)
## RMSE Rsquared MAE
## 12.7879928 0.1802922 9.6969758
Exercise 7 Rerun the model from (5), but use 10-fold cross-validation on the training set. Evaluate the predictions with postResample()
.
set.seed(123) # for reproducibility
model <- train(Age ~ Pclass + Sex + Survived,
data = train,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
model
## Linear Regression
##
## 622 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 560, 558, 559, 560, 559, 562, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 12.5743 0.2086778 9.93655
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
pred <- predict(model, newdata = test)
postResample(pred, test$Age)
## RMSE Rsquared MAE
## 12.7879928 0.1802922 9.6969758
There’s not much more we can do for this linear model. At least we now that the below model is not grossly overfitted and that, if new data would come in, there is not much accuracy in predicting Age
from these predictors. Let’s hope that never happens.
lm(Age ~ Pclass + Sex + Survived, data = titanic) %>% summary()
##
## Call:
## lm(formula = Age ~ Pclass + Sex + Survived, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.406 -7.934 -0.934 7.493 47.066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.406 1.422 29.122 < 2e-16 ***
## Pclass2nd class -9.803 1.281 -7.653 5.15e-14 ***
## Pclass3rd class -15.889 1.098 -14.467 < 2e-16 ***
## Sexmale 1.417 1.061 1.335 0.182
## SurvivedYes -5.427 1.098 -4.944 9.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.67 on 882 degrees of freedom
## Multiple R-squared: 0.1991, Adjusted R-squared: 0.1954
## F-statistic: 54.8 on 4 and 882 DF, p-value: < 2.2e-16
We can still infer that Age
differs over these groups. The overall model is highly significant.
Exercise 8 Use the same train/test splits to evaluate the performance of a logistic model where Survived
is predicted from Age
, Pclass
and Sex
. Study the accuracy and the confusion matrix
We start by specifying the caret
model.
set.seed(123) # for reproducibility
model <- train(Survived ~ Age + Pclass + Sex,
data = train,
method = "glm",
family = binomial(link = "logit"),
trControl = trainControl(method = "cv", number = 10)
)
model
## Generalized Linear Model
##
## 622 samples
## 3 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 559, 560, 560, 560, 560, 560, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8020993 0.5771372
We can ask for a confusion matrix over the crossvalidated sets.
confusionMatrix(model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 52.6 10.9
## Yes 8.8 27.7
##
## Accuracy (average) : 0.8023
We see that a bit over 80% is accurately predicted. The off-diagonal holds the other almost 20%.
When we apply the model to the test data to obtain predictions, we can choose to get the raw
predictions (i.e. the scale of the response as recorded in the data), or prob
predictions (i.e. the scale of the response as modeled in probabilities).
pred <- predict(model, newdata = test, type = "raw")
The confusion matrix over the predictions yields many informative measures.
confusionMatrix(pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 140 35
## Yes 23 67
##
## Accuracy : 0.7811
## 95% CI : (0.7265, 0.8294)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 5.749e-09
##
## Kappa : 0.5274
##
## Mcnemar's Test P-Value : 0.1486
##
## Sensitivity : 0.8589
## Specificity : 0.6569
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.7444
## Prevalence : 0.6151
## Detection Rate : 0.5283
## Detection Prevalence : 0.6604
## Balanced Accuracy : 0.7579
##
## 'Positive' Class : No
##
Exercise 9 Compare the model obtained with caret
’s train()
on with a model obtained with glm()
. Fit the glm()
model on the training set. Study accuracy and parameters.
We start with generating the relevant output from glm()
. First, we fit the model with the correct family and link function
fit <- train %$%
glm(Survived ~ Age + Pclass + Sex,
family = binomial(link = "logit"))
Next, we generate the predicted values:
pred.glm <- ifelse(predict(fit, newdata = test, type = "response") > .5, "Yes", "No")
We have to indicate how to go from the predicted probabilities back to No
and Yes
. I use the ifelse()
function to do that: if the probability is over .5, then the new value will be Yes
, else it will be No
.
Now we can enter this vector of predicted Yes
and No
in the postResample()
function to compare it with the observations in test$Survived
.
postResample(pred.glm, test$Survived)
## Accuracy Kappa
## 0.7811321 0.5273678
Finally, we can obtain the parameter summary from the fitted model.
fit %>% summary
##
## Call:
## glm(formula = Survived ~ Age + Pclass + Sex, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0040 -0.6605 -0.4094 0.6136 2.4892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.672181 0.447397 8.208 2.25e-16 ***
## Age -0.036166 0.008427 -4.291 1.78e-05 ***
## Pclass2nd class -1.138621 0.316826 -3.594 0.000326 ***
## Pclass3rd class -2.453363 0.314649 -7.797 6.33e-15 ***
## Sexmale -2.643157 0.222832 -11.862 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 829.57 on 621 degrees of freedom
## Residual deviance: 555.37 on 617 degrees of freedom
## AIC: 565.37
##
## Number of Fisher Scoring iterations: 5
When we obtain the same information from the caret
model, we see that there are no differences.
postResample(pred, test$Survived)
## Accuracy Kappa
## 0.7811321 0.5273678
model$finalModel %>% summary()
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0040 -0.6605 -0.4094 0.6136 2.4892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.672181 0.447397 8.208 2.25e-16 ***
## Age -0.036166 0.008427 -4.291 1.78e-05 ***
## `Pclass2nd class` -1.138621 0.316826 -3.594 0.000326 ***
## `Pclass3rd class` -2.453363 0.314649 -7.797 6.33e-15 ***
## Sexmale -2.643157 0.222832 -11.862 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 829.57 on 621 degrees of freedom
## Residual deviance: 555.37 on 617 degrees of freedom
## AIC: 565.37
##
## Number of Fisher Scoring iterations: 5
The outputs are identical. caret
does not perform magical parameter poolings over the crossvalidated sets. The returned model is the fitted model. The accuracy that is obtained over the train
set is obtained by meand of crossvalidation. The fitted model is in both cases identically applied to the test
set.
That said, the modeling possibilities with caret
are enormous and there are many modeling efforts possible that lead to a proper model training on a training set. Crossvalidation is then needed. A test set can then be used to evaluate the performance of the trained model on unseen data.
End of Practical
.