We use the following packages:

library(mice)
library(caret)
library(dplyr)
library(magrittr)
library(DAAG)
library(readr)

We use the titanic data set for this exercise. Download the titanic.csv data set.

con <- url("https://www.gerkovink.com/erasmus/Day%202/Part%20D/titanic.csv")
titanic <- read_csv(con)

## Rows: 887 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 1 Inspect the titanic data set by calling titanic in the console and with functions summary(), str() and md.pattern().

titanic

## # A tibble: 887 × 8
##    Survived Pclass Name     Sex      Age `Siblings/Spous… `Parents/Childr…  Fare
##       <dbl>  <dbl> <chr>    <chr>  <dbl>            <dbl>            <dbl> <dbl>
##  1        0      3 Mr. Owe… male      22                1                0  7.25
##  2        1      1 Mrs. Jo… female    38                1                0 71.3 
##  3        1      3 Miss. L… female    26                0                0  7.92
##  4        1      1 Mrs. Ja… female    35                1                0 53.1 
##  5        0      3 Mr. Wil… male      35                0                0  8.05
##  6        0      3 Mr. Jam… male      27                0                0  8.46
##  7        0      1 Mr. Tim… male      54                0                0 51.9 
##  8        0      3 Master.… male       2                3                1 21.1 
##  9        1      3 Mrs. Os… female    27                0                2 11.1 
## 10        1      2 Mrs. Ni… female    14                1                0 30.1 
## # … with 877 more rows

We can see that the titanic data set is imported as a tibble. A tibble is a more flexible data frame with a much nicer printing class.

summary(titanic)

##     Survived          Pclass          Name               Sex           
##  Min.   :0.0000   Min.   :1.000   Length:887         Length:887        
##  1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character  
##  Median :0.0000   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :0.3856   Mean   :2.306                                        
##  3rd Qu.:1.0000   3rd Qu.:3.000                                        
##  Max.   :1.0000   Max.   :3.000                                        
##       Age        Siblings/Spouses Aboard Parents/Children Aboard
##  Min.   : 0.42   Min.   :0.0000          Min.   :0.0000         
##  1st Qu.:20.25   1st Qu.:0.0000          1st Qu.:0.0000         
##  Median :28.00   Median :0.0000          Median :0.0000         
##  Mean   :29.47   Mean   :0.5254          Mean   :0.3833         
##  3rd Qu.:38.00   3rd Qu.:1.0000          3rd Qu.:0.0000         
##  Max.   :80.00   Max.   :8.0000          Max.   :6.0000         
##       Fare        
##  Min.   :  0.000  
##  1st Qu.:  7.925  
##  Median : 14.454  
##  Mean   : 32.305  
##  3rd Qu.: 31.137  
##  Max.   :512.329

The summary() output gives us direct information about the parametric nature of the columns is the data

str(titanic)

## spec_tbl_df [887 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Survived               : num [1:887] 0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass                 : num [1:887] 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name                   : chr [1:887] "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
##  $ Sex                    : chr [1:887] "male" "female" "female" "female" ...
##  $ Age                    : num [1:887] 22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings/Spouses Aboard: num [1:887] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents/Children Aboard: num [1:887] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num [1:887] 7.25 71.28 7.92 53.1 8.05 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   `Siblings/Spouses Aboard` = col_double(),
##   ..   `Parents/Children Aboard` = col_double(),
##   ..   Fare = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

When we study the structure of the data set, we see that the outcome Survived is not coded as a factor, but as a numeric column. The same holds for Pclass. This will influence the default estimation later on. There are more irregularities, but we’ll ignore those for now.

md.pattern(titanic, rotate.names = TRUE)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     Survived Pclass Name Sex Age Siblings/Spouses Aboard
## 887        1      1    1   1   1                       1
##            0      0    0   0   0                       0
##     Parents/Children Aboard Fare  
## 887                       1    1 0
##                           0    0 0

There are no missing values in this titanic data set.

Exercise 2 Correct the measurement level of the columns Pclass and Survived. Then ask for the summary() once more.

titanic %<>% 
  mutate(Pclass = factor(Pclass, labels = c("1st class", "2nd class", "3rd class")), 
         Survived = factor(Survived, labels = c("No", "Yes")))
titanic %>% summary()

##  Survived        Pclass        Name               Sex           
##  No :545   1st class:216   Length:887         Length:887        
##  Yes:342   2nd class:184   Class :character   Class :character  
##            3rd class:487   Mode  :character   Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##       Age        Siblings/Spouses Aboard Parents/Children Aboard
##  Min.   : 0.42   Min.   :0.0000          Min.   :0.0000         
##  1st Qu.:20.25   1st Qu.:0.0000          1st Qu.:0.0000         
##  Median :28.00   Median :0.0000          Median :0.0000         
##  Mean   :29.47   Mean   :0.5254          Mean   :0.3833         
##  3rd Qu.:38.00   3rd Qu.:1.0000          3rd Qu.:0.0000         
##  Max.   :80.00   Max.   :8.0000          Max.   :6.0000         
##       Fare        
##  Min.   :  0.000  
##  1st Qu.:  7.925  
##  Median : 14.454  
##  Mean   : 32.305  
##  3rd Qu.: 31.137  
##  Max.   :512.329

We now see the tabular information about the Survived and Pclass columns. This is because these columns are now coded as factors (i.e. categorical variables with a numeric representation). Note that in the mutate call, I used the %<>% pipe. This assign pipe returns the endresult of the pipe to the original object. This mitigates the use of the <- assign operator and the double calling of the titanic set in the regular strategy below:

titanic <- titanic %>% 
  mutate(Pclass = factor(Pclass, labels = c("1st class", "2nd class", "3rd class")), 
         Survived = factor(Survived, labels = c("No", "Yes")))

Data subsetting

Exercise 3 Split the data manually into two parts: a training part (70% of cases) and a test part (30% of cases). Verify the dimensions of the splits with function dim().

set.seed(123) # for reproducibility
trainIndex <- createDataPartition(titanic$Survived, p = .7, times = 1, list = FALSE)

train <- titanic[trainIndex, ]
test <- titanic[-trainIndex, ]

We make use of the createDataPartition() function from package caret to generate the rownumbers for the splits. We could have also done this manually with e.g. trainIndex <- sample(1:nrow(titanic), round(.7*nrow(titanic)), replace = TRUE). I find the createDataPartition() function always convenient, because it directly plugs into the caret functionality.

dim(train)

## [1] 622   8

dim(test)

## [1] 265   8

We can see that the split with 622 cases in the train set and 265 cases in the test set approximates the desired p = .7 split probability with 0.7012401.

Linear model

Exercise 3 Predict Age from Pclass, Sex and Survived. Train your model on the train set and validate it on the test set

fit <- train %$% 
  lm(Age ~ Pclass + Sex + Survived) 
pred <- fit %>% 
  predict(newdata = test)

The fit object contains the model fitted on the training data. The pred object contains the predictions obtained by applying the fit model to the test data.

Exercise 4 Now calculate the RMSE and the $R^2$ for the predictions and compare those to the fitted model in fit

results <- data.frame(R2 = c(cor(pred, test$Age)^2, 
                             summary(fit)$r.squared),
                      RMSE = c((pred - test$Age)^2 %>% sum %>% sqrt, 
                               fit$residuals^2 %>% mean %>% sqrt))
rownames(results) <- c("predicted", "fitted")
results

##                  R2      RMSE
## predicted 0.1802922 208.17344
## fitted    0.2066340  12.57457

We see that the $R^2$ is lower for the predictions and that the root mean squared error is higher. For unbiased estimators we can view the RMSE as the standard error of the estimator. The MSE would then be the variance of that unbiased estimator.

Exercise 5 Now use the caret package to do the same as above. Use the default paramters for the train() function and use the train data to train the model.

set.seed(123) # for reproducibility
# train the model on training set
model <- train(Age ~ Pclass + Sex + Survived,
               data = train,
               method = "lm")
model

## Linear Regression 
## 
## 622 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 622, 622, 622, 622, 622, 622, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   12.83484  0.1941763  10.1362
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

We see that the train function by default uses a Bootstrapped resampling: the train data is resampled with replacement 25 times and every time the model is evaluated. Every individual sample is slightly different and, hence, the distribution of obtained results is also different. We can get information about the variance from:

model$results

##   intercept     RMSE  Rsquared     MAE    RMSESD RsquaredSD     MAESD
## 1      TRUE 12.83484 0.1941763 10.1362 0.5506277 0.04287056 0.4955852

Exercise 6 Now use the model from (5) to predict the test data and calculate the same metrics as in (4).

pred <- predict(model, newdata = test)
# R^2
cor(pred, test$Age)^2

## [1] 0.1802922

# RMSE
(pred - test$Age)^2 %>% mean %>% sqrt

## [1] 12.78799

A much easier way of obtaining the same metrics is with the postResample() function:

postResample(pred = pred, obs = test$Age)

##       RMSE   Rsquared        MAE 
## 12.7879928  0.1802922  9.6969758

Exercise 7 Rerun the model from (5), but use 10-fold cross-validation on the training set. Evaluate the predictions with postResample().

set.seed(123) # for reproducibility
model <- train(Age ~ Pclass + Sex + Survived,
               data = train,
               method = "lm",
               trControl = trainControl(method = "cv", number = 10)
               )
model

## Linear Regression 
## 
## 622 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 560, 558, 559, 560, 559, 562, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE    
##   12.5743  0.2086778  9.93655
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

pred <- predict(model, newdata = test)
postResample(pred, test$Age)

##       RMSE   Rsquared        MAE 
## 12.7879928  0.1802922  9.6969758

There’s not much more we can do for this linear model. At least we now that the below model is not grossly overfitted and that, if new data would come in, there is not much accuracy in predicting Age from these predictors. Let’s hope that never happens.

lm(Age ~ Pclass + Sex + Survived, data = titanic) %>% summary()

## 
## Call:
## lm(formula = Age ~ Pclass + Sex + Survived, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.406  -7.934  -0.934   7.493  47.066 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       41.406      1.422  29.122  < 2e-16 ***
## Pclass2nd class   -9.803      1.281  -7.653 5.15e-14 ***
## Pclass3rd class  -15.889      1.098 -14.467  < 2e-16 ***
## Sexmale            1.417      1.061   1.335    0.182    
## SurvivedYes       -5.427      1.098  -4.944 9.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 882 degrees of freedom
## Multiple R-squared:  0.1991, Adjusted R-squared:  0.1954 
## F-statistic:  54.8 on 4 and 882 DF,  p-value: < 2.2e-16

We can still infer that Age differs over these groups. The overall model is highly significant.

Logistic regression

Exercise 8 Use the same train/test splits to evaluate the performance of a logistic model where Survived is predicted from Age, Pclass and Sex. Study the accuracy and the confusion matrix

We start by specifying the caret model.

set.seed(123) # for reproducibility
model <- train(Survived ~ Age + Pclass + Sex,
               data = train,
               method = "glm",
               family = binomial(link = "logit"),
               trControl = trainControl(method = "cv", number = 10)
               )
model

## Generalized Linear Model 
## 
## 622 samples
##   3 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 559, 560, 560, 560, 560, 560, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8020993  0.5771372

We can ask for a confusion matrix over the crossvalidated sets.

confusionMatrix(model)

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  52.6 10.9
##        Yes  8.8 27.7
##                             
##  Accuracy (average) : 0.8023

We see that a bit over 80% is accurately predicted. The off-diagonal holds the other almost 20%.

When we apply the model to the test data to obtain predictions, we can choose to get the raw predictions (i.e. the scale of the response as recorded in the data), or prob predictions (i.e. the scale of the response as modeled in probabilities).

pred <- predict(model, newdata = test, type = "raw")

The confusion matrix over the predictions yields many informative measures.

confusionMatrix(pred, test$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  140  35
##        Yes  23  67
##                                           
##                Accuracy : 0.7811          
##                  95% CI : (0.7265, 0.8294)
##     No Information Rate : 0.6151          
##     P-Value [Acc > NIR] : 5.749e-09       
##                                           
##                   Kappa : 0.5274          
##                                           
##  Mcnemar's Test P-Value : 0.1486          
##                                           
##             Sensitivity : 0.8589          
##             Specificity : 0.6569          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.7444          
##              Prevalence : 0.6151          
##          Detection Rate : 0.5283          
##    Detection Prevalence : 0.6604          
##       Balanced Accuracy : 0.7579          
##                                           
##        'Positive' Class : No              
##

Exercise 9 Compare the model obtained with caret’s train() on with a model obtained with glm(). Fit the glm() model on the training set. Study accuracy and parameters.

We start with generating the relevant output from glm(). First, we fit the model with the correct family and link function

fit <- train %$% 
  glm(Survived ~ Age + Pclass + Sex, 
      family = binomial(link = "logit"))

Next, we generate the predicted values:

pred.glm <- ifelse(predict(fit, newdata = test, type = "response") > .5, "Yes", "No")

We have to indicate how to go from the predicted probabilities back to No and Yes. I use the ifelse() function to do that: if the probability is over .5, then the new value will be Yes, else it will be No.

Now we can enter this vector of predicted Yes and No in the postResample() function to compare it with the observations in test$Survived.

postResample(pred.glm, test$Survived)

##  Accuracy     Kappa 
## 0.7811321 0.5273678

Finally, we can obtain the parameter summary from the fitted model.

fit %>% summary

## 
## Call:
## glm(formula = Survived ~ Age + Pclass + Sex, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0040  -0.6605  -0.4094   0.6136   2.4892  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.672181   0.447397   8.208 2.25e-16 ***
## Age             -0.036166   0.008427  -4.291 1.78e-05 ***
## Pclass2nd class -1.138621   0.316826  -3.594 0.000326 ***
## Pclass3rd class -2.453363   0.314649  -7.797 6.33e-15 ***
## Sexmale         -2.643157   0.222832 -11.862  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 829.57  on 621  degrees of freedom
## Residual deviance: 555.37  on 617  degrees of freedom
## AIC: 565.37
## 
## Number of Fisher Scoring iterations: 5

When we obtain the same information from the caret model, we see that there are no differences.

postResample(pred, test$Survived)

##  Accuracy     Kappa 
## 0.7811321 0.5273678

model$finalModel %>% summary()

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0040  -0.6605  -0.4094   0.6136   2.4892  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        3.672181   0.447397   8.208 2.25e-16 ***
## Age               -0.036166   0.008427  -4.291 1.78e-05 ***
## `Pclass2nd class` -1.138621   0.316826  -3.594 0.000326 ***
## `Pclass3rd class` -2.453363   0.314649  -7.797 6.33e-15 ***
## Sexmale           -2.643157   0.222832 -11.862  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 829.57  on 621  degrees of freedom
## Residual deviance: 555.37  on 617  degrees of freedom
## AIC: 565.37
## 
## Number of Fisher Scoring iterations: 5

The outputs are identical. caret does not perform magical parameter poolings over the crossvalidated sets. The returned model is the fitted model. The accuracy that is obtained over the train set is obtained by meand of crossvalidation. The fitted model is in both cases identically applied to the test set.

That said, the modeling possibilities with caret are enormous and there are many modeling efforts possible that lead to a proper model training on a training set. Crossvalidation is then needed. A test set can then be used to evaluate the performance of the trained model on unseen data.

End of Practical.

Exercises D

Gerko Vink

Data Science and Predictive Machine Learning

Data subsetting

Linear model

Logistic regression