We use the following packages:

library(tidyverse)
library(magrittr)
library(psych)
library(caret)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)

Introduction

Today, we will learn how to use different ensemble methods in R, recap on how to evaluate the performance of the methods, and learn how we can substantively interpret the model output.

In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical, we will look at both aspects.

In this practical, we will use the tidyverse, magrittr, psych, GGally and caret libraries.

I have prepared the training and test data sets for you. You can load them in by running the following code block.

con <- url("https://www.gerkovink.com/erasmus/Day%204/Part%20J/train_test.Rdata")
load(con)

We will use these data sets to make inferences and to train a prediction model.

Before we continue, we fix the random number generator seed.

set.seed(123)

1. Get an impression of the training data by looking at the structure of the data and creating some descriptive statistics.

First we inspect the head() and tail() of the train data

head(train)

## # A tibble: 6 × 11
##     Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase
##   <dbl> <fct>            <dbl>            <dbl>                <dbl>
## 1    65 Female             0.7              0.1                  187
## 2    62 Male              10.9              5.5                  699
## 3    62 Male               7.3              4.1                  490
## 4    58 Male               1                0.4                  182
## 5    72 Male               3.9              2                    195
## 6    17 Male               0.9              0.3                  202
## # … with 6 more variables: Alamine_Aminotransferase <dbl>,
## #   Aspartate_Aminotransferase <dbl>, Total_Protiens <dbl>, Albumin <dbl>,
## #   Ratio_Albumin_Globulin <dbl>, Disease <fct>

tail(train)

## # A tibble: 6 × 11
##     Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase
##   <dbl> <fct>            <dbl>            <dbl>                <dbl>
## 1    32 Male              25               13.7                  560
## 2    32 Male              12.7              8.4                  190
## 3    60 Male               0.5              0.1                  500
## 4    40 Male               0.6              0.1                   98
## 5    31 Male               1.3              0.5                  184
## 6    38 Male               1                0.3                  216
## # … with 6 more variables: Alamine_Aminotransferase <dbl>,
## #   Aspartate_Aminotransferase <dbl>, Total_Protiens <dbl>, Albumin <dbl>,
## #   Ratio_Albumin_Globulin <dbl>, Disease <fct>

We can also obtain descriptive statistics about this data as follows

train %>%
  select(-c(Gender, Disease)) %>%
  describeBy(train$Disease, fast = TRUE)

## 
##  Descriptive statistics by group 
## group: Healthy
##                            vars   n   mean     sd  min    max  range    se
## Age                           1 332  46.20  15.78  8.0   90.0   82.0  0.87
## Total_Bilirubin               2 332   4.13   6.23  0.4   42.8   42.4  0.34
## Direct_Bilirubin              3 332   2.02   3.27  0.1   19.7   19.6  0.18
## Alkaline_Phosphotase          4 332 313.24 238.70 75.0 1750.0 1675.0 13.10
## Alamine_Aminotransferase      5 332  98.96 217.79 12.0 2000.0 1988.0 11.95
## Aspartate_Aminotransferase    6 332 140.73 356.85 11.0 4929.0 4918.0 19.58
## Total_Protiens                7 332   6.47   1.10  2.8    9.6    6.8  0.06
## Albumin                       8 332   3.06   0.78  0.9    5.5    4.6  0.04
## Ratio_Albumin_Globulin        9 332   0.92   0.34  0.3    2.8    2.5  0.02
## ------------------------------------------------------------ 
## group: Disease
##                            vars   n   mean     sd   min     max   range    se
## Age                           1 132  41.15  16.37  4.00   84.00   80.00  1.42
## Total_Bilirubin               2 132   1.20   1.11  0.50    7.30    6.80  0.10
## Direct_Bilirubin              3 132   0.42   0.57  0.10    3.60    3.50  0.05
## Alkaline_Phosphotase          4 132 222.58 154.04 90.00 1580.00 1490.00 13.41
## Alamine_Aminotransferase      5 132  33.92  23.18 10.00  181.00  171.00  2.02
## Aspartate_Aminotransferase    6 132  42.28  38.08 12.00  285.00  273.00  3.31
## Total_Protiens                7 132   6.44   1.04  3.70    9.20    5.50  0.09
## Albumin                       8 132   3.29   0.79  1.40    5.00    3.60  0.07
## Ratio_Albumin_Globulin        9 132   1.03   0.29  0.37    1.85    1.48  0.03

It is quite clear that there are substantial differences between the diseased and non-diseased in the data.

2. To further explore the data for this practical, create some interesting data visualizations that show whether there are interesting patterns in the data.

Hint: Think about adding a color aesthetic for the variable Disease.

I give here a set of visualization that I think are informative. There are many more visualization that one could create:

train %>%
  select(-Gender) %>%
  pivot_longer(where(is.numeric)) %>%
  ggplot(aes(x = value, col = Disease, fill = Disease)) +
  geom_boxplot(alpha = 0.8) +
  facet_wrap(~name, scales = "free") +
  scale_color_brewer(palette = "Paired") +
  scale_fill_brewer(palette = "Paired") +
  theme_minimal()

train %>%
  select(-Gender) %>%
  pivot_longer(where(is.numeric)) %>%
  ggplot(aes(x = value, col = Disease, fill = Disease)) +
  geom_density(alpha = 0.8) +
  facet_wrap(~name, scales = "free") +
  scale_color_brewer(palette = "Paired") +
  scale_fill_brewer(palette = "Paired") +
  theme_minimal()

prop.table(table(train$Gender, train$Disease), margin = 1) %>%
  as.data.frame %>%
  select(Gender = Var1, Disease = Var2, `Relative Frequency` = Freq) %>%
  ggplot(aes(y = `Relative Frequency`, x = Gender, col = Disease, fill = Disease)) +
  geom_histogram(alpha = 0.8, stat = "identity", position = "dodge") +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired") +
  theme_minimal()

## Warning: Ignoring unknown parameters: binwidth, bins, pad

From these visualizations we can see differences between the distributions for the two Disease categories. However, these differences do not seem to be dramatic. Additionally, there are relatively more women with the liver disease than men.

3. Shortly reflect on the difference between bagging, random forests, and boosting.

## Bagging:       fit a regression tree to N bootstrap samples of the training data
##                take the average of all classification trees to base predictions on
##                Note: out-of-bag data can serve as internal validation set.

## Random forest: Similarly to bagging, classification trees are trained on 
##                a bootstrap sample of the data. However, the decision trees
##                are trained using a subset of features from the data. 

## Boosting:      We build a decision tree sequentially. Given the current
##                we fit a (small) tree on the residuals of the current model, 
##                rather than on the outcome Y

We are going to apply different machine learning models using the caret package.

4. Apply bagging to the training data, to predict the outcome Disease, using the caret library.

Note. We first specify the internal validation settings, like so:

cvcontrol <- trainControl(method = "repeatedcv", 
                          number = 10,
                          allowParallel = TRUE)

These settings can be inserted within the train function from the caret package. Make sure to use the treebag method, to specify cvcontrol as the trControl argument and to set importance = TRUE.

bag_train <- train(Disease ~ .,
                   data = train, 
                   method = 'treebag',
                   trControl = cvcontrol,
                   importance = TRUE)

5. Interpret the variable importance measure using the varImp function on the trained model object.

bag_train %>%
  varImp %>%
  plot

6. Create training set predictions based on the bagged model, and use the confusionMatrix() function from the caret package to assess it’s performance.`

Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.

confusionMatrix(predict(bag_train, type = "raw"),
                train$Disease)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy     332       1
##    Disease       0     131
##                                           
##                Accuracy : 0.9978          
##                  95% CI : (0.9881, 0.9999)
##     No Information Rate : 0.7155          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9947          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9924          
##          Pos Pred Value : 0.9970          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7155          
##          Detection Rate : 0.7155          
##    Detection Prevalence : 0.7177          
##       Balanced Accuracy : 0.9962          
##                                           
##        'Positive' Class : Healthy         
##

We have realized near-perfect training set performance. However, this shows nothing more than that we have been able to train the model rather well. We need to evaluate our model on the test set before we can draw conclusions about predicive power and test error.

7. Now ask for the output of the bagged model. Explain why the under both approaches differ.

bag_train

## Bagged CART 
## 
## 464 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 417, 418, 417, 418, 418, 418, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7218316  0.2589208

We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.

8. Fit a random forest to the training data to predict the outcome Disease, using the caret library.

Note. Use the same cvcontrol settings as in the previous model.

rf_train <- train(Disease ~ .,
                  data = train, 
                  method = 'rf',
                  trControl = cvcontrol,
                  importance = TRUE)

9. Again, interpret the variable importance measure using the varImp function on the trained model object. Do you draw the same conclusions as under the bagged model?

rf_train %>%
  varImp %>%
  plot

The random forest model rf_train indicates a different variable importance than the bagged model bag_train. This is due to the random selection of predictors within random forests: the bootstrap-based trees are thus decorrelated.

10. Output the model output from the random forest. Are we doing better than with the bagged model?

rf_train

## Random Forest 
## 
## 464 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 417, 418, 417, 418, 418, 418, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7243004  0.2497997
##    6    0.7030161  0.2179426
##   10    0.7115749  0.2307716
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Yes, the most accurate model indicates that we do just slightly better than with the bagged model. However, this might well be due to chance.

11. Now, fit a boosting model using the caret library to predict disease status.`

Hint: Use gradient boosting (the gbm method in caret).

gbm_train <- train(Disease ~ .,
                   data = train,
                   method = "gbm",
                   verbose = F,
                   trControl = cvcontrol)

12. Again, interpret the variable importance measure. You will have to call for summary() on the model object you just created. Compare the output to the previously obtained variable importance measures.

summary(gbm_train)

##                                                   var    rel.inf
## Alkaline_Phosphotase             Alkaline_Phosphotase 17.4710342
## Age                                               Age 16.8212053
## Aspartate_Aminotransferase Aspartate_Aminotransferase 15.6829277
## Alamine_Aminotransferase     Alamine_Aminotransferase 13.2912014
## Direct_Bilirubin                     Direct_Bilirubin 11.0980359
## Total_Protiens                         Total_Protiens  9.3152709
## Albumin                                       Albumin  6.7373895
## Ratio_Albumin_Globulin         Ratio_Albumin_Globulin  6.1805270
## Total_Bilirubin                       Total_Bilirubin  2.8300197
## GenderFemale                             GenderFemale  0.5723883

13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?

gbm_train

## Stochastic Gradient Boosting 
## 
## 464 samples
##  10 predictor
##   2 classes: 'Healthy', 'Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 417, 418, 418, 418, 418, 418, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7240981  0.2019393
##   1                  100      0.7239593  0.2381550
##   1                  150      0.7218779  0.2268189
##   2                   50      0.7219704  0.2328491
##   2                  100      0.7175763  0.2311722
##   2                  150      0.7283071  0.2698521
##   3                   50      0.7154024  0.2243496
##   3                  100      0.7088344  0.2236268
##   3                  150      0.7281684  0.2802560
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

Yes, our best model is doing slightly better then the previous two models. However, the performance gain is small and might be due to random variation.

For now, we will continue with extreme gradient boosting, although we will use a difference procedure.

We will use xgboost to train a binary classification model, and create some visualizations to obtain additional insight in our model. We will create the visualizations using SHAP (SHapley Additive exPlanations) values, which are a measure of importance of the variables in the model. In fact, SHAP values indicate the influence of each input variable on the predicted probability for each person. Essentially, these give an indication of the difference between the predicted probability with and without that variable, for each person’s score.

14. Download the file shap.R from this Github repository.

Note. There are multiple ways to this, of which the simplest is to run the following code.

con <- url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")
source(con)

15. Specify your model as follows, and use it to create predictions on the training data.

train_x <- model.matrix(Disease ~ ., train)[,-1]
train_y <- as.numeric(train$Disease) - 1
xgboost_train <- xgboost(data = train_x,
                         label = train_y, 
                         max.depth = 10,
                         eta = 1,
                         nthread = 4,
                         nrounds = 4,
                         objective = "binary:logistic",
                         verbose = 2)

pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
  mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
                          labels = c("Healthy", "Disease")))
confusionMatrix(pred$Disease, train$Disease)

16. First, calculate the SHAP rank scores for all variables in the data, and create a variable importance plot using these values. Interpret the plot.

shap_results <- shap.score.rank(xgboost_train,
                                X_train = train_x,
                                shap_approx = F)

## make SHAP score by decreasing order

var_importance(shap_results)

17. Plot the SHAP values for every individual for every feature and interpret them.

shap_long <- shap.prep(shap = shap_results,
                       X_train = train_x)

plot.shap.summary(shap_long)

xgb.plot.shap(train_x, features = colnames(train_x), model = xgboost_train, n_col = 3)

The first plot demonstrates that those with a high value for Direct_Bilirubin have a lower probability of being diseased. Also, Those with a higher age have a lower probability of being diseased, while those with a higher Albumin have a higher probability of being diseased.

The second set of plots displays the marginal relationships of the SHAP values with the predictors. This conveys the same information, but in greater detail. The interpretability may be a bit tricky for the inexperienced data analyst.

18. Verify which of the models you created in this practical performs best on the test data.

bag_test <- predict(bag_train, newdata = test)
rf_test  <- predict(rf_train, newdata = test)
gbm_test <- predict(gbm_train, newdata = test)
xgb_test <- predict(xgboost_train, newdata = model.matrix(Disease ~ ., test)[,-1]) %>%
  factor(x = ifelse(. < 0.5, 1, 2), levels = c(1,2), labels = c("Healthy", "Disease"))

list(`bagging` = bag_test, 
     `random_forest` = rf_test, 
     `gradient_boosting` = gbm_test, 
     `xtreme_gradient_boosting` = xgb_test) %>%
  map(~ confusionMatrix(.x, test$Disease))

## $bagging
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      64      22
##    Disease      18      11
##                                           
##                Accuracy : 0.6522          
##                  95% CI : (0.5577, 0.7386)
##     No Information Rate : 0.713           
##     P-Value [Acc > NIR] : 0.9368          
##                                           
##                   Kappa : 0.1181          
##                                           
##  Mcnemar's Test P-Value : 0.6353          
##                                           
##             Sensitivity : 0.7805          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.7442          
##          Neg Pred Value : 0.3793          
##              Prevalence : 0.7130          
##          Detection Rate : 0.5565          
##    Detection Prevalence : 0.7478          
##       Balanced Accuracy : 0.5569          
##                                           
##        'Positive' Class : Healthy         
##                                           
## 
## $random_forest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      69      25
##    Disease      13       8
##                                           
##                Accuracy : 0.6696          
##                  95% CI : (0.5757, 0.7544)
##     No Information Rate : 0.713           
##     P-Value [Acc > NIR] : 0.87084         
##                                           
##                   Kappa : 0.0941          
##                                           
##  Mcnemar's Test P-Value : 0.07435         
##                                           
##             Sensitivity : 0.8415          
##             Specificity : 0.2424          
##          Pos Pred Value : 0.7340          
##          Neg Pred Value : 0.3810          
##              Prevalence : 0.7130          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.8174          
##       Balanced Accuracy : 0.5419          
##                                           
##        'Positive' Class : Healthy         
##                                           
## 
## $gradient_boosting
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      73      22
##    Disease       9      11
##                                           
##                Accuracy : 0.7304          
##                  95% CI : (0.6397, 0.8089)
##     No Information Rate : 0.713           
##     P-Value [Acc > NIR] : 0.38375         
##                                           
##                   Kappa : 0.2534          
##                                           
##  Mcnemar's Test P-Value : 0.03114         
##                                           
##             Sensitivity : 0.8902          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.7684          
##          Neg Pred Value : 0.5500          
##              Prevalence : 0.7130          
##          Detection Rate : 0.6348          
##    Detection Prevalence : 0.8261          
##       Balanced Accuracy : 0.6118          
##                                           
##        'Positive' Class : Healthy         
##                                           
## 
## $xtreme_gradient_boosting
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Disease
##    Healthy      64      23
##    Disease      18      10
##                                           
##                Accuracy : 0.6435          
##                  95% CI : (0.5488, 0.7306)
##     No Information Rate : 0.713           
##     P-Value [Acc > NIR] : 0.9579          
##                                           
##                   Kappa : 0.0875          
##                                           
##  Mcnemar's Test P-Value : 0.5322          
##                                           
##             Sensitivity : 0.7805          
##             Specificity : 0.3030          
##          Pos Pred Value : 0.7356          
##          Neg Pred Value : 0.3571          
##              Prevalence : 0.7130          
##          Detection Rate : 0.5565          
##    Detection Prevalence : 0.7565          
##       Balanced Accuracy : 0.5418          
##                                           
##        'Positive' Class : Healthy         
##

End of Practical

Practical J

Gerko Vink

Data Science and Predictive Machine Learning

Introduction