Introduction

In this practical, we will focus on ridge regression.

One of the packages we are going to use is glmnet. For this, you will probably need to install.packages("glmnet") before running the library() functions. GGally is also a new package, that needs to be installed to access the ggpairs() function.

library(MASS)
library(magrittr)
library(dplyr)
library(GGally)
library(glmnet)
library(caret)

Before starting with the exercises, it is a good idea to set your seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.

set.seed(123)

Exercises

The mtcars data set from package MASS contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

Make yourself familiar with the mtcars data set. Is everything properly coded?.

Recode the columns for which the measurement level is not properly set.

Visually inspect the data structure.

Fit a linear model with hp as the response and all other features as the predictors. Try to use the exposition (%$%) pipe.

Inspect the model’s inference. How would you evaluate the model’s performance?

Now fit a ridge regression model to the data. Name the resulting object ridge.

Inspect the ridge object and the coef() thereof.

Study the summary() of the ridge regression

Plot the ridge regression’s fitted object ridge twice: once with the deviance on the x-axis and once with the log of lambda on the x-axis. Hint: see ?plot.glmnet.

Now fit the ridge regression again, but in a cross-validation setting

Now fit the ridge regression again, but in a cross-validation setting. Name the resulting object cv.ridge.

Study the output of the cv.ridge object and run the object through plot().

Compare the - ridge regression to the linear model. Study the RMSE and R-squared of the predicted values from both approaches. Which performs better?

Compare a 4-fold cross-validated ridge regression to the linear model again, but now train both models on a training set with 70% of cases. Use the remaining test cases to study the RMSE and R-squared of the predicted values (use both $\lambda_{min}$ and $\lambda_{1se}$ for ridge regression) from both approaches. Which method has better predictive power?

Important Note: We have used both a test/train split and a k-fold cross-validation procedure. A different seed value will result in different splits and cross-validation folds. Don’t forget that once you fix the seed, everything will become seed dependent.

End of Practical

Exercises G

Gerko Vink

Data Science and Predictive Machine Learning

Introduction

Exercises