Introduction

In this practical, you will learn how to handle many variables with regression by using variable selection techniques, and how to tune hyperparameters for these techniques. This practical has been derived from chapter 6 of ISLR.

One of the packages we are going to use is glmnet. For this, you will probably need to install.packages("glmnet") before running the library() functions.

library(ISLR)
library(glmnet)
library(tidyverse)

To get replicable results, it is always wise to set a seed when relying on random processes.

set.seed(45)

Best subset selection

Our goal for today is to use the Hitters dataset from the ISLR package to predict Salary.

Prepare a dataframe baseball from the Hitters dataset where you remove the baseball players for which the Salary is missing. How many baseball players are left?

baseball <- Hitters %>% filter(!is.na(Salary))

nrow(baseball)

## [1] 263

Create baseball_train (50%), baseball_valid (30%), and baseball_test (20%) datasets.

split <- c(rep("train", 132), rep("valid", 79), rep("test",  52))
baseball <- baseball %>% mutate(split = sample(split))

baseball_train <- baseball %>% filter(split == "train")
baseball_valid <- baseball %>% filter(split == "valid")
baseball_test  <- baseball %>% filter(split == "test")

Create a function called lm_mse() with as inputs (1) a formula, (2) a training dataset, and (3) a test dataset which outputs the mse on the test dataset for predictions from a linear model.

Start like this:

lm_mse <- function(formula, train_data, valid_data) {
  y_name <- as.character(formula)[2]
  y_true <- valid_data[[y_name]]
  
  lm_fit <- lm(formula, train_data)
  y_pred <- predict(lm_fit, newdata = valid_data)
  
  mean((y_true - y_pred)^2)
}

Try out your function with the formula Salary ~ Hits + Runs, using baseball_train and baseball_valid.

lm_mse(Salary ~ Hits + Runs, baseball_train, baseball_valid)

## [1] 142631.1

We have pre-programmed a function for you to generate as a character vector all formulas with a set number of p variables. You can load the function into your environment by sourcing the .R file it is written in:

source("generate_formulas.R")

You can use it like so:

Create a character vector of all predictor variables from the Hitters dataset. colnames() may be of help. Note that Salary is not a predictor variable.

x_vars <- colnames(Hitters)
x_vars <- x_vars[x_vars != "Salary"]

Generate all formulas with as outcome Salary and 3 predictors from the Hitters data. Assign this to a variable called formulas. There should be 969 elements in this vector.

formulas <- generate_formulas(p = 3, x_vars = x_vars, y_var = "Salary")
length(formulas)

## [1] 969

Use a for loop to find the best set of 3 predictors in the Hitters dataset based on MSE. Use the baseball_train and baseball_valid datasets.

# Initialise a vector we will fill with MSE values
mses <- rep(0, 969)

# loop over all the formulas
for (i in 1:969) {
  mses[i] <- lm_mse(as.formula(formulas[i]), baseball_train, baseball_valid)
}

# select the formula with the lowest MSE
best_3_preds <- formulas[which.min(mses)]

Do the same for 1, 2 and 4 predictors. Now select the best model with 1, 2, 3, or 4 predictors in terms of its out-of-sample MSE

# Generate formulas
formulas_1 <- generate_formulas(p = 1, x_vars = x_vars, y_var = "Salary")
formulas_2 <- generate_formulas(p = 2, x_vars = x_vars, y_var = "Salary")
formulas_4 <- generate_formulas(p = 4, x_vars = x_vars, y_var = "Salary")

# Initialise a vector we will fill with MSE values
mses_1 <- rep(0, length(formulas_1))
mses_2 <- rep(0, length(formulas_2))
mses_4 <- rep(0, length(formulas_4))

# loop over all the formulas
for (i in 1:length(formulas_1)) {
  mses_1[i] <- lm_mse(as.formula(formulas_1[i]), baseball_train, baseball_valid)
}

for (i in 1:length(formulas_2)) {
  mses_2[i] <- lm_mse(as.formula(formulas_2[i]), baseball_train, baseball_valid)
}

for (i in 1:length(formulas_4)) {
  mses_4[i] <- lm_mse(as.formula(formulas_4[i]), baseball_train, baseball_valid)
}

# Compare mses
min(mses_1)
min(mses_2)
min(mses)
min(mses_4)

# min(mses_4) is lowest of them all!
# So let's see which model that is

formulas_4[which.min(mses_4)]

## [1] 111606.5
## [1] 102359.1
## [1] 97916.01
## [1] 97461.05
## [1] "Salary ~ Years + CAtBat + CHits + PutOuts"

Calculate the test MSE for this model. Then, create a plot comparing predicted values (mapped to x position) versus observed values (mapped to y position) of baseball_test.

# Estimate model and calculate mse
lm_best <- lm(Salary ~ Walks + CAtBat + CHits + CRBI, baseball_train)
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
mse(baseball_test$Salary, predict(lm_best, newdata = baseball_test))

## [1] 103973.4

# create a plot
tibble(
  y_true = baseball_test$Salary,
  y_pred = predict(lm_best, newdata = baseball_test)
) %>% 
  ggplot(aes(x = y_pred, y = y_true)) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  geom_point() +
  theme_minimal()

Through enumerating all possibilities, we have selected the best subset of at most 4 non-interacting predictors for the prediction of baseball salaries. This method works well for few predictors, but the computational cost of enumeration increases quickly to the point where it is infeasible to enumerate all combinations of variables:

Regularisation with glmnet

glmnet is a package that implements efficient (quick!) algorithms for LASSO and ridge regression, among other things.

Read through the help file of glmnet. We are going to perform a linear regression with normal (gaussian) error terms. What format should our data be in?

# We need to input a predictor matrix x and a response (outcome) variable y, 
# as well as a family = "gaussian"

Again, we will try to predict baseball salary, this time using all the available variables and using the LASSO penalty to perform subset selection. For this, we first need to generate an input matrix.

First generate the input matrix using (a variation on) the following code. Remember that the “.” in a formula means “all available variables”. Make sure to check that this x_train looks like what you would expect.

x_train <- model.matrix(Salary ~ ., data = baseball_train %>% select(-split))
head(x_train)

##                   (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
## -Alan Ashby                 1   315   81     7   24  38    39    14   3449
## -Andre Dawson               1   496  141    20   65  78    37    11   5628
## -Andres Galarraga           1   321   87    10   39  42    30     2    396
## -Alfredo Griffin            1   594  169     4   74  51    35    11   4408
## -Al Newman                  1   185   37     1   23   8    21     2    214
## -Argenis Salazar            1   298   73     0   24  24     7     3    509
##                   CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
## -Alan Ashby         835     69   321  414    375       1         1     632
## -Andre Dawson      1575    225   828  838    354       1         0     200
## -Andres Galarraga   101     12    48   46     33       1         0     805
## -Alfredo Griffin   1133     19   501  336    194       0         1     282
## -Al Newman           42      1    30    9     24       1         0      76
## -Argenis Salazar    108      0    41   37     12       0         1     121
##                   Assists Errors NewLeagueN
## -Alan Ashby            43     10          1
## -Andre Dawson          11      3          1
## -Andres Galarraga      40      4          1
## -Alfredo Griffin      421     25          0
## -Al Newman            127      7          0
## -Argenis Salazar      283      9          0

The model.matrix() function takes a dataset and a formula and outputs the predictor matrix where the categorical variables have been correctly transformed into dummy variables, and it adds an intercept. It is used internally by the lm() function as well!

Using glmnet(), perform a LASSO regression with the generated x_train as the predictor matrix and Salary as the response variable. Set the lambda parameter of the penalty to 15. NB: Remove the intercept column from the x_matrix – glmnet adds an intercept internally.

result <- glmnet(x      = x_train[, -1],          # X matrix without intercept
                 y      = baseball_train$Salary,  # Salary as response
                 family = "gaussian",             # Normally distributed errors
                 alpha  = 1,                      # LASSO penalty
                 lambda = 15)                     # Penalty value

The coefficients for the variables are in the beta element of the list generated by the glmnet() function. Which variables have been selected? You may use the coef() function.

rownames(coef(result))[which(coef(result) != 0)]

##  [1] "(Intercept)" "Hits"        "Runs"        "Walks"       "CHmRun"     
##  [6] "CRuns"       "CRBI"        "DivisionW"   "PutOuts"     "Assists"    
## [11] "NewLeagueN"

Create a predicted versus observed plot for the model you generated with the baseball_valid data. Use the predict() function for this! What is the MSE on the validation set?

x_valid <- model.matrix(Salary ~ ., data = baseball_valid %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result, newx = x_valid))

tibble(Predicted = y_pred, Observed = baseball_valid$Salary) %>% 
  ggplot(aes(x = Predicted, y = Observed)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  theme_minimal() +
  labs(title = "Predicted versus observed salary")

mse(baseball_valid$Salary, y_pred)

## [1] 131955.2

Tuning lambda

Like many methods of analysis, regularised regression has a tuning parameter. In the previous section, we’ve set this parameter to 15. The lambda parameter changes the strength of the shrinkage in glmnet(). Changing the tuning parameter will change the predictions, and thus the MSE. In this section, we will select the tuning parameter based on out-of-sample MSE.

Fit a LASSO regression model on the same data as before, but now do not enter a specific lambda value. What is different about the object that is generated? Hint: use the coef() and plot() methods on the resulting object.

result_nolambda <- glmnet(x = x_train[, -1], y = baseball_train$Salary, 
                          family = "gaussian", alpha  = 1)

# This object contains sets of coefficients for different values of lambda,
# i.e., different models ranging from an intercept-only model (very high 
# lambda) to almost no shrinkage (very low lambda).

plot(result_nolambda)

For deciding which value of lambda to choose, we could work similarly to what we have don in the best subset selection section before. However, the glmnet package includes another method for this task: cross validation.

Use the cv.glmnet function to determine the lambda value for which the out-of-sample MSE is lowest using 15-fold cross validation. As your dataset, you may use the training and validation sets bound together with bind_rows(). What is the best lambda value?

x_cv <- model.matrix(Salary ~ ., bind_rows(baseball_train, baseball_valid)[, -21])[, -1]
result_cv <- cv.glmnet(x = x_cv, y = c(baseball_train$Salary, baseball_valid$Salary), nfolds = 15)
best_lambda <- result_cv$lambda.min
best_lambda

## [1] 1.774744

Try out the plot() method on this object. What do you see? What does this tell you about the bias-variance tradeoff?

plot(result_cv)

# the MSE is high with very small values of lambda (no shrinkage) and 
# with very large values of lambda (intercept-only model).

# introducing a bit of bias lowers the variance relatively strongly 
# (fewer variables in the model) and therefore the MSE is reduced.

Use the predict() method directly on the object you just created to predict new salaries for the baseball players in the baseball_test dataset using the best lambda value you just created (hint: you need to use the s argument, look at ?predict.cv.glmnet for help). Create another predicted-observed scatter plot.

x_test <- model.matrix(Salary ~ ., data = baseball_test %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result_cv, newx = x_test, s = best_lambda))

tibble(Predicted = y_pred, Observed = baseball_test$Salary) %>% 
  ggplot(aes(x = Predicted, y = Observed)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  theme_minimal() +
  labs(title = "Predicted versus observed salary: LASSO with cv tuning")

mse(baseball_test$Salary, y_pred)

## [1] 107147.6

Exercise: method comparison

Create a bar plot comparing the test set (baseball_test) MSE of (a) linear regression with all variables, (b) the best subset selection regression model we created, (c) LASSO with lambda set to 50, and (d) LASSO with cross-validated lambda. As training dataset, use the rows in both the baseball_train and baseball_valid

# create this new training dataset
train_data <- bind_rows(baseball_train, baseball_valid)[, -21]

# generate predictions from the models
y_pred_ols <- predict(lm(Salary ~ ., data = train_data), newdata = baseball_test)
y_pred_sub <- predict(lm(Salary ~ Runs + CHits + Division + PutOuts, data = train_data),
                      newdata = baseball_test)
# these two use x_cv and x_test from the previous exercises
y_pred_las <- as.numeric(predict(glmnet(x_cv, train_data$Salary, lambda = 50), newx = x_test))
y_pred_cv  <- as.numeric(predict(result_cv, newx = x_test, s = best_lambda))

# Calculate MSEs
mses <- c(
  mse(baseball_test$Salary, y_pred_ols),
  mse(baseball_test$Salary, y_pred_sub),
  mse(baseball_test$Salary, y_pred_las),
  mse(baseball_test$Salary, y_pred_cv)
)

# Create a plot
tibble(Method = as_factor(c("lm", "subset", "lasso", "cv_las")), MSE = mses) %>% 
  ggplot(aes(x = Method, y = MSE, fill = Method)) +
  geom_bar(stat = "identity", col = "black") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Comparison of test set MSE for different prediction methods") +
  scale_fill_viridis_d() # different colour scale

Supervised learning: Regression 2

Introduction

Best subset selection

Regularisation with glmnet

Tuning lambda

Exercise: method comparison