Introduction


In this practical, we will continue with regularized regression. We need the following packages:

library(magrittr)
library(dplyr)
library(GGally)
library(glmnet)
library(caret)
library(plotmo)
library(coefplot)

We continue with the previous practical. For convenience, I have prepared an image with all necessary objects and functions from the previous practical so that we continue where we left off.

To load that workspace image, run the below code block.

con <- url("https://www.gerkovink.com/erasmus/Day%203/Part%20H/load_all_objects.RData")
load(con)

The comparison in the previous practical was between OLS and Ridge regression for the mtc data. The mtc data set is our recoded version of the mtcars data, where the binary columns are set to factors. We created a training set (train) and a test set (test) based on this mtc data. If you recall, OLS performed worse than ridge regression

postResample(pred, test$hp) # lm()
##       RMSE   Rsquared        MAE 
## 36.5474077  0.7369976 32.7630202
postResample(ridge_min, test$hp) # ridge with \lambda_min
##       RMSE   Rsquared        MAE 
## 26.6839013  0.8118065 25.2053081
postResample(ridge_1se, test$hp) # ridge with \lambda_1se
##      RMSE  Rsquared       MAE 
## 29.517895  0.853422 23.769065

Before starting with the exercises, it is a good idea to set your RNG seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.

set.seed(123)

  1. Fit a lasso on the training data. Name the object fit.lasso. Do not do crossvalidation, yet.

fit.lasso <- glmnet(x = x, y = y, 
                    family = "gaussian", 
                    alpha = 1)

There is no need to set the argument standardize = TRUE to ensure that the predictors are measured in the same units, because it is by default set as TRUE.


  1. Inspect the plots on the fitted object. Use different x-axes.

The first, generic plot on the fit.lasso object yields the plot of the coefficients against the \(L_1\) norm.

plot(fit.lasso)

We can see that the harder the penalization, the more the coefficients are shrunk towards zero and the fewer non-zero coefficients remain. When the manhattan norm would result in zero, all coefficients are set to zero. The lasso clearly bets on sparsity.

Let’s look at the same plot, but now with \(\text{Log}(\lambda)\) on the x-axis.

plot(fit.lasso, xvar = "lambda")

It is clear that with increasing \(\lambda\) comes increasing shrinkage and selection.

The final plot demonstrates the same trend, but now with the deviance on the x-axis.

plot(fit.lasso, xvar = "dev")


The function plot_glmnet() from package plotmo has a nicer - but very similar - plot class. I often prefer these plots over the native plot.glmnet()


  1. Recreate the plots from (2) with plot_glmnet().

The arguments are slightly different for

plot_glmnet(fit.lasso, xvar = "norm")

plot_glmnet(fit.lasso, xvar = "lambda")

plot_glmnet(fit.lasso, xvar = "dev")

And there is one extra plot in plot_glmnet() that reverses the \(\text{Log}(\lambda)\) x-axis and gives the corresponding \(\lambda\) values at the top margin. This is the plot that is returned by default by plot_glmnet() when the argument xvar is not specified.

plot_glmnet(fit.lasso)


  1. Train a lasso regression model on the train data set. Name the resulting object lasso. Use 4-fold crossvalidation, just like with ridge regression.

To fit the lasso model, we need to specify alpha = 1 in function glmnet()

lasso <- cv.glmnet(x = x, y = y, 
                   family = "gaussian",
                   alpha = 1, 
                   standardize = TRUE, 
                   nfolds = 4)

  1. Find out which value of \(\lambda\) yields the lowest cross-validated error. Do a visual and a numeric inspection.

Let’s first look at the plot.

plot(lasso)

It is clear that different values for \(\text{Log}(\lambda)\) yield different Mean-squared Error (MSE). The optimum, that is the value of \(\text{Log}(\lambda)\) for which the MSE is minimal lies at \(\text{Log}(\lambda_{min}) =\) 1.6267717. The more parsimonious optimum that lies within 1 standard error can be found at \(\text{Log}(\lambda_{1se}) =\) 2.9292441.

The lasso object also returns the optimal values for \(\lambda\)

lasso
## 
## Call:  cv.glmnet(x = x, y = y, nfolds = 4, family = "gaussian", alpha = 1,      standardize = TRUE) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure     SE Nonzero
## min  5.087    27    1878  880.3       6
## 1se 18.713    13    2672 1449.5       4

Please note that these values are the \(\lambda\) values and not the \(\text{Log}(\lambda)\)’s from the plot.


  1. Add the performance of the lasso regression to the comparison made earlier. Does the lasso perform better than ridge regression?

Let’s take the more parsimonious \(\lambda_{1se}\) as the \(\lambda\)-parameter for the Lasso regression.

lasso_1se <- predict(lasso, s = "lambda.1se", newx = newx)

If we then compare the predicted values to the observations in the test object, we obtain the following performance measures.

postResample(pred, test$hp) # lm
##       RMSE   Rsquared        MAE 
## 36.5474077  0.7369976 32.7630202
postResample(ridge_1se, test$hp) # ridge
##      RMSE  Rsquared       MAE 
## 29.517895  0.853422 23.769065
postResample(lasso_1se, test$hp) # lasso
##       RMSE   Rsquared        MAE 
## 26.4962126  0.8682274 22.7184508

The lasso performs slightly better than ridge regression in this case. We must alsonote that the lasso does have fewer parameters to obtain nearly the same predictive performance:

lasso %>% coef(s = "lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) 84.27990432
## mpg          .         
## cyl          8.98423993
## disp         0.08781704
## drat         .         
## wt           .         
## qsec        -2.67897374
## vsstraight   .         
## ammanual     .         
## gear         .         
## carb        12.21460415

The lasso only uses 5 parameters, including the intercept. The ridge regression uses all 11 parameters. The lasso regression solution is therefor far more parsimonious.

ridge %>% coef(s = "lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 186.48260848
## mpg          -1.06889609
## cyl           3.92218896
## disp          0.05429729
## drat         -5.16347160
## wt            4.55310965
## qsec         -4.14488672
## vsstraight  -11.27808604
## ammanual     -0.04598432
## gear          3.59664712
## carb          5.18327817

  1. Rerun the crossvalidated ridge regression again, but now only with the parameters selected by the lasso regression. Does performance increase?

Let’s rerun the ridge regression, but first we need to adjust the model matrix.

x.subset <- model.matrix(hp ~ -1 + cyl + disp + qsec + carb, data = train) 

The -1 in the formula excludes the Intercept. We don’t need the intercept in the design matrix because glmnet() will add it automatically.

Let’s fit the ridge model again

ridge.subset <- cv.glmnet(x = x.subset, y = y, 
                          family = "gaussian",
                          alpha = 0, 
                          nfolds = 4)
plot(ridge.subset)

Now, let’s obtain the predicted values for the ridge.subset model.

newx.subset <- model.matrix(hp ~ -1 + cyl + disp + qsec + carb, data = test) 
ridge.subset_1se <- predict(ridge.subset, s = "lambda.1se", newx = newx.subset)

If we calculate the performance measures for the ridge.subset model

postResample(ridge.subset_1se, test$hp) 
##      RMSE  Rsquared       MAE 
## 25.558627  0.877935 22.648077

we see that performance has indeed increased by making use of both \(L_1\) and \(L_2\) regularization.


The elastic net


Instead of fitting the lasso and ridge regressions seperately, we can simultanous optimize the \(L1\) and \(L2\) norms. We then have to specify the ratio between these norms. That is done with alpha where \(0\leq\alpha\leq1\) and an alpha = 0 would mean ridge regression and an alpha = 1 indicates the lasso.


  1. Fit an elastic net on the train data (not the subset) and set alpha = 0.5. Does performance on the test data increase?

Let’s run the elastic net regression

e.net <- cv.glmnet(x = x, y = y, 
                   family = "gaussian",
                   alpha = 0.5, 
                   nfolds = 4)
plot(e.net)

The elastic net yields a slightly higher mean squared error than the previous ridge.subset solution.

e.net
## 
## Call:  cv.glmnet(x = x, y = y, nfolds = 4, family = "gaussian", alpha = 0.5) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  2.092    44    1489 303.3       8
## 1se 23.505    18    1740 739.7       5
ridge.subset
## 
## Call:  cv.glmnet(x = x.subset, y = y, nfolds = 4, family = "gaussian",      alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  19.15    87    1247 359.8       4
## 1se  93.14    70    1581 668.6       4

Let’s generate predicted values from the elastic net model

e.net_1se <- predict(e.net, s = "lambda.1se", newx = newx)

If we calculate the performance measures for the e.net model

postResample(e.net_1se, test$hp) 
##       RMSE   Rsquared        MAE 
## 23.0488087  0.8788349 20.5470349

we see that performance has slightly increased over our hacky \(L_1\) and \(L_2\) regularization.


Different levels of alpha will yield different performance. So, just like \(\lambda\), \(\alpha\) is a hyperparameter that can be optimized.


  1. Train an elastic net model using caret. Can you find a combination of alpha (between 0.1 and 1 in steps of .1) and lambda (between 0.1 and 30 in steps of 1) that yields a better predictive performance than the one currently obtained?

To do so, we need to define a tuning grid for the parameters alpha and lambda.

grid <- expand.grid(alpha = seq(0.1, 1, by = .1),
                    lambda = seq(0.1, 30, by = .1))
grid %>% head(n=20)
##    alpha lambda
## 1    0.1    0.1
## 2    0.2    0.1
## 3    0.3    0.1
## 4    0.4    0.1
## 5    0.5    0.1
## 6    0.6    0.1
## 7    0.7    0.1
## 8    0.8    0.1
## 9    0.9    0.1
## 10   1.0    0.1
## 11   0.1    0.2
## 12   0.2    0.2
## 13   0.3    0.2
## 14   0.4    0.2
## 15   0.5    0.2
## 16   0.6    0.2
## 17   0.7    0.2
## 18   0.8    0.2
## 19   0.9    0.2
## 20   1.0    0.2

The above tuning grid takes values for alpha (between 0.1 and 1) and lambda (between 0.1 and 30). The total number of combinations between alpha’s and lambda’s is \(10 \times 30 = 300\).

Now let’s fit the model (it may take a while!)

model <- train(x, y, 
               method = "glmnet",
               tuneGrid = grid,
               trControl = trainControl(method = "cv", number = 4))

The model has to navigate over the combinations specified in the tuning grid grid.

Let’s inspect the model

plot(model)

The solution for the lambda’s is different from glmnet’s solution. This is because the cross-validated sampling is different. The training data is quite small, therefore the splitting can influence the hyperparameter optimization.

The best parameters that are obtained for alpha and lambda are

model$bestTune
##     alpha lambda
## 105   0.1   10.5

Generating predicted values for the trained model:

e.net.caret <- predict(model, newdata = newx)

The performance of the trained model is

postResample(e.net.caret, test$hp) 
##       RMSE   Rsquared        MAE 
## 29.6317102  0.7849322 28.7840665

which is far worse than the performance of the glmnet elastic net with alpha = 0.5.


  1. Use the alpha and lambda obtained in the previous exercise to generate predictions with glmnet. Is the performance the same?

glmnet(x, y, 
       alpha = model$bestTune$alpha, 
       lambda = model$bestTune$lambda) %>% 
  predict(newx = newx) %>% 
  postResample(test$hp) 
##       RMSE   Rsquared        MAE 
## 29.6294425  0.7849857 28.7824998

The results are the same.


  1. Apply lasso regression to the prediction of am (automatic or manual cars) based on the following design matrix. Use the same train/test split
x <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = train)

Let’s first generate the data parts to feed to glmnet.

y <- train$am
x <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = train)

Run the model

fit <- cv.glmnet(x, y, family = "binomial", alpha = 1, nfolds = 4)
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground

## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground

A lot of warnings are printed. There are few observations and the crossvalidation may bring us in dangerous territories. What if a fold contains only one class for the response?


  1. Inspect the lambda trace with plot(). What are the optimum values for lambda?

Let’s inspect the model, despite the warnings earlier

plot(fit)

The deviance stays relatively constant at the beginning, but increases steeply when \(\lambda\) becomes larger.

The optimal values for \(\lambda\) are

fit
## 
## Call:  cv.glmnet(x = x, y = y, nfolds = 4, family = "binomial", alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.00684    44  0.2554 0.09672       4
## 1se 0.03324    27  0.3390 0.06950       3

  1. Can you reasonably predict the values of am with this model?
newx <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = test)
pred <- fit %>% 
  predict(newx = newx, type = "class")
postResample(pred, test$am)
## Accuracy    Kappa 
##    0.875    0.750

Yes; only one mistake. Not bad!


  1. Inspect the confusion matrix? Is the performance still good?
newx <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = test)
pred <- fit %>% 
  predict(newx = newx, type = "class")
confusionMatrix(factor(pred), test$am)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  automatic manual
##   automatic         4      0
##   manual            1      3
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.4735, 0.9968)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.135           
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.800           
##             Specificity : 1.000           
##          Pos Pred Value : 1.000           
##          Neg Pred Value : 0.750           
##              Prevalence : 0.625           
##          Detection Rate : 0.500           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.900           
##                                           
##        'Positive' Class : automatic       
## 

Performance is much better than the baseline performance.


  1. How would a logistic model have performed?
pred <- glm(am ~ mpg + disp + hp + drat + wt + qsec, data = train,
            family = binomial(link = "logit")) %>% 
  predict(newdata = test, 
          type = "response")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
confusionMatrix(factor(ifelse(pred > .5, "manual", "automatic")),
                test$am)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  automatic manual
##   automatic         4      0
##   manual            1      3
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.4735, 0.9968)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.135           
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : 1.000           
##                                           
##             Sensitivity : 0.800           
##             Specificity : 1.000           
##          Pos Pred Value : 1.000           
##          Neg Pred Value : 0.750           
##              Prevalence : 0.625           
##          Detection Rate : 0.500           
##    Detection Prevalence : 0.500           
##       Balanced Accuracy : 0.900           
##                                           
##        'Positive' Class : automatic       
## 

A logistic model in this case performed the same, but with a lot of algorithmic problems. It’s a good thing that the developers for glm() know what they are doing!


Final challenge for today


Gene expression data

The data file we will be working with is the gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at, 202896_s_at, AFFX-r2-P1-cre-3_at. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.

The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?

CHALLENGE: Use the challenge.zip archive on the course website to predict Disease from a huge set of genotypes as accurately as possible using either ridge regression, lasso regression or a combination thereof (elastic net or hacky way).

Rules:

The best performing model will be awarded a prize next week (or per mail)


End of Practical