In this practical, we will continue with regularized regression. We need the following packages:
library(magrittr)
library(dplyr)
library(GGally)
library(glmnet)
library(caret)
library(plotmo)
library(coefplot)
We continue with the previous practical. For convenience, I have prepared an image with all necessary objects and functions from the previous practical so that we continue where we left off.
To load that workspace image, run the below code block.
con <- url("https://www.gerkovink.com/erasmus/Day%203/Part%20H/load_all_objects.RData")
load(con)
The comparison in the previous practical was between OLS and Ridge regression for the mtc
data. The mtc
data set is our recoded version of the mtcars
data, where the binary columns are set to factors. We created a training set (train
) and a test set (test
) based on this mtc
data. If you recall, OLS performed worse than ridge regression
postResample(pred, test$hp) # lm()
## RMSE Rsquared MAE
## 36.5474077 0.7369976 32.7630202
postResample(ridge_min, test$hp) # ridge with \lambda_min
## RMSE Rsquared MAE
## 26.6839013 0.8118065 25.2053081
postResample(ridge_1se, test$hp) # ridge with \lambda_1se
## RMSE Rsquared MAE
## 29.517895 0.853422 23.769065
Before starting with the exercises, it is a good idea to set your RNG seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.
set.seed(123)
fit.lasso
. Do not do crossvalidation, yet.fit.lasso <- glmnet(x = x, y = y,
family = "gaussian",
alpha = 1)
There is no need to set the argument standardize = TRUE
to ensure that the predictors are measured in the same units, because it is by default set as TRUE
.
The first, generic plot on the fit.lasso
object yields the plot of the coefficients against the \(L_1\) norm.
plot(fit.lasso)
We can see that the harder the penalization, the more the coefficients are shrunk towards zero and the fewer non-zero coefficients remain. When the manhattan norm would result in zero, all coefficients are set to zero. The lasso clearly bets on sparsity.
Let’s look at the same plot, but now with \(\text{Log}(\lambda)\) on the x-axis.
plot(fit.lasso, xvar = "lambda")
It is clear that with increasing \(\lambda\) comes increasing shrinkage and selection.
The final plot demonstrates the same trend, but now with the deviance on the x-axis.
plot(fit.lasso, xvar = "dev")
The function plot_glmnet()
from package plotmo
has a nicer - but very similar - plot class. I often prefer these plots over the native plot.glmnet()
plot_glmnet()
.The arguments are slightly different for
plot_glmnet(fit.lasso, xvar = "norm")
plot_glmnet(fit.lasso, xvar = "lambda")
plot_glmnet(fit.lasso, xvar = "dev")
And there is one extra plot in plot_glmnet()
that reverses the \(\text{Log}(\lambda)\) x-axis and gives the corresponding \(\lambda\) values at the top margin. This is the plot that is returned by default by plot_glmnet()
when the argument xvar
is not specified.
plot_glmnet(fit.lasso)
train
data set. Name the resulting object lasso
. Use 4-fold crossvalidation, just like with ridge regression.To fit the lasso model, we need to specify alpha = 1
in function glmnet()
lasso <- cv.glmnet(x = x, y = y,
family = "gaussian",
alpha = 1,
standardize = TRUE,
nfolds = 4)
Let’s first look at the plot.
plot(lasso)
It is clear that different values for \(\text{Log}(\lambda)\) yield different Mean-squared Error (MSE). The optimum, that is the value of \(\text{Log}(\lambda)\) for which the MSE is minimal lies at \(\text{Log}(\lambda_{min}) =\) 1.6267717. The more parsimonious optimum that lies within 1 standard error can be found at \(\text{Log}(\lambda_{1se}) =\) 2.9292441.
The lasso
object also returns the optimal values for \(\lambda\)
lasso
##
## Call: cv.glmnet(x = x, y = y, nfolds = 4, family = "gaussian", alpha = 1, standardize = TRUE)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 5.087 27 1878 880.3 6
## 1se 18.713 13 2672 1449.5 4
Please note that these values are the \(\lambda\) values and not the \(\text{Log}(\lambda)\)’s from the plot.
Let’s take the more parsimonious \(\lambda_{1se}\) as the \(\lambda\)-parameter for the Lasso regression.
lasso_1se <- predict(lasso, s = "lambda.1se", newx = newx)
If we then compare the predicted values to the observations in the test
object, we obtain the following performance measures.
postResample(pred, test$hp) # lm
## RMSE Rsquared MAE
## 36.5474077 0.7369976 32.7630202
postResample(ridge_1se, test$hp) # ridge
## RMSE Rsquared MAE
## 29.517895 0.853422 23.769065
postResample(lasso_1se, test$hp) # lasso
## RMSE Rsquared MAE
## 26.4962126 0.8682274 22.7184508
The lasso performs slightly better than ridge regression in this case. We must alsonote that the lasso does have fewer parameters to obtain nearly the same predictive performance:
lasso %>% coef(s = "lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 84.27990432
## mpg .
## cyl 8.98423993
## disp 0.08781704
## drat .
## wt .
## qsec -2.67897374
## vsstraight .
## ammanual .
## gear .
## carb 12.21460415
The lasso only uses 5 parameters, including the intercept. The ridge regression uses all 11 parameters. The lasso regression solution is therefor far more parsimonious.
ridge %>% coef(s = "lambda.1se")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 186.48260848
## mpg -1.06889609
## cyl 3.92218896
## disp 0.05429729
## drat -5.16347160
## wt 4.55310965
## qsec -4.14488672
## vsstraight -11.27808604
## ammanual -0.04598432
## gear 3.59664712
## carb 5.18327817
Let’s rerun the ridge regression, but first we need to adjust the model matrix.
x.subset <- model.matrix(hp ~ -1 + cyl + disp + qsec + carb, data = train)
The -1
in the formula excludes the Intercept
. We don’t need the intercept in the design matrix because glmnet()
will add it automatically.
Let’s fit the ridge model again
ridge.subset <- cv.glmnet(x = x.subset, y = y,
family = "gaussian",
alpha = 0,
nfolds = 4)
plot(ridge.subset)
Now, let’s obtain the predicted values for the ridge.subset
model.
newx.subset <- model.matrix(hp ~ -1 + cyl + disp + qsec + carb, data = test)
ridge.subset_1se <- predict(ridge.subset, s = "lambda.1se", newx = newx.subset)
If we calculate the performance measures for the ridge.subset
model
postResample(ridge.subset_1se, test$hp)
## RMSE Rsquared MAE
## 25.558627 0.877935 22.648077
we see that performance has indeed increased by making use of both \(L_1\) and \(L_2\) regularization.
Instead of fitting the lasso and ridge regressions seperately, we can simultanous optimize the \(L1\) and \(L2\) norms. We then have to specify the ratio between these norms. That is done with alpha
where \(0\leq\alpha\leq1\) and an alpha = 0
would mean ridge regression and an alpha = 1
indicates the lasso.
train
data (not the subset) and set alpha = 0.5
. Does performance on the test
data increase?Let’s run the elastic net regression
e.net <- cv.glmnet(x = x, y = y,
family = "gaussian",
alpha = 0.5,
nfolds = 4)
plot(e.net)
The elastic net yields a slightly higher mean squared error than the previous
ridge.subset
solution.
e.net
##
## Call: cv.glmnet(x = x, y = y, nfolds = 4, family = "gaussian", alpha = 0.5)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 2.092 44 1489 303.3 8
## 1se 23.505 18 1740 739.7 5
ridge.subset
##
## Call: cv.glmnet(x = x.subset, y = y, nfolds = 4, family = "gaussian", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 19.15 87 1247 359.8 4
## 1se 93.14 70 1581 668.6 4
Let’s generate predicted values from the elastic net model
e.net_1se <- predict(e.net, s = "lambda.1se", newx = newx)
If we calculate the performance measures for the e.net
model
postResample(e.net_1se, test$hp)
## RMSE Rsquared MAE
## 23.0488087 0.8788349 20.5470349
we see that performance has slightly increased over our hacky \(L_1\) and \(L_2\) regularization.
Different levels of alpha
will yield different performance. So, just like \(\lambda\), \(\alpha\) is a hyperparameter that can be optimized.
caret
. Can you find a combination of alpha
(between 0.1 and 1 in steps of .1) and lambda
(between 0.1 and 30 in steps of 1) that yields a better predictive performance than the one currently obtained?To do so, we need to define a tuning grid for the parameters alpha
and lambda
.
grid <- expand.grid(alpha = seq(0.1, 1, by = .1),
lambda = seq(0.1, 30, by = .1))
grid %>% head(n=20)
## alpha lambda
## 1 0.1 0.1
## 2 0.2 0.1
## 3 0.3 0.1
## 4 0.4 0.1
## 5 0.5 0.1
## 6 0.6 0.1
## 7 0.7 0.1
## 8 0.8 0.1
## 9 0.9 0.1
## 10 1.0 0.1
## 11 0.1 0.2
## 12 0.2 0.2
## 13 0.3 0.2
## 14 0.4 0.2
## 15 0.5 0.2
## 16 0.6 0.2
## 17 0.7 0.2
## 18 0.8 0.2
## 19 0.9 0.2
## 20 1.0 0.2
The above tuning grid takes values for alpha
(between 0.1 and 1) and lambda
(between 0.1 and 30). The total number of combinations between alpha
’s and lambda
’s is \(10 \times 30 = 300\).
Now let’s fit the model (it may take a while!)
model <- train(x, y,
method = "glmnet",
tuneGrid = grid,
trControl = trainControl(method = "cv", number = 4))
The model has to navigate over the combinations specified in the tuning grid grid
.
Let’s inspect the model
plot(model)
The solution for the lambda’s is different from glmnet
’s solution. This is because the cross-validated sampling is different. The training
data is quite small, therefore the splitting can influence the hyperparameter optimization.
The best parameters that are obtained for alpha
and lambda
are
model$bestTune
## alpha lambda
## 105 0.1 10.5
Generating predicted values for the trained model:
e.net.caret <- predict(model, newdata = newx)
The performance of the trained model is
postResample(e.net.caret, test$hp)
## RMSE Rsquared MAE
## 29.6317102 0.7849322 28.7840665
which is far worse than the performance of the glmnet
elastic net with alpha = 0.5
.
alpha
and lambda
obtained in the previous exercise to generate predictions with glmnet
. Is the performance the same?glmnet(x, y,
alpha = model$bestTune$alpha,
lambda = model$bestTune$lambda) %>%
predict(newx = newx) %>%
postResample(test$hp)
## RMSE Rsquared MAE
## 29.6294425 0.7849857 28.7824998
The results are the same.
am
(automatic or manual cars) based on the following design matrix. Use the same train/test splitx <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = train)
Let’s first generate the data parts to feed to glmnet
.
y <- train$am
x <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = train)
Run the model
fit <- cv.glmnet(x, y, family = "binomial", alpha = 1, nfolds = 4)
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
A lot of warnings are printed. There are few observations and the crossvalidation may bring us in dangerous territories. What if a fold contains only one class for the response?
plot()
. What are the optimum values for lambda
?Let’s inspect the model, despite the warnings earlier
plot(fit)
The deviance stays relatively constant at the beginning, but increases steeply when \(\lambda\) becomes larger.
The optimal values for \(\lambda\) are
fit
##
## Call: cv.glmnet(x = x, y = y, nfolds = 4, family = "binomial", alpha = 1)
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.00684 44 0.2554 0.09672 4
## 1se 0.03324 27 0.3390 0.06950 3
am
with this model?newx <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = test)
pred <- fit %>%
predict(newx = newx, type = "class")
postResample(pred, test$am)
## Accuracy Kappa
## 0.875 0.750
Yes; only one mistake. Not bad!
newx <- model.matrix(am ~ -1 + mpg + disp + hp + drat + wt + qsec, data = test)
pred <- fit %>%
predict(newx = newx, type = "class")
confusionMatrix(factor(pred), test$am)
## Confusion Matrix and Statistics
##
## Reference
## Prediction automatic manual
## automatic 4 0
## manual 1 3
##
## Accuracy : 0.875
## 95% CI : (0.4735, 0.9968)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.135
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.800
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 0.750
## Prevalence : 0.625
## Detection Rate : 0.500
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.900
##
## 'Positive' Class : automatic
##
Performance is much better than the baseline performance.
pred <- glm(am ~ mpg + disp + hp + drat + wt + qsec, data = train,
family = binomial(link = "logit")) %>%
predict(newdata = test,
type = "response")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
confusionMatrix(factor(ifelse(pred > .5, "manual", "automatic")),
test$am)
## Confusion Matrix and Statistics
##
## Reference
## Prediction automatic manual
## automatic 4 0
## manual 1 3
##
## Accuracy : 0.875
## 95% CI : (0.4735, 0.9968)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.135
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.800
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 0.750
## Prevalence : 0.625
## Detection Rate : 0.500
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.900
##
## 'Positive' Class : automatic
##
A logistic model in this case performed the same, but with a lot of algorithmic problems. It’s a good thing that the developers for glm()
know what they are doing!
The data file we will be working with is the gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at
, 202896_s_at
, AFFX-r2-P1-cre-3_at
. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.
The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?
CHALLENGE: Use the challenge.zip
archive on the course website to predict Disease
from a huge set of genotypes as accurately as possible using either ridge regression, lasso regression or a combination thereof (elastic net or hacky way).
Rules:
challenge.Rmd
file from the challenge archiveglmnet
or caret
The best performing model will be awarded a prize next week (or per mail)
End of Practical