In this practical, we will continue with regularized regression. We need the following packages:
library(magrittr)
library(dplyr)
library(GGally)
library(glmnet)
library(caret)
library(plotmo)
library(coefplot)
We continue with the previous practical. For convenience, I have prepared an image with all necessary objects and functions from the previous practical so that we continue where we left off.
To load that workspace image, run the below code block.
con <- url("https://www.gerkovink.com/erasmus/Day%203/Part%20H/load_all_objects.RData")
load(con)
The comparison in the previous practical was between OLS and Ridge regression for the mtc
data. The mtc
data set is our recoded version of the mtcars
data, where the binary columns are set to factors. We created a training set (train
) and a test set (test
) based on this mtc
data. If you recall, OLS performed worse than ridge regression
postResample(pred, test$hp) # lm()
## RMSE Rsquared MAE
## 36.5474077 0.7369976 32.7630202
postResample(ridge_min, test$hp) # ridge with \lambda_min
## RMSE Rsquared MAE
## 26.6839013 0.8118065 25.2053081
postResample(ridge_1se, test$hp) # ridge with \lambda_1se
## RMSE Rsquared MAE
## 29.517895 0.853422 23.769065
Before starting with the exercises, it is a good idea to set your RNG seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.
set.seed(123)
fit.lasso
. Do not do crossvalidation, yet.The function plot_glmnet()
from package plotmo
has a nicer - but very similar - plot class. I often prefer these plots over the native plot.glmnet()
plot_glmnet()
.train
data set. Name the resulting object lasso
. Use 4-fold crossvalidation, just like with ridge regression.Instead of fitting the lasso and ridge regressions seperately, we can simultanous optimize the \(L1\) and \(L2\) norms. We then have to specify the ratio between these norms. That is done with alpha
where \(0\leq\alpha\leq1\) and an alpha = 0
would mean ridge regression and an alpha = 1
indicates the lasso.
train
data (not the subset) and set alpha = 0.5
. Does performance on the test
data increase?Different levels of alpha
will yield different performance. So, just like \(\lambda\), \(\alpha\) is a hyperparameter that can be optimized.
caret
. Can you find a combination of alpha
(between 0.1 and 1 in steps of .1) and lambda
(between 0.1 and 30 in steps of 1) that yields a better predictive performance than the one currently obtained?alpha
and lambda
obtained in the previous exercise to generate predictions with glmnet
. Is the performance the same?am
(automatic or manual cars) based on the following design matrix. Use the same train/test splitplot()
. What are the optimum values for lambda
?am
with this model?The data file we will be working with is gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at
, 202896_s_at
, AFFX-r2-P1-cre-3_at
. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.
The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?
CHALLENGE: Use the challenge.zip
archive on the course website to predict Disease
from a huge set of genotypes as accurately as possible using either ridge regression, lasso regression or a combination thereof (elastic net or hacky way).
Rules:
challenge.Rmd
file from the challenge archiveglmnet
or caret
The best performing model will be awarded a prize next week (or per mail)
End of Practical