Exercises

The mtcars data set from package MASS contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

Make yourself familiar with the mtcars data set. Is everything properly coded?.

Let’s start with studying the dimensionality of the mtcars data

dim(mtcars)

## [1] 32 11

There are not too many dimensions. The dataset holds 32 cases over 11 columns. This is tiny if we’d like to predict one column based on the others.

The reason why it is wise to start with the dimensionality of data is:

if the data is highdimensional, generic data functions like str() and summary may be redundant as the output return would be too large to study.
highdimensional data may require different techniques. Now is this data not highdimensional, but there is definitely a curse of dimensionality.

Now, let’s look at the structure of the data:

mtcars %>% str

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The columns vs and am seem dichotomous. A call to summary() verifies this:

mtcars %>% summary

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The actual nature of these columns can also be retrieved from the help file. See e.g. ?mtcars.

Recode the columns for which the measurement level is not properly set.

Let’s recode the vs and am columns into factors. To avoid confusion later on, I’ll name the recoded data set mtc.

mtc <- mtcars %>% 
  mutate(vs = factor(vs, labels = c("v-shaped", "straight")),
         am = factor(am, labels = c("automatic", "manual")))

The measurement for these columns has now been properly set:

mtc %>% str

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "v-shaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Visually inspect the data structure.

I always like GGally’s function ggpairs() to inspect the multivariate structure.

ggpairs(mtc)

It is clear that many features have quite substantial interrelations. The correlations between the first six columns are quite high. The boxplots for am and vs also demonstrate that there are quite some effects with the first variables. It is also apparant that for some variable combinations there are very few observations.

Fit a linear model with hp as the response and all other features as the predictors. Try to use the exposition (%$%) pipe.

This is straightforward:

fit <- mtc %$% 
  lm(hp ~ ., data = .)

Inspect the model’s inference. How would you evaluate the model’s performance?

fit %>% summary

## 
## Call:
## lm(formula = hp ~ ., data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.681 -15.558   0.799  18.106  34.718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  79.0484   184.5041   0.428  0.67270   
## mpg          -2.0631     2.0906  -0.987  0.33496   
## cyl           8.2037    10.0861   0.813  0.42513   
## disp          0.4390     0.1492   2.942  0.00778 **
## drat         -4.6185    16.0829  -0.287  0.77680   
## wt          -27.6600    19.2704  -1.435  0.16591   
## qsec         -1.7844     7.3639  -0.242  0.81089   
## vsstraight   25.8129    19.8512   1.300  0.20758   
## ammanual      9.4863    20.7599   0.457  0.65240   
## gear          7.2164    14.6160   0.494  0.62662   
## carb         18.7487     7.0288   2.667  0.01441 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared:  0.9028, Adjusted R-squared:  0.8565 
## F-statistic:  19.5 on 10 and 21 DF,  p-value: 1.898e-08

The model seems to fit quite well to the data, as indicated by the high R-squared (and relatively similar Adjusted R-squared); yet, few of the predictors are significant. In short: the observed outcome seems to be quite well-represented with this model.

Now fit a ridge regression model to the data. Name the resulting object ridge.

We use glmnet to fit the ridge regression.

y <- mtc$hp
x <- model.matrix(fit)
ridge <- glmnet(x = x, y = y, 
                family = "gaussian",
                alpha = 0, 
                standardize = TRUE)

glmnet requires the data to be specified in terms of a response $y$ and a predictor space $\mathbf{X}$. The predictorspace $\mathbf{X}$ needs to be a design matrix - that means that all factors are coded as dummies, etc. The easiest way to obtain such a design matrix is with the model.matrix() function. Since I already created a fitted linear model, I might as well grab the design matrix from that object. The convenience here is that the lm() function already realized the factor expansion into dummies for us.

The remaining two function arguments in the glmnet() call are vital for the process of ridge estimation:

alpha = 0 indicates that ridge regressions’s $L_2$ regularization should be used.
standardize = TRUE is the default flag that indicates that the predictor space x should be standardized by glmnet() during estimation. A flag of standardize = FALSE would indicate that glmnet() does not perform standardization.

Why is this standardization important? Let’s imagine that some features are measured in completely different units than other features. Then, the ones that would yield large OLS coefficients will be penalized more than estimates that yueld smaller OLS coefficients. In our example:

fit %>% coef

## (Intercept)         mpg         cyl        disp        drat          wt 
##  79.0483879  -2.0630545   8.2037204   0.4390024  -4.6185488 -27.6600472 
##        qsec  vsstraight    ammanual        gear        carb 
##  -1.7843654  25.8128774   9.4862914   7.2164047  18.7486691

the coefficients for wt, vsstraight and carb will be shrunk more than e.g. disp. This would not be fair, because engine displacement disp is measured in cubic inches (a high number) and wt is measured in 1000 lbs units. The unit scale between these variables is different and, hence, the coefficients are not comparable. Standardization would put all features in the same units ($\sim \mathcal{N}(0, 1)$) such that their relative importance with respect tot the response variable can be compared.

Naturally, the coefficients are backtransformed in the final model, such that they are in the scale of the original predictor values.

Inspect the ridge object and the coef() thereof.

Let’s start with the object ridge:

ridge

## 
## Call:  glmnet(x = x, y = y, family = "gaussian", alpha = 0, standardize = TRUE) 
## 
##     Df  %Dev Lambda
## 1   10  0.00  56180
## 2   10  1.10  51190
## 3   10  1.21  46640
## 4   10  1.32  42500
## 5   10  1.45  38720
## 6   10  1.59  35280
## 7   10  1.74  32150
## 8   10  1.91  29290
## 9   10  2.09  26690
## 10  10  2.29  24320
## 11  10  2.51  22160
## 12  10  2.74  20190
## 13  10  3.00  18400
## 14  10  3.29  16760
## 15  10  3.60  15270
## 16  10  3.93  13920
## 17  10  4.30  12680
## 18  10  4.70  11550
## 19  10  5.14  10530
## 20  10  5.61   9591
## 21  10  6.12   8739
## 22  10  6.68   7963
## 23  10  7.29   7255
## 24  10  7.94   6611
## 25  10  8.65   6024
## 26  10  9.41   5488
## 27  10 10.24   5001
## 28  10 11.13   4557
## 29  10 12.08   4152
## 30  10 13.10   3783
## 31  10 14.20   3447
## 32  10 15.37   3141
## 33  10 16.62   2862
## 34  10 17.95   2607
## 35  10 19.36   2376
## 36  10 20.86   2165
## 37  10 22.43   1972
## 38  10 24.08   1797
## 39  10 25.82   1638
## 40  10 27.63   1492
## 41  10 29.51   1360
## 42  10 31.47   1239
## 43  10 33.48   1129
## 44  10 35.55   1028
## 45  10 37.67    937
## 46  10 39.83    854
## 47  10 42.02    778
## 48  10 44.23    709
## 49  10 46.44    646
## 50  10 48.65    588
## 51  10 50.85    536
## 52  10 53.02    489
## 53  10 55.16    445
## 54  10 57.24    406
## 55  10 59.28    370
## 56  10 61.24    337
## 57  10 63.13    307
## 58  10 64.95    280
## 59  10 66.68    255
## 60  10 68.32    232
## 61  10 69.87    212
## 62  10 71.32    193
## 63  10 72.69    176
## 64  10 73.96    160
## 65  10 75.14    146
## 66  10 76.23    133
## 67  10 77.24    121
## 68  10 78.17    110
## 69  10 79.02    100
## 70  10 79.80     92
## 71  10 80.52     83
## 72  10 81.17     76
## 73  10 81.77     69
## 74  10 82.31     63
## 75  10 82.81     58
## 76  10 83.27     52
## 77  10 83.70     48
## 78  10 84.08     44
## 79  10 84.44     40
## 80  10 84.78     36
## 81  10 85.09     33
## 82  10 85.38     30
## 83  10 85.65     27
## 84  10 85.91     25
## 85  10 86.16     23
## 86  10 86.38     21
## 87  10 86.61     19
## 88  10 86.82     17
## 89  10 87.01     16
## 90  10 87.21     14
## 91  10 87.39     13
## 92  10 87.57     12
## 93  10 87.73     11
## 94  10 87.90     10
## 95  10 88.05      9
## 96  10 88.20      8
## 97  10 88.34      7
## 98  10 88.47      7
## 99  10 88.60      6
## 100 10 88.72      6

The output shows three columns over 100 options for $\lambda$:

Df shows the number of nonzero coefficients (equal to the model DF from Least Squares when $\lambda \ll \infty$).
%Dev the percent (of null) deviance explained
lambda the value of $\lambda$ resulting in the corresponding Df and %Dev.

Function glmnet() fits the model for 100 values of lambda by default. The function will stop when %Dev does not change sufficently from one lambda to the next

Let’s look at the coefficients.

ridge %>% coef()

## 12 x 100 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 100 column names 's0', 's1', 's2' ... ]]

##                                                                    
## (Intercept)  1.466875e+02  1.472529e+02  1.473075e+02  1.473673e+02
## (Intercept)  .             .             .             .           
## mpg         -8.918920e-36 -1.155421e-02 -1.267149e-02 -1.389585e-02
## cyl          3.228109e-35  4.182744e-02  4.587301e-02  5.030647e-02
## disp         4.419724e-37  5.726389e-04  6.280213e-04  6.887128e-04
## drat        -5.812650e-35 -7.513747e-02 -8.238584e-02 -9.032526e-02
## wt           4.662631e-35  6.036284e-02  6.619567e-02  7.258658e-02
## qsec        -2.744816e-35 -3.562883e-02 -3.908167e-02 -4.286691e-02
## vsstraight  -9.935867e-35 -1.287315e-01 -1.411815e-01 -1.548249e-01
## ammanual    -3.375455e-35 -4.348827e-02 -4.766809e-02 -5.224324e-02
## gear        -1.179948e-35 -1.507575e-02 -1.651120e-02 -1.807963e-02
## carb         3.214994e-35  4.173691e-02  4.578222e-02  5.021709e-02
##                                                                    
## (Intercept)  1.474329e+02  1.475047e+02  1.475834e+02  1.476695e+02
## (Intercept)  .             .             .             .           
## mpg         -1.523733e-02 -1.670692e-02 -1.831654e-02 -2.007921e-02
## cyl          5.516426e-02  6.048617e-02  6.631554e-02  7.269954e-02
## disp         7.552123e-04  8.280642e-04  9.078614e-04  9.952495e-04
## drat        -9.901993e-02 -1.085397e-01 -1.189605e-01 -1.303648e-01
## wt           7.958783e-02  8.725635e-02  9.565413e-02  1.048486e-01
## qsec        -4.701610e-02 -5.156370e-02 -5.654731e-02 -6.200796e-02
## vsstraight  -1.697738e-01 -1.861506e-01 -2.040886e-01 -2.237327e-01
## ammanual    -5.724982e-02 -6.272697e-02 -6.871705e-02 -7.526586e-02
## gear        -1.979264e-02 -2.166263e-02 -2.370290e-02 -2.592766e-02
## carb         5.507850e-02  6.040686e-02  6.624629e-02  7.264492e-02
##                                                                   
## (Intercept) 147.763769583 147.866921292 147.97977190 148.103195779
## (Intercept)   .             .             .            .          
## mpg          -0.022009062  -0.024121470  -0.02643311  -0.028962077
## cyl           0.079689509   0.087341234   0.09571529   0.104877399
## disp          0.001090931   0.001195668   0.00131029   0.001435696
## drat         -0.142841780  -0.156488409  -0.17140934  -0.187717676
## wt            0.114912937   0.125926736   0.13797620   0.151154724
## qsec         -0.067990389  -0.074543327  -0.08171984  -0.089577679
## vsstraight   -0.245240936  -0.268784618  -0.29454996  -0.322738883
## ammanual     -0.082422809  -0.090241173  -0.09877822  -0.108095386
## gear         -0.028352020  -0.030991994  -0.03386450  -0.036987297
## carb          0.079655217   0.087334390   0.09574472   0.104953998
##                                                                    
## (Intercept) 148.238138647 148.385621817 148.546751999 148.722704928
## (Intercept)   .             .             .             .          
## mpg          -0.031727952  -0.034751924  -0.038057437  -0.041668215
## cyl           0.114898745   0.125856365   0.137834720   0.150921545
## disp          0.001572859   0.001722834   0.001886766   0.002065869
## drat         -0.205535349  -0.224993612  -0.246234221  -0.269406933
## wt            0.165563324   0.181311106   0.198515834   0.217303934
## qsec         -0.098179592  -0.107593744  -0.117894181  -0.129160877
## vsstraight   -0.353570081  -0.387280154  -0.424124572  -0.464379136
## ammanual     -0.118258447  -0.129337595  -0.141407292  -0.154547079
## gear         -0.040378934  -0.044058638  -0.048046162  -0.052361650
## carb          0.115035916   0.126070539   0.138144743   0.151352735
##                                                                    
## (Intercept) 148.914757835 149.124274960 149.352714695 149.601631771
## (Intercept)   .             .             .             .          
## mpg          -0.045611252  -0.049915089  -0.054610320  -0.059729662
## cyl           0.165214510   0.180817607   0.197842389   0.216408246
## disp          0.002261471   0.002474995   0.002707965   0.002962012
## drat         -0.294673866  -0.322207194  -0.352189869  -0.384815600
## wt            0.237811532   0.260184432   0.284578536   0.311160084
## qsec         -0.141480643  -0.154947231  -0.169661821  -0.185733434
## vsstraight   -0.508340504  -0.556327500  -0.608681934  -0.665769375
## ammanual     -0.168840385  -0.184375068  -0.201242967  -0.219539498
## gear         -0.057025306  -0.062057122  -0.067476456  -0.073301524
## carb          0.165796493   0.181586294   0.198841214   0.217689620
##                                                                    
## (Intercept) 149.872678504 150.167604853 150.488257008 150.836574179
## (Intercept)   .             .             .             .          
## mpg          -0.065308007  -0.071382459  -0.077992341  -0.085179179
## cyl           0.236642623   0.258681179   0.282667856   0.308754853
## disp          0.003238877   0.003540414   0.003868588   0.004225478
## drat         -0.420288670  -0.458823526  -0.500644127  -0.545982978
## wt            0.340105807   0.371602947   0.405849138   0.443052091
## qsec         -0.203279319  -0.222425329  -0.243306253  -0.266066099
## vsstraight   -0.727979746  -0.795727701  -0.869452714  -0.949618816
## ammanual     -0.239363071  -0.260814324  -0.283995114  -0.309007260
## gear         -0.079548777  -0.086232135  -0.093362072  -0.100944515
## carb          0.238269659   0.260729713   0.285228826   0.311937073
##                                                                    
## (Intercept) 151.214583239 151.624390808 152.068172378 152.548158005
## (Intercept)   .             .             .             .          
## mpg          -0.092986639  -0.101460433  -0.110648162  -0.120599106
## cyl           0.337102459   0.367878748   0.401259085   0.437425428
## disp          0.004613273   0.005034268   0.005490857   0.005985524
## drat         -0.595079812  -0.648179856  -0.705531624  -0.767384171
## wt            0.483429064   0.527206071   0.574616785   0.625901093
## qsec         -0.290858316  -0.317845926  -0.347201561  -0.379107365
## vsstraight   -1.036713906  -1.131248533  -1.233754082  -1.344780224
## ammanual     -0.335950965  -0.364922902  -0.396013915  -0.429306293
## gear         -0.108979550  -0.117459894  -0.126369126  -0.135679648
## carb          0.341035864   0.372718154   0.407188551   0.444663275
##                                                                    
## (Intercept) 153.066614130 153.625821050 154.228045619 154.875508774
## (Intercept)   .             .             .             .          
## mpg          -0.131363944  -0.142994396  -0.155542786  -0.169061510
## cyl           0.476565388   0.518871027   0.564537348   0.613760469
## disp          0.006520831   0.007099401   0.007723898   0.008397004
## drat         -0.833983765  -0.905569894  -0.982370590  -1.064597020
## wt            0.681303264   0.741069672   0.805446047   0.874674217
## qsec         -0.413754751  -0.451343972  -0.492083494  -0.536189122
## vsstraight   -1.464891557  -1.594663299  -1.734675929  -1.885508670
## ammanual     -0.464870583  -0.502761924  -0.543015864  -0.585643683
## gear         -0.145350356  -0.155324011  -0.165524310  -0.175852655
## carb          0.485369962   0.529547274   0.577444280   0.629319587
##                                                                              
## (Intercept) 155.570347591 156.314571661 157.1100137 157.95827492 158.86066448
## (Intercept)   .             .             .           .            .         
## mpg          -0.183602401  -0.199216008  -0.2159508  -0.23385202  -0.25296110
## cyl           0.666735448   0.723653749   0.7847003   0.85005042   0.91986585
## disp          0.009121388   0.009899673   0.0107344   0.01162799   0.01258268
## drat         -1.152437325  -1.246049737  -1.3455550  -1.45102813  -1.56248983
## wt            0.948988304   1.028610385   1.1137456   1.20457679   1.30125858
## qsec         -0.583882865  -0.635391507  -0.6909448  -0.75077360  -0.81510695
## vsstraight   -2.047731707  -2.221897056  -2.4085280  -2.60810727  -2.82106335
## ammanual     -0.630627219  -0.677913254  -0.7274075  -0.77896853  -0.83240114
## gear         -0.186184633  -0.196366265  -0.2062101  -0.21549096  -0.22394231
## carb          0.685440175   0.746079921   0.8115178   0.88203557   0.95791543
##                                                                            
## (Intercept) 159.81811021 160.8311856 161.89990556 163.02374078 164.20152973
## (Intercept)   .            .           .            .            .         
## mpg          -0.27331128  -0.2949375  -0.31785975  -0.34209161  -0.36763680
## cyl           0.99429254   1.0734530   1.15744551   1.24633823   1.34016540
## disp          0.01360057   0.0146833   0.01583234   0.01704876   0.01833318
## drat         -1.67991187  -1.8031569  -1.93203928  -2.06627439  -2.20548140
## wt            1.40391684   1.5126220   1.62741036   1.74826031   1.87509060
## qsec         -0.88417110  -0.9581810  -1.03734385  -1.12185065  -1.21187332
## vsstraight   -3.04775854  -3.2884633  -3.54335274  -3.81248005  -4.09576059
## ammanual     -0.88745093  -0.9437961  -1.00104587  -1.05873374  -1.11631580
## gear         -0.23125115  -0.2370573  -0.24094690  -0.24245214  -0.24104958
## carb          1.03943678   1.1268730   1.22048767   1.32053059   1.42723321
##                                                                             
## (Intercept) 165.43141454 166.71078412 168.03622714 169.40349792 170.80749780
## (Intercept)   .            .            .            .            .         
## mpg          -0.39448811  -0.42262638  -0.45201978  -0.48262323  -0.51437807
## cyl           1.43892332   1.54256684   1.65100644   1.76410600   1.88168158
## disp          0.01968576   0.02110615   0.02259347   0.02414629   0.02576265
## drat         -2.34917759  -2.49677466  -2.64757754  -2.80078598  -2.95549944
## wt            2.00775473   2.14603643   2.28964634   2.43822012   2.59131827
## qsec         -1.30756010  -1.40903101  -1.51637323  -1.62963648  -1.74882859
## vsstraight   -4.39295340  -4.70364329  -5.02722401  -5.36288283  -5.70958728
## ammanual     -1.17316971  -1.22859627  -1.28182361  -1.33201447  -1.37827680
## gear         -0.23616026  -0.22715134  -0.21333965  -0.19399731  -0.16835964
## carb          1.54080416   1.66142440   1.78924246   1.92436963   2.06687544
##                                                                            
## (Intercept) 172.24227472 173.70104280 175.17622359 176.6595093 178.14194788
## (Intercept)   .            .            .            .           .         
## mpg          -0.54721208  -0.58103978  -0.61576307  -0.6512722  -0.68744704
## cyl           2.00350119   2.12928576   2.25871132   2.3914124   2.52698688
## disp          0.02744002   0.02917541   0.03096534   0.0328060   0.03469325
## drat         -3.11072551  -3.26539205  -3.41836297  -3.5684576  -3.71447308
## wt            2.74842788   2.90896650   3.07228799   3.2376906   3.40442683
## qsec         -1.87391140  -2.00479710  -2.14134520  -2.2833603  -2.43059061
## vsstraight   -6.06607466  -6.43084464  -6.80215578  -7.1780260  -7.55623762
## ammanual     -1.41967780  -1.45526151  -1.48406952  -1.5051647  -1.51765678
## gear         -0.13563534  -0.09501909  -0.04570629   0.0130903   0.08212249
## carb          2.21678335   2.37406698   2.53864694   2.7103883   2.88909925
##                                                                            
## (Intercept) 179.6140487 181.06590616 182.48733747 183.86803044 185.19769432
## (Intercept)   .           .            .            .            .         
## mpg          -0.7241585  -0.76127021  -0.79864063  -0.83612474  -0.87357629
## cyl           2.6650015   2.80499904   2.94650635   3.08904307   3.23213163
## disp          0.0366228   0.03859027   0.04059138   0.04262205   0.04467859
## drat         -3.8552098  -3.98949858  -4.11622901  -4.23437865  -4.34304096
## wt            3.5717153   3.73875337   3.90473153   4.06884706   4.23031787
## qsec         -2.5827281  -2.73940904  -2.90021674  -3.06468495  -3.23230303
## vsstraight   -7.9343467  -8.30969725  -8.67943899  -9.04054940  -9.38985813
## ammanual     -1.5207299  -1.51366965  -1.49588918  -1.46695287  -1.42659520
## gear          0.1620877   0.25361006   0.35722224   0.47334926   0.60229509
## carb          3.0745299   3.26637326   3.46426613   3.66779170   3.87648269
##                                                                             
## (Intercept) 186.46620739 187.66375368 188.77592851 189.80262748 190.73216640
## (Intercept)   .            .            .            .            .         
## mpg          -0.91084998  -0.94780384  -0.98462612  -1.02069427  -1.05608906
## cyl           3.37530796   3.51813308   3.66143703   3.80271164   3.94254949
## disp          0.04675784   0.04885734   0.05098625   0.05312257   0.05527569
## drat         -4.44145132  -4.52900935  -4.60585355  -4.67018092  -4.72264347
## wt            4.38839444   4.54236936   4.69101170   4.83437186   4.97163620
## qsec         -3.40252215  -3.57476256  -3.74827883  -3.92263045  -4.09712960
## vsstraight   -9.72407313 -10.03980723 -10.33210468 -10.59989701 -10.83870257
## ammanual     -1.37473361  -1.31147332  -1.23547268  -1.14988495  -1.05423666
## gear          0.74423325   0.89920190   1.06770954   1.24834316   1.44121560
## carb          4.08982526   4.30726356   4.52790832   4.75165737   4.97765285
##                                                                            
## (Intercept) 191.55740418 192.27205646 192.87068175 193.3486372 193.70201849
## (Intercept)   .            .            .            .           .         
## mpg          -1.09069715  -1.12441386  -1.15714575  -1.1888133  -1.21935425
## cyl           4.08066812   4.21684822   4.35094621   4.4829049   4.61276145
## disp          0.05744651   0.05963714   0.06185094   0.0640926   0.06636808
## drat         -4.76332476  -4.79250509  -4.81065005  -4.8183854  -4.81645677
## wt            5.10228636   5.22585710   5.34192155   5.4500689   5.54987446
## qsec         -4.27116187  -4.44413131  -4.61546822  -4.7846347  -4.95112797
## vsstraight  -11.04509113 -11.21569516 -11.34722443 -11.4364772 -11.48035180
## ammanual     -0.94934355  -0.83614126  -0.71563818  -0.5888631  -0.45681260
## gear          1.64580058   1.86146804   2.08750676   2.3231461   2.56757256
## carb          5.20524630   5.43378140   5.66259849   5.8910395   6.11845342
##                                                                             
## (Intercept) 193.92109327 194.02191854 193.99443867 193.83803899 193.55250744
## (Intercept)   .            .            .            .            .         
## mpg          -1.24958628  -1.27788644  -1.30490303  -1.33062218  -1.35504969
## cyl           4.74226023   4.86795813   4.99181549   5.11416174   5.23541158
## disp          0.06868099   0.07103425   0.07344181   0.07591337   0.07845923
## drat         -4.80231755  -4.78166007  -4.75381589  -4.71975077  -4.68036600
## wt            5.63773425   5.71873939   5.79011555   5.85128880   5.90155881
## qsec         -5.11376309  -5.27347084  -5.42934363  -5.58107857  -5.72840684
## vsstraight  -11.47302080 -11.41796733 -11.30981867 -11.14626443 -10.92520896
## ammanual     -0.31738503  -0.17801718  -0.03650641   0.10649780   0.25054535
## gear          2.81940382   3.07777793   3.34200895   3.61120714   3.88448289
## carb          6.34395738   6.56769239   6.78871127   7.00648413   7.22052318
##                                                                            
## (Intercept) 193.13822267 192.6235899 191.97468050 191.20396732 190.31482894
## (Intercept)   .            .           .            .            .         
## mpg          -1.37821255  -1.4008257  -1.42127520  -1.44048786  -1.45850504
## cyl           5.35603341   5.4750119   5.59408917   5.71384674   5.83469984
## disp          0.08108964   0.0837654   0.08657918   0.08950903   0.09256335
## drat         -4.63640053  -4.5807940  -4.52925388  -4.47544037  -4.41960724
## wt            5.94007224   5.9632788   5.97688283   5.97616013   5.95994817
## qsec         -5.87108814  -6.0086674  -6.14204537  -6.27041561  -6.39366861
## vsstraight  -10.64485290 -10.3068884  -9.90732793  -9.44571180  -8.92205281
## ammanual      0.39534665   0.5381364   0.68087323   0.82389969   0.96721226
## gear          4.16089806   4.4347382   4.71317849   4.99197107   5.27000787
## carb          7.43040383   7.6368820   7.83802531   8.03416632   8.22521310
##                                                                          
## (Intercept) 189.37257716 188.27470426 187.0687536 185.7595365 184.3519195
## (Intercept)   .            .            .           .           .        
## mpg          -1.47534918  -1.49059909  -1.5047988  -1.5179336  -1.5300333
## cyl           5.95246313   6.07499424   6.1998551   6.3271725   6.4570548
## disp          0.09568011   0.09900362   0.1024771   0.1061069   0.1098975
## drat         -4.35573114  -4.29876645  -4.2405424  -4.1813575  -4.1211935
## wt            5.92841217   5.88042664   5.8132188   5.7257870   5.6169480
## qsec         -6.51277361  -6.62636283  -6.7345252  -6.8372005  -6.9342614
## vsstraight   -8.34612170  -7.70301477  -7.0004500  -6.2407065  -5.4265766
## ammanual      1.10220811   1.24363242   1.3861696   1.5297716   1.6744514
## gear          5.53991500   5.81326972   6.0828195   6.3475418   6.6062949
## carb          8.41308865   8.59421180   8.7703459   8.9417301   9.1086946
##                                                                        
## (Intercept) 182.9558484 181.3598369 179.6836123 178.0286143 176.1687057
## (Intercept)   .           .           .           .           .        
## mpg          -1.5394633  -1.5491794  -1.5578918  -1.5634790  -1.5702765
## cyl           6.5805339   6.7155438   6.8525651   6.9825160   7.1250080
## disp          0.1137919   0.1179382   0.1222511   0.1267023   0.1313963
## drat         -4.0637809  -4.0063802  -3.9480316  -3.8982249  -3.8430090
## wt            5.4966454   5.3437245   5.1664717   4.9777560   4.7484732
## qsec         -7.0295272  -7.1152723  -7.1949698  -7.2729658  -7.3390116
## vsstraight   -4.5761659  -3.6617589  -2.7048756  -1.7222137  -0.6883898
## ammanual      1.8047943   1.9531098   2.1018573   2.2371760   2.3929331
## gear          6.8544835   7.1008229   7.3378455   7.5648486   7.7848551
## carb          9.2735649   9.4323208   9.5881199   9.7425576   9.8929095
##                                                                       
## (Intercept) 174.2374998 172.3242459 170.1250784 167.9465417 165.769300
## (Intercept)   .           .           .           .           .       
## mpg          -1.5760497  -1.5785075  -1.5859257  -1.5890722  -1.589213
## cyl           7.2684806   7.4041494   7.5607698   7.7070944   7.845097
## disp          0.1362636   0.1413014   0.1465318   0.1519511   0.157590
## drat         -3.7882349  -3.7471821  -3.6774904  -3.6284684  -3.599619
## wt            4.4927363   4.2244414   3.8922704   3.5514111   3.194440
## qsec         -7.3981154  -7.4546376  -7.4913174  -7.5256794  -7.555251
## vsstraight    0.3730632   1.4481843   2.5627107   3.6790341   4.799234
## ammanual      2.5490449   2.6942526   2.8706812   3.0338030   3.193619
## gear          7.9937086   8.1937352   8.3758693   8.5520025   8.721551
## carb         10.0419624  10.1906801  10.3379391  10.4856741  10.634105
##                                                                        
## (Intercept) 163.4428247 161.0919828 158.6210382 156.1186655 153.5193462
## (Intercept)   .           .           .           .           .        
## mpg          -1.5924349  -1.5919122  -1.5946362  -1.5941659  -1.5969457
## cyl           7.9915465   8.1285241   8.2710823   8.4035649   8.5389892
## disp          0.1633429   0.1693553   0.1754512   0.1818056   0.1882199
## drat         -3.5487641  -3.5257348  -3.4805817  -3.4625579  -3.4239166
## wt            2.7873116   2.3662043   1.8958211   1.4084889   0.8742336
## qsec         -7.5689594  -7.5771660  -7.5691322  -7.5538718  -7.5223027
## vsstraight    5.9235270   7.0437772   8.1532667   9.2494730  10.3222755
## ammanual      3.3652046   3.5343900   3.7125781   3.8899779   4.0740702
## gear          8.8691874   9.0153281   9.1375263   9.2578672   9.3540732
## carb         10.7846319  10.9362100  11.0913373  11.2486117  11.4103177
##                                                
## (Intercept) 150.8886940 148.1876123 145.4197146
## (Intercept)   .           .           .        
## mpg          -1.5971154  -1.6004930  -1.6039712
## cyl           8.6634640   8.7882784   8.9076274
## disp          0.1948860   0.2015858   0.2084335
## drat         -3.4114081  -3.3800171  -3.3550844
## wt            0.3206107  -0.2763911  -0.9026725
## qsec         -7.4820869  -7.4259810  -7.3573035
## vsstraight   11.3731550  12.3898058  13.3744893
## ammanual      4.2584282   4.4469797   4.6382009
## gear          9.4478774   9.5179478   9.5782001
## carb         11.5751705  11.7449431  11.9185074

We can see that the resulting output holds the coefficients of the 12 parameters (?) in the model for each of the 100 values of lambda that we saw already in the output of the ridge object. The extra parameter is the result of the model matrix that we entered as the predictor space x: it holds an intercept (i.e. a column of 1’s) while the function glmnet() will by default add an intercept too. The redundant extra Intercept is set to zero.

Study the summary() of the ridge regression

There is no summary output. At least not as we know it. The reason is: the model is not fit to generate one specific scenario. For different values of the shrinkage/penalization parameter $\lambda$, different results are obtained.

Plot the ridge regression’s fitted object ridge twice: once with the deviance on the x-axis and once with the log of lambda on the x-axis. Hint: see ?plot.glmnet.

Let’s first plot the trace of the coefficients with the natural logarithm of lambda.

plot(ridge, xvar = "lambda")

We can see that with increasing lambda, the parameters will shrink towards zero. Note that $\ln(\lambda) = 10$ would correspond to a lambda of 2.2026466^{4}.

plot(ridge, xvar = "dev")

We can see that with increasing dev, the parameters will move away frome zero.

The lesson is that there is an optimal value for $\lambda$, where most

Now fit the ridge regression again, but in a cross-validation setting

ridge <- cv.glmnet(x = x[, -1], y = y, 
                   family = "gaussian",
                   alpha = 0, 
                   standardize = TRUE)

I choose to remove the intercept by excluding the first column ([, -1]): glmnet() will by default add an intercept already and otherwise I’ll have the redundant parameter that we saw before.

Now fit the ridge regression again, but in a cross-validation setting. Name the resulting object cv.ridge.

cv.ridge <- cv.glmnet(x = x[, -1], y = y, 
                   family = "gaussian",
                   alpha = 0, 
                   standardize = TRUE)

Study the output of the cv.ridge object and run the object through plot().

cv.ridge

## 
## Call:  cv.glmnet(x = x[, -1], y = y, family = "gaussian", alpha = 0,      standardize = TRUE) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  20.66    86    1041 351.5      10
## 1se 145.78    65    1384 583.0      10

There are two optimal parameters for $\lambda$ given, that both yield 9 nonzero coefficients:

lambda_min: the $\lambda$ which minimizes out-of-sample loss in CV. That is, $\lambda_{min}$ yields the minimum mean cross-validated error.
lamda_1se: the largest lambda value within 1 standard error of $\lambda_{min}$. One could argue that $\lambda_{1se}$ yields the most regularized model, such that the cross-validated error is within one standard error of the minimum.

plot(cv.ridge)

The vertical lines dashed lines show the locations of $\lambda_{min}$ and $\lambda_{1se}$.

The 1-standard-error-rule acknowledges the fact that the risk curves are estimated (and have error). It therefore favors parsimony. In other words: a simpler model that yields about the same predictive power as the one under $\lambda_{min}$. The motivation for $\lambda_{1se}$ originated in the 1984 book Classification and regression trees by Breiman, Friedman, Olshen and Stone and can be found in chapter 3, paragraph 4.3.

Compare the crossvalidated ridge regression to the linear model. Study the RMSE and R-squared of the predicted values from both approaches. Which performs better?

First we obtain the linear model predictions (fitted values)

pred.lm <- predict(fit)

and then we generate the predicted values for the ridge regression under the $\lambda_{1se}$ model:

pred.ridge <- predict(cv.ridge, s = "lambda.1se", newx = x[, -1])

Using the caret function postResample() we obtain the following performance measures.

caret::postResample(pred.lm, y)

##       RMSE   Rsquared        MAE 
## 21.0392190  0.9027993 18.8429652

caret::postResample(pred.ridge, y)

##       RMSE   Rsquared        MAE 
## 33.6481538  0.8323334 23.0017267

The linear model performs better than the ridge regression. However, we do not know how well these model fit on new predictions. Simply verifying the predicted values on the data the model has been trained on would be faulty: the linear model tends to fit better in that scenario as it yields unbiased parameter estimates. In conclusion; for inference purposes, the linear model may be fine. However, for prediction purposes it may overfit the data.

Compare a 4-fold cross-validated ridge regression to the linear model again, but now train both models on a training set with 70% of cases. Use the remaining test cases to study the RMSE and R-squared of the predicted values (use both $\lambda_{min}$ and $\lambda_{1se}$ for ridge regression) from both approaches. Which method has better predictive power?

I have chosen to fit 4 folds in this exercise because the sample size has decreased because of the train/test set. First, we create the train/test split.

idx <- createDataPartition(mtc$hp, p = .7, list = FALSE)
train <- mtc[idx, ]
test <- mtc[-idx, ]

Then we refit the linear model to the training cases. We store the predicted values on the test data in object pred.

fit <- train %$% lm(hp ~ ., data = .)
pred <- predict(fit, newdata = test)

For ridge regression, we do the same. First we seperate the training response in vector y and we obtain the model matrix x from the fitted linear model (the one on the training data, naturally).

y <- train$hp
x <- model.matrix(fit)[, -1] #exclude intercept

Then we run the cv.glmnet() method to infer optimal values of lambda through cross-validation.

ridge <- cv.glmnet(x = x, y = y, 
                   family = "gaussian",
                   alpha = 0, 
                   standardize = TRUE, 
                   nfolds = 4)

Finally we obtain the predicted values. To do so, I use a quick linear model on the test data to obtain the corresponding design matrix. I do this to avoid an annoying matrix property: if I would convert the test object to a matrix, every column would turn to character class. That is because matrices can be either numeric or character; not both. predict.glmnet() requires the input for argument newx to be a matrix.

newx <- test %$% lm(hp ~ ., data = .) %>% model.matrix() %>% .[, -1]
ridge_min <- predict(ridge, s = "lambda.min", newx = newx)
ridge_1se <- predict(ridge, s = "lambda.1se", newx = newx)

Now, let’s compare the three solutions:

postResample(pred, test$hp)

##       RMSE   Rsquared        MAE 
## 36.5474077  0.7369976 32.7630202

postResample(ridge_min, test$hp)

##       RMSE   Rsquared        MAE 
## 26.6839013  0.8118065 25.2053081

postResample(ridge_1se, test$hp)

##      RMSE  Rsquared       MAE 
## 29.517895  0.853422 23.769065

We can see that both ridge predictions outperform the linear model predictions, with the more regularized $\lambda_{1se}$ solution yielding the best predictive performance.

Important Note: We have used both a test/train split and a k-fold cross-validation procedure. A different seed value will result in different splits and cross-validation folds. Don’t forget that once you fix the seed, everything will become seed dependent.

Re-running it all again yields:

##       RMSE   Rsquared        MAE 
## 30.8110303  0.8112899 26.2250707

##       RMSE   Rsquared        MAE 
## 28.1841095  0.8428196 22.6240256

##       RMSE   Rsquared        MAE 
## 29.9926533  0.8708709 22.9344999

End of Practical

Exercises G

Gerko Vink

Data Science and Predictive Machine Learning

Introduction

Exercises