In this practical, you will learn how to perform regression analysis, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.
Just like in the practical at the end of chapter 3 of the ISLR book, we will use the Boston dataset, which is in the MASS package that comes with R.
library(ISLR)
library(MASS)
library(tidyverse)RRegression is performed through the lm() function. It requires two arguments: a formula and data. A formula is a specific type of object that can be constructed like so:
some_formula <- outcome ~ predictor_1 + predictor_2 You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:
class(some_formula)## [1] "formula"
as.character(some_formula)## [1] "~" "outcome"
## [3] "predictor_1 + predictor_2"
You can estimate a linear model using lm() by specifying the outcome variable and the predictors in a formula and by inputting the dataset these variables should be taken from.
lm_ses using the formula medv ~ lstat and the Boston dataset.You have now trained a regression model with medv (housing value) as the outcome/dependent variable and lstat (socio-economic status) as the predictor / independent variable.
Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:
\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]
coef() to extract the intercept and slope from the lm_ses object. Interpret the slope coefficient.summary() to get a summary of the lm_ses object. What do you see? You can use the help file ?summary.lm.We now have a model object lm_ses that represents the formula
\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]
With this object, we can predict a new medv value by inputting its lstat value. The predict() method enables us to do this for the lstat values in the original dataset.
y_predy_pred mapped to the x position and the true y value (Boston$medv) mapped to the y value. What do you see? What would this plot look like if the fit were perfect?We can also generate predictions from new data using the newdat argument in the predict() method. For that, we need to prepare a data frame with new values for the original predictors.
seq() function to generate a sequence of 1000 equally spaced values from 0 to 40. Store this vector in a data frame with (data.frame() or tibble()) as its column name lstat. Name the data frame pred_dat.newdata argument to a predict() call for lm_ses. Store it in a variable named y_pred_new.ggplotA good way of understanding your model is by visualising it. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an lm object.
Boston dataset with lstat mapped to the x position and medv mapped to the y position. Store the plot in an object called p_scatter.Now we’re going to add a prediction line to this plot.
y_pred_new to the pred_dat data frame with the name medv.p_scatter, with pred_dat as the data argument. What does this line represent?interval argument can be used to generate confidence or prediction intervals. Create a new object called y_pred_95 using predict() (again with the pred_dat data) with the interval argument set to “confidence”. What is in this object?medv, lstat, lower, and upper.geom_ribbon() to the plot with the data frame you just made. The ribbon geom requires three aesthetics: x (lstat, already mapped), ymin (lower), and ymax (upper). Add the ribbon below the geom_line() and the geom_points() of before to make sure those remain visible. Give it a nice colour and clean up the plot, too!mse() that takes in two vectors: true y values and predicted y values, and which outputs the mean square error.Start like so:
mse <- function(y_true, y_pred) {
# your function here
}Wikipedia may help for the formula.
mse() function works correctly by running the following code.mse(1:10, 10:1)## [1] 33
You have now calculated the mean squared length of the dashed lines below.
lm_ses model. Use the medv column as y_true and use the predict() method to generate y_pred.You have calculated the mean squared length of the dashed lines in the plot below.
Now we will use the sample() function to randomly select observations from the Boston dataset to go into a training, test, and validation set. The training set will be used to fit our model, the validation set will be used to calculate the out-of sample prediction error during model building, and the test set will be used to estimate the true out-of-sample MSE.
Boston dataset has 506 observations. Use c() and rep() to create a vector with 253 times the word “train”, 152 times the word “validation”, and 101 times the word “test”. Call this vector splits.sample() to randomly order this vector and add it to the Boston dataset using mutate(). Assign the newly created dataset to a variable called boston_master.filter() to create a training, validation, and test set from the boston_master data. Call these datasets boston_train, boston_valid, and boston_test.We will set aside the boston_test dataset for now.
model_1 using the training dataset. Use the formula medv ~ lstat like in the first lm() exercise. Use summary() to check that this object is as you expect.model_1_mse_train.model_1_mse_valid. Hint: use the newdata argument in predict().This is the estimated out-of-sample mean squared error.
model_2 for the train data which includes age and tax as predictors. Calculate the train and validation MSE.This is an advanced exercise. Some components we have seen before in this and previous practicals, but some things will be completely new. Try to complete it by yourself, but don’t worry if you get stuck. If you don’t know about for loops in R, read up on those before you start the exercise.
Use help in this order:
You may also just read the answer and try to understand what happens in each step.
Inputs:
formula: a formula just as in the lm() functiondataset: a data framek: the number of folds for cross validationOutputs:
medv ~ lstat + age + tax. Compare it to a model with as formulat medv ~ lstat + I(lstat^2) + age + tax.When you have finished the practical,
enclose all files of the project 04_Regression.Rproj (i.e. all .R and/or .Rmd files including the one with your answers, and the .Rproj file) in a zip file, and
hand in the zip by PR from your fork here. Do so before Lecture 6. That way we can iron out issues during the next Q&A in Week 5.