In this practical, you will learn how to perform regression analysis, how to plot with confidence and prediction intervals, how to calculate MSE, perform train-test splits, and write a function for cross validation.
Just like in the practical at the end of chapter 3 of the ISLR book, we will use the Boston
dataset, which is in the MASS
package that comes with R
.
library(ISLR)
library(MASS)
library(tidyverse)
R
Regression is performed through the lm()
function. It requires two arguments: a formula
and data
. A formula
is a specific type of object that can be constructed like so:
<- outcome ~ predictor_1 + predictor_2 some_formula
You can read it as “the outcome variable is a function of predictors 1 and 2”. As with other objects, you can check its class and even convert it to other classes, such as a character vector:
class(some_formula)
## [1] "formula"
as.character(some_formula)
## [1] "~" "outcome"
## [3] "predictor_1 + predictor_2"
You can estimate a linear model using lm()
by specifying the outcome variable and the predictors in a formula and by inputting the dataset these variables should be taken from.
lm_ses
using the formula medv ~ lstat
and the Boston
dataset.<- lm(formula = medv ~ lstat, data = Boston) lm_ses
You have now trained a regression model with medv
(housing value) as the outcome/dependent variable and lstat
(socio-economic status) as the predictor / independent variable.
Remember that a regression estimates \(\beta_0\) (the intercept) and \(\beta_1\) (the slope) in the following equation:
\[\boldsymbol{y} = \beta_0 + \beta_1\cdot \boldsymbol{x}_1 + \boldsymbol{\epsilon}\]
coef()
to extract the intercept and slope from the lm_ses
object. Interpret the slope coefficient.coef(lm_ses)
## (Intercept) lstat
## 34.5538409 -0.9500494
# for each point increase in lstat, the median housing value drops by 0.95
summary()
to get a summary of the lm_ses
object. What do you see? You can use the help file ?summary.lm
.summary(lm_ses)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We now have a model object lm_ses
that represents the formula
\[\text{medv}_i = 34.55 - 0.95 * \text{lstat}_i + \epsilon_i\]
With this object, we can predict a new medv
value by inputting its lstat
value. The predict()
method enables us to do this for the lstat
values in the original dataset.
y_pred
<- predict(lm_ses) y_pred
y_pred
mapped to the x position and the true y value (Boston$medv
) mapped to the y value. What do you see? What would this plot look like if the fit were perfect?tibble(pred = y_pred,
obs = Boston$medv) %>%
ggplot(aes(x = pred, y = obs)) +
geom_point() +
theme_minimal() +
geom_abline(slope = 1)
# I've added an ideal line where all the points would lie on if the
# fit were perfect.
We can also generate predictions from new data using the newdat
argument in the predict()
method. For that, we need to prepare a data frame with new values for the original predictors.
seq()
function to generate a sequence of 1000 equally spaced values from 0 to 40. Store this vector in a data frame with (data.frame()
or tibble()
) as its column name lstat
. Name the data frame pred_dat
.<- tibble(lstat = seq(0, 40, length.out = 1000)) pred_dat
newdata
argument to a predict()
call for lm_ses
. Store it in a variable named y_pred_new
.<- predict(lm_ses, newdata = pred_dat) y_pred_new
ggplot
A good way of understanding your model is by visualising it. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an lm
object.
Boston
dataset with lstat
mapped to the x position and medv
mapped to the y position. Store the plot in an object called p_scatter
.<-
p_scatter %>%
Boston ggplot(aes(x = lstat, y = medv)) +
geom_point() +
theme_minimal()
p_scatter
Now we’re going to add a prediction line to this plot.
y_pred_new
to the pred_dat
data frame with the name medv
.# this can be done in several ways. Here are two possibilities:
# pred_dat$medv <- y_pred_new
<- pred_dat %>% mutate(medv = y_pred_new) pred_dat
p_scatter
, with pred_dat
as the data
argument. What does this line represent?+ geom_line(data = pred_dat) p_scatter
# This line represents predicted values of medv for the values of lstat
interval
argument can be used to generate confidence or prediction intervals. Create a new object called y_pred_95
using predict()
(again with the pred_dat
data) with the interval
argument set to “confidence”. What is in this object?<- predict(lm_ses, newdata = pred_dat, interval = "confidence")
y_pred_95
head(y_pred_95)
## fit lwr upr
## 1 34.55384 33.44846 35.65922
## 2 34.51580 33.41307 35.61853
## 3 34.47776 33.37768 35.57784
## 4 34.43972 33.34229 35.53715
## 5 34.40168 33.30690 35.49646
## 6 34.36364 33.27150 35.45578
# it's a matrix with an estimate and a lower and an upper confidence interval.
medv
, lstat
, lower
, and upper
.<- tibble(
gg_pred lstat = pred_dat$lstat,
medv = y_pred_95[, 1],
lower = y_pred_95[, 2],
upper = y_pred_95[, 3]
)
gg_pred
geom_ribbon()
to the plot with the data frame you just made. The ribbon geom requires three aesthetics: x
(lstat
, already mapped), ymin
(lower
), and ymax
(upper
). Add the ribbon below the geom_line()
and the geom_points()
of before to make sure those remain visible. Give it a nice colour and clean up the plot, too!# Create the plot
%>%
Boston ggplot(aes(x = lstat, y = medv)) +
geom_ribbon(aes(ymin = lower, ymax = upper), data = gg_pred, fill = "#00008b44") +
geom_point(colour = "#883321") +
geom_line(data = pred_dat, colour = "#00008b", size = 1) +
theme_minimal() +
labs(x = "Proportion of low SES households",
y = "Median house value",
title = "Boston house prices")
# The ribbon represents the 95% confidence interval of the fit line.
# The uncertainty in the estimates of the coefficients are taken into
# account with this ribbon.
# You can think of it as:
# upon repeated sampling of data from the same population, at least 95% of
# the ribbons will contain the true fit line.
# pred with pred interval
<- predict(lm_ses, newdata = pred_dat, interval = "prediction")
y_pred_95
# create the df
<- tibble(
gg_pred lstat = pred_dat$lstat,
medv = y_pred_95[, 1],
l95 = y_pred_95[, 2],
u95 = y_pred_95[, 3]
)
# Create the plot
%>%
Boston ggplot(aes(x = lstat, y = medv)) +
geom_ribbon(aes(ymin = l95, ymax = u95), data = gg_pred, fill = "#00008b44") +
geom_point(colour = "#883321") +
geom_line(data = pred_dat, colour = "#00008b", size = 1) +
theme_minimal() +
labs(x = "Proportion of low SES households",
y = "Median house value",
title = "Boston house prices")
# You can look at ISLR p.81-82 for a discussion of prediction intervals
mse()
that takes in two vectors: true y values and predicted y values, and which outputs the mean square error.Start like so:
<- function(y_true, y_pred) {
mse # your function here
}
Wikipedia may help for the formula.
# there are many ways of doing this.
<- function(y_true, y_pred) {
mse mean((y_true - y_pred)^2)
}
mse()
function works correctly by running the following code.mse(1:10, 10:1)
## [1] 33
You have now calculated the mean squared length of the dashed lines below.
lm_ses
model. Use the medv
column as y_true
and use the predict()
method to generate y_pred
.mse(Boston$medv, predict(lm_ses))
## [1] 38.48297
You have calculated the mean squared length of the dashed lines in the plot below.
Now we will use the sample()
function to randomly select observations from the Boston
dataset to go into a training, test, and validation set. The training set will be used to fit our model, the validation set will be used to calculate the out-of sample prediction error during model building, and the test set will be used to estimate the true out-of-sample MSE.
Boston
dataset has 506 observations. Use c()
and rep()
to create a vector with 253 times the word “train”, 152 times the word “validation”, and 101 times the word “test”. Call this vector splits
.<- c(rep("train", 253), rep("validation", 152), rep("test", 101)) splits
sample()
to randomly order this vector and add it to the Boston
dataset using mutate()
. Assign the newly created dataset to a variable called boston_master
.<- Boston %>% mutate(splits = sample(splits)) boston_master
filter()
to create a training, validation, and test set from the boston_master
data. Call these datasets boston_train
, boston_valid
, and boston_test
.<- boston_master %>% filter(splits == "train")
boston_train <- boston_master %>% filter(splits == "validation")
boston_valid <- boston_master %>% filter(splits == "test") boston_test
We will set aside the boston_test
dataset for now.
model_1
using the training dataset. Use the formula medv ~ lstat
like in the first lm()
exercise. Use summary()
to check that this object is as you expect.<- lm(medv ~ lstat, data = boston_train)
model_1 summary(model_1)
##
## Call:
## lm(formula = medv ~ lstat, data = boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.825 -4.181 -1.591 1.895 22.239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.90748 0.76502 45.63 <2e-16 ***
## lstat -0.96058 0.05251 -18.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.167 on 251 degrees of freedom
## Multiple R-squared: 0.5714, Adjusted R-squared: 0.5697
## F-statistic: 334.7 on 1 and 251 DF, p-value: < 2.2e-16
model_1_mse_train
.<- mse(y_true = boston_train$medv, y_pred = predict(model_1)) model_1_mse_train
model_1_mse_valid
. Hint: use the newdata
argument in predict()
.<- mse(y_true = boston_valid$medv,
model_1_mse_valid y_pred = predict(model_1, newdata = boston_valid))
This is the estimated out-of-sample mean squared error.
model_2
for the train data which includes age
and tax
as predictors. Calculate the train and validation MSE.<- lm(medv ~ lstat + age + tax, data = boston_train)
model_2 <- mse(y_true = boston_train$medv, y_pred = predict(model_2))
model_2_mse_train <- mse(y_true = boston_valid$medv,
model_2_mse_valid y_pred = predict(model_2, newdata = boston_valid))
# If you are interested in out-of-sample prediction, the
# answer may depend on the random sampling of the rows in the
# dataset splitting: everyond has a different split. However, it
# is likely that model_2 has both lower training and validation MSE.
<- mse(y_true = boston_test$medv,
model_2_mse_test y_pred = predict(model_2, newdata = boston_test))
# The estimate for the expected amount of error when predicting
# the median value of a not previously seen town in Boston when
# using this model is:
sqrt(model_2_mse_test)
## [1] 6.17603
This is an advanced exercise. Some components we have seen before in this and previous practicals, but some things will be completely new. Try to complete it by yourself, but don’t worry if you get stuck. If you don’t know about for loops
in R
, read up on those before you start the exercise.
Use help in this order:
You may also just read the answer and try to understand what happens in each step.
Inputs:
formula
: a formula just as in the lm()
functiondataset
: a data framek
: the number of folds for cross validationOutputs:
# Just for reference, here is the mse() function once more
<- function(y_true, y_pred) mean((y_true - y_pred)^2)
mse
<- function(formula, dataset, k) {
cv_lm # We can do some error checking before starting the function
stopifnot(is_formula(formula)) # formula must be a formula
stopifnot(is.data.frame(dataset)) # dataset must be data frame
stopifnot(is.integer(as.integer(k))) # k must be convertible to int
# first, add a selection column to the dataset as before
<- nrow(dataset)
n_samples <- rep(1:k, length.out = n_samples)
select_vec <- dataset %>% mutate(folds = sample(select_vec))
data_split
# initialise an output vector of k mse values, which we
# will fill by using a _for loop_ going over each fold
<- rep(0, k)
mses
# start the for loop
for (i in 1:k) {
# split the data in train and validation set
<- data_split %>% filter(folds != i)
data_train <- data_split %>% filter(folds == i)
data_valid
# calculate the model on this data
<- lm(formula = formula, data = data_train)
model_i
# Extract the y column name from the formula
<- as.character(formula)[2]
y_column_name
# calculate the mean square error and assign it to mses
<- mse(y_true = data_valid[[y_column_name]],
mses[i] y_pred = predict(model_i, newdata = data_valid))
}
# now we have a vector of k mse values. All we need is to
# return the mean mse!
mean(mses)
}
medv ~ lstat + age + tax
. Compare it to a model with as formulat medv ~ lstat + I(lstat^2) + age + tax
.cv_lm(formula = medv ~ lstat + age + tax, dataset = Boston, k = 9)
## [1] 38.70845
cv_lm(formula = medv ~ lstat + I(lstat^2) + age + tax, dataset = Boston, k = 9)
## [1] 28.12842
When you have finished the practical,
enclose all files of the project 04_Regression.Rproj
(i.e. all .R
and/or .Rmd
files including the one with your answers, and the .Rproj
file) in a zip file, and
hand in the zip by PR from your fork here. Do so before Lecture 6. That way we can iron out issues during the next Q&A in Week 5.