In this practical, we will learn about three different classification methods: K-nearest neighbours, logistic regression, and linear discriminant analysis.
One of the packages we are going to use is class
. For this, you will probably need to install.packages("class")
before running the library()
functions.
library(MASS)
library(class)
library(ISLR)
library(tidyverse)
Before starting with the exercises, it is a good idea to set your seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.
set.seed(45)
The default dataset contains credit card loan data for 10 000 people. The goal is to classify credit card cases as yes
or no
based on whether they will default on their loan.
Default
dataset, where balance
is mapped to the x position, income
is mapped to the y position, and default
is mapped to the colour. Can you see any interesting patterns already?facet_grid(cols = vars(student))
to the plot. What do you see?ifelse()
(0 = not a student, 1 = student). Then, randomly split the Default dataset into a training set default_train
(80%) and a test set default_test
(20%)Now that we have explored the dataset, we can start on the task of classification. We can imagine a credit card company wanting to predict whether a customer will default on the loan so they can take steps to prevent this from happening.
The first method we will be using is k-nearest neighbours (KNN). It classifies datapoints based on a majority vote of the k points closest to it. In R
, the class
package contains a knn()
function to perform knn.
knn()
function. Use student
, balance
, and income
(but no basis functions of those variables) in the default_train
dataset. Set k
to 5. Store the predictions in a variable called knn_5_pred
.default
) mapped to the colour aesthetic, and one with the predicted class (knn_5_pred
) mapped to the colour aesthetic.Hint: Add the predicted class knn_5_pred
to the default_test
dataset before starting your ggplot()
call of the second plot. What do you see?
knn_2_pred
vector generated from a 2-nearest neighbours algorithm. Are there any differences?The confusion matrix is an insightful summary of the plots we have made and the correct and incorrect classifications therein. A confusion matrix can be made in R
with the table()
function by entering two factor
s:
table(true = default_test$default, predicted = knn_2_pred)
## predicted
## true No Yes
## No 1899 31
## Yes 55 15
We will go more into the assessment of confusion matrices in the next practical.
KNN directly predicts the class of a new observation using a majority vote of the existing observations closest to it. In contrast to this, logistic regression predicts the log-odds
of belonging to category 1. These log-odds can then be transformed to probabilities by performing an inverse logit transform:
\[ p = \frac{1}{1+e^{-\alpha}}\], where \(\alpha\) indicates log-odds for being in class 1 and \(p\) is the probability.
Therefore, logistic regression is a probabilistic
classifier as opposed to a direct
classifier such as KNN: indirectly, it outputs a probability which can then be used in conjunction with a cutoff (usually 0.5) to classify new observations.
Logistic regression in R
happens with the glm()
function, which stands for generalized linear model. Here we have to indicate that the residuals are modeled not as a gaussian (normal distribution), but as a binomial
distribution.
glm()
with argument family = binomial
to fit a logistic regression model lr_mod
to the default_train
data.Now we have generated a model, we can use the predict()
method to output the estimated probabilities for each point in the training dataset. By default predict
outputs the log-odds, but we can transform it back using the inverse logit function of before or setting the argument type = "response"
within the predict function.
lr_mod
. You can choose for yourself which type of visualisation you would like to make. Write down your interpretations along with your plot.Another advantage of logistic regression is that we get coefficients we can interpret.
lr_mod
model and interpret the coefficient for balance
. What would the probability of default be for a person who is not a student, has an income of 40000, and a balance of 3000 dollars at the end of each month? Is this what you expect based on the plots we’ve made before?In two steps, we will visualise the effect balance
has on the predicted default probability.
balance_df
with 3 columns and 500 rows: student
always 0, balance
ranging from 0 to 3000, and income
always the mean income in the default_train
dataset.newdata
in a predict()
call using lr_mod
to output the predicted probabilities for different values of balance
. Then create a plot with the balance_df$balance
variable mapped to x and the predicted probabilities mapped to y. Is this in line with what you expect?The last method we will use is LDA, using the lda()
function from the MASS
package.
lda_mod
on the training set.lda_mod
object. What can you conclude about the characteristics of the people who default on their loans?data/
folder. Would the passenger have survived if they were a girl in 2nd class?When you have finished the practical,
enclose all files of the project 06_classification.Rproj
(i.e. all .R
and/or .Rmd
files including the one with your answers, and the .Rproj
file) in a zip file, and
hand in the zip by PR from your fork here. Do so before Lecture 8. That way we can iron out issues during the next Q&A in Week 7.