Processing math: 100%

Capita Selecta

Gerko Vink

Fundamental Techniques in Data Science with R

Packages and functions used

library(magrittr) # pipes
library(dplyr)    # data manipulation
library(ggplot2)  # plotting
library(mice)     # for boys data
library(GGally)   # correlation and distribution matrixplot
library(mctest)   # multicollinearity test
library(DAAG)     # now for vif() function
titanic <- read.csv(file = "titanic.csv", header = TRUE) %>%
  mutate(Pclass = factor(Pclass, 
                         levels = c(3, 2, 1), 
                         ordered = FALSE))

Recap

Linear modeling

Assumptions

Linear regression

  • requires a linear relation between between the dependent and independent variables.

  • requires the error terms (residuals) to be normally distributed.

  • requires homoscedasticity

    • i.e. the standard deviations of the residuals are constant and do not depend on the independent variables
  • requires the dependent variable to be measured on an interval or ratio scale

  • assumes that there is little or no multicollinearity in the data.

Modeling

A simple model:

anscombe %$% lm(y1 ~ x1) %>% summary
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Modeling

A more complex model:

anscombe %$% lm(y1 ~ x1 + x2) %>% summary
## 
## Call:
## lm(formula = y1 ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## x2                NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Modeling

The system on the previous slide is highly multicollinear.

anscombe %$% identical(x1, x2)
## [1] TRUE

x1 and x2 are identical and there is no unique way to distribute the regression parameters over x1 and x2.

As a result, the model paramaters cannot be estimated unbiasedly, so R’s lm() function will give you the lower rank approximation with only one parameter. The resulting model outcome is still valid!

Modeling

A more complex model:

anscombe2 <- anscombe %>%
  mutate(y5 = factor(ifelse(y3 > 7.5, "yes", "no")))
fit <- anscombe2 %$% lm(x1 ~ y1 + y2 * y5)

Parameters

fit %>% summary
## 
## Call:
## lm(formula = x1 ~ y1 + y2 * y5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4769 -0.3182 -0.1109  0.2981  0.7319 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42965    0.76580   0.561 0.595084    
## y1           0.06465    0.14272   0.453 0.666479    
## y2           0.91014    0.12688   7.173 0.000371 ***
## y5yes       32.21191    5.30427   6.073 0.000905 ***
## y2:y5yes    -3.26437    0.59856  -5.454 0.001582 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.519 on 6 degrees of freedom
## Multiple R-squared:  0.9853, Adjusted R-squared:  0.9755 
## F-statistic: 100.6 on 4 and 6 DF,  p-value: 1.254e-05

Prediction

coef(fit)
## (Intercept)          y1          y2       y5yes    y2:y5yes 
##  0.42964653  0.06465026  0.91014154 32.21191128 -3.26437210
new <- data.frame(y1 = c(5, 5), y2 = c(8, 8), y5 = as.factor(c("yes", "no")))
predict(fit, newdata = new)
##        1        2 
## 14.13096  8.03403
0.42964653 + 0.06465026 * 5 + 0.91014154 * 8 + 32.21191128  + -3.26437210 * 8 # y5 = "yes"
## [1] 14.13096
0.42964653 + 0.06465026 * 5 + 0.91014154 * 8 #y5 = "no"
## [1] 8.03403

Logistic modeling

Assumptions

Logistic regression

  • does not require a linear relation between between the dependent and independent variables.

  • does not require the error terms (residuals) to be normally distributed.

  • does not require homoscedasticity:

    • So the standard deviations of the residuals may depend on the independent variables!
  • does not require the dependent variable to be measured on an interval or ratio scale

Assumptions continued

However, logistic regression

  • requires the outcome to be binary

  • requires the observations to be independent of each other

  • requires there to be little or no multicollinearity among the independent variables.

    • i.e. the independent variables should not be too highly correlated with each other.
  • assumes linearity of independent variables and log odds.

  • typically requires a large sample size:

    • a rule of thumb: at least 10 cases with the least frequent outcome for each independent variable:
    • If P(least frequent)=.2, then for 4 predictors you need (10*4) / .2 =200 cases

Modeling

A simple logistic model:

titanic %$% glm(Survived ~ Age)
## 
## Call:  glm(formula = Survived ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##    0.446210    -0.002058  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  885 Residual
## Null Deviance:       210.1 
## Residual Deviance: 209.4     AIC: 1243

Modeling 2

A more complex model:

titanic %$% glm(Survived ~ Age * Pclass)
## 
## Call:  glm(formula = Survived ~ Age * Pclass)
## 
## Coefficients:
## (Intercept)          Age      Pclass2      Pclass1  Age:Pclass2  Age:Pclass1  
##    0.403619    -0.006323     0.346715     0.570724    -0.002968    -0.002564  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  881 Residual
## Null Deviance:       210.1 
## Residual Deviance: 176.9     AIC: 1101

Parameters

fit <- titanic %$% glm(Survived ~ Age, 
                       family = binomial(link="logit"))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0864  -1.0017  -0.9439   1.3562   1.5806  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.209189   0.159494  -1.312   0.1897  
## Age         -0.008774   0.004947  -1.774   0.0761 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.8  on 886  degrees of freedom
## Residual deviance: 1179.6  on 885  degrees of freedom
## AIC: 1183.6
## 
## Number of Fisher Scoring iterations: 4

Prediction

new <- data.frame(Age = 40)
predict(fit, newdata = new, type = "response")
##         1 
## 0.3635098
coef(fit)
##  (Intercept)          Age 
## -0.209188757 -0.008774355
logodds <- -0.209188757 + (40 * -0.008774355)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##     logodds     odds      prob    plogis
## 1 -0.560163 0.571116 0.3635098 0.3635098

Example 2

fit <- titanic %$% glm(Survived ~ Age * Pclass, 
                       family = binomial(link="logit"))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age * Pclass, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0888  -0.8368  -0.6521   1.0292   2.3197  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.222282   0.246559  -0.902 0.367303    
## Age         -0.038062   0.009838  -3.869 0.000109 ***
## Pclass2      1.306741   0.458158   2.852 0.004342 ** 
## Pclass1      2.365100   0.528674   4.474 7.69e-06 ***
## Age:Pclass2 -0.002199   0.015564  -0.141 0.887623    
## Age:Pclass1 -0.002480   0.014709  -0.169 0.866133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.8  on 886  degrees of freedom
## Residual deviance: 1036.9  on 881  degrees of freedom
## AIC: 1048.9
## 
## Number of Fisher Scoring iterations: 4

Confidence intervals

confint(fit)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -0.70582264  0.26254796
## Age         -0.05784190 -0.01923168
## Pclass2      0.42354482  2.22513745
## Pclass1      1.35339096  3.43128459
## Age:Pclass2 -0.03318941  0.02801718
## Age:Pclass1 -0.03152570  0.02625743

Predict a 40 year old in 2nd class

coef(fit)[c(1:3, 5)]
##  (Intercept)          Age      Pclass2  Age:Pclass2 
## -0.222282007 -0.038061512  1.306741345 -0.002199355
new <- data.frame(Age = 40, 
                  Pclass = as.factor(2))
predict(fit, newdata = new, type = "response")
##         1 
## 0.3714561
logodds <- -0.222282007 + (40 * -0.038061512) + 1.306741345 + (40 * -0.002199355)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##      logodds      odds      prob    plogis
## 1 -0.5259753 0.5909787 0.3714561 0.3714561

Predict 23/65yr in 3rd

coef(fit)[1:2]
## (Intercept)         Age 
## -0.22228201 -0.03806151
new <- data.frame(Age = c(23, 65), 
                  Pclass = as.factor(3))
predict(fit, newdata = new, type = "response")
##         1         2 
## 0.2501717 0.0631932
logodds <- -0.22228201 + (c(23, 65) * -0.03806151)
odds <- exp(logodds)
prob <- odds / (1 + odds)
data.frame(logodds = logodds, odds = odds, prob = prob, plogis = plogis(logodds))
##     logodds       odds       prob     plogis
## 1 -1.097697 0.33363866 0.25017170 0.25017170
## 2 -2.696280 0.06745597 0.06319321 0.06319321

Plot logodds

Plot probabilities

0204060800.000.250.500.75
123AgeProbability of SurvivalPclass

Multicollinearity

Visualizing

ggpairs(anscombe, progress = FALSE)

How to inspect

A variance inflation factor (VIF) detects multicollinearity in regression analyses.

Multicollinearity happens when two or more predictors explain the same variance in the model.

Due to multicollinearity it is challenging to pinpoint a predictor’s contribution to explaining the variance in the outcome.

A VIF estimates how much the variance of a regression estimate is inflated because of multicollinearity in your model.

VIF example

fit <- anscombe %$% lm(y1 ~ x1 + x2)
vif(fit)
## x1 
##  1

VIF calculation

In a regression modeling effort where ˆy=βX, the variance inflation factor can be calculated as VIFi=11R2i,

where R2i would be the coefficient of determination (proportion explained variance) for a regression model of all other predictors X on predictor Xi.

A rule of thumb is that if VIFi>10 then multicollinearity is high.

In R

fit <- boys %$% lm(age ~ wgt + hgt)
vif(fit)
##    wgt    hgt 
## 9.0125 9.0125
r.sq <- boys %$% lm(wgt ~ hgt) %>% summary %>% .$r.squared
r.sq
## [1] 0.8890426
1 / (1 - r.sq)
## [1] 9.012469

So we can see that the rule of thumb VIFi>10 corresponds to an R2.9.

More sophisticated

The omcdiag() function from the mctest package calculates a suite of overall statistics.

fit <- boys %$% lm(age ~ bmi + wgt + hgt) 
fit %>% omcdiag
## 
## Call:
## omcdiag(mod = .)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.0172         0
## Farrar Chi-Square:      2941.2661         1
## Red Indicator:             0.7927         1
## Sum of Lambda Inverse:    64.7307         1
## Theil's Method:            0.8489         1
## Condition Number:         54.9661         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

individual diagnostics

The imcdiag() function calculates a suite of individual statistics.

fit %>% imcdiag
## 
## Call:
## imcdiag(mod = .)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##         VIF    TOL        Wi       Fi Leamer    CVIF Klein  IND1   IND2
## bmi  6.4429 0.1552  1970.329  3946.10 0.3940 -0.2016     0 4e-04 0.9148
## wgt 37.1650 0.0269 13091.715 26219.60 0.1640 -1.1631     1 1e-04 1.0537
## hgt 21.1228 0.0473  7284.462 14589.05 0.2176 -0.6610     0 1e-04 1.0316
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## * all coefficients have significant t-ratios
## 
## R-square of y on all x: 0.9608 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Which regressors with VIF

fit %>% imcdiag(method = "VIF")
## 
## Call:
## imcdiag(mod = ., method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##         VIF detection
## bmi  6.4429         0
## wgt 37.1650         1
## hgt 21.1228         1
## 
## Multicollinearity may be due to wgt hgt regressors
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

Example without VIF based MC

boys %$% lm(age ~ wgt + hgt) %>% omcdiag()
## 
## Call:
## omcdiag(mod = .)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.1110         0
## Farrar Chi-Square:      1592.8922         1
## Red Indicator:             0.9429         1
## Sum of Lambda Inverse:    18.0249         1
## Theil's Method:            0.8189         1
## Condition Number:         18.9067         0
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

Example without VIF based MC

boys %$% lm(age ~ wgt + hgt) %>% imcdiag(method = "VIF")
## 
## Call:
## imcdiag(mod = ., method = "VIF")
## 
## 
##  VIF Multicollinearity Diagnostics
## 
##        VIF detection
## wgt 9.0125         0
## hgt 9.0125         0
## 
## NOTE:  VIF Method Failed to detect multicollinearity
## 
## 
## 0 --> COLLINEARITY is not detected by the test
## 
## ===================================

The actual correlation

boys %>% 
  select(wgt, hgt, age) %>%
  cor(use = "pairwise.complete.obs")
##           wgt       hgt       age
## wgt 1.0000000 0.9428906 0.9505762
## hgt 0.9428906 1.0000000 0.9752568
## age 0.9505762 0.9752568 1.0000000

Model fit

Model fit summary

See the previous lectures for detailed inspections of model fit

In short:

A model with the same fit but fewer parameters, is the better model

Techniques: - Use the anova() function to compare nested models: - Use the AIC() function to campare any model on the same outcome:

  • The AIC yields the log-likelihood of the model given the data

Example

fit1 <- boys %>% select(age, hgt, wgt) %>% na.omit() %$% lm(age ~ hgt)
fit2 <- boys %>% select(age, hgt, wgt) %>% na.omit() %$% lm(age ~ hgt + wgt)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: age ~ hgt
## Model 2: age ~ hgt + wgt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    725 1679.0                                  
## 2    724 1400.6  1    278.46 143.94 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(fit1, fit2)
##      df      AIC
## fit1  3 2677.679
## fit2  4 2547.849

Why the fuzz?

fit1 %>% summary() %>% .$coefficients
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -9.7562988 0.170397472 -57.25612 3.252757e-271
## hgt          0.1443346 0.001215674 118.72803  0.000000e+00
fit2 %>% summary() %>% .$coefficients
##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -7.45952488 0.246781862 -30.22720 1.871803e-130
## hgt          0.10660192 0.003335514  31.95967 1.759561e-140
## wgt          0.07137607 0.005949200  11.99759  2.242865e-30

In this scenario, the overall interpretation of the parameter hgt did not change. However, multicollinearity may:

  • make choosing the efficient set of predictors challenging
  • make it harder to determine the precise effect of predictors

The overall fit of the model and the predictions are not affected by multicollinearity