Statistical Programming with R

Packages and functions used

library(magrittr) # pipes
library(DAAG) # data sets and functions
  • glm() Generalized linear models
  • predict() Obtain predictions based on a model
  • confint() Obtain confidence intervals for model parameters
  • coef() Obtain a model’s coefficients
  • DAAG::CVbinary() Cross-validation for regression with a binary response

Generalized linear models

GLM’s

We knew \[y=\alpha+\beta x+\epsilon\]

Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]

They’re the same. Different notation, different framework.

The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).

The function \(f()\) we call the link function. This function tranforms the scale of the response/outcome to the scale of the linear predictor.

Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)

Link functions

GLM’s continued

Remember that \[y=\alpha+\beta x+\epsilon,\]

and that \[\mathbb{E}[y] = \alpha + \beta x.\]

As a result \[y = \mathbb{E}[y] + \epsilon.\]

and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)

Logit

\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]

Logit continued

Logit models work on the \(\log(\text{odds})\) scale

\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]

The logit of the probability is the log of the odds.

Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.

\(\log(\text{odds})\) explained

logodds <- rep(NA, 1001)
for(i in 0:1000){
  p <- i / 1000 
  logodds[i + 1] <- log(p / (1 - p))
}
head(logodds)
## [1]      -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305
tail(logodds)
## [1] 5.293305 5.517453 5.806138 6.212606 6.906755      Inf

\(\log(\text{odds})\) explained

plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type = "l")

logit\(^{-1}\) explained

invlogit <- exp(logodds) / (1 + exp(logodds))
invlogit %>% head()
## [1] 0.000 0.001 0.002 0.003 0.004 0.005
invlogit %>% head()
## [1] 0.000 0.001 0.002 0.003 0.004 0.005

logit\(^{-1}\) explained

plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type = "l")

Logistic regression

Logistic regression

With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).

  • Deviance is the fit of the observed values to the expected values.

With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.

The likelihood is the (joint) probability of the observed values, given the current model parameters.

In normally distributed data: \(\text{SS}=\text{D}\).

Logistic regression

anesthetic %>% head(n = 18)
##    move conc    logconc nomove
## 1     0  1.0  0.0000000      1
## 2     1  1.2  0.1823216      0
## 3     0  1.4  0.3364722      1
## 4     1  1.4  0.3364722      0
## 5     1  1.2  0.1823216      0
## 6     0  2.5  0.9162907      1
## 7     0  1.6  0.4700036      1
## 8     1  0.8 -0.2231436      0
## 9     0  1.6  0.4700036      1
## 10    1  1.4  0.3364722      0
## 11    1  0.8 -0.2231436      0
## 12    0  1.6  0.4700036      1
## 13    0  2.5  0.9162907      1
## 14    0  1.4  0.3364722      1
## 15    0  1.6  0.4700036      1
## 16    0  1.4  0.3364722      1
## 17    0  1.4  0.3364722      1
## 18    1  0.8 -0.2231436      0

Logistic regression

anesthetic %$% glm(nomove ~ conc, family = binomial(link="logit")) %>% summary()
## 
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76666  -0.74407   0.03413   0.68666   2.06900  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.418  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.455  on 29  degrees of freedom
## Residual deviance: 27.754  on 28  degrees of freedom
## AIC: 31.754
## 
## Number of Fisher Scoring iterations: 5

A different approach

anestot <- aggregate(anesthetic[, c("move","nomove")],  
                     by = list(conc = anesthetic$conc), FUN = sum) 
anestot
##   conc move nomove
## 1  0.8    6      1
## 2  1.0    4      1
## 3  1.2    2      4
## 4  1.4    2      4
## 5  1.6    0      4
## 6  2.5    0      2

A different approach

anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) 
anestot$prop <- anestot$nomove / anestot$total 
anestot
##   conc move nomove total      prop
## 1  0.8    6      1     7 0.1428571
## 2  1.0    4      1     5 0.2000000
## 3  1.2    2      4     6 0.6666667
## 4  1.4    2      4     6 0.6666667
## 5  1.6    0      4     4 1.0000000
## 6  2.5    0      2     2 1.0000000

A different approach

anestot %$% glm(prop ~ conc, family = binomial(link="logit"), weights = total) %>% 
  summary()
## 
## Call:
## glm(formula = prop ~ conc, family = binomial(link = "logit"), 
##     weights = total)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.20147  -0.45367   0.56890  -0.70000   0.81838   0.04826  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.419  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15.4334  on 5  degrees of freedom
## Residual deviance:  1.7321  on 4  degrees of freedom
## AIC: 13.811
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

Always try to make the relation as linear as possible

  • after all you are assessing a linear model.

Do not forget that you use transformations to “make” things more linear

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + 
##     avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9763  -0.7189  -0.2786   0.7970   2.5745  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          18.2688999 16.1381912   1.132  0.25762    
## log(distance)        -0.7583198  0.2558117  -2.964  0.00303 ** 
## log(NoOfPools)        0.5708953  0.2153335   2.651  0.00802 ** 
## NoOfSites            -0.0036201  0.1061469  -0.034  0.97279    
## avrain                0.0007003  0.0411710   0.017  0.98643    
## I(meanmax + meanmin)  1.4958055  0.3153152   4.744  2.1e-06 ***
## I(meanmax - meanmin) -3.8582668  1.2783921  -3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.65  on 205  degrees of freedom
## AIC: 211.65
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + 
##     meanmin) + I(meanmax - meanmin), family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9753  -0.7224  -0.2780   0.7974   2.5736  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           18.5268     5.2673   3.517 0.000436 ***
## log(distance)         -0.7547     0.2261  -3.338 0.000844 ***
## log(NoOfPools)         0.5707     0.2152   2.652 0.007999 ** 
## I(meanmax + meanmin)   1.4985     0.3088   4.853 1.22e-06 ***
## I(meanmax - meanmin)  -3.8806     0.9002  -4.311 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.66  on 207  degrees of freedom
## AIC: 207.66
## 
## Number of Fisher Scoring iterations: 5

Fitted values

frogs.glm <- frogs %$% glm(pres.abs ~ log(distance) + log(NoOfPools) 
                           + I(meanmax + meanmin) + I(meanmax - meanmin),  
                           family = binomial)
frogs.glm %>% fitted() %>% head()
##         1         2         3         4         5         6 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "response") %>% head()
##         1         2         3         4         5         6 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "link") %>% head() # Scale of linear predictor 
##         1         2         3         4         5         6 
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086

Fitted values with approximate SE’s

pred <- frogs.glm %>% 
  predict(type = "link", se.fit = TRUE)
data.frame(fit = pred$fit, se = pred$se.fit) %>% head()
##         fit        se
## 1 2.7815212 0.6859612
## 2 2.5256832 0.4851882
## 3 2.2303441 0.4381720
## 4 1.4628085 0.4805175
## 5 2.6085055 0.5290599
## 6 0.9835086 0.3900549

Confidence intervals for the \(\beta\)

frogs.glm %>% confint()
## Waiting for profiling to be done...
##                           2.5 %     97.5 %
## (Intercept)           8.5807639 29.3392400
## log(distance)        -1.2157643 -0.3247928
## log(NoOfPools)        0.1628001  1.0106909
## I(meanmax + meanmin)  0.9219804  2.1385678
## I(meanmax - meanmin) -5.7299615 -2.1868793
frogs.glm %>% coef()
##          (Intercept)        log(distance)       log(NoOfPools) 
##           18.5268418           -0.7546587            0.5707174 
## I(meanmax + meanmin) I(meanmax - meanmin) 
##            1.4984905           -3.8805856

Cross validating predictive power

set.seed(123)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), 
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm)
## 
## Fold:  3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.745

The cross-validation measure is the proportion of predictions over the folds that are correct.

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2)
## 
## Fold:  1 5 9 8 7 2 10 6 4 3
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.778

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2, nfolds = 5)
## 
## Fold:  3 2 4 5 1
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.783