Generalized linear models

14 november 2014

Chapter 7: Exercises

Who will discuss function

Choose <- function(n.names, replicate = 10000){
  names <- c("Lukas", "Roeline", "Zoë", "Lisanne", "Nigel", "Thijs")
  Who <- as.data.frame(matrix(NA, replicate))
  #pb <- txtProgressBar(min = 0, max = replicate, style = 3)
  for(i in 1:replicate){
    Who[i, ] <- sample(names, size = 1, replace=F)
    #setTxtProgressBar(pb, i)
  }
  #close(pb) 
  nms <- sort(table(Who)/replicate, decreasing = T)[1:n.names]
  if(n.names == 2){
    cat(paste("Today's presenters are: \n", 
              names(nms[1])," (",nms[1]," percent)", " and ", 
              names(nms[2])," (",nms[2]," percent)", sep = ""))}
  else{print(nms)}
}
Choose(2)

## Today's presenters are: 
## Lukas (0.1746 percent) and Nigel (0.1672 percent)

Generalized linear models

GLM's

We knew \[y=\alpha+\beta x+\epsilon\]

Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]

They're the same. Different notation, different framework.

The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).

The function \(f()\) we call the link function. This function tranforms the scale of the response/outcome to the scale of the linear predictor.

Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)

Link functions

GLM's continued

Remember that \[y=\alpha+\beta x+\epsilon,\]

and that \[\mathbb{E}[y] = \alpha + \beta x.\]

As a result \[y = \mathbb{E}[y] + \epsilon.\]

and residuals do not need to be normal (heck, \(y\) probably isn't, so why should \(\epsilon\) be?)

Logit

\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]

Logit continued

Logit models work on the \(\log(\text{odds})\) scale

\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]

The logit of the probability is the log of the odds.

Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.

\(\log(\text{odds})\) explained

logodds <- rep(NA, 1001)
for(i in 0:1000){p <- i/1000; logodds[i+1] <- log(p / (1-p))}
head(logodds)

## [1]      -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305

tail(logodds)

## [1] 5.293305 5.517453 5.806138 6.212606 6.906755      Inf

\(\log(\text{odds})\) explained

plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type="l")

logit\(^{-1}\) explained

invlogit <- exp(logodds) / (1 + exp(logodds))
head(invlogit)

## [1] 0.000 0.001 0.002 0.003 0.004 0.005

tail(invlogit)

## [1] 0.995 0.996 0.997 0.998 0.999   NaN

logit\(^{-1}\) explained

plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type="l")

Logistic regression

With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).

With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.

The likelihood is the (joint) probability of the observed values, given the current model parameters.

In normally distributed data: \(\text{SS}=\text{D}\).

Logistic regression

library(DAAG)

## Loading required package: lattice

anestot <- aggregate(anesthetic[, c("move","nomove")],  
                     by=list(conc=anesthetic$conc), FUN=sum) 
anestot$conc <- as.numeric(as.character(anestot$conc)) 
anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) 
anestot$prop <- anestot$nomove/anestot$total 
head(anestot)

##   conc move nomove total      prop
## 1  0.8    6      1     7 0.1428571
## 2  1.0    4      1     5 0.2000000
## 3  1.2    2      4     6 0.6666667
## 4  1.4    2      4     6 0.6666667
## 5  1.6    0      4     4 1.0000000
## 6  2.5    0      2     2 1.0000000

Logistic regression

glm(nomove ~ conc, family=binomial(link="logit"), data=anesthetic)

## 
## Call:  glm(formula = nomove ~ conc, family = binomial(link = "logit"), 
##     data = anesthetic)
## 
## Coefficients:
## (Intercept)         conc  
##      -6.469        5.567  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
## Null Deviance:       41.46 
## Residual Deviance: 27.75     AIC: 31.75

glm(prop ~ conc, family=binomial(link="logit"), weights=total, data=anestot)$coef

## (Intercept)        conc 
##   -6.468675    5.566762

Logistic multiple regression

Always try to make the relation as linear as possible - after all you are assessing a linear model.

Do not forget that you use transformations to "make" things more linear

Logistic multiple regression

with(frogs, pairs(cbind(log(distance), log(NoOfPools), NoOfSites, avrain, altitude, meanmax+meanmin, meanmax-meanmin), col="gray", panel=panel.smooth, labels=c("log(Distance)", "log(NoOfPools)", "NoOfSites", "AvRainfall", "Altitude", "meanmax+meanmin", "meanmax-meanmin")))

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + 
##     avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9763  -0.7189  -0.2786   0.7970   2.5745  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          18.2688999 16.1381912   1.132  0.25762    
## log(distance)        -0.7583198  0.2558117  -2.964  0.00303 ** 
## log(NoOfPools)        0.5708953  0.2153335   2.651  0.00802 ** 
## NoOfSites            -0.0036201  0.1061469  -0.034  0.97279    
## avrain                0.0007003  0.0411710   0.017  0.98643    
## I(meanmax + meanmin)  1.4958055  0.3153152   4.744  2.1e-06 ***
## I(meanmax - meanmin) -3.8582668  1.2783921  -3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.65  on 205  degrees of freedom
## AIC: 211.65
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + 
##     meanmin) + I(meanmax - meanmin), family = binomial, data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9753  -0.7224  -0.2780   0.7974   2.5736  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           18.5268     5.2673   3.517 0.000436 ***
## log(distance)         -0.7547     0.2261  -3.338 0.000844 ***
## log(NoOfPools)         0.5707     0.2152   2.652 0.007999 ** 
## I(meanmax + meanmin)   1.4985     0.3088   4.853 1.22e-06 ***
## I(meanmax - meanmin)  -3.8806     0.9002  -4.311 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.66  on 207  degrees of freedom
## AIC: 207.66
## 
## Number of Fisher Scoring iterations: 5

Fitted values

head(fitted(frogs.glm))

##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038

head(predict(frogs.glm, type="response"))

##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038

head(predict(frogs.glm, type="link"))  # Scale of linear predictor

##         2         3         4         5         6         7 
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086

Fitted values with approximate SE's

pred <- predict(frogs.glm, type="link", se.fit=TRUE)
head(data.frame(fit = pred$fit, se = pred$se.fit))

##         fit        se
## 2 2.7815212 0.6859612
## 3 2.5256832 0.4851882
## 4 2.2303441 0.4381720
## 5 1.4628085 0.4805175
## 6 2.6085055 0.5290599
## 7 0.9835086 0.3900549

Individual contributions

par(mfrow=c(1,4), pty="s") 
frogs$maxminSum <- with(frogs, meanmax+meanmin) 
frogs$maxminDiff <- with(frogs, meanmax-meanmin) 
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools) +  
                 maxminSum + maxminDiff, family = binomial,  
                 data = frogs) 
termplot(frogs.glm)

Cross validating predictive power

frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools),
                 family=binomial,data=frogs)
cv <- CVbinary(frogs.glm)

## 
## Fold:  5 7 3 1 9 8 6 2 10 4
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.745

ls(cv)

## [1] "acc.cv"       "acc.internal" "acc.training" "cvhat"       
## [5] "internal"     "training"

Logistic models for categorical data

Loglinear model

str(UCBAdmissions)

##  table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Admit : chr [1:2] "Admitted" "Rejected"
##   ..$ Gender: chr [1:2] "Male" "Female"
##   ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...

UCBAdmissions

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

Loglinear model

dimnames(UCBAdmissions)  # Check levels of table margins

## $Admit
## [1] "Admitted" "Rejected"
## 
## $Gender
## [1] "Male"   "Female"
## 
## $Dept
## [1] "A" "B" "C" "D" "E" "F"

UCB <- as.data.frame.table(UCBAdmissions["Admitted", , ]) 
names(UCB)[3] <- "admit" 
UCB$reject <- as.data.frame.table(UCBAdmissions["Rejected", , ])$Freq 
UCB$Gender <- relevel(UCB$Gender, ref="Male") 
## Add further columns total and p (proportion admitted) 
UCB$total <- UCB$admit + UCB$reject 
UCB$p <- UCB$admit/UCB$total

Loglinear model

UCB

##    Gender Dept admit reject total          p
## 1    Male    A   512    313   825 0.62060606
## 2  Female    A    89     19   108 0.82407407
## 3    Male    B   353    207   560 0.63035714
## 4  Female    B    17      8    25 0.68000000
## 5    Male    C   120    205   325 0.36923077
## 6  Female    C   202    391   593 0.34064081
## 7    Male    D   138    279   417 0.33093525
## 8  Female    D   131    244   375 0.34933333
## 9    Male    E    53    138   191 0.27748691
## 10 Female    E    94    299   393 0.23918575
## 11   Male    F    22    351   373 0.05898123
## 12 Female    F    24    317   341 0.07038123

loglinear model

UCB.glm <- glm(p ~ Dept*Gender, family=binomial,  
               data=UCB, weights=total) 
anova(UCB.glm, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: p
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           11     877.06              
## Dept         5   855.32         6      21.74 < 2.2e-16 ***
## Gender       1     1.53         5      20.20  0.215928    
## Dept:Gender  5    20.20         0       0.00  0.001144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

loglinear model

summary(UCB.glm)$coef

##                       Estimate Std. Error     z value     Pr(>|z|)
## (Intercept)         0.49212143 0.07174966   6.8588682 6.940825e-12
## DeptB               0.04162783 0.11318919   0.3677721 7.130431e-01
## DeptC              -1.02763967 0.13549685  -7.5842331 3.344593e-14
## DeptD              -1.19607953 0.12640656  -9.4621632 3.016374e-21
## DeptE              -1.44908321 0.17681152  -8.1956378 2.492678e-16
## DeptF              -3.26186520 0.23119594 -14.1086615 3.358928e-45
## GenderFemale        1.05207596 0.26270810   4.0047336 6.208742e-05
## DeptB:GenderFemale -0.83205342 0.51039480  -1.6302153 1.030560e-01
## DeptC:GenderFemale -1.17699758 0.29955796  -3.9291147 8.525915e-05
## DeptD:GenderFemale -0.97008876 0.30261874  -3.2056467 1.347593e-03
## DeptE:GenderFemale -1.25226298 0.33032201  -3.7910371 1.500195e-04
## DeptF:GenderFemale -0.86318013 0.40266653  -2.1436600 3.206014e-02

The first six regression coefficients give the \(\log(\text{odds})\) for males.

In Departments "C", "D", "E" and "F", the \(\log(\text{odds})\) for females (relative to males of course) is reduced. In Department "A" the \(\log(\text{odds})\) for females is increased.

Poisson regression

ACF1

##    count endtime
## 1      1       6
## 2      3       6
## 3      5       6
## 4      1       6
## 5      2       6
## 6      1       6
## 7      1       6
## 8      3      12
## 9      1      12
## 10     2      12
## 11     6      12
## 12     0      12
## 13     0      12
## 14     4      12
## 15     1      12
## 16    10      18
## 17     6      18
## 18     6      18
## 19     7      18
## 20     5      18
## 21     7      18
## 22     6      18

Poisson regression

summary(ACF.glm <- glm(formula = count ~ endtime, family = poisson, data = ACF1))

## 
## Call:
## glm(formula = count ~ endtime, family = poisson, data = ACF1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.46204  -0.47851  -0.07943   0.38159   2.26332  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.32152    0.40046  -0.803    0.422    
## endtime      0.11920    0.02642   4.511 6.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 51.105  on 21  degrees of freedom
## Residual deviance: 28.369  on 20  degrees of freedom
## AIC: 92.209
## 
## Number of Fisher Scoring iterations: 5

Poisson regression: fitted values

fitted(ACF.glm)

##        1        2        3        4        5        6        7        8 
## 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 3.030821 
##        9       10       11       12       13       14       15       16 
## 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 6.196674 
##       17       18       19       20       21       22 
## 6.196674 6.196674 6.196674 6.196674 6.196674 6.196674

Quasi-likelihood

A quasi-likelihood function has similar properties to the log-likelihood function

- a quasi-likelihood function is not the log-likelihood 
- it does not correspond to any actual probability distribution.

Used to allow for overdispersion during estimation

- Overdispersion occurs when there is greater variability in the data than would be expected from 

the statistical model that is used.

Quasi-poisson regression

summary(glm(formula = count ~ endtime, family = quasipoisson, data = ACF1))

## 
## Call:
## glm(formula = count ~ endtime, family = quasipoisson, data = ACF1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.46204  -0.47851  -0.07943   0.38159   2.26332  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.32152    0.45532  -0.706 0.488239    
## endtime      0.11920    0.03004   3.968 0.000758 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.292733)
## 
##     Null deviance: 51.105  on 21  degrees of freedom
## Residual deviance: 28.369  on 20  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Transformations

\(y=\sqrt{n}\)

n <- rnorm(1000, 0, 1)^2
par(mfrow = c(1,2))
hist(n)
hist(sqrt(n))

\(y=\text{arcsin}(\sqrt{p})\)

Used to linearize sigmoid distributions and to equalize variances of proportions.

p <- seq(.1:1, by = .1)
p

##  [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

asin(sqrt(p))

##  [1] 0.3217506 0.4636476 0.5796397 0.6847192 0.7853982 0.8860771 0.9911566
##  [8] 1.1071487 1.2490458 1.5707963

asin(sqrt(p)) * 180/pi

##  [1] 18.43495 26.56505 33.21091 39.23152 45.00000 50.76848 56.78909
##  [8] 63.43495 71.56505 90.00000

Next session

Prepare for next session:

Exercises Chapter 8
- 1, 2, 3 and 4
Study the reading material and exercises from Chapters 6, 7 and 8.