14 november 2014
Choose <- function(n.names, replicate = 10000){ names <- c("Lukas", "Roeline", "Zoë", "Lisanne", "Nigel", "Thijs") Who <- as.data.frame(matrix(NA, replicate)) #pb <- txtProgressBar(min = 0, max = replicate, style = 3) for(i in 1:replicate){ Who[i, ] <- sample(names, size = 1, replace=F) #setTxtProgressBar(pb, i) } #close(pb) nms <- sort(table(Who)/replicate, decreasing = T)[1:n.names] if(n.names == 2){ cat(paste("Today's presenters are: \n", names(nms[1])," (",nms[1]," percent)", " and ", names(nms[2])," (",nms[2]," percent)", sep = ""))} else{print(nms)} } Choose(2)
## Today's presenters are: ## Lukas (0.1746 percent) and Nigel (0.1672 percent)
We knew \[y=\alpha+\beta x+\epsilon\]
Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]
They're the same. Different notation, different framework.
The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).
The function \(f()\) we call the link function
. This function tranforms the scale of the response/outcome to the scale of the linear predictor.
Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)
Remember that \[y=\alpha+\beta x+\epsilon,\]
and that \[\mathbb{E}[y] = \alpha + \beta x.\]
As a result \[y = \mathbb{E}[y] + \epsilon.\]
and residuals do not need to be normal (heck, \(y\) probably isn't, so why should \(\epsilon\) be?)
\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]
Logit models work on the \(\log(\text{odds})\) scale
\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]
The logit of the probability is the log of the odds.
Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.
logodds <- rep(NA, 1001) for(i in 0:1000){p <- i/1000; logodds[i+1] <- log(p / (1-p))} head(logodds)
## [1] -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305
tail(logodds)
## [1] 5.293305 5.517453 5.806138 6.212606 6.906755 Inf
plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type="l")
invlogit <- exp(logodds) / (1 + exp(logodds)) head(invlogit)
## [1] 0.000 0.001 0.002 0.003 0.004 0.005
tail(invlogit)
## [1] 0.995 0.996 0.997 0.998 0.999 NaN
plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type="l")
With linear regression we had the Sum of Squares (SS)
. Its logistic counterpart is the Deviance (D)
.
With logistic regression we aim to maximize the likelihood
, which is equivalent to minimizing the deviance.
The likelihood is the (joint) probability of the observed values, given the current model parameters.
In normally distributed data: \(\text{SS}=\text{D}\).
library(DAAG)
## Loading required package: lattice
anestot <- aggregate(anesthetic[, c("move","nomove")], by=list(conc=anesthetic$conc), FUN=sum) anestot$conc <- as.numeric(as.character(anestot$conc)) anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) anestot$prop <- anestot$nomove/anestot$total head(anestot)
## conc move nomove total prop ## 1 0.8 6 1 7 0.1428571 ## 2 1.0 4 1 5 0.2000000 ## 3 1.2 2 4 6 0.6666667 ## 4 1.4 2 4 6 0.6666667 ## 5 1.6 0 4 4 1.0000000 ## 6 2.5 0 2 2 1.0000000
glm(nomove ~ conc, family=binomial(link="logit"), data=anesthetic)
## ## Call: glm(formula = nomove ~ conc, family = binomial(link = "logit"), ## data = anesthetic) ## ## Coefficients: ## (Intercept) conc ## -6.469 5.567 ## ## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual ## Null Deviance: 41.46 ## Residual Deviance: 27.75 AIC: 31.75
glm(prop ~ conc, family=binomial(link="logit"), weights=total, data=anestot)$coef
## (Intercept) conc ## -6.468675 5.566762
Always try to make the relation as linear as possible - after all you are assessing a linear model.
Do not forget that you use transformations to "make" things more linear
with(frogs, pairs(cbind(log(distance), log(NoOfPools), NoOfSites, avrain, altitude, meanmax+meanmin, meanmax-meanmin), col="gray", panel=panel.smooth, labels=c("log(Distance)", "log(NoOfPools)", "NoOfSites", "AvRainfall", "Altitude", "meanmax+meanmin", "meanmax-meanmin")))
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + ## avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, ## data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9763 -0.7189 -0.2786 0.7970 2.5745 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.2688999 16.1381912 1.132 0.25762 ## log(distance) -0.7583198 0.2558117 -2.964 0.00303 ** ## log(NoOfPools) 0.5708953 0.2153335 2.651 0.00802 ** ## NoOfSites -0.0036201 0.1061469 -0.034 0.97279 ## avrain 0.0007003 0.0411710 0.017 0.98643 ## I(meanmax + meanmin) 1.4958055 0.3153152 4.744 2.1e-06 *** ## I(meanmax - meanmin) -3.8582668 1.2783921 -3.018 0.00254 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.65 on 205 degrees of freedom ## AIC: 211.65 ## ## Number of Fisher Scoring iterations: 5
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + ## meanmin) + I(meanmax - meanmin), family = binomial, data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9753 -0.7224 -0.2780 0.7974 2.5736 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.5268 5.2673 3.517 0.000436 *** ## log(distance) -0.7547 0.2261 -3.338 0.000844 *** ## log(NoOfPools) 0.5707 0.2152 2.652 0.007999 ** ## I(meanmax + meanmin) 1.4985 0.3088 4.853 1.22e-06 *** ## I(meanmax - meanmin) -3.8806 0.9002 -4.311 1.63e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.66 on 207 degrees of freedom ## AIC: 207.66 ## ## Number of Fisher Scoring iterations: 5
head(fitted(frogs.glm))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type="response"))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type="link")) # Scale of linear predictor
## 2 3 4 5 6 7 ## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086
pred <- predict(frogs.glm, type="link", se.fit=TRUE) head(data.frame(fit = pred$fit, se = pred$se.fit))
## fit se ## 2 2.7815212 0.6859612 ## 3 2.5256832 0.4851882 ## 4 2.2303441 0.4381720 ## 5 1.4628085 0.4805175 ## 6 2.6085055 0.5290599 ## 7 0.9835086 0.3900549
par(mfrow=c(1,4), pty="s") frogs$maxminSum <- with(frogs, meanmax+meanmin) frogs$maxminDiff <- with(frogs, meanmax-meanmin) frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools) + maxminSum + maxminDiff, family = binomial, data = frogs) termplot(frogs.glm)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), family=binomial,data=frogs) cv <- CVbinary(frogs.glm)
## ## Fold: 5 7 3 1 9 8 6 2 10 4 ## Internal estimate of accuracy = 0.759 ## Cross-validation estimate of accuracy = 0.745
ls(cv)
## [1] "acc.cv" "acc.internal" "acc.training" "cvhat" ## [5] "internal" "training"
str(UCBAdmissions)
## table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ... ## - attr(*, "dimnames")=List of 3 ## ..$ Admit : chr [1:2] "Admitted" "Rejected" ## ..$ Gender: chr [1:2] "Male" "Female" ## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
UCBAdmissions
## , , Dept = A ## ## Gender ## Admit Male Female ## Admitted 512 89 ## Rejected 313 19 ## ## , , Dept = B ## ## Gender ## Admit Male Female ## Admitted 353 17 ## Rejected 207 8 ## ## , , Dept = C ## ## Gender ## Admit Male Female ## Admitted 120 202 ## Rejected 205 391 ## ## , , Dept = D ## ## Gender ## Admit Male Female ## Admitted 138 131 ## Rejected 279 244 ## ## , , Dept = E ## ## Gender ## Admit Male Female ## Admitted 53 94 ## Rejected 138 299 ## ## , , Dept = F ## ## Gender ## Admit Male Female ## Admitted 22 24 ## Rejected 351 317
dimnames(UCBAdmissions) # Check levels of table margins
## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F"
UCB <- as.data.frame.table(UCBAdmissions["Admitted", , ]) names(UCB)[3] <- "admit" UCB$reject <- as.data.frame.table(UCBAdmissions["Rejected", , ])$Freq UCB$Gender <- relevel(UCB$Gender, ref="Male") ## Add further columns total and p (proportion admitted) UCB$total <- UCB$admit + UCB$reject UCB$p <- UCB$admit/UCB$total
UCB
## Gender Dept admit reject total p ## 1 Male A 512 313 825 0.62060606 ## 2 Female A 89 19 108 0.82407407 ## 3 Male B 353 207 560 0.63035714 ## 4 Female B 17 8 25 0.68000000 ## 5 Male C 120 205 325 0.36923077 ## 6 Female C 202 391 593 0.34064081 ## 7 Male D 138 279 417 0.33093525 ## 8 Female D 131 244 375 0.34933333 ## 9 Male E 53 138 191 0.27748691 ## 10 Female E 94 299 393 0.23918575 ## 11 Male F 22 351 373 0.05898123 ## 12 Female F 24 317 341 0.07038123
UCB.glm <- glm(p ~ Dept*Gender, family=binomial, data=UCB, weights=total) anova(UCB.glm, test="Chisq")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: p ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 11 877.06 ## Dept 5 855.32 6 21.74 < 2.2e-16 *** ## Gender 1 1.53 5 20.20 0.215928 ## Dept:Gender 5 20.20 0 0.00 0.001144 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(UCB.glm)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.49212143 0.07174966 6.8588682 6.940825e-12 ## DeptB 0.04162783 0.11318919 0.3677721 7.130431e-01 ## DeptC -1.02763967 0.13549685 -7.5842331 3.344593e-14 ## DeptD -1.19607953 0.12640656 -9.4621632 3.016374e-21 ## DeptE -1.44908321 0.17681152 -8.1956378 2.492678e-16 ## DeptF -3.26186520 0.23119594 -14.1086615 3.358928e-45 ## GenderFemale 1.05207596 0.26270810 4.0047336 6.208742e-05 ## DeptB:GenderFemale -0.83205342 0.51039480 -1.6302153 1.030560e-01 ## DeptC:GenderFemale -1.17699758 0.29955796 -3.9291147 8.525915e-05 ## DeptD:GenderFemale -0.97008876 0.30261874 -3.2056467 1.347593e-03 ## DeptE:GenderFemale -1.25226298 0.33032201 -3.7910371 1.500195e-04 ## DeptF:GenderFemale -0.86318013 0.40266653 -2.1436600 3.206014e-02
The first six regression coefficients give the \(\log(\text{odds})\) for males.
In Departments "C", "D", "E" and "F", the \(\log(\text{odds})\) for females (relative to males of course) is reduced. In Department "A" the \(\log(\text{odds})\) for females is increased.
ACF1
## count endtime ## 1 1 6 ## 2 3 6 ## 3 5 6 ## 4 1 6 ## 5 2 6 ## 6 1 6 ## 7 1 6 ## 8 3 12 ## 9 1 12 ## 10 2 12 ## 11 6 12 ## 12 0 12 ## 13 0 12 ## 14 4 12 ## 15 1 12 ## 16 10 18 ## 17 6 18 ## 18 6 18 ## 19 7 18 ## 20 5 18 ## 21 7 18 ## 22 6 18
summary(ACF.glm <- glm(formula = count ~ endtime, family = poisson, data = ACF1))
## ## Call: ## glm(formula = count ~ endtime, family = poisson, data = ACF1) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.46204 -0.47851 -0.07943 0.38159 2.26332 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.32152 0.40046 -0.803 0.422 ## endtime 0.11920 0.02642 4.511 6.44e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 51.105 on 21 degrees of freedom ## Residual deviance: 28.369 on 20 degrees of freedom ## AIC: 92.209 ## ## Number of Fisher Scoring iterations: 5
fitted(ACF.glm)
## 1 2 3 4 5 6 7 8 ## 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 3.030821 ## 9 10 11 12 13 14 15 16 ## 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 6.196674 ## 17 18 19 20 21 22 ## 6.196674 6.196674 6.196674 6.196674 6.196674 6.196674
A quasi-likelihood function has similar properties to the log-likelihood function
- a quasi-likelihood function is not the log-likelihood - it does not correspond to any actual probability distribution.
Used to allow for overdispersion during estimation
- Overdispersion occurs when there is greater variability in the data than would be expected from the statistical model that is used.
summary(glm(formula = count ~ endtime, family = quasipoisson, data = ACF1))
## ## Call: ## glm(formula = count ~ endtime, family = quasipoisson, data = ACF1) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.46204 -0.47851 -0.07943 0.38159 2.26332 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.32152 0.45532 -0.706 0.488239 ## endtime 0.11920 0.03004 3.968 0.000758 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for quasipoisson family taken to be 1.292733) ## ## Null deviance: 51.105 on 21 degrees of freedom ## Residual deviance: 28.369 on 20 degrees of freedom ## AIC: NA ## ## Number of Fisher Scoring iterations: 5
n <- rnorm(1000, 0, 1)^2 par(mfrow = c(1,2)) hist(n) hist(sqrt(n))
Used to linearize sigmoid distributions and to equalize variances of proportions.
p <- seq(.1:1, by = .1) p
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
asin(sqrt(p))
## [1] 0.3217506 0.4636476 0.5796397 0.6847192 0.7853982 0.8860771 0.9911566 ## [8] 1.1071487 1.2490458 1.5707963
asin(sqrt(p)) * 180/pi
## [1] 18.43495 26.56505 33.21091 39.23152 45.00000 50.76848 56.78909 ## [8] 63.43495 71.56505 90.00000
Prepare for next session: