14 november 2014
Choose <- function(n.names, replicate = 10000){
names <- c("Lukas", "Roeline", "Zoë", "Lisanne", "Nigel", "Thijs")
Who <- as.data.frame(matrix(NA, replicate))
#pb <- txtProgressBar(min = 0, max = replicate, style = 3)
for(i in 1:replicate){
Who[i, ] <- sample(names, size = 1, replace=F)
#setTxtProgressBar(pb, i)
}
#close(pb)
nms <- sort(table(Who)/replicate, decreasing = T)[1:n.names]
if(n.names == 2){
cat(paste("Today's presenters are: \n",
names(nms[1])," (",nms[1]," percent)", " and ",
names(nms[2])," (",nms[2]," percent)", sep = ""))}
else{print(nms)}
}
Choose(2)
## Today's presenters are: ## Lukas (0.1746 percent) and Nigel (0.1672 percent)
We knew \[y=\alpha+\beta x+\epsilon\]
Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]
They're the same. Different notation, different framework.
The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).
The function \(f()\) we call the link function. This function tranforms the scale of the response/outcome to the scale of the linear predictor.
Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)
Remember that \[y=\alpha+\beta x+\epsilon,\]
and that \[\mathbb{E}[y] = \alpha + \beta x.\]
As a result \[y = \mathbb{E}[y] + \epsilon.\]
and residuals do not need to be normal (heck, \(y\) probably isn't, so why should \(\epsilon\) be?)
\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]
Logit models work on the \(\log(\text{odds})\) scale
\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]
The logit of the probability is the log of the odds.
Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.
logodds <- rep(NA, 1001)
for(i in 0:1000){p <- i/1000; logodds[i+1] <- log(p / (1-p))}
head(logodds)
## [1] -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305
tail(logodds)
## [1] 5.293305 5.517453 5.806138 6.212606 6.906755 Inf
plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type="l")
invlogit <- exp(logodds) / (1 + exp(logodds)) head(invlogit)
## [1] 0.000 0.001 0.002 0.003 0.004 0.005
tail(invlogit)
## [1] 0.995 0.996 0.997 0.998 0.999 NaN
plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type="l")
With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).
With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.
The likelihood is the (joint) probability of the observed values, given the current model parameters.
In normally distributed data: \(\text{SS}=\text{D}\).
library(DAAG)
## Loading required package: lattice
anestot <- aggregate(anesthetic[, c("move","nomove")],
by=list(conc=anesthetic$conc), FUN=sum)
anestot$conc <- as.numeric(as.character(anestot$conc))
anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum)
anestot$prop <- anestot$nomove/anestot$total
head(anestot)
## conc move nomove total prop ## 1 0.8 6 1 7 0.1428571 ## 2 1.0 4 1 5 0.2000000 ## 3 1.2 2 4 6 0.6666667 ## 4 1.4 2 4 6 0.6666667 ## 5 1.6 0 4 4 1.0000000 ## 6 2.5 0 2 2 1.0000000
glm(nomove ~ conc, family=binomial(link="logit"), data=anesthetic)
## ## Call: glm(formula = nomove ~ conc, family = binomial(link = "logit"), ## data = anesthetic) ## ## Coefficients: ## (Intercept) conc ## -6.469 5.567 ## ## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual ## Null Deviance: 41.46 ## Residual Deviance: 27.75 AIC: 31.75
glm(prop ~ conc, family=binomial(link="logit"), weights=total, data=anestot)$coef
## (Intercept) conc ## -6.468675 5.566762
Always try to make the relation as linear as possible - after all you are assessing a linear model.
Do not forget that you use transformations to "make" things more linear
with(frogs, pairs(cbind(log(distance), log(NoOfPools), NoOfSites, avrain, altitude, meanmax+meanmin, meanmax-meanmin), col="gray", panel=panel.smooth, labels=c("log(Distance)", "log(NoOfPools)", "NoOfSites", "AvRainfall", "Altitude", "meanmax+meanmin", "meanmax-meanmin")))
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + ## avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, ## data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9763 -0.7189 -0.2786 0.7970 2.5745 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.2688999 16.1381912 1.132 0.25762 ## log(distance) -0.7583198 0.2558117 -2.964 0.00303 ** ## log(NoOfPools) 0.5708953 0.2153335 2.651 0.00802 ** ## NoOfSites -0.0036201 0.1061469 -0.034 0.97279 ## avrain 0.0007003 0.0411710 0.017 0.98643 ## I(meanmax + meanmin) 1.4958055 0.3153152 4.744 2.1e-06 *** ## I(meanmax - meanmin) -3.8582668 1.2783921 -3.018 0.00254 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.65 on 205 degrees of freedom ## AIC: 211.65 ## ## Number of Fisher Scoring iterations: 5
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + ## meanmin) + I(meanmax - meanmin), family = binomial, data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9753 -0.7224 -0.2780 0.7974 2.5736 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.5268 5.2673 3.517 0.000436 *** ## log(distance) -0.7547 0.2261 -3.338 0.000844 *** ## log(NoOfPools) 0.5707 0.2152 2.652 0.007999 ** ## I(meanmax + meanmin) 1.4985 0.3088 4.853 1.22e-06 *** ## I(meanmax - meanmin) -3.8806 0.9002 -4.311 1.63e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.66 on 207 degrees of freedom ## AIC: 207.66 ## ## Number of Fisher Scoring iterations: 5
head(fitted(frogs.glm))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type="response"))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type="link")) # Scale of linear predictor
## 2 3 4 5 6 7 ## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086
pred <- predict(frogs.glm, type="link", se.fit=TRUE) head(data.frame(fit = pred$fit, se = pred$se.fit))
## fit se ## 2 2.7815212 0.6859612 ## 3 2.5256832 0.4851882 ## 4 2.2303441 0.4381720 ## 5 1.4628085 0.4805175 ## 6 2.6085055 0.5290599 ## 7 0.9835086 0.3900549
par(mfrow=c(1,4), pty="s")
frogs$maxminSum <- with(frogs, meanmax+meanmin)
frogs$maxminDiff <- with(frogs, meanmax-meanmin)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools) +
maxminSum + maxminDiff, family = binomial,
data = frogs)
termplot(frogs.glm)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools),
family=binomial,data=frogs)
cv <- CVbinary(frogs.glm)
## ## Fold: 5 7 3 1 9 8 6 2 10 4 ## Internal estimate of accuracy = 0.759 ## Cross-validation estimate of accuracy = 0.745
ls(cv)
## [1] "acc.cv" "acc.internal" "acc.training" "cvhat" ## [5] "internal" "training"
str(UCBAdmissions)
## table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ... ## - attr(*, "dimnames")=List of 3 ## ..$ Admit : chr [1:2] "Admitted" "Rejected" ## ..$ Gender: chr [1:2] "Male" "Female" ## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
UCBAdmissions
## , , Dept = A ## ## Gender ## Admit Male Female ## Admitted 512 89 ## Rejected 313 19 ## ## , , Dept = B ## ## Gender ## Admit Male Female ## Admitted 353 17 ## Rejected 207 8 ## ## , , Dept = C ## ## Gender ## Admit Male Female ## Admitted 120 202 ## Rejected 205 391 ## ## , , Dept = D ## ## Gender ## Admit Male Female ## Admitted 138 131 ## Rejected 279 244 ## ## , , Dept = E ## ## Gender ## Admit Male Female ## Admitted 53 94 ## Rejected 138 299 ## ## , , Dept = F ## ## Gender ## Admit Male Female ## Admitted 22 24 ## Rejected 351 317
dimnames(UCBAdmissions) # Check levels of table margins
## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F"
UCB <- as.data.frame.table(UCBAdmissions["Admitted", , ]) names(UCB)[3] <- "admit" UCB$reject <- as.data.frame.table(UCBAdmissions["Rejected", , ])$Freq UCB$Gender <- relevel(UCB$Gender, ref="Male") ## Add further columns total and p (proportion admitted) UCB$total <- UCB$admit + UCB$reject UCB$p <- UCB$admit/UCB$total
UCB
## Gender Dept admit reject total p ## 1 Male A 512 313 825 0.62060606 ## 2 Female A 89 19 108 0.82407407 ## 3 Male B 353 207 560 0.63035714 ## 4 Female B 17 8 25 0.68000000 ## 5 Male C 120 205 325 0.36923077 ## 6 Female C 202 391 593 0.34064081 ## 7 Male D 138 279 417 0.33093525 ## 8 Female D 131 244 375 0.34933333 ## 9 Male E 53 138 191 0.27748691 ## 10 Female E 94 299 393 0.23918575 ## 11 Male F 22 351 373 0.05898123 ## 12 Female F 24 317 341 0.07038123
UCB.glm <- glm(p ~ Dept*Gender, family=binomial,
data=UCB, weights=total)
anova(UCB.glm, test="Chisq")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: p ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 11 877.06 ## Dept 5 855.32 6 21.74 < 2.2e-16 *** ## Gender 1 1.53 5 20.20 0.215928 ## Dept:Gender 5 20.20 0 0.00 0.001144 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(UCB.glm)$coef
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.49212143 0.07174966 6.8588682 6.940825e-12 ## DeptB 0.04162783 0.11318919 0.3677721 7.130431e-01 ## DeptC -1.02763967 0.13549685 -7.5842331 3.344593e-14 ## DeptD -1.19607953 0.12640656 -9.4621632 3.016374e-21 ## DeptE -1.44908321 0.17681152 -8.1956378 2.492678e-16 ## DeptF -3.26186520 0.23119594 -14.1086615 3.358928e-45 ## GenderFemale 1.05207596 0.26270810 4.0047336 6.208742e-05 ## DeptB:GenderFemale -0.83205342 0.51039480 -1.6302153 1.030560e-01 ## DeptC:GenderFemale -1.17699758 0.29955796 -3.9291147 8.525915e-05 ## DeptD:GenderFemale -0.97008876 0.30261874 -3.2056467 1.347593e-03 ## DeptE:GenderFemale -1.25226298 0.33032201 -3.7910371 1.500195e-04 ## DeptF:GenderFemale -0.86318013 0.40266653 -2.1436600 3.206014e-02
The first six regression coefficients give the \(\log(\text{odds})\) for males.
In Departments "C", "D", "E" and "F", the \(\log(\text{odds})\) for females (relative to males of course) is reduced. In Department "A" the \(\log(\text{odds})\) for females is increased.
ACF1
## count endtime ## 1 1 6 ## 2 3 6 ## 3 5 6 ## 4 1 6 ## 5 2 6 ## 6 1 6 ## 7 1 6 ## 8 3 12 ## 9 1 12 ## 10 2 12 ## 11 6 12 ## 12 0 12 ## 13 0 12 ## 14 4 12 ## 15 1 12 ## 16 10 18 ## 17 6 18 ## 18 6 18 ## 19 7 18 ## 20 5 18 ## 21 7 18 ## 22 6 18
summary(ACF.glm <- glm(formula = count ~ endtime, family = poisson, data = ACF1))
## ## Call: ## glm(formula = count ~ endtime, family = poisson, data = ACF1) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.46204 -0.47851 -0.07943 0.38159 2.26332 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.32152 0.40046 -0.803 0.422 ## endtime 0.11920 0.02642 4.511 6.44e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 51.105 on 21 degrees of freedom ## Residual deviance: 28.369 on 20 degrees of freedom ## AIC: 92.209 ## ## Number of Fisher Scoring iterations: 5
fitted(ACF.glm)
## 1 2 3 4 5 6 7 8 ## 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 1.482388 3.030821 ## 9 10 11 12 13 14 15 16 ## 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 3.030821 6.196674 ## 17 18 19 20 21 22 ## 6.196674 6.196674 6.196674 6.196674 6.196674 6.196674
A quasi-likelihood function has similar properties to the log-likelihood function
- a quasi-likelihood function is not the log-likelihood - it does not correspond to any actual probability distribution.
Used to allow for overdispersion during estimation
- Overdispersion occurs when there is greater variability in the data than would be expected from the statistical model that is used.
summary(glm(formula = count ~ endtime, family = quasipoisson, data = ACF1))
## ## Call: ## glm(formula = count ~ endtime, family = quasipoisson, data = ACF1) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.46204 -0.47851 -0.07943 0.38159 2.26332 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.32152 0.45532 -0.706 0.488239 ## endtime 0.11920 0.03004 3.968 0.000758 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for quasipoisson family taken to be 1.292733) ## ## Null deviance: 51.105 on 21 degrees of freedom ## Residual deviance: 28.369 on 20 degrees of freedom ## AIC: NA ## ## Number of Fisher Scoring iterations: 5
n <- rnorm(1000, 0, 1)^2 par(mfrow = c(1,2)) hist(n) hist(sqrt(n))
Used to linearize sigmoid distributions and to equalize variances of proportions.
p <- seq(.1:1, by = .1) p
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
asin(sqrt(p))
## [1] 0.3217506 0.4636476 0.5796397 0.6847192 0.7853982 0.8860771 0.9911566 ## [8] 1.1071487 1.2490458 1.5707963
asin(sqrt(p)) * 180/pi
## [1] 18.43495 26.56505 33.21091 39.23152 45.00000 50.76848 56.78909 ## [8] 63.43495 71.56505 90.00000
Prepare for next session: