Processing math: 100%
  • Gerko Vink
  • Fundamental Techniques in Data Science with R
  • Exercises

In this practical I detail multiple skills and show you a workflow for (predictive) analytics. Since these skills are vital for the assignment, I would like you to focus on understanding the code and I give you the answers and interpretations of the analyses. Therefore it is not necessary to hand in any exercise this week.

Feel free to ask me, if you have questions.

All the best,

Gerko


We use the following packages:

library(mice)
library(dplyr)
library(magrittr)
library(DAAG)

Exercises


The following table shows numbers of occasions when inhibition (i.e., no flow of current across a membrane) occurred within 120 s, for different concentrations of the protein peptide-C. The outcome yes implies that inhibition has occurred.

conc 0.1 0.5  1 10 20 30 50 70 80 100 150 
no     7   1 10  9  2  9 13  1  1   4   3 
yes    0   0  3  4  0  6  7  0  0   1   7

  1. Create this data in R. Make it into a data frame.
data <- data.frame(conc = c(.1, .5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
                   no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
                   yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)) 
data
##     conc no yes
## 1    0.1  7   0
## 2    0.5  1   0
## 3    1.0 10   3
## 4   10.0  9   4
## 5   20.0  2   0
## 6   30.0  9   6
## 7   50.0 13   7
## 8   70.0  1   0
## 9   80.0  1   0
## 10 100.0  4   1
## 11 150.0  3   7

  1. Add the following three new variables (columns) to the data
  • the margin of no and yes over the rows (i.e. no + yes)
  • the proportion (yes over margin)
  • the logodds

First, we create a function to calculate the logit (logodds):

logit <- function(p) log(p / (1 - p))

Then we add the new columns

data <- 
  data %>% 
  mutate(margin = no+yes,
         prop = yes / margin,
         logit = logit(prop)) # apply the function

  1. Inspect the newly added columns in thw data set. What do you see?
data
##     conc no yes margin      prop      logit
## 1    0.1  7   0      7 0.0000000       -Inf
## 2    0.5  1   0      1 0.0000000       -Inf
## 3    1.0 10   3     13 0.2307692 -1.2039728
## 4   10.0  9   4     13 0.3076923 -0.8109302
## 5   20.0  2   0      2 0.0000000       -Inf
## 6   30.0  9   6     15 0.4000000 -0.4054651
## 7   50.0 13   7     20 0.3500000 -0.6190392
## 8   70.0  1   0      1 0.0000000       -Inf
## 9   80.0  1   0      1 0.0000000       -Inf
## 10 100.0  4   1      5 0.2000000 -1.3862944
## 11 150.0  3   7     10 0.7000000  0.8472979

There are a lot of zero proportions, hence the −∞ in the logit. You can fix this (at least the interpretation of the logodds) by adding a constant (usually 0.5) to all cells conform the empirical logodds (see e.g. Cox and Snell 1989).

Another option is to add a value of 1. This is conceptually interesting as the log of 1 equals 0.


  1. Add a new column where the log odds are calculated as: log(odds)=log(yes+0.5no+0.5)
logitCandS <- function(yes, no) log((yes + .5) / (no + .5))
data <- 
  data %>% 
  mutate(logitCS = logitCandS(yes, no))
data
##     conc no yes margin      prop      logit    logitCS
## 1    0.1  7   0      7 0.0000000       -Inf -2.7080502
## 2    0.5  1   0      1 0.0000000       -Inf -1.0986123
## 3    1.0 10   3     13 0.2307692 -1.2039728 -1.0986123
## 4   10.0  9   4     13 0.3076923 -0.8109302 -0.7472144
## 5   20.0  2   0      2 0.0000000       -Inf -1.6094379
## 6   30.0  9   6     15 0.4000000 -0.4054651 -0.3794896
## 7   50.0 13   7     20 0.3500000 -0.6190392 -0.5877867
## 8   70.0  1   0      1 0.0000000       -Inf -1.0986123
## 9   80.0  1   0      1 0.0000000       -Inf -1.0986123
## 10 100.0  4   1      5 0.2000000 -1.3862944 -1.0986123
## 11 150.0  3   7     10 0.7000000  0.8472979  0.7621401

We can now see that the −∞ proportions are gone.


  1. Fit the model with margin as the weights, just like the model in slide 44 from this week’s lecture
fit <- 
  data %$%
  glm(prop ~ conc, family=binomial, weights=margin)

  1. Look at the summary of the fitted object
summary(fit)
## 
## Call:
## glm(formula = prop ~ conc, family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8159  -1.0552  -0.6878   0.3667   1.0315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32701    0.33837  -3.922 8.79e-05 ***
## conc         0.01215    0.00496   2.450   0.0143 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.683  on 10  degrees of freedom
## Residual deviance: 10.389  on  9  degrees of freedom
## AIC: 30.988
## 
## Number of Fisher Scoring iterations: 4

A unit increase in conc increases the log(odds) of prop with 0.01215. This increase is significant.


Just like with a linear model, we can obtain a series of plots for inspecting the logistic model.


  1. Inspect the plots number 1 and 5 for object fit
plot(fit, which = c(1, 5))

The data set is small, but case 11 stands out in the Residuals vs. Leverage plot. Case 11 has quite a lot of leverage, although its residual is not out of the ordinary. Its leverage, however is sufficient to yield it a Cook’s distance of over 1. Further investigation is warranted.


Many models are built around the assumption of normality. Even in cases when this assumption is not strict, modeling efforts may benefit from a transformation towards normality. A transformation that is often used for skewed data is the log-transformation.


  1. conc is somewhat skewed. Plot the density of the conc column twice:
  • once as observed
  • once with a log-tranformation

In this case the transformation towards normality is not very apparent due to the small sample size. If we take, e.g. from the mice::mammalsleep data set the column bw (body weigth), the log-transformation is very beneficial:

par(mfrow = c(1, 2)) # change parameter of plotting to 2 plots side-by-side
mammalsleep %$%
  density(bw) %>%
  plot(main = "density", xlab = "body weight")

mammalsleep %$% 
  log(bw) %>%
  density() %>%
  plot(main = "density", xlab = "log(body weight)")

I hope you all see that taking the log of bw in this case would make the predictor far more normal.


We now return to the exercise and its data example with prop and conc.


  1. Investigate the log model

To apply this transformation in the model directly, we best pose it in the I() function. I() indicates that any interpretation and/or conversion of objects should be inhibited and the contents should be evaluated ‘as is’. For example, to run the model:

fit.log <- 
  data %$%
  glm(prop ~ I(log(conc)), family=binomial, weights=margin)

  1. Look at the summary of the fitted objects again
summary(fit.log)
## 
## Call:
## glm(formula = prop ~ I(log(conc)), family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2510  -1.0599  -0.5029   0.3152   1.3513  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.7659     0.5209  -3.390 0.000699 ***
## I(log(conc))   0.3437     0.1440   2.387 0.016975 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.6834  on 10  degrees of freedom
## Residual deviance:  9.3947  on  9  degrees of freedom
## AIC: 29.994
## 
## Number of Fisher Scoring iterations: 4

The logodds for fit.log now depict the unit increase in log(conc), instead of conc.


  1. Inspects the plots number 1 and 5 of the fitted objects based on log(conc)
plot(fit.log, which = c(1, 5))

Outliers are now less of an issue. This exercise demonstrates that data transformations may easily render our method more valid, but in exchange it makes our model interpretation more difficult: Parameters now have to be assessed in the log(conc) parameter space.


End of Practical.