Practical J&K

Exercises

The following packages are required for this practical:

library(dplyr)
library(magrittr)
library(mice)
library(ggplot2)
library(DAAG)
library(MASS)

The data sets elastic1 and elastic2 from the package DAAG were obtained using the same apparatus, including the same rubber band, as the data frame elasticband.

Using a different symbol and/or a different color, plot the data from the two data frames elastic1 and elastic2 on the same graph. Do the two sets of results appear consistent?

For each of the data sets elastic1 and elastic2, determine the regression of distance on stretch. In each case determine:

fitted values and standard errors of fitted values and
the \(R^2\) statistic.

Compare the two sets of results. What is the key difference between the two sets of data?

Study the residual vs leverage plots for both models. Hint use plot() on the fitted object

Use the robust regression function rlm() from the MASS package to fit lines to the data in elastic1 and elastic2. Compare the results with those from use of lm():

residuals
regression coefficients,
standard errors of coefficients,
plots of residuals against fitted values.

Use the elastic2 variable stretch to obtain predictions on the model fitted on elastic1.

Now make a scatterplot to investigate similarity between plot the predicted values against the observed values for elastic2

A recruiter for a large company suspects that the process his company uses to hire new applicants is biased. To test this, he records the application numbers that have been successfully hired in the last hiring round. He finds the following pattern:

numbers <- data.frame(hired = c(11, 19, 13, 4, 8, 4),
                      not_hired = c(89, 81, 87, 96, 92, 11))
numbers$probability <- round(with(numbers, hired / (hired + not_hired)), 2)
rownames(numbers) <- c(paste("Application number starts with", 0:5))
numbers

##                                  hired not_hired probability
## Application number starts with 0    11        89        0.11
## Application number starts with 1    19        81        0.19
## Application number starts with 2    13        87        0.13
## Application number starts with 3     4        96        0.04
## Application number starts with 4     8        92        0.08
## Application number starts with 5     4        11        0.27

Investigate whether there is indeed a pattern: does the probability to be hired depend a posteriori on the job application number?

The recruiter knows that application numbers are assigned to new applications based on the time and date the application has been submitted. A colleague suggests that applicants who submit early on in the process tend to be better prepared than applicants who submit later on in the process. Test this assumption by running a \(X^2\) test to compare the original data to the following pattern where a 2-percent drop over the starting numbers is expected.

decreasing <- data.frame(hired = c(16, 14, 12, 10, 8, 1),
                         not_hired = c(84, 86, 88, 91, 93, 14))
decreasing$probability <- round(with(decreasing, hired / (hired + not_hired)), 2)
decreasing

##   hired not_hired probability
## 1    16        84        0.16
## 2    14        86        0.14
## 3    12        88        0.12
## 4    10        91        0.10
## 5     8        93        0.08
## 6     1        14        0.07

The board of the company would like to improve their process if the process is systematically biased. They tell the recruiter that their standard process in hiring people is as follows:

The secretary sorts the applications by application number
The board determines for every application if the applicant would be hired
If half the vacancies are filled they take a coffee break
After the coffee break they continue the same process to distribute the other applications over the remaining vacancies.

The recruiter suspects that the following psychological process is occuring: The board realized at the coffee break that they were running out of vacancies to award the remaining half of the applications, then became more conservative for a while and return to baseline in the end.

If that were true, the following expected cell frequencies might be observed:

oops <- data.frame(hired = c(14, 14, 14, 2, 12, 3),
                   not_hired = c(86, 86, 86, 98, 88, 12))
oops$probability <- round(with(oops, hired / (hired + not_hired)), 2)
oops

##   hired not_hired probability
## 1    14        86        0.14
## 2    14        86        0.14
## 3    14        86        0.14
## 4     2        98        0.02
## 5    12        88        0.12
## 6     3        12        0.20

Verify if the oops pattern would fit to the observed pattern from the numbers data. Again, use a chi-squared test.

Plot the probability against the starting numbers and use different colours for each of the following patterns:

the observed pattern
the independence pattern (equal probability)
the decreasing probability pattern
the oops pattern.

Write a function that chooses automatically whether to do the chisq.test() or the fisher.test(). Create the function such that it:

takes two vectors as input
creates a table from the two vectors
checks if there is any expected cell frequency that is smaller than 5
and that then performs and prints the results of the appropriate test.

Test the function with the dataset bacteria (from MASS) by testing independence between compliance (hilo) and the presence or absence of disease (y).

Does your function work differently if we only put in the first 25 rows of the bacteria dataset?

The mammalsleep dataset is part of mice. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type

?mammalsleep

Fit and inspect a model where brw is modeled from bw

Now fit and inspect a model where brw is predicted from both bw and species

Can you find a model that improves the \(R^2\) in modeling brw?

Inspect the diagnostic plots for the model obtained in exercise 16. What issues can you detect?

End of Practical JK.

Practical J&K

Gerko Vink

Statistical Programming in R

Exercises