We use the following packages in this Practical:

library(dplyr)
library(magrittr)
library(ggplot2)

In this practical you will need to perform regression analysis and create plots with ggplot2. I give you some examples and ask from you to apply the techniques I demonstrate. For some exercises I give you the solution (e.g. the resulting graph) and the interpretation. The exercise is then to provide to me the code that generates the solution and give me the interpretation for the exercises where this is omitted.

Feel free to ask me, if you have questions.

All the best,

Gerko

Models and output

Fit the following linear models on the anscombe data:

y1 predicted by x1 - stored in object fit1
y2 predicted by x2 - stored in object fit2
y3 predicted by x3 - stored in object fit3
y4 predicted by x4 - stored in object fit4

I give you the code for first regression model. You need to fit the other three models yourself.

fit1 <- anscombe %$%
  lm(y1 ~ x1)

fit2 <- anscombe %$%
  lm(y2 ~ x2)
fit3 <- anscombe %$%
  lm(y3 ~ x3)
fit4 <- anscombe %$%
  lm(y4 ~ x4)

`Display a data frame with the coefficients of the 4 fitted objects from Exercise 1

Use the following code to markup your output into a nice format

output <- data.frame(fit1 = coef(fit1),
                     fit2 = coef(fit2),
                     fit3 = coef(fit3),
                     fit4 = coef(fit4))
row.names(output) <- names(coef(fit1))

output

##                  fit1     fit2      fit3      fit4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091

Inspect the estimates for the four models in the output object. What do you conclude?

# These estimates are very similar.

Plotting the relation

Plot the pair (x1, y1) such that y1 is on the Y-axis and make the color of the points blue This is quite simple to do with ggplot2

anscombe %>%
  ggplot(aes(x = x1, y = y1)) + 
  geom_point(color = "blue")

In the above code we put the aesthetics aes(x = x1, y = y1) in the ggplot() function. This way, the aesthetics hold for the whole graph (i.e. all geoms we specify), unless otherwise specified. Alternatively, we could specify aesthetics for individual geom’s, such as in

anscombe %>%
  ggplot() + 
  geom_point(aes(x = x1, y = y1), color = "blue")

We can also override the aes(x = x1, y = y1) specified in ggplot() by specifying a different aes(x = x2, y = y2) under geom_point().

anscombe %>%
  ggplot(aes(x = x1, y = y1)) + 
  geom_point(aes(x = x2, y = y2), color = "blue")

Plot the four pairs of variables from Exercise 1 in a single plotting window. Make the points in the plots blue, gray, orange and purple, respectively.

In other words, create the following plot:

gg <- anscombe %>%
  ggplot() + 
  geom_point(aes(x = x1, y = y1), color = "blue") + 
  geom_point(aes(x = x2, y = y2), color = "gray") + 
  geom_point(aes(x = x3, y = y3), color = "orange") + 
  geom_point(aes(x = x4, y = y4), color = "purple") +
  ylab("Y") + xlab("X")

gg

Add a regression line to the plot for only the pairs (y3, x3) and (y4, x4) where the line inherits the colour from the respective points. Hint: use geom_smooth().

gg + # take the plot under #5 as the starting point 
  geom_smooth(aes(x = x3, y = y3), method = "lm", se = FALSE, color = "orange") + 
  geom_smooth(aes(x = x4, y = y4), method = "lm", se = FALSE, color = "purple")

Now add a loess line to the plot from Exercise 5 for all pairs but (y4, x4) where the line inherits the colour from the respective points.

gg + # take the plot under #5 as the starting point 
  geom_smooth(aes(x = x1, y = y1), method = "loess", se = FALSE, color = "blue") + 
  geom_smooth(aes(x = x2, y = y2), method = "loess", se = FALSE, color = "grey") +
  geom_smooth(aes(x = x3, y = y3), method = "loess", se = FALSE, color = "orange")

Assumptions

Inspect `fit1`

Inspect the assumptions for the first model. What do you think?

HINT: use plot() and use the plots you’ve created in exercises 5-7.

plot(fit1)

The data follows a linear trend, although the loess curve shows some deviations from linearity. All in all, there are only 11 points, so this slight deviations is not something that would worry me.
Taking into account that there are only a few observations, I would argue that the residuals seem normally distributed from the Normal Q-Q plot.
The residual variance seems constant over the level of the fitted values (i.e. homoscedastic residual variance) as seen in Residuals vs. Fitted plot and the Scale-Location plot. Again, the dip in the Scale-Location plot can easily be explained by the small sample size and the deviation should be taken with a grain of salt.
No special remarks with respect to leverage and cook’s distance, although case #3 would need to be at least investigated.

Inspect `fit2`

Inspect the assumptions for the model fit2. What do you think?

#- The data does not follow a linear trend, the deviation would definitely worry me. 
#- The residuals seem non-normally distributed, especially in the tails from the `Normal Q-Q` plot. 
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance is heteroscedastic.  
#- Case 8 has quite some leverage and a large residual. It's cook's distance is greater than `.5`. 

plot(fit2)

Inspect `fit3`

Inspect the assumptions for the model fit3. What do you think?

#- The data follows a perfect linear trend, except for case #3 
#- The residuals seem normally distributed, except for case #3
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance is heteroscedastic. However, if case #3 were omitted, there are no residuals: every point falls perfectly on the regression line. 
#- Case 3 has quite some leverage, but not as large as other cases. Case #3 has the largest residual. It's cook's distance is greater than `1`. 

plot(fit3)

Inspect `fit4`

Inspect the assumptions for the model fit4. What do you think?

#- The data follows no trend. You'd be an idiot to perform linear regression on this set. 
#- The residuals seem normally distributed
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance does not exist for fitted values other than 7!
#- Case 8 has a leverage of 1; hence it is omitted from most of the plots. The plot over the remaining points is redundant.  

plot(fit4)

## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

Exercise 6

Gerko Vink

Fundamental Techniques in Data Science with R

Models and output

Plotting the relation

Assumptions

Inspect `fit1`

Inspect `fit2`

Inspect `fit3`

Inspect `fit4`

Exercise 6

Gerko Vink

Fundamental Techniques in Data Science with R

Models and output

Plotting the relation

Assumptions

Inspect fit1

Inspect fit2

Inspect fit3

Inspect fit4

Inspect `fit1`

Inspect `fit2`

Inspect `fit3`

Inspect `fit4`