We use the following packages in this Practical:
library(dplyr)
library(magrittr)
library(ggplot2)
In this practical you will need to perform regression analysis and create plots with ggplot2. I give you some examples and ask from you to apply the techniques I demonstrate. For some exercises I give you the solution (e.g. the resulting graph) and the interpretation. The exercise is then to provide to me the code that generates the solution and give me the interpretation for the exercises where this is omitted.
Feel free to ask me, if you have questions.
All the best,
Gerko
y1
predicted by x1
- stored in object fit1
y2
predicted by x2
- stored in object fit2
y3
predicted by x3
- stored in object fit3
y4
predicted by x4
- stored in object fit4
I give you the code for first regression model. You need to fit the other three models yourself.
fit1 <- anscombe %$%
lm(y1 ~ x1)
fit2 <- anscombe %$%
lm(y2 ~ x2)
fit3 <- anscombe %$%
lm(y3 ~ x3)
fit4 <- anscombe %$%
lm(y4 ~ x4)
Use the following code to markup your output into a nice format
output <- data.frame(fit1 = coef(fit1),
fit2 = coef(fit2),
fit3 = coef(fit3),
fit4 = coef(fit4))
row.names(output) <- names(coef(fit1))
output
## fit1 fit2 fit3 fit4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1 0.5000909 0.500000 0.4997273 0.4999091
output
object. What do you conclude?# These estimates are very similar.
(x1, y1)
such that y1
is on the Y-axis and make the color of the points blue This is quite simple to do with ggplot2
anscombe %>%
ggplot(aes(x = x1, y = y1)) +
geom_point(color = "blue")
In the above code we put the aesthetics aes(x = x1, y = y1)
in the ggplot()
function. This way, the aesthetics hold for the whole graph (i.e. all geoms
we specify), unless otherwise specified. Alternatively, we could specify aesthetics for individual geom
’s, such as in
anscombe %>%
ggplot() +
geom_point(aes(x = x1, y = y1), color = "blue")
We can also override the aes(x = x1, y = y1)
specified in ggplot()
by specifying a different aes(x = x2, y = y2)
under geom_point()
.
anscombe %>%
ggplot(aes(x = x1, y = y1)) +
geom_point(aes(x = x2, y = y2), color = "blue")
blue
, gray
, orange
and purple
, respectively. In other words, create the following plot:
gg <- anscombe %>%
ggplot() +
geom_point(aes(x = x1, y = y1), color = "blue") +
geom_point(aes(x = x2, y = y2), color = "gray") +
geom_point(aes(x = x3, y = y3), color = "orange") +
geom_point(aes(x = x4, y = y4), color = "purple") +
ylab("Y") + xlab("X")
gg
(y3, x3)
and (y4, x4)
where the line inherits the colour from the respective points. Hint: use geom_smooth()
.gg + # take the plot under #5 as the starting point
geom_smooth(aes(x = x3, y = y3), method = "lm", se = FALSE, color = "orange") +
geom_smooth(aes(x = x4, y = y4), method = "lm", se = FALSE, color = "purple")
Exercise 5
for all pairs but (y4, x4)
where the line inherits the colour from the respective points.gg + # take the plot under #5 as the starting point
geom_smooth(aes(x = x1, y = y1), method = "loess", se = FALSE, color = "blue") +
geom_smooth(aes(x = x2, y = y2), method = "loess", se = FALSE, color = "grey") +
geom_smooth(aes(x = x3, y = y3), method = "loess", se = FALSE, color = "orange")
fit1
HINT: use plot()
and use the plots you’ve created in exercises 5-7.
plot(fit1)
Normal Q-Q
plot.Residuals vs. Fitted
plot and the Scale-Location
plot. Again, the dip in the Scale-Location
plot can easily be explained by the small sample size and the deviation should be taken with a grain of salt.fit2
fit2
. What do you think?#- The data does not follow a linear trend, the deviation would definitely worry me.
#- The residuals seem non-normally distributed, especially in the tails from the `Normal Q-Q` plot.
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance is heteroscedastic.
#- Case 8 has quite some leverage and a large residual. It's cook's distance is greater than `.5`.
plot(fit2)
fit3
fit3
. What do you think?#- The data follows a perfect linear trend, except for case #3
#- The residuals seem normally distributed, except for case #3
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance is heteroscedastic. However, if case #3 were omitted, there are no residuals: every point falls perfectly on the regression line.
#- Case 3 has quite some leverage, but not as large as other cases. Case #3 has the largest residual. It's cook's distance is greater than `1`.
plot(fit3)
fit4
fit4
. What do you think?#- The data follows no trend. You'd be an idiot to perform linear regression on this set.
#- The residuals seem normally distributed
#- I could not still argue that the residual variance seems more-or-less constant over the level of the fitted values. The residual variance does not exist for fitted values other than 7!
#- Case 8 has a leverage of 1; hence it is omitted from most of the plots. The plot over the remaining points is redundant.
plot(fit4)
## Warning: not plotting observations with leverage one:
## 8
## Warning: not plotting observations with leverage one:
## 8