In this practical I detail multiple skills and show you a workflow for (predictive) analytics. Since these skills are vital for the assignment, I would like you to focus on understanding the code and I give you the answers and interpretations of the analyses. Therefore it is not necessary to hand in any exercise this week.
Feel free to ask me, if you have questions.
All the best,
Gerko
The following packages are required for this practical:
library(dplyr)
library(magrittr)
library(mice)
library(ggplot2)
library(DAAG)
library(MASS)
The data sets elastic1
and elastic2
from the package DAAG
were obtained using the same apparatus, including the same rubber band, as the data frame elasticband
.
elastic1
and elastic2
on the same graph. Do the two sets of results appear consistent?elastic <- rbind(elastic1, elastic2)
elastic$source <- c(rep("Elastic1", nrow(elastic1)),
rep("Elastic2", nrow(elastic2)))
elastic %>%
ggplot(aes(stretch, distance, colour = source)) +
geom_point() +
geom_smooth(method = "lm")
The results seem very consistent: Data set elastic2
has more observations over a larger range, but both sets result in roughly the same regression line. Data set elastic1
seems to have an odd-one-out value.
elastic1
and elastic2
, determine the regression of distance on stretch. In each case determine:Compare the two sets of results. What is the key difference between the two sets of data?
First we run the two models:
fit1 <-
elastic1 %$%
lm(distance ~ stretch)
fit2 <-
elastic2 %$%
lm(distance ~ stretch)
and then we compare the fitted values
fit1 %>% predict(se.fit = TRUE)
## $fit
## 1 2 3 4 5 6 7
## 183.1429 235.7143 196.2857 209.4286 170.0000 156.8571 222.5714
##
## $se.fit
## [1] 6.586938 10.621119 5.891537 6.586938 8.331891 10.621119 8.331891
##
## $df
## [1] 5
##
## $residual.scale
## [1] 15.58754
fit2 %>% predict(se.fit = TRUE)
## $fit
## 1 2 3 4 5 6 7
## 77.58333 196.58333 137.08333 166.83333 256.08333 226.33333 107.33333
## 8 9
## 226.33333 285.83333
##
## $se.fit
## [1] 6.740293 3.520003 4.358744 3.635444 5.060323 4.064550 5.453165 4.064550
## [9] 6.296773
##
## $df
## [1] 7
##
## $residual.scale
## [1] 10.44202
We see that fit1
(based on elastic1
) has a larger residual standard deviation (i.e. $residual.scale
).
To get the \(R^2\) we can run a summary on the fitted models:
fit1 %>% summary()
##
## Call:
## lm(formula = distance ~ stretch)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.1429 -18.7143 -7.2857 -1.4286 8.0000 -6.8571 26.4286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -119.143 70.943 -1.679 0.15391
## stretch 6.571 1.473 4.462 0.00663 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 5 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.7591
## F-statistic: 19.91 on 1 and 5 DF, p-value: 0.006631
fit2 %>% summary()
##
## Call:
## lm(formula = distance ~ stretch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0833 -7.0833 -0.5833 5.1667 20.1667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.9167 15.6102 -6.465 0.000345 ***
## stretch 5.9500 0.3148 18.899 2.89e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.44 on 7 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.978
## F-statistic: 357.2 on 1 and 7 DF, p-value: 2.888e-07
Or we can grab the \(R^2\) directly from the object without a pipe
summary(fit1)$r.squared
## [1] 0.7992446
summary(fit2)$r.squared
## [1] 0.9807775
The model based on elastic2
has smaller standard errors and a much larger \(R^2\). This is due to the larger range of values in elastic2
, and the absence of an outlier.
plot()
on the fitted objectfit1 %>% plot(which = 5) #the fifth plot is the residuals vs leverage plot
fit2 %>% plot(which = 5)
For elastic1
, case 2 has the largest influence on the estimation. However, it is not the case with the largest residual:
fit1$residuals
## 1 2 3 4 5 6
## -0.1428571 -18.7142857 -7.2857143 -1.4285714 8.0000000 -6.8571429
## 7
## 26.4285714
As we can see, case 7 has the largest residual.
Because there is a single value that influences the estimation and is somewhat different than the other values, a robust form of regression may be advisable to obtain more stable estimates. When robust methods are used, we refrain from omitting a suspected outlier from our analysis. In general, with robust analysis, influential cases that are not conform the other cases receive less weight in the estimation procedure then under non-robust analysis.
rlm()
from the MASS
package to fit lines to the data in elastic1
and elastic2
. Compare the results with those from use of lm()
:First, we run the same models again with rlm()
fit1.rlm <-
elastic1 %$%
rlm(distance ~ stretch)
fit2.rlm <-
elastic2 %$%
rlm(distance ~ stretch)
and then we look at the coefficients and the residuals
data.frame(lm = coef(fit1),
rlm = coef(fit1.rlm))
## lm rlm
## (Intercept) -119.142857 -95.747207
## stretch 6.571429 6.039709
data.frame(lm = coef(fit2),
rlm = coef(fit2.rlm))
## lm rlm
## (Intercept) -100.9167 -103.055008
## stretch 5.9500 5.975157
We see that the coefficients for elastic1
are different for lm()
and rlm()
. The coefficients for elastic2
are very similar.
To study the standard errors of the coefficients:
data.frame(lm = summary(fit1)$coefficients[, "Std. Error"],
rlm = summary(fit1.rlm)$coefficients[, "Std. Error"])
## lm rlm
## (Intercept) 70.943496 60.690050
## stretch 1.472884 1.260009
data.frame(lm = summary(fit2)$coefficients[, "Std. Error"],
rlm = summary(fit2.rlm)$coefficients[, "Std. Error"])
## lm rlm
## (Intercept) 15.6101986 14.4955054
## stretch 0.3148387 0.2923567
The standard errors for the estimates for elastic1
have become much smaller with rlm()
compared to standard lm()
estimation. The standard errors for elastic2
are very similar.
To study the residuals:
data.frame(lm = residuals(fit1),
rlm = residuals(fit1.rlm))
## lm rlm
## 1 -0.1428571 0.9205815
## 2 -18.7142857 -13.3970925
## 3 -7.2857143 -5.1588370
## 4 -1.4285714 1.7617445
## 5 8.0000000 8.0000000
## 6 -6.8571429 -7.9205815
## 7 26.4285714 30.6823260
data.frame(lm = residuals(fit2),
rlm = residuals(fit2.rlm))
## lm rlm
## 1 -6.5833333 -5.1997008
## 2 -0.5833333 0.2971601
## 3 -10.0833333 -8.9512703
## 4 20.1666667 21.1729449
## 5 -7.0833333 -6.4544094
## 6 -9.3333333 -8.5786247
## 7 6.6666667 7.9245144
## 8 1.6666667 2.4213753
## 9 5.1666667 5.6698058
The residual trend for both models is very similar. Remember that different values will still be different under robust analyses; they are only given less influence.
To plot the residuals against the fitted values:
plot(fit1, which = 1, add.smooth = "FALSE", col = "blue", main = "elastic1")
points(residuals(fit1.rlm) ~ fitted(fit1.rlm), col = "orange")
plot(fit2, which = 1, add.smooth = "FALSE", col = "blue", main = "elastic2")
points(residuals(fit2.rlm) ~ fitted(fit2.rlm), col = "orange")
The case 2 residual in elastic1 is smaller in the robust regression. This is because the case had less weight in the rlm()
estimation of the coefficients than in the ordinary lm()
regression.
elastic2
variable stretch
to obtain predictions on the model fitted on elastic1
.pred <- predict.lm(fit1, newdata = data.frame(stretch = elastic2$stretch))
elastic2
new.dat <- data.frame(stretch = elastic2$stretch,
distance = c(elastic2$distance, pred))
new.dat$source <- c(rep("original", nrow(elastic2)),
rep("predicted", nrow(elastic2)))
new.dat %>%
ggplot(aes(stretch, distance, colour = source)) +
geom_point() +
geom_smooth(method = "lm")
The predicted values are very similar to the observed values:
data.frame(distance = elastic2$distance, predicted = pred) %>%
ggplot(aes(distance, predicted)) +
geom_point()
They do not strictly follow the straight line because there is some modeling error: we use elastic1
’s model to predict elastic2
’s distance [error source 1] and we compare those predictions to elastic2
’s observed distance [error source 2]. However, if you consider the modeling, these predictions are very accurate and have high correlations with the observed values:
data.frame(distance = elastic2$distance, predicted = pred) %>%
cor()
## distance predicted
## distance 1.0000000 0.9903421
## predicted 0.9903421 1.0000000
The mammalsleep dataset is part of mice
. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type
?mammalsleep
brw
is modeled from bw
mammalsleep %$%
lm(brw ~ bw) %>%
anova()
## Analysis of Variance Table
##
## Response: brw
## Df Sum Sq Mean Sq F value Pr(>F)
## bw 1 46068314 46068314 411.19 < 2.2e-16 ***
## Residuals 60 6722239 112037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that we can model brain weight brw
with body weight bw
. If we inspect the linear model, we see that the \(R^2\) is quite high:
mammalsleep %$%
lm(brw ~ bw) %>%
summary()
##
## Call:
## lm(formula = brw ~ bw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -810.07 -88.52 -79.64 -13.02 2050.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.00440 43.55258 2.09 0.0409 *
## bw 0.96650 0.04766 20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.8705
## F-statistic: 411.2 on 1 and 60 DF, p-value: < 2.2e-16
brw
is predicted from both bw
and species
mammalsleep %$%
lm(brw ~ bw + species) %>%
anova()
## Warning in anova.lm(.): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Analysis of Variance Table
##
## Response: brw
## Df Sum Sq Mean Sq F value Pr(>F)
## bw 1 46068314 46068314
## species 60 6722239 112037
## Residuals 0 0
There seems to be a perfect fit and we don’t get any p-values. If we inspect the linear model summary()
, we find that every animal only is observed once. Adding species as a predictor yields the most overfitted model we may obtain and our residuals drop effectively to zero.
mammalsleep %$%
lm(brw ~ bw + species) %>%
summary()
##
## Call:
## lm(formula = brw ~ bw + species)
##
## Residuals:
## ALL 62 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.5316 NA NA NA
## bw 0.8564 NA NA NA
## speciesAfrican giant pouched rat -7.7880 NA NA NA
## speciesArctic Fox 28.0695 NA NA NA
## speciesArctic ground squirrel -8.6195 NA NA NA
## speciesAsian elephant 2408.2242 NA NA NA
## speciesBaboon 156.9334 NA NA NA
## speciesBig brown bat -13.2513 NA NA NA
## speciesBrazilian tapir 18.4448 NA NA NA
## speciesCat 9.2423 NA NA NA
## speciesChimpanzee 381.7987 NA NA NA
## speciesChinchilla -7.4956 NA NA NA
## speciesCow 11.2436 NA NA NA
## speciesDesert hedgehog -11.6026 NA NA NA
## speciesDonkey 245.2365 NA NA NA
## speciesEastern American mole -12.3958 NA NA NA
## speciesEchidna 8.8992 NA NA NA
## speciesEuropean hedgehog -10.7039 NA NA NA
## speciesGalago -8.7029 NA NA NA
## speciesGenet 2.7609 NA NA NA
## speciesGiant armadillo 16.0846 NA NA NA
## speciesGiraffe 213.4342 NA NA NA
## speciesGoat 77.7805 NA NA NA
## speciesGolden hamster -12.6344 NA NA NA
## speciesGorilla 215.1941 NA NA NA
## speciesGray seal 238.6746 NA NA NA
## speciesGray wolf 74.8555 NA NA NA
## speciesGround squirrel -9.6181 NA NA NA
## speciesGuinea pig -8.9222 NA NA NA
## speciesHorse 195.2854 NA NA NA
## speciesJaguar 57.8287 NA NA NA
## speciesKangaroo 12.4945 NA NA NA
## speciesLesser short-tailed shrew -13.3959 NA NA NA
## speciesLittle brown bat -13.2902 NA NA NA
## speciesMan 1253.3718 NA NA NA
## speciesMole rat -10.6361 NA NA NA
## speciesMountain beaver -6.5877 NA NA NA
## speciesMouse -13.1513 NA NA NA
## speciesMusk shrew -13.2427 NA NA NA
## speciesN. American opossum -8.6875 NA NA NA
## speciesNine-banded armadillo -5.7290 NA NA NA
## speciesOkapi 262.3691 NA NA NA
## speciesOwl monkey 1.5573 NA NA NA
## speciesPatas monkey 92.9044 NA NA NA
## speciesPhanlanger -3.5190 NA NA NA
## speciesPig 2.0401 NA NA NA
## speciesRabbit -3.5726 NA NA NA
## speciesRaccoon 21.9962 NA NA NA
## speciesRat -11.8714 NA NA NA
## speciesRed fox 33.2416 NA NA NA
## speciesRhesus monkey 159.6449 NA NA NA
## speciesRock hyrax (Hetero. b) -1.8739 NA NA NA
## speciesRock hyrax (Procavia hab) 4.3854 NA NA NA
## speciesRoe deer 71.9680 NA NA NA
## speciesSheep 113.9384 NA NA NA
## speciesSlow loris -2.2305 NA NA NA
## speciesStar nosed mole -12.5830 NA NA NA
## speciesTenrec -11.7023 NA NA NA
## speciesTree hyrax -2.9444 NA NA NA
## speciesTree shrew -11.1207 NA NA NA
## speciesVervet 40.8801 NA NA NA
## speciesWater opossum -12.6290 NA NA NA
## speciesYellow-bellied marmot NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 61 and 0 DF, p-value: NA
The analysis we ran is in fact equivalent to running a fixed effects model on clusters of size 1. Since in that scenario there is no clustering, we should omit the fixed effect for species
and just model the random variation (in this case done by the residual variance).
brw
?Since we’re considering linear models so far, I limit myself to linear models only. The basis of the linear model is the variance covariance matrix and how it translates to data relations. This is most easily linearly summarized in the correlation matrix:
mammalsleep %>%
subset(select = -species) %>% #exclude factor species
cor(use = "pairwise.complete.obs") #pairwise deletion
## bw brw sws ps ts mls
## bw 1.00000000 0.93416384 -0.3759462 -0.1093833 -0.3071859 0.30245056
## brw 0.93416384 1.00000000 -0.3692177 -0.1051388 -0.3581020 0.50925268
## sws -0.37594625 -0.36921766 1.0000000 0.5142539 0.9627147 -0.38443179
## ps -0.10938331 -0.10513879 0.5142539 1.0000000 0.7270870 -0.29574535
## ts -0.30718591 -0.35810203 0.9627147 0.7270870 1.0000000 -0.41020239
## mls 0.30245056 0.50925268 -0.3844318 -0.2957453 -0.4102024 1.00000000
## gt 0.65110218 0.74724248 -0.5947028 -0.4508987 -0.6313262 0.61484879
## pi 0.05949472 0.03385548 -0.3181846 -0.4474705 -0.3958350 -0.10254416
## sei 0.33827367 0.36780037 -0.5437566 -0.5372245 -0.6422845 0.36035221
## odi 0.13358123 0.14587888 -0.4838522 -0.5793365 -0.5877424 0.06177846
## gt pi sei odi
## bw 0.6511022 0.05949472 0.3382737 0.13358123
## brw 0.7472425 0.03385548 0.3678004 0.14587888
## sws -0.5947028 -0.31818462 -0.5437566 -0.48385220
## ps -0.4508987 -0.44747050 -0.5372245 -0.57933653
## ts -0.6313262 -0.39583497 -0.6422845 -0.58774241
## mls 0.6148488 -0.10254416 0.3603522 0.06177846
## gt 1.0000000 0.20050426 0.6382790 0.37861701
## pi 0.2005043 1.00000000 0.6182460 0.91604245
## sei 0.6382790 0.61824597 1.0000000 0.78720311
## odi 0.3786170 0.91604245 0.7872031 1.00000000
This matrix contains quite a few cells. To obtain only the correlations with brw
we could select the respective column:
mammalsleep %>%
subset(select = -species) %>% #exclude factor species
cor(use = "pairwise.complete.obs") %>% #pairwise deletion
subset(select = brw) #only column brw from the correlation matrix
## brw
## bw 0.93416384
## brw 1.00000000
## sws -0.36921766
## ps -0.10513879
## ts -0.35810203
## mls 0.50925268
## gt 0.74724248
## pi 0.03385548
## sei 0.36780037
## odi 0.14587888
It seems that the following variables have a rather nice relation with brw
:
sws
: short wave sleepts
: total sleepmls
: maximum life spangt
: gestation timesei
: sleep exposure indexHowever, from the larger correlation matrix we can also see that ts
is highly colinear with sws
- in fact, ts
is calculated as the sum over sws
and ps
. Including both variables will not hurt our \(R^2\) per se, but it will certainly trouble the precision of our estimation as including both variables will yield much larger standard errors. It may be wise to select sws
as a predictor: ts
contains a source of error in the form of ps
, so its linear association with brw
is slightly weaker. However, sws
misses 14 cases and ts
misses only 4.
mammalsleep %>%
summary()
## species bw brw
## African elephant : 1 Min. : 0.005 Min. : 0.14
## African giant pouched rat: 1 1st Qu.: 0.600 1st Qu.: 4.25
## Arctic Fox : 1 Median : 3.342 Median : 17.25
## Arctic ground squirrel : 1 Mean : 198.790 Mean : 283.13
## Asian elephant : 1 3rd Qu.: 48.203 3rd Qu.: 166.00
## Baboon : 1 Max. :6654.000 Max. :5712.00
## (Other) :56
## sws ps ts mls
## Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000
## 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625
## Median : 8.350 Median :1.800 Median :10.45 Median : 15.100
## Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878
## 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750
## Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000
## NA's :14 NA's :12 NA's :4 NA's :4
## gt pi sei odi
## Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 79.00 Median :3.000 Median :2.000 Median :2.000
## Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4
Therefore it is highly preferable to use ts
in the model, despite its weaker association.
We run the new model:
fit <-
mammalsleep %$%
lm(brw ~ bw + ts + mls + gt + sei)
fit %>%
summary()
##
## Call:
## lm(formula = brw ~ bw + ts + mls + gt + sei)
##
## Residuals:
## Min 1Q Median 3Q Max
## -485.56 -100.64 0.64 104.72 1434.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -197.18186 198.55848 -0.993 0.325987
## bw 0.81023 0.05633 14.383 < 2e-16 ***
## ts 4.90058 11.88823 0.412 0.682134
## mls 10.58363 2.69053 3.934 0.000287 ***
## gt 1.23352 0.58171 2.121 0.039509 *
## sei -43.36629 34.80674 -1.246 0.219243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 268.8 on 45 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.9373, Adjusted R-squared: 0.9304
## F-statistic: 134.6 on 5 and 45 DF, p-value: < 2.2e-16
and we have obtained a very high \(R^2\). If prediction was our goal, we are doing great.
exercise 16
. What issues can you detect?fit %>%
plot(which = 1:6)
Some issues spring to mind:
mammalsleep$species[c(1, 5)]
## [1] African elephant Asian elephant
## 62 Levels: African elephant African giant pouched rat ... Yellow-bellied marmot
If we sort the brw
variable together with its strongest predictor
mammalsleep %>%
subset(select = c(species, brw, bw, ts, mls, gt, sei)) %>%
arrange(desc(brw)) #sort the data in descending order based on brw
## species brw bw ts mls gt sei
## 1 African elephant 5712.00 6654.000 3.3 38.6 645.0 5
## 2 Asian elephant 4603.00 2547.000 3.9 69.0 624.0 5
## 3 Man 1320.00 62.000 8.0 100.0 267.0 1
## 4 Giraffe 680.00 529.000 NA 28.0 400.0 5
## 5 Horse 655.00 521.000 2.9 46.0 336.0 5
## 6 Okapi 490.00 250.000 NA 23.6 440.0 5
## 7 Chimpanzee 440.00 52.160 9.7 50.0 230.0 1
## 8 Cow 423.00 465.000 3.9 30.0 281.0 5
## 9 Donkey 419.00 187.100 3.1 40.0 365.0 5
## 10 Gorilla 406.00 207.000 12.0 39.3 252.0 4
## 11 Gray seal 325.00 85.000 6.2 41.0 310.0 3
## 12 Pig 180.00 192.000 8.4 27.0 115.0 4
## 13 Baboon 179.50 10.550 9.8 27.0 180.0 4
## 14 Rhesus monkey 179.00 6.800 9.6 29.0 164.0 3
## 15 Sheep 175.00 55.500 3.8 20.0 151.0 5
## 16 Brazilian tapir 169.00 160.000 6.2 30.4 392.0 5
## 17 Jaguar 157.00 100.000 10.8 22.4 100.0 1
## 18 Gray wolf 119.50 36.330 13.0 16.2 63.0 1
## 19 Goat 115.00 27.660 3.8 20.0 148.0 5
## 20 Patas monkey 115.00 10.000 10.9 20.2 170.0 4
## 21 Roe deer 98.20 14.830 2.6 17.0 150.0 5
## 22 Giant armadillo 81.00 60.000 18.1 7.0 NA 1
## 23 Vervet 58.00 4.190 10.3 24.0 210.0 3
## 24 Kangaroo 56.00 35.000 NA 16.3 33.0 5
## 25 Red fox 50.40 4.235 9.8 9.8 52.0 1
## 26 Arctic Fox 44.50 3.385 12.5 14.0 60.0 1
## 27 Raccoon 39.20 4.288 12.5 13.7 63.0 2
## 28 Cat 25.60 3.300 14.5 28.0 63.0 2
## 29 Echidna 25.00 3.000 8.6 50.0 28.0 2
## 30 Rock hyrax (Procavia hab) 21.00 3.600 5.4 6.0 225.0 2
## 31 Genet 17.50 1.410 6.1 34.0 NA 2
## 32 Yellow-bellied marmot 17.00 4.050 NA 13.0 38.0 1
## 33 Owl monkey 15.50 0.480 17.0 12.0 140.0 2
## 34 Slow loris 12.50 1.400 11.0 12.7 90.0 2
## 35 Rock hyrax (Hetero. b) 12.30 0.750 6.6 7.0 225.0 2
## 36 Tree hyrax 12.30 2.000 5.4 7.5 200.0 1
## 37 Rabbit 12.10 2.500 8.4 18.0 31.0 5
## 38 Phanlanger 11.40 1.620 13.7 13.0 17.0 1
## 39 Nine-banded armadillo 10.80 3.500 17.4 6.5 120.0 1
## 40 Mountain beaver 8.10 1.350 11.2 NA 45.0 1
## 41 African giant pouched rat 6.60 1.000 8.3 4.5 42.0 1
## 42 Chinchilla 6.40 0.425 12.5 7.0 112.0 4
## 43 N. American opossum 6.30 1.700 19.4 5.0 12.0 1
## 44 Arctic ground squirrel 5.70 0.920 16.5 NA 25.0 2
## 45 Guinea pig 5.50 1.040 8.2 7.6 68.0 3
## 46 Galago 5.00 0.200 10.7 10.4 120.0 2
## 47 Ground squirrel 4.00 0.101 13.8 9.0 28.0 1
## 48 Water opossum 3.90 3.500 19.4 3.0 14.0 1
## 49 European hedgehog 3.50 0.785 10.7 6.0 42.0 2
## 50 Mole rat 3.00 0.122 10.6 NA 30.0 1
## 51 Tenrec 2.60 0.900 13.3 4.5 60.0 1
## 52 Tree shrew 2.50 0.104 15.8 2.3 46.0 2
## 53 Desert hedgehog 2.40 0.550 10.3 NA NA 1
## 54 Rat 1.90 0.280 13.2 4.7 21.0 1
## 55 Eastern American mole 1.20 0.075 8.4 3.5 42.0 1
## 56 Golden hamster 1.00 0.120 14.4 3.9 16.0 1
## 57 Star nosed mole 1.00 0.060 10.3 3.5 NA 1
## 58 Mouse 0.40 0.023 13.2 3.2 19.0 1
## 59 Musk shrew 0.33 0.048 12.8 2.0 30.0 1
## 60 Big brown bat 0.30 0.023 19.7 19.0 35.0 1
## 61 Little brown bat 0.25 0.010 19.9 24.0 50.0 1
## 62 Lesser short-tailed shrew 0.14 0.005 9.1 2.6 21.5 2
we see that Man
has a large brw
for small bw
and that African elephant
is so large that it massively influences the model. For Man
we would expect a much lighter brain given its body weight. We can also see this from the residuals:
fit %>%
residuals()
## 1 2 3 5 6
## -485.5560450 106.2290991 -1.1334890 1434.2683263 -14.2173431
## 7 8 9 10 11
## -99.9739636 -382.2870299 -138.2708329 -222.1389154 103.2062362
## 12 14 15 16 17
## -222.9813136 -207.3502817 111.6721285 -299.3810807 119.0327425
## 18 22 23 24 25
## -21.7754459 93.7473544 109.8701136 -176.6595017 -263.2907467
## 26 27 28 29 30
## 17.7387559 47.0471717 127.4386770 -268.6348480 -96.8256901
## 32 33 34 37 38
## 185.4170567 -172.4144182 83.3979884 118.9373998 119.9390604
## 39 40 42 43 44
## 82.6791039 -53.5732527 -83.9801061 0.6414618 24.9406553
## 45 46 47 48 49
## -73.6937449 154.1785525 35.6757062 101.8866990 71.6287560
## 50 51 52 53 54
## -55.4961698 -88.3634693 -65.5082879 122.5071227 127.4900821
## 55 57 58 59 60
## -4.0547720 55.6039203 -101.3157666 127.8168865 -181.6354614
## 61
## 97.5209191
from the influence statistics:
fit %>%
influence()
## $hat
## 1 2 3 5 6 7
## 0.90376652 0.09680085 0.03820138 0.27344636 0.05622069 0.11707227
## 8 9 10 11 12 14
## 0.19346612 0.06299083 0.13658848 0.09859497 0.07869705 0.12546971
## 15 16 17 18 22 23
## 0.09609692 0.24598832 0.03973070 0.03019447 0.12099853 0.04617505
## 24 25 26 27 28 29
## 0.10988139 0.09296178 0.03795117 0.04105129 0.05488669 0.09677947
## 30 32 33 34 37 38
## 0.04907583 0.06803407 0.13168851 0.54221947 0.04692941 0.04828122
## 39 40 42 43 44 45
## 0.10241613 0.11430997 0.11037221 0.06662589 0.04627464 0.07663133
## 46 47 48 49 50 51
## 0.20760998 0.02767828 0.04517846 0.06359325 0.02915966 0.13663756
## 52 53 54 55 57 58
## 0.15726711 0.13674991 0.11930401 0.02387457 0.04466402 0.19821965
## 59 60 61
## 0.06384060 0.04501389 0.10433937
##
## $coefficients
## (Intercept) bw ts mls gt
## 1 -243.72029374 -7.891619e-01 8.300378873 1.961265e+00 1.213301e+00
## 2 25.21936417 1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3 -0.08631166 -6.519435e-06 0.001850449 -8.986121e-05 6.447977e-05
## 5 -222.37630813 -3.864155e-03 9.482902916 1.328092e+00 1.088658e+00
## 6 1.23516961 1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7 15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02 8.203378e-04
## 8 60.48183044 2.901918e-02 -2.848182251 8.097885e-01 -3.426136e-01
## 9 8.08219132 -1.659263e-03 -0.708783091 -2.266736e-01 3.415038e-02
## 10 -14.83398360 5.099153e-03 0.549892324 -4.320579e-01 -5.267207e-02
## 11 -13.71999924 -9.170728e-04 0.867035841 -1.055549e-01 2.565440e-03
## 12 -1.05736367 2.688443e-03 0.579881544 5.396998e-02 -1.076290e-02
## 14 6.18986203 9.959258e-03 0.240514300 9.009225e-02 -8.913409e-02
## 15 26.18872034 1.394162e-03 -1.242033539 -8.959104e-02 -1.713371e-02
## 16 -50.31756815 -1.692566e-02 3.179504037 -1.672341e+00 3.352325e-01
## 17 10.32949936 1.392736e-03 -0.383105546 -4.747950e-02 -2.249698e-02
## 18 -0.98684647 2.228338e-04 0.015753756 1.986039e-02 -2.935611e-03
## 22 6.58455160 4.821783e-04 -0.638174756 1.542362e-02 -3.505366e-02
## 23 4.89650785 1.187494e-03 0.099911014 -4.859549e-02 -9.064587e-03
## 24 36.40960548 4.536888e-03 -2.056885018 -1.466931e-01 -4.167628e-02
## 25 -5.00633467 1.105221e-02 0.433175399 -2.350633e-02 -1.165109e-01
## 26 1.04073805 1.201451e-04 -0.010461765 5.052978e-03 -1.136219e-03
## 27 2.54661940 5.084744e-04 0.001183841 -2.497786e-03 -5.040769e-03
## 28 11.88649415 1.357163e-03 -0.633372764 -5.456064e-02 -3.247090e-02
## 29 -0.24983987 4.472249e-03 0.850871137 -2.243379e-01 -1.807907e-02
## 30 -10.57779705 -4.922710e-04 0.393768168 -5.105855e-02 5.179576e-03
## 32 28.53048895 3.570485e-03 -1.358210784 -9.209852e-02 -5.688596e-02
## 33 30.19122380 -5.338780e-04 -2.483945499 -2.252012e-01 -1.753357e-03
## 34 11.21681229 -1.485900e-04 -0.865126400 1.140465e+00 -5.130672e-02
## 37 9.71039304 1.351139e-03 -0.172191353 -6.465768e-02 -1.166554e-02
## 38 10.77073455 9.402313e-04 -0.228081945 -9.675266e-02 -5.032490e-03
## 39 -9.89672227 4.285150e-04 0.967966630 -2.807885e-02 4.079672e-03
## 40 5.96424958 1.452604e-03 -0.601268163 8.715142e-02 -2.837401e-02
## 42 14.81578562 2.318740e-03 -1.154080165 9.618280e-02 -3.784149e-02
## 43 -0.07314966 -1.065209e-05 0.004206938 -1.934365e-04 5.630308e-05
## 44 1.50944208 4.203137e-04 -0.014371516 1.266259e-02 -5.390016e-03
## 45 0.25180713 -1.320030e-03 0.064563528 -9.447721e-02 3.115094e-02
## 46 -4.27223545 5.971332e-03 -0.111275900 2.538248e-01 -1.365289e-01
## 47 0.53009765 2.395557e-04 0.037122001 4.450059e-03 -4.448297e-03
## 48 8.10621682 1.179136e-03 -0.141994812 -4.181767e-02 -1.091671e-02
## 49 12.33365624 7.470523e-04 -0.540374093 -1.717373e-02 -9.612321e-03
## 50 0.69553868 5.210781e-04 -0.047772953 -3.262922e-02 4.553270e-04
## 51 -12.34827888 3.811564e-03 0.555351409 2.515678e-01 -5.483223e-02
## 52 -12.25405271 2.786901e-03 0.603867004 1.979415e-01 -3.961636e-02
## 53 13.97782003 5.011877e-04 -1.167005941 -2.509449e-02 -4.491956e-02
## 54 8.86474762 7.004253e-04 -0.858771111 1.788077e-02 -4.676115e-02
## 55 -0.19784084 1.055703e-07 0.004440846 1.220367e-03 1.184715e-04
## 57 3.18619488 -5.497384e-05 0.004555809 -5.141468e-02 4.983695e-03
## 58 -28.56155358 3.739395e-03 1.395498566 2.787572e-01 -5.798662e-02
## 59 -9.63366731 4.197479e-05 0.961028516 -1.154272e-01 5.556994e-03
## 60 9.89309757 5.111625e-03 -0.700129056 9.372341e-02 -5.521970e-02
## 61 -11.66188823 3.311138e-04 1.152496607 -5.752401e-02 8.412294e-03
## sei
## 1 9.81613002
## 2 -2.72951278
## 3 0.01557169
## 5 -7.18465245
## 6 -0.32753933
## 7 -0.83130974
## 8 -5.89595125
## 9 -1.44964422
## 10 7.80997880
## 11 3.65565323
## 12 -4.01918629
## 14 -1.92346523
## 15 -2.85032581
## 16 -0.20737504
## 17 -0.03834421
## 18 0.14299652
## 22 2.72227838
## 23 -0.68195638
## 24 -4.55163332
## 25 3.82628293
## 26 -0.21961298
## 27 -0.38536066
## 28 1.10977706
## 29 -3.38753239
## 30 2.00304757
## 32 -0.58636471
## 33 -1.41665399
## 34 -5.85319446
## 37 -1.15244417
## 38 -1.42227624
## 39 0.64273176
## 40 0.41679480
## 42 -0.73648205
## 43 0.01688128
## 44 -0.17609909
## 45 -1.92202225
## 46 9.05021867
## 47 0.11983892
## 48 -0.98553762
## 49 -1.50855921
## 50 -0.34524738
## 51 2.63589409
## 52 2.24214991
## 53 3.12404097
## 54 3.66670950
## 55 0.01182831
## 57 -0.71478964
## 58 5.47730002
## 59 1.58781903
## 60 -0.63930534
## 61 0.71772803
##
## $sigma
## 1 2 3 5 6 7 8
## 135.04872 271.35637 271.87903 97.82349 271.87013 271.40553 264.19703
## 9 10 11 12 14 15 16
## 271.02492 269.47973 271.38474 269.61397 269.81643 271.30182 266.86449
## 17 18 22 23 24 25 26
## 271.26167 271.85865 271.46086 271.34960 270.40967 268.66571 271.86541
## 27 28 29 30 32 33 34
## 271.78259 271.15991 268.51889 271.46669 270.33284 270.44438 271.24331
## 37 38 39 40 42 43 44
## 271.25800 271.24658 271.56058 271.74361 271.54753 271.87906 271.85182
## 45 46 47 48 49 50 51
## 271.63315 270.62231 271.82437 271.42429 271.64998 271.74646 271.50082
## 52 53 54 55 57 58 59
## 271.66616 271.15145 271.10660 271.87838 271.74378 271.34345 271.14870
## 60 61
## 270.43130 271.43491
##
## $wt.res
## 1 2 3 5 6
## -485.5560450 106.2290991 -1.1334890 1434.2683263 -14.2173431
## 7 8 9 10 11
## -99.9739636 -382.2870299 -138.2708329 -222.1389154 103.2062362
## 12 14 15 16 17
## -222.9813136 -207.3502817 111.6721285 -299.3810807 119.0327425
## 18 22 23 24 25
## -21.7754459 93.7473544 109.8701136 -176.6595017 -263.2907467
## 26 27 28 29 30
## 17.7387559 47.0471717 127.4386770 -268.6348480 -96.8256901
## 32 33 34 37 38
## 185.4170567 -172.4144182 83.3979884 118.9373998 119.9390604
## 39 40 42 43 44
## 82.6791039 -53.5732527 -83.9801061 0.6414618 24.9406553
## 45 46 47 48 49
## -73.6937449 154.1785525 35.6757062 101.8866990 71.6287560
## 50 51 52 53 54
## -55.4961698 -88.3634693 -65.5082879 122.5071227 127.4900821
## 55 57 58 59 60
## -4.0547720 55.6039203 -101.3157666 127.8168865 -181.6354614
## 61
## 97.5209191
From the influence we see:
$sigma
would drop when the first case and the fifth case would be omitted.The influence(fit)$coefficients
is equivalent to the dfbeta()
return:
head(influence(fit)$coefficients)
## (Intercept) bw ts mls gt
## 1 -243.72029374 -7.891619e-01 8.300378873 1.961265e+00 1.213301e+00
## 2 25.21936417 1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3 -0.08631166 -6.519435e-06 0.001850449 -8.986121e-05 6.447977e-05
## 5 -222.37630813 -3.864155e-03 9.482902916 1.328092e+00 1.088658e+00
## 6 1.23516961 1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7 15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02 8.203378e-04
## sei
## 1 9.81613002
## 2 -2.72951278
## 3 0.01557169
## 5 -7.18465245
## 6 -0.32753933
## 7 -0.83130974
fit %>%
dfbeta() %>%
head()
## (Intercept) bw ts mls gt
## 1 -243.72029374 -7.891619e-01 8.300378873 1.961265e+00 1.213301e+00
## 2 25.21936417 1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3 -0.08631166 -6.519435e-06 0.001850449 -8.986121e-05 6.447977e-05
## 5 -222.37630813 -3.864155e-03 9.482902916 1.328092e+00 1.088658e+00
## 6 1.23516961 1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7 15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02 8.203378e-04
## sei
## 1 9.81613002
## 2 -2.72951278
## 3 0.01557169
## 5 -7.18465245
## 6 -0.32753933
## 7 -0.83130974
End of Practical
.