12/9/22
All materials can be found at
www.gerkovink.com/rijkR
I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
When external figures and other sources are shown:
Scientific references are in the footer.
Opinions are my own.
Packages used:
We begin today with an exploration into statistical inference.
Truths are boring, but they are convenient.
Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.
Core assumption: all observations are bonafide
age hgt wgt bmi hc gen phb tv reg
3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
At the end of this lecture we aim to understand what happens in
It effectively replaces round(cor(select(boys, is.numeric), use = "pairwise.complete.obs"), digits = 3)
.
Benefit: a single object in memory that is easy to interpret Your code becomes more readable:
f(x)
becomes x %>% f()
f(x, y)
becomes x %>% f(y)
age hgt wgt bmi hc gen phb tv reg
3 0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
h(g(f(x)))
becomes x %>% f %>% g %>% h
%>%
pipe%$%
pipe.
in a pipeIn a %>% b(arg1, arg2, arg3)
, a
will become arg1
. With .
we can change this.
VS
The .
can be used as a placeholder in the pipe.
Welch Two Sample t-test
data: age by ovwgt
t = -15.971, df = 32.993, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
-9.393179 -7.270438
sample estimates:
mean in group FALSE mean in group TRUE
9.103392 17.435200
is the same as
age hgt wgt bmi hc gen phb tv reg
7410 20.372 1.887 59.8 16.79 55.2 <NA> <NA> NA west
7418 20.429 1.811 67.2 20.48 56.6 <NA> <NA> NA north
7444 20.761 1.891 88.0 24.60 NA <NA> <NA> NA west
7447 20.780 1.935 75.4 20.13 NA <NA> <NA> NA west
7451 20.813 1.890 78.0 21.83 59.9 <NA> <NA> NA north
7475 21.177 1.818 76.5 23.14 NA <NA> <NA> NA east
ggplot2
anscombe
data x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
x1 x2 x3 x4 y1 y2 y3 y4
9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
y1 y2 y3 y4
x1 0.816 0.816 0.816 -0.314
x2 0.816 0.816 0.816 -0.314
x3 0.816 0.816 0.816 -0.314
x4 -0.529 -0.718 -0.345 0.817
y1 y2 y3 y4
x1 5.501 5.500 5.497 -2.115
x2 5.501 5.500 5.497 -2.115
x3 5.501 5.500 5.497 -2.115
x4 -3.565 -4.841 -2.321 5.499
Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.
ggplot2
?Layered plotting based on the book The Grammer of Graphics by Leland Wilkinson.
Wilkinson, L. (2006). The Grammar of Graphics. Springer Science & Business Media.
With ggplot2
you
ggplot2
then takes care of the details
1: Provide the data
2: map variable to aesthetics
3: state which geometric object to display
Create the plot
Add another layer (smooth fit line)
Give it some labels and a nice look
Is the same as
Gerko Vink