Last week we started with linear modeling. Today we revisit parts of last-week’s lecture.
Fundamental Techniques in Data Science with R
Last week we started with linear modeling. Today we revisit parts of last-week’s lecture.
library(mvtnorm) # Multivariate Normal tools library(dplyr) library(magrittr) library(ggplot2) library(mice)
set.seed()
: Fixing the Random Number Generator seedThere are four key assumptions about the use of linear regression models.
In short, we assume
The outcome to have a linear relation with the predictors and the predictor relations to be additive.
\[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]
The residuals are statistically independent
The residual variance is constant
The residuals are normally distributed with mean \(\mu_\epsilon = 0\)
\[y = \alpha + \beta X + \epsilon\] and \[\hat{y} = \alpha + \beta X\] As a result, the following components in linear modeling
can also be seen as columns in a data set.
As a data set, this would be the result for lm(y1 ~ x1, data = anscombe)
Here the residuals are not independent, and the residual variance is not constant!
Here the residuals do not have mean zero.
Here the residual variance is not constant, the residuals are not normally distributed and the relation between \(Y\) and \(X\) is not linear!
Leverage: see the fit line as a lever.
Standardized residuals:
Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.
Outliers are cases with large \(\epsilon_z\) (standardized residuals).
If the model is correct we expect:
Influential cases are cases with large influence on parameter estimates
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
These are the plots for the Violated Assumptions #3
scenario. There are many cases with unrealistically large \(|e_z|\), so these could be labeled as outliers. There are no cases with Cook’s Distance \(>1\), but case 72 stands out. Of course this model should have never been fit.
We saw that we could create vectors
c(1, 2, 3, 4, 3, 2, 1)
## [1] 1 2 3 4 3 2 1
c(1:4, 3:1)
## [1] 1 2 3 4 3 2 1
We could create matrices
matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15
Vectors from matrices
mat <- matrix(1:18, 3, 6) mat
## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 10 13 16 ## [2,] 2 5 8 11 14 17 ## [3,] 3 6 9 12 15 18
c(mat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Of the same size,
sample(mat)
## [1] 1 7 17 18 12 14 3 13 9 15 8 16 11 10 6 2 4 5
smaller size,
sample(mat, size = 5)
## [1] 16 14 3 5 15
or larger size
sample(mat, size = 50, replace = TRUE)
## [1] 3 18 11 18 6 8 4 1 13 7 1 17 6 17 8 3 11 7 18 7 18 10 5 11 1 ## [26] 6 3 16 8 3 8 4 5 1 6 8 16 5 1 5 8 16 2 16 10 2 4 10 9 11
hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)
probs <- c(rep(.01, 15), .1, .25, .50) probs
## [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ## [16] 0.10 0.25 0.50
hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)
set.seed(123) sample(mat, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 13 7 ## [26] 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 16 17 18 17
set.seed(123) sample(mat, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 13 7 ## [26] 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 16 17 18 17
set.seed(123) sample(mat, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
set.seed(123) sample(mat, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
The random seed is a number used to initialize the pseudo-random number generator
If replication is needed, pseudorandom number generators must be used
The initial value (the seed) itself does not need to be random.
When an R
instance is started, there is initially no seed. In that case, R
will create one from the current time and process ID.
If we fix the random seed we can exactly replicate the random process
If the method has not changed:
the results of the process will be identical when using the same seed.
We can draw data from a standard normal distribution
hist(rnorm(1000, mean = 0, sd = 1))
We can draw data from a specific normal distribution
hist(rnorm(1000, mean = 50, sd = 15))
hist(rnorm(100000, mean = 50, sd = 15))
Random sample
Population parameter
Sample statistic
The sampling distribution is the distribution of a sample statistic
Conditions for the CLT to hold
Statistical tests work with a null hypothesis, e.g.: \[ H_0:\mu=\mu_0 \]
\[ t_{(df)}=\frac{\bar{x}-\mu_0}{SEM} \]
The \(t\)-distribution has larger variance than the normal distribution (more uncertainty).
curve(dt(x, 100), -3, 3, ylab = "density") curve(dt(x, 2), -3, 3, ylab = "", add = T, col = "red") curve(dt(x, 1), -3, 3, ylab = "", add = T, col = "blue") legend(1.8, .4, c("t(df=100)", "t(df=2)", "t(df=1)"), col = c("black", "red", "blue"), lty=1)
curve(dnorm(x), -3, 3, ylab = "density") curve(dt(x, 2), -3, 3, ylab = "", add = T, col = "red") curve(dt(x, 1), -3, 3, ylab = "", add = T, col = "blue") legend(1.8, .4, c("normal", "t(df=2)", "t(df=1)"), col = c("black", "red", "blue"), lty=1)
The \(p\)-value in this situation is the probability that \(\bar{x}\) is at least that much different from \(\mu_0\):
We would reject \(H_0\) if \(p\) is smaller than the experimenters’ (that would be you) predetermined significance level \(\alpha\):
Example of two-sided test for \(t_{(df=10)}\) given that \(P(t<-1.559)=7.5\%\) (\(\alpha\) = 0.15)
t0 <- qt(.075, 10) cord.x1 <- c(-3, seq(-3, t0, 0.01), t0) cord.y1 <- c(0, dt(seq(-3, t0, 0.01), 10), 0) cord.x2 <- c(-t0, seq(-t0, 3, 0.01), 3) cord.y2 <- c(0, dt(seq(-t0, 3, 0.01), 10), 0) curve(dt(x,10),xlim=c(-3,3),ylab="density",main='',xlab="t-value") polygon(cord.x1,cord.y1,col='red') polygon(cord.x2,cord.y2,col='red')
The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either.
The p-value is not the probability that a finding is “merely a fluke.” In fact, the calculation of the p-value is based on the assumption that every finding is the product of chance alone.
The p-value is not the probability of falsely rejecting the null hypothesis.
The p-value is not the probability that replicating the experiment would yield the same conclusion.
The significance level, \(\alpha\), is not determined by the p-value. The significance level is decided by the experimenter a-priori and compared to the p-value that is obtained a-posteriori.
The p-value does not indicate the size or importance of the observed effect - they are related together with sample size.
If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals
\[ 95\%~CI=\bar{x}\pm{t}_{(1-\alpha/2)}\cdot SEM \]
Example
x.bar <- 7.6 # sample mean SEM <- 2.1 # standard error of the mean n <- 11 # sample size df <- n-1 # degrees of freedom alpha <- .15 # significance level t.crit <- qt(1 - alpha / 2, df) # t(1 - alpha / 2) for df = 10 c(x.bar - t.crit * SEM, x.bar + t.crit * SEM)
## [1] 4.325605 10.874395
Neyman, J. (1934). On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625
Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .
that there is a 95% probability the population parameter lies within the interval
that there is a 95% probability that the interval covers the population parameter
Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.
The 95% probability only has to do with the estimation procedure.
100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.