Fundamental Techniques in Data Science with R

Last week

Last week we started with linear modeling. Today we revisit parts of last-week’s lecture.

Today

  • Assumptions
  • Generating data
  • Statistical inference

Packages and functions that we use

library(mvtnorm)  # Multivariate Normal tools
library(dplyr)
library(magrittr)
library(ggplot2)
library(mice)
  • set.seed(): Fixing the Random Number Generator seed

Assumptions

The key assumptions

There are four key assumptions about the use of linear regression models.

In short, we assume

  • The outcome to have a linear relation with the predictors and the predictor relations to be additive.

    • the expected value for the outcome is a straight-line function of each predictor, given that the others are fixed.
    • the slope of each line does not depend on the values of the other predictors
    • the effects of the predictors on the expected value are additive

    \[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]

  • The residuals are statistically independent

  • The residual variance is constant

    • accross the expected values
    • across any of the predictors
  • The residuals are normally distributed with mean \(\mu_\epsilon = 0\)

A simple example

\[y = \alpha + \beta X + \epsilon\] and \[\hat{y} = \alpha + \beta X\] As a result, the following components in linear modeling

  • outcome \(y\)
  • predicted outcome \(\hat{y}\)
  • predictor \(X\)
  • residual \(\epsilon\)

can also be seen as columns in a data set.

As a data set

As a data set, this would be the result for lm(y1 ~ x1, data = anscombe)

An example

Residual variance

Violated assumptions #1

Here the residuals are not independent, and the residual variance is not constant!

Residual variance

Violated assumptions #2

Here the residuals do not have mean zero.

Residual variance

Violated assumptions #3

Here the residual variance is not constant, the residuals are not normally distributed and the relation between \(Y\) and \(X\) is not linear!

Residual variance

Leverage and influence revisited

Outliers and influential cases

Leverage: see the fit line as a lever.

  • some points pull/push harder; they have more leverage

Standardized residuals:

  • The values that have more leverage tend to be closer to the line
  • The line is fit so as to be closer to them
  • The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
  • Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

  • it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(\epsilon_z\) (standardized residuals).

If the model is correct we expect:

  • 5% of standardized residuals \(|\epsilon_z|>1.96\)
  • 1% of standardized residuals \(|\epsilon_z|>2.58\)
  • 0% of standardized residuals \(|\epsilon_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

  • cases with Cook’s Distance \(> 1\), or
  • cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

These are the plots for the Violated Assumptions #3 scenario. There are many cases with unrealistically large \(|e_z|\), so these could be labeled as outliers. There are no cases with Cook’s Distance \(>1\), but case 72 stands out. Of course this model should have never been fit.

Generating data

Creating our own data

We saw that we could create vectors

c(1, 2, 3, 4, 3, 2, 1)
## [1] 1 2 3 4 3 2 1
c(1:4, 3:1)
## [1] 1 2 3 4 3 2 1

We could create matrices

matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15

Creating our own data

Vectors from matrices

mat <- matrix(1:18, 3, 6)
mat
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18
c(mat)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

But we can also draw a sample

Of the same size,

sample(mat)
##  [1]  1  7 17 18 12 14  3 13  9 15  8 16 11 10  6  2  4  5

smaller size,

sample(mat, size = 5)
## [1] 16 14  3  5 15

or larger size

sample(mat, size = 50, replace = TRUE)
##  [1]  3 18 11 18  6  8  4  1 13  7  1 17  6 17  8  3 11  7 18  7 18 10  5 11  1
## [26]  6  3 16  8  3  8  4  5  1  6  8 16  5  1  5  8 16  2 16 10  2  4 10  9 11

We can do random sampling

hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)

Or nonrandom sampling

probs <- c(rep(.01, 15), .1, .25, .50)
probs
##  [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## [16] 0.10 0.25 0.50
hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)

We can replicate individual samples

set.seed(123)
sample(mat, size = 50, replace = TRUE)
##  [1] 15 14  3 10 18 11  5 14  5  9  3  8  7 10  9  4 14 17 11  7 12 15 10 13  7
## [26]  9  9 10  7  6  2  5  8 12 13 18  1  6 15  9 15 16  6 11  8  7 16 17 18 17
set.seed(123)
sample(mat, size = 50, replace = TRUE)
##  [1] 15 14  3 10 18 11  5 14  5  9  3  8  7 10  9  4 14 17 11  7 12 15 10 13  7
## [26]  9  9 10  7  6  2  5  8 12 13 18  1  6 15  9 15 16  6 11  8  7 16 17 18 17

We can replicate a chain of samples

set.seed(123)
sample(mat, size = 5, replace = TRUE)
## [1] 15 14  3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11  5 14  5  9  3  8
set.seed(123)
sample(mat, size = 5, replace = TRUE)
## [1] 15 14  3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11  5 14  5  9  3  8

The random seed

The random seed is a number used to initialize the pseudo-random number generator

If replication is needed, pseudorandom number generators must be used

  • Pseudorandom number generators generate a sequence of numbers
  • The properties of generated number sequences approximates the properties of random number sequences
  • Pseudorandom number generators are not truly random, because the process is determined by an initial value.

The initial value (the seed) itself does not need to be random.

  • The resulting process is random because the seed value is not used to generate randomness
  • It merely forms the starting point of the algorithm for which the results are random.

Why fix the random seed

When an R instance is started, there is initially no seed. In that case, R will create one from the current time and process ID.

  • Hence, different sessions will give different results when random numbers are involved.
  • When you store a workspace and reopen it, you will continue from the seed specified within the workspace.

If we fix the random seed we can exactly replicate the random process

If the method has not changed: the results of the process will be identical when using the same seed.

  • Replications allows for verification
  • But beware: the process depends on the seed
    • The results obtained could theoretically be extremely rare and would not have occured with every other potential seed
    • Run another seed before publishing your results

Random seeds

HTML5 Icon

Random processes

HTML5 Icon

Drawing data

We can draw data from a standard normal distribution

hist(rnorm(1000, mean = 0, sd = 1))

Drawing data

We can draw data from a specific normal distribution

hist(rnorm(1000, mean = 50, sd = 15))

Drawing data: many values

hist(rnorm(100000, mean = 50, sd = 15))

A review of inference concepts

Basic concepts

Parameters and statistics

Random sample

  • set of values drawn independently from a population

Population parameter

  • Property of the population (e.g. \(\mu,\sigma,\rho\))

Sample statistic

  • Property of the sample (e.g. \(\bar{x},s,r\))

Sampling distributions

The sampling distribution is the distribution of a sample statistic

Central Limit Theorem

  • Given a distribution with mean \(\mu\) and standard deviation \(\sigma\), the sampling distribution of the mean approaches a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\) as \(n\to\infty\)

Conditions for the CLT to hold

  • \(\mu\) and \(\sigma\) known
  • the sample size \(n\) is sufficiently large

Hypothesis tests and confidence intervals

Hypothesis tests

Statistical tests work with a null hypothesis, e.g.: \[ H_0:\mu=\mu_0 \]

  • If population parameters \(\mu\) and \(\sigma\) are unknown, we need to use sample mean \(\bar{x}\) and sample standard deviation \(s\) as estimators
  • as a consequence, the sample mean has a \(t\)-distribution

\[ t_{(df)}=\frac{\bar{x}-\mu_0}{SEM} \]

\(t\)-distribution

The \(t\)-distribution has larger variance than the normal distribution (more uncertainty).

  • degrees of freedom: \(df=n-\) number of estimated means
  • the higher the degrees of freedom (the larger the sample) the more it resembles a normal distribution

\(t\)-distribution

curve(dt(x, 100), -3, 3, ylab = "density")
curve(dt(x,   2), -3, 3, ylab = "", add = T, col = "red")
curve(dt(x,   1), -3, 3, ylab = "", add = T, col = "blue")
legend(1.8, .4, c("t(df=100)", "t(df=2)", "t(df=1)"), 
       col = c("black", "red", "blue"), lty=1)

\(t\) versus normal

curve(dnorm(x), -3, 3, ylab = "density")
curve(dt(x, 2), -3, 3, ylab = "", add = T, col = "red")
curve(dt(x, 1), -3, 3, ylab = "", add = T, col = "blue")
legend(1.8, .4, c("normal", "t(df=2)", "t(df=1)"), 
       col = c("black", "red", "blue"), lty=1)

\(p\)-values

The \(p\)-value in this situation is the probability that \(\bar{x}\) is at least that much different from \(\mu_0\):

  • P(\(\bar{X}\) \(\geq\) \(\bar{x}\) |\(\mu\)=0)

We would reject \(H_0\) if \(p\) is smaller than the experimenters’ (that would be you) predetermined significance level \(\alpha\):

  • two-sided test if \(H_A:\mu\neq\mu_0\) (upper and lower (\(\alpha\)/2) * 100%)
  • one-sided test if \(H_A:\mu>\mu_0\) or \(H_A:\mu<\mu_0\) (upper or lower \(\alpha\) * 100%)

\(p\)-values

Example of two-sided test for \(t_{(df=10)}\) given that \(P(t<-1.559)=7.5\%\) (\(\alpha\) = 0.15)

\(p\)-values: code for the figure

t0      <- qt(.075, 10)

cord.x1 <- c(-3, seq(-3, t0, 0.01), t0)
cord.y1 <- c(0, dt(seq(-3, t0, 0.01), 10), 0)
cord.x2 <- c(-t0, seq(-t0, 3, 0.01), 3)
cord.y2 <- c(0, dt(seq(-t0, 3, 0.01), 10), 0) 

curve(dt(x,10),xlim=c(-3,3),ylab="density",main='',xlab="t-value") 
polygon(cord.x1,cord.y1,col='red')
polygon(cord.x2,cord.y2,col='red')

Misconceptions

  1. The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either.

  2. The p-value is not the probability that a finding is “merely a fluke.” In fact, the calculation of the p-value is based on the assumption that every finding is the product of chance alone.

  3. The p-value is not the probability of falsely rejecting the null hypothesis.

  4. The p-value is not the probability that replicating the experiment would yield the same conclusion.

  5. The significance level, \(\alpha\), is not determined by the p-value. The significance level is decided by the experimenter a-priori and compared to the p-value that is obtained a-posteriori.

  6. The p-value does not indicate the size or importance of the observed effect - they are related together with sample size.

95% confidence interval

If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals

\[ 95\%~CI=\bar{x}\pm{t}_{(1-\alpha/2)}\cdot SEM \]

Example

x.bar <- 7.6 # sample mean
SEM   <- 2.1 # standard error of the mean
n     <- 11 # sample size
df    <- n-1 # degrees of freedom
alpha <- .15 # significance level
t.crit <- qt(1 - alpha / 2, df) # t(1 - alpha / 2) for df = 10
c(x.bar - t.crit * SEM, x.bar + t.crit * SEM) 
## [1]  4.325605 10.874395

HTML5 Icon
    Neyman, J. (1934). On the Two Different Aspects of the Representative Method: 
    The Method of Stratified Sampling and the Method of Purposive Selection. 
    Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625

Misconceptions

Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .

  1. A realised 95% CI does not mean:
  • that there is a 95% probability the population parameter lies within the interval

  • that there is a 95% probability that the interval covers the population parameter

    Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.

    The 95% probability only has to do with the estimation procedure.

  1. A 95% confidence interval does not mean that 95% of the sample data lie within the interval.
  2. A confidence interval is not a range of plausible values for the sample mean, though it may be understood as an estimate of plausible values for the population parameter.
  3. A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample mean from a repeat of the experiment falling within this interval.

Confidence intervals

100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.

To conclude