Introduction, Basics and Data Wrangling

Supervised Learning and Visualization

This lecture

Course pages
Course overview
Introduction to SLV
(Dark) Data Science
Data Wrangling
Wrap-up

Disclaimer

I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

When external figures and other sources are shown:

the references are included when the origin is known, or
the objects are directly linked from within the public domain and the source can be obtained by right-clicking the objects.

Opinions are my own.

I am a statistician

Procedural stuff

If there is anything important - contact me!
The on-location lectures will not be recorded.
- If you are ill, ask your classmates to cover for you.
If you feel that you are stuck, and the wait for the Q&A session is too long: open a GitHub issue here.
- You are most likely not the only one with that question. You are simply the bravest or the first.
- Do not contact us via private chat or e-mail for content-related questions.
- use a reprex to detail your issue, when code is involved.
If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message or e-mail.

Course pages

You can find all materials at the following location:

https://www.gerkovink.com/slv/

All course materials should be submitted through a pull-request from your Fork of

https://github.com/gerkovink/INFOMDA1-2022

The structure of your submissions should follow the corresponding repo’s README. To make it simple, I have added an example for the first practical. If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from one of my other courses. There you can find video walkthroughs that detail the process.

Course overview

Team

Gerko Vink, Maarten Cruyff and Erik-Jan van Kesteren

All three have a PhD in statistics and a ton of experience in development, data analysis and visualization.

Topics

Week #	Focus	Teacher	Materials
1	Data wrangling with `R`	GV	R4DS ISLR
2	The grammar of graphics	GV	R4DS
3	Exploratory data analysis	GV	R4DS FIMD
4	Statistical learning: regression	MC	ISLR, TBD
5	Statistical learning: classification	EJvK	ISLR, TBD
6	Classification model evaluation	EJvK	ISLR, TBD
7	Nonlinear models	MC	ISLR, TBD
8	Bagging, boosting, random forest and support vector machines	MC	ISLR, TBD

Course Setup

Each weak we have the following:

1 Lecture on Monday @ 9am in BBG 201
1 Practical (not graded). Must be submitted to pass. Hand in the practical before the next lecture
1 combined workgroup and Q&A session in BBG 106
Course materials to study. See the corresponding week on the course page.

Twice we have:

Group assignments
The assignment is made in teams (3-4 students).
Each assignment counts towards 25% of the total grade. Must be > 5.5 to pass.

Once we have:

Individual exam
BYOD: so charge and bring your laptop.
50% of total grade. Must be > 5.5 to pass.

Groups

We will make groups on Wednesday Sept 14!

Introduction to SLV

Terms I may use

TDGM: True data generating model
DGP: Data generating process, closely related to the TDGM, but with all the wacky additional uncertainty
Truth: The comparative truth that we are interested in
Bias: The distance to the comparative truth
Variance: When not everything is the same
Estimate: Something that we calculate or guess
Estimand: The thing we aim to estimate and guess
Population: That larger entity without sampling variance
Sample: The smaller thing with sampling variance
Incomplete: There exists a more complete version, but we don’t have it
Observed: What we have
Unobserved: What we would also like to have

Some statistics

At the start

We begin this course series with a bit of statistical inference.

Statistical inference is the process of drawing conclusions from truths

Truths are boring, but they are convenient.

however, for most problems truths require a lot of calculations, tallying or a complete census.
therefore, a proxy of the truth is in most cases sufficient
An example for such a proxy is a sample
Samples are widely used and have been for a long timeSee Jelke Bethlehem’s CBS discussion paper for an overview of the history of sampling within survey statistics

Being wrong about the truth

The population is the truth
The sample comes from the population, but is generally smaller in size
This means that not all cases from the population can be in our sample
If not all information from the population is in the sample, then our sample may be wrong

Q1: Why is it important that our sample is not wrong?
Q2: How do we know that our sample is not wrong?

Solving the missingness problem

There are many flavours of sampling
If we give every unit in the population the same probability to be sampled, we do random sampling
The convenience with random sampling is that the missingness problem can be ignored
The missingness problem would in this case be: not every unit in the population has been observed in the sample

Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?

Sidestep

The problem is a bit larger
We have three entities at play, here:
1. The truth we’re interested in
2. The proxy that we have (e.g. sample)
3. The model that we’re running
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information

All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.

Sidestep

The problem is a bit larger
We have three entities at play, here:
1. The truth we’re interested in
2. The proxy that we have (e.g. sample)
3. The model that we’re running
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information

Core assumption: all observations are bonafide

Uncertainty simplified

When we do not have all information …

We need to accept that we are probably wrong
We just have to quantify how wrong we are

In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.

The uncertainty measures about our estimates can be used to create intervals

Rumsfeld moment of fame in statistics

Confidence intervals

Confidence intervals can be hugely informative!

If we sample 100 samples from a population, then a 95% CI will cover the population value at least 95 out of 100 times.

If the coverage <95: bad estimation process with risk of errors and invalid inference
If the coverage >95: inefficient estimation process, but correct conclusions and valid inference. Lower statistical power.

The other type of intervals

Prediction intervals can also be hugely informative!

Prediction intervals are generally wider than confidence intervals

This is because it covers inherent uncertainty in the data point on top of sampling uncertainty
Just like CIs, PIs will become more narrow (for locations) where more information is observed (less uncertainty)
Usually this is at the location of the mean of the predicted values.

Narrower intervals mean less uncertainty. It does not mean less bias!

The holy trinity

Whenever I evaluate something, I tend to look at three things:

bias (how far from the truth)
uncertainty/variance (how wide is my interval)
coverage (how often do I cover the truth with my interval)

As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff

Now with missingness

We now have a new problem:

we do not have the whole truth; but merely a sample of the truth
we do not even have the whole sample, but merely a sample of the sample of the truth.

Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?

Now with missingness

We now have a new problem:

we do not have the whole truth; but merely a sample of the truth
we do not even have the whole sample, but merely a sample of the sample of the truth.

Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?

The statistical solution

There are two sources of uncertainty that we need to cover:

Uncertainty about the missing value:
when we don’t know what the true observed value should be, we must create a distribution of values with proper variance (uncertainty).
Uncertainty about the sampling:
nothing can guarantee that our sample is the one true sample. So it is reasonable to assume that the paramaters obtained on our sample are biased.

More challenging if the sample does not randomly come from the population or if the feature set is too limited to solve for the substantive model of interest

Now how do we know we did well?

I’m really sorry, but:

We don’t. In practice we may often lack the necessary comparative truths!

For example:

Predict a future response, but we only have the past
Analyzing incomplete data without a reference about the truth
Estimate the effect between two things that can never occur together
Mixing bonafide observations with bonafide non-observations

What to do with uncertainty without a truth?

Scenario

Let’s assume that we have an incomplete data set and that we can impute (fill in) the incomplete values under multiple models

Challenge
Imputing this data set under one model may yield different results than imputing this data set under another model.

Problem
We have no idea about validity of either model’s results: we would need either the true observed values or the estimand before we can judge the performance and validity of the imputation model.

We do have a constant in our problem, though: the observed values

Solution

We can overimpute the observed values and evaluate how well the models fit on the observed values.

The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).

If we overimpute the observations multiple times we can calculate bias, intervals and coverage.
The model that would be unbiased, properly covered and have the smallest interval width would then be the most efficient model.

The model to the left clearly does not fit well to the observations.

Solution

We can overimpute the observed values and evaluate how well the models fit on the observed values.

The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).

If we overimpute the observations multiple times we can calculate bias, intervals and coverage.
The model that would be unbiased, properly covered and have the smallest interval width would then be the most efficient model.

The model to the left fits quite well to the observations.

Q6. Can we infer truth?

Bringing it in perspective

Focus points

What are statistical learning and visualization?
How does it connect to data analysis?
Why do we need the above?
What types of analyses and learning are there?

Some example questions

Did our imputations make sense?
Who will win the election?
Is the climate changing?
Why are women underrepresented in STEM degrees?
What is the best way to prevent heart failure?
Who is at risk of crushing debt?
Is this matter undergoing a phase transition?
What kind of topics are popular on Twitter?
How familiar are incoming DAV students with several DAV topics?

Goals in data analysis

Description:
What happened?
Prediction:
What will happen?
Explanation:
Why did/does something happen?
Prescription:
What should we do?

Modes in data analysis

Exploratory:
Mining for interesting patterns or results
Confirmatory:
Testing hypotheses

Some examples

	Exploratory	Confirmatory
Description	EDA; unsupervised learning	One-sample t-test
Prediction	Supervised learning	Macro-economics
Explanation	Visual mining	Causal inference
Prescription	Personalised medicine	A/B testing

In this course

Exploratory Data Analysis:
Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
Examples: boxplot, five-number summary, histograms, missing data plots, …
Supervised learning:
Regression: predict continuous labels from other values.
Examples: linear regression, support vector machines, regression trees, … Classification: predict discrete labels from other values.
Examples: logistic regression, discriminant analysis, classification trees, …

image source

Exploratory Data Analysis workflow

image source

Data analysis

How do you think that data analysis relates to:

“Data analytics”?
“Data modeling”?
“Machine learning”?
“Statistical learning”?
“Statistics”?
“Data science”?
“Data mining”?
“Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

We often use the same techniques.
We just use different terms to highligh different aspects of so-called data analysis.
All the terms on the previous slides are not exact synonyms.
But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Spaceshuttle Challenger

36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.

How wages differ

Source: ISLR2, pp. 2

The origin of cholera

Source: wikimedia commons

Predicting the outcome of elections

Source: Hans Muster

Google Flu Trends

Google used a linear model to calculate the log-odds of Influence-like illness (ILI) physician visit and the log-odds of ILI-related search queries per \[\operatorname{logit}(P)=\beta _{0}+\beta _{1}\times \operatorname{logit}(Q)+\epsilon\] where \(P\) is the percentage of ILI physician visit and \(Q\) is the ILI-related query fraction computed.

Ginsberg, J., Mohebbi, M., Patel, R. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009)

Identifying Brontës from Austen

Eder, Maciej & Rybicki, Jan & Kestemont, Mike. (2016). Stylometry with R: A Package for Computational Text Analysis. The R Journal. 8. 107-121. 10.32614/RJ-2016-007.

The tree of life

Hug, L., Baker, B., Anantharaman, K. et al. A new view of the tree of life. Nat Microbiol 1, 16048 (2016)

The tree of life

Image source

Exercise

Challenger disaster
How wages differ
Jon Snow and Cholera
Election prediction

Flu trends
Brontë or Austen
Elevation, climate and forest
The tree of life

Where would you place each example in the table?

Does each example fit in just one cell of the table?
For each of these analyses, we can ask some common questions, such as:
- “How well does the model fit the data?”
- “How well does the model do on new, unseen, data?”
- …

Can we think of other common questions?

Can we think of an example of a case where the model did not do well?

Nothing happened, so we ignored it

In the decision process that led to the unfortunate launch of spaceshuttle challenger, some dark data existed.

Dark data is information that is not available.

Such unavailable information can mislead people. The notion that we could potentially be misled is important, because we then need to accept that our outcome analysis or decision process might be faulty.

If you do not have all information, there is always a possibility that you arrive at an invalid conclusion or a wrong decision.

From data collection to output

Why analysis and visualization?

When high risk decisions are at hand, it paramount to analyze the correct data.
When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.
Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so
Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer, newspapers would make huge news items based on noise from opinion polls.
If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.
If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.

Why analysis and visualization?

Scholars fight over who wrote various songs (Wilhelmus), treatises (Caesar), plays (Shakespeare), etc., with shifting arguments. By counting words, we can sometimes identify the most likely author of a text, and we can explain exactly why we think that is the correct answer.
Biologists have been constructing the tree of life based on appearance of the animal/plant. But sometimes the outward appearance corresponds by chance. DNA is a more precise method, because there is more of it, and because it is more directly linked to evolution than appearance. But there is so much of it that we need automated methods of reconstructing the tree.

There is a need

The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.

On some level, humans do nothing but analyze data;
They may not do it consistently, understandibly, transparently, or correctly, however;
DAV help us process more data, and can keep us honest;
DAV can also exacerbate our biases when we are not careful.

Thought

Source: Mike Lee

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

The process is often iterative
The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

Discovering: The first step of data wrangling is to gain a better understanding of the data: different data is worked and organized in different ways.
Structuring:The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formatting null values. This step is important in assuring the overall quality of the data.
Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

To Do

Study ISLR Chapter 1
Study R4DS Chapters 1 and 9-16
Make and hand in the practical before the next lecture

If you are unfamiliar with R, this crash course into scripting may be helpful