- Course pages
- Course overview
- Introduction to SLV
- (Dark) Data Science
- Data Wrangling
- Wrap-up
Supervised Learning and Visualization
I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
When external figures and other sources are shown:
Opinions are my own.
I am a statistician
If there is anything important - contact me!
The on-location lectures will not be recorded.
If you feel that you are stuck, and the wait for the Q&A session is too long: open a GitHub issue here.
reprex
to detail your issue, when code is involved.If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message or e-mail.
You can find all materials at the following location:
All course materials should be submitted through a pull-request from your Fork of
The structure of your submissions should follow the corresponding repo’s README. To make it simple, I have added an example for the first practical. If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from one of my other courses. There you can find video walkthroughs that detail the process.
All three have a PhD in statistics and a ton of experience in development, data analysis and visualization.
Week # | Focus | Teacher | Materials |
---|---|---|---|
1 | Data wrangling with R |
GV | R4DS ISLR |
2 | The grammar of graphics | GV | R4DS |
3 | Exploratory data analysis | GV | R4DS FIMD |
4 | Statistical learning: regression | MC | ISLR, TBD |
5 | Statistical learning: classification | EJvK | ISLR, TBD |
6 | Classification model evaluation | EJvK | ISLR, TBD |
7 | Nonlinear models | MC | ISLR, TBD |
8 | Bagging, boosting, random forest and support vector machines | MC | ISLR, TBD |
Each weak we have the following:
Twice we have:
Once we have:
We will make groups on Wednesday Sept 14!
We begin this course series with a bit of statistical inference.
Statistical inference is the process of drawing conclusions from truths
Truths are boring, but they are convenient.
Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
Core assumption: all observations are bonafide
When we do not have all information …
In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.
The uncertainty measures about our estimates can be used to create intervals
Confidence intervals can be hugely informative!
If we sample 100 samples from a population, then a 95% CI will cover the population value at least 95 out of 100 times.
Prediction intervals can also be hugely informative!
Prediction intervals are generally wider than confidence intervals
Narrower intervals mean less uncertainty. It does not mean less bias!
Whenever I evaluate something, I tend to look at three things:
As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff
We now have a new problem:
Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?
We now have a new problem:
Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?
There are two sources of uncertainty that we need to cover:
More challenging if the sample does not randomly come from the population or if the feature set is too limited to solve for the substantive model of interest
We don’t. In practice we may often lack the necessary comparative truths!
For example:
Let’s assume that we have an incomplete data set and that we can impute (fill in) the incomplete values under multiple models
Challenge
Imputing this data set under one model may yield different results than imputing this data set under another model.
Problem
We have no idea about validity of either model’s results: we would need either the true observed values or the estimand before we can judge the performance and validity of the imputation model.
We do have a constant in our problem, though: the observed values
We can overimpute the observed values and evaluate how well the models fit on the observed values.
The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).
The model to the left clearly does not fit well to the observations.
We can overimpute the observed values and evaluate how well the models fit on the observed values.
The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).
The model to the left fits quite well to the observations.
Q6. Can we infer truth?
Exploratory | Confirmatory | |
---|---|---|
Description | EDA; unsupervised learning | One-sample t-test |
Prediction | Supervised learning | Macro-economics |
Explanation | Visual mining | Causal inference |
Prescription | Personalised medicine | A/B testing |
Exploratory Data Analysis:
Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
Examples: boxplot, five-number summary, histograms, missing data plots, …
Supervised learning:
Regression: predict continuous labels from other values.
Examples: linear regression, support vector machines, regression trees, … Classification: predict discrete labels from other values.
Examples: logistic regression, discriminant analysis, classification trees, …
How do you think that data analysis
relates to:
People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.
data analysis
.In this course we emphasize on drawing insights that help us understand the data.
36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.
Challenger disaster
How wages differ
Jon Snow and Cholera
Election prediction
Flu trends
Brontë or Austen
Elevation, climate and forest
The tree of life
Where would you place each example in the table?
Can we think of other common questions?
Can we think of an example of a case where the model did not do well?
In the decision process that led to the unfortunate launch of spaceshuttle challenger, some dark data existed.
Dark data is information that is not available.
Such unavailable information can mislead people. The notion that we could potentially be misled is important, because we then need to accept that our outcome analysis or decision process might be faulty.
If you do not have all information, there is always a possibility that you arrive at an invalid conclusion or a wrong decision.
When high risk decisions are at hand, it paramount to analyze the correct data.
When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.
Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so
Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer
, newspapers would make huge news items based on noise from opinion polls.
If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.
If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.
Scholars fight over who wrote various songs (Wilhelmus), treatises (Caesar), plays (Shakespeare), etc., with shifting arguments. By counting words, we can sometimes identify the most likely author of a text, and we can explain exactly why we think that is the correct answer.
Biologists have been constructing the tree of life based on appearance of the animal/plant. But sometimes the outward appearance corresponds by chance. DNA is a more precise method, because there is more of it, and because it is more directly linked to evolution than appearance. But there is so much of it that we need automated methods of reconstructing the tree.
The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.