Supervised Learning and Visualization

This lecture

  1. Course pages
  2. Course overview
  3. Introduction to SLV
  4. (Dark) Data Science
  5. Data Wrangling
  6. Wrap-up

Disclaimer

I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

When external figures and other sources are shown:

  1. the references are included when the origin is known, or
  2. the objects are directly linked from within the public domain and the source can be obtained by right-clicking the objects.

Opinions are my own.

I am a statistician

Procedural stuff

  • If there is anything important - contact me!

  • The on-location lectures will not be recorded.

    • If you are ill, ask your classmates to cover for you.
  • If you feel that you are stuck, and the wait for the Q&A session is too long: open a GitHub issue here.

    • You are most likely not the only one with that question. You are simply the bravest or the first.
    • Do not contact us via private chat or e-mail for content-related questions.
    • use a reprex to detail your issue, when code is involved.
  • If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message or e-mail.

Course pages

You can find all materials at the following location:

https://www.gerkovink.com/slv/


All course materials should be submitted through a pull-request from your Fork of

https://github.com/gerkovink/INFOMDA1-2022


The structure of your submissions should follow the corresponding repo’s README. To make it simple, I have added an example for the first practical. If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from one of my other courses. There you can find video walkthroughs that detail the process.

Course overview

Team

Topics

Week # Focus Teacher Materials
1 Data wrangling with R GV R4DS ISLR
2 The grammar of graphics GV R4DS
3 Exploratory data analysis GV R4DS FIMD
4 Statistical learning: regression MC ISLR, TBD
5 Statistical learning: classification EJvK ISLR, TBD
6 Classification model evaluation EJvK ISLR, TBD
7 Nonlinear models MC ISLR, TBD
8 Bagging, boosting, random forest and support vector machines MC ISLR, TBD

Course Setup

Each weak we have the following:

  • 1 Lecture on Monday @ 9am in BBG 201
  • 1 Practical (not graded). Must be submitted to pass. Hand in the practical before the next lecture
  • 1 combined workgroup and Q&A session in BBG 106
  • Course materials to study. See the corresponding week on the course page.

Twice we have:

  • Group assignments
  • The assignment is made in teams (3-4 students).
  • Each assignment counts towards 25% of the total grade. Must be > 5.5 to pass.

Once we have:

  • Individual exam
  • BYOD: so charge and bring your laptop.
  • 50% of total grade. Must be > 5.5 to pass.

Groups

We will make groups on Wednesday Sept 14!

Introduction to SLV

Terms I may use

  • TDGM: True data generating model
  • DGP: Data generating process, closely related to the TDGM, but with all the wacky additional uncertainty
  • Truth: The comparative truth that we are interested in
  • Bias: The distance to the comparative truth
  • Variance: When not everything is the same
  • Estimate: Something that we calculate or guess
  • Estimand: The thing we aim to estimate and guess
  • Population: That larger entity without sampling variance
  • Sample: The smaller thing with sampling variance
  • Incomplete: There exists a more complete version, but we don’t have it
  • Observed: What we have
  • Unobserved: What we would also like to have

Some statistics

At the start

We begin this course series with a bit of statistical inference.

Statistical inference is the process of drawing conclusions from truths

Truths are boring, but they are convenient.

  • however, for most problems truths require a lot of calculations, tallying or a complete census.
  • therefore, a proxy of the truth is in most cases sufficient
  • An example for such a proxy is a sample
  • Samples are widely used and have been for a long timeSee Jelke Bethlehem’s CBS discussion paper for an overview of the history of sampling within survey statistics

Being wrong about the truth

  • The population is the truth
  • The sample comes from the population, but is generally smaller in size
  • This means that not all cases from the population can be in our sample
  • If not all information from the population is in the sample, then our sample may be wrong


    Q1: Why is it important that our sample is not wrong?
    Q2: How do we know that our sample is not wrong?

Solving the missingness problem

  • There are many flavours of sampling
  • If we give every unit in the population the same probability to be sampled, we do random sampling
  • The convenience with random sampling is that the missingness problem can be ignored
  • The missingness problem would in this case be: not every unit in the population has been observed in the sample




Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


    All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


Core assumption: all observations are bonafide

Uncertainty simplified

When we do not have all information …

  1. We need to accept that we are probably wrong
  2. We just have to quantify how wrong we are


In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.

The uncertainty measures about our estimates can be used to create intervals

Rumsfeld moment of fame in statistics

Confidence intervals

Confidence intervals can be hugely informative!

If we sample 100 samples from a population, then a 95% CI will cover the population value at least 95 out of 100 times.

  • If the coverage <95: bad estimation process with risk of errors and invalid inference
  • If the coverage >95: inefficient estimation process, but correct conclusions and valid inference. Lower statistical power.

The other type of intervals

Prediction intervals can also be hugely informative!

Prediction intervals are generally wider than confidence intervals

  • This is because it covers inherent uncertainty in the data point on top of sampling uncertainty
  • Just like CIs, PIs will become more narrow (for locations) where more information is observed (less uncertainty)
  • Usually this is at the location of the mean of the predicted values.


Narrower intervals mean less uncertainty. It does not mean less bias!

The holy trinity

Whenever I evaluate something, I tend to look at three things:

  • bias (how far from the truth)
  • uncertainty/variance (how wide is my interval)
  • coverage (how often do I cover the truth with my interval)


As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff

Now with missingness

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?

Now with missingness

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

Q4. What would be a simple solution to allowing for valid inferences on the incomplete sample?
Q5. Would that solution work in practice?

The statistical solution

There are two sources of uncertainty that we need to cover:

  1. Uncertainty about the missing value:
    when we don’t know what the true observed value should be, we must create a distribution of values with proper variance (uncertainty).
  2. Uncertainty about the sampling:
    nothing can guarantee that our sample is the one true sample. So it is reasonable to assume that the paramaters obtained on our sample are biased.


More challenging if the sample does not randomly come from the population or if the feature set is too limited to solve for the substantive model of interest

Now how do we know we did well?

I’m really sorry, but:
We don’t. In practice we may often lack the necessary comparative truths!

For example:

  1. Predict a future response, but we only have the past
  2. Analyzing incomplete data without a reference about the truth
  3. Estimate the effect between two things that can never occur together
  4. Mixing bonafide observations with bonafide non-observations

What to do with uncertainty without a truth?

Scenario

Let’s assume that we have an incomplete data set and that we can impute (fill in) the incomplete values under multiple models

Challenge
Imputing this data set under one model may yield different results than imputing this data set under another model.

Problem
We have no idea about validity of either model’s results: we would need either the true observed values or the estimand before we can judge the performance and validity of the imputation model.

We do have a constant in our problem, though: the observed values

Solution

We can overimpute the observed values and evaluate how well the models fit on the observed values.

The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).

  • If we overimpute the observations multiple times we can calculate bias, intervals and coverage.
  • The model that would be unbiased, properly covered and have the smallest interval width would then be the most efficient model.

The model to the left clearly does not fit well to the observations.

Solution

We can overimpute the observed values and evaluate how well the models fit on the observed values.

The assumption would then be that any good imputation model would properly cover the observed data (i.e. would fit to the observed data).

  • If we overimpute the observations multiple times we can calculate bias, intervals and coverage.
  • The model that would be unbiased, properly covered and have the smallest interval width would then be the most efficient model.

The model to the left fits quite well to the observations.


Q6. Can we infer truth?

Bringing it in perspective

Focus points

  1. What are statistical learning and visualization?
  2. How does it connect to data analysis?
  3. Why do we need the above?
  4. What types of analyses and learning are there?

Some example questions

  • Did our imputations make sense?
  • Who will win the election?
  • Is the climate changing?
  • Why are women underrepresented in STEM degrees?
  • What is the best way to prevent heart failure?
  • Who is at risk of crushing debt?
  • Is this matter undergoing a phase transition?
  • What kind of topics are popular on Twitter?
  • How familiar are incoming DAV students with several DAV topics?

Goals in data analysis

  • Description:
    What happened?
  • Prediction:
    What will happen?
  • Explanation:
    Why did/does something happen?
  • Prescription:
    What should we do?

Modes in data analysis

  • Exploratory:
    Mining for interesting patterns or results
  • Confirmatory:
    Testing hypotheses

Some examples

Exploratory Confirmatory
Description EDA; unsupervised learning One-sample t-test
Prediction Supervised learning Macro-economics
Explanation Visual mining Causal inference
Prescription Personalised medicine A/B testing

In this course

  • Exploratory Data Analysis:
    Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
    Examples: boxplot, five-number summary, histograms, missing data plots, …

  • Supervised learning:
    Regression: predict continuous labels from other values.
    Examples: linear regression, support vector machines, regression trees, … Classification: predict discrete labels from other values.
    Examples: logistic regression, discriminant analysis, classification trees, …


image source

Exploratory Data Analysis workflow

Data analysis

How do you think that data analysis relates to:

  • “Data analytics”?
  • “Data modeling”?
  • “Machine learning”?
  • “Statistical learning”?
  • “Statistics”?
  • “Data science”?
  • “Data mining”?
  • “Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

  • We often use the same techniques.
  • We just use different terms to highligh different aspects of so-called data analysis.
  • All the terms on the previous slides are not exact synonyms.
  • But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Spaceshuttle Challenger

36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.

How wages differ

The origin of cholera

Predicting the outcome of elections

Google Flu Trends

Identifying Brontës from Austen

The tree of life

The tree of life

Exercise

Challenger disaster
How wages differ
Jon Snow and Cholera
Election prediction

Flu trends
Brontë or Austen
Elevation, climate and forest
The tree of life


Where would you place each example in the table?

  • Does each example fit in just one cell of the table?
  • For each of these analyses, we can ask some common questions, such as:
    • “How well does the model fit the data?”
    • “How well does the model do on new, unseen, data?”
    • …

Can we think of other common questions?

Can we think of an example of a case where the model did not do well?

Nothing happened, so we ignored it

In the decision process that led to the unfortunate launch of spaceshuttle challenger, some dark data existed.

Dark data is information that is not available.

Such unavailable information can mislead people. The notion that we could potentially be misled is important, because we then need to accept that our outcome analysis or decision process might be faulty.

If you do not have all information, there is always a possibility that you arrive at an invalid conclusion or a wrong decision.

From data collection to output

Why analysis and visualization?

  • When high risk decisions are at hand, it paramount to analyze the correct data.

  • When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.

  • Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so

  • Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer, newspapers would make huge news items based on noise from opinion polls.

  • If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.

  • If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.

Why analysis and visualization?

  • Scholars fight over who wrote various songs (Wilhelmus), treatises (Caesar), plays (Shakespeare), etc., with shifting arguments. By counting words, we can sometimes identify the most likely author of a text, and we can explain exactly why we think that is the correct answer.

  • Biologists have been constructing the tree of life based on appearance of the animal/plant. But sometimes the outward appearance corresponds by chance. DNA is a more precise method, because there is more of it, and because it is more directly linked to evolution than appearance. But there is so much of it that we need automated methods of reconstructing the tree.

There is a need

The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.

  • On some level, humans do nothing but analyze data;
  • They may not do it consistently, understandibly, transparently, or correctly, however;
  • DAV help us process more data, and can keep us honest;
  • DAV can also exacerbate our biases when we are not careful.

Thought

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

  • The process is often iterative
  • The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

  • Discovering: The first step of data wrangling is to gain a better understanding of the data: different data is worked and organized in different ways.
  • Structuring:The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
  • Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formatting null values. This step is important in assuring the overall quality of the data.
  • Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
  • Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
  • Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

To Do