Supervised Learning and Visualization

This lecture

  1. Covid and the course
  2. Course pages
  3. Course overview
  4. Introduction to SLV
  5. Some examples
  6. Data Wrangling
  7. Wrap-up

Disclaimer

Parts of this week’s slides may be based on materials from previous iterations of Data Analysis and Visualization courses. The authors of these materials include, but may not be limited to: Erik-Jan van Kesteren, Daniel Oberski and Peter van der Heijden.

When figures and other external sources are shown, the references are included when the origin is known.

Covid and the course

With the exception of the first lecture, all lectures are on location. There are some rules by which we obide:

Covid related:

  • You are required to wear a face-mask when moving. Please do so.
  • You may remove your mask when seated in the lecture hall, if you want. You do not have to.
  • Make use of the available space in the lecture rooms.
  • If you have doubts about your health, please skip class and contact a medical professional.
  • Please take other peoples wishes, needs, rights and freedom into consideration.
  • We have a seperate online Q&A session every week to allow for more student-teacher interaction.

Covid and the course

Procedure related:

  • The first lecture will be recorded because of schedule clashes.

  • The on-location lectures will not be recorded.

    • If you are ill, ask your classmates to cover for you.
  • If you feel that you are stuck, and the wait for the Q&A session is too long: open a GitHub issue here.

    • You are most likely not the only one with that question. You are simply the bravest or the first.
    • Do not contact the teachers via private chat or e-mail.
    • use a reprex to detail your issue, when code is involved.
  • If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message.

Course pages

You can find all materials at the following location:

https://www.gerkovink.com/slv/


All course materials should be submitted through a pull-request from your Fork of

https://github.com/gerkovink/INFOMDA1-2021


The structure of your submissions should follow the corresponding repo’s README. To make it simple, I will add an example for the first of each submission type.

If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from one of my other courses. There you can find video walkthroughs that detail the process.

Course overview

Team

Topics

Week # Focus Teacher Materials
1 Data wrangling with R GV R4DS ISLR
2 The grammar of graphics GV R4DS
3 Exploratory data analysis GV R4DS FIMD
4 Statistical learning: regression MC ISLR, TBD
5 Regression model evaluation MC ISLR, TBD
6 Statistical learning: classification EJvK ISLR, TBD
7 Classification model evaluation EJvK ISLR, TBD
8 Nonlinear models MC ISLR, TBD
9 Bagging, boosting, random forest and support vector machines MC ISLR, TBD

Course Setup

Each weak we have the following:

  • 1 Lecture on Monday @ 9am in Ruppert D
  • 1 Practical (not graded). Must be submitted to pass. Hand in the practical before the next lecture
  • 1 online Q&A session.
  • Course materials to study. See the corresponding week on the course page.

Twice we have:

  • Group assignments
  • The assignment is made in teams.
  • Each assignment counts towards 25% of the total grade. Must be > 5.5 to pass.

Once we have:

  • Individual exam
  • BYOD: so charge and bring your laptop.
  • 50% of total grade. Must be > 5.5 to pass.

Groups

We will make groups on Monday Sept 13!

Introduction to SLV

Focus points

  1. What are statistical learning and visualization?
  2. How does it connect to data analysis?
  3. Why do we need the above?
  4. What types of analyses and learning are there?

Some example questions

  • Who will win the election?
  • Is the climate changing?
  • Why are women underrepresented in STEM degrees?
  • What is the best way to prevent heart failure?
  • Who is at risk of crushing debt?
  • Is this matter undergoing a phase transition?
  • What kind of topics are popular on Twitter?
  • How familiar are incoming DAV students with several DAV topics?

Goals in data analysis

  • Description:
    What happened?
  • Prediction:
    What will happen?
  • Explanation:
    Why did/does something happen?
  • Prescription:
    What should we do?

Modes in data analysis

  • Exploratory:
    mining for interesting patterns or results
  • Confirmatory:
    Testing hypotheses

Some examples

Exploratory Confirmatory
Description EDA; unsupervised learning One-sample t-test
Prediction Supervised learning Macro-economics
Explanation Visual mining Causal inference
Prescription Personalised medicine A/B testing

In this course

  • Exploratory Data Analysis:
    Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
    Examples: boxplot, five-number summary, histograms, missing data plots, …

  • Supervised learning:
    Regression: predict continuous labels from other values.
    Examples: linear regression, support vector machines, regression trees, … Classification: predict discrete labels from other values.
    Examples: logistic regression, discriminant analysis, classification trees, …


image source

Exploratory Data Analysis workflow

Data analysis

How do you think that data analysis relates to:

  • “Data analytics”?
  • “Data modeling”?
  • “Machine learning”?
  • “Statistical learning”?
  • “Statistics”?
  • “Data science”?
  • “Data mining”?
  • “Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

  • We often use the same techniques.
  • We just use different terms to highligh different aspects of so-called data analysis.
  • All the terms on the previous slides are not exact synonyms.
  • But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Remember

Exploratory Confirmatory
Description
Prediction
Explanation
Prescription

Challenger disaster