Data Science and Predictive Machine Learning

This lecture

  1. Course overview
  2. Introduction to data science
  3. Some examples
  4. Pipes in R
  5. Visualization with ggplot2

Disclaimer

I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

When figures and other external sources are shown, the references are included when the origin is known.

Opinions are my own.

Course page

Course overview

About me

  • Name: Gerko Vink
  • Identifies as: Dark data scientist
  • Studied to be: Psychologist
  • Landed in: Statistics
  • Enjoys: Putting easter eggs in software
  • Pet peeve: Excessively certain Data Scientists
  • Married, with 2 kids
  • Favorite guitarist: Jimmy Page
  • Still has mortgage

Expertise: Missing data theory, statistical programming, computational evaluation

Topics

Week 1

Introduction to it all, modeling, least-squares estimation, linear regression, assumptions, model fit and complexity, intervals.

Week 2

Test/training splits, crossvalidation, curse of dimensionality, underfitting and overfitting, bias/variance trade-off, logistic regression, classification and prediction, prediction evaluation, marginal likelihood, model interpretability, k-nearest neighbours and a peak into unsupervised clustering.

Week 3

The curse of dimensionality on steroids, how to avoid overfitting, problems with least squares estimation, ridge regression, the lasso, elastic net regularization, model interpretability.

Week 4

Support vector machines and non-linear predictions. BYOD.

Schedule

Introduction to SLV

Goals in data analysis

  • Description:
    What happened?
  • Prediction:
    What will happen?
  • Explanation:
    Why did/does something happen?
  • Prescription:
    What should we do?


image source

Exploratory data analysis workflow

Data analysis

How do you think that data analysis relates to:

  • “Data analytics”?
  • “Data modeling”?
  • “Machine learning”?
  • “Statistical learning”?
  • “Statistics”?
  • “Data science”?
  • “Data mining”?
  • “Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

  • We often use the same techniques.
  • We just use different terms to highligh different aspects of so-called data analysis.
  • All the terms on the previous slides are not exact synonyms.
  • But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Challenger disaster

How wages differ

The origin of cholera

Google Flu Trends

Bad performance


Bad performance


Why analysis and visualization in data science?

  • When high risk decisions are at hand, it paramount to analyze the correct data.

  • When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.

  • Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so.

  • If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.

The above examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.

  • On some level, humans do nothing but analyze data;
  • They may not do it consistently, understandibly, transparently, or correctly, however;
  • Data analyses and visualizations help us process more data, and can keep us honest;
  • Data analysis and visualizations can also exacerbate our biases when we are not careful.

Thought

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

  • The process is often iterative
  • The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

  • Discovering: The first step of data wrangling is to gain a better understanding of the data: different data is worked and organized in different ways.
  • Structuring:The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
  • Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formatting null values. This step is important in assuring the overall quality of the data.
  • Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
  • Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
  • Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

Pipes

This is a pipe:

boys <- 
  read_csv("boys.txt") %>%
  head()

It effectively replaces boys <- head(read_csv("boys.txt")).

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

Benefit: a single object in memory that is easy to interpret

The standard %>% pipe

HTML5 Icon


This pipe is included in package dplyr in R.

The %$% pipe

HTML5 Icon


This pipe is not included in package dplyr in R. You need package magrittr for this.

The %T>% pipe

HTML5 Icon


This pipe is also not included in package dplyr in R. You need package magrittr for this.

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

mice::boys %>% 
  lm(age ~ hgt)
## Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame

VS

mice::boys %>% 
  lm(age ~ hgt, data = .)
## 
## Call:
## lm(formula = age ~ hgt, data = .)
## 
## Coefficients:
## (Intercept)          hgt  
##     -9.7438       0.1443

The . can be used as a placeholder in the pipe.

However …

mice::boys %$% 
  lm(age ~ .)
## Error in terms.formula(formula, data = data): '.' in formula and no 'data' argument

should be substituted by

mice::boys %$% 
  lm(age ~ ., data = .) 
## 
## Call:
## lm(formula = age ~ ., data = .)
## 
## Coefficients:
## (Intercept)          hgt          wgt          bmi           hc        gen.L  
##    2.556051     0.059987    -0.009846     0.142162    -0.024086     1.287455  
##       gen.Q        gen.C        gen^4        phb.L        phb.Q        phb.C  
##   -0.006861    -0.187256     0.034186     1.552398     0.499620     0.656277  
##       phb^4        phb^5           tv      regeast      regwest     regsouth  
##   -0.094722    -0.113686     0.074321    -0.222249    -0.233307    -0.258771  
##     regcity  
##    0.188423

Visualization 101

Why visualise?

  • We can process a lot of information quickly with our eyes
    • Our eyes can transfer up to 10^7 bits per second to our brain
    • Through reading we can transfer up to 10^3 bits per second to our brain
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

ggplot2

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

A <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

B <- A + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

C <- B + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(C)

Why this syntax?

For fun