Data science, pipes and visualization

Data Science and Predictive Machine Learning

This lecture

Course overview
Introduction to data science
Some examples
Pipes in R
Visualization with ggplot2

Disclaimer

I owe a debt of gratitude to many people as the thoughts and teachings in my slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

When figures and other external sources are shown, the references are included when the origin is known.

Opinions are my own.

Course page

You can find all materials at the following location:

https://www.gerkovink.com/erasmus/

The MS Teams environment can be found here. Participants can join the meeting online here

Course overview

About me

Name: Gerko Vink
Identifies as: Dark data scientist
Studied to be: Psychologist
Landed in: Statistics
Enjoys: Putting easter eggs in software
Pet peeve: Excessively certain Data Scientists
Married, with 2 kids
Favorite guitarist: Jimmy Page
Still has mortgage

Expertise: Missing data theory, statistical programming, computational evaluation

Topics

Week 1

Introduction to it all, modeling, least-squares estimation, linear regression, assumptions, model fit and complexity, intervals.

Week 2

Test/training splits, crossvalidation, curse of dimensionality, underfitting and overfitting, bias/variance trade-off, logistic regression, classification and prediction, prediction evaluation, marginal likelihood, model interpretability, k-nearest neighbours and a peak into unsupervised clustering.

Week 3

The curse of dimensionality on steroids, how to avoid overfitting, problems with least squares estimation, ridge regression, the lasso, elastic net regularization, model interpretability.

Week 4

Support vector machines and non-linear predictions. BYOD.

Schedule

Introduction to SLV

Goals in data analysis

Description:
What happened?
Prediction:
What will happen?

Explanation:
Why did/does something happen?
Prescription:
What should we do?

image source

Exploratory data analysis workflow

image source

Data analysis

How do you think that data analysis relates to:

“Data analytics”?
“Data modeling”?
“Machine learning”?
“Statistical learning”?
“Statistics”?
“Data science”?
“Data mining”?
“Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

We often use the same techniques.
We just use different terms to highligh different aspects of so-called data analysis.
All the terms on the previous slides are not exact synonyms.
But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Challenger disaster

Source: wikimedia commons and MIMP summerschool slide 28

Challenger space shuttle - 28 Jan 1986 - 7 deaths

How wages differ

Source: ISLR2, pp. 2

The origin of cholera

Source: wikimedia commons

Google Flu Trends

Google used a linear model to calculate the log-odds of Influence-like illness (ILI) physician visit and the log-odds of ILI-related search queries per \[\operatorname{logit}(P)=\beta _{0}+\beta _{1}\times \operatorname{logit}(Q)+\epsilon\] where $P$ is the percentage of ILI physician visit and $Q$ is the ILI-related query fraction computed.

Ginsberg, J., Mohebbi, M., Patel, R. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009)

Bad performance

Why analysis and visualization in data science?

When high risk decisions are at hand, it paramount to analyze the correct data.
When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.
Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so.
If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.

The above examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.

On some level, humans do nothing but analyze data;
They may not do it consistently, understandibly, transparently, or correctly, however;
Data analyses and visualizations help us process more data, and can keep us honest;
Data analysis and visualizations can also exacerbate our biases when we are not careful.

Thought

Source: Mike Lee

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

The process is often iterative
The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

Discovering: The first step of data wrangling is to gain a better understanding of the data: different data is worked and organized in different ways.
Structuring:The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formatting null values. This step is important in assuring the overall quality of the data.
Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

Pipes

This is a pipe:

boys <- 
  read_csv("boys.txt") %>%
  head()

It effectively replaces boys <- head(read_csv("boys.txt")).

With pipes

Your code becomes more readable:

data operations are structured from left-to-right and not from in-to-out
nested function calls are avoided
local variables and copied objects are avoided
easy to add steps in the sequence

Benefit: a single object in memory that is easy to interpret

The standard `%>%` pipe

This pipe is included in package dplyr in R.

The `%$%` pipe

This pipe is not included in package dplyr in R. You need package magrittr for this.

The `%T>%` pipe

This pipe is also not included in package dplyr in R. You need package magrittr for this.

The role of `.` in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

mice::boys %>% 
  lm(age ~ hgt)

## Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame

mice::boys %>% 
  lm(age ~ hgt, data = .)

## 
## Call:
## lm(formula = age ~ hgt, data = .)
## 
## Coefficients:
## (Intercept)          hgt  
##     -9.7438       0.1443

The . can be used as a placeholder in the pipe.

However …

mice::boys %$% 
  lm(age ~ .)

## Error in terms.formula(formula, data = data): '.' in formula and no 'data' argument

should be substituted by

mice::boys %$% 
  lm(age ~ ., data = .)

## 
## Call:
## lm(formula = age ~ ., data = .)
## 
## Coefficients:
## (Intercept)          hgt          wgt          bmi           hc        gen.L  
##    2.556051     0.059987    -0.009846     0.142162    -0.024086     1.287455  
##       gen.Q        gen.C        gen^4        phb.L        phb.Q        phb.C  
##   -0.006861    -0.187256     0.034186     1.552398     0.499620     0.656277  
##       phb^4        phb^5           tv      regeast      regwest     regsouth  
##   -0.094722    -0.113686     0.074321    -0.222249    -0.233307    -0.258771  
##     regcity  
##    0.188423

Visualization 101

Why visualise?

We can process a lot of information quickly with our eyes
- Our eyes can transfer up to 10^7 bits per second to our brain
- Through reading we can transfer up to 10^3 bits per second to our brain
Plots give us information about
- Distribution / shape
- Irregularities
- Assumptions
- Intuitions
Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

Source: https://www.autodeskresearch.com/publications/samestats

ggplot2

What is `ggplot2`?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

provide the data
define how to map variables to aesthetics
state which geometric object to display
(optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

boys %>%
  ggplot()

2: map variable to aesthetics

boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

A <- 
  boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

B <- A + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

C <- B + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(C)

Why this syntax?

For fun

source

This lecture

Disclaimer

Course page

Course overview

About me

Topics

Week 1

Week 2

Week 3

Week 4

Schedule

Introduction to SLV

Goals in data analysis

Exploratory data analysis workflow

Data analysis

Explanation

Some examples

Challenger disaster

How wages differ

The origin of cholera

Google Flu Trends

Bad performance

Bad performance

Why analysis and visualization in data science?

Thought

Data wrangling

Wrangling in the pipeline

Core ideas

Pipes

This is a pipe:

With pipes

The standard %>% pipe

The %$% pipe

The %T>% pipe

The role of . in a pipe

However …

Visualization 101

Why visualise?

ALWAYS plot your data!

Why visualise?

Why visualise?

ggplot2

What is ggplot2?

An example: scatterplot

An example: scatterplot

Why this syntax?

Why this syntax?

Why this syntax?

For fun

The standard `%>%` pipe

The `%$%` pipe

The `%T>%` pipe

The role of `.` in a pipe

What is `ggplot2`?