Quick Overview

Column 1

Fundamental Techniques in Data Science with R

In nine weeks you will learn the basics of data handling with R and the details about regression techniques in the context of statistical inference, as well as the connection to research philosophy. During every lecture we will treat a different theoretical aspect. Following each lecture there will be a computer lab exercise that connects the statistical theory to practice, as well as a workgroup meeting wherein you will work on solving motivating real-world case studies.

Assignment and Grading

The final grade is computed as follows

Graded part Weight
Linear Regression Assignment and presentation 25 %
Logistic Regression Assignment 25 %
Written Exam 50 %

To develop the necessary skills for completing the assignments and the presentations, 7 R exercises must be made and submitted. These exercises are not graded, but students must fulfil them to pass the course.

In order to pass the course, the final grade must be 5.5 or higher, your contribution to the course should be sufficient and all assignments and R exercises should be handed in and/or passed. Otherwise, additional work is required concerning the assignments and/or exercises you have failed.

Column 2

Schedule

Week # Topic R-practical Workgroup
1 The elemental building blocks of R Assigning objects and elements; creating vectors, matrices, dataframes and lists Receive instructions and form groups
2 Finding the least squares solution; simple linear regression Subsetting data; using pipes to simplify the workflow Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes
3 Linear modeling in R; testing assumptions; standardized residuals, leverage and Cook’s distance Class lm in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
4 Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance Investigate the assumptions about linear modeling Test and quantify the effect of the defined model; continue the project in rmarkdown
5 Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials Cross-validation and model fit in R Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data
6 Simple logistic regression Class glm(formula, family = "binomial") in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
7 Formulating the logistic model and interpreting the parameters; marginal effects Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals Test and quantify the effect of the defined model
8 Logistic regression model evaluation; cross-validation; multiple regression; interactions Multiple logistic regression and cross-validating the logistic regression in R Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data

Course Manual

Column 1

Course description

Regression techniques are widely used to quantify the relationship between two or more variables. In data science it is very common to investigate this relation and linear and logistic regression are proven to be very powerful techniques. However, it is essential to understand how and when it is appropriate to apply these regression techniques. In this course, students will learn exactly how to do that with the statistical software package R.

This course gives students a new set of tools to explore the issues and problems so many people care about. The course will help students get acquainted with the principles of analytical data science, linear and logistic regression and introduces the basics of statistical learning. These techniques will be presented in the context of estimation, testing and prediction. Students will learn to adapt these techniques in their way of thinking about statistical inference, which will help students to quantify the uncertainty and measure the accuracy of statistical estimates. Students will develop fundamental R programming skills and will gain experience with tidyverse, visualize data with ggplot2 and perform basic data wrangling techniques with dplyr. This course makes students better equipped for a further career (e.g. junior researcher or research assistant) or education in research, such as a (research) Master program, or a PhD.

Assignment

Students will form groups to choose work on two assigments. Students will need to perform calculations and program code for these assigments. All work needs to be combined in an easy understandable and insightful R project and must be submitted to the [Surfdrive file drop environment]((https://surfdrive.surf.nl/files/index.php/s/LqUCEt3MWVsW7DF). Each assigment will be graded will be graded on

  1. Quality of the methodological application
  2. Model evaluation and assumptions checking
  3. Quality of the code and scripts

Grading

Students will be evaluated on the following aspects:

  1. apply and interpret the basic methodological and statistical concepts that are associated with doing predictive and/or inferential research;
  1. explain concepts from inferential statistics, such as probability, inference and modeling; and apply them in practice.
  2. make an informed choice for research designs that are suitable for regression analyses.
  3. apply and explain the choice for techniques to investigate data problems.
  4. apply and explain the concepts of linearity and non-linearity.
  5. interpret statistical software output and report software output following APA reporting guidelines.
  6. explain and conceptualize statistical inference and its relation to statistical theory.
  7. perform the different steps in solving basic regression analysis problems and report on these steps.
  1. apply and interpret important techniques in linear and logistic regression analysis;
  1. perform, interpret and evaluate quantitative (causal) analyses on data with the statistical software platform R.
  2. perform analyses in statistical software.

Relation between assessment and objective

  • With the exam the knowledge from methodological and statistical concepts is evaluated (learning goals 1a, 1d, 1f), as well as the application of these concepts to research scenarios (learning goals 1b and 1c). During the exam students need to interpret statistical software output (learning goal 1e).
  • With the practical lab it is tested if the student has sufficient skills to solve basic analysis problems and execute quantitative analyses on real-life data sets (learning goals 2a and 2b).
  • The work groups focuses on applying the newly gained knowledge and skills through a series of motivating real-world case studies aimed at solving relevant data analysis problems and reporting on the steps taken to obtain a solution (learning goal 1g).

After taking this course students can understand innovations in statistical markup, statistical simulation and reproducible research. Students are also able to approach challenges from different professional viewpoints. They have gained experience in marking up a professional manuscript and designing a state-of-the-art statistical archive in an open source repository.

How to prepare

Column 1

Preparing your machine for the course

Dear all,

This semester you will participate in the Fundamental Techniques in Data Science with R course at Utrecht University. To realize a steeper learning curve, we will use some functionality that is not part of the base installation for R. The below steps guide you through installing both R as well as the necessary additions. Please do so before the first meeting.

I look forward to see you all,

Gerko Vink

System requirements

Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advice you to bring a personal laptop computer to the workgroup meetings.

1. Install R

R can be obtained here. We won’t use R directly in the course, but rather call R through RStudio. Therefore it needs to be installed.

2. Install RStudio Desktop

Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Execute the following lines of code in the console window:

install.packages(c("ggplot2", "tidyverse", "magrittr", "micemd", "jomo", "pan", 
                 "lme4", "knitr", "rmarkdown", "plotly", "ggplot2", "shiny", 
                 "devtools", "boot", "class", "car", "MASS", "ggplot2movies", 
                 "ISLR", "DAAG", "mice"), 
                 dependencies = TRUE)

If you are not sure where to execute code, use the following figure to identify the console:

HTML5 Icon

Just copy and paste the installation command and press the return key. When asked

Do you want to install from sources the package which needs 
compilation? (Yes/no/cancel)

type Yes in the console and press the return key.

Column 2

What if the steps to the left do not work for me?

If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.

  1. Open a free account on rstudio.cloud. You can run your own cloud-based RStudio environment there.
  2. Use Utrecht University’s MyWorkPlace. You would have access to R and RStudio there. You may need to install packages for new sessions during the course.

Naturally, you will need internet access for these services to be accessed.

Week 1

Column 1

Lecture (Monday)

Today’s lecture is about the elemental building blocks of R. We discuss what the object-oriented programming language R is, why we use RStudio, how to ‘speak’ S, the scripting language that is used in R and how to work with objects and elements.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Make sure that you attend all workgroup meetings: However today’s meeting is vital. We will form groups and I will tell you all about the assignments and management of expectations.

Please do not forget to complete Exercise 1 before the workgroup meeting!

All the best,

Gerko

Column 2

R exercise

This week’s R exercise is threefold. We need to cover a lot of ground this week to get you started for the rest of the course.

  • Make Exercise 1 before Thursday’s workgroup. This exercise will get you started with R and RStudio.
  • Make Exercise 2 during Thursday’s workgroup timeslot. Laura will be available in the Teams channels and/or meeting to help you during that time.
  • Complete and hand in Exercise 3 before the next lecture. For this week it is sufficient if you present the code to the relevant exercises.

A video discussion of the exercises will be available below after the second lecture.

Hand in a markdown compiled html + Rmd file for exercise 3 to the SurfDrive drop folder. Name the file Yourname.Rmd, where Yourname is your name. Do this before the next lecture on Monday.

Exercise discussion

The videodiscussion for the practical exercises:

Week 2

Column 1

Lecture (Monday)

Today’s lecture has two parts:

  1. Let’s explore pipes: a more efficient way of organizing our R code that forces you to think about the analytical process while being more memory efficient.
  2. We look at squared deviations and see how useful these calculations can be. From there we leap to least-squares estimation and start with simple linear regression.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

This week’s R exercise is about getting familiar with pipes and exploring the simple linear model.

Hand in a markdown compiled html + Rmd file for Exercise 4: Question 11 to the SurfDrive drop folder. Name the file Yourname.Rmd/html, where Yourname is your name. Do this before the next lecture on Monday.

Useful References

The above links are useful references that connect to this week’s materials.

Exercise answers

The answers to the fourth practical exercises:

Week 3

Column 1

Lecture (Monday)

Today’s lecture is about linear modeling and its assumptions. Linear modeling is the ubiquitous work horse that is used for estimation throughout contemporary data science. However, its application has limits. We explore the application and the limits in this lecture.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

This week’s R exercise is about getting familiar with the linear model.

Hand in a markdown compiled html file for Exercise 5 to the SurfDrive drop folder. Do this before the next lecture on Monday.

Required reading

These readings are exam materials.

Week 4

Column 1

Lecture (Monday)

Today’s lecture is about statistical inference. Inference is the process of drawing conclusions about true data generating models (TDGM). The most widely know TDGM is the population, the body we aim to infer about in social science and in classical statistics.

Even though statistical inference has been around for a long time, it’s associated components, quantities and estimands are not always straightforward. Let’s dive a bit deeper into that today, but in order to do so, we need some knowledge about statistical sampling and some widely used distributions. That is why we start today’s lecture with random numbers and data generation in R.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

This week’s R exercise is about getting familiar with the linear model.

Hand in a markdown compiled html file for Exercise 6 to the SurfDrive drop folder. Do this before the next lecture on Monday.

Solution to the exercise

You can find the answers to this week’s exercise and to the previous week’s exercise here. Use these answers only when you are stuck.

Video discussion of the exercises in two parts:

Week 5

Column 1

Lecture (Monday)

Today’s lecture is about being a skilled modeler. There is a famous quote that is generally attributed to George Box:

\[\text{All models are wrong, but some are useful}\]

Today we dive deeper into why and how this statement applies. We also explore techniques to infer the usefulness of a model, given that it is wrong anyway.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

This week’s R exercise is about getting more familiar with the linear model and its usefulness.

No need to hand in any work; but make sure that you understand the contents, code and train ot thought of the document. If not, ask Gerko or Laura.

Week 6

Column 1

Lecture (Monday)

Today’s lecture is about logistic regression. We explore the basics of this method today and continue with a more in-depth exploration in the next weeks.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

This week’s R exercise is about getting familiar with fitting the logistic regression model.

To allow you to enjoy the holidays, there is no need to hand in any work; but make sure that you understand the contents, code and train ot thought of the document. If not, ask Gerko or Laura.

Week 7

Column 1

Lecture (Monday)

Today we continue with logistic regression. We’ll use the titanic data to demonstrate the technique and we will explore some ways to evaluate the usefulness of the fitted models.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

The titanic.csv data

This week’s exercise is to: 1. Recreate the plots from slides 28 and 29, but now for females 2. For the model with the interactions:

  • Create a confusion matrix for the model with the interactions
  • Perform crossvalidation on the titanic model with the interactions
  1. See if a model with all variables included is a better fit than the model with the interactions presented in the slides.
  • The model with all variables (excluding Name!) can be simply fit with glm(Survived ~ . * ., data = titanic[, -3])
  • The . here implies all remaining columns in the data. . * . yields the full factorial model between all these columns.

Hand in a markdown compiled html file for Exercise 9 to the SurfDrive drop folder. Do this before the next lecture on Monday.

Week 8

Column 1

Lecture (Monday)

Today we will wrap up this course. I’ll give you an overview of the highlights and introduce only a few new concepts.

You can find the slides here

Workgroup (Thursday)

Today’s workgroup is online. See the Teams channel.

Column 2

R exercise

No exercise for this week. I believe that you have been given the proper skillset to solve the assignments in this course. Use your time to improve your skills.

Do not forget to include the newly introduced concepts for this week in your Assignment 2.

Exam material

Column 1

Practice exam

You can find the practice exam here

What can be tested

This page as a pdf

The information in the lecture slides:

and the information in the following sources these lecture slides are based on:

What about equations and formulae?

Your knowledge of matrix algebra will not be tested. So, there is no need to memorize that the regression estimates \(\beta\) can be estimated as \(\hat{\beta} = ({\bf X}^T{\bf X})^{-1}{\bf X}^Ty\). However, you will need to know, understand and apply equations such as:

  • \(y = \beta_0 + \beta_1X+\epsilon\) and any more complicated version of this.
  • \(\epsilon = y - \hat{y}\)
  • \(\mathbb{E}[y] = \alpha + \beta x.\)
  • \(\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\)
  • \(p_i = \frac{\text{exp}(\eta)}{1+\text{exp}(\eta)} = \frac{\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}{1+\text{exp}(\beta_0 + \beta_1x_{1,i} + \dots + \beta_nx_{n,i})}\)
  • etcetera

What if you are in doubt?

If any of the course materials confuse you, drop me a line and I’d be more than happy to explain.

The second half of the last lecture is dedicated to a Q&A