R
In nine weeks you will learn the basics of data handling with R and the details about regression techniques in the context of statistical inference, as well as the connection to research philosophy. During every lecture we will treat a different theoretical aspect. Following each lecture there will be a computer lab exercise that connects the statistical theory to practice, as well as a workgroup meeting wherein you will work on solving motivating real-world case studies.
The final grade is computed as follows
Graded part | Weight |
---|---|
Linear Regression Assignment and presentation | 25 % |
Logistic Regression Assignment | 25 % |
Written Exam | 50 % |
To develop the necessary skills for completing the assignments and the presentations, 7 R
exercises must be made and submitted. These exercises are not graded, but students must fulfil them to pass the course.
In order to pass the course, the final grade must be 5.5 or higher, your contribution to the course should be sufficient and all assignments and R
exercises should be handed in and/or passed. Otherwise, additional work is required concerning the assignments and/or exercises you have failed.
Week # | Topic | R -practical |
Workgroup |
---|---|---|---|
1 | The elemental building blocks of R |
Assigning objects and elements; creating vectors, matrices, dataframes and lists | Receive instructions and form groups |
2 | Finding the least squares solution; simple linear regression | Subsetting data; using pipes to simplify the workflow | Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes |
3 | Linear modeling in R ; testing assumptions; standardized residuals, leverage and Cook’s distance |
Class lm in R ; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
4 | Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance | Investigate the assumptions about linear modeling | Test and quantify the effect of the defined model; continue the project in rmarkdown |
5 | Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials | Cross-validation and model fit in R |
Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data |
6 | Simple logistic regression | Class glm(formula, family = "binomial") in R ; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
7 | Formulating the logistic model and interpreting the parameters; marginal effects | Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals | Test and quantify the effect of the defined model |
8 | Logistic regression model evaluation; cross-validation; multiple regression; interactions | Multiple logistic regression and cross-validating the logistic regression in R |
Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data |
Regression techniques are widely used to quantify the relationship between two or more variables. In data science it is very common to investigate this relation and linear and logistic regression are proven to be very powerful techniques. However, it is essential to understand how and when it is appropriate to apply these regression techniques. In this course, students will learn exactly how to do that with the statistical software package R
.
This course gives students a new set of tools to explore the issues and problems so many people care about. The course will help students get acquainted with the principles of analytical data science, linear and logistic regression and introduces the basics of statistical learning. These techniques will be presented in the context of estimation, testing and prediction. Students will learn to adapt these techniques in their way of thinking about statistical inference, which will help students to quantify the uncertainty and measure the accuracy of statistical estimates. Students will develop fundamental R
programming skills and will gain experience with tidyverse
, visualize data with ggplot2
and perform basic data wrangling techniques with dplyr
. This course makes students better equipped for a further career (e.g. junior researcher or research assistant) or education in research, such as a (research) Master program, or a PhD.
Students will form groups to choose work on two assigments. Students will need to perform calculations and program code for these assigments. All work needs to be combined in an easy understandable and insightful R
project and must be submitted to the [Surfdrive file drop environment]((https://surfdrive.surf.nl/files/index.php/s/LqUCEt3MWVsW7DF). Each assigment will be graded will be graded on
Students will be evaluated on the following aspects:
After taking this course students can understand innovations in statistical markup, statistical simulation and reproducible research. Students are also able to approach challenges from different professional viewpoints. They have gained experience in marking up a professional manuscript and designing a state-of-the-art statistical archive in an open source repository.
Dear all,
This semester you will participate in the Fundamental Techniques in Data Science with R
course at Utrecht University. To realize a steeper learning curve, we will use some functionality that is not part of the base installation for R
. The below steps guide you through installing both R
as well as the necessary additions. Please do so before the first meeting.
I look forward to see you all,
Gerko Vink
Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advice you to bring a personal laptop computer to the workgroup meetings.
R
R
can be obtained here. We won’t use R
directly in the course, but rather call R
through RStudio
. Therefore it needs to be installed.
RStudio
DesktopRstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop
version is sufficient.
Execute the following lines of code in the console window:
install.packages(c("ggplot2", "tidyverse", "magrittr", "micemd", "jomo", "pan",
"lme4", "knitr", "rmarkdown", "plotly", "ggplot2", "shiny",
"devtools", "boot", "class", "car", "MASS", "ggplot2movies",
"ISLR", "DAAG", "mice"),
dependencies = TRUE)
If you are not sure where to execute code, use the following figure to identify the console:
Just copy and paste the installation command and press the return key. When asked
type Yes
in the console and press the return key.
If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.
RStudio
environment there.R
and RStudio
there. You may need to install packages for new sessions during the course.Naturally, you will need internet access for these services to be accessed.
Today’s lecture is about the elemental building blocks of R
. We discuss what the object-oriented programming language R
is, why we use RStudio
, how to ‘speak’ S
, the scripting language that is used in R
and how to work with objects and elements.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
Make sure that you attend all workgroup meetings: However today’s meeting is vital. We will form groups and I will tell you all about the assignments and management of expectations.
Please do not forget to complete Exercise 1 before the workgroup meeting!
All the best,
Gerko
The above links are useful references that connect to this week’s materials.
R
exerciseThis week’s R
exercise is threefold. We need to cover a lot of ground this week to get you started for the rest of the course.
R
and RStudio
.A video discussion of the exercises will be available below after the second lecture.
Hand in a markdown compiled html
+ Rmd
file for exercise 3 to the SurfDrive drop folder. Name the file Yourname.Rmd
, where Yourname
is your name. Do this before the next lecture on Monday.
The videodiscussion for the practical exercises:
Today’s lecture has two parts:
R
code that forces you to think about the analytical process while being more memory efficient.You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s R
exercise is about getting familiar with pipes and exploring the simple linear model.
Hand in a markdown compiled html
+ Rmd
file for Exercise 4: Question 11 to the SurfDrive drop folder. Name the file Yourname.Rmd/html
, where Yourname
is your name. Do this before the next lecture on Monday.
The above links are useful references that connect to this week’s materials.
Today’s lecture is about linear modeling and its assumptions. Linear modeling is the ubiquitous work horse that is used for estimation throughout contemporary data science. However, its application has limits. We explore the application and the limits in this lecture.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s R
exercise is about getting familiar with the linear model.
Hand in a markdown compiled html
file for Exercise 5 to the SurfDrive drop folder. Do this before the next lecture on Monday.
These readings are exam materials.
Today’s lecture is about statistical inference. Inference is the process of drawing conclusions about true data generating models (TDGM). The most widely know TDGM is the population, the body we aim to infer about in social science and in classical statistics.
Even though statistical inference has been around for a long time, it’s associated components, quantities and estimands are not always straightforward. Let’s dive a bit deeper into that today, but in order to do so, we need some knowledge about statistical sampling and some widely used distributions. That is why we start today’s lecture with random numbers and data generation in R
.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s R
exercise is about getting familiar with the linear model.
Hand in a markdown compiled html
file for Exercise 6 to the SurfDrive drop folder. Do this before the next lecture on Monday.
You can find the answers to this week’s exercise and to the previous week’s exercise here. Use these answers only when you are stuck.
Video discussion of the exercises in two parts:
These readings are exam materials.
Today’s lecture is about being a skilled modeler. There is a famous quote that is generally attributed to George Box:
\[\text{All models are wrong, but some are useful}\]
Today we dive deeper into why and how this statement applies. We also explore techniques to infer the usefulness of a model, given that it is wrong anyway.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s R
exercise is about getting more familiar with the linear model and its usefulness.
No need to hand in any work; but make sure that you understand the contents, code and train ot thought of the document. If not, ask Gerko or Laura.
These readings are exam materials. Also useful, but not exam material is:
Today’s lecture is about logistic regression. We explore the basics of this method today and continue with a more in-depth exploration in the next weeks.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s R
exercise is about getting familiar with fitting the logistic regression model.
To allow you to enjoy the holidays, there is no need to hand in any work; but make sure that you understand the contents, code and train ot thought of the document. If not, ask Gerko or Laura.
These readings are exam materials.
Today we continue with logistic regression. We’ll use the titanic data to demonstrate the technique and we will explore some ways to evaluate the usefulness of the fitted models.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseThis week’s exercise is to: 1. Recreate the plots from slides 28 and 29, but now for females 2. For the model with the interactions:
titanic
model with the interactionsName
!) can be simply fit with glm(Survived ~ . * ., data = titanic[, -3])
.
here implies all remaining columns in the data. . * .
yields the full factorial model between all these columns.Hand in a markdown compiled html
file for Exercise 9 to the SurfDrive drop folder. Do this before the next lecture on Monday.
These readings are exam materials.
Today we will wrap up this course. I’ll give you an overview of the highlights and introduce only a few new concepts.
You can find the slides here
Today’s workgroup is online. See the Teams channel.
R
exerciseNo exercise for this week. I believe that you have been given the proper skillset to solve the assignments in this course. Use your time to improve your skills.
Do not forget to include the newly introduced concepts for this week in your Assignment 2.
You can find the practice exam here
The information in the lecture slides:
and the information in the following sources these lecture slides are based on:
Your knowledge of matrix algebra will not be tested. So, there is no need to memorize that the regression estimates \(\beta\) can be estimated as \(\hat{\beta} = ({\bf X}^T{\bf X})^{-1}{\bf X}^Ty\). However, you will need to know, understand and apply equations such as:
If any of the course materials confuse you, drop me a line and I’d be more than happy to explain.
The second half of the last lecture is dedicated to a Q&A