You can access the course materials quickly from
www.gerkovink.com/fundamentals
\[\quad\]
You can reach me at G.Vink@uu.nl or via the MS Teams channels
Fundamental Techniques in Data Science with R
You can access the course materials quickly from
www.gerkovink.com/fundamentals
\[\quad\]
You can reach me at G.Vink@uu.nl or via the MS Teams channels
We learn to use regression; a technique that is aimed at figuring out the strength of the relation between an outcome and a set of predictors.
It is basically one of the following three scenarios:
In general it holds that the more hours you spend on studying, the higher your grade. But this relation is not 1:1.
We will learn to identify the average conditional relation between outcome and predictors.
We will learn to differentiate our investigation between different groups. For example, is there a difference in the relation between Hours
and Grade
for different groups?
Finally, we will learn how to estimate the probability of pass
or fail
, based on the Hours
studied. In other words, how many Hours
should you study to pass?
What is the relation between age
and bmi
Can we infer the relation between age
and bmi
, but now for different levels of bmi
Finally, if we would develop some intervention campaign; which age
-group(s) should be targeted?
Learn to keep your cool
and build the foundation for a succesfull scripting career in predictive and inferential analytics
This means that you will learn the ins and outs of inferential and predictive research with linear and logistic models.
R
to perform our data analysis and visualizationsR
-exercises: every weekWeek # | Topic | R -practical |
Workgroup |
---|---|---|---|
1 | The elemental building blocks of R |
Assigning objects and elements; creating vectors, matrices, dataframes and lists | Receive instructions and form groups |
2 | Finding the least squares solution; simple linear regression | Subsetting data; using pipes to simplify the workflow | Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes |
3 | Linear modeling in R ; testing assumptions; standardized residuals, leverage and Cook’s distance |
Class lm in R ; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
4 | Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance | Demonstrate confidence validity of the linear model on simulated data with rmarkdown |
Test and quantify the effect of the defined model; continue the project in rmarkdown |
5 | Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials | Cross-validation and model fit in R |
Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data |
Week # | Topic | R -practical |
Workgroup |
---|---|---|---|
6 | Simple logistic regression | Class glm(formula, family = "binomial") in R ; modeling, prediction and visualization |
Fit your defined model; evaluate if assumptions are met |
7 | Formulating the logistic model and interpreting the parameters; marginal effects | Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals | Test and quantify the effect of the defined model |
8 | Logistic regression model evaluation; cross-validation; multiple regression; interactions | Multiple logistic regression and cross-validating the logistic regression in R |
Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data |
R is a language and environment for statistical computing and for graphics
GNU project (100% free software)
Managed by the R Foundation for Statistical Computing, Vienna, Austria.
Community-driven
Based on the object-oriented language S (1975)
R works with objects that consist of elements. The smallest elements are numbers and characters.
Everything that is published on the Comprehensive R
Archive Network (CRAN) and is aimed at R
users, must be accompanied by a help file.
If you know the name of the function that performs an operation, e.g. anova()
, then you just type ?anova
or help(anova)
in the console.
If you do not know the name of the function: type ??
followed by your search criterion. For example ??anova
returns a list of all help pages that contain the word ‘anova’
Alternatively, the internet will tell you almost everything you’d like to know and sites such as http://www.stackoverflow.com and http://www.stackexchange.com, as well as Google
can be of tremendous help.
R
related issues; use ‘R:’ as a prefix in your search termAssigning things in R is very straightforward:
<-
For example, if you assign the value 100
(an element) to object a
, you would type
a <- 100
Calling things in R is also very straightforward:
For example, we assigned the value 100
to object a
. To call object a
, we would type
a
## [1] 100
This is why we use R-Studio.
a <- c(1, 2, 3, 4, 5) a
## [1] 1 2 3 4 5
b <- 1:5 b
## [1] 1 2 3 4 5
Characters (or character strings) in R
are indicated by the double quote identifier.
a.new <- c(a, "A") a.new
## [1] "1" "2" "3" "4" "5" "A"
Notice the difference with a
from the previous slide
a
## [1] 1 2 3 4 5
rep(a, 15)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 ## [39] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
If we would want just the third element, we would type
a[3]
## [1] 3
This we would refer to as a matrix
c <- matrix(a, nrow = 5, ncol = 2) c
## [,1] [,2] ## [1,] 1 1 ## [2,] 2 2 ## [3,] 3 3 ## [4,] 4 4 ## [5,] 5 5
matrix(a, nrow = 5, ncol = 2, byrow = TRUE)
## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4 ## [3,] 5 1 ## [4,] 2 3 ## [5,] 4 5
c[1, ]
## [1] 1 1
c[, 2]
## [1] 1 2 3 4 5
c[1, 2]
## [1] 1
In short; (square) brackets [] are used to call elements, rows and columns.
If we add a character column to matrix c
; everything becomes a character:
cbind(c, letters[1:5])
## [,1] [,2] [,3] ## [1,] "1" "1" "a" ## [2,] "2" "2" "b" ## [3,] "3" "3" "c" ## [4,] "4" "4" "d" ## [5,] "5" "5" "e"
Alternatively,
cbind(c, c("a", "b", "c", "d", "e"))
## [,1] [,2] [,3] ## [1,] "1" "1" "a" ## [2,] "2" "2" "b" ## [3,] "3" "3" "c" ## [4,] "4" "4" "d" ## [5,] "5" "5" "e"
Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.
d <- data.frame("V1" = rnorm(5), "V2" = rnorm(5, mean = 5, sd = 2), "V3" = letters[1:5]) d
## V1 V2 V3 ## 1 2.1988103 4.047506 a ## 2 1.3124130 3.422794 b ## 3 -0.2651451 3.810765 c ## 4 0.5431941 8.301815 d ## 5 -0.4143399 4.891944 e
We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.
Data frames can contain both numerical and character elements at the same time, although never in the same column.
You can name the columns and rows in data frames (just like in matrices)
row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5") d
## V1 V2 V3 ## row 1 2.1988103 4.047506 a ## row 2 1.3124130 3.422794 b ## row 3 -0.2651451 3.810765 c ## row 4 0.5431941 8.301815 d ## row 5 -0.4143399 4.891944 e
There are two ways to obtain row 3
from data frame d
:
d["row 3", ]
## V1 V2 V3 ## row 3 -0.2651451 3.810765 c
and
d[3, ]
## V1 V2 V3 ## row 3 -0.2651451 3.810765 c
The intersection between row 2 and column 4 can be obtained by
d[2, 3]
## [1] "b"
Both
d[, "V2"] # and
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944
d[, 2]
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944
yield the second column. But we can also use $
to call variable names in data frame objects
d$V2
## [1] 4.047506 3.422794 3.810765 8.301815 4.891944
List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by
f <- list(a) f
## [[1]] ## [1] 1 2 3 4 5
Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f
is object a
f[[1]]
## [1] 1 2 3 4 5
We can simply add an object or element to an existing list
f[[2]] <- d f
## [[1]] ## [1] 1 2 3 4 5 ## ## [[2]] ## V1 V2 V3 ## row 1 2.1988103 4.047506 a ## row 2 1.3124130 3.422794 b ## row 3 -0.2651451 3.810765 c ## row 4 0.5431941 8.301815 d ## row 5 -0.4143399 4.891944 e
to obtain a list with a vector and a data frame.
We can add names to the list as follows
names(f) <- c("vector", "data frame") f
## $vector ## [1] 1 2 3 4 5 ## ## $`data frame` ## V1 V2 V3 ## row 1 2.1988103 4.047506 a ## row 2 1.3124130 3.422794 b ## row 3 -0.2651451 3.810765 c ## row 4 0.5431941 8.301815 d ## row 5 -0.4143399 4.891944 e
Calling the vector (a) from the list can be done as follows
f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5
Take the following example
g <- list(f, f)
To call the vector from the second list within the list g, use the following code
g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5
Logical operators are signs that evaluate a statement, such as ==
, <
, >
, <=
, >=
, and |
(OR) as well as &
(AND). Typing !
before a logical operator takes the complement of that action. There are more operations, but these are the most useful.
For example, if we would like elements out of matrix c
that are larger than 3, we would type:
c[c > 3]
## [1] 4 5 4 5
c > 3
## [,1] [,2] ## [1,] FALSE FALSE ## [2,] FALSE FALSE ## [3,] FALSE FALSE ## [4,] TRUE TRUE ## [5,] TRUE TRUE
The column values for TRUE
may be of different length. A vector as a return is therefore more appropriate.
c[c < 3 | c > 3] #c smaller than 3 or larger than 3
## [1] 1 2 4 5 1 2 4 5
or
c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5
c != 3
returns a matrix## [,1] [,2] ## [1,] TRUE TRUE ## [2,] TRUE TRUE ## [3,] FALSE FALSE ## [4,] TRUE TRUE ## [5,] TRUE TRUE
c
?:## [,1] [,2] ## [1,] 1 1 ## [2,] 2 2 ## [3,] 3 3 ## [4,] 4 4 ## [5,] 5 5
0 / 0
## [1] NaN
mean(c(1, 2, NA, 4, 5))
## [1] NA
There are two easy ways to perform “listwise deletion”:
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3
(3 - 2.9)
## [1] 0.1
(3 - 2.9) == 0.1
## [1] FALSE
Why does R
tell us that 3 - 2.9
\(\neq\) 0.1
?
(3 - 2.9) - .1
## [1] 8.326673e-17
#
) to clarify what you are doing
R
-scripts
RStudio
projectsFunctions have parentheses ()
. Names directly followed by parentheses always indicate functions. For example;
matrix()
is a functionc()
is a function(1 - 2) * 5
is a calculation, not a functionPackages give additional functionality to R
.
By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.
The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:
A list of available packages can be found on CRAN
There are two ways to load a package in R
library(stats)
and
require(stats)
The easiest way to install e.g. package mice
is to use
install.packages("mice")
Alternatively, you can also do it in RStudio
through
Tools --> Install Packages
R
in depthA workspace contains all changes you made to R
.
A saved workspace contains everything at the time of the state wherein it was saved.
You do not need to run all the previous code again if you would like to continue working at a later time.
Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.
R
by default saves (part of) the code history and RStudio
expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
RStudio
Use common sense and BE CONSISTENT.
Browse through the tidyverse style guide
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.
Intentional spacing makes your code easier to interpret
a<-c(1,2,3,4,5)
vs;a <- c(1, 2, 3, 4, 5)
at least put a space after every comma!
dplyr
, pipes
and linear modeling.