Dear all,

As a statistician / data scientist / developer you’ll need to master many skills. Some of these skills are made simple with tools. This week we’ll bite the bullet by learning about two massively important tools for your toolset: **version controlling* and simulation.

To document our activity and our changes in detail, we’ll use Git. You can view Git as the ability to go back in time. Back to the very beginning of your project. A bonus: Git integrates nicely with RStudio. In this exercise we will learn

  1. How to integrate Git within our projects.
  2. How to publish our projects as GitHub repositories
  3. How to go about development with Git and GitHub

And in the meanwhile we’ll also Monte Carlo simulate one of the most important concepts in statistics:

  1. Confidence Validity

Use the appropriate channels to ask questions and hand-in your work.

All the best,

Gerko


1 Git

Git is a free and open source version control system for text files. It can handle extensive change logging for you, no matter the size of the project. Git is fast and efficient, but its effectiveness depends also on the frequency you instruct it to log your project’s changes.

You can see Git as a blank canvas that starts at a certain point in time. Every time you (or others) instruct Git to log any changes that have been made, Git adds the changes that are made to this canvas. We call the changes to the canvas commits. With every commit an extensive log is created that includes at least the following information:

  • the changes made
  • who made the changes
  • metadata
  • a small piece of text that describe the changes made

The difference between two commits - or the changes between them - are called diffs.

If you’d like to know much more about Git, this online book is a very good resource. If you’d like to practice with the command line interface use this webpage for a quick course. This book covers pretty much everything you need to marry git and R.


2 GitHub

GitHub is the social and user interface to Git that allows you to work in repositories. These repositories can be seen as project folders in which you publish your work, but you can also use them as test sites for development, testing, etcetera. There is a distinction between private repositories (only for you and those you grant access) and public repositories (visible for everyone).

Your public repositories can be viewed and forked by everyone. Forking is when other people create a copy of your repository on their own account. This allows them to work on a repository without affecting the master. You can also do this yourself, but then the process is called branching instead of forking. If you create a copy of a repository that is offline, the process is called cloning.

GitHub’s ability to branch, fork and clone is very useful as it allows other people and yourself to experiment on (the code in) a repository before any definitive changes are merged with the master. If you’re working in a forked repository, you can submit a pull request to the repository collaborators to accept (or reject) any suggested changes.

For now, this may be confusing, but I hope you recognize the benefits GitHub can have on the process of development and bug-fixing. For example, the most up-to-date version of the mice package in R can be directly installed from the mice repository with the following code:

install.packages("devtools")
devtools::install_github(repo = "stefvanbuuren/mice")

You can see that this process requires package devtools that expands the R functionality with essential development tools. Loading packages in R directly from their respective GitHub repositories, allows you to obtain the latest - often improved and less buggy - iteration of that software even before it is published on CRAN.


3 Installing Git

3.1 Installing on Mac

I suggest you install Git by downloading and installing GitHub Desktop. GitHub’s desktop application is a nice GUI and, naturally, integrates well into the repository workflow on GitHub.

When installed, you can go to GitHub Desktop > Install Command Line Tool

After a reboot, all should be set.

3.2 Installing on Windows

Download and install Git for Windows, Then download and install GitHub Desktop. GitHub’s desktop application is a nice GUI and, naturally, integrates well into the repository workflow on GitHub.

After a reboot, all should be set.


4 Command line interface vs. GUI

Ultimately, you’ll want to learn how to use Git through the command line interface (CLI). It offers better and more comprehensive functionality. Again, take this 15-minute course to get a gentle introduction. But do not be afraid that you miss out on the CLI if you don’t study this link: In week 4 we’ll explore in detail how to handle Git when things go haywire. You’ll be a CLI-wizard by then.


5 .gitignore

GitHub sees every file in your repository as one of the following three

  • tracked files that have been (previously) staged and committed
  • untracked files that have not been staged or committed
  • ignored files that have been explicitly ignored

It may be wise to instruct Git to ignore changes in some files. For example, compiled files (think about .com, .exe, .o, .so, etc), archives (e.g. .zip, .tar, .rar), logs (.log) and files generated in runtime (.temp) do not have to be tracked by Git. The same holds for hidden system files (e.g. .DS_Store or Thumbs.db). Adding such filetypes to a file named .gitignore and placing that file in the root of your repository will take care of focusing Git’s energy on useful files only. For common .gitignore examples, see this repository. There are many examples inside, such as this .gitignore example for R


6 Linking GitHub and RStudio

Securely linking your local machine to the remote repository is vital when collaborating with other people. In short; you would not want a potential hacker to have contributor access to any of your projects. I have prepared this walkthrough video that details the process of linking GitHub to your machine and RStudio. Below I explain the rationale of using both an SSH key and a personal access token.

If you still experience problems after following my walkthrough, check this chapter

To learn more about maintaining a package as GitHub repository within RStudio, have a look at this guide by Hadley Wickham.


6.1 SSH keys

With an SSH key you can identify yourself to an online server (in this case the GitHub server) without having to log in every time. It is like your machine having access to an online server through a unique biometric security measure, but instead of biometric data a bits-and-bytes hash code is communicated every time. You will need an SSH key to link RStudio to your GitHub repository.


6.2 Personal access tokens

If you use GitHub’s 2FA functionality - you should! - your username and password are not sufficient to push commits to GitHub through RStudio. To solve this follow these steps on github.com like I detail in this walkthrough video:

  1. Log in to your account
  2. Click on your profile photo (upper right corner) and select Settings
  3. Go to Developer settings
  4. Select Personal access tokens in the left sidebar
  5. Click Generate new token
  6. Give the token a name
  7. Select at least the repo scope; you’ll need these permissions to access repositories
  8. Click Generate token

Copy the token. The token will not be displayed again, so make a note of it, or save it somewhere.

In RStudio, paste the generated token in the password field when RStudio asks for your credentials. The token will now serve as the unique authenticated link instead of your password.


7 Git exercise

  1. Fork the course repository. See also this walkthrough video I’ve made for you that details the next couple of steps.
  2. Clone the fork to your machine
  3. Create a new branch
  4. Add two pieces of personal information to a Yourname/Assignment 1 folder.
  5. Commit the changes
  6. Push to your GitHub fork
  7. Send a pull request to incorporate your changes into the upstream/master branch (i.e. gerkovink/markup2020). I have made another walkthrough video for you that details this step.

8 Monte Carlo simulation exercise

  1. Perform a small simulation that does the following:
  1. Sample 100 samples from a standard normal distribution.
  2. For each of these samples, calculate the following statistics for the mean:
  • absolute bias
  • standard error
  • lower bound of the 95% confidence interval
  • upper bound of the 95% confidence interval
  1. Create a plot that demonstrates the following:

“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)

  1. Present a table containing all simulated samples for which the resulting confidence interval does not contain the population value.
  1. Add the simulation and its results to the Yourname/Assignment 1 folder
  2. Commit the changes to the repository
  3. Push to your GitHub fork
  4. Send a pull request to incorporate your changes into the upstream/master branch (i.e. gerkovink/markup2020)

End of exercises