Dear all,

Last week we’ve learned about using rmarkdown to create a reproducible workflow. With rmarkdown any independent researcher would be able to quickly reproduce, evaluate and/or add to your work. Also, your collaborators can do the same and new collaborators can be quickly brought up to speed.

This week we’ll add another toolset to reproducibility; one that is primarily designed for you: Git. You can view Git as the ability to go back in time. Back to the very beginning of your project.

Git integrates nicely with RStudio. In this exercise we will learn

  1. How to integrate Git within our projects.
  2. How to publish our projects as GitHub repositories

All the best,

Gerko


1 Git

Git is a free and open source version control system for text files. It can handle extensive change logging for you, no matter the size of the project. Git is fast and efficient, but its effectiveness depends also on the frequency you instruct it to log your project’s changes.

You can see Git as a blank canvas that starts at a certain point in time. Every time you (or others) instruct Git to log any changes that have been made, Git adds the changes that are made to this canvas. We call the changes to the canvas commits. With every commit an extensive log is created that includes at least the following information:

  • the changes made
  • who made the changes
  • metadata
  • a small piece of text that describe the changes made

The difference between two commits - or the changes between them - are called diffs.

If you’d like to know much more about Git, this online book is a very good resource. If you’d like to practice with the command line interface use this webpage for a quick course.


2 GitHub

GitHub is the social and user interface to Git that allows you to work in repositories. These repositories can be seen as project folders in which you publish your work, but you can also use them as test sites for development, testing, etcetera. There is a distinction between private repositories (only for you and those you grant access) and public repositories (visible for everyone).

Your public repositories can be viewed and forked by everyone. Forking is when other people create a copy of your repository on their own account. This allows them to work on a repository without affecting the master. You can also do this yourself, but then the process is called branching instead of forking. If you create a copy of a repository that is offline, the process is called cloning.

GitHub’s ability to branch, fork and clone is very useful as it allows other people and yourself to experiment on (the code in) a repository before any definitive changes are merged with the master. If you’re working in a forked repository, you can submit a pull request to the repository collaborators to accept (or reject) any suggested changes.

For now, this may be confusing, but I hope you recognize the benefits GitHub can have on the process of development and bug-fixing. For example, the most up-to-date version of the mice package in R can be directly installed from the mice repository with the following code:

install.packages("devtools")
devtools::install_github(repo = "stefvanbuuren/mice")

You can see that this process requires package devtools that expands the R functionality with essential development tools. Loading packages in R directly from their respective GitHub repositories, allows you to obtain the latest - often improved and less buggy - iteration of that software even before it is published on CRAN.


3 Installing Git

3.1 Installing on Mac

I suggest you install Git by downloading and installing GitHub Desktop. GitHub’s desktop application is a nice GUI and, naturally, integrates well into the repository workflow on GitHub.

When installed, you can go to GitHub Desktop > Install Command Line Tool

After a reboot, all should be set.

3.2 Installing on Windows

Download and install Git for Windows, Then download and install GitHub Desktop. GitHub’s desktop application is a nice GUI and, naturally, integrates well into the repository workflow on GitHub.

After a reboot, all should be set.


4 Command line interface vs. GUI

Ultimately, you’ll want to learn how to use Git through the command line. It offers better functionality. Again, take this 15-minute course to get a gentle introduction,


5 Git and RStudio

To link Git and RStudio, follow this document

To learn more about maintaining a package as GitHub repository within RStudio, have a look at this guide by Hadley Wickham.


6 .gitignore

GitHub sees every file in your repository as one of the following three

  • tracked files that have been (previously) staged and committed
  • untracked files that have not been staged or committed
  • ignored files that have been explicitly ignored

It may be wise to instruct Git to ignore changes in some files. For example, compiled files (think about .com, .exe, .o, .so, etc), archives (e.g. .zip, .tar, .rar), logs (.log) and files generated in runtime (.temp) do not have to be tracked by Git. The same holds for hidden system files (e.g. .DS_Store or Thumbs.db). Adding such filetypes to a file named .gitignore and placing that file in the root of your repository will take care of focusing Git’s energy on useful files only. For common .gitignore examples, see this repository. There are many examples inside, such as this .gitignore example for R


7 SSH keys

Follow this tutorial to add an SSH key to your GitHub account. With an SSH key you can identify yourself to an online server (in this case the GitHub server) without having to log in every time. It is like your machine having access to an online server through a unique biometric security measure, but instead of biometric data a bits-and-bytes hash code is communicated every time. You will need an SSH key to link RStudio to your GitHub repository.


8 2-Factor Authentication

If you use GitHub’s 2FA functionality - you should! - your username and password are not sufficient to push commits to GitHub through RStudio. To solve this follow these steps on github.com:

  1. Log in to your account
  2. Click on your profile photo (upper right corner) and select Settings
  3. Go to Developer settings
  4. Select Personal access tokens in the left sidebar
  5. Click Generate new token
  6. Give the token a name
  7. Select at least the repo scope; you’ll need these permissions to access repositories
  8. Click Generate token

Copy the token. The token will not be displayed again, so make a note of it, or save it somewhere.

In RStudio, paste the generated token in the password field when RStudio asks for your credentials. The token will now serve as the unique authenticated link instead of your password.


9 Exercise

  1. Add Git functionality to the project of last week’s exercise and publish that project on GitHub.
  2. Fork this repository
  • add a new branch with your name to your fork,
  • add your name to the file README.md
  • send a pull request to incorporate your changes into the upstream/master branch.

End of exercise