Dear all,
As a statistician / data scientist / developer you’ll need to master many skills. Some of these skills are made simple with tools. This week we’ll bite the bullet by learning about two massively important tools for your toolset: **version controlling* and simulation.
To document our activity and our changes in detail, we’ll use Git
. You can view
Git
as the ability to go back in time. Back to the very
beginning of your project. A bonus: Git
integrates nicely
with RStudio
. In this exercise we will learn
Git
within our projects.GitHub
repositoriesGit
and
GitHub
And in the meanwhile we’ll also Monte Carlo simulate one of the most important concepts in statistics:
Use the appropriate channels to ask questions and hand-in your work.
All the best,
Gerko
Git
Git
is a free and open source version control system for
text files. It can handle extensive change logging for you, no matter
the size of the project. Git
is fast and efficient, but its
effectiveness depends also on the frequency you instruct it to log your
project’s changes.
You can see Git
as a blank canvas that starts at a
certain point in time. Every time you (or others) instruct
Git
to log any changes that have been made,
Git
adds the changes that are made to this canvas. We call
the changes to the canvas commits
.
With every commit
an extensive log is created that includes
at least the following information:
The difference between two commits - or the changes between them -
are called diffs
.
If you’d like to know much more about Git
, this online book is a very
good resource. If you’d like to practice with the command line interface
use this webpage for a
quick course. This book covers
pretty much everything you need to marry git
and
R
.
GitHub
GitHub
is the social and user interface to
Git
that allows you to work in repositories.
These repositories can be seen as project folders in which you publish
your work, but you can also use them as test sites for development,
testing, etcetera. There is a distinction between private
repositories (only for you and those you grant access) and public
repositories (visible for everyone).
Your public repositories can be viewed and forked
by everyone. Forking
is when other people create a copy of
your repository on their own account. This allows them to work on a
repository without affecting the master
. You can also do
this yourself, but then the process is called branching
instead of forking. If you create a copy of a repository that is
offline, the process is called cloning
.
GitHub
’s ability to branch, fork and clone is very
useful as it allows other people and yourself to experiment on (the code
in) a repository before any definitive changes are merged
with the master
. If you’re working in a forked repository,
you can submit a pull request
to the repository collaborators to accept (or reject) any suggested
changes.
For now, this may be confusing, but I hope you recognize the benefits
GitHub
can have on the process of development and
bug-fixing. For example, the most up-to-date version of the
mice
package in R
can be directly installed
from the mice
repository with the following code:
install.packages("devtools")
devtools::install_github(repo = "stefvanbuuren/mice")
You can see that this process requires package devtools
that expands the R
functionality with essential development
tools. Loading packages in R
directly from their respective
GitHub
repositories, allows you to obtain the latest -
often improved and less buggy - iteration of that software even before
it is published on CRAN
.
Git
I suggest you install Git
by downloading and installing
GitHub Desktop
.
GitHub
’s desktop application is a nice GUI and, naturally,
integrates well into the repository workflow on GitHub
.
When installed, you can go to
GitHub Desktop > Install Command Line Tool
After a reboot, all should be set.
Download and install Git for Windows
, Then
download and install GitHub Desktop
.
GitHub
’s desktop application is a nice GUI and, naturally,
integrates well into the repository workflow on GitHub
.
After a reboot, all should be set.
Ultimately, you’ll want to learn how to use Git
through
the command line interface (CLI). It offers better and more
comprehensive functionality. Again, take this 15-minute course
to get a gentle introduction. But do not be afraid that you miss out on
the CLI if you don’t study this link: In week 4 we’ll explore in detail
how to handle Git
when things go haywire. You’ll be a
CLI-wizard by then.
GitHub sees every file in your repository as one of the following three
It may be wise to instruct Git
to ignore changes in some
files. For example, compiled files (think about .com
,
.exe
, .o
, .so
, etc), archives
(e.g. .zip
, .tar
, .rar
), logs
(.log
) and files generated in runtime (.temp
)
do not have to be tracked by Git
. The same holds for hidden
system files (e.g. .DS_Store
or Thumbs.db
).
Adding such filetypes to a file named .gitignore
and
placing that file in the root of your repository will take care of
focusing Git
’s energy on useful files only. For common
.gitignore
examples, see this repository. There
are many examples inside, such as this
.gitignore
example for R
GitHub
and RStudio
Securely linking your local machine to the remote repository is vital
when collaborating with other people. In short; you would not want a
potential hacker to have contributor access to any of your projects. I
have prepared this walkthrough
video that details the process of linking GitHub
to
your machine and RStudio
. Below I explain the rationale of
using both an SSH key and a personal access token.
If you still experience problems after following my walkthrough, check this chapter
To learn more about maintaining a package as GitHub
repository within RStudio
, have a look at this guide by Hadley Wickham.
With an SSH key you can identify yourself to an online server (in
this case the GitHub
server) without having to log in every
time. It is like your machine having access to an online server through
a unique biometric security measure, but instead of biometric data a
bits-and-bytes hash code is communicated every time. You will need an
SSH key to link RStudio
to your GitHub
repository.
If you use GitHub
’s 2FA functionality - you should! -
your username and password are not sufficient to push
commits
to GitHub
through
RStudio
. To solve this follow these steps on github.com like I detail in this walkthrough video:
Settings
Developer settings
Personal access tokens
in the left sidebarGenerate new token
repo
scope; you’ll need these
permissions to access repositoriesCopy the token. The token will not be displayed again, so make a note of it, or save it somewhere.
In RStudio
, paste the generated token in the password
field when RStudio
asks for your credentials. The token
will now serve as the unique authenticated link instead of your
password.
Git
exerciseYourname/Assignment 1
folder.GitHub
forkupstream/master
branch
(i.e. gerkovink/markup2020
). I have made another walkthrough video for you that
details this step.“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)
Yourname/Assignment 1
folderGitHub
forkupstream/master
branch
(i.e. gerkovink/markup2020
)End of exercises