As a statistician / data scientist / developer you’ll need to master many skills. Some of these skills are made simple with tools. This week we’ll bite the bullet by learning about two massively important tools for your toolset: **version controlling* and simulation.
To document our activity and our changes in detail, we’ll use
Git. You can view
Git as the ability to go back in time. Back to the very beginning of your project. A bonus:
Git integrates nicely with
RStudio. In this exercise we will learn
Gitwithin our projects.
And in the meanwhile we’ll also Monte Carlo simulate one of the most important concepts in statistics:
Use the appropriate channels to ask questions and hand-in your work.
All the best,
Git is a free and open source version control system for text files. It can handle extensive change logging for you, no matter the size of the project.
Git is fast and efficient, but its effectiveness depends also on the frequency you instruct it to log your project’s changes.
You can see
Git as a blank canvas that starts at a certain point in time. Every time you (or others) instruct
Git to log any changes that have been made,
Git adds the changes that are made to this canvas. We call the changes to the canvas
commits. With every
commit an extensive log is created that includes at least the following information:
The difference between two commits - or the changes between them - are called
If you’d like to know much more about
Git, this online book is a very good resource. If you’d like to practice with the command line interface use this webpage for a quick course. This book covers pretty much everything you need to marry
GitHub is the social and user interface to
Git that allows you to work in repositories. These repositories can be seen as project folders in which you publish your work, but you can also use them as test sites for development, testing, etcetera. There is a distinction between private repositories (only for you and those you grant access) and public repositories (visible for everyone).
Your public repositories can be viewed and forked by everyone.
Forking is when other people create a copy of your repository on their own account. This allows them to work on a repository without affecting the
master. You can also do this yourself, but then the process is called
branching instead of forking. If you create a copy of a repository that is offline, the process is called
GitHub’s ability to branch, fork and clone is very useful as it allows other people and yourself to experiment on (the code in) a repository before any definitive changes are
merged with the
master. If you’re working in a forked repository, you can submit a
pull request to the repository collaborators to accept (or reject) any suggested changes.
For now, this may be confusing, but I hope you recognize the benefits
GitHub can have on the process of development and bug-fixing. For example, the most up-to-date version of the
mice package in
R can be directly installed from the
mice repository with the following code:
install.packages("devtools") devtools::install_github(repo = "stefvanbuuren/mice")
You can see that this process requires package
devtools that expands the
R functionality with essential development tools. Loading packages in
R directly from their respective
GitHub repositories, allows you to obtain the latest - often improved and less buggy - iteration of that software even before it is published on
I suggest you install
Git by downloading and installing
GitHub’s desktop application is a nice GUI and, naturally, integrates well into the repository workflow on
When installed, you can go to
GitHub Desktop > Install Command Line Tool
After a reboot, all should be set.
Ultimately, you’ll want to learn how to use
Git through the command line interface (CLI). It offers better and more comprehensive functionality. Again, take this 15-minute course to get a gentle introduction. But do not be afraid that you miss out on the CLI if you don’t study this link: In week 4 we’ll explore in detail how to handle
Git when things go haywire. You’ll be a CLI-wizard by then.
GitHub sees every file in your repository as one of the following three
It may be wise to instruct
Git to ignore changes in some files. For example, compiled files (think about
.so, etc), archives (e.g.
.rar), logs (
.log) and files generated in runtime (
.temp) do not have to be tracked by
Git. The same holds for hidden system files (e.g.
Thumbs.db). Adding such filetypes to a file named
.gitignore and placing that file in the root of your repository will take care of focusing
Git’s energy on useful files only. For common
.gitignore examples, see this repository. There are many examples inside, such as this
.gitignore example for
gerkovink/markup2020). I have made another walkthrough video for you that details this step.
“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)
End of exercises