Reproducible Research

Gerko Vink

Methodology & Statistics @ UU

Hanne Oberman

Methodology & Statistics @ UU

6 Nov 2024

Overview

Today’s class

  • Reproducibility
  • Research compendiums

Why bother?

We would like our results to be as fully reproducible as possible:

A. Reproducibility is one of the pillars of science

  • It is the standard by which to judge scientific claims
  • It helps the cumulative growth of knowledge - no duplication of effort

B. Reproducibility may greatly benefit you

  • You’ll develop better work habits
  • Better teamwork - especially with new team members
  • Changing or amending your work is much easier
  • Higher research impact - more likely to be picked up and cited

By the end of this class…

  • … you will have heard what reproducible research is and why it is important
  • … you have applied this knowledge by critiquing the reproducibility of case study
  • … you will be able to produce a reproducible research archive from scratch

Reproducibility

Crisis?

Definition

A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer.

Definition

Research results are replicable if there is sufficient information available for independent researchers to make the same findings using the same procedures.

True or false?

In computational sciences - such as ours - simply having the data and code means that the results are not only replicable, but fully reproducible.

Reproducibility of R scripts

Reproducible research is not the norm:

74% of R files failed to complete without error

Reproducibility spectrum

Case study

Case study

Our study is completely reproducible using the R code provided in online supplemental file 1, which uses freely available data.






# ====================================================================
# R CODE
# small scale simulation study to investigate impact of measurement error
# measurement error on (continuous) exposure and/or (continuous) confounding variable
# ====================================================================
# ====================================================================
# libraries:
library(Hmisc)
library(mice)
library(tidyverse)

Making research reproducible

  • reproducible documents
  • research compendiums

Research Compendiums

Research compendium

DOI

Definition

A research compendium is a collection of all digital parts of a research project including data, code, texts…

The collection is created in such a way that reproducing all results is straightforward1


The compendium serves as a means for distributing, managing, and updating the collection2

Basic compendium

A basic research compendium is just a folder…

compendium/
├── data
│   └── my_data.csv
├── analysis
│   └── my_script.R
├── requirements.txt
└── README.md

(Not so) basic compendium

… but it can become extensive…

|
├── paper/
│   ├── paper.qmd       
│   └── references.bib  
| 
├── figures/            
|
├── data/
│   ├── raw_data/       
│   └── clean_data/   
|
└── templates
    └── journal_template.csl     

(Not so) basic compendium

…or even executable!

|
├── _targets.R
├── R/
│   ├── functions_data.R
│   ├── functions_analysis.R
│   ├── functions_visualization.R
└── data/
    └── input_data.csv

(Not so) basic compendium

Guidelines

  • Completeness
  • Organization
  • Economy
  • Transparency
  • Documentation
  • Access
  • Provenance
  • Metadata
  • Automation
  • Review

In practice

Research Data Management Support workshop:

Writing Reproducible Manuscripts in R and Python

How to

  • Think about a good folder structure
    • Split up ‘read-only’, ‘human-generated’, and ‘project-generated’ files
  • Create folder structure (main directory and sub directories)
    • Add a landing page in the form of a README document
    • Make the compendium executable (to automatically generate the results; optional)
    • Make the compendium into a git repository (optional)
  • Add all files needed for reproducing the results of the project
    • Avoid ‘hard coded’ parameters or human intervention in the execution
  • Make the compendium as clean and easy to use as possible
    • Include a citation file and a LICENSE file with info on how it can be used
  • Publish your compendium
    • E.g. on Zenodo (optional, more on this in the last course week)

Wrap-up

‘Reproducible’ research

Boulesteix, Groenwold, Abrahamowicz, et al. (2020) with the pdf supplement

Reproducible but not replicable

set.seed(1)
run_simulation()
set.seed(2)
run_simulation()
set.seed(3)

Take aways

  • reproducibility
  • research compendiums
  • building a compendium

Next meeting

November 22nd: Developer Portfolio + hand in the exercise in this week’s lab

References and further reading

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. https://www.nature.com/articles/533452a

Boulesteix, A.-L., Groenwold, R. H., Abrahamowicz, M., Binder, H., Briel, M., Hornung, R., Morris, T. P., Rahnenführer, J., & Sauerbrei, W. (2020). Introduction to statistical simulations in health research. BMJ Open, 10(12), e039921. https://doi.org/10.1136/bmjopen-2020-039921

Bryan, J., & TAs, T. S. 545. (n.d.-a). Chapter 33 Why and how we automate data analyses + examples | STAT 545. Retrieved October 30, 2023, from https://stat545.com/

Checklist. (n.d.). Retrieved October 31, 2023, from https://guide.esciencecenter.nl/#/best_practices/checklist

Checklist for a Software Management Plan. (n.d.). https://doi.org/10.5281/zenodo.2159713

Comment on Oberman & Vink: Should we fix or simulate the complete data in simulation studies evaluating missing data methods? - Morris—Biometrical Journal—Wiley Online Library. (n.d.). Retrieved October 30, 2023, from https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.202300085

Drost, N., Spaaks, J. H., Andela, B., Veen, L., Zwaan, J. M., Verhoeven, S., Bos, P., Kuzak, M., Werkhoven, B., Attema, J., Hidding, J., Hees, V., Martinez-Ortiz, C., Spreeuw, H., Borgdorff, J., Leinweber, K., Diblen, F., Oord, G., Goncalves, R., … Bakker, T. (2020). Netherlands eScience Center—Software Development Guide (v0.9.1). Zenodo. https://doi.org/10.5281/ZENODO.4020564

References and further reading

Gentleman, R., & Temple Lang, D. (2007). Statistical Analyses and Reproducible Research. Journal of Computational and Graphical Statistics, 16(1), 1–23. https://doi.org/10.1198/106186007X178663

Jiménez, R. C., Kuzak, M., Alhamdoosh, M., Barker, M., Batut, B., Borg, M., Capella-Gutierrez, S., Hong, N. C., Cook, M., Corpas, M., Flannery, M., Garcia, L., Gelpí, J. L., Gladman, S., Goble, C., Ferreiro, M. G., Gonzalez-Beltran, A., Griffin, P. C., Grüning, B., … Crouch, S. (2017). Four simple recommendations to encourage best practices in research software (6:876). F1000Research. https://doi.org/10.12688/f1000research.11407.1

Knuth, D. E. (1984). Literate Programming. The Computer Journal, 27(2), 97–111. https://doi.org/10.1093/comjnl/27.2.97

Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging Data Analytical Work Reproducibly Using R (and Friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086

NHANES Questionnaires, Datasets, and Related Documentation. (n.d.). Retrieved October 30, 2023, from https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Begi-nYear=2015

References and further reading

Nüst, D., Ostermann, F., Sileryte, R., Hofer, B., Granell, C., Teperek, M., Graser, A., Broman, K., Hettne, K., & Clare, C. (2019). AGILE Reproducible Paper Guidelines. https://doi.org/10.17605/OSF.IO/CB7Z8

Peng, R. D. (2011). Reproducible Research in Computational Science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847

Reliability and reproducibility checklist for molecular dynamics simulations. (2023). Communications Biology, 6(1), Article 1. https://doi.org/10.1038/s42003-023-04653-0

Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten Simple Rules for Reproducible Computational Research. PLOS Computational Biology, 9(10), e1003285. https://doi.org/10.1371/journal.pcbi.1003285

Sayre, F., & Riegelman, A. (2019). Replicable Services for Reproducible Research: A Model for Academic Libraries | Sayre | College & Research Libraries. https://doi.org/10.5860/crl.80.2.260

Table 1 Reliability and reproducibility checklist for molecular dynamics simulations. (n.d.). Retrieved October 30, 2023, from https://www.nature.com/articles/s42003-023-04653-0/tables/1

Telford, R. J. (2023, September 6). Enough Markdown to Write a Thesis. https://biostats-r.github.io/biostats/quarto/

References and further reading

The Turing Way Community (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.3233853

The Turing Way Community & Scriberia. (2023). Illustrations from The Turing Way: Shared under CC-BY 4.0 for reuse. Zenodo. https://doi.org/10.5281/ZENODO.3332807

TIER Protocol 4.0 | Project TIER | Teaching Integrity in Empirical Research. (n.d.). Retrieved October 30, 2023, from https://www.projecttier.org/tier-protocol/protocol-4-0/

Trisovic, A., Lau, M. K., Pasquier, T., & Crosas, M. (2022). A large-scale study on research code quality and execution. Scientific Data, 9(1), Article 1. https://doi.org/10.1038/s41597-022-01143-6

Utrecht University (2023a, September 26). Best Practices for Writing Reproducible Code. https://utrechtuniversity.github.io/workshop-computational-reproducibility/

Utrecht University (2023b, October 24). Writing Reproducible Manuscripts in R & Python. https://utrechtuniversity.github.io/workshop-reproducible-manuscripts/