Essentials for Reproducibility

#reproducibility #datascience #research #data

Reproducibility of results is imperative for a sound research project. However, in computational research, especially in academic environments, we often overlook the reproducibility factor and does not put enough effort into establishing essential environments and workflows early on in the project lifecycle. Readers are often expected to figure things out themselves from the minimal information in the methods section if they want to reproduce the results themselves. In this article, I discuss some of the practices that I have adopted in the past two years to make my work more reproducible.

Version Control: Most important of all is to keep your code base under a version control system (VCS). I prefer GitHub; however, other popular variants like GitLab, Bitbucket are equally good. Posting your code online on GitHub not only serves the purpose of versioning it also acts as a backup. I have lost too many of my code in the past due to several reasons. Now I keep all my code posted on GitHub. If it is something embarrassingly trivial, then you can keep them in private repositories. At least they all will be there in case you might need to refer in the future. With all the code related to a project on GitHub, it becomes straightforward to share with collaborators or readers just by pointing them to the GitHub repository. If you happen to write a package or a module many programming languages can install your package directly from its GitHub repository without the need of posting it to a programming language-specific archive network.
Linux: Most of the academic projects run in between two to four years before you can publish an excellent paper. In addition to the project runtime, it is desirable for our results to be reproducible even after a decade. That is a very long time for most of the operating systems to stick around on the same version. In case of Windows and Mac OS, it might not even be possible to get a legal copy of an older version. In contrast, at least in theory, you can always get an older version of a popular Linux distribution that should install in a virtual machine. It can run the code written with old libraries.
Containers: Containers serves as a lighter version of a virtual machine that can encapsulate an operating system along with all the installed software and packages. A container image can be stored in a file and can be transferred easily from one computer to another. Docker is one of the most widely used container systems. Docker containers can be built from scratch or downloaded one of the pre-builts from their official vendors. If you are into scientific computing and frequently work in high-performance computing (HPC) environment, then Singularity containers should be your choice. Singularity is now available by default on most of the academic HPCs, for example, all the HPCs provided by XSEDE in the USA. Once a non-writable container is built, it freezes its contents in time and can be executed with its engine at any time in the future.
Environments: Setting up a virtual environment for every project and fixing the library versions have multiple advantages. It provides you with an isolated environment to work with without tempering the global settings, and each project can have their selected versions of libraries. It prevents you from the hassle of resolving conflicts between multiple versions of libraries if installed globally. However, in contrast to virtual machines and containers, a virtual environment can only control for a specific set of libraries. It doesn't select an operating system. I find conda to be excellent for setting up environments. I use it for both Python and R.
Documentation: Never forget to document code both inside the scripts and outside in a lab book or a readme file. Often in a hurry to produce results and get feedback from our supervisor documentation is the first thing that we forget to do. It also seems to be least necessary while the project is running because we usually have all the facts on top of our heads. However, as time passes, there is only so much that the human brain can remember and recall at the time of need. It would not be surprising that you will forget the very details of the project that you were so confident and could recite in the middle of asleep at the time you were actively working on the project. Markdown readme files, Dropbox Paper, and GitHub wiki are some of my preferred ways to document my projects.