As data science continues to gain momentum as a field, managing and versioning data and code has become increasingly important. Git, a powerful version control system, is a popular tool among software developers for managing source code changes. However, Git is not just limited to software development and can also be used effectively for managing data science projects.
In this article, we will explore how Git can be leveraged by data scientists to efficiently manage and version data, track changes, collaborate with team members, and reproduce experiments. Whether you are new to Git or an experienced user, this article aims to provide a comprehensive guide on using Git for data science projects.
Git is a distributed version control system used for tracking changes in source code during software development. It allows multiple people to collaborate on the same project by tracking changes to code. Git does this by taking snapshots of the files at various points in time, creating a complete history of changes made to those files. Each snapshot is called a "commit" and contains a reference to the previous commit, forming a "commit chain" or a "commit history".
Git uses a distributed model, which means that each user has a local copy of the entire repository, including the commit history. This allows users to work offline and makes collaboration easier. When users are ready to share their changes, they can push their commits to a remote repository, from which other users can then pull to incorporate those changes into their local copies.
Git also offers tools for merging changes made by different people and reverting to earlier versions if necessary. It also provides tools for branching, enabling developers to work on different parts of a project simultaneously without disrupting each other's work.
Git is a command-line tool that allows developers to track source code history over time while also allowing them to collaborate on the same project with minimal conflict.
GitHub is a web platform built on Git technology where remote repositories of git projects are hosted. It offers other features such as bug tracking, project management, automation and other features. Alternatives to GitHub include GitLab, Bitbucket, GitKraken, among others.
Terminologies & Commands
- Repository: A repository is a central location where Git stores all the files and folders of a project, along with their revision history.
# Create a new repository on your local computer git init
Commit: A commit is a snapshot of a repository at a specific point in time. It represents a set of changes that have been made to the repository. You must first stage the edited files using the
git addcommand. This marks the files to go into the commit.
# stage all edited files git add . # stage a specific file git add <file_name.ext> git commit -m "commit message goes here"
- Branch: A branch is a separate version of the repository that allows developers to work on different features or fixes simultaneously without interfering with each other's work.
# create then checkout to branch git branch <branch_name> git checkout <branch_name> # create and checkout into new branch git checkout -b <branch_name> # list all branches in the repository git branch
- Push: Push is the process of sending changes from a local repository to a remote repository, such as on GitHub.
git push origin <branch_name> # origin -> the default remote repository that Git tracks for a local repository or points to the original repository in case of cloning.
- Pull: Pull is the process of fetching and merging changes from a remote repository into a local repository.
git pull origin <branch_name>
- Merge: A merge is the process of combining changes from one branch into another branch.
git merge <feature branch_name>
- Pull Request: A pull request is a request made by a developer to merge their changes from a branch into the main branch of the repository.
- Fork: A fork is a copy of a repository that allows a developer to make changes to the code without affecting the original repository.
- Clone: A clone is a local copy of a remote repository that a developer can work on without affecting the original repository.
git clone <link to remote repository>
- HEAD: Shorthand for the current commit your local repository is currently on.
Whether you are working on a private or public repository, never commit any secrets. These include, any username, password, API key, TLS certificates, or other sensitive information. Keep in mind that private repositories can be accessed and cloned by multiple accounts or can also be made public at some point.
To protect such sensitive information, make use of the
.env file. This file's purpose is to hold environment variables. The
.env file is in turn kept safe by including it in the
For the purpose of making collaboration easy, you should create a
.env.template file. This file informs other collaborators which environement variables the system expects. From this file, they can create a
.env file with their own usernames, passwords and secret keys.
# .env file: API_KEY=97467282TTa89sdaf7659025f7sda22245 # .env.example file: API_KEY=your_key # gitignore file: .env # app.py from dotenv import load_dotenv load_dotenv() api_key = os.getenv('API_KEY')
If you happen to commit a secret, you cannot fix it by simply deleting it. Because git is designed to maintain a persistent history of the code, removing the secret will require rewriting history. This can prove difficult in situations where other people have the secret on their local repositories. The simplest solution is to change the passwords and disable the exposed secret keys.
The main purpose of Git is to track changes in text file, not large binary files such as a dataset. You may work with extremely large datasets which you can accidentally commit if you are not careful. There are several approaches you can take:
a) If your dataset does not change, you can upload it to a server and gain access ti it via its URL.
b) Use a
.gitgnore file. Add your dataset files or folders into the gitignore file to avoid accidentally staging and committing them.
# ignore archives *.zip *.tar *.tar.gz *.rar # ignore dataset folder and subfolders datasets/
Cell outputs on notebooks are a great feature. However, when using version control systems such as Git, a change to a code cell will most likely change its output. Keep track of the changes made in output cells will distract from the more important changes in the code cells. This can prove tedious when multiple people are working on the same notebook.
You should, therefore, strip all outputs from a notebook before committing to Git by:
- Manually clearing all output cells from the main menu
Cells -> All Output -> Clear
- Setting up a pre-commit hook to clear outputs automatically.
- Using a .gitattributes file
At times, you may encounter an error when pushing to remote that asks you to use the
-f flag. There are situations that require using this flag. However, make it a habit to read the error message first, try to identify the origin of the error and fix the underling issue. If this proves challenging, try asking for help.
Using --force habitually will prove detrimental in the long run.
As a general rule of thumb, a single commit should do one thing: fix one bug, not five; solve a single issue, not ten.
For example, a commit that fixes ten bugs will most likely have multiple changed files. Further, if the commit message is unclear like "Model now working", it becomes difficult for someone else to understand what happened in the commit. This provides zero value. The commit message "Fix special tokens not correctly tokenized" is short, but clear. You know what changed, and why.
Thankfully, you can fix your commit history if you haven't pushed to remote. Learning to rewrite history can prove very useful in real world projects.
If your project is constantly being worked on by many people or is in production, pull requests can prove very helpful. By default, a git repository has a single branch
master. It is considered the central true branch.
When you branch, you create a temporary 'caveat' from the
main branch. You and other collaborators can work on different features simultaneously through branching. This allows you to work on new features or fix old ones without affecting the main branch.
When you are done working on your feature, you will create a pull request to merge (include) the changes of your branch into the
main central branch. Pull requests are a github concept and have features to allow other people to review, comment, suggest changes, approve, or apply the changes in the pull request.
In this article, we've covered Git, how it works and the best practices when working with Git. To further help you in this journey, I have linked articles I found useful below:
I hope you found this post useful!