Rodney Kirui

Posted on Apr 3, 2023

Introduction to Data Version Control

#github #python

What is Data Version Control (DVC)?
Data Version Control (DVC) is an open-source tool that enables data scientists to track and manage changes to their data, models, and experiments. DVC is designed to work seamlessly with Git, the popular version control system used for software development.

Data version control is a critical aspect of any data science project. In traditional software development, version control is used to keep track of changes to source code. With the rise of data-driven applications, data has become a critical part of the development process, and version control is just as important for data as it is for code.
In standard software engineering, many people need to work on a shared codebase and handle multiple versions of the same code. This can quickly lead to confusion and costly mistakes.

To address this problem, developers use version control systems, such as Git, that help keep team members organized.

In a version control system, there’s a central repository of code that represents the current, official state of the project. A developer can make a copy of that project, make some changes, and request that their new version become the official one. Their code is then reviewed and tested before it’s deployed to production.

These quick feedback cycles can happen many times per day in traditional development projects. But similar conventions and standards are largely missing from commercial data science and machine learning. Data version control is a set of tools and processes that tries to adapt the version control process to the data world.

Having systems in place that allow people to work quickly and pick up where others have left off would increase the speed and quality of delivered results. It would enable people to manage data transparently, run experiments effectively, and collaborate with others.

What Is DVC?

DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. If you haven’t worked with Git before, then be sure to check out Introduction to Git and GitHub for Python Developers. If you’re familiar with Git but would like to take your skills to the next level, then check out Advanced Git Tips for Python Developers.

DVC is meant to be run alongside Git. In fact, the git and dvc commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.

Git can store code locally and also on a hosting service like GitHub, Bitbucket, or GitLab. Likewise, DVC uses a remote repository to store all your data and models. This is the single source of truth, and it can be shared amongst the whole team. You can get a local copy of the remote repository, modify the files, then upload your changes to share with team members.

The remote repository can be on the same computer you’re working on, or it can be in the cloud. DVC supports most major cloud providers, including AWS, GCP, and Azure. But you can set up a DVC remote repository on any server and connect it to your laptop. There are safeguards to keep members from corrupting or deleting the remote data.

When you store your data and models in the remote repository, a .dvc file is created. A .dvc file is a small text file that points to your actual data files in remote storage.

The .dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.

The best way to understand DVC is to use it, so let’s dive in. You’ll explore the most important features by working through several examples. Before you start, you’ll need to set up an environment to work in and then get some data.

Set Up Your Working Environment

ou’ll need to have Python and Git installed on your system. You can follow the Python 3 Installation and Setup Guide to install Python on your system. To install Git, you can read through Installing Git.

Since DVC is a command-line tool, you’ll need to be familiar with working in your operating system’s command line. If you’re a Windows user, have a look at Running DVC on Windows.

To prepare your workspace, you’ll take the following steps:

Create and activate a virtual environment.
Install DVC and its prerequisite Python libraries.
Fork and clone a GitHub repository with all the code.
Download a free dataset to use in the examples.
You can use any package and environment manager you want. This tutorial uses conda because it has great support for data science and machine learning tools. To create and activate a virtual environment, open your command-line interface of choice and type the following command:
$ conda create --name dvc python=3.8.2 -y

The create command creates a new virtual environment. The --name switch gives a name to that environment, which in this case is dvc. The python argument allows you to select the version of Python that you want installed inside the environment. Finally, the -y switch automatically agrees to install all the necessary packages that Python needs, without you having to respond to any prompts.

Once everything is installed, activate the environment:

$ conda activate dvc

You now have a Python environment that is separate from your operating system’s Python installation. This gives you a clean slate and prevents you from accidentally messing up something in your default version of Python.

You’ll also use some external libraries in this tutorial:

dvc is the star of this tutorial.
scikit-learn is a machine learning library that allows you to train models.
scikit-image is an image processing library that you’ll use to prepare data for training.
pandas is a library for data analysis that organizes data in table-like structures.
numpy is a numerical computing library that adds support for multidimensional data, like images.
Some of these are available only through conda-forge, so you’ll need to add it to your config and use conda install to install all the libraries:

$ conda config --add channels conda-forge
$ conda install dvc scikit-learn scikit-image pandas numpy

Alternatively, you can use the pip installer:
$ python -m pip install dvc scikit-learn scikit-image pandas numpy

Now you have all the necessary Python libraries to run the code.

This tutorial comes with a ready-to-go repository that contains the directory structure and code to quickly get you experimenting with DVC.
You need to fork the repository to your own GitHub account. On the repository’s GitHub page, click Fork in the top-right corner of the screen and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.

Clone the forked repository to your computer with the git clone command and position your command line inside the repository folder:

$ git clone https://github.com/YourUsername/data-version-control
$ cd data-version-control

Don’t forget to replace Your Username in the above command with your actual username. You should now have a clone of the repository on your computer.

There are six folders in your repository:

src/ is for source code.
data/ is for all versions of the dataset.
data/raw/ is for data obtained from an external source.
data/prepared/ is for data modified internally.
model/ is for machine learning models.
data/metrics/ is for tracking the performance metrics of your models.
The src/ folder contains three Python files:

prepare.py contains code for preparing data for training.
train.py contains code for training a machine learning model.
evaluate.py contains code for evaluating the results of a machine learning model.
The final step in the preparation is to get an example dataset you can use to practice DVC. Images are well suited for this particular tutorial because managing lots of large files is where DVC shines, so you’ll get a good look at DVC’s most powerful features.

How Data Version Control Works

At a high level, DVC works by creating a separate version control system for data and model files, while leveraging Git for code and experiment tracking. When a new data file is added to the project, DVC stores the file in a central repository and generates a small metadata file that contains information about the data, such as its hash value and location.

When a change is made to the data file, DVC generates a new metadata file with updated information about the file, including the new hash value. This metadata file is then committed to the Git repository, along with any code changes or experiment results.

Let's dive deeper into how DVC works step-by-step:

1. Initialize a DVC project: The first step is to initialize a new DVC project. This creates a new directory that contains the DVC configuration files and a Git repository.

2. Add data to the project: Next, data files are added to the project using the DVC add command. When a data file is added to the project, DVC generates a small metadata file that contains information about the data, such as its hash value and location. This metadata file is stored in the DVC cache directory, along with the original data file.

3. Track data changes: When a change is made to a data file, DVC detects the change and generates a new metadata file with updated information about the file, including the new hash value. The new metadata file is stored in the DVC cache directory, and the original data file is overwritten with the new data.

4. Commit changes to Git: Once the data changes are tracked by DVC, the changes are committed to the Git repository along with any code changes or experiment results. This ensures that all changes to data and code are tracked and versioned.

5. Share data with others: To share data with others, the DVC project directory can be pushed to a shared Git repository, or data can be shared directly from the DVC cache directory.

By using separate metadata files for data and models, DVC can track changes to large files without actually storing the files in the Git repository. This allows data scientists to manage and share large files without overwhelming the Git repository or slowing down the development process.

Benefits of Using DVC

Collaboration: DVC allows team members to work on the same project simultaneously, while ensuring that changes to data and models are tracked and shared.
Reproducibility: DVC ensures that data, models, and experiments are stored and versioned, enabling scientists to reproduce experiments easily.
Traceability: DVC provides a detailed history of changes to data and models, making it easy to track down the source of errors or issues.
Scalability: DVC is designed to handle large datasets, allowing data scientists to work with big data without compromising performance or storage.

Conclusion

Data Version Control (DVC) is a powerful tool that enables data scientists to track and manage changes to their data, models, and experiments. By using DVC alongside Git, data scientists can streamline their development process and focus on creating insights from data. DVC provides a way to manage and share large data files, collaborate with team members, and ensure reproducibility and traceability of experiments.

DEV Community

Introduction to Data Version Control

Top comments (0)