Machine learning projects are often harder than they should be. We’re dealing with data and software, and it should be a simple matter of running the code, iterating through some algorithm tweaks, and after a while we have a perfectly trained AI model. But fast forward three months later, the training data might have been changed or deleted, and the understanding of training scripts might be a vague memory of which does what. Have you created a disconnect between the trained model and the process to create the model? How do you share work with colleagues for collaboration or replicating your results?
As is true for software projects in general, what’s needed is better management of code versions and project assets. One might need to revisit the state of the project as it was at any stage in the past. We do this (review old commits) in software engineering all the time. Shouldn’t a machine learning project also need to occasionally do the same? It’s even more than that. What about the equivalent of a Pull Request, or other sorts of team management practices routinely used in other fields?
Myself, I am just beginning my journey to learn about Machine Learning tools. Among the learning materials, I watch tutorial videos and the instructors sometimes talk about problems reminding me of a period early in my software engineering career. In 1993-4, for example, I was the lead engineer of a team developing an e-mail user agent. We did not have any kind of Source Code Management (SCM) system. Every day I consulted all other team members to see what changes they had made that day. The only tool I had was to run a diff between their source tree and the master source tree (using diff -c | less
), then manually apply the changes. Later, team members manually updated their source tree from the master source tree. That was a mess until we found an early SCM system (CVS). That one tool made the project run much more smoothly.
As I learn the tools used in machine learning and data science projects, the stories feel similar to this. Even today ML researchers sometimes store experiments (data, code, etc) in parallel directory structures to facilitate diffing, just like I did in 1993.
Principles
Let’s start with a brief overview of some principles that might be useful to improve the state of software management tools for machine learning projects.
In any machine learning project the scientist will run many experiments to develop the best trained model for the target scenario. Experiments contain:
- Code and Configuration: The software used in the experiment, along with configuration parameters
- Dataset: Any input data used - this can easily be many gigabytes in size such as projects to recognize content of audio, image or video files
- Outputs: The trained ML model and any other outputs from the experiment
A machine learning project is just running software. But often there are difficulties in sharing files with colleagues or reproducing the results. Getting repeatable results that can be shared with colleagues, and where you can go back in time to evaluate earlier stages of the project, requires more comprehensive management tools.
The solution needs to encompass ideas like these (abstracted from a talk by Patrick Ball titled Principled Data Processing):
-
Transparency: Inspecting every aspect of an ML project.
- What code, configuration and data files are used
- What processing steps are used in the project, and the order of the steps
-
Auditability: Inspecting intermediate results of a pipeline
- Looking at both the final result, but any intermediate results
-
Reproducibility: Ability to re-execute precisely the project at any stage of its development, and the ability for co-workers to re-execute precisely the project
- Recording the processing steps such that they’re automatically rerunnable by anyone
- Recording the state of the project as the project progresses. “State” means code, configuration, and datasets
- Ability to recreate the exact datasets available at any time in the project history is crucial for Auditability to be useful
- Scalability: Ability to support multiple co-workers working on a project, and the ability to work on multiple projects simultaneously
What makes ML projects different from regular software engineering?
Are you already concluding that if ML projects are the same as software engineering, then why don’t we just use regular software engineering tools in machine learning projects? Not so fast!
There are many tools used in regular software engineering projects that could be useful to ML researchers. The code and experiment configuration can be easily managed in a regular source code management system like Git, and techniques like pull requests can be used to manage updates to those files. CI/CD (Jenkins, etc) systems can even be useful in automating project runs.
But ML projects have differences preventing regular software developer tools from serving every need. Here’s a few things:
- Metrics-Driven development versus Feature-Driven development: In regular software engineering “whether to release” decisions are based on whether the team has reached feature milestones. By contrast, ML researchers look at an entirely different measurement - the predictive value of the generated machine learning model. The researcher will iteratively generate dozens (or more) models, measuring the accuracy of each. The project is guided by metrics achieved in each experiment, since the goal is to find the most accurate model.
- ML Model’s require huge resources to train: Where a regular software project organizes the files to compile together a software product, an ML project instead trains a “model” that describes an AI algorithm. In most cases compiling a software product takes a few minutes, which is so cheap many teams follow a continuous integration strategy. Training an ML model takes so long that it’s desirable to avoid doing so unless necessary.
- Enormous datasets and trained models: A generalization of the previous point is that machine learning development phases almost always require enormous datasets that are used in training the ML model, plus trained models can be enormous. Normal source code management tools (Git et al) do not handle large files very well, and add-ons like Git-LFS are not suitable for ML projects. (See my previous article)
- Pipelines: ML projects are a series of steps such as downloading data, preparing data, separating data into training/validation sets, training a model, and validating the model. Many use the word “pipeline”, and it is useful to structure an ML project with discrete commands for each step versus cramming everything into one program.
- Special purpose hardware: Software organizations can host their software infrastructure on any kind of server equipment. If they desire a cloud deployment, they can rent every-day normal VPS’s from their favorite cloud computing provider. ML researchers have huge computation needs. High-power GPU’s not only speed up video editing, but they can make ML algorithms fly, slashing the time required to train ML models.
What if that intermediate result was generated three months ago and things have changed such that you don’t remember how the software had been run at that time? What if the dataset has been overwritten or changed? A system supporting transparency, auditability and reproducibility for an ML project must account for all these things.
Now that we have a list of principles, let’s look at some open source tools in this context.
There are a large number of tools that might be suitable for data science and machine learning practitioners. In the following sections we’re specifically discussing two tools (MLFlow and DVC) while also talking about general principles.
Principled data and models storage for ML projects
One side of this discussion boils down to:
- Tracking which data files were used for every round of training machine learning models.
- Tracking resulting trained models and evaluation metrics
- Simple method to share data files with colleagues via any form of file sharing system.
A data tracking system is required to transparently audit, or to reproduce the results. A data sharing system required to scale the project team to multiple colleagues.
It may already be obvious, but it is impractical to use Git or other SCM (Source Code Management system) to store the data files used in a machine learning project. It would be attractively simple if the SCM storing the code and configuration files could also store the data files. Git-LFS is not a good solution either. My earlier article, Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis, went into some detail about the reasoning.
Some libraries provide an API to simplify dealing with files on remote storage, and manage uploading files to or from remote storage. While this can be useful for shared access to a remote dataset, it does not help with the problem described here. First, it is a form of embedded configuration since the file names are baked into the software. Any program where configuration settings are embedded in the source code is more difficult to reuse in other circumstances. Second, it does not correlate which data file was used for each version of the scripts.
Consider the example code for MLFlow:
mlflow.pytorch.load_model("runs:/<mlflow_run_id>/run-relative/path/to/model")
This supports several alternative file access “schemes” including cloud storage systems like S3. The example here loads a file, in this case a trained model, from the “run” area. An MLFlow “run” is generated each time you execute “a piece of data science code”. You configure a location where “run” data is stored, and obviously a “run ID” is generated for each run that is used to index into the data storage area.
This looks to be useful as it will automatically associate the data with commits to the SCM repository storing code and configuration files. Additionally, as the MLFlow API is available for several languages, you’re not limited to Python.
DVC has a different approach. Instead of integrating a file API into your ML scripts, your scripts simply input and output files using normal file-system APIs. For example:
model = torch.load(‘path/to/model.pkl’)
Ideally this pathname would be passed in from the command line. The point is that nothing special is required of the code because DVC provides its value outside the context of the code used in training or validating models.
DVC makes this transparent because the data file versioning is paired with Git. A file or directory is taken under DVC control with the command:
$ dvc add path/to/model.pkl
The data is stored in a natural place, in your working directory. Navigating through the results of various runs is a simple matter of navigating through your Git history. Viewing a particular result is as simple as running git checkout
, and DVC will be invoked to ensure the correct data files are linked into the workspace.
A “DVC file” is created to track each file or directory, and are inserted into the workspace by DVC. They have two purposes, one of which is tracking data and model files, the other is recording the workflow commands, which we’ll go over in the next section.
These DVC files record MD5 checksums of the files or directories being tracked. They are committed to the Git workspace, and therefore the DVC files record the checksum of each file in the workspace for each Git commit. Behind the scenes, DVC uses what’s called a “DVC cache directory” to store multiple instances of each file. The instances are indexed by the checksum, and are linked into the workspace using reflinks or symlinks. When DVC responds to the git checkout
operation, it is able to quickly rearrange linked files in the workspace based on the checksums recorded in the DVC files.
DVC supports a remote cache directory that is used to share data and models with others.
$ dvc remote add remote1 ssh://user@host.name/path/to/dir
$ dvc push
$ dvc pull
A DVC remote is a pool of storage through which data can be shared. It supports many storage services including S3 and other services, HTTP, and FTP. Creating one is very simple. The dvc push
and dvc pull
commands are purposely similar to the git push
and git pull
commands. Where dvc push sends data to a remote DVC cache, we retrieve data from a DVC cache using dvc pull
.
Principled workflow descriptions for ML projects
Another side of the discussion is about how to best describe the workflow, or pipeline, used in the ML project. Do we pile the whole thing into one program? Or do we use multiple tools?
The greatest flexibility comes from implementing the workflow as a pipeline, or a directed acyclic graph, of reusable commands that take configuration options as command-line arguments. This is purposely similar to The Unix Philosophy of small well-defined tools, with narrow scope, that work well together, where behavior is tailored by command-line options or environment variables, and that can be mixed and matched as needed. There is a long collective history behind this philosophy.
By contrast many of the ML frameworks take a different approach in which a single program is written to drive the workflow used by the specific project. The single program might start with the step of splitting data into training and validation sets, then proceed through training a model and running validation of the model. This gives us limited chance to reuse code in other projects.
Structuring an ML project as a pipeline serves some benefits.
- Managing complexity: Implementing the steps as separate commands improves transparency, and lets you focus
- Optimize execution: Ability to skip steps that do not need to be rerun if files have not changed.
- Reusability: The possibility of using the same tool between multiple projects.
- Scalability: Different tools can be independently developed by different team members.
In MLFlow the framework has you write a “driver program”. That program contains whatever logic is required, such as processing and generating a machine learning model. Behind the scenes the MLFlow API sends requests to an MLFlow server, which then spawns the specified commands.
The MLFlow example for a multi-step workflow makes this clear. Namely:
...
load_raw_data_run = _get_or_run("load_raw_data", {}, git_commit)
ratings_csv_uri = os.path.join(load_raw_data_run.info.artifact_uri,
"ratings-csv-dir")
etl_data_run = _get_or_run("etl_data",
{"ratings_csv": ratings_csv_uri,
"max_row_limit": max_row_limit},
git_commit)
…
als_run = _get_or_run("als",
{"ratings_data": ratings_parquet_uri,
"max_iter": str(als_max_iter)},
git_commit)
…
_get_or_run("train_keras", keras_params, git_commit, use_cache=False)
...
The _get_or_run
function is a simple wrapper around mlflow.run
. The first argument to each is an entrypoint
defined in the MLproject
file. An entry point contains environment settings, the command to run, and options to pass to that command. For example:
etl_data:
parameters:
ratings_csv: path
max_row_limit: {type: int, default: 100000}
command: "python etl_data.py --ratings-csv {ratings_csv} --max-row-limit {max_row_limit}"
At first blush this appears to be very good. But here are a few questions to ponder:
- What if your workflow must be more complex than a straight line? You pass false to the synchronous parameter to mlflow.run then wait on the SubmittedRun object to indicate the task finished. In other words, it is possible to build a process management system on top of the MLFlow API.
- Why is a server required? Why not just run the commands at a command line? Requiring that a server be configured makes setup of a MLFlow project more complex.
- How do you avoid running a task that does not need to execute? In many ML projects, it takes days to train a model. That resource cost should only be spent if needed, such as changed data, changed parameters or changed algorithms.
DVC has an approach that works with regular command-line tools, but does not require setting up a server nor writing a driver program. DVC supports defining a workflow as a directed acyclic graph (DAG) using the set of DVC files mentioned earlier.
We mentioned DVC files earlier as associated with files added to the workspace. DVC files also describe commands to execute, such as:
$ dvc run -d matrix-train.p -d train_model.py \
-o model.p \
python train_model.py matrix-train.p 20180226 model.p
$ dvc run -d parsingxml.R -d Posts.xml \
-o Posts.csv \
Rscript parsingxml.R Posts.xml Posts.csv
The dvc run command defines a DVC file that describes a command to execute. The -d
option documents a dependency on a file where DVC will track its checksum to detect changes to the file. The -o
option is an output from the command. Outputs of one command can of course be used as inputs to another command. By looking at dependencies and outputs DVC can calculate the execution order for commands.
All outputs, including trained models, are automatically tracked in the DVC cache just like any other data file in the workspace.
Because it computes checksums, DVC can detect changed files. When the user requests DVC to re-execute the pipeline it only executes stages where there are changes. DVC can skip over your three-day model training task if none of its input files changed.
Everything executes at a regular command line, there is no server to set up. If you want this to execute in a cloud computing environment, or on a server with attached GPU’s, simply deploy the code and data to that server and execute DVC commands on the command line in that server.
Conclusion
We’ve come a long way with this exploration of some principles for improved machine learning practices. The ML field, as many recognize, needs better management tools so that ML teams can work more efficiently and reliably.
The ability to reproduce results means others can evaluate what you’ve done, or collaborate on further development. Reproducibility has many prerequisites including the ability to examine every part of a system, and the ability to precisely rerun the software and input data.
Some of the tools being used in machine learning projects have nice pretty user interfaces, such as Jupyter Notebook. These kind of tools have their place in machine learning work. However GUI tools do not fit well with the principles discussed in this article. Command line tools are well suited for processing tasks running in the background, and can easily satisfy all the principles we outline, while typical GUI’s interfere with most of those principles.
As we’ve seen in this article some tools and practices can be borrowed from regular software engineering. However, the needs of machine learning projects dictate tools that better fit the purpose. A few worthy tools include MLFlow, DVC, ModelDb and even Git-LFS (despite what we said earlier about it).
Top comments (2)
Great article! I understand that from your point of view using DVC is preferred compared to MLFlow.
In the other hand, DVC always requires data to be downloaded locally in the machine from which you execute the commands. If the workflowof the machine learning pipeline is to read from s3 in memory (e.g. as DataFrame) or to load in Spark cluster and write in s3, the DVC option of pulling the data locally before to execute the script can be a bottleneck.
Moreover, you have to remember to clean up the downloaded data after you have executed your script.
Wouldn't be better to use DVC to track config and metadata files containing the path to a read-only datalake (such as S3) and have the load of data done at execution time?
In my team at Helixa we have been experimenting with using Alluxio as mid-storage layer between the server machine and s3 in order to avoid unnecessary IO and network traffic with S3.
Moreover, I like the rich features of MLFlow as model repository and the UI to visualize published metrics as opposed to raw files versioned in DVC.
What's your thought about?
Thanks
DVC and MLFlow are not mutually exclusive. You can use DVC for dataset versioning while mlflow or other tools for metrics tracking and visualization. An example is here.
Right. DVC works with local files. In such a way, DVC solves the problem of file naming for multiple versions - you don't need to change file suffixes\prefixes\hashes all the time from your code.
You pointed to a good use case - sometimes you don't need a local copy and prefer to read from S3 into memory directly. This use case is already supported in DVC by external dependencies. You can even output results back to cloud storage by external outputs.
This works exactly how you described :) "track config and metadata files containing the path to a read-only datalake (such as S3)". The following command creates dvc-metafile with all the paths. Next
dvc repro
command will rerun the command if the input was changed.