Discussion on: Principled Machine Learning: Practices and Tools for Efficient Collaboration

View post

Great article! I understand that from your point of view using DVC is preferred compared to MLFlow.
In the other hand, DVC always requires data to be downloaded locally in the machine from which you execute the commands. If the workflowof the machine learning pipeline is to read from s3 in memory (e.g. as DataFrame) or to load in Spark cluster and write in s3, the DVC option of pulling the data locally before to execute the script can be a bottleneck.
Moreover, you have to remember to clean up the downloaded data after you have executed your script.
Wouldn't be better to use DVC to track config and metadata files containing the path to a read-only datalake (such as S3) and have the load of data done at execution time?
In my team at Helixa we have been experimenting with using Alluxio as mid-storage layer between the server machine and s3 in order to avoid unnecessary IO and network traffic with S3.

Moreover, I like the rich features of MLFlow as model repository and the UI to visualize published metrics as opposed to raw files versioned in DVC.
What's your thought about?

Thanks

Dmitry Petrov • Jul 7 '19 • Edited

DVC and MLFlow are not mutually exclusive. You can use DVC for dataset versioning while mlflow or other tools for metrics tracking and visualization. An example is here.

Right. DVC works with local files. In such a way, DVC solves the problem of file naming for multiple versions - you don't need to change file suffixes\prefixes\hashes all the time from your code.

You pointed to a good use case - sometimes you don't need a local copy and prefer to read from S3 into memory directly. This use case is already supported in DVC by external dependencies. You can even output results back to cloud storage by external outputs.

This works exactly how you described :) "track config and metadata files containing the path to a read-only datalake (such as S3)". The following command creates dvc-metafile with all the paths. Next dvc repro command will rerun the command if the input was changed.

$ dvc run -d s3://mybucket/input.json -o s3://mybucket/out.tsv ./run-my-spark-job.sh