Discussion on: Principled Machine Learning: Practices and Tools for Efficient Collaboration

fullstackml profile image
Dmitry Petrov

DVC and MLFlow are not mutually exclusive. You can use DVC for dataset versioning while mlflow or other tools for metrics tracking and visualization. An example is here.

Right. DVC works with local files. In such a way, DVC solves the problem of file naming for multiple versions - you don't need to change file suffixes\prefixes\hashes all the time from your code.

You pointed to a good use case - sometimes you don't need a local copy and prefer to read from S3 into memory directly. This use case is already supported in DVC by external dependencies. You can even output results back to cloud storage by external outputs.

This works exactly how you described :) "track config and metadata files containing the path to a read-only datalake (such as S3)". The following command creates dvc-metafile with all the paths. Next dvc repro command will rerun the command if the input was changed.

$ dvc run -d s3://mybucket/input.json -o s3://mybucket/out.tsv ./run-my-spark-job.sh
Enter fullscreen mode Exit fullscreen mode