Hard links and symbolic links have been available since time immemorial, and we use them all the time without even thinking about it. In machine learning projects they can help us, when setting up new experiments, to rearrange data files quickly and efficiently in machine learning projects. However, with traditional links, we run the risk of polluting the data files with erroneous edits. In this blog post we’ll go over the details of using links, some cool new stuff in modern file systems (reflinks), and an example of how DVC (Data Version Control, https://dvc.org/) leverages this.
As I am studying machine learning, I’m wishing for a tool that would let us inspect ML projects the way we do regular software engineering projects. That is, to retrieve the state of the project at any given time, create branches or tags (with Git) based on an earlier state of a project, handle collaboration with colleagues, and so on. What makes ML projects different is the tremendous amount of data, thousands of images, audio, or video files, the trained models, and how difficult it is to manage those files with regular tools like Git. In my earlier articles I went over why Git by itself is insufficient, and why Git-LFS is not a solution for machine learning projects, as well as some principles that seem to be useful for tools to manage ML projects.
DVC has proved to be very good at managing ML project datasets and workflow. It works hand-in-hand with Git, and can show you the state of the datasets corresponding to any Git commit. Simply by checking out a commit, DVC can rearrange the data files to exactly match what was present at the time of that commit.
The speed is rather magical, considering that potentially many gigabytes of data are being rearranged nearly instantaneously. So I was wondering: How does DVC pull off this trick?
The trick to rearranging gigabytes of training data
Turns out DVC’s secret to rearranging data and model files as quickly as Git is to link files rather than copy them. Git, of course, copies files into place when it checks out a commit, but Git typically deals with relatively small text files, as opposed to the large binary blobs used in ML projects. Linking a file, like DVC does, is incredibly fast, making it possible to rearrange any amount of files in the blink of an eye, while avoiding copying, thus saving disk space.
Using file linking techniques is nothing new to the field, actually. Some data science teams use symlinks to save space and avoid copying large datasets. But symlinks are not the only sort of link which can be used. We will start with a strategy of copying files into place, then using hard links and symbolic links, then ending up with a new type of link, reflinks, which implements Copy On Write capabilities in the file system. We will use DVC as an example of how tools can use different linking strategies.
Test setup
Since we’ll be testing different linking strategies, we need a sample workspace. The workspace was setup on my laptop, a MacBook Pro where the main drive is formatted with the APFS file system. Further tests were done on Linux on a drive formatted with XFS.
The data used is two “stub-articles” dumps of the Wikipedia website retrieved from two different days. Each is about 38 GB of XML, giving us enough data to be similar to an ML project. We then set up a Git/DVC workspace where one can switch between these two files by checking out different Git commits.
$ ls -hl wikidatawiki-20190401-stub-articles.xml
-rw-r--r-- 1 david staff 35G Jul 20 21:35 wikidatawiki-20190401-stub-articles.xml
$ time cp wikidatawiki-20190401-stub-articles.xml wikidatawiki-stub-articles.xml
real 14m16.918s
...
As a base-line measure we’ll note that copying these files into the workspace took about 15 minutes apiece. Obviously an ML researcher would not have a pleasant life if it took 15 minutes to switch between commits in the repository.
Instead, we’ll be exploring another technique DVC and some other tools utilize - linking. There are two types of links all modern OSs support: Hard Link, Symbolic Link. A new type of link, the Reflink (copy-on-write), is starting to be available in newer releases of Mac OS X and Linux (for which one needs the desired filesystem driver). We’ll use each of them in turn and see how well they work.
DVC discusses the four strategies corresponding to the three link types (the 4th strategy is to just copy files) in its documentation. The strategy used depends on the file system capabilities, and whether the "dvc config cache.type
" command has changed the configuration.
DVC defaults to using reflinks and, if not available, to fall back to file copying. It avoids using symlinks and hardlinks because of the risk of accidental cache or repository corruption. We’ll see all this in the coming sections.
Versioned datasets using File Copying
The basic (or naive) strategy of copying files into place when checking out a Git tag is the equivalent of these commands:
$ rm data/wikidatawiki-stub-articles.xml
$ cp .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
This will run on any filesystem, but it will take a long time to copy the file, and it will consume twice the disk space.
What’s with the strange file name? It’s the filename within the DVC cache, the hex digits being the MD5 checksum. Files in the DVC cache are indexed by this checksum, allowing there to be multiple versions of the same file. The DVC documentation contains more details about the DVC cache implementation.
In practice this is how it works in DVC:
To set up the workspace we created two Git tags, one for each file we downloaded. The DVC example on versioned datasets (or this alternate tutorial) should give you an idea of what’s involved with setting up the workspace. To test different file copying/linking modes we first change the DVC configuration then check out the given Git commit. Running "dvc checkout
" causes the corresponding data file to be inserted into the directory.
Oh boy, that sure took a long time. The Git portion of this is very fast, but DVC took a very long time. This is as expected, since we told DVC to perform a file copy, and we already knew it took about 16 minutes or so to use the cp command to copy the file.
As for disk space, obviously there are now two copies of the data file. There is the copy in the DVC cache directory, and the other that was copied into the workspace.
Source: https://www.xkcd.com/981/
Versioned datasets using Hard Links and Symlinks
Clearly copying files around to handle dataset versioning is slow and an inefficient use of disk space. An option that has existed since time immemorial in Unix-like environments are both hard links and symbolic links. While Windows historically did not support file links, the "mklink
" can do both styles of links as well.
Hard links are a byproduct of the Unix model for file systems. What we think of as the filename is really just an entry in a directory file. The directory file entry contains the file name, and the “inode number” which is simply an index into the inode table. Inode table entries are data structures containing file attributes, and a pointer to the actual data. A hard link is simply two directory entries with the same inode number. In effect, it is the exact same file appearing at two locations in the filesystem. Hard links can only be made within a given mounted volume.
A symbolic link is a special file where the attributes contains a pathname specifying the target of the link. Because it contains a pathname, symbolic links can point to any file in the filesystem, even across mounted volumes or across network file systems.
The equivalent commands in this case are:
$ rm data/wikidatawiki-stub-articles.xml
$ ln .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
This is for a hard link. For a symbolic link use "ln -s
".
Then to perform the hard link scenario:
The symbolic link scenario is the same, but setting cache.type
to symlink
. The timing is similar for both cases.
Two seconds (or less) is sure a lot faster than the 16 minutes or so it took to copy the files. It happens so fast we use the word “instantaneous”. File links are that much faster than copying files around. This is a big win.
As for disk space consumption, consider this:
$ ls -l data/
total 8
lrwxr-xr-x 1 david staff 70 Jul 21 18:43 wikidatawiki-stub-articles.xml -> /Users/david/dvc/linktest/.dvc/cache/2c/82d0130fb32a17d58e2b5a884cd3ce
The link takes up a negligible amount of disk space. But there is a wrinkle to consider.
Ok, looks great, right? Fast, no extra space consumed … But, let’s think about what would happen if you were to edit data/wikidatawiki-stub-articles.xml
in the workspace. Because that file is a link to a file in the DVC cache, the file in the cache would be changed, polluting the cache. You’ll need to take extra measures, and learn how to avoid that problem. The DVC documentation has instructions on avoiding the problem when using DVC. It means always remembering to use a specific process for editing a data file that, while not a deal breaker, is less than convenient. The better option though is to use reflinks.
Versioned Datasets using Reflinks
Hard links and symbolic links have been in the Unix/Linux ecosystem for a long time. I first used symbolic links in 1984 on 4.2BSD, and hard links date back even further. Both hard links and symbolic links can be used to do what DVC does, namely quickly rearranging data files in a working directory. But surely in the last 35+ years there has been an advancement or two in file systems?
Indeed there has, and the Mac OS X “clonefile
” and Linux “reflink
” features are examples.
Copy On Write links, a.k.a. reflinks, offer a solution to quickly linking a file into the workspace while avoiding any risk of polluting the cache. The hard link and symbolic link approaches are big wins because of their speed, but doing so runs the risk of polluting the cache. With reflinks, the copy-on-write behavior means that if someone were to modify the data file the copy in the cache would not be polluted. That means we’d have the same performance advantage as traditional links, with the added advantage of data safety.
Maybe, like me, you don’t know what a reflink is. This technique means to duplicate a file on the disk such that the “copy” is a “clone” similar to a hard link. Unlike a hard link where two directory entries refer to the same inode entry, with reflinks there are two inode entries, and it is the data blocks that are shared. It happens as quickly as a hard link, but there is an important difference. Any write to the cloned file causes new data blocks to be allocated to hold that data. The cloned file appears changed, and the original file is unmodified. The clone is perfectly suitable for the case of duplicating a dataset, allowing modifications to the dataset without polluting the original dataset.
Like with hard links, reflinks only work within a given mounted volume.
Reflinks are easily available on Mac OS X, and with a little work is available on Linux. This feature is supported only on certain file systems:
- Linux
- BTRFS
- XFS
- OCFS2
- Mac OS X
- APFS
APFS is supported out of the box on macOS, and Apple strongly suggest we use it. For Linux, XFS is the easiest to set up as shown in this tutorial.
For APFS the equivalent commands are:
$ rm data/wikidatawiki-stub-articles.xml
$ cp -c .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml
With the -c
option, the macOS cp
command uses clonefile(2)
system call. The clonefile
function sets up a reflink clone of the named file. On Linux the cp
command uses the --reflink
option instead.
Then to run the test:
The performance is, as expected, similar to the hard links and symbolic links strategies. What we learn is that reflinks are about as fast as hard links and symlinks, and disk space consumption is again negligible.
The cool stuff about this link is even though the files are connected you can edit the file without modifying the file in the cache. The changed data are copied under the hood.
On Linux the same scenario runs with similar performance.
Conclusion
We’ve learned something about how to efficiently manage a large dataset, like is typical in machine learning projects. If we need to revisit any development stage in such projects, we’ll wanta system for efficiently rearranging large datasets to match each stage.
We’ve seen it is possible to keep a list of files that were present at any Git commit. With that list we can link or copy those files into the working directory. That is exactly how DVC manages data files in a project. Using links, rather than file copying, lets us quickly and efficiently switch between revisions of the project.
Reflinks are an interesting new feature for file systems, and they are perfect for this scenario. Reflinks are as fast to create as traditional hard links and symbolic links, letting us quickly duplicate a file, or a whole directory structure, while consuming negligible extra space. And, since reflinks keeps modifications in the linked file, they give us many more possibilities than traditional links. In this article we examined using reflinks in machine learning projects, but they are used in other sorts of applications. For example, some database systems utilizing them to manage data on disk more efficiently. Now that you’ve learned about reflinks, how will you go about using them?
Top comments (3)
Windows has support for Copy-On-Write if you use the Enterprise version and format the drive with ReFS. It can't boot from ReFS, but you can have additional drives, which are useful. (although I didn't test it yet, I'm using VSS to version that drive, which is for my ML project)
docs.microsoft.com/en-us/windows-s...
I'll consider formatting the drive to test if that dvc tool works.
It seems like it wouldn't be too hard to implement it docs.microsoft.com/en-us/windows/w...
The irony, git-lfs has support for it. github.com/git-lfs/git-lfs/blob/b9...
And dvc doesn't.
github.com/iterative/dvc/blob/318b...
Please take note !
Nice post! I wasn't aware of reflinks and their benefits, nor the exact distinctions among these link types, so thank you.