File-systems are tricky in general, that is a statement we need to make up-front and unambiguously.
POSIX file-systems (most of the ones you know and love) even moreso, as POSIX specifies some guarantees about what it means to take certain operations, and those requirements are fairly rigid.
In the contemporary age of computing, there's a handful of file-systems, and the most common features have more-or-less settled-in, and there's not nearly as much innovation as there used to be, and for most of us, most of the time, that's just fine. Solid State Drives (SSDs) cover all our our common use-cases, and by virtue of being absolutely screaming fast, file-systems in general were propelled a decade into the future "by magic" because of the underlying technology.
This post is a quick attempt to set-up some thoughts about file-systems, and the performance characteristics we expect, specifically looking at shared file-systems for development environments, Docker to be precise.
Docker on Linux is a very thin wrapper over CGroups and Kernel Namespaces, the performance is so close to the native performance of the underlying file-system, and the underlying hardware that there's barely anything to discuss. In that scenario the file-system stack looks something like this:
The story doesn't change much if using Docker shared volumes mapped between the container and the host because the partitioning of the filesystem that is visible to the container is all magic of kernel namespacing and clever use of "overlay" filesystems, which are a way to "stack" multiple, possible sparse directories on top of one another to give a union view of all the parts of the stack.
When you do a FROM someimage:latest
, and then you mount, or copy in files into the container, you're creating a two or three layer deep stack, and Docker takes care of making sure all your pancake layers are there when you need them.
This, however isn't a story about Linux and Docker file sharing, this is about the situation on Mac, and possibly Windows (although I really have no idea about Windows).
Because macOS isn't Linux, they don't have the cgroups and namespace features which are what make Docker work, that makes it impossible to run Docker natively on any platform that is not Linux, to get around this limitation, Docker simply manages an "invisible" Linux virtual machine.
In theory there's no reason why this should prefer one hypervisor or another. (the tech that virtualizes hardware, examples would be VMWare Fusion, Parallels, VirtualBox, etc)
On macOS, it's xhyve, nothing spectacular or special about xhyve, as far as I'm aware, except it's bundled basically as part of macOS by Apple as Hypervisor.framework
, and xhyve just uses that.
The important thing is though, that we no longer share any resources between the host and the VM. If we want to share a directory now, we need to share between macOS and the virtual machine, and from the virtual machine down to the Docker container.
Worse, because the invisible virtual machine (Linux), and the host use different file-systems (say EXT4 in the VM, and HFS+ in the host), not all concepts are portable. HFS+ for example is case insensitive (optionally), and the way file-system events (notifications about changes to files and directories) don't work quite the same, and probably other aspects I'm forgetting, such as whether or not files can be atomically replaced, or deleted whilst still being open, or whatever else on the long-tale of occasionally important behaviour.
Here's how our stack from before looks with two extra levels of VM, and file-sharing:
In the realms of shared file-systems, the bridge between our Docker VM (and the containers it runs) and the host system file-system must be some kind of network, or demi-network based file-system.
Common known solutions are NFS and CIFS (Samba), however Docker provides their own osxfs
(closed source), and in their edge channel provide one by the name of gRPC-Fuse (also closed-source).
Common file-system benchmarks (including the extensive NFS benchmarking and tuning documentation) focus a great deal on optimizing for reads and writes, which historically makes sense, magnetic storage wasn't renowned for speed, so (the software part of) file-systems ran the risk of being a significant bottle-neck.
Indeed, for file browsing tools (Windows' Explorer, macOS' Finder, Gnome's Nautilus, etc) even some of those benchmarks can make sense, it's common to open and read from files to search for icons or render previews, etc. The actual read/write performance of a filesystem is a significant driver.
In the context then, again of Docker (for Mac), I did some profiling of the file-system performance but not a the level of raw read/write performance, but indeed a level lower.
Before actually reading a file, it must be opened, before even opening one can check for the existence with calls to specific system APIs which return meta-data.
The family of file-system related "system calls" (syscalls), runs a breadth of ~15 things, open
, read
, write
, stat
, lstat
, access
, and close
being the most frequently called ones. (rename
, symlink
, etc being also common).
Syscalls represent a trivially small amount of the percentage of time spent reading a file, usually. An open
may take 7ns
where a read
may take somewhere in the order of 20ns to read 1024 bytes.
Take a look at this, after generating a 10MB file of zeros, we can ask cat
to print it, and trace the underlying system-calls, note this only works on Linux so I'm doing it in Docker, on my Linux machine, but in an Alpine container with no shared file-system.
cat
is using open
and close
taking 0.000061
and 0.000011
seconds respectively and the sendfile
syscall which copies between the file and my terminal without having to do reads
of the file, this actually makes cat
about twice as fast as head
for printing a file:
/ # strace -c -f cat 10meg
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 0.048492 24246 2 sendfile
0.01 0.000004 4 1 close
0.00 0.000000 0 1 open
0.00 0.000000 0 2 mprotect
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getuid
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 0.048496 10 total
/ # strace -c -f head 10meg
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.48 0.051081 5 9757 writev
35.29 0.027957 2 9767 readv
0.07 0.000053 26 2 mprotect
0.05 0.000036 36 1 open
0.04 0.000028 28 1 execve
0.02 0.000017 17 1 arch_prctl
0.02 0.000016 16 1 ioctl
0.02 0.000014 14 1 getuid
0.02 0.000014 14 1 set_tid_address
0.00 0.000003 3 1 close
------ ----------- ----------- --------- --------- ----------------
100.00 0.079219 19533 total
OK, at the risk of having gotten lost then, let's say that programming languages, development tooling and the kinds of things we usually run in Docker have quite specific requirements.
Top comments (0)