loading...

Zero-assumptions ZFS: how to actually understand ZFS

nikvdp profile image Nik Updated on ・11 min read

This is the first in a series of articles about ZFS, and is part of what I hope becomes an ongoing series here: the zero-assumptions write up. This article will be written assuming you know nothing about ZFS.

I’ve been interested in ZFS for a while now, but didn’t have a good reason to use it for anything. Last time I looked at it ZFS on Linux was still a bit immature, but in the past few years Linux support for ZFS has really stepped up, so I decided to give it another go. ZFS is a bit of a world unto itself though, with most resources walking you through some quick commands without explaining the concepts underlying ZFS, or assuming the user is very familiar with traditional RAID terminology.

Background

I keep one desktop machine in my house (running bedrock linux with an Ubuntu base and an Arch Linux strata on top) that acts, among other things, as a storage/media server. I keep photos and other digital detritus I’ve collected over the years there, and would be very sad if they were to disappear. I back everything up nightly via the excellent restic to the also excellent Backblaze B2, but since I have terabytes of data stored there I haven’t followed the cardinal rule of backing up: make sure that you can actually restore from your backups. Since testing that on my internet connection would take months, and I’m afraid of accidentally deleting data or drive failure, I decided to add a bit more redundancy.

My server has 3 hard drives in it right now: one 4TB drive spinning disk drive, one 2TB drive spinning disk, and one 500GB SSD drive that holds the root filesystem. The majority of the data I want to keep is on the 2TB drive, and the 4TB drive is mostly empty. After doing some research (read: browsing posts on /r/datahoarder), it seems the two most common tools people use to add transparent redundancy are a snapraid + mergerfs combo, or the old standby, ZFS.

Installing ZFS on Linux

Getting ZFS installed on Linux (assuming you don’t try to use it as the root filesystem) is almost comically easy these days. On Ubuntu 16.04+ (and probably recent Debian releases too), this should be as straightforward as:

sudo apt install zfs-dkms zfs-fuse zfs-initramfs zfsutils-linux

Explanation:

For simplicity, the above command installs more than is strictly needed: zfs-dkms and zfs-fuse are different implementations of ZFS for linux, and either should be enough to use ZFS on it’s own. The reason there are multiple implementations is due to how linux does things. zfs-dkms uses a technology (unsurprisingly) called DKMS, while zfs-fuse uses (even less surprisingly) a technology called FUSE. FUSE makes it easier for developers to implement filesystems at the cost of a bit of performance. DKMS stands for Dynamic Kernel Module support, and is a means by which you can install the source code for a module and let the linux distro itself take care of compiling that source to match the running Linux kernel.

For Arch Linux you’ll need to use the AUR and install zfs-linux. Check the Arch wiki’s article on ZFS for more detailed instructions, but for most systems this should suffice:

sudo pacman -Syu zfs-linux

Planning your drives

The first step to getting started with ZFS was to figure out how I wanted to use my drives. Most people who use ZFS for these purposes seem to go out and buy multiple big hard drives, and then use ZFS to mirror them. I just wanted more data redundancy on the drives I already had, so I decided to partition my drives.

Since I have one 2TB drive that I want backed up, I first partitioned my 4TB drive into two 2TB partitions using gparted. I then created an ext4 filesystem on the second drive.

Then I used blkid and lsblk to check my handiwork. These two tools print lists of all the “block devices” (read: hard disks) in my system and show different ways to refer to them in Linux:

$ blkid
/dev/sda1: UUID="7600-739F" TYPE="vfat" PARTUUID="ded30b23-f318-433c-bfb2-15738d42cc01"
/dev/sda2: LABEL="500gb-ssd-root" UUID="906bd064-2156-4a88-8d88-8940af7c5a34" TYPE="ext4" PARTLABEL="500gb-ssd-root" PARTUUID="cc6695ed-1a2b-4cb1-b302-37614cf07bf7"
/dev/sdc1: LABEL="zstore" UUID="5303013864921755800" UUID_SUB="17834655468516818280" TYPE="ext4" PARTUUID="072d0dd9-a1bf-4c67-b9b3-046f37c48846"
/dev/sdc2: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="266677788785228698" TYPE="ext4" PARTLABEL="extra2tb" PARTUUID="1f9e7fd1-1da6-4dbd-9302-95f6ea62fff0"
/dev/sdb1: LABEL="longterm" UUID="7765758551585446647" UUID_SUB="89185545293388421" TYPE="zfs_member" PARTUUID="5626d9ea-01"
/dev/sde1: UUID="acd97a41-df27-4b69-924c-9290470b735d" TYPE="ext4" PARTLABEL="wd2tb" PARTUUID="6ca94069-5fc8-4466-bba2-e5b6237a19b7"

$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb      8:16   0   1.8T  0 disk
└─sdb1   8:17   0   1.8T  0 part
sde      8:64   0   1.8T  0 disk
└─sde1   8:65   0   1.8T  0 part
sdc      8:32   0   3.7T  0 disk
├─sdc2   8:34   0   1.8T  0 part
└─sdc1   8:33   0   1.8T  0 part
sda      8:0    0   477G  0 disk
├─sda2   8:2    0 476.4G  0 part
└─sda1   8:1    0   512M  0 part

Explanation:

If you’re not familiar with how Linux handles hard disks, Linux refers to hard disks as “block devices.” Linux provides access to physical hardware through a virtual filesystem it mounts at /dev, and depending on what type of hard drive you have, hard disks will generally be of the format /dev/sdX where the X is a letter from a-z that Linux assigns to the drive. Partitions on each disk are then assigned a number, so in the lsblk output above, you can see that disk sdc has two partitions, which show up in the output as sdc1 and sdc2.

The blkid command shows the traditional /dev/sdX labels, but also adds UUIDs, which you can think of as a random id that will always refer to that particular disk. The reason for this is that if you were to unplug one of your drives and plug it into a different port Linux may give it a different /dev/sdX name, e.g. if you unplugged the /dev/sdc drive and plugged it into another port it may become /dev/sda, but it would keep the same UUID.

I wanted to convert my 2TB drive to ZFS, but since my precious data is all currently located on my 2TB drive (/dev/sdb1 above), I decided to pull a swaparoo and first copy everything onto the second partition of my 4TB drive (/dev/sdc2 above), then let ZFS takeover the original partition (/dev/sdb1) and copy the data back onto that drive.

The end result I’m looking for is to have a layout with two “pools” (zfs-speak for sets of drives, more on this later). One pool should consist of my original 2TB drive, replicated to one of the 2TB partitions on my 4TB drive. The extra 2TB partition available on the 4TB drive will act as a second pool, which gives me nice ZFS benefits like checksumming and the ability to take snapshots of the drive, as well as the option to add another 2TB drive/partition later and mirror the data.

If you’re already familiar with zpool, this is what the finished setup looks like:

$ sudo zpool status
pool: longterm
state: ONLINE
config:

       NAME        STATE     READ WRITE CKSUM
       longterm    ONLINE       0     0     0
         mirror-0  ONLINE       0     0     0
           sdc2    ONLINE       0     0     0
           sdb1    ONLINE       0     0     0

pool: zstore
state: ONLINE
config:

       NAME        STATE     READ WRITE CKSUM
       zstore      ONLINE       0     0     0
         sdc1      ONLINE       0     0     0

ZFS terminology and concepts: mirrors, stripes, pools, vdevs and parity

ZFS introduces a fair amount of new concepts and terminology which can take some getting used to. The first bit to understand is what ZFS actually does. ZFS usually works with pools of drives (hence the name of the zpool command), and allows you to do things like mirroring or striping the drives.

And what does it mean to mirror or stripe a drive you ask? When two drives are mirrored they do everything in unison, so any data written to one drive is also written to the other drive at the same time. This way if one of your drives were to fail, your data would still be safe and sound on the other drive, and through a process ZFS calls “resilvering” if you were to install a new hard drive to replace the failed one ZFS would automatically take care of syncing all your data back on to it.

Striping is a different beast. Mirroring drives is great for redundancy, but has the obvious drawback that you only get to use half the disk space you have available. Sometimes the situation calls for the opposite trade-off: if you bought two 2TB drives and you wanted to be able to use all 4TB of available storage, striping would let you do that. In striped setups ZFS writes “stripes” of data to each drive. This means that if you write a single file ZFS may actually store part of the file on one drive and part of the file on another.

This has many advantages: it speeds up your reads and writes by making them concurrent. Since it’s storing pieces of one file on each drive, both drives can be writing at the same time, so your write speed could theoretically double. Read speed also gets a boost since you can also read from both drives at the same time. The downside to all this speed and space is that your data is less safe. Since your data is split between two drives, if one of the hard drives dies you will probably lose all your data – no one file will be complete because while your good drive might have half the file on it, the other half is gone with your dead hard disk. So in effect you’re trading close to double the speed and space for close to double the risk of losing all your data. Depending what you’re doing that might be a good choice to make, but I wouldn’t put any data I didn’t want to lose into a striped setup.

There’s a third type, a sort of compromise solution which is to use parity. This type of setup is frequently referred to as RAIDZ (or RAIDZ2 or RAIDZ3) and is somewhere between a full-on striped setup and a mirrored setup. This approach uses what’s called a parity disk to act as a kind of semi-backup. This is backed by a lot of complicated math that I don’t pretend to understand, but the take-home message is that it provides a way to restore your data if a drive fails. So if you have three 2TB drives, you can choose to stripe them but dedicate one to parity. In this setup, you’d have 4TB of available storage, but if a drive were to fail you wouldn’t lose any data (although performance would probably be pretty horrible until you replaced the failed disk). Think of it as a kind of half backup. You can tweak the ratio as well, if you dedicate more disks to parity you can survive more failing drives without losing data–this is what the 2 and 3 in RAIDZ2 and RAIDZ3 mean.

More info on the different RAID levels you can use with ZFS here and here.

Now that we’ve gone over the high-level concepts of drive arrays and RAID, we can dive into the more ZFS-specific aspects. The first item to go over is the concept of a vdev. A vdev is a “virtual device,” and when zpool pools drives it pools collections of these virtual devices using one of the RAID approaches (striped or mirrored) we discussed above. However what makes vdevs useful is that you can put more than one physical drive (or partition) into a single vdev.

While zpool can create striped and mirrored arrays over pools of vdevs, a vdev can create striped or mirrored arrays over sets of drives. This is part of what makes ZFS so flexible. For example, you could get the speed benefits of a striped setup with the redundancy benefits of a mirrored setup by creating two mirror vdevs, each of which is configured to mirror data across two physical drives. You could then add both vdevs into a striped pool to get fast the fast reads and writes that striping allows without running the risk of losing your data if a single drive were to fail (this is actually a fairly popular setup and is known as RAID10 outside of ZFS-land).

This can get quite complicated quite quickly, but this article (backup link here since original was down at the time of writing) does a nice job walking through the various permutations of vdevs and zpools that are possible.

Experimenting

ZFS can also be used on loopback devices, which is a nice way to play with ZFS without having to invest in lots of hard drives. Let’s run through a few of the possibilities with some loopback devices so you can get a feeling for how ZFS works.

When ZFS uses files on another filesystem instead of accessing devices directly it requires that the files be allocated first. We can do that with a shell for loop by using the dd command to copy 1GB of zeros into each file (you should make sure you have at least 4GB of available disk space before running this command):

for i in 1 2 3 4; do dd if=/dev/zero of=zfs$i bs=1024M count=1; done

Now that we have our empty files we can put them into a ZFS pool:

sudo zpool create testpool mirror $PWD/zfs1 $PWD/zfs2 mirror $PWD/zfs3 $PWD/zfs4

NOTE: The $PWD above is important, ZFS requires absolute paths when using files

You should now have a new zpool mounted at /testpool. Check on it with zpool status:

$ zpool status
  pool: testpool
 state: ONLINE
  scan: none requested
config:

        NAME                STATE     READ WRITE CKSUM
        testpool            ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            /home/nik/zfs1  ONLINE       0     0     0
            /home/nik/zfs2  ONLINE       0     0     0
          mirror-1          ONLINE       0     0     0
            /home/nik/zfs3  ONLINE       0     0     0
            /home/nik/zfs4  ONLINE       0     0     0

errors: No known data errors

Your new ZFS filesystem is now live, and you can cd to /testpool and copy some files into your new ZFS filesystem.

Next steps

We’ve gone over the basics of ZFS, in the next post we’ll go on to some of the more powerful and advanced features ZFS offers like compression, snapshots and the zfs send and zfs receive commands, and the secret .zfs dir.


Posted on by:

Discussion

pic
Editor guide