loading...

Use raw ZFS volume for VirtualBox guest

mnohime profile image Michal Nowak ・8 min read

If you’ve ever run virtual guests on platforms like KVM, Xen, Hyper-V, VMware, or VirtualBox, you probably think of disk attached to the guest as an image file of a special format (qcow2, VHDX, VMDK, VDI), which lies somewhere on the disk of the host file system. Lesser known is that most of these platforms support other storage options, namely network file systems (NFS, SMB), logical volumes (partitions), disks, and, most notably here, ZFS volume.

File-based arrangement certainly has value for some users. From my experience it is the most convenient arrangement for guests you want to rapidly create and destroy, or when you don’t care much about anything else than plain storage.

But it has drawbacks too. Here’s one of them. One relies on virtualization-specific designs which do not translate well from one platform to another. As someone who tried to convert image from qcow2 to VMDK or VHDX, or tried to create an instant JeOS-like VM image for VMware with open source tools, I witnessed, that in the former case one may end up with fixed-size image and, in the latter case, with image booting only from slow virtual IDE controller. Some of those designs may have fatal flaws.

Even if you think you will stick to one platform and you are not interested in platform interoperability, you may want to use your operating system-native tools with your guests. In OpenIndiana, an illumos distribution, our system components like IPS (package manager), Zones (OS-level virtualization/"container"), and KVM (HW virtualization) are well integrated with ZFS, the advanced file system and volume manager, and can leverage its features like snapshots and encryption.

Thanks to this level of integration on OpenIndiana we can run KVM guests in a Zone and thus make it well isolated from other tenants on the host, control its resources, and more.

Interestingly, VirtualBox can leverage raw device as a storage for guest. For our purposes ZFS volume will be an ideal device. First, I will show you how to create a VirtualBox guest running off a ZFS volume, then we will use ZFS snapshotting feature to save state of the guest, later on we will send the guest to another ZFS pool, and finally we will run the guest from an encrypted ZFS volume.

If you want to try this on your systems, here are the prerequisites: Host with ZFS file system. I use OpenIndiana 2019.04 updated to the latest packages to get ZFS encryption, but other Unices may do as well, but I haven't tried. I have two ZFS pools present: rpool is the default one, tank is one I use for some of the examples. VirtualBox with Oracle VM VirtualBox Extension Pack needs to be installed and working (I used version 6.0.8). You also need another system with graphical interface, where you can display guest's graphical console, unless you display on your Host.

Legend: $ introduces user command, # introduces root command. Guest/Host/Workstation denotes environment where the command is being executed.

I advise caution as VirtualBox guests write to raw devices and using wrong volume may result in data loss.

Create guest with ZFS volume

First, as root create rpool/vboxzones/myvbox volume of 10 Gb:

Host # zfs create rpool/vboxzones
Host # zfs create -V 10G rpool/vboxzones/myvbox

Then again as root create VirtualBox guest with SATA controller, reserve port 1 for the disk:

Host # VBoxManage createvm --name myvbox --register
Host # VBoxManage modifyvm myvbox --boot1 disk --boot2 dvd --cpus 1 \
--memory 2048 --vrde on --vram 4 --usb on --nic1 nat --nictype1 82540EM \
--cableconnected1 on --ioapic on --apic on --x2apic off --acpi on \
--biosapic=apic
Host # VBoxManage storagectl myvbox --name SATA --add sata \
--controller intelahci --bootable on
Host # VBoxManage storageattach myvbox --storagectl SATA --port 0 \
--device 0 --type dvddrive --medium OI-hipster-text-20190511.iso
Host # VBoxManage internalcommands createrawvmdk \
-filename /rpool/vboxzones/myvbox/myvbox.vmdk \
-rawdisk /dev/zvol/rdsk/rpool/vboxzones/myvbox

The last command creates a pass-through VMDK image which gets attached to the guest, but in reality, nothing gets written to this file and /dev/zvol/rdsk/rpool/vboxzones/myvbox volume is used directly instead. Attach the image to the guest:

Host # VBoxManage storageattach myvbox --storagectl SATA --port 1 \
--device 0 --type hdd --medium /rpool/vboxzones/myvbox/myvbox.vmdk \
--mtype writethrough

Start the guest, connect to it via RDP client (rdekstop and FreeRDP on Unix):

Host # VBoxManage startvm myvboxdisk --type headless

Workstation $ xfreerdp /v:$(Host_IP)

If you see the guest’s bootloader you can proceed and install the guest from ISO image, in the "Disks" screen of the installer, you’ll see your disk:

alt_text

Create a snapshot of the disk

Record the approximate time of the snapshot by creating a file with a timestamp in it:

Guest $ date > date

Now with this arrangement, we can create a snapshot of the volume:

Host # zfs snapshot rpool/vboxzones/myvbox@first

Wait some time and rewrite the timestamp:

Guest $ date > date

Power off the guest and rollback to the first snapshot with the -r option so that ZFS lets us rollback to any, even older, snapshot:

Guest # poweroff
Host # zfs rollback -r rpool/vboxzones/myvbox@first

Start the guest and check that the timestamp is set to the one from the first snapshot:

Guest $ cat date

Hooray! We demonstrated that VirtualBox guest works with ZFS snapshots.

Send guest disk to another ZFS pool

Now lets see if we can send the guest disk to another ZFS pool and run it from there.

Guest # poweroff
Host # zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rpool   118G  40.2G  77.8G      -       -   18% 34%  1.00x  ONLINE  -
tank    232G  22.2G   210G      -       -   1%  9%   1.00x  ONLINE  -
Host # zfs send rpool/vboxzones/myvbox | zfs receive -v tank/vboxzones/myvbox
receiving full stream of rpool/vboxzones/myvbox@--head-- into tank/vboxzones/myvbox@--head--
received 9.34GB stream in 153 seconds (62.5MB/sec)

Now, “rewire” the pass-through disk so it points to the new location:

Host # VBoxManage storageattach myvbox --storagectl SATA --port 1 \
--device 0 --type hdd --medium emptydrive
Host # VBoxManage closemedium disk /rpool/vboxzones/myvbox/myvbox.vmdk \
--delete
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Host # VBoxManage internalcommands createrawvmdk \
-filename /rpool/vboxzones/myvbox/myvbox.vmdk \
-rawdisk /dev/zvol/rdsk/tank/vboxzones/myvbox
RAW host disk access VMDK file /rpool/vboxzones/myvbox/myvbox.vmdk created successfully.
Host # VBoxManage storageattach myvbox --storagectl SATA --port 1 \
--device 0 --type hdd --medium /rpool/vboxzones/myvbox/myvbox.vmdk \
--mtype writethrough
Host # VBoxManage startvm myvbox --type headless
Waiting for VM "myvbox" to power on…
VM "myvbox" has been successfully started.

On boot the guest will notice that disk’s device ID does not match the recorded one but corrects it to what we actually have and won't warn on the next boot:

alt_text

So, we proved that the guests disk can be send to another ZFS pool and successfully started from there.

Start guest from an encrypted volume

Now, lets see if the guest starts from an encrypted ZFS volume.

First, create an encrypted file system tank/vboxzones_encrypted with a passphrase entered on the prompt:

Host # zfs create -o encryption=on -o keyformat=passphrase \
-o keylocation=prompt tank/vboxzones_encrypted
Enter passphrase:
Re-enter passphrase:

Then send stream to the new encrypted dataset:

Host # zfs send rpool/vboxzones/myvbox | zfs receive -v tank/vboxzones_encrypted/myvbox
receiving full stream of rpool/vboxzones/myvbox@--head-- into tank/vboxzones_encrypted/myvbox@--head--
received 9.34GB stream in 166 seconds (57.6MB/sec)

And recreate the guest pass-through disk to point to the encrypted dataset:

Host # VBoxManage storageattach myvbox --storagectl SATA --port 1 \
--device 0 --type hdd --medium emptydrive
Host # VBoxManage closemedium disk /rpool/vboxzones/myvbox/myvbox.vmdk \
--delete
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Host # rm /rpool/vboxzones/myvbox/myvbox.vmdk
Host # VBoxManage internalcommands createrawvmdk \
-filename /rpool/vboxzones/myvbox/myvbox.vmdk \
-rawdisk /dev/zvol/rdsk/tank/vboxzones_encrypted/myvbox
RAW host disk access VMDK file /rpool/vboxzones/myvbox/myvbox.vmdk created successfully.
Host # VBoxManage storageattach myvbox --storagectl SATA --port 1 \
--device 0 --type hdd --medium /rpool/vboxzones/myvbox/myvbox.vmdk \
--mtype writethrough
Host # VBoxManage startvm myvbox --type headless
Waiting for VM "myvbox" to power on…
VM "myvbox" has been successfully started.

To prove that we really run from an encrypted volume, lets unload the encryption key from memory manually (or restarting the Host) and start the guest again:

Host # zfs unmount /tank/vboxzones_encrypted
Host # zfs unload-key tank/vboxzones_encrypted
Host # VBoxManage startvm myvbox --type headless
Waiting for VM "myvbox" to power on…
VBoxManage: error: Could not open the medium '/rpool/vboxzones/myvbox/myvbox.vmdk'.
VBoxManage: error: VD: error VERR_ACCESS_DENIED opening image file '/rpool/vboxzones/myvbox/myvbox.vmdk' (VERR_ACCESS_DENIED)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MediumWrap, interface IMedium

As expected it failed because the encrypted volume is unreadable for VirtualBox. Lets decrypt it and start the guest again:

Host # zfs load-key tank/vboxzones_encrypted
Enter passphrase for 'tank/vboxzones_encrypted':
Host # VBoxManage startvm myvbox --type headless
Waiting for VM "myvbox" to power on…
VM "myvbox" has been successfully started.

After decryption the guest starts correctly.


In these three examples we

  1. created a snapshot and then rolled back to it,
  2. sent guest disk as a ZFS stream to another ZFS pool, and
  3. started guest off an encrypted dataset. Thus we demonstrated that we can leverage OS-level file system features in VirtualBox guests thanks to fairly unknown VirtualBox ability to use plain storage devices.

Concerns and Future directions

Unfortunately not everything works ideally. Whatever I try disk IO of the raw ZFS volume is significantly lower than IO performance with the VDI disk (or the Global Zone, but I treat it here mostly as a base-line for comparison).

I have run “seqwr” and “seqrd” test modes from the fileio test of the sysbench 0.4.1 benchmark on an old Samsung 840 PRO SSD (MLC) disk which seems to show some wear:

Operation / Storage type Global Zone (host) VDI disk (guest) ZFS dataset (guest)
Sequential Write (2Gb file) 36.291 Mb/sec 54.302 Mb/sec 11.847 Mb/sec
Sequential Read (2Gb file) 195.63 Mb/sec 565.63 Mb/sec 155.21 Mb/sec

There are two things of concern:

  1. ZFS volume is three times slower on sequential write, and ~20 % slower on sequential read then the Global Zone;
  2. the VDI disk performance can actually be higher (sic) than in Global Zone.

The first takeaway may mean that unneeded fsync() occurs on some layer and the second takeaway may mean that something fishy is happening with synchronizing files to the underlying file system on the hypervisor side. Perhaps fsync() being ignored to some degree? (In any case these numbers are to be taken with a grain of salt and are good only to show the contrast between various storage types and their potential limitations.)

To summarize, it turns out that the level of integration we see with KVM on various illumos distributions can be met with VirtualBox as we seems to have necessary building blocks. But it needs to be investigated what's the culprit on the IO performance side and how, Zones can be added to the mix so that additional tenant isolation is achieved (especially because VirtualBox is run as root because of the use of ZFS volumes).

Acknowledgement: Johannes Schlüter’s ZFS and VirtualBox blog post inspired me to write this article.

Posted on by:

mnohime profile

Michal Nowak

@mnohime

Engineer. Working on OpenIndiana (an illumos distribution).

Discussion

markdown guide
 

Things to explore as to why it could be slower on zfs:

  • Depending on how vbox is doing the writes, this could be sync writes for every write that is happening in the guest. It would be useful to measure the number of iops seen by the host in each of your tests.
  • If sync writes are happening such that they don't align with the dataset's block size (recordsize for filesystem, volblocksize for volume) you could be seeing write inflation. In the host, look for a lot of reads during a write workload and look to see if the amount of data written by the host is significantly more than that written to the guest.
  • If you are doing writes that are at least 32k (zfs_immediate_write_sz) but vbox is chopping them up into smaller writes, the data may be written to the zil (zil exists even if log devices don't) and again to its permanent home.
  • I believe that the Samsung 840 Pro advertises a physical sector size of 512b, which is almost certainly not true. It is likely 4k or 8k. The zpool will have been created with ashift=9, and all allocations will fall on a 512 byte boundary. If that 512 byte boundary just happens to also be a boundary that is the real sector size (4k or 8k), you will avoid the drive having to do a read-modify-write cycle when writing multiples of the drive's actual physical sector size. If the real sector size is 4k, a 512 byte allocation only has a 12.5% chance of landing on the right boundary. Repeat your benchmarks many times, looking for wild variations.
  • ZFS is doing more work than it would be without ZFS. The protections offered by copy-on-write, checksumming, etc. are quite cheap compared to the cost of operations on hard disks. The cost becomes more noticeable on faster disks. Normally I think of these costs being noticeable on NVMe and not so much on an earlier generation SATA SSD.
  • If the tests are with encryption enabled, I'd be interested to see what it's like without encryption.
 

Thanks Mike! Very much appreciate your comment. I will go thru those things and see what difference they make, just started with ashift, which was indeed set to 9 (512B) but should be to 13 (8K).

 

Minor typo:

"control it's resources, and more." ->
"control its resources, and more."

 

Thanks, Jon. Typo fixed.

 

a brilliant use of virtualisation ! using it to experiment with new file systems WITHOUT messing up your host or native env