Sergey Platonov

Posted on Aug 21

Asynchronous I/O: A Practical Guide for Optimizing HPC Workflows with xiRAID in Lustre Environments

#hpc

In today's AI world, powerful storage is key for many GPUs working together on large datasets. These workflows involve an initial data load at high speeds, followed by data loading during training and periodic checkpoints, all at tens of gigabytes per second. Storage must handle many GPUs accessing massive datasets (hundreds of terabytes to petabytes) while delivering high throughput (tens of gigabytes per second or more) and efficient performance for small operations. This versatility is critical to prevent delays in computational clusters. Our goal is to minimize downtime and maximize efficiency.

Research groups often use limited infrastructures or cloud services. The growth of AI cloud services demands a data storage system that is fast, high-capacity, and integrates seamlessly into cloud infrastructure. Ideal solutions should be software-defined, deployable on any hardware, and easy to integrate.

At Xinnor, we provide high-performance storage solutions for diverse clients. Recently, we've observed a growing demand for storage tailored to HPC and AI, especially for shared file systems in smaller setups (with 1-2 DGXs or HGXs) and among cloud providers. These setups require several key features.

For small standalone installations, solutions need to be created consisting of only one or two storage controllers. These solutions should be capable of delivering performance levels corresponding to a 400-gigabit network and have the potential to scale up to 800 gigabits in the future.

For cloud environments we need to create a solution that primarily delivers performance of approximately 20 gigabytes per second per client’s virtual machine. Additionally, it must ensure consistent read and write access simultaneously from multiple virtual machines, and support essential requirements such as multi-tenancy and Quality of Service (QoS).

Our developments enable us to achieve the necessary performance levels while consuming minimal resources. Recent tests with KIOXIA showcased xiRAID's capabilities in both RAID5 and RAID6 configurations using 24 PCIe Gen5 drives. The results were impressive, achieving near-theoretical performance with minimal CPU load.

However, to meet all the requirements for our solution, we certainly need a parallel or clustered file system. For this, we have chosen Lustre.

This blog highlights how xiRAID, combined with our Lustre tuning expertise, delivers outstanding results in Lustre environments:

Why We Rely on Lustre
Our Objectives
Tested Architectures
Test Stand Configuration
Block Device Performance
IOR Single Test Results
Testing Synchronous and Asynchronous I/O Operations
Lustre vs. NFSoRDMA Testing
Testing Lustre in the Cloud Environment
Final Thoughts
Appendix

Why We Rely on Lustre

But first, why Lustre? At Xinnor, we rely on Lustre for our high-performance storage solutions because it offers several key advantages:

Shared storage for parallel workflows. Lustre enables the creation of shared storage, allowing multiple nodes to access the same file system concurrently. This is crucial for modern AI and HPC workloads, where numerous processing units need efficient, simultaneous access to data.
Scalability for growing demands. Lustre's architecture supports linear scalability. We can seamlessly add new storage nodes to the cluster without performance degradation. This ensures our storage solutions can grow alongside our clients' increasing workloads.
Shared-disk architecture. Lustre's shared-disk architecture perfectly aligns with our approach of providing high-performance block storage solutions.
Flexibility for Diverse Deployments. As a software solution, Lustre is deployable on any hardware and within virtualized infrastructures. This flexibility allows us to meet various performance targets across different environments.

Beyond these core benefits, Lustre boasts a proven track record in handling HPC tasks with high streaming performance. However, our goal extends beyond optimizing for large block I/O. We also aim to extract maximum performance from small block I/O operations, critical for many AI applications.

Our Objectives

At Xinnor, we have several installations with a combined capacity exceeding 100 petabytes. Each Lustre project is unique, and to ensure the highest performance levels, we fine-tune each solution through a multi-stage process:

Hardware and software configuration: we configure drives, storage services, and OS settings. This includes software installation, testing, and RAID configuration adjustments.
Lustre OSS and MDS testing: we test Lustre using I/O utilities like OBDFilter-survey and MDtest.
Client-perspective testing: we employ standard HPC industry I/O tools (IOR) and FIO for testing from the client's perspective. Using FIO with asynchronous engines helps us achieve optimal efficiency at each storage stack level.

Our performance objectives include:

Achieving tens of GBps throughput from a few Lustre clients using a couple of Object Storage Servers (OSS).
Attaining several million IOPS in the same configuration.
Maintaining a simple hardware and software configuration for easy deployment.
Developing an easily reproducible test approach for consistency.

Benchmarking with IOR presents challenges. Buffered I/O can be CPU-intensive, while Direct I/O may create uneven storage loads. Performance scaling with additional I/O threads can be limited on HDDs and read-intensive SSDs. Increasing I/O size isn't always effective either. Therefore, one of our objectives is to demonstrate how Asynchronous I/O (AIO) helps us achieve optimal performance.

We conducted a comprehensive analysis to identify the most effective configurations for HPC workflows. This included comparisons of Lustre 2.15.4 over ldiskfs, Lustre 2.15.4 over ZFS and NFSoRDMA (v3 and v4.2).

These comparisons will be further explored to showcase the best performing configurations. We will demonstrate the performance benefits of asynchronous engines compared to classic Buffered I/O and Direct I/O approaches. Additionally, we will benchmark ldiskfs vs. ZFS to determine the better backend file system for different scenarios. Finally, we will compare our solutions against NFS, a competitor for smaller research groups and cloud-deployed systems.

Through these comparisons and configurations, we aim to showcase the superior performance and versatility of our Lustre-based storage solutions in various HPC and AI environments.

Tested Architectures

Our testing involved two primary architectures: a "Cluster-in-the-box" solution and a virtualized solution, both designed to assess the performance of Lustre in various scenarios.

Cluster-in-the-box architecture

The first architecture we are testing is the "Cluster-in-the-box" solution. This high-performance system is designed for small installations, offering a fully integrated and fault-tolerant setup within a single enclosure. Despite its compact form factor, it supports multiple clients, allowing them to read and write data consistently. Furthermore, leveraging Lustre's capabilities, it can easily scale if the need arises. This solution is ideal for small research groups, combining the convenience of a plug-and-play setup with the high performance typically associated with larger Lustre deployments. Unlike traditional Lustre clusters, which require separate OST/OSS/MDS/MGS components that need to be interconnected, our "Cluster in the Box" offers a streamlined, all-in-one package that significantly simplifies deployment and management.

Virtualized solution architecture

The second architecture is a virtualized solution, suitable for deployment in cloud environments, whether private or public. This approach caters to the growing demand for flexible, cloud-based storage solutions. Unlike most clients who use traditional bare-metal distributed systems, our virtualized architecture stands out by providing a robust and scalable solution within a virtual environment. This setup not only supports the performance needs of HPC and AI workloads but also ensures seamless integration into existing cloud infrastructures, offering a modern alternative to conventional bare-metal installations.

Test Stand Configuration

Hardware configuration:

CPU: 64-Core Processor per node (AMD 7702P)
Memory: 256 GB RAM per Node
Networking: 1 x MT28908 Family [ConnectX-6] per node
Drives: 24x KIOXIA CM6-R 3.84TB: 1.6TB namespace per node
The clients are based on the same hardware and Rocky Linux 9

Software configuration:

Rocky Linux 8 with Lustre 2.15.4.
RAID: 4 х RAID 6: 10 drives(8d+2p), ss=64k for OSS
2x RAID1 for MGS and MDS

Block Device Performance

To start, we created a storage array consisting of four RAID6 volumes for data, with two arrays on each node, and two RAID1 volumes for the MGS/MDS. Our initial focus was to test the block device performance of these RAID6 arrays to establish a baseline for the potential performance of the file system within this compact unit. The results were impressive, with single drive performance reaching millions of IOPS for 4K random reads and writes, and up to 93.4 GBps for sequential reads with multiple jobs.

FIO configuration 1:

[global]
bs=4k
rw=randread/randwrite
norandommap
bs=4k
direct=1
group_reporting
random_generator=lfsr
norandommap
time_based=1
runtime=60
iodepth=128
ioengine=libaio
[file1]
filename=/dev/xi_data1
[file2]
filename=/dev/xi_data1

FIO configuration 2:

[global]
bs=1024k
rw=read/write
direct=1
group_reporting
time_based=1
runtime=60
iodepth=32
ioengine=libaio
numjobs=1
offset_increment=3%
[file1]
filename=/dev/xi_data1
[file2]
filename=/dev/xi_data1

When running multiple jobs, which is naturally our target workload, we observe that for random reads, we approach 12.5 million IOPS, and for random writes, we reach 2 million IOPS. Under streaming workloads, we can achieve up to 93 gigabytes per second for reads and 67 gigabytes per second for writes.

IOR Single Client Test Results

Next, we deployed the file system with default parameters and conducted the following round of tests. We conducted an IOR test with a single client to assess potential performance. For this test, the client was connected via a 200 Gbit port. The theoretical maximum throughput for streaming read/write operations is approximately 22-24 GBps, and for random operations, around 4.2M IOPS, which represents the maximum capacity of the 200 Gbit connection.

Our IOR tests provided valuable insights into the performance differences between Direct I/O (DIO) and Buffered I/O in various scenarios. Here's a detailed breakdown of the results.

Direct I/O (DIO) Performance

Large Sequential I/O Operations:

64M Write Operations: 13053 MiB.
64M Read Operations: 12288 MiB.

Small Random I/O Operations:

4k Write Operations: 6542 IOPS.
4k Read Operations: 6742 IOPS.

These results indicate that DIO provides stable and consistent performance for both large sequential and small random I/O operations. This stability is essential for applications requiring predictable I/O behavior, making DIO a reliable choice for demanding environments.

However, it is worth noting that the performance for small blocks is extremely low.

The load on the CPU is noticeable but not critical.

Buffered I/O Performance

Large Sequential I/O Operations:

64M Write Operations: 3874 MiB.
64M Read Operations: 12757 MiB.

Small Random I/O Operations:

4k Write Operations: 7359 IOPS.
4k Read Operations: 556629 IOPS.

One notable observation was the variability in buffered I/O write results, ranging from 2,000 to 28,000 IOPS. Additionally, CPU load fluctuated significantly during these operations, spanning from 6% to 100%. This variability and high CPU load highlights the challenges of using buffered I/O in environments requiring consistent performance.

Conclusions from Single Client Tests

With the existing approach we can meet the required performance for large sequential I/Os, achieving a single-threaded performance peak of approximately 13 GBps (almost half of potential 24 GBps).
DIO proves to be more stable compared to buffered I/O in these scenarios, making it a preferred choice for large sequential workloads.
The performance for small random I/Os remains far from the potential performance of the block device.

Testing Synchronous (SYNC) and Asynchronous I/O (AIO) operations

We often use the FIO utility because it offers extensive capabilities for regulating load and selecting different io engines for data access. This allows us to evaluate the system’s behavior under various types of workloads.

We conducted tests with various parameters to understand their impact on performance:

4k random reads and writes:

Tested with a fixed numjobs=1 and variable iodepth.
Also tested with numjobs=32 and variable iodepth.
Tested with a fixed iodepth=1 and variable numjobs.

1M sequential reads and writes:

Conducted with fixed numjobs=1 and variable iodepth.
Also explored numjobs=32 and variable iodepth.

I/O engines used: libaio, io_uring, and sync. We also used variable Lustre client settings and Lustre OSS settings.

Configurations description for testing Fio in distributed environment:

OPT_OSS – optimized OSS/OST settings, with no changing Clients settings
OPT_CL1 – optimized Client with max rpc in flight = 1 (lctl set_param osc..max_pages_per_rpc=4096 osc..checksums=0 osc.*.max_rpcs_in_flight=1)
OPT_CL128 – optimized Client with max rpc in flight = 128 (lctl set_param osc..max_pages_per_rpc=4096 osc..checksums=0 osc.*.max_rpcs_in_flight=128)
ASYNC – ioengine=libaio/io_uring
SYNC – ioengine=sync

Important! In this section, we do not differentiate between io_uring and libaio because we did not observe any difference between them in our tests. Therefore, we labeled both engines as “async” on the graphs. Also, we do not address sequential workloads here; we will address them later when we compare with NFS.

Testing Results

4k Random Reads, numjobs=1

When running a single job, increasing the number of requests at once (IO depth) significantly improved performance for the async engine. However, the sync engine, which always uses a queue depth of 1, only matched the async engine's performance when running 16 jobs at once. When using the sync engine, we can theoretically employ multiple threads as an alternative to the asynchronous engine with a high queue depth, but this is effective only up to 16 queues. Beyond 16 queues, performance does not continue to increase. Unfortunately, the sync engine couldn't handle more jobs effectively, and its latency increased significantly.

Performance varies depending on the type of workload and the settings of both the client and server parts of the Lustre file system. This is clearly illustrated in the graph.

4k Random Writes, numjobs=1

Similar results were seen with random writes. The Lustre client performed best when a configuration setting "max rpc flight" was set to a moderate value (between 8 and 24). Again, the sync engine achieved similar performance to the async engine with 16 jobs, but it didn't scale well with more jobs.

At the same time, the impact of different settings on random write performance is more straightforward.

4k Random Reads, numjobs=32

The most impressive results came from tests with 32 jobs per client. In this scenario, the async engine achieved nearly 4M IOPS, demonstrating Lustre's excellent scalability for high workloads.

On the graph, we can see that the performance of Lustre under async IO workloads scales quite well with increasing load, gradually approaching half of the network interface’s capacity in terms of throughput. And, of course, the performance is significantly higher than what we observed when using DIO and Buffered IO in the IOR utility.

4k Random Writes, numjobs=32

Random writes also showed near-linear improvement with increasing queue depth, reaching close to 1M IOPS. More detailed results including latency graphs can be observed in our presentation for LUG24 conference “Asynchronous I/O: A Practical Guide for Optimizing HPC Workflows”.

Insights from Testing SYNC and AIO

Our tests provided several key insights into the performance of synchronous (SYNC) and asynchronous I/O (AIO) operations, particularly for large and small block I/Os.

The performance difference between synchronous (SYNC) and asynchronous I/O (AIO) is minimal for large I/O operations. This observation is consistent, though not explicitly shown in the charts.
For small block I/Os, the difference reaches several times. SYNC operations scale effectively up to 16 jobs, indicating good performance within this range.
We achieved 49% of the maximum potential performance for random writes. This performance level is primarily constrained by the drives' capabilities.
For random reads, we reached 46% of the maximum possible performance, limited by the dual 200Gbit Host Channel Adapters (HCAs).
Achieving around 50% of the storage system’s capabilities on a parallel file system is quite a good result, as storage needs to perform a significant amount of additional work compared to just a block device. Additionally, we have greatly exceeded our expectations, as we had repeatedly heard that Lustre does not suit this type of workload, but we have proved that wrong. Furthermore, we see that there is still great potential for future growth, and we believe that over time we will be able to approach the efficiency of 70-80%.
The parameter max_rpc_in_flight has a significant impact on performance. Optimal results were observed with values ranging from 8 to 24.

Lustre vs NFSoRDMA Testing

When building small-scale systems, there are various approaches to consider. For instance, with Lustre, you can create a fully open-source solution using the ZFS file system as the backend. ZFS offers integrated RAID management and a volume manager, and it integrates well with Lustre. This makes it a robust option for those who prefer open-source solutions.

Alternatively, you can build a system based on NFS instead of Lustre. NFS has its own advantages, such as having a built-in client in the Linux kernel, which means you are not dependent on specific versions compatible with the Lustre client. This plug-and-play nature is a significant benefit. However, the primary downside is the lack of an open-source NFS implementation that matches the comprehensive functionality of Lustre. While there is an open-source NFS server, it is mainly designed for testing and does not offer the same level of scalability as Lustre. Nonetheless, for small systems, particularly those with one or two controllers, NFS can still be a viable option worth testing.

Here we'll present our test results for ZFS with NFS. We tested with optimal configurations, the best settings are described in the appendix. Now, let's delve into the performance graphs and detailed findings.

We compared NFSoRDMA and Lustre 2.15.4 over ldiskfs and ZFS using the same testing approach. We mounted NFS with Sync and Async options as well as changed the NFS server and client settings. Our key findings include:

No difference between sync and async mount options for reads.
No difference between NFS3 and NFS4.1 in most cases.

The graphs below show the outcome of our testing:

Lustre vs NFS3 vs NFS 4.2, 4k READ Ios, numjobs=1

The benchmark results showed an interesting trend in performance between NFS and Lustre. For single-threaded workloads (number of jobs = 1), NFS exhibited better performance, but this advantage plateaued quickly at around 250K IOPS (regardless of NFS version). In contrast, Lustre's performance scaled linearly as the number of jobs increased.

Lustre vs NFS3 vs NFS 4.2, 4k READ IOs, numjobs=32

After increasing the number of jobs, Lustre performance scales linearly and NFS plateaus at 250K IOPS. And Lustre over ZFS shows the bottleneck at 100K IOPS.

Lustre vs NFS3 vs NFS 4.2, 4k WRITE IOs, numjobs=1

Similar trends emerge in 4K random write tests. While NFS with the async mount option boasts the highest initial speeds, it again encounters a ceiling at 250K IOPS. Lustre maintains its linear scaling, but at a lower overall performance level. As expected, ZFS significantly lags behind.

Lustre vs NFS3 vs NFS 4.2, 4k WRITE IOs, numjobs=32

Once the number of jobs increases, NFS remains capped, while Lustre continues to scale linearly. Lustre over ZFS performance also does not scale.

Lustre vs NFS, 1M sequential reads

Tests involving sequential workloads, including reads at both single and 32 jobs, show minimal performance differences between NFS and Lustre over ldiskfs.

Note that the upper performance limit is around 45 GB/s, which is related to the performance of the network connections.

Lustre vs NFS, 1M sequential writes

The situation for sequential writes is similar, but for a single job only Lustre over ldiskfs demonstrates scalability.

Note that the upper performance limit is around 45 GB/s, which is related to the performance of the network connections.

Conclusions for Lustre vs NFSoRDMA Testing

Lustre easily reaches the maximum network connection performance with a total throughput of 400 Gbps when using a small number of clients, which is exactly what we need.
ZFS does not allow us to achieve the required performance numbers on small block IOs.
NFS over RDMA performs better for NJ=1 and iodepth=1 but does not scale well on small random IOs.
Lustre performs significantly better as the workload increases.
Overall, NFS over RDMA can be considered a decent solution for sequential workloads, but it slightly lags behind Lustre in terms of write performance.

Testing Lustre in the Cloud Environment

In this study, we investigated the virtualization of Lustre, specifically focusing on its OSS layer. We compared two data paths: a user-space implementation (using xiRAID Opus on SPDK) and a kernel-space implementation (Virtio-blk implementations within the Linux kernel). Both configurations utilized Lustre OPS_OSS and asynchronous I/O, with Opus leveraging Lustre OPS_OSS with asynchronous I/O and virtio-blk using aio=io_uring.

In this image, you can see two deployment options. On the left is the option using Opus, and on the right is the option using xiRAID. As you can see, Opus will use only one core, while the virtual machines operating as Lustre OSS servers will use three virtual cores each. Meanwhile, xiRAID Classic will use eight cores.

For each deployment option, we had two OSS virtual machines and two RAID Protected Volumes.

Interestingly, both approaches delivered similar results for sequential reads and writes, reaching around 45 GB/s (the network capacity limit). The results were as follows:

However, with an Opus-based solution, we can achieve the same level of performance as with a xiRAID Classic-based solution while using 8 times fewer CPU resources. Overall, to saturate the performance of a 200 Gbps interface with Opus, only a single CPU core is needed.

However, a significant difference emerged for random operations. The user-space approach exhibited scaling, while the kernel-space approach did not. This can be attributed to the limitation of virtio-blk implementation.

The user-space approach offers promising results for achieving scalability in random operations. We observed performance of 4M IOPS for random reads and 1M IOPS for random writes.

Additionally, we have completely replicated the performance results of the Bare Metal solution. However, we consumed significantly fewer computational resources and memory.

Conclusions for Lustre in Cloud Environment

Kernel-based virtio-blk achieves good performance on large IOs.
Kernel block devices exposed to VM do not provide good performance on small block IOs.
The solution is to run block device and vhost controller in user space: xiRAID Opus solves the problem.

Final Thoughts

Lustre stands out as a powerful option for data storage, even against established solutions like SAN and NFS. Lustre easily achieves performance matching the capabilities of network interfaces, 400 Gbps and above, using only two clients.

Its asynchronous I/O function significantly boosts performance. For random workloads, it is essential to use asynchronous I/O. Additionally, it is crucial to use the correct backend for Lustre, as ZFS does not handle the load well at all. However, for the best results, further development of io_uring features is needed.

Overall, the performance for synchronous small block IOs, after proper configuration, turned out to be quite good and even exceeded all our expectations. It approached half the capacity of a super high-performance block device, and we still see room for potential improvements in the future.

When operating in a virtualized environment, using only a few CPUs, we can achieve performance levels comparable to a bare-metal solution while ensuring super-high flexibility in deploying resources on-demand where we need them – on any hypervisor, within GPU environments, or on external storage arrays.

However, the high efficiency of the solution in terms of CPU resource consumption, as well as the ability to ensure high-performance access for small random IO, is possible only if specialized storage solutions operating in the Linux user space, such as xiRAID Opus, are used.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here.

DEV Community