Sergey Platonov

Posted on Aug 7 • Edited on Aug 12

Saturating Infiniband Bandwith xiRAID, To Keep Nvidia Dgx Busy

#nvidia #raid #data

Objectives

Modern AI innovations require proper infrastructure, especially concerning data throughput and storage capabilities. While GPUs drive faster results, legacy storage solutions often lag behind, causing inefficient resource utilization and extended times in completing the project. Traditional enterprise storage or HPC-focused parallel file systems are costly and challenging to manage for AI-scale deployments. High-performance storage systems can significantly reduce AI model training time. Delays in data access can also impact AI model accuracy, highlighting the critical role of storage performance.

Xinnor partnered with DELTA Computer Products GMBH, a leading system integrator in Germany, to build a high-performance solution designed specifically for AI and HPC tasks. Thanks to the use of high-performance NVMe drives from Micron, efficient software RAID from Xinnor, and 400Gbit InfiniBand controllers from NVIDIA, the system designed by Delta ensures a high level of performance through NFSoRDMA interfaces, both for read and write operations, that is crucial for reducing checkpoint times typical of AI projects and for handling possible drive failures. NFSoRDMA enables parallel access for reading and writing from multiple nodes simultaneously. The 2U dual sockets server used by Delta and equipped with 24x 7450 NVMe 15.36 from Micron allows storage of up to 368TB and provides theoretical access speeds of up to 50GBps. In this document we’ll explain how to set up the system with xiRAID to saturate the InfiniBand bandwidth and provide the best possible performance to NVIDIA DGX H100 systems.

In addition, we’ll showcase the capabilities of xiRAID software. xiRAID represents a comprehensive software RAID engine, offering a range of features tailored to address diverse storage needs.

Finally, this report provides a detailed instruction manual for achieving optimal and consistent performance across various deployments.

Test Setup

Motherboard: Giga Computing MZ93-FS0
CPU: 2xAMD EPYC 9124
RAM: 756GB
Storage: Micron 7450 (15.36TB) x 24
Boot drives: Micron 7450 (960GB) x 2
Network: NVIDIA ConnectX-7 400Gbit
OS: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
RAID: xiRAID 4.0.3

Client 1:

NVIDIA DGX H100
Intel(R) Xeon(R) Platinum 8480CL
2063937MB RAM
Network InfiniBand controller: Mellanox Technologies MT2910 Family [ConnectX-7]
Client 2:

NVIDIA DGX H100
Intel(R) Xeon(R) Platinum 8480CL
2063937MB RAM
Network InfiniBand controller: Mellanox Technologies MT2910 Family [ConnectX-7]

Testing Approach

We conducted tests of synchronous and asynchronous file access modes to demonstrate the difference in performance between the two approaches. Synchronous mode means that the host receives confirmation of the write only after the data has been written to the non-volatile memory. This mode ensures data integrity and more stable performance. In asynchronous mode, the client receives confirmation of the write when the data is saved in the page cache of the server. Asynchronous mode is less sensitive to storage-level delays and thus to array geometry, but it may provide an unstable level of performance, varying depending on the level of cache fill, and may lead to data loss in case of power outage and lack of proper tools to protect the cache itself.

If supported by the application, Xinnor recommends using synchronous mode.

RAID and File System Configuration

To achieve the best results in synchronous mode, it is necessary to correctly configure the array geometry and file system mounting parameters. In our case, we will create 1 RAID50 array with 18 drives, with a chunk size of 64k. For the journals, we will create a RAID1 from 2 drives (for each parity RAID), so that small log IOs will not interfere with writing large data blocks. This geometry allows us to aligne to 512kb blocks and consequently, to achieve better sequential write results, due to the reduced read-modify-write (RMW) operations. The alternative to this configuration could be 2 RAID5 where each RAID belongs to the dedicated NUMA node. In this testing we don’t see great value for NUMA affinity approach, but in some server configurations it may significantly help. It is worth mentioning that one xiRAID software instance supports unlimited number of RAIDs.

Example array for 1 shared folder

Possible Array Configuration Schemes

Scheme 1

Second testing configuration

A single RAID50/60 is created from 18/20 drives and a mirror of two drives. One file system (data + log) is created and exported as a single shared folder.

Pros:

If IO is a multiple of 256k, there are no RMW operations;
Unified data volume for all clients;
Small IO does not affect performance stability.

Cons:

Not all drives are used for data;
NUMA may affect overall performance.

Scheme 3

Third testing configuration

A single RAID50 or 60 is created with 24 drives. One file system with internal logs is created and exported as 1 shared folder.

Pros:

The entire volume is allocated for data;

Cons:

Slightly higher latency, lower performance in comparison with aligned IO.

Aligned IO Description

If the IO is not a multiple of the stripe size (for example, if the IO is 256kb and the stripe consists of 12 drives with a chunk size of 32kb), to update the checksums during writing we have to read the old data state, the old checksum state, recalculate, and write everything back.

The same situation occurs if the IO is equal to the stripe size but not aligned to its boundary and is written with an offset, then the RMW operation must be done for two stripes.

If the IO is aligned, for example if we write 256kb on an 8+1 stripe, we can generate the checksum only from the new data, and we do not need to read the old data and parity states.

Performance Tests

We conducted performance tests of the array locally to demonstrate its capabilities. Then we added the file system to assess its impact and conducted tests with clients over the network using NFSoRDMA protocols, both with one and two clients, to evaluate scalability. To understand the system's behavior in various scenarios, we presented the test results in case of failures and in asynchronous mode of NFS clients. Additionally, for comparison, we conducted tests on a single unaligned array to demonstrate the impact of such geometry on the results.

Local Performance Testing

Testing Drives and Array

We conducted tests on the drives and array. Prior to that, we needed to initialize the array. This is the FIO configuration file:

[global]
bs=1024k
ioengine=libaio
rw=write
direct=1
group_reporting
time_based
offset_increment=3%
runtime=90
iodepth=32
group_reporting
exitall
[nvme1n1]
filename=/dev/xi_xiraid
~
~

Test results for scheme 2 (1 RAID50 of 18 drives for data and 1 RAID1 of 2 drives for logs) are as follows:

The read performance is close to the theoretical maximum for this workload.

At the same time, the write performance is very good, greatly exceeding the capabilities of alternative solutions available in the market.

Testing the Local File System

When testing the local file system, we can assess the extent of its influence on the results. FIO configuration:

[globall]
bs=1024k
ioengine=libaio
rw=write
direct=1
group_reporting
time_based
runtime= 90
iodepth=32
exitall
[nvme1n1]
directory=/data

Now let's format and mount the file system:

mkfs.xfs -d su=64k,sw=8 -l logdev=/dev/xi_log1,size=1G /dev/xi_xiraid -f -ssize=4k

The mount options look the following way:

/dev/xi_xiraid /data xfs logdev=/dev/xi_log1,noatime,nodiratime,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,x-systemd.requires=xiraid-restore.service,x-systemd.device-timeout=5m,_netdev 0 0

Thanks to xiRAID architecture we don’t see significant impact on the results in comparison with previous test of RAID block device. As well we demonstrate that theoretically we can saturate all the network bandwidth.

Network Performance Testing

The NFS configuration file is available in Appendix 1.

The share parameters:

(/etc/exports
/data *(rw,no_root_squash,sync,insecure,no_wdelay)

vim /etc/modprobe.d/nfsclient.conf
options nfs max_session_slots=180
mount -o nfsvers=3,rdma,port=20049,sync,nconnect=16 10.10.10.254:/data /data1

We recommend using NFS v3 as it demonstrates more stable results in synchronous mode. FIO configuration on the client:

Synchronous Mode, Single Client Testing

Below are the results for single client performance testing.

Write operations provide 3/4 of the network interface's capabilities, while read operations offer the full potential of the interface (50GB/s or 400Gbs). Writing is slower than the interface results because in synchronous mode, IO parallelization decreases due to the need to wait for confirmation of the write on the drives.

Synchronous Mode, Single Client Testing, Degraded Mode

It is also important to check the system's behavior in degraded mode. Degraded mode is when one or more drives are removed from the RAID.

Array status in degraded mode

During one drive failure, no performance degradation is observed, meaning that DGX H100 client will not suffer any downtime.

Synchronous Mode, Two Clients Testing

Testing in synchronous mode demonstrates that write performance increases for low jobs count with two clients because of the increased workload from the clients, while read performance remains the same as we already reached the capabilities of a single-port 400 Gbit interface (50GB/s).

Asynchronous Mode

During asynchronous operations, the performance appears similar, but it might be unstable over time and for this reason we recommend running in synchronous mode whenever it is supported by the application.

Non-aligned RAID Performance Testing

In some cases, it may be necessary to increase the usable array capacity at the expense of some performance reduction, or, if the client behavior is not determined, there is no point or possibility in creating an aligned RAID.

Using all drives for testing, we will create a RAID50 array of 24 drives (scheme 3) and make some changes to the file system creation and mounting parameters (see fig. 4). We will decrease the chunk size to 32k to reduce stripe width. With this chunk size, we recommend using write intensive drives to avoid performance degradation.

Write performance on single client with non-aligned array is nearly one-third lower. Read operations are similar to aligned arrays.

Conclusions

The combination of NFSoverRDMA, xiRAID and Micron 7450 NVMe SSD enables to create a high-performance storage system capable of saturating the network bandwidth in read operation and ensuring fast flushing and checkpoint execution (write at 3/4 of the interface capability), therefore keeping DGX H100 busy with data and consequently optimizing its usage.
Storage performance remains unaffected in case of drive failures, eliminating the need for overprovisioning resources and avoiding system downtime.
Both synchronous and asynchronous operation modes are supported, and the solution offers the necessary set of settings to optimize performance for various scenarios and load patterns.

Thank you for reading! If you have any questions or thoughts, please leave them in the comments below. I’d love to hear your feedback!

Original article can be found here.

DEV Community