3FS Optimization 02 | Client Memory Usage Optimization

e.g. articlePhantom Force | High Speed File Series 3FSPhantom AI has designed a sample read file system, 3FS, that is ideal for deep learning training. 3FS uses Direct IO and RDMA Read to allow model training to achieve high read bandwidth with minimal CPU and memory overhead in the sample read section, eliminating the need to wait for data to be loaded during the training process, and utilizing the GPU more fully. performance of GPUs.

As we know, file systems are generally categorized into client and server side. In a 3FS file system, the client is deployed on the compute node side and accesses the 3FS server deployed on the storage node side over the network. There are many issues in this process that are worth optimizing. In this issue, we will talk to you about the client side of theMemory usage optimizationof the problem.

Phantom AI's deep learning platform is separate from computation and storage, and the computation cluster is only responsible for computation. Whether it's data reading, model loading, or gradient reduce, etc., all of them need to be transferred to memory and then to video memory through the NIC, and in the middle of the process, it needs to be copied many times, which is a very serious use of the memory bandwidth. If we can effectively optimize the use of memory, the performance of model training will be greatly improved.

summarize

The client side of a 3FS is the 3FS deployed on the compute node side, whose role is to connect with the 3FS on the server side over the network, request data from the server side and parse it. For the Phantom Firefly II cluster, there is a high-speed network card on each machine in the cluster, and for a typical file system, let's look at the memory overhead required to read 23.5GB of data per second:

  1. Read from NIC to kernel-state memory: one memory write
  2. Read from kernel state to Page Cache, one memory read, one memory write
  3. Reading from Page Cache to userland memory: one memory read and one memory write
  4. In addition, since the register instructions for Non-temporal store are not supported inside the kernel, one to two times more memory reads are required for writes. For more information, seehere are

As mentioned above, if we want to achieve a read performance of 23.5 GBps, we need to consume per second 23.5 * 6~7 = 141~165GBps of memory bandwidth.

And the most advanced training platforms on the market today, such as the dual-path AMD Epyc Rome/Milan have a total memory bandwidth in the range of270GBps~330GBpsThe total memory bandwidth of a dual-lane Intel Cascade Lake platform is only about 220GBps, which means that for a typical file system, the memory bandwidth consumed just to read data will take up about50%~70%! And while we are using these machines, it is clear that only data is not enough, and if IO takes up so much bandwidth, there are fewer resources left for computation and communication.

Therefore, reducing memory copy operations and lowering memory overhead are the key directions for us to optimize the performance of 3FS clients. We useRDMA Readof technology to solve that type of problem.

RDMA Read

weRead to userland memory via RDMATo further avoid kernel-to-user memory copying. As in the previous example, after using RDMA Read, the read operation becomes one step: read from the NIC to the user memory directly, i.e., reading a full 23.5GBps NIC consumes only 23.5GBps of memory bandwidth, which is less than 10% of the total memory bandwidth.

We chose to initiate an RDMA Read from the compute node side instead of doing an RDMA Write from the storage node side.This is because the bulk reads of 3FS result in read requests on the same network that may correspond to many application Buffers, and RDMA Write does not support distributing to multiple addresses on the receiver side, which means that when the length of a single read is very small, it will generate many very small RDMA requests, to the point of significantly affecting the bandwidth of the read.

image.png

However, it is important to note that the read data has to be aligned, and RDMA Read does not simply fill the user's buffer with the read data; instead, it needs to read the data used for alignment on the storage node side into the padding buffer allocated in the kernel, and then discard the data after alignment on the compute node side. During performance testing, we found that RDMA Read requests of different sizes affect each other:

  • Each request is too small, e.g. only a few KBs to read, which is not efficient;
  • Each request is too large to read a few MB at a time, and doesn't really work well when the network is busy.

We combined small requests, split large requests, and finally settled on reading 64KB per RDMA, which yielded good results.

wrap-up

In this paper, we present some of Phantom Cube AI's thoughts on optimizing the client-side memory usage when designing the 3FS file system. We adopt Direct IO and RDMA Read to allow the server-side data to be loaded directly into the user-state memory through the NIC, which reduces the memory bandwidth consumption, and allows the model training to get a super-high read bandwidth in the sample reading part with only a very small CPU and memory overhead, thus eliminating the need to wait for the data to be loaded during the training process, and utilizing the computational performance of the GPU more fully. The GPU can be utilized more fully in the computational performance.

It didn't happen overnight that 3FS could run with such performance. Phantom AI has encountered a lot of challenges in optimizing it in long-term practice. This series of articles will continue to share some small stories about 3FS performance optimization, as a brick to attract jade, which there is no lack of pits we have stepped on. Scholars and experts in the industry are welcome to discuss together.

您可以转载、不违背作品原意地摘录及引用本技术博客的内容,但必须遵守以下条款: 署名 — 您应当署名原作者,但不得以任何方式暗示幻方为您背书,亦不会对幻方的权利造成任何负面影响。 非商业性使用 — 您不得将本技术博客内容用于商业目的。 禁止演绎 — 如果基于该内容改编、转换、或者再创作,您不得公开或分发被修改内容,该内容仅可供个人使用。