

{"id":486,"date":"2025-11-25T09:25:02","date_gmt":"2025-11-25T01:25:02","guid":{"rendered":"https:\/\/high-flyer.in.suopu.cc\/?p=486"},"modified":"2025-11-25T09:25:02","modified_gmt":"2025-11-25T01:25:02","slug":"3fs%e4%bc%98%e5%8c%96-02-%e5%ae%a2%e6%88%b7%e7%ab%af%e5%86%85%e5%ad%98%e4%bd%bf%e7%94%a8%e4%bc%98%e5%8c%96","status":"publish","type":"post","link":"https:\/\/high-flyer.in.suopu.cc\/en\/blog\/486\/","title":{"rendered":"3FS Optimization 02 | Client Memory Usage Optimization"},"content":{"rendered":"<p>e.g. article<a href=\"https:\/\/www.high-flyer.cn\/blog\/3fs\">Phantom Force | High Speed File Series 3FS<\/a>Phantom AI has designed a sample read file system, 3FS, that is ideal for deep learning training. 3FS uses Direct IO and RDMA Read to allow model training to achieve high read bandwidth with minimal CPU and memory overhead in the sample read section, eliminating the need to wait for data to be loaded during the training process, and utilizing the GPU more fully. performance of GPUs.<\/p>\n<p>As we know, file systems are generally categorized into client and server side. In a 3FS file system, the client is deployed on the compute node side and accesses the 3FS server deployed on the storage node side over the network. There are many issues in this process that are worth optimizing. In this issue, we will talk to you about the client side of the<strong>Memory usage optimization<\/strong>of the problem.<\/p>\n<p>Phantom AI's deep learning platform is separate from computation and storage, and the computation cluster is only responsible for computation. Whether it's data reading, model loading, or gradient reduce, etc., all of them need to be transferred to memory and then to video memory through the NIC, and in the middle of the process, it needs to be copied many times, which is a very serious use of the memory bandwidth. If we can effectively optimize the use of memory, the performance of model training will be greatly improved.<\/p>\n<h2 id=\"\u6982\u8ff0\">summarize<\/h2>\n<p>The client side of a 3FS is the 3FS deployed on the compute node side, whose role is to connect with the 3FS on the server side over the network, request data from the server side and parse it. For the Phantom Firefly II cluster, there is a high-speed network card on each machine in the cluster, and for a typical file system, let's look at the memory overhead required to read 23.5GB of data per second:<\/p>\n<ol>\n<li>Read from NIC to kernel-state memory: one memory write<\/li>\n<li>Read from kernel state to Page Cache, one memory read, one memory write<\/li>\n<li>Reading from Page Cache to userland memory: one memory read and one memory write<\/li>\n<li>In addition, since the register instructions for Non-temporal store are not supported inside the kernel, one to two times more memory reads are required for writes. For more information, see<a href=\"http:\/\/sites.utexas.edu\/jdm4372\/2018\/01\/01\/notes-on-non-temporal-aka-streaming-stores\/\">here are<\/a>\u3002<\/li>\n<\/ol>\n<p>As mentioned above, if we want to achieve a read performance of 23.5 GBps, we need to consume per second\u00a0<code class=\"language-text\">23.5 * 6~7 = 141~165GBps<\/code>\u00a0of memory bandwidth.<\/p>\n<p>And the most advanced training platforms on the market today, such as the dual-path AMD Epyc Rome\/Milan have a total memory bandwidth in the range of<code class=\"language-text\">270GBps~330GBps<\/code>The total memory bandwidth of a dual-lane Intel Cascade Lake platform is only about 220GBps, which means that for a typical file system, the memory bandwidth consumed just to read data will take up about<code class=\"language-text\">50%~70%<\/code>! And while we are using these machines, it is clear that only data is not enough, and if IO takes up so much bandwidth, there are fewer resources left for computation and communication.<\/p>\n<p>Therefore, reducing memory copy operations and lowering memory overhead are the key directions for us to optimize the performance of 3FS clients. We use<strong>RDMA Read<\/strong>of technology to solve that type of problem.<\/p>\n<h2 id=\"rdma-read\">RDMA Read<\/h2>\n<p>we<strong>Read to userland memory via RDMA<\/strong>To further avoid kernel-to-user memory copying. As in the previous example, after using RDMA Read, the read operation becomes one step: read from the NIC to the user memory directly, i.e., reading a full 23.5GBps NIC consumes only 23.5GBps of memory bandwidth, which is less than 10% of the total memory bandwidth.<\/p>\n<p>We chose to initiate an RDMA Read from the compute node side instead of doing an RDMA Write from the storage node side.This is because the bulk reads of 3FS result in read requests on the same network that may correspond to many application Buffers, and RDMA Write does not support distributing to multiple addresses on the receiver side, which means that when the length of a single read is very small, it will generate many very small RDMA requests, to the point of significantly affecting the bandwidth of the read.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/07108\/image.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/a6d36\/image.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/222b7\/image.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/ff46a\/image.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/a6d36\/image.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/e548f\/image.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/3c492\/image.png 1300w,https:\/\/hfai-static.high-flyer.cn\/static\/9ce94dbce3caaf40da101fd4b29597a8\/07108\/image.png 1619w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>However, it is important to note that the read data has to be aligned, and RDMA Read does not simply fill the user's buffer with the read data; instead, it needs to read the data used for alignment on the storage node side into the padding buffer allocated in the kernel, and then discard the data after alignment on the compute node side. During performance testing, we found that RDMA Read requests of different sizes affect each other:<\/p>\n<ul>\n<li>Each request is too small, e.g. only a few KBs to read, which is not efficient;<\/li>\n<li>Each request is too large to read a few MB at a time, and doesn't really work well when the network is busy.<\/li>\n<\/ul>\n<p>We combined small requests, split large requests, and finally settled on reading 64KB per RDMA, which yielded good results.<\/p>\n<h2 id=\"\u5c0f\u7ed3\">wrap-up<\/h2>\n<p>In this paper, we present some of Phantom Cube AI's thoughts on optimizing the client-side memory usage when designing the 3FS file system. We adopt Direct IO and RDMA Read to allow the server-side data to be loaded directly into the user-state memory through the NIC, which reduces the memory bandwidth consumption, and allows the model training to get a super-high read bandwidth in the sample reading part with only a very small CPU and memory overhead, thus eliminating the need to wait for the data to be loaded during the training process, and utilizing the computational performance of the GPU more fully. The GPU can be utilized more fully in the computational performance.<\/p>\n<p>It didn't happen overnight that 3FS could run with such performance. Phantom AI has encountered a lot of challenges in optimizing it in long-term practice. This series of articles will continue to share some small stories about 3FS performance optimization, as a brick to attract jade, which there is no lack of pits we have stepped on. Scholars and experts in the industry are welcome to discuss together.<\/p>","protected":false},"excerpt":{"rendered":"<p>\u5982\u6587\u7ae0\u300a\u5e7b\u65b9\u529b\u91cf | \u9ad8\u901f\u6587\u4ef6\u7cfb\u5217 3FS\u300b\u4e2d\u6240\u4ecb\u7ecd\u7684\uff0c\u5e7b\u65b9AI\u8bbe\u8ba1\u4e86\u4e00\u5957\u975e\u5e38\u9002\u5408\u6df1\u5ea6\u5b66\u4e60\u8bad\u7ec3\u7684\u6837\u672c\u8bfb\u53d6\u6587\u4ef6\u7cfb [&hellip;]<\/p>","protected":false},"author":1,"featured_media":487,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[8],"tags":[],"class_list":["post-486","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-infrastructure"],"acf":[],"_links":{"self":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/486","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/comments?post=486"}],"version-history":[{"count":1,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/486\/revisions"}],"predecessor-version":[{"id":488,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/486\/revisions\/488"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media\/487"}],"wp:attachment":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media?parent=486"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/categories?post=486"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/tags?post=486"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}