{"id":501,"date":"2025-11-25T09:28:18","date_gmt":"2025-11-25T01:28:18","guid":{"rendered":"https:\/\/high-flyer.in.suopu.cc\/?p=501"},"modified":"2025-11-25T09:28:18","modified_gmt":"2025-11-25T01:28:18","slug":"pytorch-%e5%88%86%e5%b8%83%e5%bc%8f%e8%ae%ad%e7%bb%83%e6%96%b9%e6%b3%95","status":"publish","type":"post","link":"https:\/\/high-flyer.in.suopu.cc\/en\/blog\/501\/","title":{"rendered":"PyTorch Distributed Training Method"},"content":{"rendered":"<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/99f37\/p0.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/a6d36\/p0.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/222b7\/p0.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/ff46a\/p0.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/a6d36\/p0.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/e548f\/p0.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/9ac28a620ade131f0826acbd7d76181e\/99f37\/p0.png 1100w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>In 2018, the Bert model with nearly 300 million parameters came out of nowhere, pushing the NLP field to new heights. In recent years, the development of the artificial intelligence field has increasingly tended to the study of large models, and all major AI giants have released their large models with hundreds of billions of parameters, giving birth to many new AI application scenarios. On the other hand, a variety of factors continue to promote the significant development of big models: 1) society is experiencing a deep digital transformation, and a large amount of data is gradually merging, giving rise to many AI application scenarios and needs; 2) hardware technology continues to progress: NVIDIA A100 GPU, Google's TPU, Ali's Contained Light 800, etc., driving the AI computing power and so on, pushing the AI arithmetic power to be continuously improved. It can be said that the combination of \u201cbig data + big model + big computing power\u201d is the cornerstone of AI 2.0. However, for most AI practitioners, due to a number of factors, it is usually difficult to come into contact with big data or big computing power, and many graduate students in colleges and universities tend to have a lot of experience in AI.<strong>Accustomed to using one graphics card (personal PC) to accelerate training<\/strong>This not only slows down training, but also restricts imagination and imprisons creativity. Mirage AI has built \u201cFirefly II\u201d according to its own business needs, aiming to provide a surging deep learning computing power service like the sea. The team self-researched intelligent time-sharing scheduling system, efficient storage system and network communication system, the cluster can be used as an ordinary computer, according to the task demand elastic expansion of GPU arithmetic; self-research hfai data warehouse, model warehouse, optimization of the AI framework and arithmetic, integration of many cutting-edge application scenarios; Client interfaces or Jupyter can be easily accessed to accelerate the training of AI models. The AI model can be easily accessed through the Client interface or Jupyter, accelerating the training of the AI model.<\/p>\n<p>This installment of the article shares, the<strong>How to use up multiple graphics cards to speed up your AI models<\/strong>Distributed training techniques are becoming one of the essential skills for AI practitioners. Distributed training techniques are becoming one of the essential skills for AI practitioners, which is the way from \u201csmall model\u201d to \u201cbig model\u201d. Let's take ResNet training written in PyTorch as an example to show you different distributed training methods and their effects.<\/p>\n<p><strong>training task<\/strong>: ResNet + ImageNet<\/p>\n<p><strong>Training framework<\/strong>: PyTorch 1.8.0<\/p>\n<p><strong>Training platforms<\/strong>: Phantom Firefly II<\/p>\n<p><strong>Training code<\/strong>\uff1a<a href=\"https:\/\/github.com\/HFAiLab\/pytorch_distributed\">https:\/\/github.com\/HFAiLab\/pytorch_distributed<\/a><\/p>\n<h3 id=\"\u8bad\u7ec3\u51c6\u5907\">Training preparation<\/h3>\n<p>Firefly II currently fully supports the parallel training environment of PyTorch. I first use 1 node with 8 cards to test the time required for different distributed training methods, graphics card utilization efficiency; after that, multiple nodes are used to verify the effect of parallel acceleration. The method of testing is as follows:<\/p>\n<ol>\n<li>nn.DataParallel<\/li>\n<li>torch.distributed + torch.multiprocessing<\/li>\n<li>apex Half precision<\/li>\n<\/ol>\n<p>To fully utilize the video memory, here<code class=\"language-text\">batch_size<\/code>Set it to 400 and record the elapsed time per Epoch for comparison. We started with a single-computer, single-card test as the BASELINE, which took 1786 seconds up and down per Epoch.<\/p>\n<p>The test results found that<strong>Parallel acceleration with half-precision adds the best results<\/strong>\uff1b<strong>DataParallel is slower and does not utilize video card resources as well, not recommended!<\/strong>\uff1b<strong>The acceleration from multiple machines and cards is significant<\/strong>. The overall results are as follows:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/d9ed0\/p1.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"result.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/a6d36\/p1.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/222b7\/p1.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/ff46a\/p1.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/a6d36\/p1.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/e548f\/p1.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/3c492\/p1.png 1300w,https:\/\/hfai-static.high-flyer.cn\/static\/2b8b7b8b386810d429647d93028deece\/d9ed0\/p1.png 1415w\" alt=\"result.png\" \/><\/a><\/span><\/p>\n<h3 id=\"nndataparallel\">nn.DataParallel<\/h3>\n<p>DataParallel is an early data parallel training method proposed by PyTorch, which uses single-process control to load models and data into multiple GPUs, control the flow of data between GPUs, and collaborate with models on different GPUs for parallel training.<\/p>\n<p>It's very easy to use, we just need to use the\u00a0<code class=\"language-text\">nn.DataParallel<\/code>\u00a0Packaging model, and then set some parameters can be set. The parameters to be defined include:<\/p>\n<ul>\n<li>What are the GPUs involved in the training, device_ids=gpus;<\/li>\n<li>Which GPU is used to aggregate the gradients, output_device=gpus[0].<\/li>\n<\/ul>\n<p>DataParallel automatically slices and loads the data to the appropriate GPUs, copies the model to the appropriate GPUs, and performs forward propagation to compute the gradient and summarize it:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/93576e68226a3bcf0b52a87376ab932d\/f058b\/p2.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/93576e68226a3bcf0b52a87376ab932d\/f058b\/p2.png\" sizes=\"(max-width: 630px) 100vw, 630px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/93576e68226a3bcf0b52a87376ab932d\/222b7\/p2.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/93576e68226a3bcf0b52a87376ab932d\/ff46a\/p2.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/93576e68226a3bcf0b52a87376ab932d\/f058b\/p2.png 630w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>It is worth noting that<strong>Here, because there are 8 video cards.<\/strong><code class=\"language-text\">**batch_size**<\/code><strong>To be set to 3200<\/strong>The models and data have to be loaded into the GPU before they can be processed by the DataParallel module. Both the model and the data have to be loaded into the GPU before the DataParallel module can process them, otherwise it will report an error.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/63ec5\/p3.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/a6d36\/p3.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/222b7\/p3.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/ff46a\/p3.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/a6d36\/p3.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/0e4685b5f245ff537f3b322b8724df2e\/63ec5\/p3.png 812w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>Initiate training. As we can see from the figure below, the average GPU utilization of the 8 graphics cards is not high (36.5%) when the graphics memory is almost full.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/67fe0\/p4.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/a6d36\/p4.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/222b7\/p4.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/ff46a\/p4.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/a6d36\/p4.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/e548f\/p4.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/606dcf947bf33e727f91a0c771c451a2\/67fe0\/p4.png 1101w\" alt=\"image.png\" \/><\/a><\/span>Looking at the utilization of each graphics card individually reveals uneven resource usage. One of the graphics cards, #0, is summarized as a gradient, with a little higher resource utilization compared to the other cards, and an overall utilization of less than 50%.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/7798c\/p5.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/a6d36\/p5.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/222b7\/p5.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/ff46a\/p5.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/a6d36\/p5.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/e548f\/p5.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/7f6c45154108bd9094fff7ba70ec8c6f\/7798c\/p5.png 1183w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>In the end.<code class=\"language-text\">nn.DataParrallel<\/code>Under the method, ResNet average per Epoch<strong>Takes 984s<\/strong>Around, compared to stand-alone single card acceleration<strong>About twice as effective<\/strong>The<\/p>\n<h3 id=\"torchdistributed\">torch.distributed<\/h3>\n<p>After pytorch 1.0, PyTorch officially encapsulates common distributed methods, supporting all-reduce, broadcast, send and receive. It supports all-reduce, broadcast, send and receive, etc. It realizes CPU communication through MPI and GPU communication through NCCL, which solves the problems of slow speed of DataParallel and unbalanced load of GPU.<\/p>\n<p>Unlike DataParallel's single-process control of multiple GPUs, with distributed we just write a single piece of code and the torch automatically assigns it to n processes running on n GPUs.<\/p>\n<p>Specifically, first use the\u00a0<code class=\"language-text\">init_process_group<\/code>\u00a0Sets the backend and port used for communication between GPUs:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/024d6\/p6.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/a6d36\/p6.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/222b7\/p6.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/ff46a\/p6.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/a6d36\/p6.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/38fe459308837d43f8e467cc7c32b7fe\/024d6\/p6.png 961w\" alt=\"image.png\" \/><\/a><\/span>\u200b<\/p>\n<p>Then, use the\u00a0<code class=\"language-text\">DistributedDataParallel<\/code>\u00a0The wrapper model helps us to do all reduce (i.e., aggregate the gradients computed on different GPUs and synchronize the results) for the gradients computed on different GPUs. the gradient of the model on different GPUs after all reduce is the average of the gradients on each GPU before all reduce:<\/p>\n<div class=\"gatsby-highlight\" data-language=\"python\">\n<pre class=\"language-python\"><code class=\"language-python\">model <span class=\"token operator\">=<\/span> torch<span class=\"token punctuation\">.<\/span>nn<span class=\"token punctuation\">.<\/span>parallel<span class=\"token punctuation\">.<\/span>DistributedDataParallel<span class=\"token punctuation\">(<\/span>model<span class=\"token punctuation\">.<\/span>cuda<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> device_ids<span class=\"token operator\">=<\/span><span class=\"token punctuation\">[<\/span>local_rank<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/code><\/pre>\n<\/div>\n<p>\u200b<\/p>\n<p>Finally, because multi-process management is employed, it is necessary to use the\u00a0<code class=\"language-text\">DistributedSampler<\/code>\u00a0Partitioning the dataset helps us to divide each batch into several partitions, so that we only need to get the partition corresponding to the rank for training in the current process:<span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/f31c5d3b7c1a02c10fb68e7500c64dfa\/1f09d\/p7.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/f31c5d3b7c1a02c10fb68e7500c64dfa\/1f09d\/p7.png\" sizes=\"(max-width: 471px) 100vw, 471px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/f31c5d3b7c1a02c10fb68e7500c64dfa\/222b7\/p7.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/f31c5d3b7c1a02c10fb68e7500c64dfa\/ff46a\/p7.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/f31c5d3b7c1a02c10fb68e7500c64dfa\/1f09d\/p7.png 471w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>Noteworthy**, with **<code class=\"language-text\">**nn.DataParrallel**<\/code><strong>Different. Here.<\/strong><code class=\"language-text\">**batch_size**<\/code><strong>It should be 400.<\/strong>, since it is only responsible for the corresponding partition under the current rank, 8 cards make up a total batch_size of 3200 samples.<\/p>\n<p>As for the API level, PyTorch provides us with the\u00a0<code class=\"language-text\">torch.distributed.launch<\/code>\u00a0A launcher for distributed execution of python files on the command line. During execution, the launcher passes the current process's (actually the GPU's) index to python via a parameter, and we can get the current process's index this way:<\/p>\n<div class=\"gatsby-highlight\" data-language=\"python\">\n<pre class=\"language-python\"><code class=\"language-python\">parser <span class=\"token operator\">=<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span>\r\nparser<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">'--local_rank'<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">=<\/span><span class=\"token operator\">-<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">=<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">help<\/span><span class=\"token operator\">=<\/span><span class=\"token string\">'node rank for distributed training'<\/span><span class=\"token punctuation\">)<\/span>\r\nargs <span class=\"token operator\">=<\/span> parser<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span>\r\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>local_rank<span class=\"token punctuation\">)<\/span><\/code><\/pre>\n<\/div>\n<p>Start 8 command lines and execute the following commands:<\/p>\n<div class=\"gatsby-highlight\" data-language=\"bash\">\n<pre class=\"language-bash\"><code class=\"language-bash\"><span class=\"token assign-left variable\">CUDA_VISIBLE_DEVICES<\/span><span class=\"token operator\">=<\/span><span class=\"token number\">0,1<\/span>,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node<span class=\"token operator\">=<\/span><span class=\"token number\">8<\/span> train.py<\/code><\/pre>\n<\/div>\n<p><strong>torch.multiprocessing<\/strong>\u200b<\/p>\n<p>Starting the command line manually is a bit of a hassle, so here's an easier way to do it:<code class=\"language-text\">torch.multiprocessing<\/code>\u00a0will help create processes automatically, bypassing the\u00a0<code class=\"language-text\">torch.distributed.launch<\/code>\u00a0Some glitches in the automatic control of opening and exiting processes.<\/p>\n<p>As shown in the code below, the code that would otherwise require\u00a0<code class=\"language-text\">torch.distributed.launch<\/code>\u00a0The managed implementation is encapsulated into the\u00a0<code class=\"language-text\">main<\/code>\u00a0function by means of the\u00a0<code class=\"language-text\">torch.multiprocessing.spawn<\/code>\u00a0nprocs=8 processes are opened, each of which executes the\u00a0<code class=\"language-text\">main<\/code>\u00a0and pass local_rank (the current process index) into it.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/886058f74182b0d6828dd04c7ddf4f04\/55811\/p8.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/886058f74182b0d6828dd04c7ddf4f04\/55811\/p8.png\" sizes=\"(max-width: 501px) 100vw, 501px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/886058f74182b0d6828dd04c7ddf4f04\/222b7\/p8.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/886058f74182b0d6828dd04c7ddf4f04\/ff46a\/p8.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/886058f74182b0d6828dd04c7ddf4f04\/55811\/p8.png 501w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>Initiate training. As we can see from the graph below, the video memory is almost fully occupied and the average utilization of the 8 cards is high (95.8%).<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/84ee5\/p9.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/a6d36\/p9.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/222b7\/p9.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/ff46a\/p9.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/a6d36\/p9.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/e548f\/p9.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/034771139e2213367410e5b97647fb52\/84ee5\/p9.png 1076w\" alt=\"image.png\" \/><\/a><\/span>\u200b<\/p>\n<p>At the same time, each video card is more fully utilized, with a utilization rate of 80% or more:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/d073d\/p10.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/a6d36\/p10.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/222b7\/p10.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/ff46a\/p10.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/a6d36\/p10.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/e548f\/p10.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/882794889232274d41529dd36abe3f95\/d073d\/p10.png 1297w\" alt=\"image.png\" \/><\/a><\/span>\u200b<\/p>\n<p>In the end.<code class=\"language-text\">torch.distributed<\/code>Under the method, ResNet average per Epoch<strong>Takes 239s<\/strong>Around, compared to stand-alone single card acceleration<strong>About eight times more effective.<\/strong>The<\/p>\n<h3 id=\"apex\">apex<\/h3>\n<p>Apex is NVIDIA's open-source library for mixed-precision and distributed training, which encapsulates the process of mixed-precision training and allows you to change two or three lines of configuration to perform mixed-precision training, which significantly reduces memory usage and saves computing time. In addition, Apex also provides a package for distributed training, optimized for NVIDIA's NCCL communication library, which has been natively supported since PyTorch version 1.6, namely<code class=\"language-text\">torch.cuda.amp<\/code>The<\/p>\n<p>Direct use\u00a0<code class=\"language-text\">amp.initialize<\/code>\u00a0Packaging the model and optimizer, apex will automatically help us manage the model parameters and optimizer precision, and other configuration parameters can be passed in depending on the precision requirements.<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/d44c9\/p11.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/a6d36\/p11.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/222b7\/p11.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/ff46a\/p11.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/a6d36\/p11.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/b7a7aa514937fe1a8035b4cf71f9ed59\/d44c9\/p11.png 722w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>In terms of packaging for distributed training, Apex has not changed much, mainly optimizing NCCL communication. As a result, most of the code is still the same as the\u00a0<code class=\"language-text\">torch.distributed<\/code>\u00a0Keep it consistent. To use it, you just need to set the\u00a0<code class=\"language-text\">torch.nn.parallel.DistributedDataParallel<\/code>\u00a0Replace with\u00a0<code class=\"language-text\">apex.parallel.DistributedDataParallel<\/code>\u00a0Used for packaging models.<\/p>\n<p>For forward propagation to compute the loss, Apex needs to use the\u00a0<code class=\"language-text\">amp.scale_loss<\/code>\u00a0Package for automatic scaling of accuracy based on loss values:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/e13ffa82cd72bc92c84b7cc54408adc5\/ff42b\/p12.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/e13ffa82cd72bc92c84b7cc54408adc5\/ff42b\/p12.png\" sizes=\"(max-width: 484px) 100vw, 484px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/e13ffa82cd72bc92c84b7cc54408adc5\/222b7\/p12.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/e13ffa82cd72bc92c84b7cc54408adc5\/ff46a\/p12.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/e13ffa82cd72bc92c84b7cc54408adc5\/ff42b\/p12.png 484w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>Initiate training. As we can see from the figure below, on an equal<code class=\"language-text\">batch_szie<\/code>conditions, the video memory occupied only 601 TP3T, while the average utilization of the 8 video cards was high (95.81 TP3T).<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/87488\/p13.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/a6d36\/p13.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/222b7\/p13.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/ff46a\/p13.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/a6d36\/p13.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/e548f\/p13.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/828d9008cfdd642b61515577bd904dbf\/87488\/p13.png 1282w\" alt=\"image.png\" \/><\/a><\/span>\u200b<\/p>\n<p>At the same time, each video card is more fully utilized, with a utilization rate of 80% or more:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/21b4d\/p14.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/a6d36\/p14.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/222b7\/p14.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/ff46a\/p14.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/a6d36\/p14.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/e548f\/p14.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/eb8903f9aa4abecd5b1d2c54b3253fbc\/21b4d\/p14.png 1280w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>In the end.<code class=\"language-text\">apex<\/code>Under the method, ResNet average per Epoch<strong>Takes 230s<\/strong>Around, compared to<code class=\"language-text\">torch.distributed<\/code>Accelerated a bit. At the same time.<strong>apex also saves the GPU's arithmetic resources, which corresponds to the ability to set a larger batch_size for faster speeds<\/strong>. Under our extreme testing, the ability to<code class=\"language-text\">torch.distributed<\/code>The base is boosted by about 50% of performance.<\/p>\n<h3 id=\"\u591a\u673a\u591a\u5361\u8bad\u7ec3\u52a0\u901f\">Multi-computer, multi-card training acceleration<\/h3>\n<p>As you can see from the test above, 8 cards give an approximate 8x speedup. If we use more machines, will there be a faster speed-up effect? Here, we are interested in the\u00a0<code class=\"language-text\">apex<\/code>The code, configured with 4 nodes and a total of 32 A100 GPUs for the experiment, yielded amazing results.<span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/218a4\/p15.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/a6d36\/p15.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/222b7\/p15.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/ff46a\/p15.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/a6d36\/p15.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/e548f\/p15.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/ea59d00d6bf86822a0d2abceef8462ed\/218a4\/p15.png 1052w\" alt=\"image.png\" \/><\/a><\/span>As can be seen.<strong>Total time per Epoch is about 52 seconds.<\/strong>Compared to stand-alone single-card speeds<strong>More than 30 times higher<\/strong>, almost in line with the number of graphics cards.<\/p>\n<h3 id=\"\u5206\u5e03\u5f0f\u8bc4\u4ef7\">distributed evaluation<\/h3>\n<p>The above shows the process of distributed training. However, it is also very important to ask how to reason and test the results of the training quickly, for example:<\/p>\n<ol>\n<li>The training sample is sliced into several parts, controlled by several processes running on several GPUs respectively, how to communicate between the processes to summarize this information (on the GPUs)?<\/li>\n<li>It's too slow to reason and test using a single card, how can you reason and test in a distributed way using Distributed and aggregate the results together?<\/li>\n<li>\u2026<\/li>\n<\/ol>\n<p>To solve these problems, we need a more basic API that<strong>Aggregate information about the accuracy, loss function and other metrics generated on different GPUs<\/strong>The API is torch.distributed.reduce. This API is torch.distributed.reduce.<\/p>\n<p>APIs such as reduce are\u00a0<code class=\"language-text\">torch.distributed<\/code>\u00a0These APIs help us control the interaction between processes and the transfer of GPU data. These APIs are useful for customizing GPU collaboration logic and aggregating small amounts of statistical information between GPUs. Proficiency in these APIs can also help us design and optimize our own distributed training and testing processes.<span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/296becc7cfe080c8de524b9cf43e4ba7\/854dc\/p16.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/296becc7cfe080c8de524b9cf43e4ba7\/854dc\/p16.png\" sizes=\"(max-width: 569px) 100vw, 569px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/296becc7cfe080c8de524b9cf43e4ba7\/222b7\/p16.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/296becc7cfe080c8de524b9cf43e4ba7\/ff46a\/p16.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/296becc7cfe080c8de524b9cf43e4ba7\/854dc\/p16.png 569w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>As shown in the figure above, its working process consists of the following two steps:<\/p>\n<ol>\n<li>After calling reduce(tensor, op=...), the current process accepts the tensor from the other process.<\/li>\n<li>After all receipts are complete, the current process (e.g., rank 0) performs an op operation on the current process's and the received tensor.<\/li>\n<\/ol>\n<p>With the above steps, we are able to sum the loss functions of the training data on different GPUs:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/97655\/p17.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"image.png\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/a6d36\/p17.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/222b7\/p17.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/ff46a\/p17.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/a6d36\/p17.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/755f130d5fa29aeee931ae3051fe5921\/97655\/p17.png 819w\" alt=\"image.png\" \/><\/a><\/span><\/p>\n<p>apart from<code class=\"language-text\">reduce<\/code>The official PyTorch website offers such features as the\u00a0<code class=\"language-text\">scatter<\/code>,\u00a0<code class=\"language-text\">gather<\/code>,\u00a0<code class=\"language-text\">all-reduce<\/code>There are 6 types of aggregate communication schemes, such as the following, refer to the official documentation for details:\u00a0<a href=\"https:\/\/ptorch.com\/docs\/1\/distributed\">https:\/\/ptorch.com\/docs\/1\/distributed<\/a>\u200b<\/p>\n<h3 id=\"\u603b\u7ed3\">summarize<\/h3>\n<p>This paper introduces different PyTorch parallel training methods and tests them on Phantom Firefly II. From the test results, we can conclude that multi-computer and multi-card parallel training can effectively improve our efficiency, and the apex method combined with torch.distributed can maximize the arithmetic power of the graphics card, so that the acceleration effect is consistent with the number of graphics cards. At the same time, we also found that with the increase in the number of graphics cards, the utilization rate of a single graphics card will gradually decrease due to the restriction of reduce computation in gradient convergence. How to efficiently optimize the reduce algorithm to increase the gpu power as much as possible, so as to accelerate the overall training efficiency, is the next topic worthy of your continued research.<\/p>","protected":false},"excerpt":{"rendered":"<p>2018\u5e74\uff0c\u5c06\u8fd13\u4ebf\u53c2\u6570\u7684Bert\u6a21\u578b\u6a2a\u7a7a\u51fa\u4e16\uff0c\u5c06NLP\u9886\u57df\u63a8\u5411\u4e86\u65b0\u7684\u9ad8\u5ea6\u3002\u8fd1\u5e74\u6765\u4eba\u5de5\u667a\u80fd\u9886\u57df\u7684\u53d1\u5c55\u6108\u6765\u6108\u8d8b\u5411 [&hellip;]<\/p>","protected":false},"author":1,"featured_media":502,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[9],"tags":[],"class_list":["post-501","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-parallel-optimization"],"acf":[],"_links":{"self":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/501","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/comments?post=501"}],"version-history":[{"count":1,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/501\/revisions"}],"predecessor-version":[{"id":503,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/501\/revisions\/503"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media\/502"}],"wp:attachment":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media?parent=501"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/categories?post=501"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/tags?post=501"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}