HIGH-FLYER | AI BLOG

Alphafold Training Optimization 02 | Multi-Card Training Speedup

2025-11-25

As mentioned in the previous article, Phantom AI has improved the overall training performance of Alphafold by optimizing data processing, using both feature preprocessing and feature cropping. As we all know, Alphafold AI has a lot of parallel training accelerators, such ashfreduce，3FS，hfai.nn arithmetic libraryetc., can they be further accelerated for Alphafold training as a whole? The articles in this issue will experiment with these questions.

hfreduce

Previous postsPhantom Power - Model Parallel Training Tool: hfreduce.As mentioned, due to the characteristics of the phantom AI architecture, Nvidia's official NCCL tool does not give full play to the communication bandwidth of the Firefly II cluster, so phantom AI has developed its own hfreduce tool to optimize the communication between the graphics cards, to improve the efficiency of the parallel training of the multi-machine multi-card. The use is very simple, just need to replace pytorch DDP with hfai DDP can be:

model = hfai.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

For the Alphafold model, it is characterized by a long forward propagation path and a short back propagation path, which has the distinctive feature of “small parameter size but large computation”. We use data parallelism to spread the data to different graphics cards for parallel acceleration to speed up the training efficiency. Therefore, we introduce hfreduce for Alphafold training, utilizing thehfai.ddptools for data parallelization to try to further push the computational efficiency of Alphafold on the Firefly II cluster.

However, in practice we found that the training boost given by hfreduce to Alphafold training is not large in absolute terms. What is the reason for this? We analyze this issue in the subsequent experimental section in conjunction with the experimental results.

model store：https://github.com/HFAiLab/alphafold-optimized

Experimental Settings

We designed and conducted three sets of experiments to compare the speedup effect after using hfreduce. The settings of the experiments are shown in the following table

serial number	GPU count	DDP Backend	Number of iterations	Time-consuming Statistics Object	Feature Cuts
1	1	无	200	loss.backward()	greatest
2	128	NCCL	200	loss.backward()	greatest
3	128	hfreduce	200	loss.backward() + model.sync_grad()	greatest

We wanted to directly see the extra overhead associated with multi-card communication by comparing the time spent on multi-card and single-card training. Therefore, we tested the average backward elapsed time for 200 iterations for the 3 training methods.

Among them.“Characteristic cropping: maximum” represents the largest possible cropping of the features, with the aim of avoiding a long waiting state for gradient synchronization on some GPUs during multi-card training due to feature processing, which affects the accuracy of the results. When using hfreduce since gradient synchronization is split out separately, the two functions need to be combined to count the total time of gradient computation + gradient synchronization.

In addition, the basic parameters of training are included:

Number of training samples: 73,472
Batch Size: 1

Results

Based on the experimental setup above, we test the complete training results of NCCL parallel acceleration and hfreduce parallel acceleration as shown in the following table:

	NCCL	hfreduce
Average time per step (s)	11.27	11.26

Directly comparing the total elapsed time of a single iteration, it is clear from the results that the different choices of HFReduce and NCCL do not lead to a significant difference in efficiency in the training scenario of Alphafold. Theoretically, HFReduce is more suitable for the cluster architecture of Firefly II than NCCL, and must bring faster gradient synchronization than DDP. However, in Alphafold's experimental results, the iteration speed of each Step in the two experiments is almost exactly the same, why is that?

For this phenomenon, the main reason may be that Alphafold's unique model structure and the characteristics of the protein data allow the model to transmit very little gradient data although it requires a very large number of GPU operations.The size of Alphafold's model parameters are only93.2M, the number of parameters of the model is less than 100 million, and using FP32 single precisionUses 372.8MB of video memory. During forward propagation Alphafold copies the main part of the model (Evoformer) 4 times, which results in a very long forward propagation path, making the time complexity much higher. However, in the end, the model still needs to be updated with only the original number of parameters, so it appears that the overhead of gradient synchronization is insignificant throughout the iterations.

So, since the time consumption of gradient synchronization is not significant, can we split the time consumption of each part by some profile methods for quantitative optimization? We conduct a more detailed test on the backward part, and the final test results are shown in the following table:

serial number	GPU count	DDP Backend	Backward average elapsed time (s)	Communication overhead (s)
1	1	无	5.63	–
2	128	NCCL	6.03	0.4
3	128	hfreduce	5.91	0.28

Here, without considering the waiting time due to different data processing progress in multi-card, the backward time of multi-card training minus the backward time of single-card training can be regarded as the extra communication overhead brought by multi-machine and multi-card.

In the results of the above table, we can see that when using DDP training on Firefly II, choosing hfreduce can save 30% communication overhead than NCCL, which is a very considerable number, indicating that hfreduce is indeed more suitable for accelerating multi-machine and multi-card training on Phantom Firefly II. At the same time, more than 90% of time consumption in the forward computation, especially layer_norm and attention operator computational overhead is relatively large, here need to further use hfai's optimization operator for further enhancement.

hfai.nn

as mentioned abovePhantom Firefly | Deep Learning Arithmetic with Superior Performance hfai.nnAs mentioned, Phantom AI has been deeply optimized for the Pytorch framework, combined with the cluster characteristics of Firefly II, and redesigned the research and development of some commonly used AI operators to further improve the overall training efficiency of the model. Its use is very simple, only need to add a line to the original arbitrary model training code:

model = hfai.nn.to_hfai(model)

hfai will automatically scan your code and replace it with optimized operators. hfai.nn provides optimized operators for MultiHeadAttention, LayerNorm, LSTM, and other structures commonly used in mainstream deep learning models, which can greatly accelerate the operation of these operators in the model. We try to introduce the hfai.nn library into Alphafold to test how much it can benefit the model training.

Alphafold GPU Overhead Analysis

Since hfai.nn's operator speedup depends on how much each type of operation in the model accounts for the total GPU overhead, we first tried to use the Pyorch Profiler tool to do some analysis of the GPU operation time of the Alphafold model when using the standard torch operator. In an iteration with a total elapsed time of 12.6s, the main overheads are as follows:

Name	CUDA Time	CUDA %
aten::native_layer_norm	2.105s	16.81%
void at::native::(anonymous namespace)::RowwiseMomen...	1.866s	14.90%
aten::mm	1.716s	13.70%
aten::native_layer_norm_backward	1.642s	13.11%
atten::add_	1.454s	11.61%

Only the first 5 types of operations accounted for 70% of Alphafold's training overhead, of which LayerNorm and the matrix operations in Attention are the most important sources of time consumption: LayerNorm operations alone accounted for about 30% of the total time consumption.Alphafold is more complex compared to general Transformer class models such as BERT. Alphafold is a more complex model and uses its own Attention implementation, so it can't get the acceleration of hfia.nn in the Attention operation. However, LayerNorm, which also occupies a large part of the time consumption, is able to use the hfai operator to accelerate, so it can be expected that hfai.nn can bring a better acceleration effect to alphafold.

In order to explore the specific acceleration effect, we first try to do some analysis on the usage of LayerNorm in Alphafold. In general the shape of the input Tensor tends to have a large impact on the effect of operator acceleration, so here we first try to do some theoretical analysis on the different Tensor shapes of the LayerNorm layer in the input model. In a typical Transformer-like model, the input shape of LayerNorm is often [BatchSize, SeqLen, EmbDim]. In Alphafold the input tensor shape will be a bit more specific due to its model structure being quite different from a simple Transformer model. The frequency of occurrence of different Input Shape corresponding to the LayerNorm layer in the model can be seen in the following table:

Input Shape	Run%	Relative Perf%	Torch Run Time(s)	Hfai Run Time(s)
[1,256,256,128]	59.96%	364.07%	0.003	0.0008
[1,132,256,256]	13.2%	267.02%	0.0017	0.0007
[1,1,256,256,64]	8.03%	379.80%	0.0028	0.0007
[1,128,256,256]	7.92%	276.52%	0.0017	0.0006
[1,256,132,256]	4.4%	276.13%	0.0017	0.0006
[1,256,128,256]	2.64%	272.92%	0.0017	0.0006
[1,256,384]	1.87%	80.50%	0.0002	0.0003
[1,1024,256,64]	1.32%	471.77%	0.0111	0.0024
[1,256,1024,64]	0.44%	474.66%	0.0111	0.0024
[1,4,256,256,64]	0.11%	474.41%	0.0111	0.0024
[1,256,256]	0.11%	93.81%	0.0003	0.0003

It can be seen that for most inputs, the LayerNorm operator provided by hfai.nn is able to achieve a significant performance improvement compared to torch.nn by several times, and only in very few shapes the performance may be a little worse than torch.nn. In addition, based on the percentage of different input Tensor shapes in the table above, it can be calculated that the use of hfai.nn can achieve a total of 3501 TP3T speedup on the GPU overhead of the LayerNorm operator in Alphafold. The GPU overhead of LayerNorm in Alphafold is 350%. Since LayerNorm in Alphafold is used frequently and in different parts of the model, it can be expected to have a more significant acceleration effect on the model as a whole.

Actual training acceleration results

After theoretically analyzing the magnitude of speedup that hfai.nn can bring to Alphafold, we also want to understand the benefit that Alphafold can gain from hfai.nn during real training. Therefore, we conducted parallel training using 128 cards in a real scenario with the same settings as the Alphafold pre-training, and tested the training time of Alphafold training using the torch operator and hfai operator respectively, and the results are shown in the following table:

operator (math.)	GPU count	Number of iterations	Average length
torch.nn	128	200	11.26s
hfai.nn	128	200	8.64s

Compared to training with the torch.nn operator, the average duration of a single iteration in the overall training of the model can be reduced from 11.26 seconds to 8.64 seconds with hfai.nn. It can be seen that adding just one line of code to introduce hfai.nn during Alphafold training is enough to obtain a training performance improvement of 30%. It is worth noting that, due to the custom Attention implementation in Alphafold does not use the standard nn.MultiHeadAttention, the hfai.nn arithmetic acceleration effect can be obtained here is actually only from the reduction of the time spent on LayerNorm operations. It can be seen that in the common standard Transformer class model, the use of hfai.nn will be more likely to get a substantial performance increase.

summarize

Through the above two experiments, we verify the acceleration effect of hfreduce and hfai.nn operators on the alphafold model. They show different acceleration abilities respectively, indicating that we need to choose the appropriate acceleration scheme according to the structural characteristics of the model in the process of model optimization. Here, Phantom AI will continue to increase research and development to provide more optimization solutions and tools.

本文作者： suopu

返回博客首页

您可以转载、不违背作品原意地摘录及引用本技术博客的内容，但必须遵守以下条款：署名 — 您应当署名原作者，但不得以任何方式暗示幻方为您背书，亦不会对幻方的权利造成任何负面影响。非商业性使用 — 您不得将本技术博客内容用于商业目的。禁止演绎 — 如果基于该内容改编、转换、或者再创作，您不得公开或分发被修改内容，该内容仅可供个人使用。