并行优化 – 幻方萤火

HIGH-FLYER | AI BLOG

New Releases

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

2025-11-25

At the heart of the Transformer model is the self-attention mechanism, which has both time and storage complexity at the O(N2)O(N2) level in terms of sequence length. As the scale of large language models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds poses a significant engineering implementation challenge. A team of researchers from the Department of Computer Science at Stanford University and the State University of New York at Buffalo has published a novel attention algorithm called FlashAttention, which not only has a longer context than PyT

categorization

PyTorch Distributed Training Method

2025-11-25

In 2018, the Bert model with nearly 300 million parameters came out of nowhere, pushing the NLP field to new heights. In recent years, the development of the artificial intelligence field has increasingly tended to the study of large models, and all major AI giants have released their large models with hundreds of billions of parameters, giving birth to many new AI application scenarios. On the other hand, a variety of factors continue to drive the significant development of big models: 1) society is experiencing a deep digital transformation, and a large amount of data is gradually merging, giving rise to many AI application scenarios and needs; 2) hardware technology continues to advance: the NVIDIA A100 GPU, Go

Alphafold Training Optimization 01 | Data Processing Optimization

2025-11-25

If there is one of the most exciting achievements in AI academia in 2021, then Alphafold deserves the title.Alphafold2 achieved far greater accuracy than comparable models on the CASP14 protein prediction challenge, and for the first time ever improved the accuracy of protein structure prediction to the atomic level-which already close to the level of experimental measurements. The Phantom AI team successfully trained and ran Alphafold2 on the Firefly 2 platform shortly after the launch of Alphafold2, as described in our previous article, "Firefly Running Models | Alphafold".

Alphafold Training Optimization 02 | Multi-Card Training Speedup

2025-11-25

As mentioned in the previous article, Phantom AI has improved the overall training performance of Alphafold by optimizing data processing, using both feature preprocessing and feature cropping. As we all know, Mirage AI has many parallel training gas pedals, such as hfreduce, 3FS, hfai.nn arithmetic library, etc., can they further accelerate the overall training of Alphafold? In this issue, we will experiment with these questions. hfreduce As mentioned in the previous article "Phantom Power | Model Parallel Training Tool: hfreduce", due to the Phantom AI architecture

Alphafold Training Optimization 03 | Pitfall Diary

2025-11-25

The previous two issues of the article showed the optimization of Alphafold by Phantom AI, using feature preprocessing and feature cropping to improve the performance of Alphafold data processing, further improving the training speed of the model through parallel training acceleration artifacts, and deeply integrating Alphafold into the characteristics of Phantom AI's clusters, to play the maximum computational efficiency. From an overall point of view, what else do we need to pay attention to when training Alphafold on Phantom Firefly II, and how to optimize the same type of deep learning model in the future? On these topics, this article will talk to you about Phantom Cube AI's