技术博客 – 幻方萤火

HIGH-FLYER | AI BLOG

New Releases

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

2025-11-25

At the heart of the Transformer model is the self-attention mechanism, which has both time and storage complexity at the O(N2)O(N2) level in terms of sequence length. As the scale of large language models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds poses a significant engineering implementation challenge. A team of researchers from the Department of Computer Science at Stanford University and the State University of New York at Buffalo has published a novel attention algorithm called FlashAttention, which not only has a longer context than PyT

categorization

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

2025-11-25

PyTorch Distributed Training Method

2025-11-25

In 2018, the Bert model with nearly 300 million parameters came out of nowhere, pushing the NLP field to new heights. In recent years, the development of the artificial intelligence field has increasingly tended to the study of large models, and all major AI giants have released their large models with hundreds of billions of parameters, giving birth to many new AI application scenarios. On the other hand, a variety of factors continue to drive the significant development of big models: 1) society is experiencing a deep digital transformation, and a large amount of data is gradually merging, giving rise to many AI application scenarios and needs; 2) hardware technology continues to advance: the NVIDIA A100 GPU, Go

Alphafold Training Optimization 01 | Data Processing Optimization

2025-11-25

If there is one of the most exciting achievements in AI academia in 2021, then Alphafold deserves the title.Alphafold2 achieved far greater accuracy than comparable models on the CASP14 protein prediction challenge, and for the first time ever improved the accuracy of protein structure prediction to the atomic level-which already close to the level of experimental measurements. The Phantom AI team successfully trained and ran Alphafold2 on the Firefly 2 platform shortly after the launch of Alphafold2, as described in our previous article, "Firefly Running Models | Alphafold".

Alphafold Training Optimization 02 | Multi-Card Training Speedup

2025-11-25

As mentioned in the previous article, Phantom AI has improved the overall training performance of Alphafold by optimizing data processing, using both feature preprocessing and feature cropping. As we all know, Mirage AI has many parallel training gas pedals, such as hfreduce, 3FS, hfai.nn arithmetic library, etc., can they further accelerate the overall training of Alphafold? In this issue, we will experiment with these questions. hfreduce As mentioned in the previous article "Phantom Power | Model Parallel Training Tool: hfreduce", due to the Phantom AI architecture

Alphafold Training Optimization 03 | Pitfall Diary

2025-11-25

The previous two issues of the article showed the optimization of Alphafold by Phantom AI, using feature preprocessing and feature cropping to improve the performance of Alphafold data processing, further improving the training speed of the model through parallel training acceleration artifacts, and deeply integrating Alphafold into the characteristics of Phantom AI's clusters, to play the maximum computational efficiency. From an overall point of view, what else do we need to pay attention to when training Alphafold on Phantom Firefly II, and how to optimize the same type of deep learning model in the future? On these topics, this article will talk to you about Phantom Cube AI's

3FS Optimization 01 | Server-Side Optimization

2025-11-25

As introduced in the article "Phantom Power | High Speed File Series 3FS", Phantom AI has designed a sample reading file system, 3FS, which is very suitable for deep learning training. 3FS, which uses Direct IO and RDMA Read, allows model training to obtain high read bandwidth in the sample reading part with minimal CPU and memory overhead, thus eliminating the need to wait for loading data during the training process and more fully utilizing GPU performance. This eliminates the need to wait for data to be loaded during the training process and more fully utilizes the computational performance of the GPU. As we know, file systems are generally categorized into client-side and server-side. In the 3FS file system

3FS Optimization 02 | Client Memory Usage Optimization

2025-11-25

As introduced in the article "Phantom Power | High-Speed File Series 3FS", Phantom AI has designed a sample read file system, 3FS, that is ideal for deep learning training. 3FS, which adopts Direct IO and RDMA Read, allows model training to obtain ultra-high read bandwidth in the sample read portion of the program with minimal CPU and memory overhead, thus eliminating the need to wait for data to load during the training process. This eliminates the need to wait for data to be loaded during the training process, and more fully utilizes the computational performance of the GPU. As we know, the file system is generally divided into client-side and server-side. In the 3FS file system, the client part

3FS Optimization 03 | Data Read Mode Adaptation

2025-11-25

As introduced in the article "Phantom Power | High-Speed File Series 3FS", Phantom AI has designed a sample read file system, 3FS, which is ideal for deep learning training. 3FS uses Direct IO and RDMA Read, allowing model training to get high read bandwidth in the sample read portion of the program with minimal CPU and memory overhead, eliminating the need to wait for data to load during the training process and more fully utilizing GPU performance. This eliminates the need to wait for data to be loaded during the training process and more fully utilizes the GPU's computational performance. However, in practice, there are many problems that we did not anticipate, such as the problem of inter-task interactions.

A bit of our practice in reducing network congestion (I)

2025-11-25

For deep learning developers and researchers, high-performance computing power is an important weapon to help their research succeed. As for the factors affecting the speed of deep learning training, it is often easy to overlook the important role of network transmission in speeding up training. Especially in large-scale clusters, distributed training scenarios, network congestion may directly lead to the failure of the GPU computing power, just as there is a section of two-way 8-lane expressway, but if the road planning is messy, the highway can only be reduced to a large parking lot. In this issue, we share a little bit of Phantom AI's thinking and optimization in this direction on the topic of network. First, let's talk about network

hfai python | Task submission at will, Firefly training on the fly

2025-11-25

Phantom AI released its deep learning suite hfai, which has been in use for many years, and attracted many peer researchers and developers to inquire about trying it out. The entire suite has many features, and by familiarizing yourself with this set of rules, you can easily call up the platform's arithmetic resources to efficiently complete training tasks. For this reason, we have created a series of albums called “hfai's Methods of Use” to introduce the design ideas and principles of some of hfai's features, so that you can learn them better and faster, and be comfortable with hfai's set of “magical skills”. To cope with the challenges of deep learning assignments