HIGH-FLYER | AI BLOG

New Releases

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

At the heart of the Transformer model is the self-attention mechanism, which has both time and storage complexity at the O(N2)O(N2) level in terms of sequence length. As the scale of large language models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds poses a significant engineering implementation challenge. A team of researchers from the Department of Computer Science at Stanford University and the State University of New York at Buffalo has published a novel attention algorithm called FlashAttention, which not only has a longer context than PyT

3FS Optimization 01 | Server-Side Optimization

As introduced in the article "Phantom Power | High Speed File Series 3FS", Phantom AI has designed a sample reading file system, 3FS, which is very suitable for deep learning training. 3FS, which uses Direct IO and RDMA Read, allows model training to obtain high read bandwidth in the sample reading part with minimal CPU and memory overhead, thus eliminating the need to wait for loading data during the training process and more fully utilizing GPU performance. This eliminates the need to wait for data to be loaded during the training process and more fully utilizes the computational performance of the GPU. As we know, file systems are generally categorized into client-side and server-side. In the 3FS file system

3FS Optimization 02 | Client Memory Usage Optimization

As introduced in the article "Phantom Power | High-Speed File Series 3FS", Phantom AI has designed a sample read file system, 3FS, that is ideal for deep learning training. 3FS, which adopts Direct IO and RDMA Read, allows model training to obtain ultra-high read bandwidth in the sample read portion of the program with minimal CPU and memory overhead, thus eliminating the need to wait for data to load during the training process. This eliminates the need to wait for data to be loaded during the training process, and more fully utilizes the computational performance of the GPU. As we know, the file system is generally divided into client-side and server-side. In the 3FS file system, the client part

3FS Optimization 03 | Data Read Mode Adaptation

As introduced in the article "Phantom Power | High-Speed File Series 3FS", Phantom AI has designed a sample read file system, 3FS, which is ideal for deep learning training. 3FS uses Direct IO and RDMA Read, allowing model training to get high read bandwidth in the sample read portion of the program with minimal CPU and memory overhead, eliminating the need to wait for data to load during the training process and more fully utilizing GPU performance. This eliminates the need to wait for data to be loaded during the training process and more fully utilizes the GPU's computational performance. However, in practice, there are many problems that we did not anticipate, such as the problem of inter-task interactions.