HIGH-FLYER | AI BLOG

New Releases

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

At the heart of the Transformer model is the self-attention mechanism, which has both time and storage complexity at the O(N2)O(N2) level in terms of sequence length. As the scale of large language models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds poses a significant engineering implementation challenge. A team of researchers from the Department of Computer Science at Stanford University and the State University of New York at Buffalo has published a novel attention algorithm called FlashAttention, which not only has a longer context than PyT

FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient

At the heart of the Transformer model is the self-attention mechanism, which has both time and storage complexity at the O(N2)O(N2) level in terms of sequence length. As the scale of large language models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds poses a significant engineering implementation challenge. A team of researchers from the Department of Computer Science at Stanford University and the State University of New York at Buffalo has published a novel attention algorithm called FlashAttention, which not only has a longer context than PyT

A bit of our practice in reducing network congestion (I)

For deep learning developers and researchers, high-performance computing power is an important weapon to help their research succeed. As for the factors affecting the speed of deep learning training, it is often easy to overlook the important role of network transmission in speeding up training. Especially in large-scale clusters, distributed training scenarios, network congestion may directly lead to the failure of the GPU computing power, just as there is a section of two-way 8-lane expressway, but if the road planning is messy, the highway can only be reduced to a large parking lot. In this issue, we share a little bit of Phantom AI's thinking and optimization in this direction on the topic of network. First, let's talk about network

Continuous Batching:一种提升 LLM 部署吞吐量的利器

由于 LLM 巨大的 GPU 内存开销和计算成本,在大多数应用中,机器学习工程师通常通过内部调整(如量化和对 CUDA 核的定制)来优化。然而,由于 LLM 通过迭代生成其输出,并且 LLM 推理通常涉及内存而不是计算,因此在很多实践中,优化系统级批处理可以使性能差异达到10倍甚至更多。 一种最近提出的优化方法是连续批处理(Continuous batching),也称为动态批处理或基于迭代级的批

LLaMA-2 技术详解(一):数据打标

LLaMA 是目前备受关注的开源大语言预训练模型。最近 Meta 发布了 LLaMA 2,它是 LLaMA 的下一代版本,具有商业友好的许可证。LLaMA 2 有 3 种不同的尺寸:7B、13B 和 70B,在对话用例下微调的模型版本被称为 LLaMA 2-Chat。Llama-2-Chat 模型在大多数基准测试中都优于现有的开源聊天模型,在对有用性和安全性的人工评估中,LLama-2-Chat

深入浅出 GPT-4 的体系结构

GPT-4(Generative Pre-trained Transformer 4)是 OpenAI 发布的最新 GPT 系列模型。它是一个大规模多模态模型,相比 GPT-3.5 / ChatGPT,GPT-4 可以接受图像和文本两种形式的输入,产生文本输出。输出依旧是一个自回归的单词预测任务。技术上,GPT-4 采用了专家混合(MoE)技术,进一步增强模型的能力。整体来说,GPT-4 在各种专

CC_Cleaner:一种丝滑高效且易扩展的数据清洗流程

AGI 是数据 x 算法 x 算力的完美实践,是科研 + 工程 + 组织的优雅艺术。而在大模型训练的前置阶段,大数据的清洗是数据处理的基础。 以 Common Crawl 数据集为例,它可以轻松地在 Amazon 上获取到,是一个免费的、规模达到 PB 级别的网络爬虫数据集,其包括了超过 12 年间收集到的数据:原始网页数据(WARC)、元数据提取(WAT)和文本提取(WET)。如何对庞杂的原始数

HAI-LLM:高效且轻量的大模型训练工具

为了更好地发挥 GPU 集群的计算能力,训练具有惊人功能的、强大的万亿参数模型,一个高效简洁的大模型训练工具十分必要。幻方-基础研究最近研发了一款深度学习训练工具,名为 HAI-LLM,其实现了四种并行训练方式:ZeRO 支持的数据并行、流水线并行、张量切片模型并行和序列并行。这种并行能力适应了不同工作负载的需求,可以支持数万亿规模的超大模型并扩展到数千个 GPU。基于萤火集群的特性而自研的 ha