

{"id":504,"date":"2025-11-25T09:29:33","date_gmt":"2025-11-25T01:29:33","guid":{"rendered":"https:\/\/high-flyer.in.suopu.cc\/?p=504"},"modified":"2025-11-25T09:29:33","modified_gmt":"2025-11-25T01:29:33","slug":"flashattention%ef%bc%9a%e5%85%b7%e6%9c%89-io-%e6%84%9f%e7%9f%a5%ef%bc%8c%e5%bf%ab%e9%80%9f%e4%b8%94%e5%86%85%e5%ad%98%e9%ab%98%e6%95%88%e7%9a%84%e6%96%b0%e5%9e%8b%e6%b3%a8%e6%84%8f%e5%8a%9b%e7%ae%97","status":"publish","type":"post","link":"https:\/\/high-flyer.in.suopu.cc\/en\/blog\/504\/","title":{"rendered":"FlashAttention: A Novel Attention Algorithm with IO Awareness, Fast and Memory-Efficient"},"content":{"rendered":"<p>At the heart of the Transformer model is the self-attention mechanism (self-attention), which has both time and storage complexity over the length of the sequence of\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">O(N2)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">O<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>\u00a0Level. As the size of Large Language Models (LLMs) continues to grow, equipping LLMs with longer contextual backgrounds presents a very significant challenge in terms of engineering implementation.<\/p>\n<p>A team of researchers from Stanford University's Department of Computer Science and the State University of New York at Buffalo has published a new attention algorithm called FlashAttention, which not only runs 2-4 times faster than PyTorch's standard attention, but also requires 5-20 times less memory. FlashAttention2, Flash Decoding, and FlashDecoding will be released later with even more dramatic performance speedups.<\/p>\n<p>Phantom Cube Basic Research and Development<a href=\"https:\/\/www.high-flyer.cn\/blog\/hai-llm\">Large Model Training Tool HAI-LLM<\/a>\u00a0FlashAttention has been adopted across the board to dramatically improve graphics card utilization and achieve excellent training performance. In this series of articles, we will talk about the technology behind FlashAttention and our practical experience.<\/p>\n<p><strong>paper address<\/strong>\uff1a<a href=\"https:\/\/arxiv.org\/abs\/2205.14135\">https:\/\/arxiv.org\/abs\/2205.14135<\/a><\/p>\n<p><strong>Project Source Code<\/strong>\uff1a<a href=\"https:\/\/github.com\/Dao-AILab\/flash-attention\">https:\/\/github.com\/Dao-AILab\/flash-attention<\/a><\/p>\n<h2 id=\"\u80cc\u666f\">contexts<\/h2>\n<p>Traditional attention algorithms whose memory efficiency is\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">O(N2)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">O<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>\u00a0The. Some past approaches to optimizing attention mechanisms have used approximations such as sparse approximations, low-rank approximations, and combinations thereof. While these methods can reduce the computation to linear or near-linear (<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">O(N)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">O<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">N<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>), but they are overly concerned with reducing the number of floating-point operations (FLops) performed per second and tend to ignore the overhead from memory accesses (IO).<\/p>\n<p>GPU FLOPS have been growing faster than memory throughput (TB\/s) for years. In our practice of optimizing model training on A100 or equivalent graphics cards, we have found that memory throughput is the bottleneck that affects further training efficiency, and FLOPS and memory throughput need to be tightly coupled in order to fully improve training efficiency. This requires more detailed design at the software level.<\/p>\n<p>As shown in the figure below:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/0f85be112184d6143d30b192d6933a59\/e5a29\/01.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"01\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/0f85be112184d6143d30b192d6933a59\/e5a29\/01.jpg\" sizes=\"(max-width: 352px) 100vw, 352px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/0f85be112184d6143d30b192d6933a59\/d2f63\/01.jpg 163w,https:\/\/hfai-static.high-flyer.cn\/static\/0f85be112184d6143d30b192d6933a59\/c989d\/01.jpg 325w,https:\/\/hfai-static.high-flyer.cn\/static\/0f85be112184d6143d30b192d6933a59\/e5a29\/01.jpg 352w\" alt=\"01\" \/><\/a><\/span><\/p>\n<p>The graph above shows the throughput and capacity of different tiers of memory for the CPU and GPU. As you can see memory is not a single component, it is layered in nature and the general rule is: the faster the memory, the more expensive it is and the smaller the capacity.<\/p>\n<p>Take A100 as an example: A100 GPU has 40~80GB of High Bandwidth Memory (HBM) with a bandwidth of 1.5-2.0 TB\/s, while every 108 stream processors has 192KB of SRAM, with an estimated bandwidth of 19TB\/s. It can be seen that although the SRAM capacity is much smaller, the speedup is 10 times higher. It can be seen that although the SRAM capacity is much smaller, the speed is increased by 10 times, so how to utilize SRAM efficiently is the key to speed up the attention algorithm.<\/p>\n<h2 id=\"\u6807\u51c6\u6ce8\u610f\u529b\u7b97\u6cd5\">Standard Attention Algorithm<\/h2>\n<p>Let's first look at the computational logic behind the standard attention algorithm:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/5ebd7\/02.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"02\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/a6d36\/02.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/222b7\/02.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/ff46a\/02.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/a6d36\/02.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/889f325ebe5fe450ddcf3c5aa642d1f8\/5ebd7\/02.png 704w\" alt=\"02\" \/><\/a><\/span><\/p>\n<p>It can be seen that the standard attention algorithm essentially treats HBM load\/store operations as 0-cost (it is not IO-aware).<\/p>\n<p>The following figure shows the complete computational time consumption statistics for one Attention operator in the GPT-2 model:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/1b42c\/03.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"03\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/6aca1\/03.jpg\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/d2f63\/03.jpg 163w,https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/c989d\/03.jpg 325w,https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/6aca1\/03.jpg 650w,https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/7c09c\/03.jpg 975w,https:\/\/hfai-static.high-flyer.cn\/static\/461d1f3fc17e9781510bca12cb498aac\/1b42c\/03.jpg 1005w\" alt=\"03\" \/><\/a><\/span><\/p>\n<p>It can be seen that masking, softmax and dropout operations take up a large amount of time, while matrix multiplication (Matmul), which mainly utilizes FLOPS, takes up only a fraction of the time. Therefore, the FlashAttention algorithm, which is optimized to be hardware IO-aware, is proposed to drastically reduce redundant HBM IO and leverage SRAM for computational acceleration.<\/p>\n<h2 id=\"flashattention\">FlashAttention<\/h2>\n<p>FlashAttention The idea is that since the standard attention algorithm has to write S back to the HBM and this step is only for reloading the computed Softmax, we can save it in SRAM and then write the final result back to the HBM when all the intermediate steps have been performed. as shown in the following figure:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/a8417\/04.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"04\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/a6d36\/04.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/222b7\/04.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/ff46a\/04.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/a6d36\/04.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/890f0cd9d1814a44043b20a1969b37a5\/a8417\/04.png 941w\" alt=\"04\" \/><\/a><\/span><\/p>\n<p>You can see that FlashAttention fuses multiple operations together by loading from the HBM only once, performing the fused arithmetic operation, and then writing the result back to the HBM.The fusion operation employs the following two main techniques:<\/p>\n<ul>\n<li>Tiling: matrix chunking, computes the reduction of a Softmax function without accessing the entire input, used in both forward and backward propagation;<\/li>\n<li>Recomputation: time-for-space, recomputation without storing intermediate attention matrices, used only in backward propagation.<\/li>\n<\/ul>\n<p>The complete pseudo-code is below:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/7b1dc\/05.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"05\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/a6d36\/05.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/222b7\/05.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/ff46a\/05.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/a6d36\/05.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/9a1a42bebd7f7c37e17c447c836d6588\/7b1dc\/05.png 956w\" alt=\"05\" \/><\/a><\/span><\/p>\n<h3 id=\"1-tiling-\u5206\u5757\u8ba1\u7b97\">1. Tiling block calculations<\/h3>\n<p>For limited SRAM capacity, the<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">N2<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u00a0The storage usage makes the sequence length (N) limited to a certain range, so we have to perform a matrix chunking computation. For matrix multiplication with point-by-point operations (scale, masking, dropout) chunking is relatively easy to implement, the main obstacle is the Softmax function, which needs to couple all the score columns together. For this reason the researcher used a trick: since Softmax is related to the attention\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">K<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathbf\">K<\/span><\/span><\/span><\/span><\/span>\u00a0columns are coupled, by introducing two additional statistics\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">m(x),l(x)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">m<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>\u00a0to decouple and realize the chunked computation. The details are as follows:<\/p>\n<p><span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">m(x):=maxi xi, f(x):=[exi-m(x)...exB-m(x)], l(x):=\u2211i f(x)i, softmax(x):=f(x)l(x)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">m<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">:=<\/span><\/span><span class=\"base\"><span class=\"mop\">max<span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u00a0<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">:=<\/span><\/span><span class=\"base\"><span class=\"mopen\">[<\/span><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">x<\/span><span class=\"vlist-t vlist-t2\"><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\">&#8230;<\/span><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">x<\/span><span class=\"vlist-t vlist-t2\"><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">B<\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">]<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">:=<\/span><\/span><span class=\"base\"><span class=\"mop\"><span class=\"mop op-symbol small-op\">\u2211<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mord\">\u00a0<\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mord mathnormal\">so<\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mord mathnormal\">t<\/span><span class=\"mord mathnormal\">ma<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">:=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">l<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">f<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>For two vectors\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">x(1),x(2)\u2208 RB<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mrel\">\u2208<\/span><\/span><span class=\"base\"><span class=\"mord\">\u00a0<\/span><span class=\"mord\"><span class=\"mord mathnormal\">R<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">B<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>The decoupled splicing vectors\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">x=[x(1),x(2)]\u2208 R2B<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">x<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mopen\">[<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">]<\/span><span class=\"mrel\">\u2208<\/span><\/span><span class=\"base\"><span class=\"mord\">\u00a0<\/span><span class=\"mord\"><span class=\"mord mathnormal\">R<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<span class=\"mord mathnormal mtight\">B<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u00a0The Softmax calculation of the<\/p>\n<p><span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">m(x)=m([x(1),x(2)])=max(x(1),x(2)), f(x)=[em(x(1))-m(x)f(x(1)) em(x(2))-m(x)f(x(2))]<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">m<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">m<\/span><span class=\"mopen\">([<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">])<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mop\">max<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mopen\">[<\/span><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><span class=\"mclose mtight\">)<\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><span class=\"mclose mtight\">)<\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)]<\/span><\/span><\/span><\/span><\/span>\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">l(x)=l([x(1),x(2)])=em(x(1))-m(x)l(x(1)) +em(x(2))-m(x)l(x(2)), softmax(x)=f(x)l(x)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">([<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">])<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><span class=\"mclose mtight\">)<\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>1<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mbin\">+<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mord mathnormal\">e<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><span class=\"mclose mtight\">)<\/span><span class=\"mbin mtight\">\u2212<\/span><span class=\"mord mathnormal mtight\">m<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">x<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mopen mtight\">(<\/span>2<span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mspace nobreak\">\u00a0<\/span><span class=\"mord\">\u00a0<\/span><span class=\"mord mathnormal\">so<\/span><span class=\"mord mathnormal\">f<\/span><span class=\"mord mathnormal\">t<\/span><span class=\"mord mathnormal\">ma<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mrel\">=<\/span><\/span><span class=\"base\"><span class=\"mord\"><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">l<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">f<\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mathnormal mtight\">x<\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>Note that it is possible to compute Softmax for multiple blocks in parallel at the same time using GPU multithreading.To take full advantage of the hardware performance, the computation of multiple blocks is not serial, but parallel.<\/p>\n<h3 id=\"2-\u91cd\u8ba1\u7b97\">2. Recalculation<\/h3>\n<p>To avoid generating redundant HBM read\/write counts, FlashAttention does not keep a large intermediate results matrix for backward passes.<\/p>\n<p>In the standard attention implementation, backward passes to compute the gradients of Q,K,V require the NxN intermediate matrices S,P , which are not preserved. The trick used for the study is to recalculate and save the two statistics\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">m(x),l(x)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">m<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mord mathnormal\">l<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">x<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>The attention matrix S,P is recomputed by chunking it quickly on the high-speed SRAM for backward passes. This approach is much faster than the standard method.<\/p>\n<h2 id=\"\u5b9e\u9a8c\">test<\/h2>\n<p>Compared to the standard attention algorithm, FlashAttention effectively reduces the I\/O of the HBM with a significant reduction in runtime, although the GFLOPs increase due to the need for recomputation for backpropagation, as shown in the figure below on the left:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/5819f\/06.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"photograph\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/a6d36\/06.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/222b7\/06.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/ff46a\/06.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/a6d36\/06.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/e548f\/06.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/73af47bd77921a706841144dfefa1106\/5819f\/06.png 1042w\" alt=\"\u56fe\u7247\" \/><\/a><\/span><\/p>\n<p>Meanwhile, from the right side of the above figure, we can also see that as the Block Size increases, the number of HBM accesses decreases, and the running time also decreases. When the Block Size exceeds 256, even though the number of HBM accesses is decreasing, the runtime does not decrease. This is when performance is limited by other factors, for example, computational constraints. It should also be noted that a larger Block Size may cause the memory required to perform a fusion operation to exceed the size of the SRAM.<\/p>\n<p>Experiments were conducted on an A100 graphics card and the acceleration of FlashAttention is shown below:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/302a4\/07.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"photograph\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/a6d36\/07.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/222b7\/07.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/ff46a\/07.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/a6d36\/07.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/e548f\/07.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/c1ef78d565bc4914b3f18e0e90aaf3a5\/302a4\/07.png 1080w\" alt=\"\u56fe\u7247\" \/><\/a><\/span><\/p>\n<p>Memory changes for:<\/p>\n<p><span class=\"gatsby-resp-image-wrapper\"><a class=\"gatsby-resp-image-link\" href=\"https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/302a4\/08.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"gatsby-resp-image-image\" title=\"photograph\" src=\"https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/a6d36\/08.png\" sizes=\"(max-width: 650px) 100vw, 650px\" srcset=\"https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/222b7\/08.png 163w,https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/ff46a\/08.png 325w,https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/a6d36\/08.png 650w,https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/e548f\/08.png 975w,https:\/\/hfai-static.high-flyer.cn\/static\/ba76f74a993e63da6afbe5838b062cbb\/302a4\/08.png 1080w\" alt=\"\u56fe\u7247\" \/><\/a><\/span><\/p>\n<p>It can be seen that combining dropout and masking at different sequence lengths has different degrees of acceleration; as the sequence length increases, FlashAttention has a continuous optimization effect on memory consumption.<\/p>\n<h2 id=\"\u603b\u7ed3\">summarize<\/h2>\n<p>The maximum sequence length of the inputs and outputs of most large language models is only 2K or 4K, essentially because the computational and spatial complexity of the self-attention block, the core component of the transformer, is\u00a0<span class=\"math math-inline\"><span class=\"katex\"><span class=\"katex-mathml\">O(N2)<\/span><span class=\"katex-html\" aria-hidden=\"true\"><span class=\"base\"><span class=\"mord mathnormal\">O<\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\"><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>\u00a0The success of FlashAttention inspires us that deep learning model optimization and acceleration can be achieved through chunking, operator fusion, and recomputation techniques, which are of great relevance for AI industrial practice to move into the deep water.<\/p>","protected":false},"excerpt":{"rendered":"<p>Transformer \u6a21\u578b\u7684\u6838\u5fc3\u662f\u81ea\u6ce8\u610f\u529b\u673a\u5236\uff08self attention\uff09\uff0c\u5176\u5728\u5e8f\u5217\u957f\u5ea6\u4e0a\u65f6\u95f4\u548c\u5b58\u50a8\u7684\u590d [&hellip;]<\/p>","protected":false},"author":1,"featured_media":505,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-504","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-basic-research"],"acf":[],"_links":{"self":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/504","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/comments?post=504"}],"version-history":[{"count":1,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/504\/revisions"}],"predecessor-version":[{"id":506,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/posts\/504\/revisions\/506"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media\/505"}],"wp:attachment":[{"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/media?parent=504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/categories?post=504"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/high-flyer.in.suopu.cc\/en\/wp-json\/wp\/v2\/tags?post=504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}