×
New Research Highlights How “FlashAttention-3” May Make Training and Inference More Efficient
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

FlashAttention-3, a new technique developed by researchers from multiple institutions, dramatically accelerates attention computation on Nvidia’s H100 and H800 GPUs, enabling faster and more efficient training and inference of large language models (LLMs).

The challenge of attention computation in LLMs: As LLMs grow larger and process longer input sequences, the computational cost of the attention mechanism becomes a significant bottleneck due to its quadratic growth with sequence length and reliance on operations not optimized for GPUs.

  • Attention computations involve a mix of matrix multiplications and special functions like softmax, which are computationally expensive and can slow down the overall computation even if they account for a small portion of the operations.
  • Efficiently scheduling workloads to avoid operation blockages and optimize the use of different memory components is crucial for optimizing attention computation.

Improving hardware resource utilization: FlashAttention-3 builds upon previous versions to maximize performance on Nvidia Hopper GPUs by leveraging their new features and optimizing operation scheduling.

  • FlashAttention-3 maximizes the overlap between computation and data movement across GPU memory segments, reducing idle time and interleaving matrix multiplication and softmax operations to minimize bottlenecks.
  • The technique also employs a special arrangement of operations for faster and more accurate attention computations in quantized models, addressing the potential accuracy loss associated with quantization.
  • FlashAttention-3 achieves up to 75% usage of the H100 GPU’s maximum capabilities, resulting in a 1.5–2x speedup compared to previous versions for both LLM training and inference.

Implications and benefits of FlashAttention-3: The faster attention computation offered by FlashAttention-3 has several potential impacts on LLM development and applications.

  • Significantly reduced training time for LLMs, enabling researchers and developers to experiment with larger models and datasets.
  • Extended context windows for LLMs, allowing them to process longer sequences more efficiently and unlocking new applications in areas such as long-form document understanding and many-shot in-context learning.
  • Reduced cost of running LLMs in production by using a higher percentage of GPU capacity and potentially requiring fewer accelerators.

Broader availability and future work: The researchers have open-sourced FlashAttention-3 under a permissive license and plan to integrate it into popular deep learning libraries, making it easier for the community to leverage its performance benefits. They also anticipate future work on optimizing LLM inference and generalizing their techniques to other hardware architectures.

Analyzing deeper: FlashAttention-3 represents a significant step forward in optimizing attention computation for LLMs, addressing a critical bottleneck in their development and deployment. By closely aligning algorithmic improvements with hardware advancements, the researchers have demonstrated the potential for substantial efficiency gains and the unlocking of new capabilities, such as extended context windows. As LLMs continue to grow in size and complexity, techniques like FlashAttention-3 will play an increasingly crucial role in making them more accessible, cost-effective, and applicable to a wider range of tasks. However, it remains to be seen how well these optimizations will generalize to other hardware architectures and whether similar improvements can be achieved for LLM inference, which presents distinct challenges compared to training.

FlashAttention-3 unleashes the power of H100 GPUs for LLMs

Recent News

MIT research evaluates driver behavior to advance autonomous driving tech

Researchers find driver trust and behavior patterns are more critical to autonomous vehicle adoption than technical capabilities, with acceptance levels showing first uptick in years.

Inside Microsoft’s plan to ensure every business has an AI Agent

Microsoft's shift toward AI assistants marks its largest interface change since the introduction of Windows, as the company integrates automated helpers across its entire software ecosystem.

Chinese AI model LLaVA-o1 rivals OpenAI’s o1 in new study

New open-source AI model from China matches Silicon Valley's best at visual reasoning tasks while making its code freely available to researchers.