×
New Research Highlights How “FlashAttention-3” May Make Training and Inference More Efficient
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

FlashAttention-3, a new technique developed by researchers from multiple institutions, dramatically accelerates attention computation on Nvidia’s H100 and H800 GPUs, enabling faster and more efficient training and inference of large language models (LLMs).

The challenge of attention computation in LLMs: As LLMs grow larger and process longer input sequences, the computational cost of the attention mechanism becomes a significant bottleneck due to its quadratic growth with sequence length and reliance on operations not optimized for GPUs.

  • Attention computations involve a mix of matrix multiplications and special functions like softmax, which are computationally expensive and can slow down the overall computation even if they account for a small portion of the operations.
  • Efficiently scheduling workloads to avoid operation blockages and optimize the use of different memory components is crucial for optimizing attention computation.

Improving hardware resource utilization: FlashAttention-3 builds upon previous versions to maximize performance on Nvidia Hopper GPUs by leveraging their new features and optimizing operation scheduling.

  • FlashAttention-3 maximizes the overlap between computation and data movement across GPU memory segments, reducing idle time and interleaving matrix multiplication and softmax operations to minimize bottlenecks.
  • The technique also employs a special arrangement of operations for faster and more accurate attention computations in quantized models, addressing the potential accuracy loss associated with quantization.
  • FlashAttention-3 achieves up to 75% usage of the H100 GPU’s maximum capabilities, resulting in a 1.5–2x speedup compared to previous versions for both LLM training and inference.

Implications and benefits of FlashAttention-3: The faster attention computation offered by FlashAttention-3 has several potential impacts on LLM development and applications.

  • Significantly reduced training time for LLMs, enabling researchers and developers to experiment with larger models and datasets.
  • Extended context windows for LLMs, allowing them to process longer sequences more efficiently and unlocking new applications in areas such as long-form document understanding and many-shot in-context learning.
  • Reduced cost of running LLMs in production by using a higher percentage of GPU capacity and potentially requiring fewer accelerators.

Broader availability and future work: The researchers have open-sourced FlashAttention-3 under a permissive license and plan to integrate it into popular deep learning libraries, making it easier for the community to leverage its performance benefits. They also anticipate future work on optimizing LLM inference and generalizing their techniques to other hardware architectures.

Analyzing deeper: FlashAttention-3 represents a significant step forward in optimizing attention computation for LLMs, addressing a critical bottleneck in their development and deployment. By closely aligning algorithmic improvements with hardware advancements, the researchers have demonstrated the potential for substantial efficiency gains and the unlocking of new capabilities, such as extended context windows. As LLMs continue to grow in size and complexity, techniques like FlashAttention-3 will play an increasingly crucial role in making them more accessible, cost-effective, and applicable to a wider range of tasks. However, it remains to be seen how well these optimizations will generalize to other hardware architectures and whether similar improvements can be achieved for LLM inference, which presents distinct challenges compared to training.

FlashAttention-3 unleashes the power of H100 GPUs for LLMs

Recent News

Claude AI can now analyze and critique Google Docs

Claude's new Google Docs integration allows users to analyze multiple documents simultaneously without manual copying, marking a step toward more seamless AI-powered workflows.

AI performance isn’t plateauing, it’s just outgrown benchmarks, Anthropic says

The industry's move beyond traditional AI benchmarks reveals new capabilities in self-correction and complex reasoning that weren't previously captured by standard metrics.

How to get a Perplexity Pro subscription for free

Internet search startup Perplexity offers its $200 premium AI service free to university students and Xfinity customers, aiming to expand its user base.