back
Get SIGNAL/NOISE in your inbox daily

The big picture: GPU Utilization, a commonly used metric for assessing GPU performance in machine learning tasks, has been found to be potentially misleading, as it doesn’t accurately reflect the computational efficiency of GPU usage.

Understanding GPU Utilization: GPU Utilization, as defined by Nvidia, measures the percentage of time during which one or more kernels are executing on the GPU, but fails to account for the efficiency of core usage or workload parallelization.

  • This metric can reach 100% even when the GPU is only performing memory read/write operations without any actual computations.
  • The discrepancy between GPU Utilization and actual computational efficiency became apparent when a foundation model company achieved 100% GPU Utilization but only 20% Model FLOPS (Floating Point Operations Per Second) utilization (MFUs).

The importance of MFUs: MFUs, introduced in Google’s PaLM paper, provide a more accurate representation of GPU performance by comparing observed throughput to the theoretical maximum.

  • MFUs calculate the ratio of actual floating point operations per second to the GPU’s maximum capabilities.
  • Most LLM trainings typically achieve 35-45% MFUs, making the 20% figure notably low despite full GPU Utilization.

Diving deeper into GPU architecture: Understanding the structure of GPUs is crucial for interpreting performance metrics accurately.

  • GPUs consist of cores and streaming multiprocessors (SMs), with SMs acting as managers for groups of cores.
  • CUDA kernels execute work on CUDA cores through one or more SMs.
  • GPU Utilization only measures kernel execution time, not the efficiency of core usage or workload parallelization.

Introducing SM Efficiency: SM Efficiency, also known as SM activity, provides a more nuanced view of GPU performance by measuring the percentage of active SMs during a given time interval.

  • This metric helps identify inefficiencies in model execution that may not be apparent from GPU Utilization alone.
  • In the case study, the Softmax kernel showed high GPU Utilization but low SM Efficiency, indicating a potential bottleneck.

Optimizing performance through kernel fusion: After identifying inefficiencies using SM Efficiency, the team focused on optimizing the transformer stack through kernel fusion.

  • Kernel fusion involves replacing PyTorch native layer definitions with GPU kernels that combine multiple layers.
  • This approach reduces memory read/write operations and improves overall computational efficiency.
  • Existing libraries like Flash Attention provide pre-implemented, hardware-optimized fused kernels for easy integration.

Results and recommendations: By implementing these optimizations, the team achieved significant performance improvements.

  • Training time was reduced by a factor of 4, and MFUs increased from 20% to 38%.
  • The authors recommend tracking SM Efficiency alongside GPU Utilization for a more comprehensive understanding of GPU performance.
  • While more granular metrics like SM occupancy exist, focusing on improving SM Efficiency is a more straightforward approach for most teams.

Looking ahead: The future of GPU performance optimization: As the field of AI and machine learning continues to evolve, so too will the methods for optimizing GPU performance.

  • While manual optimization is currently necessary, future developments in tools like torch.compile may automate many of these processes.
  • Continued research into GPU architecture and performance metrics will likely yield even more sophisticated methods for squeezing maximum performance out of GPUs.
  • As AI models grow more complex and resource-intensive, the importance of accurate performance metrics and optimization techniques will only increase, making this an area ripe for further innovation and study.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...