The big picture: GPU Utilization, a commonly used metric for assessing GPU performance in machine learning tasks, has been found to be potentially misleading, as it doesn’t accurately reflect the computational efficiency of GPU usage.
Understanding GPU Utilization: GPU Utilization, as defined by Nvidia, measures the percentage of time during which one or more kernels are executing on the GPU, but fails to account for the efficiency of core usage or workload parallelization.
- This metric can reach 100% even when the GPU is only performing memory read/write operations without any actual computations.
- The discrepancy between GPU Utilization and actual computational efficiency became apparent when a foundation model company achieved 100% GPU Utilization but only 20% Model FLOPS (Floating Point Operations Per Second) utilization (MFUs).
The importance of MFUs: MFUs, introduced in Google’s PaLM paper, provide a more accurate representation of GPU performance by comparing observed throughput to the theoretical maximum.
- MFUs calculate the ratio of actual floating point operations per second to the GPU’s maximum capabilities.
- Most LLM trainings typically achieve 35-45% MFUs, making the 20% figure notably low despite full GPU Utilization.
Diving deeper into GPU architecture: Understanding the structure of GPUs is crucial for interpreting performance metrics accurately.
- GPUs consist of cores and streaming multiprocessors (SMs), with SMs acting as managers for groups of cores.
- CUDA kernels execute work on CUDA cores through one or more SMs.
- GPU Utilization only measures kernel execution time, not the efficiency of core usage or workload parallelization.
Introducing SM Efficiency: SM Efficiency, also known as SM activity, provides a more nuanced view of GPU performance by measuring the percentage of active SMs during a given time interval.
- This metric helps identify inefficiencies in model execution that may not be apparent from GPU Utilization alone.
- In the case study, the Softmax kernel showed high GPU Utilization but low SM Efficiency, indicating a potential bottleneck.
Optimizing performance through kernel fusion: After identifying inefficiencies using SM Efficiency, the team focused on optimizing the transformer stack through kernel fusion.
- Kernel fusion involves replacing PyTorch native layer definitions with GPU kernels that combine multiple layers.
- This approach reduces memory read/write operations and improves overall computational efficiency.
- Existing libraries like Flash Attention provide pre-implemented, hardware-optimized fused kernels for easy integration.
Results and recommendations: By implementing these optimizations, the team achieved significant performance improvements.
- Training time was reduced by a factor of 4, and MFUs increased from 20% to 38%.
- The authors recommend tracking SM Efficiency alongside GPU Utilization for a more comprehensive understanding of GPU performance.
- While more granular metrics like SM occupancy exist, focusing on improving SM Efficiency is a more straightforward approach for most teams.
Looking ahead: The future of GPU performance optimization: As the field of AI and machine learning continues to evolve, so too will the methods for optimizing GPU performance.
- While manual optimization is currently necessary, future developments in tools like torch.compile may automate many of these processes.
- Continued research into GPU architecture and performance metrics will likely yield even more sophisticated methods for squeezing maximum performance out of GPUs.
- As AI models grow more complex and resource-intensive, the importance of accurate performance metrics and optimization techniques will only increase, making this an area ripe for further innovation and study.
GPU Utilization is a Misleading Metric