The evolution of Meta’s AI infrastructure: Meta’s journey in scaling its AI capabilities has led to significant advancements in hardware design and infrastructure optimization to support increasingly complex AI models and workloads.
- Meta has been integrating AI into its core products for years, including features like Feed and its advertising system.
- The company’s latest AI model, Llama 3.1 405B, boasts 405 billion parameters and required training across more than 16,000 NVIDIA H100 GPUs.
- Meta’s AI training clusters have rapidly scaled from 128 GPUs to two 24,000-GPU clusters in just over a year, with expectations for continued growth.
Networking challenges and solutions: The scale of Meta’s AI operations necessitates advanced networking solutions to ensure optimal performance and scalability.
- AI clusters require tightly integrated high-performance computing systems and isolated high-bandwidth compute networks.
- Meta anticipates needing injection bandwidth of around one terabyte per second per accelerator in the coming years, representing a tenfold increase from current capabilities.
- To meet these demands, the company is developing a high-performance, multi-tier, non-blocking network fabric with modern congestion control mechanisms.
Open hardware initiatives: Meta is championing open hardware solutions to accelerate AI innovation and foster collaboration within the industry.
- The company announced Catalina, a new high-powered rack designed for AI workloads, based on the NVIDIA Blackwell platform and capable of supporting up to 140kW of power.
- Meta has expanded its Grand Teton AI platform to support AMD Instinct MI300X accelerators, offering greater compute capacity and memory for large-scale AI inference workloads.
- The new Disaggregated Scheduled Fabric (DSF) for next-generation AI clusters aims to overcome limitations in scale, component supply options, and power density.
Collaboration with industry partners: Meta’s partnership with Microsoft and other tech giants is driving open innovation in AI infrastructure.
- Meta and Microsoft have collaborated on various OCP initiatives, including the Switch Abstraction Interface (SAI) and Open Accelerator Module (OAM) standard.
- The companies are currently working on Mount Diablo, a new disaggregated power rack featuring a scalable 400 VDC unit for enhanced efficiency and scalability.
The importance of open source in AI development: Meta emphasizes the critical role of open source in advancing AI technology and ensuring its benefits are widely accessible.
- Open source software frameworks are essential for driving model innovation, ensuring portability, and promoting transparency in AI development.
- Standardized models help leverage collective expertise, make AI more accessible, and work towards minimizing biases in AI systems.
- Open AI hardware systems are crucial for delivering high-performance, cost-effective, and adaptable infrastructure necessary for AI advancement.
Looking ahead: Meta’s vision for the future of AI infrastructure emphasizes collaboration and open innovation to unlock the full potential of AI technology.
- The company encourages engagement with the OCP community to address AI’s infrastructure needs collectively.
- By fostering an open ecosystem for AI hardware and software development, Meta aims to make the benefits and opportunities of AI accessible to people worldwide.
Meta’s open AI hardware vision