×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The evolution of Meta’s AI infrastructure: Meta’s journey in scaling its AI capabilities has led to significant advancements in hardware design and infrastructure optimization to support increasingly complex AI models and workloads.

  • Meta has been integrating AI into its core products for years, including features like Feed and its advertising system.
  • The company’s latest AI model, Llama 3.1 405B, boasts 405 billion parameters and required training across more than 16,000 NVIDIA H100 GPUs.
  • Meta’s AI training clusters have rapidly scaled from 128 GPUs to two 24,000-GPU clusters in just over a year, with expectations for continued growth.

Networking challenges and solutions: The scale of Meta’s AI operations necessitates advanced networking solutions to ensure optimal performance and scalability.

  • AI clusters require tightly integrated high-performance computing systems and isolated high-bandwidth compute networks.
  • Meta anticipates needing injection bandwidth of around one terabyte per second per accelerator in the coming years, representing a tenfold increase from current capabilities.
  • To meet these demands, the company is developing a high-performance, multi-tier, non-blocking network fabric with modern congestion control mechanisms.

Open hardware initiatives: Meta is championing open hardware solutions to accelerate AI innovation and foster collaboration within the industry.

  • The company announced Catalina, a new high-powered rack designed for AI workloads, based on the NVIDIA Blackwell platform and capable of supporting up to 140kW of power.
  • Meta has expanded its Grand Teton AI platform to support AMD Instinct MI300X accelerators, offering greater compute capacity and memory for large-scale AI inference workloads.
  • The new Disaggregated Scheduled Fabric (DSF) for next-generation AI clusters aims to overcome limitations in scale, component supply options, and power density.

Collaboration with industry partners: Meta’s partnership with Microsoft and other tech giants is driving open innovation in AI infrastructure.

  • Meta and Microsoft have collaborated on various OCP initiatives, including the Switch Abstraction Interface (SAI) and Open Accelerator Module (OAM) standard.
  • The companies are currently working on Mount Diablo, a new disaggregated power rack featuring a scalable 400 VDC unit for enhanced efficiency and scalability.

The importance of open source in AI development: Meta emphasizes the critical role of open source in advancing AI technology and ensuring its benefits are widely accessible.

  • Open source software frameworks are essential for driving model innovation, ensuring portability, and promoting transparency in AI development.
  • Standardized models help leverage collective expertise, make AI more accessible, and work towards minimizing biases in AI systems.
  • Open AI hardware systems are crucial for delivering high-performance, cost-effective, and adaptable infrastructure necessary for AI advancement.

Looking ahead: Meta’s vision for the future of AI infrastructure emphasizes collaboration and open innovation to unlock the full potential of AI technology.

  • The company encourages engagement with the OCP community to address AI’s infrastructure needs collectively.
  • By fostering an open ecosystem for AI hardware and software development, Meta aims to make the benefits and opportunities of AI accessible to people worldwide.
Meta’s open AI hardware vision

Recent News

Illuminate is Google’s new AI podcasting tool — here’s how it works

Google's new AI tool synthesizes scientific research from arxiv.org into customizable podcasts, potentially transforming how complex information is disseminated to broader audiences.

Meta unveils open-source AI hardware strategy

Meta's AI infrastructure expansion highlights rapid growth in training capabilities, with clusters scaling from 128 to 24,000 GPUs in just over a year, while addressing networking challenges and developing open hardware solutions.

AI startups are moving out of Boston and into Buffalo — here’s why

Buffalo's emerging AI ecosystem attracts Boston startups with state-backed incentives and a growing talent pool.