DeepMind’s new Mixture-of-Experts (MoE) architecture, PEER, scales language models to millions of tiny experts, improving performance while keeping computational costs down.
Key innovation: Parameter Efficient Expert Retrieval (PEER); DeepMind’s novel MoE architecture introduces a learned index to efficiently route input data to a vast pool of millions of tiny experts, enabling significant scaling without slowing down inference:
- PEER replaces the fixed router in traditional MoE with a fast initial computation to create a shortlist of potential expert candidates before activating the top experts.
- Unlike previous MoE architectures with large experts, PEER uses tiny single-neuron experts in the hidden layer, allowing knowledge sharing among experts and improving parameter efficiency.
- PEER employs a multi-head retrieval approach, similar to the multi-head attention in transformer models, to compensate for the small expert size.
Overcoming MoE scaling limitations: Current MoE techniques are limited to a relatively small number of experts due to fixed routers that need readjustment when new experts are added:
- Studies suggest increasing MoE “granularity” (number of experts) can improve performance, especially with increased model size and training data.
- High-granularity MoE also enables models to learn new knowledge more efficiently by adding experts and proper regularization.
- PEER addresses these challenges, allowing MoE to scale to millions of experts without the limitations of fixed routers.
Outperforming dense models and other MoEs: Experiments show PEER models achieve a better performance-compute tradeoff compared to transformer models with dense feedforward layers and other MoE architectures:
- PEER reaches lower perplexity scores with the same computational budget as counterparts.
- Increasing the number of experts in PEER leads to further perplexity reduction, challenging the belief that MoE efficiency peaks with a limited number of experts.
Broader implications for large language models: PEER’s approach can help further reduce the cost and complexity of training and serving very large language models:
- PEER might be used in DeepMind’s Gemini 1.5 models, which reportedly use “a new Mixture-of-Experts (MoE) architecture.”
- The architecture shows potential for dynamically adding new knowledge and features to LLMs by adapting PEER to select parameter-efficient fine-tuning adapters at runtime.
- As the race to scale LLMs continues, PEER represents a significant step in improving the performance-compute tradeoff, positioning it as a competitive alternative to dense layers in foundation models.
DeepMind’s PEER scales language models with millions of tiny experts