DeepMind, Berkeley Show How to Make AI Models Better, Not Bigger

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Optimizing LLM performance through inference-time compute: Researchers from DeepMind and UC Berkeley have explored innovative ways to enhance large language model (LLM) performance by strategically allocating compute resources during inference, potentially reducing the need for larger models or extensive pre-training.

The study investigates how to maximize LLM performance using a fixed amount of inference-time compute, comparing different methods and their effectiveness against larger pre-trained models.
This approach aims to enable the deployment of smaller LLMs while achieving comparable performance to larger, more computationally expensive models.

Key strategies for inference-time compute optimization: The researchers focused on two main approaches to improve LLM performance without increasing model size or pre-training.

The first strategy involves modifying the proposal distribution, which is the process by which the LLM generates responses. This can be achieved by fine-tuning the LLM to iteratively revise its answers in complex reasoning-based settings.
The second strategy optimizes the verifier, which is the mechanism used to select the best answer from generated responses. This is done by training a process-based reward model that evaluates the correctness of individual steps in an answer.

Experimental findings: The researchers conducted experiments on the challenging MATH benchmark using PaLM-2 models to evaluate their approach.

For easier problems, allowing the model to iteratively refine its initial answer proved more effective than generating multiple samples in parallel.
For more difficult problems requiring exploration of different solution strategies, resampling multiple responses in parallel or deploying tree-search against a process-based reward model was more effective.
The efficacy of a particular test-time compute strategy was found to depend critically on both the nature of the specific problem and the base LLM used.

Performance improvements: The study demonstrated significant performance gains through strategic allocation of test-time compute.

By appropriately allocating test-time compute, the researchers were able to surpass the best-of-N baseline while using only about 25% of the computation.
This finding highlights the potential for developing adaptive “compute-optimal” strategies that select specific approaches based on the prompt to make the best use of additional computation.

Comparing test-time compute to pre-training: The researchers also investigated how test-time computation compares to additional pre-training in terms of performance improvements.

For easier and medium-difficulty questions, a smaller model with additional test-time compute performed comparably to a 14x larger model with more pre-training.
However, for the most challenging questions, additional pre-training compute proved to be more effective, indicating that current approaches to scaling test-time compute may not be a perfect substitute for scaling pre-training in all scenarios.

Future research directions: The study suggests several avenues for further exploration in optimizing LLM performance through inference-time compute.

Exploring more complex strategies that combine different revision and search techniques.
Developing more efficient methods for estimating question difficulty to better tailor the compute allocation strategy.
Investigating how to balance the allocation of compute resources between pre-training and inference for optimal performance across different types of tasks and difficulty levels.

Implications for AI development: This research points to a potential shift in how AI models are developed and deployed in the future.

The findings suggest that allocating more computational resources to inference rather than pre-training could be a more efficient approach in some scenarios.
This could lead to a future where fewer FLOPs (floating-point operations) are spent during pre-training and more are allocated to inference, potentially changing the landscape of AI model development and deployment.

DeepMind and UC Berkeley shows how to make the most of LLM inference-time compute

VentureBeat

Menu

DeepMind, Berkeley Show How to Make AI Models Better, Not Bigger

Recent News

HSBC warns Apple’s slow AI rollout may delay iPhone upgrades

WhatsApp replaces support forms with AI-powered chat system

AI datacenter spending reaches 2% of US GDP, making other parts of the economy jealous

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

DeepMind, Berkeley Show How to Make AI Models Better, Not Bigger

Recent News

HSBC warns Apple’s slow AI rollout may delay iPhone upgrades

WhatsApp replaces support forms with AI-powered chat system

AI datacenter spending reaches 2% of US GDP, making other parts of the economy jealous

Join the revolution

CO/AI

Resources

Join the revolution