Even human PhDs are struggling with this new math benchmark for AI models

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

The emergence of FrontierMath marks a significant development in AI testing, introducing a benchmark of expert-level mathematics problems that are proving exceptionally challenging for even the most advanced AI language models.

The benchmark’s unique approach: FrontierMath represents a novel testing framework that keeps its problems private to prevent AI models from being trained directly on the test data.

The test includes hundreds of expert-level mathematics problems that current AI models solve less than 2% of the time
Leading models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro performed poorly, despite having access to Python environments for testing
This performance contrasts sharply with their 90%+ success rates on simpler mathematical benchmarks like GSM8K and MATH

Development and validation process: Epoch AI collaborated extensively with the mathematical community to ensure the benchmark’s rigor and validity.

Over 60 mathematicians from leading institutions contributed to developing the problems
The problems underwent peer review to verify correctness and eliminate ambiguities
Approximately 5% of problems required corrections during review, matching the error rate of other major machine learning benchmarks
Fields Medal winners Terence Tao and Timothy Gowers participated in reviewing portions of the benchmark

Technical specifications: The benchmark incorporates specific design elements to ensure reliable testing and prevent gaming the system.

Problems span multiple mathematical disciplines, including computational number theory and abstract algebraic geometry
All answers must be automatically verifiable through computation
Problems are “guessproof” with less than 1% chance of correct random guesses
Solutions require either exact integers or precise mathematical objects

Comparison to traditional mathematics competitions: FrontierMath takes a distinctly different approach from conventional mathematical challenges.

Unlike the International Mathematical Olympiad (IMO), which avoids specialized knowledge and complex calculations, FrontierMath embraces both
The benchmark leverages AI systems’ computational capabilities by focusing on algorithmic implementation rather than purely theoretical proofs
Problems require both creative insight and complex technical implementation

Future developments: Epoch AI has outlined plans to continue evolving and expanding the benchmark.

Regular evaluations of AI models against the benchmark will be conducted
Additional sample problems will be released in coming months
The problem set will be expanded over time

Strategic implications: The poor performance of leading AI models on FrontierMath raises important questions about the current limitations of artificial intelligence in tackling complex mathematical reasoning, suggesting that true mathematical problem-solving ability remains a significant challenge for even the most sophisticated AI systems.

New secret math benchmark stumps AI models and PhDs alike

Ars Technica

Menu

Even human PhDs are struggling with this new math benchmark for AI models

Recent News

AI upscaling tools create fake details in FBI Kirk shooting investigation photos

Claude AI now remembers conversations automatically for Team users

Nvidia unveils Rubin CPX GPU with 128GB memory for AI inference

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Even human PhDs are struggling with this new math benchmark for AI models

Recent News

AI upscaling tools create fake details in FBI Kirk shooting investigation photos

Claude AI now remembers conversations automatically for Team users

Nvidia unveils Rubin CPX GPU with 128GB memory for AI inference

Join the revolution

CO/AI

Resources

Join the revolution