Scale AI and the Center for AI Safety (CAIS) have released results from “Humanity’s Last Exam,” a new AI benchmark testing expert-level knowledge across multiple fields, where current AI models achieved less than 10% accuracy on expert questions.
Project Overview: The benchmark aims to test AI systems’ capabilities at the frontiers of human expertise across mathematics, humanities, and natural sciences.
- The project collected over 70,000 trial questions, narrowed down to 3,000 final questions through expert review
- Leading AI models tested included OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and OpenAI o1
- Nearly 1,000 contributors from more than 500 institutions across 50 countries participated in question development
Methodology and Design: The benchmark was created to address the challenge of “benchmark saturation,” where AI models achieve near-perfect scores on existing tests but struggle with questions outside those parameters.
- Questions span multiple formats, including text-only and multi-modal challenges with images and diagrams
- Content focuses on world-class expert-level problems across diverse domains
- The exam includes highly specialized questions, such as detailed inquiries about hummingbird anatomy in ecology
Key Findings: Current AI models demonstrated limited capability in answering expert-level questions, highlighting significant room for improvement.
- Models answered fewer than 10% of questions correctly
- Results showed variations in performance that could be attributed to randomness
- The benchmark reveals clear gaps in AI systems’ reasoning capabilities at expert levels
Research Impact: The project establishes a new standard for evaluating advanced AI systems while promoting collaborative research.
- The dataset will be made available to the research community, with a small subset reserved for future evaluations
- Financial awards were offered for top contributions: $5,000 for each of the top 50 questions and $500 for the next 500 best submissions
- Contributors were offered coauthorship opportunities on the final paper
Looking Beyond Current Capabilities: Historical precedent suggests rapid progress in AI capabilities is possible, though significant challenges remain.
- Earlier benchmarks like MATH saw dramatic improvements, from less than 10% accuracy to over 90% in just three years
- The current performance gap on Humanity’s Last Exam indicates substantial room for advancement in AI reasoning capabilities
- The benchmark provides a roadmap for future AI research and development priorities
Scale AI and CAIS Unveil Results of Humanity’s Last Exam