AI models have scored poorly on a new ultra-difficult intelligence benchmark called “Humanity’s Last Exam,” with even the most advanced systems achieving less than 10% accuracy on its challenging questions.
The benchmark’s development: Scale AI and the Center for AI Safety (CAIS) collaborated to create Humanity’s Last Exam, designed to test AI systems at the absolute limits of human expertise and knowledge.
- The test comprises 3,000 questions contributed by experts from over 500 institutions across 50 countries
- Originally named “Humanity’s Last Stand,” the title was later softened to “Last Exam”
- Questions span highly specialized topics requiring deep expertise in fields like biology, linguistics, and mythology
Performance metrics: Current AI models have demonstrated notably low performance on this new benchmark, significantly underperforming compared to other standard AI tests.
- DeepSeek-R1 achieved the highest score at 9.4%
- Google’s Gemini scored 6.2%
- Claude 3.5 Sonnet reached 4.3%
- OpenAI’s GPT-4o managed only 3.3%
- The results show a stark contrast to AI’s typically strong performance on other benchmarks like GPQA, MATH, and MMLU
Sample questions: The exam features extremely complex questions that challenge even human experts in their respective fields.
- One question involves detailed anatomical knowledge about hummingbird bone structure and tendon pairs
- Another requires advanced understanding of Biblical Hebrew syllable analysis using the Tiberian pronunciation tradition
- A third tests knowledge of Greek mythology genealogy
Current implications: The benchmark reveals significant limitations in current AI systems’ reasoning capabilities and specialized knowledge.
- AI models struggle with questions requiring deep domain expertise
- The gap between human expert knowledge and AI capabilities remains substantial in specialized fields
- The test serves as a meaningful measure of AI progress in advanced reasoning tasks
Looking ahead: While current AI models cannot effectively tackle Humanity’s Last Exam, the rapidly evolving nature of AI technology suggests future improvements in performance are likely.
- OpenAI’s recent release of Operator, its first AI agent, demonstrates ongoing advances in AI capabilities
- The benchmark provides a clear metric for measuring progress in AI reasoning and specialized knowledge
- The significant performance gap indicates substantial room for improvement in AI systems
Reading between the lines: This benchmark may provide a more realistic assessment of AI capabilities than previous tests, helping to temper both excessive optimism and unfounded fears about current AI systems’ abilities while establishing a clear marker for measuring future progress.
Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI