×
The leading AI models just failed ‘Humanity’s Last Exam’ — but could you do any better?
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI models have scored poorly on a new ultra-difficult intelligence benchmark called “Humanity’s Last Exam,” with even the most advanced systems achieving less than 10% accuracy on its challenging questions.

The benchmark’s development: Scale AI and the Center for AI Safety (CAIS) collaborated to create Humanity’s Last Exam, designed to test AI systems at the absolute limits of human expertise and knowledge.

  • The test comprises 3,000 questions contributed by experts from over 500 institutions across 50 countries
  • Originally named “Humanity’s Last Stand,” the title was later softened to “Last Exam”
  • Questions span highly specialized topics requiring deep expertise in fields like biology, linguistics, and mythology

Performance metrics: Current AI models have demonstrated notably low performance on this new benchmark, significantly underperforming compared to other standard AI tests.

  • DeepSeek-R1 achieved the highest score at 9.4%
  • Google’s Gemini scored 6.2%
  • Claude 3.5 Sonnet reached 4.3%
  • OpenAI’s GPT-4o managed only 3.3%
  • The results show a stark contrast to AI’s typically strong performance on other benchmarks like GPQA, MATH, and MMLU

Sample questions: The exam features extremely complex questions that challenge even human experts in their respective fields.

  • One question involves detailed anatomical knowledge about hummingbird bone structure and tendon pairs
  • Another requires advanced understanding of Biblical Hebrew syllable analysis using the Tiberian pronunciation tradition
  • A third tests knowledge of Greek mythology genealogy

Current implications: The benchmark reveals significant limitations in current AI systems’ reasoning capabilities and specialized knowledge.

  • AI models struggle with questions requiring deep domain expertise
  • The gap between human expert knowledge and AI capabilities remains substantial in specialized fields
  • The test serves as a meaningful measure of AI progress in advanced reasoning tasks

Looking ahead: While current AI models cannot effectively tackle Humanity’s Last Exam, the rapidly evolving nature of AI technology suggests future improvements in performance are likely.

  • OpenAI’s recent release of Operator, its first AI agent, demonstrates ongoing advances in AI capabilities
  • The benchmark provides a clear metric for measuring progress in AI reasoning and specialized knowledge
  • The significant performance gap indicates substantial room for improvement in AI systems

Reading between the lines: This benchmark may provide a more realistic assessment of AI capabilities than previous tests, helping to temper both excessive optimism and unfounded fears about current AI systems’ abilities while establishing a clear marker for measuring future progress.

Could you pass 'Humanity’s Last Exam'? Probably not, but neither can AI

Recent News

7 ways to optimize your business for ChatGPT recommendations

Companies must adapt their digital strategy with specific expertise, consistent information across platforms, and authoritative content to appear in AI-powered recommendation results.

Robin Williams’ daughter Zelda slams OpenAI’s Ghibli-style images amid artistic and ethical concerns

Robin Williams' daughter condemns OpenAI's AI-generated Ghibli-style images, highlighting both environmental costs and the contradiction with Miyazaki's well-documented opposition to artificial intelligence in creative work.

AI search tools provide wrong answers up to 60% of the time despite growing adoption

Independent testing reveals AI search tools frequently provide incorrect information, with error rates ranging from 37% to 94% across major platforms despite their growing popularity as Google alternatives.