The race toward artificial general intelligence (AGI) has hit a sobering checkpoint as a new benchmark reveals the limitations of today’s most advanced AI systems. The ARC Prize Foundation’s ARC-AGI-2 test introduces efficiency metrics alongside performance standards, showing that even cutting-edge models score in the low single digits while costing significantly more than humans to complete basic reasoning tasks. This development signals a fundamental shift in how we evaluate AI progress, prioritizing not just raw capability but also computational efficiency.
The big picture: Current AI models, including OpenAI‘s sophisticated o3 systems, are failing a new benchmark designed to measure progress toward artificial general intelligence, scoring no higher than single digits out of 100.
How the benchmark works: ARC-AGI-2 tests AI models on seemingly simplistic tasks requiring symbolic interpretation and adaptability, while also factoring in the computational efficiency and cost of running the models.
Between the lines: The new benchmark represents a philosophical shift in AI evaluation, moving beyond raw performance to consider the environmental and economic costs of increasingly powerful systems.
What experts are saying: Researchers are divided on the significance and framing of these benchmark tests in measuring progress toward AGI.
The counterpoint: Critics suggest these benchmarks mislead the public about AI capabilities by equating task-specific performance with general intelligence.
Looking ahead: As AI development continues, benchmark standards will likely keep evolving to match advancing capabilities.