A new benchmark called RE-Bench provides unprecedented insight into how artificial intelligence agents compare to human experts when tackling complex machine learning engineering tasks.
Core methodology and design: RE-Bench evaluates both human experts and AI language models like Claude 3.5 Sonnet and OpenAI’s o1-preview across seven different machine learning engineering environments.
- The benchmark focuses on realistic tasks such as fitting scaling laws and optimizing GPU kernels
- Testing occurs across varying time budgets ranging from 2 to 32 hours
- The evaluation framework is designed to provide direct comparisons between human and AI performance
Key performance findings: AI agents demonstrated mixed results when compared to human experts, with performance varying significantly based on time constraints.
- In short 2-hour sessions, AI agents outperformed human experts
- However, humans achieved nearly double the performance of the best AI agents when given 32-hour time frames
- AI solutions were notably more cost-effective, with operational costs several times lower than human expert rates
- The speed advantage was clear – AI agents could generate and test implementations more than 10 times faster than humans
Technical limitations: Despite some impressive capabilities, AI agents showed several consistent weaknesses in their approach to complex problems.
- Most AI attempts made minimal progress on complex tasks
- Agents struggled to effectively process and incorporate new information
- Building upon previous progress proved challenging for AI systems
- The median AI performance fell significantly below both human experts and the best AI attempts
Areas of promise: Several aspects of AI agent performance suggest potential for future improvements.
- AI systems demonstrated substantial machine learning expertise
- When given multiple attempts, agents occasionally discovered remarkably strong solutions
- The cost-effectiveness and speed advantages of AI agents point to potential hybrid approaches
- The open-source nature of the environments and agent transcripts enables further research and improvement
Looking ahead: While RE-Bench reveals current limitations in AI capabilities for complex engineering tasks, the results suggest that improved elicitation methods and hybrid human-AI approaches could lead to significant advances in AI-assisted research and development.
Evaluating frontier AI R&D capabilities of language model agents against human experts