The Machine Ethics Testing and Research (METR) organization has completed preliminary evaluations of two advanced AI models: Anthropic’s Claude 3.5 Sonnet (October 2024 release) and OpenAI’s pre-deployment checkpoint of o1, finding no immediate evidence of dangerous capabilities in either system.
Key findings from autonomous risk evaluation: The evaluation consisted of 77 tasks designed to assess the models’ capabilities in areas like cyberattacks, AI R&D, and autonomous replication.
- Claude 3.5 Sonnet performed at a level comparable to what human testers could achieve in about 1 hour
- The baseline o1 agent initially showed lower performance but improved to match 2-hour human baseline performance after optimization
- A specialized “elicited” agent scaffold was developed for o1, implementing an advisor-actor-rater loop that improved its performance
Testing limitations and caveats: Several factors prevented the team from establishing definitive capability boundaries.
- The evaluation used default settings for o1’s reasoning parameters
- Lack of access to o1’s internal thought processes limited capability assessment
- Time constraints restricted the optimization of agent scaffolds
- Incomplete information about the systems’ training methodologies affected testing approach
RE-Bench performance metrics: The evaluation included testing on RE-Bench, a suite of 7 complex AI R&D tasks.
- Claude 3.5 Sonnet performed at the 37th percentile of human expert level
- o1 achieved performance at the 30th percentile of human expert level
- Different agent scaffolds showed varying levels of effectiveness for each model
Methodological challenges: The evaluation process revealed significant hurdles in capability-based safety assessment.
- Small changes in agent scaffolding led to dramatic performance improvements
- Performance scaled significantly with increased inference budgets
- The need for extensive testing resources may make thorough evaluations prohibitively expensive
Future implications and research direction: While capability testing faces growing challenges, METR emphasizes its continued importance.
- The organization plans to continue evaluating new AI systems, including the recently released DeepSeek-R1
- Research into novel safety assessment methodologies remains a priority
- The findings suggest a need for alternative safety arguments beyond pure capability testing
Complex evaluation landscape: The results highlight increasing difficulty in making definitive capability-based safety assessments for advanced AI systems, pointing to a need for more sophisticated evaluation methods and potentially alternative approaches to AI safety verification.
An update on our preliminary evaluations of Claude 3.5 Sonnet and o1