AI companies are failing to provide adequate justification for their safety claims based on dangerous capability evaluations, according to a new analysis by researcher Zach Stein-Perlman. Despite OpenAI, Google DeepMind, and Anthropic publishing evaluation reports intended to demonstrate their models’ safety, these reports largely fail to explain why their results—which often show strong performance—actually indicate the models aren’t dangerous, particularly for biothreat and cyber capabilities.
The core problem: Companies consistently fail to bridge the gap between their evaluation results and safety conclusions, often reporting strong model performance while claiming safety without clear reasoning.
- OpenAI acknowledges that “several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats,” yet doesn’t explain how it concludes this or what results would change its assessment.
- On biothreat evaluations, OpenAI’s o3 model performs well enough that one evaluation “is reaching saturation,” and it matches or substantially outperforms human expert baselines on others—results that seem to suggest dangerous capabilities rather than rule them out.
- DeepMind claims Gemini 2.5 Pro lacks dangerous CBRN (chemical, biological, radiological, and nuclear) capabilities because “it does not yet consistently or completely enable progress through key bottleneck stages,” but provides no comparison to human performance or criteria for what would change their conclusion.
Poor elicitation undermines results: Companies systematically underestimate their models’ true capabilities through inadequate testing methods, making their safety claims even less reliable.
- On a subset of RE-Bench (a reverse engineering benchmark), Anthropic scored Sonnet 3.6 at 0.21, but external researchers at METR (Model Evaluation and Threat Research) achieved 0.51 on the same model—more than double the performance.
- Meta’s cyber evaluations showed such poor elicitation that when other researchers retested with better methods, performance jumped from 5% to 100% on some tests.
- DeepMind initially reported about 0.15 on an AI research and development evaluation, but later recalculated to allow multiple attempts, changing the score to about 0.72—a nearly five-fold increase.
Transparency gaps hide critical details: The evaluation processes remain largely opaque, preventing external verification of companies’ safety determinations.
- Companies often don’t provide clear thresholds for what constitutes dangerous capabilities or explain their decision-making criteria.
- Anthropic mentions “thresholds” for many evaluations but mostly doesn’t explain what these thresholds mean or which evaluations are actually driving safety decisions.
- In cyber evaluations, Anthropic reports that “Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges” but provides no specific metrics or context for these claims.
What the analysis reveals: The fundamental issue extends beyond poor methodology to a lack of accountability and explanation in how companies interpret their results.
- Companies don’t explain what would change their minds about model safety or provide clear criteria for dangerous capabilities.
- There’s no external accountability mechanism to verify that evaluations are conducted properly or results interpreted correctly.
- When companies do report concerning results, they often fail to explain why these don’t indicate actual danger.
Why this matters: As AI capabilities rapidly advance, the gap between companies’ evaluation practices and their safety claims creates significant risks for public safety and regulatory oversight, potentially allowing dangerous capabilities to be deployed without adequate safeguards.
AI companies' eval reports mostly don't support their claims