Hallucinations spike in OpenAI's o3 and o4-mini

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

OpenAI’s newest AI models, o3 and o4-mini, are exhibiting an unexpected and concerning trend: higher hallucination rates than their predecessors. This regression in factual reliability comes at a particularly problematic time as these models are designed for more complex reasoning tasks, potentially undermining trust among enterprise clients and raising questions about how AI advancement is being measured. The company has acknowledged the issue in its technical report but admits it doesn’t fully understand the underlying causes.

The hallucination problem: OpenAI’s technical report reveals that the o3 model hallucinated in response to 33% of questions during evaluation, approximately double the rate of previous reasoning models.

Both o3 and o4-mini underperformed in PersonQA, a benchmark specifically designed to evaluate hallucination tendencies in AI systems.
The company acknowledged in its report that “o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims.”

Why this matters: The increased hallucination rate runs counter to the expected evolutionary path of AI models, where newer iterations typically demonstrate improved factual reliability alongside other capabilities.

For complex reasoning tasks that these models are designed to handle, factual accuracy is particularly crucial as the stakes and complexity of use cases increase.
Enterprise customers considering significant financial investments in these advanced models may hesitate if the fundamental reliability issue remains unresolved.

Behind the numbers: OpenAI admits in its technical report that “more research is needed to understand the cause of this result,” suggesting the company is struggling to identify why their newer models are regressing in this specific dimension.

The company did note that smaller models generally have less world knowledge and tend to hallucinate more, but this doesn’t fully explain why o3 is hallucinating more than earlier versions.

Reading between the lines: The hallucination increase reveals the complex trade-offs inherent in AI development, where optimizing for certain capabilities might inadvertently compromise others.

The fact that o3 “makes more claims overall” suggests the model might be calibrated for greater confidence and assertiveness, which unintentionally leads to more hallucinations.

Where we go from here: With both models now released to the public, OpenAI may be hoping that widespread usage and feedback will help identify patterns and potentially resolve the hallucination issue through further training.

The company will likely need to address this regression quickly to maintain credibility in the increasingly competitive AI model market.
Future reasoning models may require new evaluation frameworks that better balance fluency, helpfulness, and factual reliability.

OpenAI’s leading models keep making things up

Tom's Guide

Menu

Hallucinations spike in OpenAI’s o3 and o4-mini

Recent News

Australia’s first sovereign AI data center launches in 2026 via Dell-Macquarie deal

75% trust AI agents but only 30% accept taking orders from them

Employee of the Month: Salesforce’s CoAct-1 hybrid AI agent achieves 60% task success rate

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Hallucinations spike in OpenAI’s o3 and o4-mini

Recent News

Australia’s first sovereign AI data center launches in 2026 via Dell-Macquarie deal

75% trust AI agents but only 30% accept taking orders from them

Employee of the Month: Salesforce’s CoAct-1 hybrid AI agent achieves 60% task success rate

Join the revolution

CO/AI

Resources

Join the revolution