Apple study reveals AI reasoning models fail on complex problems

Apple researchers have released a study exposing fundamental flaws in advanced AI reasoning models, showing they completely collapse when faced with complex problems. The findings directly challenge claims about artificial general intelligence (AGI) and reveal that so-called “large reasoning models” from companies like OpenAI, Anthropic, and DeepSeek are sophisticated pattern-matching systems rather than true reasoning engines.

What you should know: Apple’s controlled experiments revealed that frontier AI models fail catastrophically on high-complexity tasks, achieving zero accuracy even with explicit algorithmic instructions.

The study tested advanced “large reasoning models” (LRMs) including OpenAI’s o3-mini, Anthropic’s Claude-3.7-Sonnet, and DeepSeek’s R1/V3 systems against increasingly difficult mathematical and puzzle problems.
While these models perform well on moderate-complexity tasks, they perform worse than standard language models on simple problems and completely break down on complex ones.
“Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds,” Apple researchers wrote.

The big picture: These results suggest the AI industry’s claims about approaching human-level artificial general intelligence are premature and potentially misleading.

The study challenges the prevailing narrative that AI models are developing genuine reasoning capabilities, instead revealing them as pattern-matching systems that mimic logical thinking.
Apple’s research comes amid growing skepticism about AI capabilities, including delays to OpenAI’s GPT-5 system and reports of companies replacing AI workers with humans at firms like Klarna, a Swedish fintech company, and Duolingo, a language-learning platform.

How the breakdown works: The models exhibit a counterintuitive behavior where they actually reduce their reasoning effort as problems become more difficult.

Apple found that “as problems approach critical difficulty, models paradoxically reduce their reasoning effort despite ample compute budgets,” using fewer “thinking tokens” when they should be working harder.
The researchers describe this as “particularly concerning” because truly reasoning systems should generate more detailed thought processes for harder problems, not fewer.
Models also fail to use explicit algorithms consistently and show “complete failure” on complex multi-step reasoning tasks.

In plain English: Think of “thinking tokens” as the AI’s internal monologue—the mental steps it takes to work through a problem. When humans face harder math problems, we typically think through more steps and show more work. But these AI models do the opposite: they give up and use fewer mental steps precisely when they should be working harder.

Testing methodology: Apple designed four specialized puzzle environments to measure reasoning capabilities with fine-grained control over complexity levels.

The puzzles included Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World problems, each systematically scaled in difficulty.
Unlike standard coding benchmarks that focus on final answer accuracy, Apple’s approach examined the internal “chain-of-thought” reasoning processes that models claim to use.
This methodology revealed that models “overthink” simple problems by finding correct solutions then wasting computational resources exploring incorrect ones.

Strategic implications: Apple’s late entry into the AI research spotlight may reflect both scientific rigor and competitive positioning against rivals like OpenAI and Google DeepMind.

The study arrives as Apple has been notably quieter in the generative AI race compared to competitors, potentially using scientific caution to establish credibility while subtly critiquing industry hype.
Critics of AI overpromising have embraced the research as evidence that the “AI hype machine has over-reached itself” amid broader questions about the sustainability of current AI investment levels.

What this means for AGI: The findings suggest “fundamental barriers to generalizable reasoning” that undermine the concept of artificial general intelligence as currently pursued.

Apple’s research indicates that transformer-based AI architectures may have intrinsic scaling limits that prevent them from achieving human-level reasoning capabilities.
The study joins growing evidence that current AI systems, while powerful for specific tasks, lack the flexible, generalizable intelligence that AGI proponents claim is imminent.

Apple study reveals AI reasoning models fail on complex problems

Recent Stories

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Vatican launches Latin American AI network for human development