×
Apple research reveals key reasoning flaws in AI language models
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI Models Struggle with Basic Reasoning: Apple Study Reveals Flaws in LLMs

A recent study conducted by Apple’s artificial intelligence scientists has uncovered significant limitations in the reasoning abilities of large language models (LLMs), including those developed by industry leaders like Meta and OpenAI. The research highlights the fragility of these AI systems when faced with tasks requiring genuine understanding and critical thinking.

Key findings: LLMs lack robust reasoning skills

  • Apple researchers developed a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of various LLMs.
  • Initial testing showed that minor changes in query wording can lead to dramatically different answers, undermining the reliability of these models.
  • The study found that adding contextual information that shouldn’t affect the core mathematics of a problem can cause LLMs to produce incorrect results.

Fragility in mathematical reasoning exposed

  • Performance of all tested models declined when numerical values in questions were altered, even slightly.
  • Adding a single sentence of seemingly relevant (but actually irrelevant) information to a math question can reduce answer accuracy by up to 65%.
  • The complexity of questions, measured by the number of clauses, directly correlates with a deterioration in model performance.

Real-world implications: Challenges for AI reliability

Illustrative example: The “GSM-NoOp” task

  • Researchers developed a math problem similar to elementary school word problems to test LLM comprehension.
  • The problem included irrelevant information about the size of some kiwis picked on a particular day.
  • Both OpenAI’s model and Meta’s Llama3-8b incorrectly subtracted the mentioned smaller kiwis from the total, demonstrating a failure to distinguish relevant from irrelevant information.

Previous research supports findings

  • A 2019 study showed that AI models could be consistently confused by adding background information to questions about Super Bowl quarterbacks’ ages.
  • This earlier research aligns with Apple’s findings, suggesting a persistent issue in AI reasoning capabilities.

Conclusion: Pattern matching vs. formal reasoning

  • The study found no evidence of formal reasoning in the tested language models.
  • Researchers concluded that LLM behavior is better explained by sophisticated pattern matching rather than genuine understanding.
  • The fragility of this pattern matching is so pronounced that even changing names within a problem can alter results.

Broader implications: Rethinking AI development and applications

This research from Apple underscores the need for a critical reassessment of current AI development approaches and the potential limitations of LLM-based systems in real-world applications requiring robust reasoning skills. As AI continues to integrate into various sectors, addressing these fundamental flaws in reasoning capabilities becomes crucial for ensuring reliability and safety in AI-driven decision-making processes.

Reasoning failures highlighted by Apple research on LLMs

Recent News

Amazon chief says GenAI is growing 3X faster than cloud computing

Amazon's AWS division sees AI services growing three times faster than traditional cloud offerings as enterprise customers rush to adopt artificial intelligence tools.

Microsoft’s 10 new AI agents fortify its grip on enterprise AI

Microsoft's enterprise AI agents gain rapid adoption as 100,000 organizations deploy automated business tools across customer service, finance, and supply chain operations.

Former BP CEO joins AI data center startup

Energy veterans and tech companies forge new alliances as AI computing centers strain power grids and demand sustainable solutions.