×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI Models Struggle with Basic Reasoning: Apple Study Reveals Flaws in LLMs

A recent study conducted by Apple’s artificial intelligence scientists has uncovered significant limitations in the reasoning abilities of large language models (LLMs), including those developed by industry leaders like Meta and OpenAI. The research highlights the fragility of these AI systems when faced with tasks requiring genuine understanding and critical thinking.

Key findings: LLMs lack robust reasoning skills

  • Apple researchers developed a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of various LLMs.
  • Initial testing showed that minor changes in query wording can lead to dramatically different answers, undermining the reliability of these models.
  • The study found that adding contextual information that shouldn’t affect the core mathematics of a problem can cause LLMs to produce incorrect results.

Fragility in mathematical reasoning exposed

  • Performance of all tested models declined when numerical values in questions were altered, even slightly.
  • Adding a single sentence of seemingly relevant (but actually irrelevant) information to a math question can reduce answer accuracy by up to 65%.
  • The complexity of questions, measured by the number of clauses, directly correlates with a deterioration in model performance.

Real-world implications: Challenges for AI reliability

Illustrative example: The “GSM-NoOp” task

  • Researchers developed a math problem similar to elementary school word problems to test LLM comprehension.
  • The problem included irrelevant information about the size of some kiwis picked on a particular day.
  • Both OpenAI’s model and Meta’s Llama3-8b incorrectly subtracted the mentioned smaller kiwis from the total, demonstrating a failure to distinguish relevant from irrelevant information.

Previous research supports findings

  • A 2019 study showed that AI models could be consistently confused by adding background information to questions about Super Bowl quarterbacks’ ages.
  • This earlier research aligns with Apple’s findings, suggesting a persistent issue in AI reasoning capabilities.

Conclusion: Pattern matching vs. formal reasoning

  • The study found no evidence of formal reasoning in the tested language models.
  • Researchers concluded that LLM behavior is better explained by sophisticated pattern matching rather than genuine understanding.
  • The fragility of this pattern matching is so pronounced that even changing names within a problem can alter results.

Broader implications: Rethinking AI development and applications

This research from Apple underscores the need for a critical reassessment of current AI development approaches and the potential limitations of LLM-based systems in real-world applications requiring robust reasoning skills. As AI continues to integrate into various sectors, addressing these fundamental flaws in reasoning capabilities becomes crucial for ensuring reliability and safety in AI-driven decision-making processes.

Reasoning failures highlighted by Apple research on LLMs

Recent News

This AI-powered dog collar gives your pet the gift of speech

The AI-powered collar interprets pet behavior and vocalizes it in human language, raising questions about the accuracy and ethics of anthropomorphizing animals.

ChatGPT’s equal treatment of users questioned in new OpenAI study

OpenAI's study reveals that ChatGPT exhibits biases based on users' names in approximately 0.1% to 1% of interactions, raising concerns about fairness in AI-human conversations.

Tesla’s Optimus robots allegedly operated by humans, reports say

Tesla's Optimus robots demonstrate autonomous walking but rely on human operators for complex tasks, highlighting both progress and ongoing challenges in humanoid robotics.