AI Models Struggle with Basic Reasoning: Apple Study Reveals Flaws in LLMs
A recent study conducted by Apple’s artificial intelligence scientists has uncovered significant limitations in the reasoning abilities of large language models (LLMs), including those developed by industry leaders like Meta and OpenAI. The research highlights the fragility of these AI systems when faced with tasks requiring genuine understanding and critical thinking.
Key findings: LLMs lack robust reasoning skills
- Apple researchers developed a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of various LLMs.
- Initial testing showed that minor changes in query wording can lead to dramatically different answers, undermining the reliability of these models.
- The study found that adding contextual information that shouldn’t affect the core mathematics of a problem can cause LLMs to produce incorrect results.
Fragility in mathematical reasoning exposed
- Performance of all tested models declined when numerical values in questions were altered, even slightly.
- Adding a single sentence of seemingly relevant (but actually irrelevant) information to a math question can reduce answer accuracy by up to 65%.
- The complexity of questions, measured by the number of clauses, directly correlates with a deterioration in model performance.
Real-world implications: Challenges for AI reliability
Illustrative example: The “GSM-NoOp” task
- Researchers developed a math problem similar to elementary school word problems to test LLM comprehension.
- The problem included irrelevant information about the size of some kiwis picked on a particular day.
- Both OpenAI’s model and Meta’s Llama3-8b incorrectly subtracted the mentioned smaller kiwis from the total, demonstrating a failure to distinguish relevant from irrelevant information.
Previous research supports findings
- A 2019 study showed that AI models could be consistently confused by adding background information to questions about Super Bowl quarterbacks’ ages.
- This earlier research aligns with Apple’s findings, suggesting a persistent issue in AI reasoning capabilities.
Conclusion: Pattern matching vs. formal reasoning
- The study found no evidence of formal reasoning in the tested language models.
- Researchers concluded that LLM behavior is better explained by sophisticated pattern matching rather than genuine understanding.
- The fragility of this pattern matching is so pronounced that even changing names within a problem can alter results.
Broader implications: Rethinking AI development and applications
This research from Apple underscores the need for a critical reassessment of current AI development approaches and the potential limitations of LLM-based systems in real-world applications requiring robust reasoning skills. As AI continues to integrate into various sectors, addressing these fundamental flaws in reasoning capabilities becomes crucial for ensuring reliability and safety in AI-driven decision-making processes.
Reasoning failures highlighted by Apple research on LLMs