Apple research reveals key reasoning flaws in AI language models

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

AI Models Struggle with Basic Reasoning: Apple Study Reveals Flaws in LLMs

A recent study conducted by Apple’s artificial intelligence scientists has uncovered significant limitations in the reasoning abilities of large language models (LLMs), including those developed by industry leaders like Meta and OpenAI. The research highlights the fragility of these AI systems when faced with tasks requiring genuine understanding and critical thinking.

Key findings: LLMs lack robust reasoning skills

Apple researchers developed a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of various LLMs.
Initial testing showed that minor changes in query wording can lead to dramatically different answers, undermining the reliability of these models.
The study found that adding contextual information that shouldn’t affect the core mathematics of a problem can cause LLMs to produce incorrect results.

Fragility in mathematical reasoning exposed

Performance of all tested models declined when numerical values in questions were altered, even slightly.
Adding a single sentence of seemingly relevant (but actually irrelevant) information to a math question can reduce answer accuracy by up to 65%.
The complexity of questions, measured by the number of clauses, directly correlates with a deterioration in model performance.

Real-world implications: Challenges for AI reliability

The study concludes that building reliable AI agents on the current foundation of LLMs is not feasible due to their susceptibility to minor, irrelevant changes in input.
This fragility raises concerns about the practical applications of LLMs in scenarios requiring consistent and accurate reasoning.

Illustrative example: The “GSM-NoOp” task

Researchers developed a math problem similar to elementary school word problems to test LLM comprehension.
The problem included irrelevant information about the size of some kiwis picked on a particular day.
Both OpenAI’s model and Meta’s Llama3-8b incorrectly subtracted the mentioned smaller kiwis from the total, demonstrating a failure to distinguish relevant from irrelevant information.

Previous research supports findings

A 2019 study showed that AI models could be consistently confused by adding background information to questions about Super Bowl quarterbacks’ ages.
This earlier research aligns with Apple’s findings, suggesting a persistent issue in AI reasoning capabilities.

Conclusion: Pattern matching vs. formal reasoning

The study found no evidence of formal reasoning in the tested language models.
Researchers concluded that LLM behavior is better explained by sophisticated pattern matching rather than genuine understanding.
The fragility of this pattern matching is so pronounced that even changing names within a problem can alter results.

Broader implications: Rethinking AI development and applications

This research from Apple underscores the need for a critical reassessment of current AI development approaches and the potential limitations of LLM-based systems in real-world applications requiring robust reasoning skills. As AI continues to integrate into various sectors, addressing these fundamental flaws in reasoning capabilities becomes crucial for ensuring reliability and safety in AI-driven decision-making processes.

Reasoning failures highlighted by Apple research on LLMs

AppleInsider

Menu

Apple research reveals key reasoning flaws in AI language models

Recent News

Liquid AI’s new vision-language models run 2x faster on smartphones

UCSF psychiatrist reports 12 cases of AI psychosis from chatbot interactions

ChatGPT adds workspace integrations as OpenAI manages GPT-5 capacity

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Apple research reveals key reasoning flaws in AI language models

Recent News

Liquid AI’s new vision-language models run 2x faster on smartphones

UCSF psychiatrist reports 12 cases of AI psychosis from chatbot interactions

ChatGPT adds workspace integrations as OpenAI manages GPT-5 capacity

Join the revolution

CO/AI

Resources

Join the revolution