AI Models Struggle with Basic Reasoning: Apple Study Reveals Flaws in LLMs
A recent study conducted by Apple’s artificial intelligence scientists has uncovered significant limitations in the reasoning abilities of large language models (LLMs), including those developed by industry leaders like Meta and OpenAI. The research highlights the fragility of these AI systems when faced with tasks requiring genuine understanding and critical thinking.
Key findings: LLMs lack robust reasoning skills
- Apple researchers developed a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of various LLMs.
- Initial testing showed that minor changes in query wording can lead to dramatically different answers, undermining the reliability of these models.
- The study found that adding contextual information that shouldn’t affect the core mathematics of a problem can cause LLMs to produce incorrect results.
Fragility in mathematical reasoning exposed
- Performance of all tested models declined when numerical values in questions were altered, even slightly.
- Adding a single sentence of seemingly relevant (but actually irrelevant) information to a math question can reduce answer accuracy by up to 65%.
- The complexity of questions, measured by the number of clauses, directly correlates with a deterioration in model performance.
Real-world implications: Challenges for AI reliability
- The study concludes that building reliable AI agents on the current foundation of LLMs is not feasible due to their susceptibility to minor, irrelevant changes in input.
- This fragility raises concerns about the practical applications of LLMs in scenarios requiring consistent and accurate reasoning.
Illustrative example: The “GSM-NoOp” task
- Researchers developed a math problem similar to elementary school word problems to test LLM comprehension.
- The problem included irrelevant information about the size of some kiwis picked on a particular day.
- Both OpenAI’s model and Meta’s Llama3-8b incorrectly subtracted the mentioned smaller kiwis from the total, demonstrating a failure to distinguish relevant from irrelevant information.
Previous research supports findings
- A 2019 study showed that AI models could be consistently confused by adding background information to questions about Super Bowl quarterbacks’ ages.
- This earlier research aligns with Apple’s findings, suggesting a persistent issue in AI reasoning capabilities.
Conclusion: Pattern matching vs. formal reasoning
- The study found no evidence of formal reasoning in the tested language models.
- Researchers concluded that LLM behavior is better explained by sophisticated pattern matching rather than genuine understanding.
- The fragility of this pattern matching is so pronounced that even changing names within a problem can alter results.
Broader implications: Rethinking AI development and applications
This research from Apple underscores the need for a critical reassessment of current AI development approaches and the potential limitations of LLM-based systems in real-world applications requiring robust reasoning skills. As AI continues to integrate into various sectors, addressing these fundamental flaws in reasoning capabilities becomes crucial for ensuring reliability and safety in AI-driven decision-making processes.
Recent Stories
DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment
The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...
Oct 17, 2025Tying it all together: Credo’s purple cables power the $4B AI data center boom
Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...
Oct 17, 2025Vatican launches Latin American AI network for human development
The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...