Fancy AI models are getting stumped by Sudoku while hallucinating explanations

University of Colorado Boulder researchers tested five AI models on 2,300 simple Sudoku puzzles and found significant gaps in both problem-solving ability and trustworthiness. The study revealed that even advanced models like ChatGPT’s o1 could only solve 65% of six-by-six puzzles correctly, while their explanations frequently contained fabricated facts or bizarre responses—including one AI that provided an unprompted weather forecast when asked about Sudoku.

What you should know: The research focused less on puzzle-solving ability and more on understanding how AI systems think and explain their reasoning.

ChatGPT’s o1 model performed best at solving puzzles but was particularly poor at explaining its methodology, using wrong terminology and failing to justify its moves.
Other AI models were deemed “not currently capable” of solving even simplified six-by-six Sudoku puzzles.
When asked to explain their reasoning, AI models frequently hallucinated facts, claiming constraints that didn’t actually exist in the puzzles.

Why this matters: The findings highlight critical trust issues that must be resolved before AI can become a reliable partner in human decision-making processes.

Only 41% of people currently trust AI technology, according to KPMG, a global consulting firm, despite 78% of organizations using AI in at least one business function.
The World Economic Forum identifies trust as a key factor that will shape outcomes in the AI-powered economy.

What they’re saying: Researchers emphasized the broader implications of AI’s reasoning failures.

“Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, study co-author and associate professor of computer science at CU Boulder. “So it might say, ‘There cannot be a two here because there’s already a two in the same row,’ but that wasn’t the case.”
“At that point, the AI had gone berserk and was completely confused,” explained study co-author Fabio Somenzi when describing the weather forecast incident.
“If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote,” Somenzi added.

The big picture: The study underscores that while AI can perform complex tasks like coding websites and summarizing meetings, its reasoning processes remain opaque and unreliable.

The hallucinations and glitches “underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making,” according to the researchers.
Understanding how AI systems think could ultimately improve public trust and ensure more reliable outputs across applications from computer code to financial services.

Fancy AI models are getting stumped by Sudoku while hallucinating explanations

Recent Stories

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Vatican launches Latin American AI network for human development