back
Get SIGNAL/NOISE in your inbox daily

Long-context LLMs face reasoning challenges: DeepMind’s Michelangelo benchmark reveals that while large language models (LLMs) with extended context windows have improved in information retrieval, they struggle with complex reasoning tasks over large datasets.

  • Google DeepMind researchers developed Michelangelo to evaluate the long-context reasoning capabilities of LLMs, addressing limitations in existing benchmarks.
  • The benchmark aims to assess models’ ability to understand relationships and structures within vast amounts of information, rather than just retrieving isolated facts.
  • Michelangelo consists of three core tasks: Latent List, Multi-round Co-reference Resolution (MRCR), and “I Don’t Know” (IDK), each designed to test different aspects of long-context reasoning.

The need for advanced benchmarks: As LLMs with context windows of up to 1 million tokens emerge, existing evaluation methods fall short in assessing their true capabilities.

  • Traditional benchmarks, like the “needle-in-a-haystack” test, have been saturated by current models and don’t reflect their ability to reason over entire contexts.
  • Many existing long-reasoning evaluations can be solved through a combination of retrieval and information stored in model weights, bypassing the test of long-context understanding.
  • Michelangelo addresses these limitations by focusing on the model’s capacity to comprehend relationships and structures within large context windows.

Latent Structure Queries framework: The researchers introduced a novel approach called Latent Structure Queries (LSQ) to design long-context reasoning evaluations.

  • LSQ allows for the creation of test data that can be extended to arbitrary lengths while avoiding data leakage into training corpora.
  • The framework emphasizes extracting information from structures rather than simple key-value retrieval, enabling a deeper assessment of context understanding.
  • LSQ provides a methodology for increasing task complexity and context length independently, making it adaptable to a wide range of reasoning tasks.

Evaluating frontier models: The researchers tested ten leading LLMs, including variants of Gemini, GPT-4, and Claude, on the Michelangelo benchmark.

  • Different models exhibited strengths in various tasks: Gemini models performed best on MRCR, GPT models excelled in Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.
  • All models showed a significant performance drop as reasoning task complexity increased, indicating room for improvement in long-context reasoning capabilities.
  • The evaluation revealed that even with very long context windows, current LLMs struggle with tasks requiring complex reasoning over large amounts of information.

Implications for enterprise applications: The findings from Michelangelo have important consequences for real-world LLM implementations.

  • In applications requiring multi-hop reasoning over disparate locations in very long contexts, model performance is likely to decrease as context length grows.
  • Models may struggle with documents containing large amounts of irrelevant information, making it difficult to distinguish crucial data.
  • LLMs are expected to maintain good performance on tasks where relevant information is concentrated in one general area of a document.

Ongoing research and future developments: DeepMind researchers plan to expand the Michelangelo benchmark and make it available to the wider research community.

  • Additional evaluations will be added to the benchmark to further assess long-context reasoning capabilities.
  • The team aims to make Michelangelo directly accessible to other researchers, enabling them to test their models on these advanced reasoning tasks.

Analyzing deeper: While long-context LLMs have made significant strides in handling vast amounts of information, Michelangelo reveals that true understanding and reasoning over complex structures remain challenging. As AI continues to evolve, addressing these limitations will be crucial for developing more capable and reliable language models that can effectively process and utilize extensive contextual information in real-world applications.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...