DeepMind launches new benchmark Michelangelo: revealing long-context LLM reasoning flaws

Author：Eve Cole Update Time：2025-03-02 18:50:02

Large language models (LLMs) with very long context windows are developing rapidly, and their ability to process information has attracted widespread attention. However, there are challenges in assessing the ability of these models to understand and utilize large amounts of information. Researchers at Google DeepMind developed the Michelangelo benchmark for this purpose, aiming to more deeply evaluate the reasoning capabilities of long-context LLM and provide directions for future model improvements.

Recently, large language models (LLMs) with very long context windows have become a hot topic of discussion. These models are able to handle hundreds of thousands or even millions of tokens in a single prompt, opening up many new possibilities for developers. However, how well can these long context LLMs understand and utilize the large information received? To solve this problem, researchers at Google DeepMind launched a new benchmark called Michelangelo, designed to evaluate long context push capabilities. The results show that although current state-of-the-art models have made some progress in extracting information from large amounts of contextual data, they still have difficulties in tasks that require reasoning and understanding the structure of the data. As LLMs with long context windows emerged, researchers began to realize that new benchmarks were needed to evaluate the capabilities of these models. Existing evaluations mostly focus on information retrieval tasks, such as “finding needles in haystacks” evaluations, that is, looking for specific information in a large number of contexts. However, simple retrieval does not equate to the model's understanding of the overall context. To address these issues, Michelangelo proposed a new evaluation method by setting complex tasks that require models to perform deeper reasoning and synthesis when processing long texts. For example, the evaluation framework contains multiple tasks related to programming and natural language. These tasks not only test the model's memory ability, but also focus on its depth of understanding and processing information. In Michelangelo's evaluation task, the model needs to solve three basic long document synthesis tasks, namely "potential list", "multi-round coreference resolution" and various other application scenarios. These tasks not only help evaluate a model's performance on long documents, but also reveal its shortcomings in inference and synthesis. The first is the "potential list", where the model needs to process a long list of operations on a Python list, filtering out irrelevant or redundant statements to determine the final state of the list. The second item is "multi-turn reference resolution", where the model needs to understand the conversation structure and solve reference problems in long conversations. The third item is "I don't know." When answering multiple multiple-choice questions, the model needs to determine whether the answer is included in the context and be able to accurately respond to "I don't know." The researchers evaluated Michelangelo against ten top LLMs, including different versions of Gemini, GPT-4, and Claude, and they tested the model in the context of up to 1 million tokens. The Gemini model performs best on MRCR, the GPT model performs well on Latent List, and Claude3.5Sonnet gets the highest score on IDK.

The researchers found that while these models varied in how well they handled long contexts, their overall performance dropped significantly when faced with more complex reasoning tasks. This means that even with a very long context window, the current LLM still needs to be improved in reasoning capabilities. The researchers plan to continue expanding Michelangelo's evaluation project and hope to make it directly available for other researchers to test their models. Paper entrance: https://arxiv.org/abs/2409.12640 Focus: Michelangelo, a new benchmark for long-context LLM, is designed to evaluate the reasoning ability of the model. ? Research shows that existing models suffer significant performance degradation when handling complex reasoning tasks. The researchers plan to expand the evaluation project to facilitate further research on the model's reasoning capabilities.

The editor of Downcodes concluded: The emergence of the Michelangelo benchmark provides a new perspective for evaluating ultra-long context LLM, and also points out the shortcomings of current models in complex reasoning capabilities. In the future, more powerful LLM will need to achieve breakthroughs in reasoning capabilities to better cope with more complex tasks and application scenarios. We look forward to future research bringing us more surprises!