Recently, a joint study conducted by the University of Munich, the Munich Machine Learning Center and Adobe Research evaluated the performance of 12 top AI language models including GPT-4, Gemini and Llama in long text conceptual reasoning tasks. The results of the study are alarming: Despite the ability to process ultra-long text, these models have significant flaws in complex logical reasoning, and their performance has a cliff-like decline in long texts. Through the NOLIMA benchmarking system, the research team cleverly avoided keyword duplication, revealing the model's fragility in conceptual associations, and deeply analyzing the causes of this phenomenon.
Research jointly released by the University of Munich, the Munich Machine Learning Center and Adobe Research recently showed that 12 top AI language models including GPT-4o, Gemini1.5Pro and Llama-3.3-70B are facing obvious results in long text conceptual reasoning tasks. performance decay. Although these models all support context processing of at least 128,000 markers, their deep logical correlation capabilities still have fundamental limitations.
The NOLIMA (No Text Matching) benchmark test system developed by the research team reveals the fragility of AI models in conceptual connection by deliberately avoiding the design of keyword duplication. For example, when the text describes "Yuki lives next to Semperoper", the model needs to understand the common sense that "Semperoper is located in Dresden" before answering "Who has been to Dresden".
The test results show:
1. **Long text performance declines in cliff-like manner**: When the context expands from 2,000 to 8,000 marks, the performance of most models has dropped significantly; in the 32,000 marks scenario, 10 of the 12 models perform Only half of what it is when short text.
2. ** Attention mechanism exposes shortcomings**: The model is difficult to accurately locate related information in long texts, and when the key answers appear in the second half of the text, the accuracy rate further decreases.
3. **The dedicated inference model still has defects**: The o1, o3-mini and DeepSeek-R1 systems designed for complex inference scored less than 50% in the 32K-label NOLIMA-Hard test, although it is almost in short text Perfect.
Research points out that the model's over-reliance on inertial thinking of "word matching" is the core problem. When the test deliberately excludes the same vocabulary, even if the thinking chain (CoT) prompt technology is used, the improvement of the long text processing capability of Llama-3.3-70B is still limited. What’s more serious is that if there is word matching interference in the irrelevant context, it will intensify model misjudgment.
"This reveals the fundamental contradiction of current AI - it is easy to expand the context window, but it is difficult to improve deep reasoning capabilities." The researchers emphasized. Taking GPT-4o as an example, although it reaches the effective context length of 8,000 marks, it is still weak in the integration of cross-paragraph concepts. As the text is extended, the model's attention mechanism gradually "out of focus", making it difficult to maintain a coherent logical chain.
This research sounds the alarm for the development of AI: simply increasing the processing length cannot break through the reasoning bottleneck. The industry needs to re-examine the model architecture design and develop more efficient information extraction and association mechanisms. In the future, how to make AI truly understand text rather than rely on pattern matching will become the key to breaking through the limits of long text processing.
This study emphasizes the limitations of current AI models in long text reasoning, and provides an important reference for the future improvement direction of AI models. Simply increasing the size of the context window cannot solve the problem. More in-depth research and improvements are required from the model architecture level to improve the true understanding of AI models.