The reasoning ability of large language models (LLMs) is a research hotspot in the field of artificial intelligence. Apple's AI research team recently published a paper revealing the limitations of LLMs in mathematical reasoning. The editor of Downcodes will interpret the main content of the paper and analyze its impact on the development of AI technology.
In the world of artificial intelligence, the reasoning capabilities of machine learning models, especially large language models (LLMs), have always been the focus of scientists.
Recently, Apple's AI research team published a paper titled "Understanding the Limitations of Large Language Models in Mathematical Reasoning," which revealed the limitations of these models in dealing with logical problems.
In the paper, the researchers demonstrate this through a simple mathematical problem. They first asked a question about Oliver picking kiwis:
As shown below:
Oliver picked 44 kiwi fruits on Friday. On Saturday he picked another 58 kiwi fruits. On Sunday he picked twice as many kiwis as on Friday. How many kiwis does Oliver have in total?
Obviously, the answer is 44+58+ (44*2) =190. While large language models are not actually mathematically perfect, they can solve problems like this fairly reliably.
But if you add some irrelevant information to observe the model's response, such as:
Oliver picked 44 kiwi fruits on Friday. On Saturday he picked another 58. On Sunday he picked twice as many kiwis as on Friday, but five of them were a little smaller than average. How many kiwis does Oliver have?
Although this does not change the mathematical nature of the problem, even state-of-the-art LLMs give wrong answers under this small perturbation. For example, GPT-o1-mini incorrectly subtracted 5 small kiwis from the total number of kiwis picked on Sunday.
This experiment shows that although LLMs can give the correct answer in some cases, they do not really understand the nature of the problem.
The researchers believe that the failure patterns of these models indicate that they are not performing true logical reasoning but are replicating the reasoning steps they observed in the training data. It's like an LLM being able to count that "I love you" is usually followed by "I love you too," but that doesn't mean it truly understands the meaning of love.
Mehrdad Farajtabar, one of the paper's co-authors, further explained the findings on social media. He points out that while it is possible to improve model performance in some simple cases through better hint engineering, for complex perturbations the model may need more contextual data to handle correctly, and these perturbations may not be feasible for a small child to handle at all. Not a problem.
This study reminds us that although LLMs excel in language processing, their abilities in logical reasoning are still limited. This is not just an academic question. As AI technology becomes an increasingly part of our daily lives, the answers to these questions become increasingly important.
We cannot simply assume that AI can understand and perform complex tasks, but should have a deeper understanding of how they work and their limitations. This research provides us with a deeper understanding of AI technologies, while also providing valuable insights into how we use and develop these technologies.
Reference: https://techcrunch.com/2024/10/11/researchers-question-ais-reasoning-ability-as-models-stumble-on-math-problems-with-trivial-changes/
All in all, the Apple team's research highlights the limitations of large language models in logical reasoning, reminding us that we need to be cautious about AI's capabilities and continue to pay attention to its development direction to avoid over-reliance on its capabilities. In the future, we need to study more deeply how to improve the reasoning capabilities of LLMs so that they can truly understand the essence of the problem instead of just imitating existing patterns.