Apple research reveals: Serious flaws in large language model reasoning capabilities

Author：Eve Cole Update Time：2025-03-02 02:25:02

The editor of Downcodes learned that Apple recently released a study on the mathematical reasoning capabilities of large language models (LLM), which has attracted widespread attention in the industry. This study questions the performance of existing LLM on the GSM8K benchmark and introduces an improved benchmark GSM-Symbolic to more reliably evaluate the reasoning capabilities of LLM. This research is of great significance for understanding the limitations of LLM in the field of mathematics, and also provides valuable insights into the direction of future improvement of LLM.

Recently, Apple conducted a study on the reasoning capabilities of large language models (LLM), raising concerns about the performance of these models in the field of mathematics.

It is known that the GSM8K benchmark is widely used to evaluate the reasoning ability of models on elementary school mathematics problems. Although the performance of LLM on GSM8K has improved in recent years, researchers have questioned the reliability of this result. Therefore, they conducted a large-scale study to explore the performance of current state-of-the-art open-source and closed-source models.

In order to better evaluate the model's reasoning ability, the research team introduced an improved benchmark test - GSM-Symbolic. This new benchmark uses symbolic templates to generate diverse questions, allowing for better control over the evaluation process and providing more reliable metrics.

The study found that LLM's performance fluctuated significantly when the numerical values in the problem were changed. More interestingly, the model's performance drops significantly as the number of terms in the question increases. The researchers speculate that this performance drop indicates that existing LLMs do not have true logical reasoning capabilities, but simply imitate the reasoning steps in the training data.

In experiments, the performance of all state-of-the-art models dropped by as much as 65% when adding just one seemingly relevant term. Although these terms have nothing to do with the chain of reasoning that leads to the final answer, they still have a huge impact on the performance of the model. Overall, this study provides us with a deeper understanding of the capabilities and limitations of LLM in mathematical reasoning.

Highlight:

The mathematical reasoning ability of LLM shows obvious differences in different problem instances.

?As problem complexity increases, the performance of LLM decreases significantly, especially after adding additional terms.

Existing LLMs do not have real logical reasoning capabilities, and they mainly rely on the repetition and imitation of training data.

This research by Apple reveals the shortcomings of large language models in mathematical reasoning and provides important directions for future model improvements. Further research is expected to improve the logical reasoning ability of LLM and bring it closer to the human cognitive level. The editor of Downcodes will continue to pay attention to the latest developments in this field.