Researchers at Princeton University and Yale University conducted in-depth research on the "chain of thinking (CoT)" reasoning ability of large language models (LLM) and released relevant reports. The study took cracking the shift password as a test task and selected three LLMs GPT-4, Claude3 and Llama3.1 for analysis, aiming to reveal the mechanism behind CoT inference. Research has found that LLM's CoT reasoning is not a simple symbolic logical reasoning, but a result of the complex interaction of multiple factors, which provides a new perspective for us to understand LLM's reasoning ability.
Researchers at Princeton and Yale University recently released a report on the reasoning ability of “Linking Chain (CoT)” in the Big Language Model (LLM), revealing the mystery of CoT reasoning: It is not symbolic reasoning based solely on logical rules, but It combines memory, probability and noise reasoning.
The researchers used cracking the shift password as a test task to analyze the performance of the three LLMs, GPT-4, Claude3 and Llama3.1. Shift passwords are a simple encoding method, and each letter is replaced by letters that move forward in the alphabet with fixed digits. For example, move the alphabet forward by 3 digits and "CAT" becomes "FDW".
The research results show that the three key factors affecting the effectiveness of CoT inference are:
Probability: LLM tends to generate outputs with higher probability, even if the inference step points to answers with lower probability. For example, if the reasoning step points to "STAZ", but "STAY" is the more common word, LLM may "correct itself" and output "STAY".
Memory: LLM remembers a large amount of text data during pre-training, which affects the accuracy of its CoT inference. For example, rot-13 is the most common shift password, and LLM has a significantly higher accuracy on rot-13 than other types of shift passwords.
Noise reasoning: The reasoning process of LLM is not completely accurate, but there is a certain degree of noise. As the displacement of the shift password increases, the intermediate steps required for decoding also increase, and the impact of noise inference becomes more obvious, resulting in a decrease in the accuracy of the LLM.
The researchers also found that LLM's CoT reasoning relies on self-conditioning, i.e. LLM needs to explicitly generate text as the context for subsequent reasoning steps. If the LLM is instructed to "think silently" without outputting any text, its reasoning ability will be greatly reduced. In addition, the effectiveness of the demonstration step has little impact on CoT inference. Even if there are errors in the demonstration step, the CoT inference effect of LLM can still remain stable.
This study shows that LLM's CoT reasoning is not perfect symbolic reasoning, but combines multiple factors such as memory, probability and noise reasoning. LLM not only shows the characteristics of a memory master in the process of CoT reasoning, but also shows the style of a probability master. This research helps us to have a deeper understanding of LLM's reasoning capabilities and provide valuable insights for the future development of more powerful AI systems.
Paper address: https://arxiv.org/pdf/2407.01687
In summary, this study is of great significance to understanding the reasoning mechanism of large language models, and its findings provide valuable reference for improving the reasoning capabilities of LLM in the future and developing more powerful AI systems. The study emphasizes the impact of factors such as probability, memory and noise on LLM reasoning, providing new directions for researchers in the field of AI.