The editor of Downcodes learned that researchers from Meta FAIR, the University of California, Berkeley, and New York University collaborated to develop a new technology called Thinking Preference Optimization (TPO), which aims to significantly improve the instruction processing and processing of large language models (LLM). Response quality. This technology breaks through the limitations of traditional LLM that only focuses on the final answer. By simulating the human thinking process, the model allows the model to conduct internal reflection and deduction before giving the answer, thereby generating a more accurate and coherent response. This technology is expected to revolutionize the application of LLM in various fields and bring users a better AI interactive experience.
The core of TPO technology is the improved Chain of Thinking (CoT) reasoning method. This approach encourages models to “think before they answer” during training, helping them develop a more organized internal thought process before providing a final answer. Traditional CoT prompts sometimes result in reduced accuracy and are quite tricky to train due to the lack of clear thinking steps. TPO successfully overcomes these challenges by allowing models to optimize and simplify their thinking processes without exposing intermediate steps to users.
During the training process of TPO, the large language model is first prompted to generate multiple ideas, and then the final answer is sorted out. These outputs are then evaluated by a "judger" model to pick out the best and worst performing responses. These evaluation results are used as "select" and "reject" pairs for direct preference optimization (DPO) to continuously improve the model's response quality.
By adjusting training cues, TPO encourages models to think internally before answering. This process guides the model to refine its answers, making them clearer and more relevant. Finally, the evaluation work is completed by an LLM-based evaluation model, which only scores the final answer, thus being independent of hidden thinking steps and helping the model improve the quality of the answer. TPO also uses direct preference optimization to create preferred and rejected answer pairs that contain hidden thinking. After multiple rounds of training, the internal process of the model is further refined.
On benchmarks against AlpacaEval and Arena-Hard, the TPO method outperformed traditional response baselines and outperformed Thinking Tips' Llama-3-8B-Instruct model. Iterative training of this approach optimizes thought generation capabilities, ultimately outperforming multiple baseline models. It is worth mentioning that TPO is not only suitable for logic and mathematical tasks, but also shows its talents in instruction following tasks in creative fields such as marketing and health.
AI and robotics expert Karan Verma shared his views on the concept of "thinking LLM" on the social platform Good therapeutic effect.
This structured internal thinking process enables the model to process complex instructions more effectively, further expanding its application in fields that require multi-level reasoning and detailed understanding, without the need for humans to provide specific thinking data. This research shows that TPO has the potential to make large language models more flexible and efficient in diverse contexts, suitable for fields that have high requirements for flexibility and depth of response generation.
All in all, the advent of TPO technology has brought new possibilities for performance improvement of large language models, and its application prospects in various fields are worth looking forward to. The editor of Downcodes believes that with the continuous development and improvement of technology, TPO will play a huge role in more fields and contribute to the development of artificial intelligence.