The editor of Downcodes learned that the Shanghai AI Lab team has open sourced the LLaMA version o1 project. This is exciting news! This project aims to reproduce OpenAI’s o1, a mathematical puzzle-solving artifact, and has made significant progress. The team skillfully used advanced technologies such as Monte Carlo tree search and reinforcement learning to surpass many closed-source solutions in the AIME2024 benchmark test, demonstrating strong technical strength and open source spirit. The open source project contains pre-trained datasets, models and training code, providing developers with valuable learning resources.
Long before the release of OpenAI's o1 series, the Shanghai AI Lab team began to explore the use of Monte Carlo tree search to improve the mathematical capabilities of large models. After the release of o1, the team further upgraded the algorithm, focusing on Mathematical Olympiad problems, and developed it as an open source version of the OpenAI Strawberry Project.
In order to improve the performance of the LLaMA model on Mathematical Olympiad problems, the team adopted a pairwise optimization strategy, which does not directly give the absolute score of the answer, but compares the relative merits of the two answers. With this approach, they achieved significant improvements on the hardest AIME2024 benchmark. Among the 30 test questions, the optimized model got 8 questions right, while the original LLaMA-3.1-8B-Instruct model only got 2 questions right. This achievement exceeds other commercial closed-source solutions except o1-preview and o1-mini.
At the end of October, the team announced that it had made significant progress in reproducing OpenAI o1 based on the AlphaGo Zero architecture, successfully enabling the model to acquire advanced thinking capabilities by interacting with the search tree during the learning process without manual annotation. In less than a week, the project was open sourced.
Currently, the open source content of LLaMA version o1 includes: pre-training data sets, pre-training models, and reinforcement learning training code. Among them, the "OpenLongCoT-Pretrain" data set contains more than 100,000 long thinking chain data. Each piece of data contains a complete mathematical problem reasoning process, including thinking content, scoring results, problem description, graphic coordinates, calculation process, and conclusion derivation. Complete reasoning links, as well as criticism and verification content of each reasoning step, provide evaluation and guidance for the reasoning process. After continued pre-training on this data set, the model can read and output the long thought chain process like o1.
Although the project is called LLaMA-O1, the currently officially provided pre-training model is based on Google's Gemma2. Based on the pre-trained model, developers can continue to perform reinforcement learning training. The training process includes: using Monte Carlo tree search to perform self-play to generate experience; storing experience in the priority experience playback buffer; sampling batch data from the buffer for training; updating model parameters and experience priority. Some key technologies are also used in the training code, including using LoRA for efficient fine-tuning of parameters, using the PPO algorithm as a strategy optimization method, implementing the GAE algorithm to calculate the advantage function, and using priority experience playback to improve training efficiency.
It is worth noting that the LLaMA-O1 code was released under a GitHub account called SimpleBerry. The account has no special introduction and appears to be relatively mysterious. From other accounts and official website information related to SimpleBerry, it can only be seen that its nature is a research laboratory, but no more information about the research direction is disclosed.
In addition to LLaMA-O1, another o1 replica project with public progress is O1-Journey from the Shanghai Jiao Tong University team. The team released its first progress report in early October, introducing the innovative Journey Learning paradigm and the first model to successfully integrate search and learning into mathematical reasoning. The core development team of O1-Journey is mainly composed of junior and senior undergraduate students of Shanghai Jiao Tong University, as well as first-year doctoral students from the GAIR Laboratory (Generative Artificial Intelligence Research Laboratory) of Shanghai Jiao Tong University. The instructors include Liu Pengfei and Yao Ban, associate professors of Shanghai Jiao Tong University. Alumnus and Sloan Award winner Li Yuanzhi, etc.
Paper address: https://arxiv.org/pdf/2410.02884
https://arxiv.org/pdf/2406.07394
The open source of the LLaMA version o1 project has brought new vitality to the field of AI mathematical problem solving, and also provided developers with valuable learning and research resources. We look forward to more similar open source projects appearing in the future to promote the continuous development of the field of artificial intelligence!