According to the official official account of Doubao Big Model Team, under the joint research of Beijing Jiaotong University and University of Science and Technology of China, the "VideoWorld" video generation experimental model proposed by Doubao Big Model Team was officially opened recently.
The biggest highlight of this model is that it no longer relies on traditional language models, but can recognize and understand the world based on visual information alone. This breakthrough research was inspired by Professor Li Feifei's concept that "young children can understand the real world without relying on language" mentioned in his TED speech.
"VideoWorld" realizes complex inference, planning and decision-making capabilities by analyzing and processing large amounts of video data. The research team's experiments showed that the model achieved significant results with only 300M parameters. Unlike existing models that rely on language or tag data, VideoWorld can independently learn knowledge, especially in complex tasks such as origami and bow ties, which can provide a more intuitive learning method.
In order to verify the effectiveness of this model, the research team established two experimental environments: Go battle and robot simulation control. As a highly strategic game, Go can effectively evaluate the model's rule learning and reasoning ability, while robot tasks examine the model's performance in control and planning. During the training stage, the model gradually establishes the ability to predict future pictures by watching a large amount of video demonstration data.
To improve the efficiency of video learning, the team introduced a potential dynamic model (LDM) designed to compress visual changes between video frames to extract critical information. This method not only reduces redundant information, but also enhances the model's learning efficiency of complex knowledge. Through this innovation, VideoWorld demonstrates outstanding abilities in Go and robotic tasks, and even reaches the level of professional five-stage Go.
Paper link: https://arxiv.org/abs/2501.09781
Code link: https://github.com/bytedance/VideoWorld
Project homepage: https://maverickren.github.io/VideoWorld.github.io
Key points:
The "VideoWorld" model can realize knowledge learning based on visual information alone, and does not rely on language models.
The model demonstrates excellent reasoning and planning capabilities in Go and robot simulation tasks.
The project code and model have been open sourced, and people from all walks of life are welcome to participate in the experience and exchange.