Recently, a study jointly conducted by ByteDance Research Institute and Tsinghua University raised questions about the physical understanding ability of current AI video generation models. Through carefully designed experiments, the research team found that these models, such as OpenAI's Sora, although visually impressive, did not really understand the basic physical laws, but instead relied on the color, size, and Learning and prioritizing surface features such as speed and shape. This study has triggered people's in-depth thinking about the reality of AI simulation, and also challenges AI models that boast about their physical understanding capabilities.
Recently, researchers from ByteDance Research Institute and Tsinghua University jointly released a new study, pointing out that the current AI video generation model, such as OpenAI's Sora, can create amazing visual effects, but understand basic physics. There are major flaws in terms of rules. This study has sparked extensive discussion of AI's ability to simulate reality.
The research team tested the AI video generation model and set up three different scenarios, namely prediction in known mode, prediction in unknown mode, and new combinations of familiar elements. Their goal is to see if these models actually learn the laws of physics, or if they rely solely on surface features in training.
Through testing, the researchers found that these AI models did not learn universally applicable rules. Instead, they rely primarily on surface features such as color, size, speed, and shape when generating videos, and follow a strict order of priority: color is preferred, followed by size, speed, and shape.
In familiar scenarios, these models perform almost perfectly, but once they encounter unknown situations, they appear powerless. A test in the study demonstrates the limitations of AI models when dealing with object movements. For example, when the model trains using fast moving spheres to move back and forth, while providing them with slow spheres during testing, the model actually shows that the sphere suddenly changes direction after a few frames. This phenomenon is also clearly reflected in the related videos.
Researchers point out that simply expanding the size of the model or increasing the training data does not solve the problem. Although larger models perform better under familiar patterns and combinations, they still fail to understand basic physical laws or handle scenarios beyond the scope of training. Research co-author Kang Bingyi mentioned: "If the data coverage is good enough in a specific scenario, it may be possible to form an overfitted world model." But this model does not meet the definition of a real world model, because the real world model should be able to Promote beyond training data.
Co-author Bingyi Kang demonstrated this limitation on X, explaining that when they trained the model with a fast-moving ball from left to right and backwards, then tested with a slow-moving ball, the model showed the ball in After only a few frames, the direction suddenly changed (you can see it in the 1 minute and 55 second video).
The results of this study challenge OpenAI's Sora program. OpenAI has said that Sora is expected to develop into a true world model through continuous expansion, and even claims that it has a basic understanding of physical interaction and three-dimensional geometry. But researchers point out that simple scale expansion alone is not enough to allow video generation models to discover basic physical laws.
Yann LeCun, head of AI at Meta, also expressed doubts about this, believing that the practice of predicting the world by generating pixels is "a waste of time and doomed to fail." Despite this, many people are still looking forward to OpenAI's release of Sora as scheduled in mid-February 2024, demonstrating its video generation potential.
Key points:
The research found that AI video generation model has major flaws in understanding physical laws and relies on the surface characteristics of the training data.
Scaling the model size does not solve the problem, which are not performing well in unknown scenarios.
OpenAI's Sora program faces challenges, and scale-up alone cannot achieve a true world model.
In short, this study pointed out the direction for the development of AI video generation technology, that is, simple scale expansion cannot solve the fundamental problem of AI's understanding of physical laws. In the future, AI models need to learn and understand physical principles more deeply in order to truly achieve accurate simulation and prediction of the real world, rather than just staying at the stage of imitating surface features.