The editor of Downcodes learned that researchers from ByteDance Research Institute and Tsinghua University recently released a study that revealed major flaws in current AI video generation models such as OpenAI’s Sora in understanding physical laws. Through a series of tests, the study delved into the performance of these models in different scenarios and analyzed the mechanisms behind them. The research results warn against the limitations of current AI video generation technology, triggering widespread thinking in the industry about AI's ability to simulate reality.
The research team tested the AI video generation model and set up three different scenarios, namely prediction under known modes, prediction under unknown modes, and new combinations of familiar elements. Their goal was to see whether these models actually learned the laws of physics or simply relied on surface features in training.
Through testing, the researchers found that these AI models did not learn universally applicable rules. Instead, they mainly rely on surface features such as color, size, speed, and shape when generating videos, and follow a strict order of priority: color first, followed by size, speed, and shape.
These models performed almost perfectly in familiar scenarios, but were incapable of doing so once they encountered unknown situations. A test in the study demonstrates the limitations of AI models when dealing with object motion. For example, when the model was trained with a fast-moving sphere moving back and forth, but when tested with a slow-moving sphere, the model actually showed that the sphere suddenly changed direction after a few frames. This phenomenon is also clearly reflected in related videos.
The researchers point out that simply scaling up the model or adding more training data won't solve the problem. While larger models perform better with familiar patterns and combinations, they still fail to understand basic physics or handle scenarios beyond their training range. Study co-author Kang Bingyi mentioned: "If the data coverage is good enough in a specific scenario, an overfitting world model may be formed." But this model does not meet the definition of a real world model, because a real world model should be able to Generalize beyond training data.
Co-author Bingyi Kang demonstrated this limitation on X, explaining that when they trained the model with a fast-moving ball moving left to right and backward, and then tested it with a slow-moving ball, the model showed the ball moving It suddenly changes direction after just a few frames (you can see it in the video at 1 minute and 55 seconds).
The findings pose a challenge to OpenAI's Sora project. OpenAI has said that Sora is expected to evolve into a true world model through continuous expansion, and even claims that it already has a basic understanding of physical interactions and three-dimensional geometry. But the researchers point out that simple scale-up alone is not enough for video generative models to discover fundamental physical laws.
Meta's head of AI, Yann LeCun, also expressed skepticism, saying that predicting the world by generating pixels is "a waste of time and doomed to failure." Despite this, many people still expect OpenAI to release Sora as scheduled in mid-February 2024 to demonstrate its potential for video generation.
This research points out the direction for the development of the field of AI video generation, and also reminds us that the evaluation of AI capabilities cannot just stay at the superficial effects, but also needs to delve into its inherent mechanisms and limitations. In the future, how to allow AI to truly understand and simulate the physical world remains a huge challenge.