OpenAI's Sora video generation model has attracted widespread attention. Its powerful video generation capabilities are amazing, but its training data source has been confusing. This article will delve into Sora’s possible training data sources, especially the role played by game live broadcasts and strategy videos, and analyze the legal issues that may arise and the impact on future AI development.
OpenAI's video generation artifact Sora has attracted much attention since its launch, but where it learned from has always been a mystery. Now, part of the mystery seems to have been revealed: Sora's training data is likely to contain a large number of game live broadcasts and strategy videos from Twitch!
Sora is like a skilled "master of imitation", able to generate videos up to 20 seconds long with just text prompts or images, and can control multiple aspect ratios and resolutions. In February of this year, when OpenAI first unveiled Sora, it hinted that its model was "concentrating on practicing" in the video of "Minecraft". So, in addition to "Minecraft", what other game treasures are hidden in Sora's "Martial Arts Secrets"?
The results are surprising, as Sora seems to be familiar with a variety of game types. It can generate a clone game video with the shadow of "Mario", although there are some "minor flaws"; it can also simulate a thrilling first-person shooting game screen, as if "Call of Duty" and "Counter-Strike" are "combined" "; It can also recreate the fighting scenes of the "Teenage Mutant Ninja Turtles" arcade game in the 1990s, making people feel like they are immersed in childhood memories.
What’s even more surprising is that Sora also knows the form of Twitch live broadcasts well, which implies that it has “watched” a large amount of live broadcast content. The video screenshots generated by Sora not only accurately captured the frame structure of the live broadcast, but also vividly restored the image of the well-known anchor Auronplay, including the tattoo on his left arm.
Not only that, Sora also "knows" another Twitch anchor Pokimane and generated a video of a character who looks similar to her. Of course, in order to avoid copyright issues, OpenAI has set up a filtering mechanism to prevent Sora from generating videos containing trademarked characters.
Although OpenAI is tight-lipped about the source of its training data, there are indications that game content is most likely included in Sora’s training set. In an interview with the Wall Street Journal in March, Mira Mulati, former CTO of OpenAI, did not directly deny that Sora used content from YouTube, Instagram and Facebook for training. OpenAI also admits in Sora's technical specifications that it uses "publicly available" data as well as licensed data from media libraries such as Shutterstock.
If game content is indeed used to train Sora, this could trigger a series of legal issues, especially when OpenAI develops a more interactive experience based on Sora. Pryor Cashman intellectual property lawyer Joshua Weigensberg pointed out that the unauthorized use of game videos for AI training will face huge risks, because training AI models usually requires copying training data, and game videos contain a large amount of subject matter. Copyright protected content.
Generative AI models such as Sora are based on probability. They learn patterns from large amounts of data and make predictions. This ability allows them to "learn" how the world works. But there are also hidden dangers. Under certain prompts, the model may generate content that is very similar to its training data. This caused strong dissatisfaction among the creators, who believed that their works were used for training without permission.
Currently, Microsoft and OpenAI are being sued over their AI tools allegedly copying licensed code. AI art application companies such as Midjourney, Runway and Stability AI have also faced accusations of infringing on artists' rights. Major music companies have also filed lawsuits against Udio and Suno, startups that develop AI song generators.
Many AI companies have long argued for "fair use" principles, arguing that their models create "transformative" works rather than plagiarism. But the game content has its own particularities. Evan Everest, a copyright attorney at Dorsey & Whitney, pointed out that gaming videos involve at least two layers of copyright protection: the copyright of the game content owned by the game developer, and the copyright of the unique video created by the player or video producer. For some games, there may also be a third tier of rights, namely copyright in user-generated content.
For example, Fortnite allows players to create their own game maps and share them with others. A gaming video about these maps involves at least three copyright owners: Epic, gamers, and map creators. If the court determines that AI model training involves copyright liability, these copyright owners may become potential plaintiffs or authorized sources.
In addition, Weigensberg also pointed out that the game itself also has many "protectable" elements, such as proprietary textures, that judges may consider in intellectual property litigation.
Currently, multiple game studios and publishers including Epic, Microsoft (which owns Minecraft), Ubisoft, Nintendo, Roblox and Cyberpunk 2077 developer CD Projekt Red have not commented on the matter. .
Even if AI companies win these legal disputes, users may not be exempted from liability. If a generative model copies a copyrighted work, the person who publishes the work or incorporates it into other projects may still be held liable for intellectual property infringement.
Some AI companies have compensation clauses in place to deal with such situations, but there are usually exceptions. For example, OpenAI's terms only apply to enterprise customers, not individual users. In addition, in addition to copyright risks, there are also risks such as trademark violations, for example, the output content may contain assets used for marketing and branding, including in-game characters.
As interest in world models grows, the situation may become more complex. One application of world models is to generate real-life video games, which can cause legal issues if these "synthetic" games are too similar to what the model was trained on.
Avery Williams, an intellectual property litigation attorney at McKool Smith, pointed out that elements such as voices, movements, characters, songs, dialogues and artworks used to train the AI platform in games constitute copyright infringement. Questions about “fair use” raised in numerous lawsuits against generative AI companies will have the same impact on the video game industry as other creative markets.
Sora’s success also highlights the huge potential of generative AI technology in the field of content creation, but also exposes its huge challenges in data usage and intellectual property rights. How to balance technological innovation and intellectual property protection will be a key issue that needs to be solved in the future development of AI. In the future, the source of training data for AI models and their legality will be subject to stricter scrutiny, which will have a profound impact on the future development of the AI industry.