Since the birth of sora, which ushered in a new era of AI video, major players at home and abroad have successively joined the AI video competition. But as we enter this new era of more interactive, immersive video, how do we address the cost, quality, and performance challenges?
On October 15, Volcano Engine and Intel jointly released a large model training video preprocessing solution at the Video Cloud Technology Conference. A reporter from "Daily Economic News" learned at the press conference that this technical solution has been applied to the bean bag video generation model.
At the press conference, Li Hang, head of Bytedance Research, introduced that the Doubao video generation model PixelDance adopted the Volcano Engine's large model training video preprocessing solution during the training process, making full use of a large number of tidal resources and providing strong support for model training.
In addition, Wang Yue, head of the video architecture of Douyin Group, revealed the latest progress of Byte's self-developed video codec chip: verified by Douyin Group's internal practice, this chip saves more than 95% of the cost under the same video compression efficiency.
"First of all, the ultra-large-scale video training data set has led to a surge in computing and processing costs." Wang Yue pointed out that large model manufacturers face many challenges in the pre-processing process. "Secondly, the video sample data is uneven, and then there are many processing links. , The project is complex, and finally faces the scheduling and deployment of multiple heterogeneous computing resources such as GPU, CPU, and ARM.”
Self-developed multimedia processing framework
At the Volcano Engine AI Innovation Tour on September 24, two large beanbag video generation models, PixelDance and Seaweed, were released together, attracting the attention of people inside and outside the industry. In fact, ByteDance’s efforts in video generation models do not stop there.
On October 15, Volcano Engine released a large model training video preprocessing solution, dedicated to solving technical challenges in cost, quality and performance of video large model training.
According to reports, preprocessing training videos is an important prerequisite to ensure the effect of large model training. The preprocessing process can unify the data format of the video, improve data quality, standardize the data, reduce the amount of data, and process annotation information, so that the model can learn the features and knowledge in the video more efficiently and improve the training effect and efficiency.
In the training of video generation models, computing power cost is undoubtedly the number one challenge.
An algorithm engineer of a domestic video generation model said in an interview with a reporter from "Daily Economic News" that with high-quality data, video models will be more difficult to train than large language models and require more computing power. "At present, The known open source video models are not particularly large, mainly because many video models are currently at a stage where they do not know how to use data, and there is not much high-quality data (for training)."
Research by computer scientist Matthias Plappert also shows that the training of Sora requires huge computing power. In the training process, it takes about 1 month to train on 4,200 to 10,500 Nvidia H100s. When the model is generated and reaches the inference stage, the computing cost will increase rapidly. beyond the training session.
In order to solve the problem of cost reduction, Volcano Engine relies on Intel's CPU and other resources to rely on its large model training video preprocessing solution on its self-developed multimedia processing framework. Wang Yue said that the solution has also been optimized in terms of algorithms and engineering, and can perform high-quality preprocessing of massive video data, achieve efficient collaboration of processing links in a short time, and improve model training efficiency.
Regarding the application of this solution, Li Hang revealed at the press conference that the beanbag video generation model PixelDance has adopted this solution during the training process. At the same time, the on-demand solution provided by the Volcano Engine Video Cloud Team also provides a one-stop service for the entire life cycle of videos produced by PixelDance, from editing, uploading, transcoding, distribution, and playback, ensuring the commercial application of the model.
In addition, at this conference, Volcano Engine also released a cross-language simultaneous live broadcast solution, a multi-modal video understanding and generation solution, a conversational AI real-time interaction solution, and an AIG3D & large scene reconstruction solution. From the production end of the video , from the interactive end to the consumer end, the entire link integrates AI capabilities.
Where is AI video headed?
AI is reshaping the way people produce, disseminate and receive information in all aspects. Among them, the emerging new video technologies have brought people from the smooth and high-definition data world into the AI world of smarter and more interactive experiences.
In July this year, SenseTime launched Vimi, the first large controllable character video generation model for C-end users; in August, MiniMax released the video generation model video-1; in September, Keling AI completed its ninth iteration and released "KeLing 1.5 model", Alibaba Cloud launched a new video generation model at the Yunqi Conference, and Byte also released 2 video generation models. The birth and iteration of AI video products almost takes months.
Regarding the "explosion" of AI video products, Wang Peng, an associate researcher at the Beijing Academy of Social Sciences, said in an interview with a reporter from "Daily Economic News" that domestic AI video products are in a stage of rapid development and continuous iteration, mainly due to strong market demand and Wide range of application scenarios and diverse commercialization models.
At present, AI video products on the market are mostly implemented in the fields of film and television, e-commerce marketing and other fields. For example, in July this year, Jimeng AI and Bona Pictures cooperated to launch the country's first AIGC generative continuous narrative science fiction short series "Sanxingdui: Future Enlightenment" "Record"; in September this year, Kuaishou teamed up with nine well-known directors including Jia Zhangke and Li Shaohong to launch the "Keling AI" director co-creation project.
Pan Helin, a member of the Information and Communication Economy Expert Committee of the Ministry of Industry and Information Technology, pointed out to the reporter of "Daily Economic News" that some AI video products are now in the introduction stage and are difficult to roll out in the market due to technology or compliance. "Currently, it feels like open source (AI Video products) are more popular than closed-source ones because the cost of AI video generation is high, and video producers often lack funds, so using open source AI algorithms downloaded to the terminal can better produce and generate videos."
In his view, AI video products at this stage mainly have two obstacles: computing power and compliance risks. "Algorithms, computing power and data all require enterprises to invest more resources and time; another difficulty lies in compliance risks. Nowadays, more and more attention is paid to privacy. Compliance is an unavoidable topic, and AI videos may sometimes An invasion of personal privacy,” he explained.
In addition, Analysys Analysis Research Partner Chen Chen also expressed concerns about the short-term monetization ability of large video generation models in an interview with a reporter from "Daily Economic News". "Due to the high model training and inference costs of large AI models, coupled with the C-side Users' demand for AI tools is relatively scattered and their willingness to pay is insufficient. The commercialization of large video models in the C-end market will still face a long cultivation period."
The era of AI video has arrived, but how to reduce costs, increase efficiency, and win more markets will also become an important proposition faced by major Internet companies and technology companies.