Zhipu AI open sourced its video generation model CogVideoX, a move aimed at accelerating the development and application popularization of video generation technology. With its efficient performance, the CogVideoX-2B version requires only a single 4090 graphics card to perform inference, and a single A6000 graphics card to complete fine-tuning, which greatly lowers the threshold for use and enables it to be more widely used in commercial fields. This model is based on advanced 3D variational autoencoder (3D VAE) technology and combined with expert Transformer technology, which can generate high-quality video content, effectively solve the problem of lack of text description of video data, and strictly screen the video data. , ensuring the data quality of model training.
The CogVideoX model uses 3D variational autoencoder (3D VAE) technology to simultaneously compress the spatial and temporal dimensions of the video through three-dimensional convolution, achieving higher compression rates and better reconstruction quality. The model structure includes an encoder, a decoder and a latent space regularizer, which ensures the causality of information through temporal causal convolution. In addition, expert Transformer technology is used to process the encoded video data and combine it with text input to generate high-quality video content. In order to train the CogVideoX model, Zhipu AI has developed a set of methods for screening high-quality video data, eliminating videos with over-editing, incoherent motion and other problems, ensuring the quality of data for model training. At the same time, the problem of lack of text description of video data is solved through a pipeline that generates video subtitles from image subtitles. In terms of performance evaluation, CogVideoX performs well on multiple indicators, including human actions, scenes, dynamic levels, etc., as well as evaluation tools focusing on video dynamic characteristics. Zhipu AI will continue to explore innovations in the field of video generation, including new model architectures, video information compression, and text and video content fusion.
In order to train the CogVideoX model, Zhipu AI has developed a set of methods for screening high-quality video data, eliminating videos with over-editing, incoherent motion and other problems, ensuring the quality of data for model training. At the same time, the problem of lack of text description of video data is solved through a pipeline that generates video subtitles from image subtitles.
In terms of performance evaluation, CogVideoX performs well on multiple indicators, including human actions, scenes, dynamic levels, etc., as well as evaluation tools focusing on video dynamic characteristics. Zhipu AI will continue to explore innovations in the field of video generation, including new model architectures, video information compression, and text and video content fusion.
Code repository:
https://github.com/THUDM/CogVideo
Model download:
https://huggingface.co/THUDM/CogVideoX-2b
Technical report:
https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf
The open source of CogVideoX provides valuable resources for video generation technology research, and also indicates that this field will usher in a new wave of development. Its efficient performance and ease of use will drive more developers to participate in the innovation of video generation technology and promote its widespread application in various industries. We look forward to more breakthroughs made by Zhipu AI in this field in the future!