The editor of Downcodes learned that Zhipu AI Company has recently open sourced its new video generation model CogVideoX-5B. This model has achieved significant improvements in video generation quality, visual effects and inference performance, which is a significant improvement compared to the previous generation product CogVideoX-2B. Even early GTX 1080Ti cards can run previous-generation models, while mainstream cards like the RTX 3060 can handle CogVideoX-5B with ease. This marks a further lowering of the threshold for high-quality video generation technology, providing more developers and users with convenient and efficient video generation solutions.
Recently, Zhipu AI Company has open sourced a new video generation model CogVideoX-5B. Not only does it surpass the previous generation product CogVideoX-2B in terms of video generation quality and visual effects, but its reasoning performance has been greatly improved, making Early GTX1080Ti graphics cards can run the previous generation model, and desktop dessert-level graphics cards, such as the RTX3060, can easily handle this new model.
Detailed parameter comparison between CogVideoX-5B and CogVideoX-2B:
This large-scale DiT (diffusion transformer) model is designed to perform text-to-video generation tasks. The technology behind it includes 3D causal variational autoencoder (3D causal VAE), which achieves efficient video reconstruction by compressing video data into latent space and decoding it in the temporal dimension.
In addition, the use of Expert Transformer combines text embedding and video embedding, uses 3D-RoPE as the position encoding, normalizes the data of the two modalities through the expert adaptive layer, and uses 3D full attention The force mechanism is jointly modeled in space and time.
Code: https://top.aibase.com/tool/cogvideox
Model download: https://huggingface.co/THUDM/CogVideoX-5b
Paper link: https://arxiv.org/pdf/2408.06072
The open source of CogVideoX-5B has brought new breakthroughs to the field of video generation, lowered the technical threshold, and provided a solid foundation for future research and applications. The editor of Downcodes believes that this technology will promote the further development of video generation technology and bring innovative applications to more fields.