The editor of Downcodes learned that Zhipu AI recently announced an open source upgrade of its CogVLM2-Video model. This model has made a major breakthrough in the field of video understanding, effectively solving the shortcomings of existing models in processing temporal information. By introducing multi-frame video images and timestamp information, and utilizing automated time positioning data construction methods, CogVLM2-Video demonstrates excellent performance in video subtitle generation and time positioning, providing a powerful tool for tasks such as video generation and summarization. . The model has achieved state-of-the-art results in public video understanding benchmarks, and its efficient automated data generation process also reduces the cost of model training.
Zhipu AI announced an open source upgrade of the CogVLM2-Video model, a model that has made significant progress in the field of video understanding. CogVLM2-Video solves the limitations of existing video understanding models in dealing with the loss of temporal information by introducing multi-frame video images and timestamps as encoder inputs. The model uses an automated time positioning data construction method to generate 30,000 time-related video question and answer data, thereby training a model that achieves the latest performance on public video understanding benchmarks. CogVLM2-Video excels in video subtitle generation and temporal positioning, providing a powerful tool for tasks such as video generation and summarization.
CogVLM2-Video extracts frames from the input video and annotates the timestamp information, so that the language model can accurately know the corresponding time of each frame, thereby achieving time positioning and related question and answer.
For large-scale training, an automated video question and answer data generation process was developed, which reduces annotation costs and improves data quality through the combined use of image understanding models and large language models. The finally constructed Temporal Grounding Question and Answer (TQA) data set contains 30,000 records, providing rich temporal positioning data for model training.
CogVLM2-Video has demonstrated excellent performance on multiple public evaluation sets, including excellent performance on quantitative evaluation indicators such as VideoChatGPT-Bench and Zero-shot QA and MVBench.
Code: https://github.com/THUDM/CogVLM2
Project website: https://cogvlm2-video.github.io
Online trial: http://36.103.203.44:7868/
All in all, the open source upgrade of the CogVLM2-Video model brings new possibilities to the field of video understanding, and its efficiency and accuracy will promote the further development of related technologies. Interested developers can visit the link provided to view and try it out. The editor of Downcodes looks forward to more innovative applications based on this model!