️ Our series works: [MMStar] [ShareGPT4V] [ShareGPT4Omni]
Official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
Here is a video for introducing ShareGPT4Video clearly:
[2024/10/1] ShareGPT4Video was accepted by NeurIPS 2024 D&B track!
[2024/7/1] The code about batch-inference of ShareCaptioner-Video is available now!
[2024/6/11] The web demo and local demo of ShareCaptioner-Video are available now!
[2024/6/11] The web demo and local demo of ShareGPT4Video-8B are available now!
[2024/6/7] Our paper has been featured as HuggingFace Daily Papers and ranked 1st in 6.7.
[2024/5/27] The ShareGPT4Video-8B model is released!
[2024/5/26] The ShareGPT4Video dataset and project page are released!
You can directly use our ShareGPT4Video model for conversation with your own video by the following command:
python run.py --model-path Lin-Chen/sharegpt4video-8b --video examples/yoga.mp4 --query Describe this video in detail.
Or you can build your local demo to enjoy our ShareGPT4Video-8B with the following command:
python app.py
You can build your local demo for enjoying our ShareCaptioner-Video with the following command:
cd captioner
python app.py
git clone https://github.com/ShareGPT4Omni/ShareGPT4Video
conda create -n share4video python=3.10 -y
conda activate share4video
cd ShareGPT4Video
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
To validate the effectiveness of high-quality video captions for helping to improve the LVLMs' comprehension capabilities. We choose the VideoLLaVA and LLaMA-VID models as our baselines. The SFT data used for both models is LLaVA-mix665K image data plus VideoChatGPT-100K video data. We replace 28K caption data in VideoChatGPT-100K with 28K high quality caption data from ShareGPT4Video. Next, we take VideoLLaVA as the example.
You need to follow the instructions in VideoLLaVA to prepare the images and videos first, then download the 28K videos used in ShareGPT4Video from HuggingFace (only involves bdd100k, ego4d, and panda).
Finally, you can specify the llava_v1_5_mix665k_with_video_chatgpt72k_share4video28k.json file in the finetune.sh to perform the SFT to reproduce the results in the paper.
If you find our work helpful for your research, please consider giving a star and citation
@article{chen2024sharegpt4video,
title={ShareGPT4Video: Improving Video Understanding and Generation with Better Captions},
author={Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and others},
journal={arXiv preprint arXiv:2406.04325},
year={2024}
}
@article{chen2023sharegpt4v,
title={ShareGPT4V: Improving Large Multi-Modal Models with Better Captions},
author={Chen, Lin and Li, Jisong and Dong, Xiaoyi and Zhang, Pan and He, Conghui and Wang, Jiaqi and Zhao, Feng and Lin, Dahua},
journal={arXiv preprint arXiv:2311.12793},
year={2023}
}
@article{chen2024we,
title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
journal={arXiv preprint arXiv:2403.20330},
year={2024}
}