Youku mPLUG 다운로드 - Youku mPLUG 소스 코드 다운로드

Youku mPLUG

기타 소스코드

1.0.0

다운로드

Youku-mPLUG 1000만 개의 중국 대규모 비디오 텍스트 데이터세트

Youku-mPLUG: 천만 개의 대규모 중국어 비디오 언어 사전 훈련 데이터 세트 및 벤치마크 다운로드 링크 여기

종이

youku-mplug의 예

Youku-mPLUG란 무엇입니까?

우리는 Youku라는 유명한 중국 비디오 공유 웹사이트에서 수집된 Youku-mPLUG 라는 공개 최대 규모의 중국 고품질 비디오 언어 데이터 세트(1,000만 개)를 엄격한 안전, 다양성 및 품질 기준에 따라 출시합니다.

youku-mplug의 예

제안된 Youku-mPLUG 데이터 세트의 비디오 클립 및 제목 예.

우리는 사전 훈련된 모델의 기능을 측정하기 위해 3가지 다운스트림 다중 모드 비디오 벤치마크 데이터세트를 제공합니다. 3가지 다른 작업에는 다음이 포함됩니다.

비디오 카테고리 예측: 비디오와 해당 제목이 주어지면 비디오 카테고리를 예측합니다.
비디오-텍스트 검색: 일부 비디오와 일부 텍스트가 있는 경우 텍스트 검색에는 비디오를 사용하고 비디오 검색에는 텍스트를 사용합니다.
비디오 캡션: 비디오가 있는 경우 비디오 내용을 설명합니다.

youku-mplug 다운스트림 데이터세트의 예

데이터 통계

데이터 세트에는 총 1천만 개의 비디오가 포함되어 있으며 고품질이며 20개 슈퍼 카테고리, 45개 카테고리로 배포됩니다.

Youku-mPLUG 데이터 세트의 카테고리 분포.

제로샷 기능

사례 1 사례 2

다운로드

이 링크를 통해 모든 비디오와 주석 파일을 다운로드할 수 있습니다.

설정

참고: megatron_util의 버그로 인해 megatron_util을 설치한 후 conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py를 현재 디렉터리의 초기화.py 로 바꿔야 합니다.

 conda env create -f environment.yml
conda activate youku
pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

# For caption evaluation
apt-get install default-jre

mPLUG-비디오(1.3B / 2.7B)

사전 훈련

먼저 Modelscope에서 GPT-3 1.3B 및 2.7B 체크포인트를 다운로드해야 합니다. 사전 훈련된 모델은 여기(1.3B)와 여기(2.7B)에서 다운로드할 수 있습니다.

mPLUG-Video의 사전 학습을 다음과 같이 실행합니다.

 exp_name = 'pretrain/gpt3_1.3B/pretrain_gpt3_freezeGPT_youku_v0'
PYTHONPATH = $ PYTHONPATH :. / 
python - m torch . distributed . launch - - nproc_per_node = 8 - - master_addr = $ MASTER_ADDR 
  - - master_port = $ MASTER_PORT 
  - - nnodes = $ WORLD_SIZE 
  - - node_rank = $ RANK 
  - - use_env run_pretrain_distributed_gpt3 . py 
  - - config . / configs / ${ exp_name }. yaml 
  - - output_dir . / output / ${ exp_name } 
  - - enable_deepspeed 
  - - bf16
  2 > & 1 | tee . / output / ${ exp_name } / train . log

벤치마킹

다운스트림 미세 조정을 수행합니다. 비디오 카테고리 예측을 예로 들어 보겠습니다.

 exp_name = 'cls/cls_gpt3_1.3B_youku_v0_sharp_2'
PYTHONPATH = $ PYTHONPATH :. / 
python - m torch . distributed . launch - - nproc_per_node = 8 - - master_addr = $ MASTER_ADDR 
  - - master_port = $ MASTER_PORT 
  - - nnodes = $ WORLD_SIZE 
  - - node_rank = $ RANK 
  - - use_env downstream / run_cls_distributed_gpt3 . py 
  - - config . / configs / ${ exp_name }. yaml 
  - - output_dir . / output / ${ exp_name } 
  - - enable_deepspeed 
  - - resume path / to / 1_3 B_mp_rank_00_model_states . pt 
  - - bf16
  2 > & 1 | tee . / output / ${ exp_name } / train . log

실험 결과

아래에서는 참조용으로 검증 세트에 대한 결과를 보여줍니다.

검증 세트에 대한 비디오 카테고리 예측 결과입니다. 검증 세트에 대한 비디오 검색 결과입니다.

mPLUG-비디오(BloomZ-7B)

mPLUG-Owl을 기반으로 mPLUG-Video 모델을 구축합니다. 모델을 사용하려면 먼저 mPLUG-Owl 저장소를 다음과 같이 복제해야 합니다.

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl

HuggingFace에서는 명령 조정 체크포인트를 사용할 수 있습니다. 모델을 미세 조정하려면 mPLUG-Owl Repo를 참조하세요. 비디오 추론을 수행하려면 다음 코드를 사용할 수 있습니다.

 import torch
from mplug_owl_video . modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl_video . processing_mplug_owl import MplugOwlImageProcessor , MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'
model = MplugOwlForConditionalGeneration . from_pretrained (
    pretrained_ckpt ,
    torch_dtype = torch . bfloat16 ,
    device_map = { '' : 0 },
)
image_processor = MplugOwlImageProcessor . from_pretrained ( pretrained_ckpt )
tokenizer = AutoTokenizer . from_pretrained ( pretrained_ckpt )
processor = MplugOwlProcessor ( image_processor , tokenizer )

# We use a human/AI template to organize the context as a multi-turn conversation.
# <|video|> denotes an video placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <|video|>
Human: 视频中的女人在干什么？
AI: ''' ]

video_list = [ 'yoga.mp4' ]

# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
    'do_sample' : True ,
    'top_k' : 5 ,
    'max_length' : 512
}
inputs = processor ( text = prompts , videos = video_list , num_frames = 4 , return_tensors = 'pt' )
inputs = { k : v . bfloat16 () if v . dtype == torch . float else v for k , v in inputs . items ()}
inputs = { k : v . to ( model . device ) for k , v in inputs . items ()}
with torch . no_grad ():
    res = model . generate ( ** inputs , ** generate_kwargs )
sentence = tokenizer . decode ( res . tolist ()[ 0 ], skip_special_tokens = True )
print ( sentence )

Youku-mPLUG 인용

이 데이터세트가 귀하의 연구에 유용하다고 생각되면 우리 논문을 인용하는 것을 고려해 보십시오.

 @misc { xu2023youku_mplug ,
    title = { Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks } ,
    author = { Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang } ,
    year = { 2023 } ,
    eprint = { 2306.04362 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}