t2v_metrics 다운로드 - t2v_metrics 소스 코드 다운로드

텍스트-비주얼 모델 평가를 위한 VQAScore [프로젝트 페이지]

VQAScore를 사용하면 연구자는 한 줄의 Python 코드를 사용하여 텍스트-이미지/비디오/3D 모델을 자동으로 평가할 수 있습니다!

[VQAScore 페이지] [VQAScore 데모] [GenAI-Bench 페이지] [GenAI-Bench 데모] [CLIP-FlanT5 Model Zoo]

VQAScore: 이미지-텍스트 생성을 통한 텍스트-시각적 생성 평가 (ECCV 2024) [논문] [HF]
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

GenAI-Bench: 구성 텍스트-비주얼 생성 평가 및 개선 (CVPR 2024, 최우수 단편 논문 @ SynData Workshop ) [논문] [HF]
Baiqi Li*, Zhiqiu Lin*, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia*, Pengchuan Zhang*, Graham Neubig*, Deva Ramanan*(*공동 1위 및 공동 선임 저자)

소식

[2024/08/13] VQAScore 는 Google의 Imagen3 보고서에서 자동 평가를 위한 CLIPScore의 가장 강력한 대체 제품으로 강조되었습니다! GenAI-Bench는 Imagen3의 뛰어난 프롬프트 이미지 정렬을 보여주는 주요 벤치마크 중 하나로 선택되었습니다. 이 성과에 대해 Google에 찬사를 보냅니다! [종이]
[2024/07/01] VQAScore가 ECCV 2024에 승인되었습니다!
[2024/06/20] GenAI-Bench가 CVPR'24 SynData Workshop에서 최우수 단편 논문을 수상했습니다! [워크숍 사이트].

VQAScore는 구성 텍스트 프롬프트에서 CLIPScore 및 PickScore와 같은 이전 측정항목보다 훨씬 뛰어난 성능을 발휘하며 인간 피드백이나 ChatGPT 및 GPT와 같은 독점 모델을 활용하는 이전 기술(예: ImageReward, HPSv2, TIFA, Davidsonian, VPEval, VIEScore)보다 훨씬 간단합니다. -4비전.

빠른 시작

다음을 통해 패키지를 설치하십시오.

git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics

conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y

pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install

또는 pip install t2v-metrics 통해 설치할 수 있습니다.

이제 다음 Python 코드는 이미지-텍스트 정렬을 위해 VQAScore를 계산하는 데 필요한 전부입니다(점수가 높을수록 유사성이 높음을 나타냄).

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' ) # our recommended scoring model

### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score ( images = [ image ], texts = [ text ])

### Alternatively, if you want to calculate the pairwise similarity scores 
### between M images and N texts, run the following to return a M x N score tensor.
images = [ "images/0.png" , "images/1.png" ]
texts = [ "someone talks on the phone angrily while another person sits happily" ,
         "someone talks on the phone happily while another person sits angrily" ]
scores = clip_flant5_score ( images = images , texts = texts ) # scores[i][j] is the score between image i and text j

GPU 및 캐시에 대한 참고 사항

GPU 사용량 : 기본적으로 이 코드는 컴퓨터의 첫 번째 cuda 장치를 사용합니다. clip-flant5-xxl 및 llava-v1.5-13b 와 같은 가장 큰 VQAScore 모델에는 40GB GPU를 권장합니다. GPU 메모리가 제한되어 있는 경우에는 clip-flant5-xl 및 llava-v1.5-7b 와 같은 더 작은 모델을 고려하십시오.
캐시 디렉터리 : t2v_metrics/constants.py에서 HF_CACHE_DIR 업데이트하여 모든 모델 체크포인트(기본값은 ./hf_cache/ )를 저장하는 캐시 폴더를 변경할 수 있습니다.

고급 사용법

더 많은 이미지-텍스트 쌍을 위한 일괄 처리
지원되는 모든 모델 확인
질문 및 답변 템플릿 사용자 정의(VQAScore용)
VQAScore 논문 결과 재현
GenAI-Bench 논문 결과 재현
VQAScore에 GPT-4o 사용
자신만의 점수 측정항목 구현
CLIP-FlanT5를 사용한 텍스트 생성(VQA)

더 많은 이미지-텍스트 쌍을 위한 일괄 처리

M개의 이미지 x N개의 텍스트로 구성된 대규모 배치의 경우, batch_forward() 함수를 사용하여 속도를 높일 수 있습니다.

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  { 'images' : [ "images/0/DALLE3.png" , "images/0/Midjourney.jpg" , "images/0/SDXL.jpg" , "images/0/DeepFloyd.jpg" ], 'texts' : [ "The brown dog chases the black dog around the tree." ]},
  { 'images' : [ "images/1/DALLE3.png" , "images/1/Midjourney.jpg" , "images/1/SDXL.jpg" , "images/1/DeepFloyd.jpg" ], 'texts' : [ "Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep." ]},
  #...
]
scores = clip_flant5_score . batch_forward ( dataset = dataset , batch_size = 16 ) # (n_sample, 4, 1) tensor

지원되는 모든 모델 확인

현재 CLIP-FlanT5, LLaVA-1.5 및 InstructBLIP을 사용하여 VQAScore 실행을 지원합니다. 절제를 위해 CLIPScore, BLIPv2Score, PickScore, HPSv2Score 및 ImageReward도 포함합니다.

 llava_score = t2v_metrics . VQAScore ( model = 'llava-v1.5-13b' )
instructblip_score = t2v_metrics . VQAScore ( model = 'instructblip-flant5-xxl' )
clip_score = t2v_metrics . CLIPScore ( model = 'openai:ViT-L-14-336' )
blip_itm_score = t2v_metrics . ITMScore ( model = 'blip2-itm' ) 
pick_score = t2v_metrics . CLIPScore ( model = 'pickscore-v1' )
hpsv2_score = t2v_metrics . CLIPScore ( model = 'hpsv2' ) 
image_reward_score = t2v_metrics . ITMScore ( model = 'image-reward-v1' )

아래 명령을 실행하여 지원되는 모든 모델을 확인할 수 있습니다.

 print ( "VQAScore models:" )
t2v_metrics . list_all_vqascore_models ()

print ( "ITMScore models:" )
t2v_metrics . list_all_itmscore_models ()

print ( "CLIPScore models:" )
t2v_metrics . list_all_clipscore_models ()

질문 및 답변 템플릿 사용자 정의(VQAScore용)

우리 논문의 부록에 표시된 것처럼 질문과 답변은 최종 점수에 약간의 영향을 미칩니다. 각 모델에 대해 간단한 기본 템플릿을 제공하며 재현성을 위해 변경하지 않는 것이 좋습니다. 그러나 질문과 답변은 쉽게 수정될 수 있다는 점을 지적하고 싶습니다. 예를 들어 CLIP-FlanT5 및 LLaVA-1.5는 t2v_metrics/models/vqascore_models/clip_t5_model.py에서 찾을 수 있는 다음 템플릿을 사용합니다.

 # {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'

question_template 및 answer_template 매개변수를 전달 forward() 또는 batch_forward() 함수에 전달하여 템플릿을 사용자 정의할 수 있습니다.

 # Use a different question for VQAScore
scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = 'Is this figure showing "{}"? Please answer yes or no.' ,
                           answer_template = 'Yes' )

P(답변 | 이미지, 질문) 대신 P(캡션 | 이미지)(VisualGPTScore)를 계산할 수도 있습니다.

 scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = "" , # no question
                           answer_template = "{}" ) # this computes P(caption | image)

VQAScore 논문 결과 재현

우리의 eval.py를 사용하면 10개의 이미지/비전/3D 정렬 벤치마크(예: Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore)를 쉽게 실행할 수 있습니다.

python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question " Is the figure showing '{}'? " --answer " Yes "

GenAI-Bench 논문 결과 재현

genai_image_eval.py 및 genai_video_eval.py는 GenAI-Bench 결과를 재현할 수 있습니다. 추가로 genai_image_ranking.py는 GenAI-Rank 결과를 재현할 수 있습니다.

 # GenAI-Bench
python genai_image_eval.py --model clip-flant5-xxl
python genai_video_eval.py --model clip-flant5-xxl

# GenAI-Rank
python genai_image_ranking.py --model clip-flant5-xxl --gen_model DALLE_3
python genai_image_ranking.py --model clip-flant5-xxl --gen_model SDXL_Base

VQAScore에 GPT-4o를 사용합니다!

우리는 새로운 최첨단 성능을 달성하기 위해 GPT-4o를 사용하여 VQAScore를 구현했습니다. 예를 보려면 t2v_metrics/gpt4_eval.py를 참조하세요. 명령줄에서 사용하는 방법은 다음과 같습니다.

 openai_key = # Your OpenAI key
score_func = t2v_metrics . get_score_model ( model = "gpt-4o" , device = "cuda" , openai_key = openai_key , top_logprobs = 20 ) # We find top_logprobs=20 to be sufficient for most (image, text) samples. Consider increase this number if you get errors (the API cost will not increase).

자신만의 점수 측정항목 구현

자신만의 점수 측정 기준을 쉽게 구현할 수 있습니다. 예를 들어, 더 효과적이라고 생각되는 VQA 모델이 있는 경우 이를 t2v_metrics/models/vqascore_models 디렉터리에 통합할 수 있습니다. 지침은 LLaVA-1.5 및 InstructBLIP의 구현 예시를 시작점으로 참조하세요.

CLIP-FlanT5를 사용한 텍스트 생성(VQA)

CLIP-FlanT5를 사용하여 텍스트(캡션 또는 VQA 작업)를 생성하려면 아래 코드를 사용하십시오.

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

images = [ "images/0.png" , "images/0.png" ] # A list of images
prompts = [ "Please describe this image: " , "Does the image show 'someone talks on the phone angrily while another person sits happily'?" ] # Corresponding prompts
clip_flant5_score . model . generate ( images = images , prompts = prompts )

소환

이 저장소가 귀하의 연구에 유용하다고 생각되면 다음을 사용하십시오(ArXiv ID로 업데이트하려면).

 @article{lin2024evaluating,
  title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
  author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
  journal={arXiv preprint arXiv:2404.01291},
  year={2024}
}