t2v_metrics下载 - t2v_metrics源代码下载

用于评估文本到视觉模型的 VQAScore [项目页面]

VQAScore 允许研究人员使用一行 Python 代码自动评估文本到图像/视频/3D 模型！

[VQAScore 页面] [VQAScore 演示] [GenAI-Bench 页面] [GenAI-Bench 演示] [CLIP-FlanT5 模型动物园]

VQAScore：使用图像到文本生成评估文本到视觉生成(ECCV 2024) [论文] [HF]
林志秋、Deepak Pathak、李白奇、李佳耀、夏熙德、Graham Neubig、张鹏川、Deva Ramanan

GenAI-Bench：评估和改进组合文本到视觉的生成（CVPR 2024，最佳短论文 @ SynData Workshop ）[论文] [HF]
Baiqi Li*、Zhiqiu Lin*、Deepak Pathak、Jiayao Li、Yixin Fei、Kewen Wu、Tiffany Ling、Xide Xia*、Pengchuan Zhang*、Graham Neubig*、Deva Ramanan*（*共同第一作者和共同高级作者）

消息

[2024/08/13] VQAScore在 Google Imagen3 报告中被强调为自动评估方面 CLIPScore 的最强替代品！ GenAI-Bench被选为展示 Imagen3 卓越的提示图像对齐的关键基准之一。感谢 Google 取得的这一成就！ [纸]
[2024/07/01] VQAScore已被ECCV 2024接受！
[2024/06/20] GenAI-Bench荣获 CVPR'24 SynData Workshop 最佳短论文！ [研讨会现场]。

VQAScore 在组合文本提示方面显着优于以前的指标（例如 CLIPScore 和 PickScore），并且它比现有技术（例如 ImageReward、HPSv2、TIFA、Davidsonian、VPEval、VIEScore）简单得多，利用人类反馈或专有模型（例如 ChatGPT 和 GPT） -4愿景。

快速启动

通过以下方式安装包：

git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics

conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y

pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install

或者您可以通过pip install t2v-metrics安装。

现在，您只需使用以下 Python 代码即可计算图像-文本对齐的 VQAScore（分数越高表示相似度越高）：

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' ) # our recommended scoring model

### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score ( images = [ image ], texts = [ text ])

### Alternatively, if you want to calculate the pairwise similarity scores 
### between M images and N texts, run the following to return a M x N score tensor.
images = [ "images/0.png" , "images/1.png" ]
texts = [ "someone talks on the phone angrily while another person sits happily" ,
         "someone talks on the phone happily while another person sits angrily" ]
scores = clip_flant5_score ( images = images , texts = texts ) # scores[i][j] is the score between image i and text j

GPU 和缓存注意事项

GPU 使用情况：默认情况下，此代码使用计算机上的第一个 cuda 设备。我们建议最大的 VQAScore 模型使用 40GB GPU，例如clip-flant5-xxl和llava-v1.5-13b 。如果您的 GPU 内存有限，请考虑较小的模型，例如clip-flant5-xl和llava-v1.5-7b 。
缓存目录：您可以通过更新 t2v_metrics/constants.py 中的HF_CACHE_DIR来更改保存所有模型检查点的缓存文件夹（默认为./hf_cache/ ）。

高级用法

批处理更多图像文本对
查看所有支持的型号
自定义问答模板（用于 VQAScore）
重现 VQAScore 论文结果
重现 GenAI-Bench 论文结果
使用 GPT-4o 进行 VQAScore
实施您自己的评分指标
使用 CLIP-FlanT5 生成文本 (VQA)

批处理更多图像文本对

对于 M 个图像 x N 个文本的大批量，您可以使用batch_forward()函数来加快速度。

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  { 'images' : [ "images/0/DALLE3.png" , "images/0/Midjourney.jpg" , "images/0/SDXL.jpg" , "images/0/DeepFloyd.jpg" ], 'texts' : [ "The brown dog chases the black dog around the tree." ]},
  { 'images' : [ "images/1/DALLE3.png" , "images/1/Midjourney.jpg" , "images/1/SDXL.jpg" , "images/1/DeepFloyd.jpg" ], 'texts' : [ "Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep." ]},
  #...
]
scores = clip_flant5_score . batch_forward ( dataset = dataset , batch_size = 16 ) # (n_sample, 4, 1) tensor

查看所有支持的型号

我们目前支持使用 CLIP-FlanT5、LLaVA-1.5 和 InstructBLIP 运行 VQAScore。对于消融，我们还包括 CLIPScore、BLIPv2Score、PickScore、HPSv2Score 和 ImageReward：

 llava_score = t2v_metrics . VQAScore ( model = 'llava-v1.5-13b' )
instructblip_score = t2v_metrics . VQAScore ( model = 'instructblip-flant5-xxl' )
clip_score = t2v_metrics . CLIPScore ( model = 'openai:ViT-L-14-336' )
blip_itm_score = t2v_metrics . ITMScore ( model = 'blip2-itm' ) 
pick_score = t2v_metrics . CLIPScore ( model = 'pickscore-v1' )
hpsv2_score = t2v_metrics . CLIPScore ( model = 'hpsv2' ) 
image_reward_score = t2v_metrics . ITMScore ( model = 'image-reward-v1' )

您可以通过运行以下命令来检查所有支持的型号：

 print ( "VQAScore models:" )
t2v_metrics . list_all_vqascore_models ()

print ( "ITMScore models:" )
t2v_metrics . list_all_itmscore_models ()

print ( "CLIPScore models:" )
t2v_metrics . list_all_clipscore_models ()

自定义问答模板（用于 VQAScore）

问题和答案对最终得分略有影响，如我们论文的附录所示。我们为每个模型提供了一个简单的默认模板，不建议为了可重复性而更改它。但是，我们确实想指出，问题和答案可以轻松修改。例如，CLIP-FlanT5 和 LLaVA-1.5 使用以下模板，可以在 t2v_metrics/models/vqascore_models/clip_t5_model.py 中找到该模板：

 # {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'

您可以通过将question_template和answer_template参数传递到forward()或batch_forward()函数来自定义模板：

 # Use a different question for VQAScore
scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = 'Is this figure showing "{}"? Please answer yes or no.' ,
                           answer_template = 'Yes' )

您还可以计算 P(caption | image) (VisualGPTScore) 而不是 P(answer | image, Question)：

 scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = "" , # no question
                           answer_template = "{}" ) # this computes P(caption | image)

重现 VQAScore 论文结果

我们的 eval.py 允许您轻松运行 10 个图像/视觉/3D 对齐基准（例如，Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore）：

python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question " Is the figure showing '{}'? " --answer " Yes "

重现 GenAI-Bench 论文结果

我们的 genai_image_eval.py 和 genai_video_eval.py 可以重现 GenAI-Bench 结果。另外 genai_image_ranking.py 可以重现 GenAI-Rank 结果：

 # GenAI-Bench
python genai_image_eval.py --model clip-flant5-xxl
python genai_video_eval.py --model clip-flant5-xxl

# GenAI-Rank
python genai_image_ranking.py --model clip-flant5-xxl --gen_model DALLE_3
python genai_image_ranking.py --model clip-flant5-xxl --gen_model SDXL_Base

使用 GPT-4o 进行 VQAScore！

我们使用 GPT-4o 实现了 VQAScore，以实现新的最先进的性能。请参阅 t2v_metrics/gpt4_eval.py 示例。以下是如何在命令行中使用它：

 openai_key = # Your OpenAI key
score_func = t2v_metrics . get_score_model ( model = "gpt-4o" , device = "cuda" , openai_key = openai_key , top_logprobs = 20 ) # We find top_logprobs=20 to be sufficient for most (image, text) samples. Consider increase this number if you get errors (the API cost will not increase).

实施您自己的评分指标

您可以轻松实施自己的评分指标。例如，如果您有一个您认为更有效的 VQA 模型，您可以将其合并到 t2v_metrics/models/vqascore_models 目录中。如需指导，请参考我们的 LLaVA-1.5 和 InstructBLIP 示例实现作为起点。

使用 CLIP-FlanT5 生成文本 (VQA)

要使用 CLIP-FlanT5 生成文本（字幕或 VQA 任务），请使用以下代码：

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

images = [ "images/0.png" , "images/0.png" ] # A list of images
prompts = [ "Please describe this image: " , "Does the image show 'someone talks on the phone angrily while another person sits happily'?" ] # Corresponding prompts
clip_flant5_score . model . generate ( images = images , prompts = prompts )

引文

如果您发现此存储库对您的研究有用，请使用以下内容（使用 ArXiv ID 进行更新）。

 @article{lin2024evaluating,
  title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
  author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
  journal={arXiv preprint arXiv:2404.01291},
  year={2024}
}