VQAScore 允许研究人员使用一行 Python 代码自动评估文本到图像/视频/3D 模型!
[VQAScore 页面] [VQAScore 演示] [GenAI-Bench 页面] [GenAI-Bench 演示] [CLIP-FlanT5 模型动物园]
VQAScore:使用图像到文本生成评估文本到视觉生成(ECCV 2024) [论文] [HF]
林志秋、Deepak Pathak、李白奇、李佳耀、夏熙德、Graham Neubig、张鹏川、Deva Ramanan
GenAI-Bench:评估和改进组合文本到视觉的生成(CVPR 2024,最佳短论文 @ SynData Workshop )[论文] [HF]
Baiqi Li*、Zhiqiu Lin*、Deepak Pathak、Jiayao Li、Yixin Fei、Kewen Wu、Tiffany Ling、Xide Xia*、Pengchuan Zhang*、Graham Neubig*、Deva Ramanan*(*共同第一作者和共同高级作者)
VQAScore 在组合文本提示方面显着优于以前的指标(例如 CLIPScore 和 PickScore),并且它比现有技术(例如 ImageReward、HPSv2、TIFA、Davidsonian、VPEval、VIEScore)简单得多,利用人类反馈或专有模型(例如 ChatGPT 和 GPT) -4愿景。
通过以下方式安装包:
git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics
conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y
pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install
或者您可以通过pip install t2v-metrics
安装。
现在,您只需使用以下 Python 代码即可计算图像-文本对齐的 VQAScore(分数越高表示相似度越高):
import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' ) # our recommended scoring model
### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score ( images = [ image ], texts = [ text ])
### Alternatively, if you want to calculate the pairwise similarity scores
### between M images and N texts, run the following to return a M x N score tensor.
images = [ "images/0.png" , "images/1.png" ]
texts = [ "someone talks on the phone angrily while another person sits happily" ,
"someone talks on the phone happily while another person sits angrily" ]
scores = clip_flant5_score ( images = images , texts = texts ) # scores[i][j] is the score between image i and text j
clip-flant5-xxl
和llava-v1.5-13b
。如果您的 GPU 内存有限,请考虑较小的模型,例如clip-flant5-xl
和llava-v1.5-7b
。HF_CACHE_DIR
来更改保存所有模型检查点的缓存文件夹(默认为./hf_cache/
)。 对于 M 个图像 x N 个文本的大批量,您可以使用batch_forward()
函数来加快速度。
import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )
# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
{ 'images' : [ "images/0/DALLE3.png" , "images/0/Midjourney.jpg" , "images/0/SDXL.jpg" , "images/0/DeepFloyd.jpg" ], 'texts' : [ "The brown dog chases the black dog around the tree." ]},
{ 'images' : [ "images/1/DALLE3.png" , "images/1/Midjourney.jpg" , "images/1/SDXL.jpg" , "images/1/DeepFloyd.jpg" ], 'texts' : [ "Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep." ]},
#...
]
scores = clip_flant5_score . batch_forward ( dataset = dataset , batch_size = 16 ) # (n_sample, 4, 1) tensor
我们目前支持使用 CLIP-FlanT5、LLaVA-1.5 和 InstructBLIP 运行 VQAScore。对于消融,我们还包括 CLIPScore、BLIPv2Score、PickScore、HPSv2Score 和 ImageReward:
llava_score = t2v_metrics . VQAScore ( model = 'llava-v1.5-13b' )
instructblip_score = t2v_metrics . VQAScore ( model = 'instructblip-flant5-xxl' )
clip_score = t2v_metrics . CLIPScore ( model = 'openai:ViT-L-14-336' )
blip_itm_score = t2v_metrics . ITMScore ( model = 'blip2-itm' )
pick_score = t2v_metrics . CLIPScore ( model = 'pickscore-v1' )
hpsv2_score = t2v_metrics . CLIPScore ( model = 'hpsv2' )
image_reward_score = t2v_metrics . ITMScore ( model = 'image-reward-v1' )
您可以通过运行以下命令来检查所有支持的型号:
print ( "VQAScore models:" )
t2v_metrics . list_all_vqascore_models ()
print ( "ITMScore models:" )
t2v_metrics . list_all_itmscore_models ()
print ( "CLIPScore models:" )
t2v_metrics . list_all_clipscore_models ()
问题和答案对最终得分略有影响,如我们论文的附录所示。我们为每个模型提供了一个简单的默认模板,不建议为了可重复性而更改它。但是,我们确实想指出,问题和答案可以轻松修改。例如,CLIP-FlanT5 和 LLaVA-1.5 使用以下模板,可以在 t2v_metrics/models/vqascore_models/clip_t5_model.py 中找到该模板:
# {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'
您可以通过将question_template
和answer_template
参数传递到forward()
或batch_forward()
函数来自定义模板:
# Use a different question for VQAScore
scores = clip_flant5_score ( images = images ,
texts = texts ,
question_template = 'Is this figure showing "{}"? Please answer yes or no.' ,
answer_template = 'Yes' )
您还可以计算 P(caption | image) (VisualGPTScore) 而不是 P(answer | image, Question):
scores = clip_flant5_score ( images = images ,
texts = texts ,
question_template = "" , # no question
answer_template = "{}" ) # this computes P(caption | image)
我们的 eval.py 允许您轻松运行 10 个图像/视觉/3D 对齐基准(例如,Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore):
python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore
# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question " Is the figure showing '{}'? " --answer " Yes "
我们的 genai_image_eval.py 和 genai_video_eval.py 可以重现 GenAI-Bench 结果。另外 genai_image_ranking.py 可以重现 GenAI-Rank 结果:
# GenAI-Bench
python genai_image_eval.py --model clip-flant5-xxl
python genai_video_eval.py --model clip-flant5-xxl
# GenAI-Rank
python genai_image_ranking.py --model clip-flant5-xxl --gen_model DALLE_3
python genai_image_ranking.py --model clip-flant5-xxl --gen_model SDXL_Base
我们使用 GPT-4o 实现了 VQAScore,以实现新的最先进的性能。请参阅 t2v_metrics/gpt4_eval.py 示例。以下是如何在命令行中使用它:
openai_key = # Your OpenAI key
score_func = t2v_metrics . get_score_model ( model = "gpt-4o" , device = "cuda" , openai_key = openai_key , top_logprobs = 20 ) # We find top_logprobs=20 to be sufficient for most (image, text) samples. Consider increase this number if you get errors (the API cost will not increase).
您可以轻松实施自己的评分指标。例如,如果您有一个您认为更有效的 VQA 模型,您可以将其合并到 t2v_metrics/models/vqascore_models 目录中。如需指导,请参考我们的 LLaVA-1.5 和 InstructBLIP 示例实现作为起点。
要使用 CLIP-FlanT5 生成文本(字幕或 VQA 任务),请使用以下代码:
import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )
images = [ "images/0.png" , "images/0.png" ] # A list of images
prompts = [ "Please describe this image: " , "Does the image show 'someone talks on the phone angrily while another person sits happily'?" ] # Corresponding prompts
clip_flant5_score . model . generate ( images = images , prompts = prompts )
如果您发现此存储库对您的研究有用,请使用以下内容(使用 ArXiv ID 进行更新)。
@article{lin2024evaluating,
title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
journal={arXiv preprint arXiv:2404.01291},
year={2024}
}
该存储库的灵感来自于 Richard Zhu 用于自动评估图像质量的 Perceptual Metric (LPIPS) 存储库。