تنزيل t2v_metrics - تنزيل كود المصدر t2v

VQAScore لتقييم نماذج النص إلى المرئية [صفحة المشروع]

يسمح VQAScore للباحثين بتقييم نماذج تحويل النص إلى صورة/فيديو/ثلاثية الأبعاد تلقائيًا باستخدام سطر واحد من كود Python!

[صفحة VQAScore] [عرض VQAScore] [صفحة GenAI-Bench] [عرض GenAI-Bench] [CLIP-FlanT5 Model Zoo]

VQAScore: تقييم إنشاء تحويل النص إلى مرئي مع إنشاء تحويل صورة إلى نص (ECCV 2024) [ورقة] [HF]
زيكيو لين، ديباك باتاك، بايكي لي، جياياو لي، إكسيدي شيا، جراهام نيوبيج، بينجتشوان تشانغ، ديفا رامانان

GenAI-Bench: تقييم وتحسين إنشاء تحويل النص إلى مرئي (CVPR 2024، أفضل ورقة قصيرة في ورشة عمل SynData ) [ورقة] [HF]
بايكي لي*، زيكيو لين*، ديباك باثاك، جياياو لي، ييكسين فاي، كيوين وو، تيفاني لينغ، إكسيد شيا*، بينغشوان تشانغ*، غراهام نيوبيج*، ديفا رامانان* (*مؤلفون مشاركين أول ومؤلفون كبار مشاركين)

أخبار

[2024/08/13] تم تسليط الضوء على VQAScore في تقرير Imagen3 من Google باعتباره أقوى بديل لـ CLIPScore للتقييم الآلي! تم اختيار GenAI-Bench كأحد المعايير الرئيسية لعرض محاذاة الصورة السريعة الفائقة لـ Imagen3. مجد لجوجل على هذا الإنجاز! [ورق]
[2024/07/01] تم قبول VQAScore في ECCV 2024!
[2024/06/20] فاز GenAI-Bench بجائزة أفضل ورقة بحثية قصيرة في ورشة عمل SynData CVPR'24! [موقع ورشة العمل].

يتفوق VQAScore بشكل كبير على المقاييس السابقة مثل CLIPScore وPickScore في مطالبات النص التركيبي، كما أنه أبسط بكثير من التقنية السابقة (على سبيل المثال، ImageReward وHPSv2 وTIFA وDavidsonian وVPEval وVIEScore) باستخدام التعليقات البشرية أو النماذج الخاصة مثل ChatGPT وGPT -4الرؤية.

بداية سريعة

تثبيت الحزمة عبر:

git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics

conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y

pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install

أو يمكنك التثبيت عبر pip install t2v-metrics .

الآن، كود Python التالي هو كل ما تحتاجه لحساب VQAScore لمحاذاة نص الصورة (تشير الدرجات الأعلى إلى تشابه أكبر):

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' ) # our recommended scoring model

### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score ( images = [ image ], texts = [ text ])

### Alternatively, if you want to calculate the pairwise similarity scores 
### between M images and N texts, run the following to return a M x N score tensor.
images = [ "images/0.png" , "images/1.png" ]
texts = [ "someone talks on the phone angrily while another person sits happily" ,
         "someone talks on the phone happily while another person sits angrily" ]
scores = clip_flant5_score ( images = images , texts = texts ) # scores[i][j] is the score between image i and text j

ملاحظات على GPU وذاكرة التخزين المؤقت

استخدام GPU : افتراضيًا، يستخدم هذا الرمز أول جهاز cuda على جهازك. نوصي باستخدام وحدات معالجة الرسومات بسعة 40 جيجابايت لأكبر طرز VQAScore مثل clip-flant5-xxl و llava-v1.5-13b . إذا كانت لديك ذاكرة GPU محدودة، ففكر في نماذج أصغر مثل clip-flant5-xl و llava-v1.5-7b .
دليل ذاكرة التخزين المؤقت : يمكنك تغيير مجلد ذاكرة التخزين المؤقت الذي يحفظ جميع نقاط فحص النموذج (الافتراضي هو ./hf_cache/ ) عن طريق تحديث HF_CACHE_DIR في t2v_metrics/constants.py.

الاستخدام المتقدم

المعالجة المجمعة لمزيد من أزواج الصور والنص
تحقق من كافة النماذج المدعومة
تخصيص قالب الأسئلة والأجوبة (لـ VQAScore)
إعادة إنتاج نتائج ورقة VQAScore
استنساخ نتائج ورقة GenAI-Bench
استخدام GPT-4o لـ VQAScore
تنفيذ مقياس التسجيل الخاص بك
إنشاء النص (VQA) باستخدام CLIP-FlanT5

المعالجة المجمعة لمزيد من أزواج الصور والنص

مع مجموعة كبيرة من الصور M × N النصوص، يمكنك تسريع استخدام وظيفة batch_forward() .

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  { 'images' : [ "images/0/DALLE3.png" , "images/0/Midjourney.jpg" , "images/0/SDXL.jpg" , "images/0/DeepFloyd.jpg" ], 'texts' : [ "The brown dog chases the black dog around the tree." ]},
  { 'images' : [ "images/1/DALLE3.png" , "images/1/Midjourney.jpg" , "images/1/SDXL.jpg" , "images/1/DeepFloyd.jpg" ], 'texts' : [ "Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep." ]},
  #...
]
scores = clip_flant5_score . batch_forward ( dataset = dataset , batch_size = 16 ) # (n_sample, 4, 1) tensor

تحقق من كافة النماذج المدعومة

نحن ندعم حاليًا تشغيل VQAScore مع CLIP-FlanT5، وLLaVA-1.5، وInstructBLIP. بالنسبة للاستئصال، نقوم أيضًا بتضمين CLIPScore وBLIPv2Score وPickScore وHPSv2Score وImageReward:

 llava_score = t2v_metrics . VQAScore ( model = 'llava-v1.5-13b' )
instructblip_score = t2v_metrics . VQAScore ( model = 'instructblip-flant5-xxl' )
clip_score = t2v_metrics . CLIPScore ( model = 'openai:ViT-L-14-336' )
blip_itm_score = t2v_metrics . ITMScore ( model = 'blip2-itm' ) 
pick_score = t2v_metrics . CLIPScore ( model = 'pickscore-v1' )
hpsv2_score = t2v_metrics . CLIPScore ( model = 'hpsv2' ) 
image_reward_score = t2v_metrics . ITMScore ( model = 'image-reward-v1' )

يمكنك التحقق من جميع النماذج المدعومة عن طريق تشغيل الأوامر التالية:

 print ( "VQAScore models:" )
t2v_metrics . list_all_vqascore_models ()

print ( "ITMScore models:" )
t2v_metrics . list_all_itmscore_models ()

print ( "CLIPScore models:" )
t2v_metrics . list_all_clipscore_models ()

تخصيص قالب الأسئلة والأجوبة (لـ VQAScore)

يؤثر السؤال والجواب قليلاً على النتيجة النهائية، كما هو موضح في ملحق ورقتنا. نحن نقدم قالبًا افتراضيًا بسيطًا لكل نموذج ولا ننصح بتغييره من أجل إمكانية التكرار. ومع ذلك، نريد أن نشير إلى أنه يمكن تعديل السؤال والجواب بسهولة. على سبيل المثال، يستخدم CLIP-FlanT5 وLLaVA-1.5 القالب التالي، والذي يمكن العثور عليه في t2v_metrics/models/vqascore_models/clip_t5_model.py:

 # {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'

يمكنك تخصيص القالب عن طريق تمرير معلمات question_template و answer_template إلى الدالتين forward() أو batch_forward() :

 # Use a different question for VQAScore
scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = 'Is this figure showing "{}"? Please answer yes or no.' ,
                           answer_template = 'Yes' )

يمكنك أيضًا حساب P(caption | image) (VisualGPTScore) بدلاً من P(answer | image, question):

 scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = "" , # no question
                           answer_template = "{}" ) # this computes P(caption | image)

إعادة إنتاج نتائج ورقة VQAScore

يتيح لك موقع eval.py الخاص بنا تشغيل 10 معايير لمحاذاة الصور/الرؤية/ثلاثية الأبعاد بسهولة (على سبيل المثال، Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore):

python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question " Is the figure showing '{}'? " --answer " Yes "

استنساخ نتائج ورقة GenAI-Bench

يمكن لـ genai_image_eval.py و genai_video_eval.py إنتاج نتائج GenAI-Bench. في genai_image_ranking.py الإضافي يمكن إعادة إنتاج نتائج GenAI-Rank:

 # GenAI-Bench
python genai_image_eval.py --model clip-flant5-xxl
python genai_video_eval.py --model clip-flant5-xxl

# GenAI-Rank
python genai_image_ranking.py --model clip-flant5-xxl --gen_model DALLE_3
python genai_image_ranking.py --model clip-flant5-xxl --gen_model SDXL_Base

استخدام GPT-4o لـ VQAScore!

قمنا بتطبيق VQAScore باستخدام GPT-4o لتحقيق أداء جديد متطور. الرجاء مراجعة t2v_metrics/gpt4_eval.py للحصول على مثال. إليك كيفية استخدامه في سطر الأوامر:

 openai_key = # Your OpenAI key
score_func = t2v_metrics . get_score_model ( model = "gpt-4o" , device = "cuda" , openai_key = openai_key , top_logprobs = 20 ) # We find top_logprobs=20 to be sufficient for most (image, text) samples. Consider increase this number if you get errors (the API cost will not increase).

تنفيذ مقياس التسجيل الخاص بك

يمكنك بسهولة تنفيذ مقياس التسجيل الخاص بك. على سبيل المثال، إذا كان لديك نموذج VQA تعتقد أنه أكثر فعالية، فيمكنك دمجه في الدليل على t2v_metrics/models/vqascore_models. للحصول على إرشادات، يرجى الرجوع إلى أمثلة تطبيقات LLaVA-1.5 وInstructBLIP كنقاط بداية.

إنشاء النص (VQA) باستخدام CLIP-FlanT5

لإنشاء النصوص (التسميات التوضيحية أو مهام VQA) باستخدام CLIP-FlanT5، يرجى استخدام الكود أدناه:

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

images = [ "images/0.png" , "images/0.png" ] # A list of images
prompts = [ "Please describe this image: " , "Does the image show 'someone talks on the phone angrily while another person sits happily'?" ] # Corresponding prompts
clip_flant5_score . model . generate ( images = images , prompts = prompts )

الاقتباس

إذا وجدت هذا المستودع مفيدًا لبحثك، فيرجى استخدام ما يلي (للتحديث باستخدام معرف ArXiv).

 @article{lin2024evaluating,
  title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
  author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
  journal={arXiv preprint arXiv:2404.01291},
  year={2024}
}