t2v_metricsダウンロード - t2v_metricsソースコードのダウンロード

Text-to-Visual モデルを評価するための VQAScore [プロジェクトページ]

VQAScore を使用すると、研究者は 1 行の Python コードを使用して、テキストから画像、ビデオ、3D モデルへの変換を自動的に評価できます。

[VQAScore ページ] [VQAScore デモ] [GenAI-Bench ページ] [GenAI-Bench デモ] [CLIP-FlanT5 Model Zoo]

VQAScore: 画像からテキストへの生成によるテキストからビジュアルへの生成の評価(ECCV 2024) [論文] [HF]
Zhiqiu Lin、Deepak Pathak、Baiqi Li、Jiayao Li、Xide Xia、Graham Neubig、Pengchuan Zhang、Deva Ramanan

GenAI-Bench: 構成テキストからビジュアルへの生成の評価と改善(CVPR 2024、最優秀短編論文 @ SynData Workshop ) [論文] [HF]
Baiqi Li*、Zhiqiu Lin*、Deepak Pathak、Jiayao Li、Yixin Fei、Kewen Wu、Tiffany Ling、Xide Xia*、Pengchuan Zhang*、Graham Neubig*、Deva Ramanan* (*共同筆頭著者および共同上級著者)

ニュース

[2024/08/13] Google の Imagen3 レポートで、 VQAScoreが自動評価用の CLIPScore の最強の代替品として取り上げられました。 GenAI-Bench は、 Imagen3 の優れたプロンプト画像調整を示す重要なベンチマークの 1 つとして選ばれました。この功績に対して Google に敬意を表します。 [紙]
[2024/07/01] VQAScore がECCV 2024 に採択されました。
[2024/06/20] GenAI-BenchがCVPR'24 SynData WorkshopでBest Short Paperを受賞しました！【ワークショップサイト】。

VQAScore は、構成テキストプロンプトで CLIPScore や PickScore などの以前の指標を大幅に上回り、人間によるフィードバックや ChatGPT や GPT などの独自モデルを利用する従来技術 (ImageReward、HPSv2、TIFA、Davidsonian、VPEval、VIEScore など) よりもはるかに簡単です。 -4ビジョン。

クイックスタート

次の方法でパッケージをインストールします。

git clone https://github.com/linzhiqiu/t2v_metrics
cd t2v_metrics

conda create -n t2v python=3.10 -y
conda activate t2v
conda install pip -y

pip install torch torchvision torchaudio
pip install git+https://github.com/openai/CLIP.git
pip install -e . # local pip install

または、 pip install t2v-metrics経由でインストールすることもできます。

画像とテキストの位置合わせの VQAScore を計算するために必要なのは、次の Python コードだけです (スコアが高いほど、類似性が高いことを示します)。

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' ) # our recommended scoring model

### For a single (image, text) pair
image = "images/0.png" # an image path in string format
text = "someone talks on the phone angrily while another person sits happily"
score = clip_flant5_score ( images = [ image ], texts = [ text ])

### Alternatively, if you want to calculate the pairwise similarity scores 
### between M images and N texts, run the following to return a M x N score tensor.
images = [ "images/0.png" , "images/1.png" ]
texts = [ "someone talks on the phone angrily while another person sits happily" ,
         "someone talks on the phone happily while another person sits angrily" ]
scores = clip_flant5_score ( images = images , texts = texts ) # scores[i][j] is the score between image i and text j

GPUとキャッシュに関する注意事項

GPU 使用率: デフォルトでは、このコードはマシン上の最初の cuda デバイスを使用します。 clip-flant5-xxlやllava-v1.5-13bなどの最大の VQAScore モデルには 40GB GPU をお勧めします。 GPU メモリが限られている場合は、 clip-flant5-xlやllava-v1.5-7bなどのより小さいモデルを検討してください。
キャッシュディレクトリ: t2v_metrics/constants.py のHF_CACHE_DIR更新することで、すべてのモデルチェックポイントを保存するキャッシュフォルダー (デフォルトは./hf_cache/ ) を変更できます。

高度な使用法

より多くの画像とテキストのペアのバッチ処理
サポートされているすべてのモデルを確認する
質問と回答のテンプレートのカスタマイズ (VQAScore 用)
VQAScore 論文の結果を再現する
GenAI-Bench の論文結果の再現
VQAScore に GPT-4o を使用する
独自のスコアリング指標の実装
CLIP-FlanT5を使用したテキスト生成（VQA）

より多くの画像とテキストのペアのバッチ処理

M 個の画像 x N 個のテキストからなる大きなバッチの場合、 batch_forward()関数を使用すると高速化できます。

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  { 'images' : [ "images/0/DALLE3.png" , "images/0/Midjourney.jpg" , "images/0/SDXL.jpg" , "images/0/DeepFloyd.jpg" ], 'texts' : [ "The brown dog chases the black dog around the tree." ]},
  { 'images' : [ "images/1/DALLE3.png" , "images/1/Midjourney.jpg" , "images/1/SDXL.jpg" , "images/1/DeepFloyd.jpg" ], 'texts' : [ "Two cats sit at the window, the blue one intently watching the rain, the red one curled up asleep." ]},
  #...
]
scores = clip_flant5_score . batch_forward ( dataset = dataset , batch_size = 16 ) # (n_sample, 4, 1) tensor

サポートされているすべてのモデルを確認する

現在、CLIP-FlanT5、LLaVA-1.5、および InstructBLIP を使用した VQAScore の実行をサポートしています。アブレーションのために、CLIPScore、BLIPv2Score、PickScore、HPSv2Score、および ImageReward も含まれています。

 llava_score = t2v_metrics . VQAScore ( model = 'llava-v1.5-13b' )
instructblip_score = t2v_metrics . VQAScore ( model = 'instructblip-flant5-xxl' )
clip_score = t2v_metrics . CLIPScore ( model = 'openai:ViT-L-14-336' )
blip_itm_score = t2v_metrics . ITMScore ( model = 'blip2-itm' ) 
pick_score = t2v_metrics . CLIPScore ( model = 'pickscore-v1' )
hpsv2_score = t2v_metrics . CLIPScore ( model = 'hpsv2' ) 
image_reward_score = t2v_metrics . ITMScore ( model = 'image-reward-v1' )

以下のコマンドを実行すると、サポートされているすべてのモデルを確認できます。

 print ( "VQAScore models:" )
t2v_metrics . list_all_vqascore_models ()

print ( "ITMScore models:" )
t2v_metrics . list_all_itmscore_models ()

print ( "CLIPScore models:" )
t2v_metrics . list_all_clipscore_models ()

質問と回答のテンプレートのカスタマイズ (VQAScore 用)

論文の付録に示すように、質問と回答は最終スコアにわずかに影響します。各モデルには単純なデフォルトのテンプレートが提供されていますが、再現性を確保するためにこれを変更することはお勧めしません。ただし、質問と回答は簡単に変更できることを指摘しておきます。たとえば、CLIP-FlanT5 および LLaVA-1.5 は、t2v_metrics/models/vqascore_models/clip_t5_model.py にある次のテンプレートを使用します。

 # {} will be replaced by the caption
default_question_template = 'Does this figure show "{}"? Please answer yes or no.'
default_answer_template = 'Yes'

question_templateとanswer_templateパラメーターをforward()関数またはbatch_forward()関数に渡すことで、テンプレートをカスタマイズできます。

 # Use a different question for VQAScore
scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = 'Is this figure showing "{}"? Please answer yes or no.' ,
                           answer_template = 'Yes' )

P(answer | image, question) の代わりに P(caption | image) (VisualGPTScore) を計算することもできます。

 scores = clip_flant5_score ( images = images ,
                           texts = texts ,
                           question_template = "" , # no question
                           answer_template = "{}" ) # this computes P(caption | image)

VQAScore 論文の結果を再現する

eval.py を使用すると、10 個の画像/ビジョン/3D アライメントベンチマーク (例: Winoground/TIFA160/SeeTrue/StanfordT23D/T2VScore) を簡単に実行できます。

python eval.py --model clip-flant5-xxl # for VQAScore
python eval.py --model openai:ViT-L-14 # for CLIPScore

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question " Is the figure showing '{}'? " --answer " Yes "

GenAI-Bench の論文結果の再現

genai_image_eval.py と genai_video_eval.py は GenAI-Bench の結果を再現できます。さらに genai_image_ranking.py は GenAI-Rank の結果を再現できます。

 # GenAI-Bench
python genai_image_eval.py --model clip-flant5-xxl
python genai_video_eval.py --model clip-flant5-xxl

# GenAI-Rank
python genai_image_ranking.py --model clip-flant5-xxl --gen_model DALLE_3
python genai_image_ranking.py --model clip-flant5-xxl --gen_model SDXL_Base

VQAScore に GPT-4o を使用!

GPT-4o を使用して VQAScore を実装し、新しい最先端のパフォーマンスを実現しました。例については、t2v_metrics/gpt4_eval.py を参照してください。コマンドラインでの使用方法は次のとおりです。

 openai_key = # Your OpenAI key
score_func = t2v_metrics . get_score_model ( model = "gpt-4o" , device = "cuda" , openai_key = openai_key , top_logprobs = 20 ) # We find top_logprobs=20 to be sufficient for most (image, text) samples. Consider increase this number if you get errors (the API cost will not increase).

独自のスコアリング指標の実装

独自のスコアリング指標を簡単に実装できます。たとえば、より効果的であると思われる VQA モデルがある場合は、それを t2v_metrics/models/vqascore_models のディレクトリに組み込むことができます。ガイダンスとして、開始点として LLaVA-1.5 および InstructBLIP の実装例を参照してください。

CLIP-FlanT5を使用したテキスト生成（VQA）

CLIP-FlanT5 を使用してテキスト (キャプションまたは VQA タスク) を生成するには、以下のコードを使用してください。

 import t2v_metrics
clip_flant5_score = t2v_metrics . VQAScore ( model = 'clip-flant5-xxl' )

images = [ "images/0.png" , "images/0.png" ] # A list of images
prompts = [ "Please describe this image: " , "Does the image show 'someone talks on the phone angrily while another person sits happily'?" ] # Corresponding prompts
clip_flant5_score . model . generate ( images = images , prompts = prompts )

引用

このリポジトリが研究に役立つと思われる場合は、以下を使用してください (ArXiv ID で更新するには)。

 @article{lin2024evaluating,
  title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
  author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
  journal={arXiv preprint arXiv:2404.01291},
  year={2024}
}

謝辞

このリポジトリは、Richard Zhang による画質の自動評価用の Perceptual Metric (LPIPS) リポジトリからインスピレーションを得ています。

拡大する

t2v_metrics

Text-to-Visual モデルを評価するための VQAScore [プロジェクトページ]

ニュース

クイックスタート

GPUとキャッシュに関する注意事項

高度な使用法

より多くの画像とテキストのペアのバッチ処理

サポートされているすべてのモデルを確認する

質問と回答のテンプレートのカスタマイズ (VQAScore 用)

VQAScore 論文の結果を再現する

GenAI-Bench の論文結果の再現

VQAScore に GPT-4o を使用!

独自のスコアリング指標の実装

CLIP-FlanT5を使用したテキスト生成（VQA）

引用

謝辞

OpenCore_NO_ACPI_Build

nspanel_pro_tools_apk

kube state metrics

zkwork_aleo_gpu_worker

nextcloud_share_url_downloader

Lihua データ分析エンジン無料版 3.0_検索_ナビゲーション_コレクション_世論_ランキング_api

chat.petals.dev

GPT Prompt Templates

GPTyped

waymo open dataset

SmartTube

Sunamu

waymo open dataset

wp functions

termwind

t2v_metrics

Text-to-Visual モデルを評価するための VQAScore [プロジェクト ページ]

ニュース

クイックスタート

GPUとキャッシュに関する注意事項

高度な使用法

より多くの画像とテキストのペアのバッチ処理

サポートされているすべてのモデルを確認する

質問と回答のテンプレートのカスタマイズ (VQAScore 用)

VQAScore 論文の結果を再現する

GenAI-Bench の論文結果の再現

VQAScore に GPT-4o を使用!

独自のスコアリング指標の実装

CLIP-FlanT5を使用したテキスト生成（VQA）

引用

謝辞

Text-to-Visual モデルを評価するための VQAScore [プロジェクトページ]