HallusionBench下載 - HallusionBench原始碼下載

HallusionBench

其他源碼

1.0.0

下載

HallusionBench：大型視覺語言模型中糾纏語言幻覺和視覺錯覺的高級診斷套件 [CVPR 2024]

你看到你的想法了嗎？或者你認為你所看到的是什麼？ GPT-4V(ision)、LLaVA-1.5 和其他多模態模型的圖像上下文推理基準挑戰

天瑞關*、劉福曉*、吳希陽、冼瑞琪、李宗霞、劉曉宇、王希軍、陳立昌、黃芙蓉、Yaser Yacoob、Dinesh Manocha、週天一

更新

[09/20] 我們的論文「AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models」被EMNLP 2024接收。我們的程式碼和數據可以在 github 上找到。
[03/13] 我們的論文「MMC: Advancing Multimodal Chart Understanding with LLMInstruction Tuning」被NAACL 2024接收。
[02/26] 我們的 HallusionBench 被CVPR 2024接受。
[01/15] 我們的工作「Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning」被ICLR 2024接收。
[11/28] 論文全文已上傳，可在此處存取。資料集已擴展，排行榜已更新。
[11/13] LLaVA-1.5評估結果更新。更多模型結果即將推出！
[10/27]排行榜及評測代碼出爐！歡迎在我們的排行榜上更新您的模型！
[10/24] 包含案例分析和見解的早期報告可在此處取得。

歡迎大家將大型多模態模型（GPT-4V）的失敗案例貢獻給我們的社群！

大型語言模型（LLM）在與視覺模型對齊並整合到視覺語言模型（VLM）中後，可以為圖像推理任務帶來令人印象深刻的改進。最近發布的 GPT-4V(ison)、LLaVA-1.5 等證明了這一點。甚至是矛盾的）語言先於推理。相較之下，VLM 中的視覺模組比 LLM 弱，可能會導致誤導性的視覺表示，然後由 LLM 轉化為自信的錯誤。為了研究這兩類 VLM 錯誤，即語言幻覺和視錯覺，我們策劃了 HallusionBench，這是一個圖像上下文推理基準，即使對 GPT-4V 和 LLaVA-1.5 來說仍然具有挑戰性。我們對 HallusionBench 中的範例進行了詳細分析，這些分析為 VLM 的幻覺或幻覺以及未來如何改進它們提供了新穎的見解。

如果您發現我們的論文有用，請引用我們的論文：

 @misc { wu2024autohallusion ,
      title = { AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models } , 
      author = { Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Lee Boyd-Graber and Tianyi Zhou and Dinesh Manocha } ,
      year = { 2024 } ,
      eprint = { 2406.10900 } ,
      archivePrefix = { arXiv } ,
      primaryClass = { cs.CV } ,
      url = { https://arxiv.org/abs/2406.10900 } , 
}
@InProceedings { Guan_2024_CVPR ,
    author    = { Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi } ,
    title     = { HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models } ,
    booktitle = { Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) } ,
    month     = { June } ,
    year      = { 2024 } ,
    pages     = { 14375-14385 }
}
@misc { liu2023mitigating ,
      title = { Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning } , 
      author = { Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang } ,
      year = { 2023 } ,
      eprint = { 2306.14565 } ,
      archivePrefix = { arXiv } ,
      primaryClass = { cs.CV }
}
@misc { liu2023mmc ,
      title = { MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning } , 
      author = { Fuxiao Liu and Xiaoyang Wang and Wenlin Yao and Jianshu Chen and Kaiqiang Song and Sangwoo Cho and Yaser Yacoob and Dong Yu } ,
      year = { 2023 } ,
      eprint = { 2311.10774 } ,
      archivePrefix = { arXiv } ,
      primaryClass = { cs.CL }
}

資料集下載

為了使評估簡單，我們僅以是/否問題的形式提供問題。

更新於	問題和註釋	人物	問題數	人物數量
2023 年 10 月 27 日	HallusionBench.json	幻覺_bench.zip	第254章	69

評估

克隆存儲庫。

 git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench

下載圖片 Hallusion_bench.zip 並將資料夾解壓縮到同一目錄。
問題和圖像位置保存在./HallusionBench.json中。資料樣本如下：

 {'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}

關鍵的visual_input表示問題是否需要影像等視覺輸入。如果visual_input=1 ，則表示問題需要視覺輸入。如果visual_input=0 ，則表示問題不需要視覺輸入。這是純文字問題。

在./HallusionBench.json上執行模型並將輸出檔案儲存為./HallusionBench_result.json 。您需要在鍵'model_prediction'中新增模型的輸出。我們在此提供範例結果。
最後，執行以下程式碼進行評估：

 python evaluation.py

您可以透過編輯此處的程式碼，使用自己的 API 金鑰進行 GPT4 評估。

排行榜

定義

視覺相關（VD）問題：沒有視覺背景就沒有肯定答案的問題。
- 簡單：從互聯網獲得的原始圖像。
- 硬：根據原始影像編輯影像。
視覺補充（VS）問題：無需視覺輸入即可回答的問題；視覺組件僅提供補充資訊。
- 簡單：沒有視覺輸入。沒有幻覺的不確定答案也被認為是正確答案。
- 硬：透過視覺輸入。答案必須遵循提供的圖形和視覺上下文。

公制

每個圖形的準確性（一致性測試） ：基於每個圖形的準確性。為了確保模型真正理解圖像，我們基於同一圖形上的相同知識提出不同的問題，如果模型能夠正確回答所有問題，則認為它是正確的。例如，模型不應該對「A 比 B 大嗎？」的問題給予不一致的答案。和「B比A小嗎？」。
每個問題的準確性：所有問題的準確性，包括簡單問題和困難問題。
每個問題對的準確性：我們對相似圖像（或有圖像和沒有圖像）提出相同的問題。我們將不同視覺上下文中的相同問題文字視為問題對（通常它們帶有一個簡單問題和相應的困難問題）。此指標計算所有問題對的準確性。

模型	問題對 Acc	圖 Acc	簡單問題ACC	困難問題 Acc	問題附件	傑森
GPT4V 2023 年 9 月 25 日版本（人類評估）	31.42	44.22	79.56	38.37	67.58	VD、VS
GPT4V 2023 年 9 月 25 日版本（GPT 評估）	28.79	39.88	75.60	37.67	65.28	VD、VS
克勞德 3 （GPT 評估）	21.76	28.61	55.16	41.40	56.86	VD、VS
LLaVA-1.5 （人類評估）	9.45	25.43	50.77	29.07	47.12	VD、VS
LLaVA-1.5 （GPT 評估）	10.55	24.86	49.67	29.77	46.94	VD、VS
雙子座專業視覺 2023 年 12 月版本（GPT 評估）	7.69	8.67	35.60	30.23	36.85	VD、VS
GUA_VL （GPT 評估）	16.70	23.12	53.63	39.77	51.82	VD、VS
BLIP2-T5 （GPT 評估）	15.16	20.52	45.49	43.49	48.09	VD、VS
Qwen-VL （GPT 評估）	5.93	6.65	31.43	24.88	39.15	VD、VS
開放-火烈鳥（GPT 評估）	6.37	11.27	39.56	27.21	38.44	VD、VS
迷你GPT5 （GPT 評估）	10.55	9.83	36.04	28.37	40.30	VD、VS
迷你GPT4 （GPT 評估）	8.79	10.12	31.87	27.67	35.78	VD、VS
指導BLIP （GPT 評估）	9.45	10.11	35.60	45.12	45.26	VD、VS
BLIP2 （GPT 評估）	5.05	12.43	33.85	40.70	40.48	VD、VS
mPLUG_Owl-v2 （GPT 評估）	13.85	19.94	44.84	39.07	47.30	VD、VS
mPLUG_Owl-v1 （GPT 評估）	9.45	10.40	39.34	29.77	43.93	VD、VS
LRV_指令（GPT 評估）	8.79	13.01	39.78	27.44	42.78	VD、VS
維LT （GPT 評估）	8.3516	11.2717	37.8022	45.3488	44.4641	VD、VS
吉特（GPT 評估）	5.27	6.36	26.81	31.86	34.37	VD、VS

在排行榜上重現 GPT4V 結果

我們使用註釋保存了 GPT4V 的輸出。將HallusionBench.tsv放入此儲存庫的根目錄中，或將 gpt4v_benchmark.py 中的input_file_name設定為 HallusionBench.tsv 檔案的位置。
（可選）如果您無法存取 GPT API，則無需運行它，因為我們已保存評估結果。可以下載它們作為 Visual Dependent 和 Visual Supplement。將json檔案放在這個repo的根目錄下，或是將gpt4v_benchmark.py中的save_json_path_vd和save_json_path_vd設定到各自的位置。
運行python gpt4v_benchmark.py 。