LRV Instruction下載 - LRV Instruction原始碼下載

LRV Instruction

Ai源碼

1.0.0

下載

透過穩健的指令調整減輕大型多模態模型中的幻覺 [ICLR 2024]

劉福曉、林凱文、李林傑、王劍鋒、Yaser Yacoob、王麗娟

[專案頁] [論文]

您可以在下面比較我們的型號和原始型號。如果線上簡報不起作用，請發送電子郵件[email protected] 。如果您發現我們的工作有趣，請引用我們的工作。謝謝！

 @article { liu2023aligning ,
  title = { Aligning Large Multi-Modal Model with Robust Instruction Tuning } ,
  author = { Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan } ,
  journal = { arXiv preprint arXiv:2306.14565 } ,
  year = { 2023 }
}
@article { liu2023hallusionbench ,
  title = { HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V (ision), LLaVA-1.5, and Other Multi-modality Models } ,
  author = { Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi } ,
  journal = { arXiv preprint arXiv:2310.14566 } ,
  year = { 2023 }
}
@article { liu2023mmc ,
  title = { MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning } ,
  author = { Liu, Fuxiao and Wang, Xiaoyang and Yao, Wenlin and Chen, Jianshu and Song, Kaiqiang and Cho, Sangwoo and Yacoob, Yaser and Yu, Dong } ,
  journal = { arXiv preprint arXiv:2311.10774 } ,
  year = { 2023 }
}

LRV-V1和LRV-V2都支援在V100 32GB上進行訓練。

[LRV-V2(Mplug-Owl) 演示], [mplug-owl 演示]

[LRV-V1(MiniGPT4) 示範]、[MiniGPT4-7B 示範]

更新

[03/13] 我們的論文「MMC: Advancing Multimodal Chart Understanding with LLMInstruction Tuning」被NAACL 2024接收。
[02/26] 我們的論文“HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challengeing for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models”是已接受CVPR 2024 。
[01/15] 我們的論文被ICLR 2024接收。相機就緒版本即將推出！
[11/15] 我們的論文「MMC：透過 LLM 指令調優推進多模態圖表理解」現已在 Arxiv 上發布。
[10/24] 請查看我們的新工作，對GPT4V 的失敗案例進行基準測試“HallusionBench：You See What You Think？或者You Think What You See？一個圖像上下文推理基準挑戰GPT-4V(ision)、LLaVA - 1.5 和其他多模態模型」（repo）。
[9/20]更多知識操控數據即將發布！
[8/24] 我們發布了一些圖表圖像的視覺指令資料（帶有知識操作），以增加資料集的多樣性。數據和圖像。
[8/17] LRV-Instruction V2的模型重量可從此處取得。
[8/16] 我們透過產生的 GPT4 發布了額外的180k視覺指令調優資料。您可以從這裡下載。我們的 LRV-Instruction 資料集總共包含320k視覺指令資料。
[8/14] 我們手動清理資料集。新版本可以從訓練集和評估集下載。
[8/05] 在 mplug-owl 上進行微調的LRV-Instruction V2在 MME 基準測試中取得了SOTA 結果。
[7/05] 在 MiniGPt4 上微調的 LRV-Instruction V1 發布！
[6/30] 我們的資料集在Hugging Face上可用。
[6/27] 我們的論文由 AK 發佈在推特上。
[6/26] 我們的技術報告可在 arxiv 上找到。

模型檢查點

型號名稱	骨幹	下載連結
LRV-指令V2	Mplug-貓頭鷹	關聯
LRV-指令V1	迷你GPT4	關聯

指令數據

型號名稱	操作說明	影像
輕軌指令	關聯	關聯
LRV指令(更多)	關聯	關聯
圖表說明	關聯	關聯

視覺指令資料（LRV-指令）

我們使用 GPT4 產生的30 萬條視覺指令更新資料集，涵蓋 16 個具有開放式指令和答案的視覺和語言任務。 LRV 指令包括正向指令和負向指令，以實現更穩健的視覺指令調整。我們資料集的圖像來自 Visual Genome。可以從這裡存取我們的數據。

 {'image_id': '2392588', 'question': 'Can you see a blue teapot on the white electric stove in the kitchen?', 'answer': 'There is no mention of a teapot on the white electric stove in the kitchen.', 'task': 'negative'}

對於每個實例， image_id指的是來自 Visual Genome 的映像。 question和answer是指指令-答案對。 task表示任務名稱。您可以從這裡下載圖像。

我們提供 GPT-4 查詢提示，以便更好地促進該領域的研究。請查看prompts夾以了解正例和負例的生成。 negative1_generation_prompt.txt包含使用不存在的元素操作產生負指令的提示。 negative2_generation_prompt.txt包含使用現有元素操作產生負指令的提示。您可以參考此處的程式碼來產生更多資料。請參閱我們的論文以了解更多詳細資訊。

LRV-Instruction 可以讓 LMM 具備說「不」的能力，並且即使 LRV-Instruction 資料集中沒有圖表影像，也可以提供正確的答案。

型號

?LRV-指令(V1)設定

LRV-Instruction(V1)基於MiniGPT4-7B。

1.克隆這個儲存庫

https://github.com/FuxiaoLiu/LRV-Instruction.git

2. 安裝包

conda env create -f environment.yml --name LRV
conda activate LRV

3. 準備駱駝毛重量

我們的模型在 MiniGPT-4 和 Vicuna-7B 上進行了微調。請參閱此處的說明來準備駱駝毛重量或從此處下載。然後，在 MiniGPT-4/minigpt4/configs/models/minigpt4.yaml 第 15 行設定 Vicuna 權重的路徑。

4. 準備模型的預訓練檢查點

從此處下載預訓練的檢查點

然後，在 MiniGPT-4/eval_configs/minigpt4_eval.yaml 的第 11 行設定預訓練檢查點的路徑。我們將來會發布 MiniGPT-4-13B 和 LLaVA 的檢查點。

5.設定資料集路徑

取得資料集後，然後在第5行的MiniGPT-4/minigpt4/configs/datasets/cc_sbu/align.yaml中設定資料集路徑的路徑。

 /MiniGPt-4/cc_sbu_align
├── image(Visual Genome images)
├── filter_cap.json

6. 本地演示

透過運行在本機電腦上嘗試我們微調模型的演示 demo.py

 cd ./MiniGPT-4
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0

您可以嘗試這裡的範例。

7. 模型推理

此處設定推理指令檔案的路徑，此處設定推理影像資料夾，此處設定輸出位置。我們在訓練過程中不會進行推理。

 cd ./MiniGPT-4
python inference.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0

?LRV-指令(V2)設定

LRV-Instruction(V2)是基於插件Owl-7B。

1.依照mplug-owl安裝環境。

我們在 8 V100 上對 mplug-owl 進行了微調。如果您在V100上實施時遇到任何問題，請隨時告訴我！

2. 下載檢查點

首先從鏈接下載 mplug-owl 的檢查點，並從這裡下載訓練好的 lora 模型權重。

3. 編輯程式碼

對於mplug-owl/serve/model_worker.py ，編輯以下程式碼，並在 lora_path 中輸入 lora 模型權重的路徑。

 self.image_processor = MplugOwlImageProcessor.from_pretrained(base_model)
self.tokenizer = AutoTokenizer.from_pretrained(base_model)
self.processor = MplugOwlProcessor(self.image_processor, self.tokenizer)
self.model = MplugOwlForConditionalGeneration.from_pretrained(
     base_model,
     load_in_8bit=load_in_8bit,
     torch_dtype=torch.bfloat16 if bf16 else torch.half,
     device_map="auto"
 )
self.tokenizer = self.processor.tokenizer

        
peft_config = LoraConfig(target_modules=r'.*language_model.*.(q_proj|v_proj)', inference_mode=False, r=8,lora_alpha=32, lora_dropout=0.05)
self.model = get_peft_model(self.model, peft_config)
lora_path = 'Your lora model path'
prefix_state_dict = torch.load(lora_path, map_location='cpu')
self.model.load_state_dict(prefix_state_dict)

4. 本地演示

當您在本機電腦中啟動演示時，您可能會發現沒有空間用於文字輸入。這是因為python和gradio之間的版本衝突。最簡單的解決方案是conda activate LRV

 python -m serve.web_server --base-model 'the mplug-owl checkpoint directory' --bf16

5. 模型推理

首先 git 從 mplug-owl 複製程式碼，用我們的/utils/model_worker.py取代/mplug/serve/model_worker.py並新增檔案/utils/inference.py 。然後編輯輸入資料檔案和影像資料夾路徑。最後運行：

 python -m serve.inference --base-model 'your checkpoint directory' --bf16

評估(GAVIE)

我們引入 GPT4 輔助視覺指令評估 (GAVIE) 作為一種更靈活、更強大的方法來測量 LMM 生成的幻覺，而不需要人工註釋的真實答案。 GPT4 將帶有邊界框座標的密集標題作為圖像內容，並比較人類指令和模型回應。然後我們要求 GPT4 充當聰明的老師，根據兩個標準對學生的答案進行評分（0-10）：（1）準確性：答案是否與圖像內容產生幻覺。 (2)相關性：回應是否直接遵循指令。 prompts/GAVIE.txt包含GAVIE的提示。

我們的評估集可在此處取得。

 {'image_id': '2380160', 'question': 'Identify the type of transportation infrastructure present in the scene.'}

對於每個實例， image_id指的是來自 Visual Genome 的映像。 instruction是指指令。 answer_gt指的是純文字 GPT4 的真實答案，但我們在評估中不使用它們。相反，我們使用純文字 GPT4 透過使用視覺基因組資料集中的密集標題和邊界框作為視覺內容來評估模型輸出。

若要評估模型輸出，請先從此處下載 vg 註解。其次根據這裡的程式碼產生評估提示。第三，將提示輸入 GPT4。

排行榜

GPT4(GPT4-32k-0314) 充當智慧教師，根據兩個標準對學生的答案進行評分 (0-10)。

(1)準確度：反應是否與影像內容產生幻覺。 (2)相關性：回應是否直接遵循指令。

方法	GAVIE-準確性	GAVIE-相關性
LLaVA1.0-7B	4.36	6.11
拉瓦1.5-7B	6.42	8.20
MiniGPT4-v1-7B	4.14	5.81
MiniGPT4-v2-7B	6.01	8.10
mPLUG-Owl-7B	4.84	6.35
指導BLIP-7B	5.93	7.34
MMGPT-7B	0.91	1.79
我們的7B	6.58	8.46

致謝

駱駝毛：駱駝毛的語言能力令人驚嘆。
MiniGPT4、LAVIS 和 mplug-owl：非常感謝 MiniGPT4、LAVIS 和 mplug-owl，我們的許多程式碼都是基於它們的！
很棒的多模式大型語言模型。 LMM 的調查非常有幫助！

引文

如果您發現我們的工作對您的研究和應用有用，請使用此 BibTeX 進行引用：

 @article { liu2023aligning ,
  title = { Aligning Large Multi-Modal Model with Robust Instruction Tuning } ,
  author = { Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan } ,
  journal = { arXiv preprint arXiv:2306.14565 } ,
  year = { 2023 }
}