ดาวน์โหลด FLMR - ดาวน์โหลดซอร์สโค้ด FLMR

เอฟแอลเอ็มอาร์

การใช้ตัวแปลงฮักกิ้งเฟซของมัลติโมดัลรีทรีฟเวอร์แบบปฏิกิริยาล่าช้าแบบละเอียด

การดำเนินการอย่างเป็นทางการอยู่ที่นี่

ดูรายละเอียดรุ่นและจุดตรวจได้ที่นี่

ดูรายละเอียดสำหรับการสร้างชุดข้อมูลและการประเมินผลในรายงานได้ที่นี่

อัพเดท

[03/09/2024] เราได้อัปโหลดภาพที่ใช้ในเกณฑ์มาตรฐาน M2KR ที่นี่
[10/08/2024] เราได้รับคำขอมากมายเกี่ยวกับการเพิ่มความสามารถหลายภาษาให้กับ PreFLMR เราประกาศว่า ขณะนี้เรากำลังฝึกอบรม PreFLMR เวอร์ชันภาษาจีน และจะเผยแพร่ในเร็วๆ นี้ คอยติดตาม!
[05/06/2024] เราได้อัปเดตการใช้งานบางส่วน
- เพิ่มสคริปต์การประเมินผลที่สร้างผลลัพธ์ในรายงาน PreFLMR ที่นี่
- เพิ่มผลลัพธ์การวัดประสิทธิภาพที่อัปเดตพร้อมกับการใช้งานหม้อแปลงที่นี่
- เพิ่มสคริปต์ตัวอย่างเพื่อปรับแต่ง PreFLMR บนชุดข้อมูลการดึงข้อมูลแบบกำหนดเองที่นี่
- สิ่งสำคัญ : แก้ไขการแยกข้อมูลของ OVEN ในเกณฑ์มาตรฐาน M2KR และอัปเดตแต่ละรายการด้วยคำสั่งคงที่เพื่อให้แน่ใจว่าผลการประเมินจะไม่ได้รับผลกระทบจากการสุ่มตัวอย่างคำสั่ง โปรดลบแคชในเครื่องของคุณและดาวน์โหลดชุดข้อมูลอีกครั้ง

สารบัญ

เอฟแอลเอ็มอาร์
- อัพเดท
- สารบัญ
- แบบจำลองและผลลัพธ์เกณฑ์มาตรฐาน
- วิธีใช้แพ็คเกจนี้
  - สิ่งแวดล้อม
  - สร้างดัชนีคอลเลกชันเอกสารแบบกำหนดเอง
  - ค้นหาคอลเลกชั่นเอกสารแบบกำหนดเอง
  - การฝึกอบรมกับการเรียนรู้ที่ตรงกันข้าม
- ทางเลือก: ใช้ Transformers.AutoModel เพื่อโหลดโมเดลที่ผ่านการฝึกอบรมมาแล้ว
- ใช้สคริปต์ตัวอย่าง
  - ใช้ FLMR
  - [ใหม่!] ใช้ PreFLMR
  - [ใหม่!] ประเมินโมเดล PreFLMR บนการวัดประสิทธิภาพ M2KR ทั้งหมด
  - [ใหม่!] ปรับแต่งโมเดล PreFLMR บนชุดข้อมูลดาวน์สตรีม
    - ดำเนินการปรับแต่ง
    - เรียกใช้การทดสอบ
    - ตัวอย่างผลลัพธ์การปรับแต่ง
- บันทึก
- การอ้างอิง

แบบจำลองและผลลัพธ์เกณฑ์มาตรฐาน

แบบอย่าง	วิท รีคอล@10	IGLUE เรียกคืน@1	KVQA เรียกคืน@5	MSMARCO เรียกคืน@5	เตาอบ Recall@5	ลาวา รีคอล@1	EVQA เรียกคืน@5	EVQA หลอก Recall@5	OKVQA เรียกคืน@5	OKVQA หลอก Recall@5	ข้อมูลข่าวสาร Recall@5	Infoseek หลอก Recall@5
LinWeizheDragon/PreFLMR_ViT-G?	0.619	0.718	0.419	0.783	0.643	0.726	0.625	0.721	0.302	0.674	0.392	0.577
LinWeizheDragon/PreFLMR_ViT-L?	0.605	0.699	0.440	0.779	0.608	0.729	0.609	0.708	0.314	0.690	0.374	0.578
LinWeizheDragon/PreFLMR_ViT-B?	0.427	0.574	0.294	0.786	0.468	0.673	0.550	0.663	0.272	0.658	0.260	0.496

หมายเหตุ: เราแปลงจุดตรวจสอบจาก PyTorch เป็น Huggingface-transformers ซึ่งผลลัพธ์การวัดประสิทธิภาพแตกต่างจากตัวเลขที่รายงานในรายงานต้นฉบับเล็กน้อย คุณสามารถทำซ้ำผลลัพธ์ในรายงานข้างต้นโดยอ้างอิงจากคำแนะนำในเอกสารนี้

วิธีใช้แพ็คเกจนี้

สิ่งแวดล้อม

สร้าง virtualenv:

 conda create -n FLMR python=3.10 -y
conda activate FLMR

ติดตั้ง Pytorch:

 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

ติดตั้งไฟส

 conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl

ทดสอบว่า fais สร้างข้อผิดพลาดหรือไม่

 python -c "import faiss"

ติดตั้ง FLMR

 git clone https://github.com/LinWeizheDragon/FLMR.git
cd FLMR
pip install -e .

ติดตั้งเครื่องยนต์ ColBERT

 cd third_party/ColBERT
pip install -e .

ติดตั้งการพึ่งพาอื่น ๆ

 pip install ujson gitpython easydict ninja datasets transformers

สร้างดัชนีคอลเลกชันเอกสารแบบกำหนดเอง

โหลดโมเดลที่ผ่านการฝึกอบรมมาแล้ว

 import os
import torch
import pandas as pd
import numpy as np
from torchvision . transforms import ToPILImage
from transformers import AutoImageProcessor

from flmr import index_custom_collection
from flmr import FLMRQueryEncoderTokenizer , FLMRContextEncoderTokenizer , FLMRModelForRetrieval

# load models
checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-G"
image_processor_name = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"

query_tokenizer = FLMRQueryEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" )
context_tokenizer = FLMRContextEncoderTokenizer . from_pretrained (
    checkpoint_path , subfolder = "context_tokenizer"
)

model = FLMRModelForRetrieval . from_pretrained (
    checkpoint_path ,
    query_tokenizer = query_tokenizer ,
    context_tokenizer = context_tokenizer ,
)
image_processor = AutoImageProcessor . from_pretrained ( image_processor_name )

สร้างคอลเลกชันเอกสาร

 num_items = 100
feature_dim = 1664
passage_contents = [ f"This is test sentence { i } " for i in range ( num_items )]
# Option 1. text-only documents
custom_collection = passage_contents
# Option 2. multi-modal documents with pre-extracted image features
# passage_image_features = np.random.rand(num_items, feature_dim)
# custom_collection = [
#     (passage_content, passage_image_feature, None) for passage_content, passage_image_feature in zip(passage_contents, passage_image_features)
# ]
# Option 3. multi-modal documents with images
# random_images = torch.randn(num_items, 3, 224, 224)
# to_img = ToPILImage()
# if not os.path.exists("./test_images"):
#     os.makedirs("./test_images")
# for i, image in enumerate(random_images):
#     image = to_img(image)
#     image.save(os.path.join("./test_images", "{}.jpg".format(i)))

# image_paths = [os.path.join("./test_images", "{}.jpg".format(i)) for i in range(num_items)]

# custom_collection = [
#     (passage_content, None, image_path)
#     for passage_content, image_path in zip(passage_contents, image_paths)
# ]

เรียกใช้การสร้างดัชนีในคอลเลกชันที่กำหนดเอง

 index_custom_collection (
    custom_collection = custom_collection ,
    model = model ,
    index_root_path = "." ,
    index_experiment_name = "test_experiment" ,
    index_name = "test_index" ,
    nbits = 8 , # number of bits in compression
    doc_maxlen = 512 , # maximum allowed document length
    overwrite = True , # whether to overwrite existing indices
    use_gpu = False , # whether to enable GPU indexing
    indexing_batch_size = 64 ,
    model_temp_folder = "tmp" ,
    nranks = 1 , # number of GPUs used in indexing
)

ค้นหาคอลเลกชั่นเอกสารแบบกำหนดเอง

สร้างข้อมูลแบบสอบถามของเล่น

 num_queries = 2

query_instructions = [ f"instruction { i } " for i in range ( num_queries )]
query_texts = [ f" { query_instructions [ i ] } : query { i } " for i in range ( num_queries )]
query_images = torch . zeros ( num_queries , 3 , 224 , 224 )
query_encoding = query_tokenizer ( query_texts )
query_pixel_values = image_processor ( query_images , return_tensors = "pt" )[ 'pixel_values' ]

รับการฝังแบบสอบถามด้วยโมเดล

 inputs = dict (
    input_ids = query_encoding [ 'input_ids' ],
    attention_mask = query_encoding [ 'attention_mask' ],
    pixel_values = query_pixel_values ,
)

# Run model query encoding
res = model . query ( ** inputs )

queries = { i : query_texts [ i ] for i in range ( num_queries )}
query_embeddings = res . late_interaction_output

ค้นหาคอลเลกชัน

 from flmr import search_custom_collection , create_searcher

# initiate a searcher
searcher = create_searcher (
    index_root_path = "." ,
    index_experiment_name = "test_experiment" ,
    index_name = "test_index" ,
    nbits = 8 , # number of bits in compression
    use_gpu = True , # whether to enable GPU searching
)
# Search the custom collection
ranking = search_custom_collection (
    searcher = searcher ,
    queries = queries ,
    query_embeddings = query_embeddings ,
    num_document_to_retrieve = 5 , # how many documents to retrieve for each query
)

# Analyse retrieved documents
ranking_dict = ranking . todict ()
for i in range ( num_queries ):
    print ( f"Query { i } retrieved documents:" )
    retrieved_docs = ranking_dict [ i ]
    retrieved_docs_indices = [ doc [ 0 ] for doc in retrieved_docs ]
    retrieved_doc_scores = [ doc [ 2 ] for doc in retrieved_docs ]
    retrieved_doc_texts = [ passage_contents [ doc_idx ] for doc_idx in retrieved_docs_indices ]

    data = {
        "Confidence" : retrieved_doc_scores ,
        "Content" : retrieved_doc_texts ,
    }

    df = pd . DataFrame . from_dict ( data )

    print ( df )

การฝึกอบรมกับการเรียนรู้ที่ตรงกันข้าม

 import torch
from flmr import FLMRQueryEncoderTokenizer , FLMRContextEncoderTokenizer , FLMRModelForRetrieval

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = FLMRQueryEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" )
context_tokenizer = FLMRContextEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "context_tokenizer" )

model = FLMRModelForRetrieval . from_pretrained ( checkpoint_path ,
                                                query_tokenizer = query_tokenizer ,
                                                context_tokenizer = context_tokenizer ,
                                                )

Q_encoding = query_tokenizer ([ "Using the provided image, obtain documents that address the subsequent question: What is the capital of France?" , "Extract documents linked to the question provided in conjunction with the image: What is the capital of China?" ])
D_encoding = context_tokenizer ([ "Paris is the capital of France." , "Beijing is the capital of China." ,
                            "Paris is the capital of France." , "Beijing is the capital of China." ])
Q_pixel_values = torch . zeros ( 2 , 3 , 224 , 224 )
inputs = dict (
    query_input_ids = Q_encoding [ 'input_ids' ],
    query_attention_mask = Q_encoding [ 'attention_mask' ],
    query_pixel_values = Q_pixel_values ,
    context_input_ids = D_encoding [ 'input_ids' ],
    context_attention_mask = D_encoding [ 'attention_mask' ],
    use_in_batch_negatives = True ,
)

res = model . forward ( ** inputs )
print ( res )

โปรดทราบ ว่าตัวอย่างในบล็อคโค้ดนี้มีไว้เพื่อการสาธิตเท่านั้น พวกเขาแสดงให้เห็นว่าแบบจำลองที่ได้รับการฝึกอบรมล่วงหน้าจะให้คะแนนสูงกว่าในการแก้ไขเอกสาร ในการฝึกอบรมจริง คุณจะต้องส่งเอกสารตามลำดับเสมอ "เอกสารเชิงบวกสำหรับแบบสอบถาม 1, doc1 เชิงลบสำหรับแบบสอบถาม 1, doc2 เชิงลบสำหรับแบบสอบถาม 1, ... doc เชิงบวกสำหรับแบบสอบถาม 2, doc1 เชิงลบสำหรับแบบสอบถาม 2, doc2 เชิงลบสำหรับแบบสอบถาม 2, …”. คุณอาจต้องการอ่านส่วนหลังซึ่งมีตัวอย่างสคริปต์การปรับแต่ง

ทางเลือก: ใช้ Transformers.AutoModel เพื่อโหลดโมเดลที่ผ่านการฝึกอบรมมาแล้ว

 pip install transformers

 from transformers import AutoConfig , AutoModel , AutoImageProcessor , AutoTokenizer
import torch

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = AutoTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" , trust_remote_code = True )
context_tokenizer = AutoTokenizer . from_pretrained ( checkpoint_path , subfolder = "context_tokenizer" , trust_remote_code = True )

model = AutoModel . from_pretrained ( checkpoint_path ,
                                query_tokenizer = query_tokenizer ,
                                context_tokenizer = context_tokenizer ,
                                trust_remote_code = True ,
                                )
image_processor = AutoImageProcessor . from_pretrained ( image_processor_name )

ใช้สคริปต์ตัวอย่าง

เรามีสคริปต์สองสคริปต์เพื่อแสดงให้เห็นว่าแบบจำลองที่ได้รับการฝึกล่วงหน้าสามารถนำมาใช้ในการประเมินได้อย่างไร:

examples/example_use_flmr.py : สคริปต์ตัวอย่างเพื่อประเมิน FLMR (พร้อม 10 ROI) บน OK-VQA
examples/example_use_preflmr.py : สคริปต์ตัวอย่างเพื่อประเมิน PreFLMR บน E-VQA

ใช้ FLMR

 cd examples/

ดาวน์โหลด KBVQA_data จากที่นี่ และแตกไฟล์โฟลเดอร์รูปภาพ รวมผลลัพธ์ ROI/คำอธิบายภาพ/การตรวจจับวัตถุแล้ว

รันคำสั่งต่อไปนี้ (ลบ --run_indexing หากคุณได้รันการจัดทำดัชนีแล้วครั้งหนึ่ง):

python example_use_flmr.py 
            --use_gpu --run_indexing 
            --index_root_path " . " 
            --index_name OKVQA_GS
            --experiment_name OKVQA_GS 
            --indexing_batch_size 64 
            --image_root_dir /path/to/KBVQA_data/ok-vqa/ 
            --dataset_path BByrneLab/OKVQA_FLMR_preprocessed_data 
            --passage_dataset_path BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages 
            --use_split test 
            --nbits 8 
            --Ks 1 5 10 20 50 100 
            --checkpoint_path LinWeizheDragon/FLMR 
            --image_processor_name openai/clip-vit-base-patch32 
            --query_batch_size 8 
            --num_ROIs 9

[ใหม่!] ใช้ PreFLMR

คุณสามารถดาวน์โหลดภาพ E-VQA ได้จาก https://github.com/google-research/google-research/tree/master/encyclopedic_vqa เราจะเพิ่มลิงก์ชุดข้อมูลที่นี่เร็วๆ นี้

 cd examples/

รันคำสั่งต่อไปนี้ (ลบ --run_indexing หากคุณได้รันการจัดทำดัชนีแล้วครั้งหนึ่ง):

python example_use_preflmr.py 
            --use_gpu --run_indexing 
            --index_root_path " . " 
            --index_name EVQA_PreFLMR_ViT-G 
            --experiment_name EVQA 
            --indexing_batch_size 64 
            --image_root_dir /rds/project/rds-hirYTW1FQIw/shared_space/vqa_data/KBVQA_data/EVQA/eval_image/ 
            --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
            --dataset EVQA 
            --use_split test 
            --nbits 8 
            --Ks 1 5 10 20 50 100 500 
            --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G 
            --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
            --query_batch_size 8 
            --compute_pseudo_recall

ที่นี่ เราอัปโหลดชุดข้อมูล M2KR ทั้งหมดลงในชุดข้อมูล HF ชุดเดียว BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR โดยมีชุดข้อมูลที่แตกต่างกันเป็นชุดย่อย หากต้องการสร้างผลลัพธ์ของชุดข้อมูลอื่นๆ ในรายงาน คุณสามารถเปลี่ยน --dataset เป็น OKVQA , KVQA , LLaVA , OVEN , Infoseek , WIT , IGLUE และ EVQA

อัปเดต :

เปิดใช้งาน --compute_pseudo_recall เพื่อคำนวณการเรียกคืนหลอกสำหรับชุดข้อมูลเช่น EVQA/OKVQA/Infoseek
Enable --Ks 1 5 10 20 50 100 500 : max(Ks) ต้องเป็น 500 เพื่อให้ตรงกับประสิทธิภาพที่รายงานในรายงาน PreFLMR

[ใหม่!] ประเมินโมเดล PreFLMR บนการวัดประสิทธิภาพ M2KR ทั้งหมด

เปลี่ยนเส้นทางรูทของรูปภาพใน examples/evaluate_all.sh และดำเนินการ:

 cd examples
bash evaluate_all.sh

รับรายงานโดย:

python report.py

[ใหม่!] ปรับแต่งโมเดล PreFLMR บนชุดข้อมูลดาวน์สตรีม

คุณจะต้องติดตั้ง pytorch-lightning:

 pip install pytorch-lightning==2.1.0

ดำเนินการปรับแต่ง

python example_finetune_preflmr.py 
    --image_root_dir /path/to/EVQA/images/ 
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
    --dataset EVQA 
    --freeze_vit 
    --log_with_wandb 
    --model_save_path saved_models 
    --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G 
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
    --batch_size 8 
    --accumulate_grad_batches 8 
    --valid_batch_size 16 
    --test_batch_size 64 
    --mode train 
    --max_epochs 99999999 
    --learning_rate 0.000005 
    --warmup_steps 100 
    --accelerator auto 
    --devices auto 
    --strategy ddp_find_unused_parameters_true 
    --num_sanity_val_steps 2 
    --precision bf16 
    --val_check_interval 2000 
    --save_top_k -1

เรียกใช้การทดสอบ

python example_use_preflmr.py 
    --use_gpu --run_indexing 
    --index_root_path " . " 
    --index_name EVQA_PreFLMR_ViT-G_finetuned_model_step_10156 
    --experiment_name EVQA 
    --indexing_batch_size 64 
    --image_root_dir /path/to/EVQA/images/ 
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
    --dataset EVQA 
    --use_split test 
    --nbits 8 
    --num_gpus 1 
    --Ks 1 5 10 20 50 100 500 
    --checkpoint_path saved_models/model_step_10156 
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
    --query_batch_size 8

ตัวอย่างผลลัพธ์การปรับแต่ง

ด้วยการรันสคริปต์ข้างต้น เราจะสามารถรับประสิทธิภาพการปรับแต่งดังต่อไปนี้:

ขั้นตอน	Pseudo Recall@5 บน EVQA
2500	73.6
10,000	73.55
12000	74.21
14000	73.73

(จุดตรวจสอบที่มีการสูญเสียการตรวจสอบความถูกต้องต่ำถูกเลือกและทดสอบ ทำงานบน A100 GPU 2 ตัว)

ภาพหน้าจอ 2024-06-05 171340

บันทึก

โมเดล FLMR ถูกนำมาใช้ตามรูปแบบเอกสารของ transformers คุณสามารถดูเอกสารประกอบโดยละเอียดได้ในไฟล์การสร้างแบบจำลอง

การอ้างอิง

หากงานของเราช่วยคุณวิจัยได้ โปรดอ้างอิงรายงานของเราเกี่ยวกับ FLMR และ PreFLMR

 @inproceedings{
    lin2023finegrained,
    title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
    author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=IWWWulAX7g}
        }
        
@inproceedings{lin-etal-2024-preflmr,
    title = "{P}re{FLMR}: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers",
    author = "Lin, Weizhe  and
      Mei, Jingbiao  and
      Chen, Jinghong  and
      Byrne, Bill",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.289",
    pages = "5294--5316",
    abstract = "Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.",
}

ขยาย