FLMR 다운로드 - FLMR 소스 코드 다운로드

FLMR

세분화된 후기 상호 작용 다중 모드 검색기의 Huggingface-transformers 구현입니다.

공식적인 구현은 여기에 있습니다.

모델 및 체크포인트에 대한 자세한 내용은 여기에서 확인할 수 있습니다.

논문의 데이터 세트 재현 및 평가에 대한 자세한 내용은 여기에서 확인할 수 있습니다.

업데이트

[2024년 3월 9일] M2KR 벤치마크에 사용된 이미지를 여기에 업로드했습니다.
[2024년 10월 8일] PreFLMR에 다국어 기능을 추가하는 것과 관련하여 많은 요청을 받았습니다. 현재 PreFLMR의 중국어 버전을 교육 중이며 곧 출시할 예정 임을 발표합니다. 계속 지켜봐 주시기 바랍니다!
[2024년 5월 6일] 구현을 일부 업데이트했습니다.
- 여기에 PreFLMR 논문의 결과를 재현하는 평가 스크립트를 추가했습니다.
- 여기에 변환기 구현으로 업데이트된 벤치마크 결과를 추가했습니다.
- 여기에 사용자 정의 검색 데이터 세트에서 PreFLMR을 미세 조정하기 위한 예제 스크립트가 추가되었습니다.
- 중요 : M2KR 벤치마크에서 OVEN 데이터 분할을 수정하고 각 항목을 고정된 명령어로 업데이트하여 평가 결과가 명령어의 무작위 샘플링에 영향을 받지 않도록 합니다. 로컬 캐시를 삭제하고 데이터세트를 다시 다운로드하세요.

FLMR
- 업데이트
- 목차
- 모델 및 벤치마크 결과
- 이 패키지를 사용하는 방법
  - 환경
  - 사용자 정의 문서 컬렉션 색인화
  - 사용자 정의 문서 컬렉션 검색
  - 대조 학습을 통한 훈련
- 대안: Transformers.AutoModel을 사용하여 사전 훈련된 모델을 로드합니다.
- 예제 스크립트 사용
  - FLMR 사용
  - [NEW!] PreFLMR 사용
  - [NEW!] 모든 M2KR 벤치마크에서 PreFLMR 모델을 평가합니다.
  - [신규!] 다운스트림 데이터 세트에서 PreFLMR 모델을 미세 조정합니다.
    - 미세 조정 실행
    - 테스트 실행
    - 미세 조정 결과 예시
- 메모
- 소환

모델 및 벤치마크 결과

모델	WIT 리콜@10	IGLUE 리콜@1	KVQA 리콜@5	MSMARCO 리콜@5	오븐 리콜@5	LLaVA 리콜@1	EVQA 리콜@5	EVQA 의사 리콜@5	OKVQA 리콜@5	OKVQA 유사 리콜@5	인포시크 리콜@5	Infoseek 의사 리콜@5
LinWeizheDragon/PreFLMR_ViT-G?	0.619	0.718	0.419	0.783	0.643	0.726	0.625	0.721	0.302	0.674	0.392	0.577
LinWeizheDragon/PreFLMR_ViT-L?	0.605	0.699	0.440	0.779	0.608	0.729	0.609	0.708	0.314	0.690	0.374	0.578
LinWeizheDragon/PreFLMR_ViT-B?	0.427	0.574	0.294	0.786	0.468	0.673	0.550	0.663	0.272	0.658	0.260	0.496

참고: 체크포인트를 PyTorch에서 Huggingface-transformer로 변환했는데, 벤치마크 결과는 원본 논문에 보고된 수치와 약간 다릅니다. 이 문서의 지침을 참조하면 위 논문의 결과를 재현할 수 있습니다.

이 패키지를 사용하는 방법

환경

virtualenv를 생성합니다:

 conda create -n FLMR python=3.10 -y
conda activate FLMR

Pytorch를 설치합니다:

 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

설치 실패

 conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl

실패 생성 오류가 있는지 테스트

 python -c "import faiss"

FLMR 설치

 git clone https://github.com/LinWeizheDragon/FLMR.git
cd FLMR
pip install -e .

ColBERT 엔진 설치

 cd third_party/ColBERT
pip install -e .

다른 종속성 설치

 pip install ujson gitpython easydict ninja datasets transformers

사용자 정의 문서 컬렉션 색인화

사전 학습된 모델 로드

 import os
import torch
import pandas as pd
import numpy as np
from torchvision . transforms import ToPILImage
from transformers import AutoImageProcessor

from flmr import index_custom_collection
from flmr import FLMRQueryEncoderTokenizer , FLMRContextEncoderTokenizer , FLMRModelForRetrieval

# load models
checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-G"
image_processor_name = "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"

query_tokenizer = FLMRQueryEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" )
context_tokenizer = FLMRContextEncoderTokenizer . from_pretrained (
    checkpoint_path , subfolder = "context_tokenizer"
)

model = FLMRModelForRetrieval . from_pretrained (
    checkpoint_path ,
    query_tokenizer = query_tokenizer ,
    context_tokenizer = context_tokenizer ,
)
image_processor = AutoImageProcessor . from_pretrained ( image_processor_name )

문서 컬렉션 만들기

 num_items = 100
feature_dim = 1664
passage_contents = [ f"This is test sentence { i } " for i in range ( num_items )]
# Option 1. text-only documents
custom_collection = passage_contents
# Option 2. multi-modal documents with pre-extracted image features
# passage_image_features = np.random.rand(num_items, feature_dim)
# custom_collection = [
#     (passage_content, passage_image_feature, None) for passage_content, passage_image_feature in zip(passage_contents, passage_image_features)
# ]
# Option 3. multi-modal documents with images
# random_images = torch.randn(num_items, 3, 224, 224)
# to_img = ToPILImage()
# if not os.path.exists("./test_images"):
#     os.makedirs("./test_images")
# for i, image in enumerate(random_images):
#     image = to_img(image)
#     image.save(os.path.join("./test_images", "{}.jpg".format(i)))

# image_paths = [os.path.join("./test_images", "{}.jpg".format(i)) for i in range(num_items)]

# custom_collection = [
#     (passage_content, None, image_path)
#     for passage_content, image_path in zip(passage_contents, image_paths)
# ]

사용자 정의 컬렉션에서 인덱싱 실행

 index_custom_collection (
    custom_collection = custom_collection ,
    model = model ,
    index_root_path = "." ,
    index_experiment_name = "test_experiment" ,
    index_name = "test_index" ,
    nbits = 8 , # number of bits in compression
    doc_maxlen = 512 , # maximum allowed document length
    overwrite = True , # whether to overwrite existing indices
    use_gpu = False , # whether to enable GPU indexing
    indexing_batch_size = 64 ,
    model_temp_folder = "tmp" ,
    nranks = 1 , # number of GPUs used in indexing
)

사용자 정의 문서 컬렉션 검색

장난감 쿼리 데이터 만들기

 num_queries = 2

query_instructions = [ f"instruction { i } " for i in range ( num_queries )]
query_texts = [ f" { query_instructions [ i ] } : query { i } " for i in range ( num_queries )]
query_images = torch . zeros ( num_queries , 3 , 224 , 224 )
query_encoding = query_tokenizer ( query_texts )
query_pixel_values = image_processor ( query_images , return_tensors = "pt" )[ 'pixel_values' ]

모델로 쿼리 임베딩 얻기

 inputs = dict (
    input_ids = query_encoding [ 'input_ids' ],
    attention_mask = query_encoding [ 'attention_mask' ],
    pixel_values = query_pixel_values ,
)

# Run model query encoding
res = model . query ( ** inputs )

queries = { i : query_texts [ i ] for i in range ( num_queries )}
query_embeddings = res . late_interaction_output

컬렉션 검색

 from flmr import search_custom_collection , create_searcher

# initiate a searcher
searcher = create_searcher (
    index_root_path = "." ,
    index_experiment_name = "test_experiment" ,
    index_name = "test_index" ,
    nbits = 8 , # number of bits in compression
    use_gpu = True , # whether to enable GPU searching
)
# Search the custom collection
ranking = search_custom_collection (
    searcher = searcher ,
    queries = queries ,
    query_embeddings = query_embeddings ,
    num_document_to_retrieve = 5 , # how many documents to retrieve for each query
)

# Analyse retrieved documents
ranking_dict = ranking . todict ()
for i in range ( num_queries ):
    print ( f"Query { i } retrieved documents:" )
    retrieved_docs = ranking_dict [ i ]
    retrieved_docs_indices = [ doc [ 0 ] for doc in retrieved_docs ]
    retrieved_doc_scores = [ doc [ 2 ] for doc in retrieved_docs ]
    retrieved_doc_texts = [ passage_contents [ doc_idx ] for doc_idx in retrieved_docs_indices ]

    data = {
        "Confidence" : retrieved_doc_scores ,
        "Content" : retrieved_doc_texts ,
    }

    df = pd . DataFrame . from_dict ( data )

    print ( df )

대조 학습을 통한 훈련

 import torch
from flmr import FLMRQueryEncoderTokenizer , FLMRContextEncoderTokenizer , FLMRModelForRetrieval

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = FLMRQueryEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" )
context_tokenizer = FLMRContextEncoderTokenizer . from_pretrained ( checkpoint_path , subfolder = "context_tokenizer" )

model = FLMRModelForRetrieval . from_pretrained ( checkpoint_path ,
                                                query_tokenizer = query_tokenizer ,
                                                context_tokenizer = context_tokenizer ,
                                                )

Q_encoding = query_tokenizer ([ "Using the provided image, obtain documents that address the subsequent question: What is the capital of France?" , "Extract documents linked to the question provided in conjunction with the image: What is the capital of China?" ])
D_encoding = context_tokenizer ([ "Paris is the capital of France." , "Beijing is the capital of China." ,
                            "Paris is the capital of France." , "Beijing is the capital of China." ])
Q_pixel_values = torch . zeros ( 2 , 3 , 224 , 224 )
inputs = dict (
    query_input_ids = Q_encoding [ 'input_ids' ],
    query_attention_mask = Q_encoding [ 'attention_mask' ],
    query_pixel_values = Q_pixel_values ,
    context_input_ids = D_encoding [ 'input_ids' ],
    context_attention_mask = D_encoding [ 'attention_mask' ],
    use_in_batch_negatives = True ,
)

res = model . forward ( ** inputs )
print ( res )

이 코드 블록의 예제는 데모용으로만 제공 됩니다. 그들은 사전 훈련된 모델이 문서 수정에 더 높은 점수를 제공한다는 것을 보여줍니다. 실제 훈련에서는 항상 "query1에 대해 positive doc, query1에 대해 negative doc1, query1에 대해 negative doc2, ..., query2에 대해 positive doc, query2에 대해 negative doc1, query2에 대해 negative doc2, ...". 미세 조정 스크립트 예제를 제공하는 이후 섹션을 읽어볼 수도 있습니다.

대안: Transformers.AutoModel을 사용하여 사전 훈련된 모델을 로드합니다.

 pip install transformers

 from transformers import AutoConfig , AutoModel , AutoImageProcessor , AutoTokenizer
import torch

checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = AutoTokenizer . from_pretrained ( checkpoint_path , subfolder = "query_tokenizer" , trust_remote_code = True )
context_tokenizer = AutoTokenizer . from_pretrained ( checkpoint_path , subfolder = "context_tokenizer" , trust_remote_code = True )

model = AutoModel . from_pretrained ( checkpoint_path ,
                                query_tokenizer = query_tokenizer ,
                                context_tokenizer = context_tokenizer ,
                                trust_remote_code = True ,
                                )
image_processor = AutoImageProcessor . from_pretrained ( image_processor_name )

예제 스크립트 사용

우리는 사전 훈련된 모델이 평가에 어떻게 사용될 수 있는지 보여주기 위해 두 가지 스크립트를 제공합니다.

examples/example_use_flmr.py : OK-VQA에서 FLMR(ROI 10개)을 평가하기 위한 예제 스크립트입니다.
examples/example_use_preflmr.py : E-VQA에서 PreFLMR을 평가하기 위한 예제 스크립트입니다.

FLMR 사용

 cd examples/

여기에서 KBVQA_data 다운로드하고 이미지 폴더의 압축을 풉니다. ROI/캡션/객체 감지 결과가 포함되었습니다.

다음 명령을 실행합니다(인덱싱을 이미 한 번 실행한 경우 --run_indexing 제거).

python example_use_flmr.py 
            --use_gpu --run_indexing 
            --index_root_path " . " 
            --index_name OKVQA_GS
            --experiment_name OKVQA_GS 
            --indexing_batch_size 64 
            --image_root_dir /path/to/KBVQA_data/ok-vqa/ 
            --dataset_path BByrneLab/OKVQA_FLMR_preprocessed_data 
            --passage_dataset_path BByrneLab/OKVQA_FLMR_preprocessed_GoogleSearch_passages 
            --use_split test 
            --nbits 8 
            --Ks 1 5 10 20 50 100 
            --checkpoint_path LinWeizheDragon/FLMR 
            --image_processor_name openai/clip-vit-base-patch32 
            --query_batch_size 8 
            --num_ROIs 9

[NEW!] PreFLMR 사용

https://github.com/google-research/google-research/tree/master/encyclopedic_vqa에서 E-VQA 이미지를 다운로드할 수 있습니다. 곧 여기에 데이터 세트 링크를 추가할 예정입니다.

 cd examples/

다음 명령을 실행합니다(인덱싱을 이미 한 번 실행한 경우 --run_indexing 제거).

python example_use_preflmr.py 
            --use_gpu --run_indexing 
            --index_root_path " . " 
            --index_name EVQA_PreFLMR_ViT-G 
            --experiment_name EVQA 
            --indexing_batch_size 64 
            --image_root_dir /rds/project/rds-hirYTW1FQIw/shared_space/vqa_data/KBVQA_data/EVQA/eval_image/ 
            --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
            --dataset EVQA 
            --use_split test 
            --nbits 8 
            --Ks 1 5 10 20 50 100 500 
            --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G 
            --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
            --query_batch_size 8 
            --compute_pseudo_recall

여기서는 모든 M2KR 데이터 세트를 하나의 HF 데이터 세트 BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 에 업로드하고 다른 데이터 세트를 하위 세트로 사용합니다. 논문에 있는 다른 데이터 세트의 결과를 재현하려면 --dataset OKVQA , KVQA , LLaVA , OVEN , Infoseek , WIT , IGLUE 및 EVQA 로 변경할 수 있습니다.

업데이트 :

EVQA/OKVQA/Infoseek와 같은 데이터 세트에 대한 의사 재현율을 계산하려면 --compute_pseudo_recall 활성화하세요.
활성화 --Ks 1 5 10 20 50 100 500 : PreFLMR 문서에 보고된 성능과 일치하려면 max(Ks)가 500이어야 합니다.

[NEW!] 모든 M2KR 벤치마크에서 PreFLMR 모델을 평가합니다.

examples/evaluate_all.sh 에서 이미지 루트 경로를 변경하고 다음을 실행합니다.

 cd examples
bash evaluate_all.sh

다음 방법으로 보고서를 얻습니다.

python report.py

[신규!] 다운스트림 데이터 세트에서 PreFLMR 모델을 미세 조정합니다.

pytorch-lightning을 설치해야 합니다:

 pip install pytorch-lightning==2.1.0

미세 조정 실행

python example_finetune_preflmr.py 
    --image_root_dir /path/to/EVQA/images/ 
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
    --dataset EVQA 
    --freeze_vit 
    --log_with_wandb 
    --model_save_path saved_models 
    --checkpoint_path LinWeizheDragon/PreFLMR_ViT-G 
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
    --batch_size 8 
    --accumulate_grad_batches 8 
    --valid_batch_size 16 
    --test_batch_size 64 
    --mode train 
    --max_epochs 99999999 
    --learning_rate 0.000005 
    --warmup_steps 100 
    --accelerator auto 
    --devices auto 
    --strategy ddp_find_unused_parameters_true 
    --num_sanity_val_steps 2 
    --precision bf16 
    --val_check_interval 2000 
    --save_top_k -1

테스트 실행

python example_use_preflmr.py 
    --use_gpu --run_indexing 
    --index_root_path " . " 
    --index_name EVQA_PreFLMR_ViT-G_finetuned_model_step_10156 
    --experiment_name EVQA 
    --indexing_batch_size 64 
    --image_root_dir /path/to/EVQA/images/ 
    --dataset_hf_path BByrneLab/multi_task_multi_modal_knowledge_retrieval_benchmark_M2KR 
    --dataset EVQA 
    --use_split test 
    --nbits 8 
    --num_gpus 1 
    --Ks 1 5 10 20 50 100 500 
    --checkpoint_path saved_models/model_step_10156 
    --image_processor_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 
    --query_batch_size 8

미세 조정 결과 예시

위 스크립트를 실행하면 다음과 같은 미세 조정 성능을 얻을 수 있습니다.

단계	EVQA의 의사 Recall@5
2500	73.6
10000	73.55
12000	74.21
14000	73.73

(검증 손실이 낮은 체크포인트를 선택하여 테스트했으며 2개의 A100 GPU에서 실행)

스크린샷 2024-06-05 171340

메모

FLMR 모델은 transformers 의 문서화 스타일에 따라 구현됩니다. 모델링 파일에서 자세한 문서를 찾을 수 있습니다.

소환

우리의 작업이 귀하의 연구에 도움이 되었다면 FLMR 및 PreFLMR에 대한 논문을 인용해 주시기 바랍니다.

 @inproceedings{
    lin2023finegrained,
    title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
    author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=IWWWulAX7g}
        }
        
@inproceedings{lin-etal-2024-preflmr,
    title = "{P}re{FLMR}: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers",
    author = "Lin, Weizhe  and
      Mei, Jingbiao  and
      Chen, Jinghong  and
      Byrne, Bill",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.289",
    pages = "5294--5316",
    abstract = "Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.",
}

확장하다