Téléchargement LLaVAR - Téléchargement du code source LLaVAR

LLaVAR

Autre code source

v1

Télécharger

LLaVAR

LLaVAR : réglage amélioré des instructions visuelles pour une compréhension des images riches en texte

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

Page du projet

Lien Arxiv

alt text

 @misc{zhang2023llavar,
    title={LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding},
    author={Yanzhe Zhang and Ruiyi Zhang and Jiuxiang Gu and Yufan Zhou and Nedim Lipka and Diyi Yang and Tong Sun},
    year={2023},
    eprint={2306.17107},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

[MISE À JOUR 08/01] Découvrez le point de contrôle du modèle prêt à l'emploi et l'ensemble de données de réglage fin de la communauté sur Huggingface !

[MISE À JOUR 07/21] Libération des métadonnées des images LAION utilisées : pré-entraînement/finetune.

[MISE À JOUR 07/12] Publier les résultats/script de l'évaluation OCR sur le benchmark MME. LLaVAR augmente le score OCR de LLaVA de 50 à 80.

[UPDATE 07/05] Données disponibles sur Huggingface ?.

[MISE À JOUR 07/05] Delta de poids du modèle sur Huggingface ?.

[MISE À JOUR 29/06] Version initiale.

La principale différence entre notre code et celui de LLaVA est que nous avons modifié les fichiers de formation/test/servion pour prendre en charge Vicuna v1.1, qui utilise '</s>' comme séparateur au lieu de '###'.

Configuration de l'environnement

Veuillez préparer l'environnement/fusionner le poids du modèle suivant LLaVA.

Delta de poids du modèle : Google Drive, Huggingface

Cela devrait être fusionné avec LLaMA-13B.

Après la fusion, veuillez ajouter "v1" au nom de votre dossier et assurez-vous que le mode de conversation "llava_v1" est utilisé.

Données d'entraînement (Huggingface)

Nos données d'image sont déjà transformées au format de pré-entraînement/réglage fin LLaVA (elles ont de "faux" noms de fichiers au format CC3M et COCO). Vous pouvez les télécharger et les fusionner dans les ensembles de formation LLaVA.

Nos instructions, en revanche, contiennent déjà les instructions de LLaVA.

Images de pré-entraînement : Google Drive

Instructions de pré-formation (595 Ko + 422 Ko) : Google Drive

Affiner les images : Google Drive

Instructions de réglage fin (158 Ko + 16 Ko) : Google Drive

Instructions de réglage fin (158 Ko + 20 Ko) : Google Drive

Données d'évaluation (Huggingface)

Nous collectons 50 questions et réponses suivant les instructions sur 50 images riches en texte de LAION, qui peuvent être exploitées pour une évaluation suivant les instructions basée sur GPT-4.

Images d'évaluation : Google Drive

Contextes d'évaluation GPT-4 (595K + 422K) : Fichier

Règles d'évaluation GPT-4： Fichier

Questions : Fichier

Réponses GPT-4 : Fichier

Scénario de formation

Vous devez fusionner nos images de pré-entraînement dans le dossier cc3m.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 
   /path/to/LLaVA/llava/train/train_mem.py 
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 
    --data_path /path/to/chat_llavar.json 
    --image_folder /path/to/cc3m 
    --vision_tower openai/clip-vit-large-patch14-336 
    --tune_mm_mlp_adapter True 
    --mm_vision_select_layer -2 
    --mm_use_im_start_end 
    --bf16 True 
    --output_dir /path/to/checkpoint 
    --num_train_epochs 1 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 4 
    --gradient_accumulation_steps 2 
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 4000 
    --save_total_limit 1 
    --learning_rate 2e-3 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --tf32 True 
    --model_max_length 1024 
    --gradient_checkpointing True 
    --lazy_preprocess True 
    --image_aspect_ratio ' pad ' 
    --report_to wandb

Vous devez fusionner nos images de réglage fin dans le dossier coco2017.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 
    /path/to/LLaVA/llava/train/train_mem.py 
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 
    --data_path /path/to/llava_instruct_150k_llavar_16k.json 
    --image_folder /path/to/coco/images/train2017 
    --vision_tower openai/clip-vit-large-patch14-336 
    --pretrain_mm_mlp_adapter /path/to/mm_proj/llava-13b-pretrain.bin 
    --mm_vision_select_layer -2 
    --mm_use_im_start_end True 
    --bf16 True 
    --output_dir /path/to/checkpoint 
    --num_train_epochs 3 
    --per_device_train_batch_size 4 
    --per_device_eval_batch_size 4 
    --gradient_accumulation_steps 1 
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 8000 
    --save_total_limit 1 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --tf32 True 
    --fsdp " full_shard auto_wrap " 
    --fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer ' 
    --model_max_length 2048 
    --gradient_checkpointing True 
    --lazy_preprocess True 
    --image_aspect_ratio ' pad ' 
    --report_to wandb

Script d'évaluation

Suivi des instructions sur les images COCO.

 python /path/to/LLaVA/llava/eval/model_vqa.py 
    --model-name /path/to/checkpoint 
    --question-file 
    /path/to/LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl 
    --image-folder 
    /path/to/coco2014/val2014 
    --answers-file 
    /path/to/qa90-answer-file.jsonl 
    --conv-mode "llava_v1"

Suivi des instructions sur une URL d'image donnée.

 python -m llava.eval.run_llava 
    --model-name /path/to/checkpoint 
    --image-file "https://cdn.shopify.com/s/files/1/0057/3728/3618/products/a-man-called-otto_ezrjr0pm_480x.progressive.jpg" 
    --query "Who starred in the movie?"

Pour VQA basé sur du texte (à partir de MultimodalOCR) : après avoir cloné leur dépôt et préparé les données, vous pouvez mettre le ./MultimodalOCR/Eval_LLaVAR.py dans /your/path/to/MultimodalOCR/models/LLaVA/ et ajouter notre modèle à /your/path/to/MultimodalOCR/eval.py pour évaluation.

Reconnaissance

La base de code provient principalement du projet LLaVA. Notre évaluation s'appuie également sur le projet MultimodalOCR.

Pour un meilleur décodeur de langue, vous pouvez également prêter attention à la récente mise à jour du modèle Vicuna.

 @article{liu2023llava,
    author      = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
    title       = {Visual Instruction Tuning},
    publisher   = {arXiv:2304.08485},
    year        = {2023}
}

@misc{liu2023hidden,
    title={On the Hidden Mystery of OCR in Large Multimodal Models},
    author={Yuliang Liu and Zhang Li and Hongliang Li and Wenwen Yu and Yang Liu and Biao Yang and Mingxin Huang and Dezhi Peng and Mingyu Liu and Mingrui Chen and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
    year={2023},
    eprint={2305.07895},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality},
    url = {https://lmsys.org/blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}

Développer

Informations supplémentaires

Version v1
Type Autre code source
Date de mise à jour 2024-12-23
taille 22.77MB
Provenant de Github

Applications connexes

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

Recommandé pour vous

chat.petals.dev

Autre code source

1.0.0
GPT Prompt Templates

Autre code source

1.0.0
GPTyped

Autre code source

GPTyped 1.0.5
waymo open dataset

Autre code source

December 2023 Update
SmartTube

Autre code source

24.71 Stable
Sunamu

Autre code source

Release 2.2.0
waymo open dataset

Autre code source

December 2023 Update
wp functions

Autres catégories

1.0.0
termwind

Autres catégories

v2.3.0

Actualités connexes Tout