LLaVAR Download - LLaVAR Source code download

LLaVAR

Otro código fuente

v1

Descargar

LLaVAR

LLaVAR: ajuste de instrucciones visuales mejorado para la comprensión de imágenes ricas en texto

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

Página del proyecto

Enlace Arxiv

alt text

 @misc{zhang2023llavar,
    title={LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding},
    author={Yanzhe Zhang and Ruiyi Zhang and Jiuxiang Gu and Yufan Zhou and Nedim Lipka and Diyi Yang and Tong Sun},
    year={2023},
    eprint={2306.17107},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

[ACTUALIZACIÓN 01/08] ¡Consulte el punto de control del modelo listo para usar y el conjunto de datos de ajuste de la comunidad en Huggingface!

[ACTUALIZACIÓN 21/07] Publicar los metadatos de las imágenes LAION usadas: preentrenamiento/ajuste fino.

[ACTUALIZACIÓN 12/07] Publicar los resultados/script de la evaluación OCR en el punto de referencia de MME. LLaVAR aumenta la puntuación OCR de LLaVA de 50 a 80.

[ACTUALIZACIÓN 05/07] ¿Datos disponibles en Huggingface?.

[ACTUALIZACIÓN 05/07] ¿Delta de peso del modelo en Huggingface?

[ACTUALIZACIÓN 29/06] Lanzamiento inicial.

La principal diferencia entre nuestro código y el código de LLaVA es que modificamos los archivos de entrenamiento/pruebas/servicio para soportar Vicuna v1.1, que usa '</s>' como separador en lugar de '###'.

Configuración del entorno

Prepare el entorno/fusione el peso del modelo siguiendo LLaVA.

Delta de peso del modelo: Google Drive, Huggingface

Este debería fusionarse con LLaMA-13B.

Después de fusionar, agregue "v1" al nombre de su carpeta y asegúrese de que se utilice el modo de conversación "llava_v1".

Datos de entrenamiento (Huggingface)

Nuestros datos de imagen ya están transformados al formato de preentrenamiento/ajuste fino de LLaVA (tienen nombres de archivos "falsos" en el formato CC3M y COCO). Puede descargarlos y fusionarlos en los conjuntos de entrenamiento de LLaVA.

Nuestras instrucciones, en cambio, ya contienen las instrucciones de LLaVA.

Imágenes previas al entrenamiento: Google Drive

Instrucciones de preentrenamiento (595K + 422K): Google Drive

Ajuste de imágenes: Google Drive

Instrucciones de ajuste (158K + 16K): Google Drive

Instrucciones de ajuste (158K + 20K): Google Drive

Datos de evaluación (Huggingface)

Recopilamos 50 preguntas y respuestas sobre 50 imágenes ricas en texto de LAION, que se pueden aprovechar para una evaluación de seguimiento de instrucciones basada en GPT-4.

Imágenes de evaluación: Google Drive

Contextos de evaluación GPT-4 (595K + 422K): Archivo

Reglas de evaluación GPT-4: Archivo

Preguntas: Archivo

Respuestas GPT-4: Archivo

Guión de entrenamiento

Debes fusionar nuestras imágenes de preentrenamiento en la carpeta cc3m.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 
   /path/to/LLaVA/llava/train/train_mem.py 
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 
    --data_path /path/to/chat_llavar.json 
    --image_folder /path/to/cc3m 
    --vision_tower openai/clip-vit-large-patch14-336 
    --tune_mm_mlp_adapter True 
    --mm_vision_select_layer -2 
    --mm_use_im_start_end 
    --bf16 True 
    --output_dir /path/to/checkpoint 
    --num_train_epochs 1 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 4 
    --gradient_accumulation_steps 2 
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 4000 
    --save_total_limit 1 
    --learning_rate 2e-3 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --tf32 True 
    --model_max_length 1024 
    --gradient_checkpointing True 
    --lazy_preprocess True 
    --image_aspect_ratio ' pad ' 
    --report_to wandb

Debes fusionar nuestras imágenes de ajuste en la carpeta coco2017.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 
    /path/to/LLaVA/llava/train/train_mem.py 
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 
    --data_path /path/to/llava_instruct_150k_llavar_16k.json 
    --image_folder /path/to/coco/images/train2017 
    --vision_tower openai/clip-vit-large-patch14-336 
    --pretrain_mm_mlp_adapter /path/to/mm_proj/llava-13b-pretrain.bin 
    --mm_vision_select_layer -2 
    --mm_use_im_start_end True 
    --bf16 True 
    --output_dir /path/to/checkpoint 
    --num_train_epochs 3 
    --per_device_train_batch_size 4 
    --per_device_eval_batch_size 4 
    --gradient_accumulation_steps 1 
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 8000 
    --save_total_limit 1 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --tf32 True 
    --fsdp " full_shard auto_wrap " 
    --fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer ' 
    --model_max_length 2048 
    --gradient_checkpointing True 
    --lazy_preprocess True 
    --image_aspect_ratio ' pad ' 
    --report_to wandb

Guión de evaluación

Seguimiento de instrucciones en imágenes COCO.

 python /path/to/LLaVA/llava/eval/model_vqa.py 
    --model-name /path/to/checkpoint 
    --question-file 
    /path/to/LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl 
    --image-folder 
    /path/to/coco2014/val2014 
    --answers-file 
    /path/to/qa90-answer-file.jsonl 
    --conv-mode "llava_v1"

Seguimiento de instrucciones en una URL de imagen determinada.

 python -m llava.eval.run_llava 
    --model-name /path/to/checkpoint 
    --image-file "https://cdn.shopify.com/s/files/1/0057/3728/3618/products/a-man-called-otto_ezrjr0pm_480x.progressive.jpg" 
    --query "Who starred in the movie?"

Para VQA basado en texto (de MultimodalOCR): después de clonar su repositorio y preparar los datos, puede colocar el ./MultimodalOCR/Eval_LLaVAR.py en /your/path/to/MultimodalOCR/models/LLaVA/ y agregar nuestro modelo a /your/path/to/MultimodalOCR/eval.py para su evaluación.

Reconocimiento

La base del código es principalmente del proyecto LLaVA. Nuestra evaluación también se basa en el proyecto MultimodalOCR.

Para un mejor decodificador de idiomas, también puedes prestar atención a la reciente actualización del modelo Vicuña.

 @article{liu2023llava,
    author      = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
    title       = {Visual Instruction Tuning},
    publisher   = {arXiv:2304.08485},
    year        = {2023}
}

@misc{liu2023hidden,
    title={On the Hidden Mystery of OCR in Large Multimodal Models},
    author={Yuliang Liu and Zhang Li and Hongliang Li and Wenwen Yu and Yang Liu and Biao Yang and Mingxin Huang and Dezhi Peng and Mingyu Liu and Mingrui Chen and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
    year={2023},
    eprint={2305.07895},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality},
    url = {https://lmsys.org/blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}

Expandir

Información adicional

Versión v1
Tipo Otro código fuente
Fecha de actualización 2024-12-23
tamaño 22.77MB
Proviene de Github

Aplicaciones relacionadas

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
viptools for eslam

2024-12-15
MySchedule.py

2024-12-15
VITAident

2024-12-15

Recomendado para ti

chat.petals.dev

Otro código fuente

1.0.0
GPT Prompt Templates

Otro código fuente

1.0.0
GPTyped

Otro código fuente

GPTyped 1.0.5
waymo open dataset

Otro código fuente

December 2023 Update
SmartTube

Otro código fuente

24.71 Stable
Sunamu

Otro código fuente

Release 2.2.0
wp functions

Otras categorias

1.0.0
waymo open dataset

Otro código fuente

December 2023 Update
slugify

Otras categorias

Version 4.6.0 (10 September 2024)

Información relacionada Todo