Project Credits | Otter Paper | OtterHD Paper | MIMIC-IT Paper
Checkpoints:
For who in the mainland China: |
Disclaimer: The code may not be perfectly polished and refactored, but all opensourced codes are tested and runnable as we also use the code to support our research. If you have any questions, please feel free to open an issue. We are eagerly looking forward to suggestions and PRs to improve the code quality.
[2023-11]: Supporting GPT4V's Evaluation on 8 Benchmarks; Anouncing OtterHD-8B, improved from Fuyu-8B. Checkout OtterHD for details.
datasets:
- name: magnifierbench
split: test
prompt: Answer with the option's letter from the given choices directly.
api_key: [Your API Key] # GPT4 or GPT3.5 to evaluate the answers and ground truth.
debug: true # put debug=true will save the model response in log file.
- name: mme
split: test
debug: true
- name: mmbench
split: test
debug: true
models:
- name: gpt4v
api_key: [Your API Key] # to call GPT4V model.
IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
LADD: # Dataset name can be assigned at any name you want
mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file
images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file
num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
M3IT_CAPTIONING:
mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
images_path: azure_storage/Parquets/coco.parquet
num_samples: 20000
[2023-08]
[2023-07]: Anouncing MIMIC-IT dataset for multiple interleaved image-text/video instruction tuning.
[2023-06]
frame tensors
were mistakenly unsqueezed to a wrong vision_x
.
Make sure to adjust the
sys.path.append("../..")
correctly to accessotter.modeling_otter
in order to launch the model.
Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.
Motivated by the upstream interleaved format pretraining of the Flamingo model, we present ? Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.
MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.
We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.
For more details, please check the MIMIC-IT dataset.
Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.
We train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model.
The following template encompasses images, user instructions, and model-generated responses, utilizing the User
and GPT
role labels to enable seamless user-assistant interactions.
prompt = f"<image>User: {instruction} GPT:<answer> {response}<endofchunk>"
Training the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.
# multi-round of conversation
prompt = f"<image>User: {first_instruction} GPT:<answer> {first_response}<endofchunk>User: {second_instruction} GPT:<answer>"
Regarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:
# Multiple in-context example with similar instructions
prompt = f"<image>User:{ict_first_instruction} GPT: <answer>{ict_first_response}<|endofchunk|><image>User:{ict_second_instruction} GPT: <answer>{ict_second_response}<|endofchunk|><image>User:{query_instruction} GPT: <answer>"
For more details, please refer to our paper's appendix for other tasks.
conda env create -f environment.yml
. Especially to make sure the transformers>=4.28.0
, accelerate>=0.18.0
.After configuring environment, you can use the ? Flamingo model / ? Otter model as a ? Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to Huggingface Otter/Flamingo for details.
Otter is trained based on OpenFlamingo. You may need to use converted weights at luodian/OTTER-9B-INIT or luodian/OTTER-MPT7B-Init. They are respectively converted from OpenFlamingo-LLaMA7B-v1 and OpenFlamingo-MPT7B-v2, we added a <answer>
token for Otter's downstream instruction tuning.
You may also use any trained Otter weights to start with your training on top of ours, see them at Otter Weights. You can refer to MIMIC-IT for preparing image/instruction/train json files.
export PYTHONPATH=.
RUN_NAME="Otter_MPT7B"
GPU=8
WORKERS=$((${GPU}*2))
echo "Using ${GPU} GPUs and ${WORKERS} workers"
echo "Running ${RUN_NAME}"
accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero3.yaml
--num_processes=${GPU}
pipeline/train/instruction_following.py
--pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init
--model_name=otter
--instruction_format=simple
--training_data_yaml=./shared_scripts/Demo_Data.yaml
--batch_size=8
--num_epochs=3
--report_to_wandb
--wandb_entity=ntu-slab
--external_save_dir=./checkpoints
--run_name=${RUN_NAME}
--wandb_project=Otter_MPTV
--workers=${WORKERS}
--lr_scheduler=cosine
--learning_rate=2e-5
--warmup_steps_ratio=0.01
--save_hf_model
--max_seq_len=1024
If you found this repository useful, please consider citing:
@article{li2023otter,
title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
journal={arXiv preprint arXiv:2305.03726},
year={2023}
}
@article{li2023mimicit,
title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
year={2023},
eprint={2306.05425},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We thank Jack Hessel for the advise and support, as well as the OpenFlamingo team for their great contribution to the open source community.
Huge accolades to Flamingo and OpenFlamingo team for the work on this great architecture.