Although great progress has been made by recent LLM-based table understanding methods, they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access high-quality textual table representations in some real-world scenarios like scanned documents and webpage screentshots, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications.
Facing the above challenge, we propose the multimodal table understanding problem, where the model is required to generate correct responses to different table-related requests (e.g., questions) in an end-to-end fashion based on the table image. Correspondingly, we construct MMTab, the first open-source large-scale dataset for multimodal table understanding problem, which can support both the training and evaluation of generalist MLLMs towards multimodal table understanding. Based on the curated MMTab dataset, we develop a versatile tabular MLLM named Table-LLaVA with an enhanced two-stage training paradigm of LLaVA v1.5. Table-LLaVA beats strong MLLM baselines on 17 held-in and 6 held-out benchmarks, and is even competitive with the powerful GPT-4V on 14 benchmarks under a subset of test samples. The right figure shows an intuitive comparison of Table LLaVA 7B and existing MLLMs on various multimodal table understanding benchmarks.
We constructed MMTab based on 14 publicly available table datasets of 8 domains. We carefully design scripts to convert original textual tables in these datasets into table images highlighting a broad coverage of table structures and styles, and transform all task-specific samples into multimodal instruction-tuning samples with a unified format of <table image, input request, output response>
. The resulting dataset contains three parts and can be downloaded from the Hugging Face Dataset. During the dataset
construction, data augmentations at multiple levels (e.g., table-level, task-level) were adopted to further improve the data diversity.
Dataset Split | #Table Images | #Samples |
---|---|---|
MMTab-pre | 97K | 150K table recognition samples for pre-training |
MMTab-instruct | 82K | 232K samples of 14 table-based tasks for instruction-tuning |
MMTab-eval | 23K | 45K samples of 17 held-in benchmarks and 4K samples of 7 held-out benchmarks for evaluation |
Dataset examples are shown in the following figure and more examples are shown in the Appendix A in the original paper.
Table LLaVA follows the LLaVA v1.5 architecture, with CLIP-ViT-L-336px as the visual encoder (336*336 image resolution), Vicuna-v1.5-7B or Vicuna-v1.5-13B as the base LLM and a two-layer MLP as the vision-language connector. The saved model checkpoints can be downloaded from the following Hugging Face Repository:
Version | Size | Schedule | Base LLM | Vision Encoder | Projection layer | Checkpoints |
---|---|---|---|---|---|---|
Table LLaVA | 7B | full_finetune-1_epoch | Vicuna-v1.5-7B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-7b |
Table LLaVA | 13B | full_finetune-1_epoch | Vicuna-v1.5-13B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-13b |
pretrained_mm_projector of Table LLaVA 7B | 5M | full_finetune-1_epoch | Vicuna-v1.5-7B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-pretrained_mm_projector |
pretrained_mm_projector of Table LLaVA 13B | 5M | full_finetune-1_epoch | Vicuna-v1.5-13B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-pretrained_mm_projector |
Note: The above Table-LLaVA checkpoints are saved from the original LLaVA repository, which is not directly compatible with the Transformers, i.e., it can not be directly loaded in the way like LlavaForConditionalGeneration.from_pretrained('SpursgoZmy/table-llava-v1.5-7b')
. This problem is mentioned in this github issue. I will try the provided conversion script to make Table-LLaVa checkpoints become compatible with Transformers and upload new checkpoints to a new hub. But for now, maybe the checkpoints can only be loaded with the LLaVA repository like this instead of directly loading from HuggingFace. Sorry for this inconvenience!
We use the code base of LLaVA v1.5 for model training and inference. Thus, Table LLaVA can be used as the normal LLaVA v1.5 model and the environment can be installed in a similar way. Note that our code base is downloaded in December 2023 and maybe not the latest. Please refer to the official LLaVA v1.5 github for its latest update.
git clone https://github.com/SpursGoZmy/Table-LLaVA.git
cd Table-LLaVA
conda create -n table_llava python=3.10 -y
conda activate table_llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Table LLaVA training consists of two stages: (1) Pre-training stage: the vision-language connector (a two-layer MLP) is trained to connect the frozen pretrained vision encoder (ViT) to the frozen LLM (Vicuna v1.5); (2) Instruction-tuning stage: the vision-language connector and the base LLM are trained to follow multimodal instructions.
The training data of each stage is shown below:
Training Stage | Data Description | Data Size | Hugging Face Dataset |
---|---|---|---|
Pre-training | 558K original LLaVA-1.5 pre-training data | 558K | blip_laion_cc_sbu_558k.json |
150K table recognition data (MMTab-pre) | 150K | MMTab-pre_pretrain_data_llava_format_150K.json | |
Instruction Fine-tuning | 665K original LLaVA-1.5 fine-tuning data | 665K | llava_v1_5_mix665k.json |
232K multimodal instruction tuning data of 14 tabular tasks (MMTab-instruct) | 232K | MMTab-instruct_sft_data_llava_format_232K.json |
The merged pre-training and instruction fine-tuning data in the LLaVA data format can be found in the MMTab dataset,
i.e., enhanced_llava_pretrain_data_708K.json
and enhanced_llava_sft_data_898K.json
, which can be directly used to train Table LLaVA.
Table LLaVA was trained on 8 A800 GPUs with 80GB memory. We use a similar set of hyperparameters as LLaVA v1.5 except that we increased the max sequence length from 2048 to 2560 to accommodate longer text sequences. The hyperparameters used in pretraining and finetuning are provided below.
Stage | Trained Weights | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | warmup ratio | Deepspeed Stage |
---|---|---|---|---|---|---|---|---|
Pre-training | vision-language connector | 256 | 1e-3 | 1 | 2560 | 0 | 0.03 | ZeRO-2 |
Instruction Fine-tuning | base LLM and vision-language connector | 128 | 2e-5 | 1 | 2048 | 0 | 0.03 | ZeRO-3 |
images.zip
from here. Put it under ./LLaVA-Pretrain/images
and unzip it.MMTab-instruct_table_images_82K.zip
and MMTab-pre_table_images_part_2_16K.zip
from MMTab dataset. Put them under ./LLaVA-Pretrain/images
and unzip them. Rename the IID_train_image
dir to table_pretrain_part_1
.enhanced_llava_pretrain_data_708K.json
from MMTab dataset to ./LLaVA-Pretrain
.LLaVA-Pretrain
├── images
│ ├── table_pretrain_part_1
| ├── table_pretrain_part_2
| ├── 00453
| ├── 00019
| ├── ...
| └── 00095
└── enhanced_llava_pretrain_data_708K.json
pretrain_table_llava.sh
. If you cannot automaticly download the base Vicuna v1.5 and ViT model through HuggingFace, you can download these models manually and set corresponding command-line parameters (model_name_or_path
and vision_tower
) to the local model paths. Once the pre-training is finished, the trained vision-language projector will be saved at the specified output_dir
../LLaVA-Finetune/images
whose names are coco
, gqa
, ocr_vqa
, textvqa
and vg
, respectively. Follow instructions from here to download images from these 5 datasets for LLaVA v1.5 fine-tuning. Put the zip files in the corresponding folders and unzip them.MMTab-instruct_table_images_82K.zip
from MMTab dataset. Put it under ./LLaVA-Finetune/images/table_instructV
and unzip it. Rename the resulting IID_train_image
dir to images
.enhanced_llava_sft_data_898K.json
from MMTab dataset to ./LLaVA-Finetune
.LLaVA-Finetune
├── images
│ ├── coco
| | └── train2017
| ├── gqa
| | └── images
| ├── ocr_vqa
| | └── images
| ├── textvqa
| | └── train_images
| ├── vg
| | ├── VG_100K
| | └── VG_100K_2
| ├── table_instructV
| | └── images
└── enhanced_llava_sft_data_898K.json
continue_sft_table_llava.sh
. Set the pretrain_mm_mlp_adapter
parameter to the path of your pre-trained vision-language projector, such as ./pretrained_mm_projector/llava-v1.5-7b-with-table-pretrain/mm_projector.bin
. The trained table llava model will be saved at the specified output_dir
.The inference data should be stored in the LLaVA's jsonl format. Each line in the input file corresponds to an input sample, which is a JSON string (generated by json.dumps()
) of a Python dict. The sample format should look like:
{ "question_id": "TSD_test_item_17", # item_id
"image": "TABMWP_24663.jpg", # corresponding image file
"text": "This image displays a table. Could you provide me ...", # input text
"category": "TABMWP_for_TSD" # {dataset_name}_for_{task_type}, which can be used to separate data of different benchmarks.
}
For inference on the MMTab-eval, download the 49K MMTab-eval test samples in the jsonl format (MMTab-eval_test_data_49K_llava_jsonl_format.jsonl) and its image files (MMTab-eval_table_images_23K.zip). Then create a folder named 'LLaVA-Inference' and organize the data as follows:
LLaVA-Inference
├── MMTab-eval_test_data_49K_llava_jsonl_format.jsonl
└── all_test_image
Inference on multi-GPU: start_multicard_inference.sh
. You can also inference on your own data. Remember adjust parameters like 'question-file
' (input file path), 'image-folder
' (image folder path) in the table_llava_inference.sh
. The inference results (merge.jsonl
) will be stored in the path of the 'answers-file
' parameter, e.g., ./eval_results/answers/MMTab_eval/table-llava-v1.5-7b/merge.jsonl
.
With the offical inference script, the inference result format in the merge.jsonl
should look like:
{ 'question_id': 'TABMWP_8', # item_id
'prompt': 'Problem: nHannah baked cookies each day ...', # input_prompt
'text': 'Find the numbers in the table.nnSaturday: ...', # model_output
'answer_id': 'jELcxSPcXHBj3xvHfm5r8T', # answer_id
'model_id': 'table-llava-7b', # model_id
'category': 'TABMWP_for_TQA'
} # item category
The evaluation scripts are stored in the MMTab-eval_evaluation
folder. First, cd MMTab-eval_evaluation
and pip install -r eval_requirements.txt
to install necessary packages like 'Sacrebleu' for evaluation. For table recognition task, we use the PubTabNet's TEDS computation script for evaluation. Then, download the MMTab-eval test data (MMTab-eval_test_data_49K.json) and test tables (MMTab-eval_test_tables_23K.json), and put them into the MMTab-eval_evaluation
folder together with the LLaVA's inference result (merge.jsonl
). Use the MMTab_evaluation.ipynb notebook for automatic evaluation.
For the evaluation on the ToTTo test set, you need to organize the model output into a txt file and upload it to the offical ToTTo leaderboard.
LlavaForConditionalGeneration.from_pretrained('SpursgoZmy/table-llava-v1.5-7b')
. This problem is mentioned in this issue@misc{zheng2024multimodal,
title={Multimodal Table Understanding},
author={Mingyu Zheng and Xinwei Feng and Qingyi Si and Qiaoqiao She and Zheng Lin and Wenbin Jiang and Weiping Wang},
year={2024},
eprint={2406.08100},
archivePrefix={arXiv},
}
}