Large language model (LLM) has shortcomings in processing table data, and the research team of Zhejiang University's Institute of Computing Innovation has developed the TableGPT2 model for this purpose. This model enables efficient integration and processing of tabular data, bringing new possibilities to business intelligence (BI) and other data-driven applications. The core innovation of TableGPT2 lies in its unique table encoder, which can effectively capture the structure information and cell content information of the table, and enhance the model's ability to handle fuzzy queries, missing column names, and irregular tables. Through large-scale pre-training and fine-tuning, as well as continuous pre-training (CPT) and supervised fine-tuning (SFT), TableGPT2 demonstrates strong coding and reasoning capabilities that can handle complex BI tasks.
The rise of large language models (LLM) has revolutionized the use of artificial intelligence, but they have obvious shortcomings in processing tabular data. Research team from Zhejiang University's Institute of Computing Innovation has launched a new model called TableGPT2, which can directly and efficiently integrate and process table data, opening up for business intelligence (BI) and other data-driven applications. New possibilities.
The core innovation of TableGPT2 is its unique table encoder designed specifically to capture the structure and cell content information of tables, thereby enhancing the model's ability to handle fuzzy queries, missing column names, and irregular tables commonly found in real-world applications. . TableGPT2 is based on the Qwen2.5 architecture and has undergone large-scale pre-training and fine-tuning, involving more than 593,800 tables and 2.36 million high-quality queries-table-output tuples, which is an unprecedented scale of table-related data in previous studies. .
To improve the encoding and inference capabilities of TableGPT2, the researchers conducted continuous pre-training (CPT) on it, with 80% of the data being carefully annotated code to ensure that it has strong encoding capabilities. In addition, they collected a large amount of reasoning data and textbooks containing domain-specific knowledge to enhance the model's reasoning ability. The final CPT data contains 86 billion strictly filtered word symbols, which provides the necessary coding and reasoning capabilities for TableGPT2 to handle complex BI tasks and other related tasks.
To address the limitations of TableGPT2 in adapting to specific BI tasks and scenarios, the researchers performed Supervised Fine Tuning (SFT). They constructed a dataset covering a variety of critical and realistic scenarios, including multiple rounds of conversations, complex reasoning, tool usage, and highly business-oriented queries. This dataset combines manual annotation and expert-driven automatic annotation process to ensure data quality and relevance. The SFT process uses a total of 2.36 million samples, further refining the model to meet the specific needs of BI and other table-related environments.
TableGPT2 also innovatively introduced a semantic table encoder that takes the entire table as input to generate a compact set of embedding vectors for each column. This architecture is customized for the unique properties of the tabular data, and effectively captures the relationship between rows and columns through a two-way attention mechanism and a hierarchical feature extraction process. In addition, a columnar contrast learning method is used to encourage the model to learn meaningful, structure-aware tabular semantic representations.
To seamlessly integrate TableGPT2 with enterprise-level data analytics tools, the researchers also designed an agent workflow runtime framework. The framework contains three core components: runtime prompt engineering, security code sandbox and proxy evaluation module, which jointly enhance the capabilities and reliability of the proxy. The workflow supports complex data analysis tasks through modular steps (input normalization, proxy execution, and tool calls) that work together to manage and monitor the performance of the proxy. By integrating search-enhanced generation (RAG) for efficient context retrieval and code sandbox for secure execution, the framework ensures TableGPT2 provides accurate, context-sensitive insights in real-world problems.
The researchers conducted extensive evaluations of TableGPT2 in a variety of widely used tables and general benchmarks, and the results show that TableGPT2 performed well in table comprehension, processing, and reasoning, with an average performance improvement of 7 billion parameter models by 35.20%, and 720% The average performance of the 100 million parameter model has increased by 49.32%, while maintaining strong general performance. For fair evaluation, they only compared TableGPT2 with open source benchmark neutral models such as Qwen and DeepSeek, ensuring balanced, versatile performance of the model on a variety of tasks without overfitting any single benchmark test. They also introduced and partially released a new benchmark, RealTabBench, which emphasizes unconventional tables, anonymous fields and complex queries, which are more in line with real-life scenarios.
Although TableGPT2 achieved state-of-the-art performance in experiments, there are challenges in deploying LLM to real-world BI environments. The researchers pointed out that future research directions include:
Domain-specific coding: enables LLM to quickly adapt to enterprise-specific domain-specific language (DSL) or pseudo-code to better meet the specific needs of enterprise data infrastructure.
Multi-agent design: Explore how to effectively integrate multiple LLMs into a unified system to handle the complexity of real-world applications.
Multi-functional table processing: Improves the ability of models to handle irregular tables, such as merged cells and inconsistent structures commonly found in Excel and Pages to better handle tabular data in various forms in the real world.
The launch of TableGPT2 marks significant progress in LLM in processing table data, bringing new possibilities to business intelligence and other data-driven applications. I believe that as the research continues to deepen, TableGPT2 will play an increasingly important role in the field of data analysis in the future.
Paper address: https://arxiv.org/pdf/2411.02059v1
All in all, TableGPT2 has achieved remarkable results in processing table data, with its innovative architecture and training methods making it stand out across multiple benchmarks. Future research directions will continue to focus on the adaptability and practicality of models to better meet the needs of real-world business intelligence applications.