Natural Language to SQL (NL2SQL) technology is developing rapidly and has become a key breakthrough in the field of natural language processing. It allows users to query the database using natural language, greatly simplifying data access and improving efficiency. However, existing methods have challenges in terms of accuracy and adaptability, especially when dealing with complex databases and cross-domain applications. The editor of Downcodes will introduce to you the XiYan-SQL framework proposed by the Alibaba team and how this framework can effectively solve these problems.
However, there is a certain trade-off between query accuracy and adaptability during the implementation of NL2SQL. Some methods cannot guarantee accuracy when generating SQL queries, and are difficult to adapt to different types of databases. Some existing solutions rely on large language models (LLMs) to generate multiple outputs and select the best query through prompt engineering, but this approach increases the computational burden and is not suitable for real-time applications. At the same time, although supervised fine-tuning (SFT) can achieve targeted SQL generation, it faces difficulties in cross-domain applications and complex database operations, so innovative frameworks are urgently needed.
Alibaba's research team launched XiYan-SQL, a breakthrough NL2SQL framework. It incorporates a multi-generator ensemble strategy that combines the advantages of prompt engineering and SFT. A key innovation of XiYan-SQL is the introduction of M-Schema, a semi-structured schema representation method that can enhance the system's understanding of the database hierarchy, including data types, primary keys and sample values, thereby improving the accuracy and The ability to contextually fit SQL queries.
XiYan-SQL uses a three-stage process to generate and optimize SQL queries.
First, the system identifies relevant database elements through architectural links, thereby reducing redundant information and focusing on key structures. Next, SQL candidates are generated using generators based on example learning (ICL) and SFT. Finally, the system uses error correction models and selection models to optimize and filter the generated SQL to ensure that the best query is selected. XiYan-SQL integrates these steps into an efficient pipeline that goes beyond traditional methods.
After rigorous benchmark testing, XiYan-SQL performed well in multiple standard test sets. For example, it achieved an execution accuracy of 89.65% in the Spider test set, significantly ahead of previous top models.
In addition, XiYan-SQL also achieved excellent results in terms of adaptability to non-relational data sets, reaching an accuracy of 41.20% in the NL2GQL test set. These results demonstrate that XiYan-SQL has excellent flexibility and accuracy in a variety of scenarios.
github:https://github.com/XGenerationLab/XiYan-SQL
All in all, the XiYan-SQL framework has made significant breakthroughs in the field of NL2SQL with its innovative M-Schema and multi-generator integration strategies, providing a new solution for efficient and accurate natural language database queries. Its excellent performance in multiple test sets also proves its strong practicability and broad application prospects. Interested readers can visit the GitHub link for more information.