Key Points

1. The paper provides a comprehensive review of NL2SQL techniques powered by large language models (LLMs), covering the entire lifecycle of the NL2SQL task.

2. The paper discusses the challenges of the NL2SQL task, including the uncertainty of natural language, the complexity of databases, and the translation from natural language queries to SQL queries.

3. The paper categorizes the evolution of NL2SQL solutions based on the development of language models, from rule-based methods to the current LLM-based approaches.

4. The paper reviews existing NL2SQL benchmarks, analyzing their characteristics and statistical information, and discusses methods for collecting and synthesizing high-quality training data.

5. The paper highlights the importance of comprehensive evaluation of NL2SQL models, including multi-angle evaluation and scenario-based evaluation.

6. The paper proposes a two-level taxonomy to summarize and analyze the typical errors produced by NL2SQL models.

7. The paper provides a roadmap for optimizing existing LLMs to the NL2SQL task and a decision flow for selecting appropriate NL2SQL modules for different scenarios.

8. The paper discusses new research opportunities in the field of NL2SQL, including open-world NL2SQL tasks, cost-effective NL2SQL with LLMs, and trustworthy NL2SQL solutions.

9. The paper introduces an NL2SQL Handbook that is continuously updated to track the latest NL2SQL techniques and provide practical guidance for researchers and practitioners.

Summary

This paper provides a comprehensive review of techniques for Natural Language to SQL (NL2SQL), which is the task of converting natural language queries (NL) into corresponding SQL queries that can be executed on a relational database. The authors present a new framework for systematically reviewing recent NL2SQL techniques, focusing on the advances in pre-trained language models (PLMs) and large language models (LLMs).

The paper covers four key aspects of the NL2SQL lifecycle: 1) Model: NL2SQL translation techniques that tackle NL ambiguity, under-specification, and properly map NL with database schema and instances; 2) Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; 3) Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; and 4) Error Analysis: Analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve.

Practical Guidance for Developing NL2SQL

The authors also provide practical guidance for developing NL2SQL solutions, including a roadmap for optimizing LLMs for the NL2SQL task and a decision flow for selecting appropriate NL2SQL modules for different scenarios. Finally, the paper discusses research challenges and open problems in the era of LLMs, such as the open-world NL2SQL problem, cost-effective NL2SQL solutions, and trustworthy NL2SQL systems.

Overall, this survey offers a comprehensive and systematic review of the NL2SQL field, highlighting the recent advancements and the challenges that need to be addressed to develop robust and practical NL2SQL solutions.

Reference: https://arxiv.org/abs/2408.05109