From Natural Language To SQL Review of
From Natural Language To SQL Review of
Abstract—LLMs when used with Retrieval Augmented Gen- AND ST_Distance_Sphere(Point(long, lat),
eration (RAG), are greatly improving the SOTA of translat- Point(University_Long,
arXiv:2410.01066v2 [cs.CL] 4 Feb 2025
ing natural language queries to structured and correct SQL. University_Lat)) < 2;
Unlike previous reviews, this survey provides a comprehensive
study of the evolution of LLM-based text-to-SQL systems, from This SQL query retrieves all gas stations within a 2-mile
early rule-based models to advanced LLM approaches that use radius of the University that have power. The text-to-SQL
(RAG) systems. We discuss benchmarks, evaluation methods, system enables the user to extract this specific information
and evaluation metrics. Also, we uniquely study the use of without needing to write the SQL query thus making complex
Graph RAGs for better contextual accuracy and schema linking data more accessible.
in these systems. Finally, we highlight key challenges such as
computational efficiency, model robustness, and data privacy Stack Overflow shows that 51.52% of professional develop-
toward improvements of LLM-based text-to-SQL systems. ers use SQL in their works. However, SQL is often a barrier for
Index Terms—Text-to-SQL, Large Language Models, non-technical users because of its technical nature. This leaves
Database, Natural Language Processing, SQL Generation a gap between the vast stores of data housed in relational
databases and the ability of many users to access related data
I. I NTRODUCTION efficiently [3].
A. Overview of the Text-to-SQ task However, building reliable text-to-SQL systems is highly
challenging. The complexity arises from several factors:
Organizations increasingly rely on relational databases to
• 1. Ambiguity and Linguistic Complexity:
manage and analyze vast amounts of structured information.
NL queries often include complex structures such as
These databases are critical parts of many modern systems,
nested clauses, pronouns, or vague terms. A query might
from business intelligence to customer relationship manage-
contain phrases like “all gas stations in the area,” where
ment. As the volume of data increases, the need to query,
“area” could refer to a specific region not directly men-
extract, and make sense of this information also increases.
tioned.
However, querying databases often requires the use of Struc-
• 2. Schema Understanding:
tured Query Language (SQL) which is a technical skill. There
To generate accurate SQL queries, text-to-SQL systems
is a gap between users who need access to data and the
must understand the database schema. In realistic appli-
specialized knowledge required to retrieve it [1]. Text-to-SQL
cations, database schemas are often complex and can vary
parsing in natural language processing (NLP) bridges this gap.
greatly between domains, making it challenging to map
LLMs make the task of translating natural language (NL)
the NL query correctly to the database schema [4].
queries given by a non-technical user to precise SQL easier.
• 3. Rare and Complex SQL Operations:
This ability is important when the complexity of data increases
Some queries involve uncommon SQL operations, such
and makes manual data exploration impractical and inefficient
as nested sub-queries, joins across multiple tables, or
[2]. For example, consider a relational database that contains
window functions. These operations can be difficult for
information about gas stations in Louisiana, with columns
text-to-SQL models to generate, especially if they are not
such as GasStationID, PARISH, NEED, STATION NAME, CITY,
frequently seen during training [3].
and WORKING_GAS_STATIONS_5_MILES among others. Suppose
4. Cross-Domain Generalization:
we have a NL query: “Where can I find a gas station with
Text-to-SQL models trained on one domain (e.g., cus-
power less than 2 miles from the University?” A text-to-SQL
tomer service databases) may not generalize well to other
system would analyze this query, understand the user’s intent,
domains (e.g., healthcare). Differences in schema struc-
and automatically generate the corresponding SQL query to
tures, terminology, and query patterns make it difficult
extract the correct information from the database. In this case,
for models to perform consistently across a wide variety
the final SQL query might look something like:
of databases [5].
SELECT STATION_NAME, location Addressing these challenges is essential for the effectiveness
FROM gas_stations
WHERE fuel_available = ’Yes’ of text-to-SQL systems. As databases become more complex,
AND distance < 2 higher accuracy of text-to-SQL models is also needed.
B. Introducing Retrieval Augmented Generation as a Solution II. E VOLUTION OF T EXT- TO -SQL S YSTEMS IN THE
L ITERATURE
As already stated, despite advances in existing text-to-SQL
systems, they still face limitations in schema understanding, A. Evolutionary Progression
handling complex queries, and generalizing across domains. Figure 1 shows the historical evolution of text-to-SQL
Retrieval Augmented Generation (RAG) has emerged as a systems. Initially they relied on rules, but now they use deep
promising framework to address these challenges. RAG sys- neural networks and powerful large language models (LLMs)
tems combine two key components: [3]. Each step in the evolution brought important innovations,
• A Retrieval Module that dynamically fetches relevant making it easier for the newer models to understand and create
schema details, SQL query template, or domain-specific SQL queries from everyday language.
knowledge from structured and/or unstructured sources • Rule-Based Approaches
like documents [6]. Early text-to-SQL systems used rule-based approaches.
• A Generative Module that produces text-based output, These systems used manually crafted grammar rules and
such as SQL queries or direct answers by adding the heuristics to translate NL queries into SQL commands.
retrieved context into the generation process. These methods were limited in handling complex queries
In this way RAG system instead of using the general and diverse schemas [9]. The fixed rules made it hard to
knowledge of LLMs, and guessing the answers, use related deal with the variety of human language, often causing
external knowledge to generate corresponding answers. RAGs errors when faced with tricky or unexpected query struc-
facilitate text-to-SQL tasks in two ways. tures. Also these systems needed a lot of manual work to
design and update the rules. This made it hard for them to
• Increasing SQL generation means that RAG retrieves adapt to new domains or database formats. Systems like
schema-specific information, query examples, or tem- LUNAR [10] and NaLIX [11] showed the potential of
plates to improve SQL query generation for complex or semantic parsing but required significant manual feature
ambiguous queries in natural language [7]. engineering, hindering scalability and adaptability [10],
• Bypassing SQL Generation means that instead of gener- [11]. Deep learning approaches have addressed many
ating SQL, RAG can directly retrieve answers from data of these limitations by introducing models capable of
sources. learning complex patterns and representations directly
This dual capability makes RAG a powerful approach to from data [12].
address the limitations of text-to-SQL systems. By dynami- • Deep Learning Approaches
cally retrieving context and integrating it into the response Deep learning greatly improved text-to-SQL systems
generation process, RAG improves schema understanding, by using large models to interpret NL and generate
resolves linguistic ambiguities, and adapts to new domains SQL queries. Models like Seq-2-SQL and SQLNet [13]
without extensive retraining. used sequence-to-sequence architectures like LSTMs and
transformers [14]. These models introduced end-to-end,
differentiable architectures that directly convert text to
C. Contributions of this Survey
SQL, offering improvements over earlier systems and
This survey provides a comprehensive review of text-to- allowing the models to learn complex patterns and re-
SQL systems and explores how Retrieval-Augmented Genera- lationships within the data, leading to more accurate
tion (RAG) can address their limitations. The key contributions and efficient translation compared to earliear rule-based
are: systems [13], [15], [16].
Transformer-based models like BERT [17] and TaBERT
• This survey presents a comprehensive review of text-to-
[18], enhancedd understanding of both database schema
SQL systems, including their challenges, datasets, bench-
and user intent. These models improved generalization to
marks, and evaluation metrics.
unseen databases by capturing dependencies between NL
• This survey identifies important limitations in existing
queries and underlying schemas. However challenges still
text-to-SQL systems, such as schema understanding, lin-
remained such as handling nested queries, cross-domain
guistic ambiguity, domain generalization, and complex
generalization, and efficiently mapping ambiguous natu-
query handling.
ral language to structured SQL still remained. [19], [18].
• This survey explains how RAG complements and en-
• Pre-trained Language Models
hances text-to-SQL systems, addressing key challenges
Pre-trained Language Models (PLMs) shifted the ap-
with dynamic retrieval capabilities.
proach from task-specific supervised learning to a more
• This survey introduces a taxonomy to classify text-to-
generalized pre-training method followed by fine-tuning.
SQL techniques, and illustrates how RAG fits into this
Models like BERT [17] (Bidirectional Encoder Repre-
landscape.
sentations from Transformers) and GPT [20] (Generative
• This survey explains how Graph RAG became the state
Pre-training Transformer) started this shift by using large-
of the art and can enhance the text-to-SQL systems’
scale, unsupervised text datasets to pre-train models that
accuracy.
could then be fine-tuned for specific tasks. The concept
Fig. 1. How text-to-SQL research has evolved over time, highlighting different implementation approaches. Each phase includes key techniques and notable
works. The dates are approximate, based on when these key works were released, with a margin of error of about a year. The design is inspired by [3], [8].
of PLMs, grounded in transfer learning, allowed models prehension, SQL generation, and SQL execution. Each
to acquire a deep understanding of NL through extensive step involves sophisticated techniques to ensure that user
pre-training on vast datasets, and this understanding trans- queries are accurately mapped to SQL, providing correct
ferred to many of tasks, like text generation, sentiment and meaningful results from database.
analysis, and question-answering [21], [22]. • RAG Systems in Text-to-SQL
PLMs were better at capturing semantic relationships The evolution of text-to-SQL system has seen significant
between user queries and database schema structures. advancements, with the integration of RAG marking
PLMs such as TaBERT and BERT-SQL [23] enabled the another step forward. By combining retrieval mechanisms
integration of both the NL query and the database schema with large-scale generative models, RAG systems address
into a unified representation that improved the system some of the persistent limitations in text-to-SQL tasks,
context. These models addressed several challenges in particularly in schema understanding, handling complex
text-to-SQL systems, such as handling complex queries queries, and domain generalization.
with multiple joins, nested queries, and cross-domain
– Dynamic Knowledge Retrieval: RAG systems uti-
generalization. However, PLMs still had limitations in
lize retrieval modules to fetch relevant schema in-
regards to their need for domain-specific fine-tuning and
formation, table relationships, and example queries
the difficulty in understanding complex database schemas
from structured or unstructured sources, such as
without additional schema linking mechanisms [24].
relational databases or external documents. These
• Large Language Models
retrieved elements are then integrated into the gener-
The advent of Large Language Models (LLMs) such as
ative process, providing real-time context to improve
GPT-4 [25], Codex [26], and LLaMa [27] has revolu-
SQL generation [30], [7].
tionized NL processing. LLMs are a major advance over
– Enhanced Schema Linking: Unlike earlier models
earlier machine learning methods by virtue their vast
that relied heavily on pre-defined schema represen-
size and training on massive datasets to generate more
tations, RAG systems dynamically adapt to schema
accurate and comprehensive responses. These models
complexities. They align user queries with database
show exceptional performance in tasks that require un-
schemas more effectively by retrieving schema-
derstanding and generating human-like text, often without
specific details during query processing, thus reduc-
needing additional fine-tuning [28].
ing errors caused by schema ambiguity [31] [32].
LLMs capture more complex relationships between NL
– Cross-Domain Generalization: Traditional text-to-
queries and structured database schemas than the earlier
SQL systems often struggle to generalize across
models. Unlike previous pre-trained language models,
different database schemas or domains. RAG systems
which require significant task-specific fine-tuning and
mitigate this challenge by leveraging domain-specific
schema linking, LLMs can handle zero-shot and few-shot
retrieval mechanisms, enabling seamless generaliza-
scenarios more effectively because of their large-scale
tion without extensive fine-tuning. This makes RAG
pre-training and reasoning capabilities [29]. Studies show
systems particularly effective for zero-shot or few-
that models like Codex [28] achieve high performance in
shot scenarios [31].
generating SQL queries with minimal prompt engineer-
ing. However, challenges such as handling ambiguous • Graph RAG System Graph RAG [33] offers a struc-
queries and optimizing SQL statements for performance tured and hierarchical approach to Retrieval-Augmented
and correctness remain [28]. Generation (RAG), making it a promising solution for
As seen in Figure 2, the architecture of LLM-based text- Text-to-SQL tasks. Unlike traditional semantic search
to-SQL systems can be broken down into several key methods that rely solely on text-based retrieval, Graph
phases: natural language understanding, schema com- RAG constructs a knowledge graph from raw data, orga-
nizes schema elements into a hierarchical structure, and eign keys, and other data, and help more precise linking
generates summaries for different database components. of query component.
By leveraging these structured relationships, GraphRAG • 3. Advance SQL Generation:
enhances schema understanding, improves retrieval accu- The retrieved schema details, query examples, and meta-
racy, and enables more precise SQL query generation, data are integrated into the prompt, guiding the LLM to
making it particularly effective for complex database generate SQL queries that are both correct and semanti-
interactions. cally aligned with the database.
• 4. Iterative SQL Execution and Error Feedback:
B. LLM-based Text-to-SQL Architecture and RAG-Integrated
RAG-to-SQL systems incorporate a feedback loop to ad-
Systems
dress execution errors. SQL execution errors are detected
• Natural Language Understanding: during query execution. These errors are used to retrieve
In traditional LLM-based text-to-SQL systems, the pro- additional context from the vector database. The SQL
cess begins with user input in the form of NL queries. query is re-generate and re-executed until the errors are
LLMs first reprocess these input to understand the user’s resolved.
intent, identifying key components such as entities, con- In Table I, High-level comparison between the LLM-based
ditions, and relations within the question. Text-to-SQL System and RAG-Integrated Systems architec-
• Schema Linking:
tures is shown. And also in Fig.3 the high level workflow of
Once the NL query is parsed, the system moves RAG-to-SQL is illustrated.
to schema linking, where the LLM maps the parsed
components of the query to the corresponding tables, III. BENCHMARKS AND EVALUATION METHODS
columns, and relationships in the database schema. For Evaluating LLM-based text-to-SQL systems is crucial for
example, “gas stations” is linked to a table named measuring how well they understand NL and generate accu-
GasStations, and “power” is matched with a column rate SQL queries. Researchers use a variety of datasets and
named PowerAvailable. This phase ensures that the benchmarks that test these models across different scenarios,
system can correctly interpret the query. both within specific domains and across multiple domains.
• SQL Generation: A. Types of Datasets used in Benchmarks
After the query has been parsed and linked to the Text-to-SQL research has made rapid progress with the help
schema, the LLM generates an SQL query based o the of many benchmark datasets, each contributing unique chal-
established semantic relationships. This stage uses the lenges for model development. These datasets are categorized
model’s understanding of SQL syntax and database logic into four types based on their characteristics: cross-domain,
to form a structured query that reflects the user’s intent. knowledge-augmented, context-dependent, and robustness. In
The generated SQL is then validated and optimized for Table ??, we categorized most well-known datasets according
accuracy and performance. to these criteria.
• SQL Execution and Output:
• 1. Cross-domain Datasets
The final SQL query is executed on the underlying Datasets like WikiSQL [35], Spider [36], and KaggleDBA
database (such as SQLiteor MySQL) to retrieve the re- [44] focus on evaluating the generalization capabilities of
quested information. The results of the query are returned models across multiple databases from different domains.
either in raw format or, in some systems, converted back These datasets test whether models can generate accurate
into NL for easier interpretation by the user. Figure SQL queries for databases they have not seen during
2 shows the flow from user input to the final SQL training [36].
query. Each phase makes the text-to-SQL systems more In RAG-TO-SQL systems, RAG has better performance
accessible for non-technical users. in cross-domain settings where schemas differ signifi-
RAG improves the traditional text-to-SQL architecture by cantly across database. Retrieval modules dynamically
integrating dynamic retrieval mechanisms with LLMs to ad- fetch schema details or examples from diverse domains,
dress challenges like schema complexity, ambiguous queries, and help to improve robust generalization.
and domain generalization. • 2. Knowledge-Augmented Datasets Datasets such as
• 1. Improved Natural Language Understanding: SQUALL [45] and BIRD [17] use external knowledge to
RAG-to-SQL systems dynamically retrieve relevant ex- improve the semantic of SQL generation. These datasets
ternal knowledge: 1. Schema metadata, example queries, aim to enhance the model’s comprehension by augment-
and documentation are fetched from a vector database ing the schema with additional context, allowing for
based on the input NL query. 2. If the query is am- more accurate and better SQL generation [17]. Spider-DK
biguous, RAG systems retrieve clarifying context, such [5] adds domain knowledge requirements to the spider
as domain-specific examples or schema descriptions, to dataset. RAG systems can retrieve external documents
refine understanding. or unstructured knowledge to handle questions needing
• 2. Schema Linking with Contextual Retrieval: additional context.
In RAG-to-SQL systems, RAG retrieves detailed schema • 3. Context-Dependent Datasets Datasets like CoSQL
information, including table-column, relationships, for- [4] and SParc [38] emphasize the conversational nature of
Fig. 2. Illustrates the key stages of the traditional text-to-SQL process using Large Language Models (LLMs).
querying databases, where previous interactions influence natural language to SQL are tested [41].
current queries. These datasets challenge models to main- • 6. Multi-Lingual or Cross-Lingual Datasets
tain context throughout multi-turn interactions, making Test model performance across multiple languages, re-
them essential for developing systems that can handle quiring the system to map non-English queries to SQL.
complex, dialog-driven database queries [38]. etrieval modules in RAG systems can fetch schema
• 4. Robustness Datasets A dataset like ADVETA [46] mappings or query examples in the specific language of
tests the robustness of text-to-SQL systems, by introduc- the input, enhancing multilingual performance [37].
ing adversarial table perturbations. This method tests if • 7. Conversational or Interactive Datasets
models are capable of handling unexpected changes in These datasets designed for conversation settings, where
database schema or table structure, thereby assessing their context from previous turns must be considered [4].
adaptability to real-world scenarios [46]. The Table II CoSQL extends Spider for conversational text-to-SQL
datasets push the boundaries of text-to-SQL research, tasks, and SParC [38] is a dataset for conversational SQL
providing challenges across various dimensions, from queries with dependencies between turns. RAG systems
domain generalization to contextual understanding and can retrieve prior turns’ context dynamically, ensuring
system robustness. Retrieval modules fetch clarifying continuity across dialogue turns.
schema descriptions or mappings, and helps LLMs re-
solve ambiguities introduced in the dataset. B. Evaluation Metrics Used in Benchmarks
• 5. Semantic Parsing-Oriented Datasets Benchmarks for Text-to-SQL systems use metrics that cap-
These datasets designed for evaluating the semantic pars- ture both the correctness and efficiency of an SQL query.
ing capabilities of models, where precise mappings from These metrics test that systems not only produce accurate
SQL queries, but also perform efficiently in realistic database
TABLE I
C OMPARISON BETWEEN LLM- BASED T EXT- TO -SQL S YSTEM AND RAG-I NTEGRATED S YSTEMS ARCHITECTURES
Fig. 4. Taxonomy of research approaches in LLM-based text-to-SQL. The format is adapted from [34].
environments. Figure 5 shows evaluation metrics that are 2) Execution-based Metrics: Focus on the actual perfor-
fall into two categories: Content Matching-based Metrics and mance of the generated SQL query when run on a database.
Execution-based Metrics. These metrics assess whether the queries not only follow the
1) Content Matching-based Metrics: Content matching- correct structure but also return the correct results and run
based metrics focus on how closely the structure of the efficiently in realistic scenarios.
generated SQL query matches the gold (or reference) query. • 1. Execution Accuracy (EX):
This type of evaluation is needed for ensuring that the model Checks whether the generated SQL query, when executed
follows the correct SQL syntax and structure, even if it might on the database, returns the same result as the gold query.
not always produce the most optimized SQL query. Execution accuracy focuses on the correctness of the
result, regardless of how the query is structured [36].
• 1. Component Matching (CM): • 2. Valid Efficiency Score (VES):
This metric evaluates each component of the SQL query The Valid Efficiency Score measures the computational
( such as SELECT, FROM, WHERE) individually. Even efficiency of the generated SQL query compared to the
if the components appear in a different order than in gold query. While a query might return the correct
the gold query, they are considered correct as long as result, it could still be inefficient, requiring unnecessary
they correspond to the expected components. This allows computational resources. VES penalizes queries that in-
for flexibility in the query structure while ensuring the troduce extra complexity, such as redundant sub-queries
essential parts of the SQL query are present and accurate or unnecessary joins, even if the results match. [4], [4].
[36].
• 2. Exact Matching (EM): C. Methods
Exact matching is stricter, requiring the generated SQL LLM-based text-to-SQL systems fall into two main classes
query to match the gold query exactly in terms of both of methods: In-context learning and Fine-tuning (FT). Each
structure and order. Every element including the sequence class use different strategies for training the model to generate
of components, must be identical to the gold query. The SQL queries from NL inputs. These methods rely on using the
disadvantage of this metric is it can penalize queries that model’s pre-trained knowledge through prompts or fine-tuning
are functionally correct but structured differently [36]. with added task-specific data to improve query generation.
TABLE II
C OMPARISON OF DATASET T YPES IN T EXT- TO -SQL: K EY C HARACTERISTICS , S TRENGTHS , C HALLENGES , AND E XAMPLES [35], [36], [4], [37], [17],
[38], [5], [39]. [40], [41], [42], [43], [7].
Category Focus Relation to RAG-to-SQL Strength Weaknesses Challenges (Posed to Models) Examples
Cross-Domain Tests generalization across RAG dynamically retrieves 1. Evaluates generalization 1. Limited focus on schema- 1. Handling unseen schemas with- 1. Spider
diverse database schemas and schema-specific mappings and across diverse schemas. specific nuances. out extensive fine-tuning. 2. WikiSQL
domains. examples, enabling better handling 2. Most common benchmark 2. May oversimplify domain- 2. Mapping ambiguous queries to
of unseen schemas. for real-world adaptability. specific challenges. schema elements.
Knowledge- Requires external domain RAG retrieves unstructured knowl- 1. Tests integration of external 1. Requires external knowledge 1. Linking domain-specific knowl- 1. Spider-DK
Augmented knowledge or context beyond edge (e.g., FAQs, documents) to domain knowledge. sources, increasing complexity. edge to schema elements. 2. BIRD
the schema itself. provide additional context for SQL 2. Simulates real-world 2. High reliance on retrieval accu- 2. Avoiding hallucinations during
generation. domain-specific challenges. racy. SQL generation.
Context-Dependent Multi-turn queries requiring RAG retrieves prior turns’ context 1. Evaluates continuity and co- 1. Limited datasets available for 1. Handling ambiguity in multi- 1. CoSQL
understanding of previous or intermediate query states to en- herence across dialogue turns. conversational SQL. turn interactions. 2. SParC
turns in a conversation. sure continuity across dialogue. 2. Tests memory and contex- 2. Requires additional context 2. Efficiently retrieving prior turn
tual adaptation. tracking and retrieval logic. information.
Robustness Evaluates performance under RAG retrieves schema clarifica- 1. Highlights system resilience 1. May overemphasize edge cases 1. Resolving synonym ambiguities. 1. Spider-Syn
adversarial or ambiguous tions, synonyms, or disambiguation to perturbations. that are rare in real-world applica- 2. Handling minimal or incomplete 2. Spider-Realistic
inputs. examples to handle perturbations. 2. Tests error handling under tions. schema descriptions.
noisy input conditions. 2. Often lacks domain realism.
Semantic Tests baseline NL-to-SQL RAG can retrieve schema meta- 1. Focuses on core NL-to-SQL 1. Limited to simple queries and 1. Achieving precise semantic pars- 1. GeoQuery
Parsing-Oriented translation accuracy and data or query examples to improve translation abilities. fixed schemas. ing for varied linguistic structures. 2. ATIS
schema linking. schema linking and parsing accu- 2. Serves as a foundation for 2. Doesn’t test domain generaliza- 2. Avoiding overfitting to simple 3. WikiSQL (single-
racy. all other categories. tion or robustness. datasets. turn queries)
Multi-Lingual or Evaluates performance across RAG retrieves schema metadata 1. Tests adaptability across 1. Limited availability of high- 1. Mapping language-specific 1. mSpider
Cross-Lingual multiple languages. and query examples in the same languages. quality multi-lingual datasets. terms to schema elements. 2. DuSQL
language as the input, improving 2. Highlights multi-lingual and 2. Heavily language-dependent re- 2. Handling multi-language
multi-lingual adaptation. cross-lingual generalization trieval models. inconsistencies in schema linking.
capabilities.
Real-World Benchmarks derived from RAG retrieves domain-specific 1. Evaluates systems on realis- 1. Often lacks diversity in schema 1. Integrating domain-specific 1. Financial SQL
Application real-world industry use cases. schema relationships or external tic, domain-specific queries. structure. external knowledge. 2. Nibiru
documentation to support domain- 2. High practical relevance for 2. Domain-specific datasets may be 2. Handling highly specific 3. ClinicalDB
specific SQL generation. real-world applications. inaccessible or proprietary. schemas with unique structures.
Conversational or Designed for interactive RAG retrieves historical interac- 1. Simulates real-world con- 1. High dependency on accurate 1. Managing long conversational 1. CoSQL
Interactive dialogue settings with tion logs or previous conversational versational settings. retrieval of prior interactions. histories. 2. SParC
evolving context. turns to ensure continuity and 2. Evaluates dynamic con- 2. Computationally intensive for 2. Handling ambiguous or implicit
context-awareness in SQL queries. text management in multi-turn large conversations. references across turns.
queries.
1) In-context Learning: In in-context learning, the model towards generating accurate SQL queries. Table IV, classifies
generates the SQL query based on a given context without the most well known methods in five categories. The following
any updates to the model parameters. Instead, the model illustrates all these categories.
is guided through accurately constructed prompts. In-context
• Zero-Shot and Few-Shot Learning:
learning includes several categories that optimize how queries
In zero-shot learning, the model generates SQL queries
are generated and improve accuracy. Mathematically, the SQL
without any prior exposure to similar examples. the model
query generation task is described as
relies on its pre-trained knowledge to interpret the input
Y = f (Q, S, I; θ), (1)
and produce SQL queries. For example, in C3’s zero-shot
Y is the generated SQL query, Q is the NL question prompting of ChatGPT for text-to-SQL tasks, no fine-
set, S represents the database schema, I is the intermediate tuning is needed [2]. Zero-shot learning is most effective
reasoning, and θ are parameters of the pre-trained model. In when the LLM has been pre-trained on a vast corpus
in-context learning, this formula highlights that the model’s that includes SQL-related content. However, in few-shot
output is determined by the input information (Q, S, I). The learning, the model is given a few examples of input-
pre-trained knowledge embedded in θ, remains fixed during output pairs, to guide its generation of new SQL queries.
the process. The model’s performance depends on how ef- An example is FinSQL, where models are given a small
fectively the input prompt is engineered to guide the model set of SQL queries and asked to generalize from these
TABLE III
COMPARISON TABLE OF S TATE OF THE ART IN RAG-BASED TEXT- TO -SQL SYSTEMS VS SOME NEW TRADITIONAL LLM- BASED TEXT- TO -SQL
Method Name LLM Used Database Used Metrics Category Sub-category Novelty Weakness Reference
Chat2Data Not specified Not specified Execution accuracy RAG-based Knowledge- Uses vector databases to Lacks LLM [31]
Enhanced Retrieval, enhance prompt generation specification and
Decomposition detailed dataset usage
Sample-aware GPT-3.5 Not specified Exact matching (EM), Ex- RAG-based Dynamic Retrieval, Dynamic revision chain for Increased [47]
Prompting and ecution accuracy Context-Aware fine-grained feedback computational cost
Dynamic Revision Retrieval
Chain
RAG-based GPT-4o-mini PostgreSQL Exact matching (EM), Ex- RAG-based Knowledge-Enhanced Direct text-to-SQL query Struggles with [48]
Text-to-SQL ecution accuracy Retrieval generation database normalization
principles
RAG-enhanced LLMs GPT-3.5, Not specified Exact matching (EM), Effi- RAG-based Knowledge-Enhanced Evaluation across LLMs Increased query gener- [49]
GPT-4, Llama2 ciency Retrieval with and without RAG in- ation times
tegration
TAG Not specified Custom datasets Exact match, Accuracy RAG-based Knowledge-Enhanced Unified approach combin- High computational [50]
Retrieval, Schema- ing RAG and Text2SQL costs for iterative data
Augmented Prompting paradigms synthesis
Context-Aware Not specified OutSystems data Execution accuracy RAG-based Dynamic Retrieval, Handling large context Increased inference [51]
Generation Context-Aware sizes by reducing irrelevant time despite improved
Retrieval data accuracy
CRUSH4SQL GPT-3 SPIDER, BIRD, Recall, Accuracy RAG-based Schema-Augmented Leverages hallucination as Increased complexity [52]
SocialDB Prompting, Context- a mechanism for high- and resource
Aware Retrieval recall schema subsetting requirements for
schema extraction
FATO-SQL Medium-scale Petrochemical data Accuracy RAG-based Context-Aware Four-stage process with Scalability is limited by [53]
LLMs Retrieval, Dynamic two rounds of LLM calls medium-scale LLM pa-
Retrieval to improve SQL generation rameters
Distyl-SQL Proprietary Not specified Execution accuracy RAG-based Hierarchical Incorporates pre- High latency and needs [54]
LLM CTE, Knowledge processing for hierarchical adaptive knowledge en-
Management decomposition of SQL hancement
generation
TARGET Not specified Various Accuracy, Recall RAG-based Table Retrieval Benchmarking retrieval ef- High variability in [55]
fectiveness for generative retriever performance
tasks over tabular data across datasets
XRICL Codex SQLite Component Matching In-context Zero-shot, Prompt Op- Translation as Chain-of- Limited cross-lingual [56]
(CM), Execution accuracy timization Thought prompt for non- retrieval exemplars
English Text-to-SQL availability
RSL-SQL GPT-4o, Spider, BIRD Execution accuracy Fine-tuning Enhanced Architecture, Bidirectional schema link- Increased complexity [57]
DeepSeek Data Augmentation ing to mitigate schema and risks from multi-
pruning risks turn strategies
TCSR-SQL Not specified Custom benchmark Execution accuracy, Exact- In-context Reasoning Self-retrieval and multi- High costs for fuzzy [58]
set-match Enhancement, Schema- round generation- testing with large
Augmented Prompting execution-revision process datasets
E-SQL Not specified BIRD Execution accuracy In-context Question Decomposi- Direct schema linking and Diminishing returns [59]
tion, Execution Refine- candidate predicate genera- when used with
ment tion advanced LLMs
DataGpt-SQL DataGpt-SQL- Spider Accuracy (EX, TS) Fine-tuning Data Augmentation, Uses preference datasets Requires fine-tuning on [60]
7B Decomposition and self-refine mechanisms large datasets, poten-
to mitigate LLM risks tially increasing costs
SEA-SQL GPT-3.5 Spider, BIRD Execution accuracy In-context Reasoning Adaptive bias elimination Limited efficiency [61]
Enhancement, and dynamic execution ad- when using complex
Adaptive Refinement justment schema structures
TA-SQL GPT-4 BIRD, Spider Execution accuracy In-context Task Alignment, Task alignment to miti- Dependence on task [62]
Schema Linking, gate hallucinations, provid- alignment makes
Logical Synthesis ing robustness in SQL gen- implementation more
eration complex
Interactive-T2S GPT-4, BIRD, Spider-Dev Execution accuracy In-context Multi-Turn Interaction, Stepwise generation with Lack of scalability for [63]
ChatGPT Prompt Optimization four tools for proactive in- wide tables
formation retrieval
SQLfuse Open-source Spider Execution accuracy Fine-tuning Enhanced Architecture, SQL generation with a Complex module inte- [64]
LLMs SQL Critic critic module for continu- gration requires high
ous improvement system resources
Knowledge-to-SQL DELLM BIRD, Spider Execution accuracy Fine-tuning Data Expert, Schema Uses tailored Data Expert Knowledge generation [65]
Augmentation LLM for knowledge en- specialized for specific
hancement in SQL genera- domains only
tion
TABLE V
S TATE OF ART METHODS USED IN FINE - TUNING (FT) FOR LLM- BASED TEXT- TO -SQL. C1: P RE - TRAINED , C2: D ECOMPOSITION , C3: DATA
AUGMENTED , C4: E NHANCED A RCHITECTURE [78], [64], [79], [80], [81], [82]. T HIS TABLE WAS COMPILED BASED ON INFORMATION FROM
PREVIOUS WORK [3]
for performance and correctness, minimizing errors in In fine-tuning, the model learns patterns specific to SQL
execution [73]. generation, such as understanding database schema and query
2) Fine-Tuning: Fine-tuning involves refining the model’s syntax. This helps it to perform better at generating SQL
internal parameters, θ, using task-specific datasets. Unlike queries by modifying its internal parameters based on task
in-context learning methods, where the model’s parameters data, making the model more specialized and accurate for the
remain fixed and prompts are the primary mechanism of SQL task. The new parameters, θ′ , improves model’s ability
control, fine-tuning updates the model’s parameters based to generalize across different databases and queries. Currently,
on examples from the target task. This process allows the a number of studies have been released exploring an improved
model to become more specialized in tasks like SQL query fine-tuning method. Table V shows categorized well-designed
generation, improving the ability to translate NL questions into fine-tuning methods.
accurate SQL queries over time. • Pre-trained Methods: Pre-trained methods form the
Mathematically, the outcome of this process is represented backbone of fine-tuning for text-to-SQL systems by lever-
as a function g: aging the general knowledge embedded in LLMs, such as
θ′ = g(θ, D) (2)
GPT, LLaMA, and T5. These models, trained on diverse
where θ is the pre-trained model’s parameters, D is the task- textual data, are adapted for SQL query generation by
specific dataset (pair of questions and SQL queries) and θ′ fine-tuning with task-specific data. The fine-tuning pro-
represents the updated parameters after fine-tuning [79]. cess enhances their ability to interpret NL and accurately
map it to SQL commands across different domains and 3) RAG-based Text-to-SQL System: RAG-based text-to-
database schemas [83]. Examples like SQL-GEN show SQL systems integrate dynamic retrieval abilities with genera-
how pre-trained models are fine-tuned with synthetic data tive models to improve SQL query generation [31]. These sys-
for dialect-specific SQL generation, while systems like tems can be categorized into 5 categories: Dynamic Retrieval,
RESDSQL [83] fine-tune LLMs on datasets like Spider Knowledge-Enhanced Retrieval, Schema-Augmented prompt-
for complex query handling [78]. ing, Context-Aware Retrieval, and Robustness Enhancement.
• Fine-Tuning Decomposition: Fine-tuning decomposition • Dynamic Retrieval:
methods aim to enhance the performance of LLM on text- These systems dynamically fetch schema-related infor-
to-SQL tasks by breaking down the complex process of mation, such as metadata, table descriptions, or previ-
query generation into smaller and manageable sub-tasks. ously used queries, to provide relevant context for SQL
The main idea is to address each sub-task individually, generation [51]. These systems improve adaptability in
thereby allowing the model to better focus and fine-tune zero-shot or few-shot scenarios [47]. However, compu-
its parameters for specific challenges related to text-to- tational overhead due to repeated retrieval queries can
SQL generation. By decomposing the task into stages reduce efficiency in real-time applications.
like schema linking and query formation, model can be • Knowledge-Enhanced Retrieval:
trained to handle these distinct processes more effectively Methods in this category integrate domain-specific un-
than if it were trained on the entire query generation task structured knowledge with schema-based retrieval to im-
all at once [80]. The typical fine-tuning decomposition prove SQL generation [50]. These systems bridge the
process involves: gap between schema and query understanding by using
external knowledge sources [48]. Also these systems
– Task Segmentation: breaking down the text-to-SQL are useful for handling domain-specific terminology and
conversion into smaller tasks like schema linking and queries with incomplete schema information. However,
SQL query generation. these systems reliant on the quality and availability of
– Sequential Fine-Tuning: Training the model on domain specific knowledge that may not always be ac-
these sub-tasks in sequence or in parallel so that each cessible.
sub-task is learned optimally. • Schema-Augmented Prompting:
Schema-Augmented Prompting use retrieved schema in-
• Data Augmented Methods: The performance of the formation to precise prompts for LLMs [52]. These
model is particularly affected by the quality of the systems improved accuracy for complex SQL like those
training labels during fine-tuning. Inadequate labeling requiring multi-table joins or nested operations [50].
can be counterproductive and often optimal results are However, prompt construction may become verbose be-
not achieved. Rather, if effective augmentation or high- cause of token limitations and inefficiencies in LLM
quality data is present, fine tuning is likely to yield inference.
results more than even the best fine tuning strategies • Context-Aware Retrieval:
implemented in low quality or raw data. In text-to-SQL Context-Aware Retrieval systems focus on retrieving rel-
and other problems data-augmented fine-tuning has pro- evant context across multi-turn conversations or inter-
gressed greatly, as more efforts now aim at improving the active sessions to generate accurate SQL queries [53].
data quality rather than the architecture. As an example, These systems have a good performance on Handling
Symbol-LLM has developed an injection and an infusion ambiguities in follow-up queries and maintains continuity
phase with a focus on improving the data quality during across dialogue turns [51]. However, Context tracking
instruction tuning [82], [81]. and retrieval can become expensive as conversations grow
• Enhance Architecture: The generative pre-trained longer.
transformer (GPT) framework employs a decoder-only • Robustness Enhancement:
transformer architecture combined with standard auto- Robustness systems retrieve alternate schema interpreta-
regressive decoding for text generation [3]. However, tion or use synonym mappings to handle ambiguity and
recent research on the efficiency of large language models adversarial challenges in SQL generation. These systems
(LLMs) has highlighted a shared challenge: when gener- increase resilience to noisy inputs and ambiguous schema
ating long sequences in the auto-regressive paradigm, the mapping. However, if irrelevant data is included, the
attention mechanism increases latency. This issue is pro- potential of having inaccurate SQL is increase.
nounced in LLM-based text-to-SQL systems, where gen-
erating SQL queries is slower than traditional language 4) Novelty and Advantages of RAG-based Systems: RAG-
modeling, posing a challenge for developing efficient, based text-to-SQL systems introduce several novel features
localized NL interfaces to databases (NLIDB). To address that distinguish them from in-context learning and fine-tuning-
this, Consistency large language models (CLLMs) has based methods:
been developed with an enhanced model architecture, • 1. Dynamic Contextualization:
providing a solution to reduce latency and speed up SQL Unlike the static fine-tuned models, RAG-based systems
query generation [79]. can dynamically adapt to new schemas or domains by
TABLE VI
C OMPARISON OF RAG-BASED , I N -C ONTEXT L EARNING , AND F INE -T UNING METHODS FOR LLM- BASED TEXT- TO -SQL