CN118484465A

CN118484465A - A method and device for generating SQL statements from natural language statements

Info

Publication number: CN118484465A
Application number: CN202410452634.4A
Authority: CN
Inventors: 卞华星; 李金霞; 程力涵; 温富国; 丁漫江; 沈键; 周晓宇; 施新宇; 黄燕; 王茜
Original assignee: Materials Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Materials Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2024-04-16
Filing date: 2024-04-16
Publication date: 2024-08-13
Anticipated expiration: 2044-04-16
Also published as: CN118484465B

Abstract

The present invention discloses a method and device for generating SQL statements from natural language statements. The method comprises establishing a language processing model based on data enhancement, wherein the language processing model comprises a data enhancement module and a task training module; obtaining first training data and inputting the data into the data enhancement module for data syntax analysis, information removal and data sampling to obtain second training data; merging the first training data and the second training data to generate training data and sending the training data to the task training module for training until the language processing model converges; establishing a language conversion model and embedding the language processing model into the language conversion model; the language conversion model receives the natural language statement to be converted, and decomposes the natural language statement to be converted through the language conversion model to obtain a simple statement; encoding and decoding are performed based on the simple statement to generate SQL statements. The present invention can realize the effective processing of complex natural language statements and improve the processing capability of complex database queries.

Description

A method and device for generating SQL statements from natural language statements

技术领域Technical Field

本发明涉及计算机技术领域，尤其涉及一种自然语言语句生成SQL语句的方法及装置。The present invention relates to the field of computer technology, and in particular to a method and device for generating SQL statements from natural language statements.

背景技术Background Art

计算机网络应用中，大量的数据存储在数据库中，对数据库中的数据进行获取和分析则需要通过结构化的查询语句，例如SQL语句，这就需要用户了解查询语句的语法规则，限制了非技术用户的使用。NL2SQL(Natural Language to Structured QueryLanguage)是将用户输入的自然语言语句转换为对应的可执行SQL语句的技术，该技术可以打通计算机语言与人类自然语言之间的壁垒，将自然语言自动转换为结构化SQL查询语句，使用户能够不需要具备专家知识，就可以用自然语言与数据库进行交互。In computer network applications, a large amount of data is stored in the database. To obtain and analyze the data in the database, structured query statements, such as SQL statements, are required. This requires users to understand the grammatical rules of query statements, which limits the use of non-technical users. NL2SQL (Natural Language to Structured Query Language) is a technology that converts natural language statements input by users into corresponding executable SQL statements. This technology can break the barrier between computer language and human natural language, automatically convert natural language into structured SQL query statements, and enable users to interact with the database in natural language without expert knowledge.

专利文献CN110888897A提供一种根据自然语言生成SQL语句的方法及装置，方法包括：1)通过句向量生成方法将自然语言N转换为句向量Ns；2)使用句向量生成方法将待查数据库中所有表的表描述转换为每个表的描述向量Ti；3)计算每个表的描述向量Ti与自然语言句向量Ns之间的相关性；4)选择相关性最大的前n张表作为候选表；5)使用语义分析算法将自然语言N转换为相应的SQL模板，遍历所选取的候选表，将每张候选表套入SQL模板当中，得到SQL语句列表；6)计算SQL语句列表的置信度，根据置信度选取SQL语句作为匹配的语句。然而该方法仅适用于简单的自然语言转换，对于复杂的自然语言，难以用单一的SQL模板表示，并且会造成模型过于复杂的情况。Patent document CN110888897A provides a method and device for generating SQL statements based on natural language, the method comprising: 1) converting natural language N into sentence vector Ns by sentence vector generation method; 2) converting table descriptions of all tables in the database to be checked into description vector Ti of each table by using sentence vector generation method; 3) calculating the correlation between description vector Ti of each table and natural language sentence vector Ns; 4) selecting the top n tables with the largest correlation as candidate tables; 5) converting natural language N into corresponding SQL template by using semantic analysis algorithm, traversing the selected candidate tables, inserting each candidate table into SQL template, and obtaining SQL statement list; 6) calculating the confidence of SQL statement list, and selecting SQL statement as matching statement according to the confidence. However, this method is only applicable to simple natural language conversion. For complex natural language, it is difficult to express it with a single SQL template, and it will cause the model to be too complicated.

发明内容Summary of the invention

本发明提供了一种自然语言语句生成SQL语句的方法及装置。可以使得非技术用户无需了解SQL语法规则，能够降低用户使用门槛；可以提高对复杂查询的处理能力和查询准确性。The present invention provides a method and device for generating SQL statements from natural language statements, which can make it unnecessary for non-technical users to understand SQL grammar rules, thus lowering the threshold for users to use the system and improving the processing capability and query accuracy of complex queries.

第一方面，本发明提供了一种自然语言语句生成SQL语句的方法，包括：In a first aspect, the present invention provides a method for generating SQL statements from natural language statements, comprising:

建立基于数据增强的语言处理模型，所述语言处理模型包括数据增强模块和任务训练模块；Establishing a language processing model based on data enhancement, wherein the language processing model includes a data enhancement module and a task training module;

获取第一训练数据，将所述第一训练数据输入至所述数据增强模块进行数据语法分析、信息去除以及数据采样，获得第二训练数据；Acquire first training data, input the first training data into the data enhancement module for data syntax analysis, information removal, and data sampling, and obtain second training data;

将所述第一训练数据和第二训练数据进行合并生成训练数据并发送至所述任务训练模块进行训练，直到所述语言处理模型收敛；Merging the first training data and the second training data to generate training data and sending the training data to the task training module for training until the language processing model converges;

建立语言转换模型，并将所述语言处理模型嵌入至所述语言转换模型中；Establishing a language conversion model, and embedding the language processing model into the language conversion model;

所述语言转换模型接收待转换的自然语言语句，并通过所述语言转换模型将待转换的自然语言语句进行分解，获得简单语句；The language conversion model receives a natural language sentence to be converted, and decomposes the natural language sentence to be converted through the language conversion model to obtain a simple sentence;

基于所述简单语句进行编码和解码，生成SQL语句。Encoding and decoding are performed based on the simple statement to generate an SQL statement.

进一步地，所述第一训练数据包括多组样本数据，所述样本数据包括自然语言语句、对应的SQL语句以及关联的表格模式。Furthermore, the first training data includes multiple groups of sample data, and the sample data includes natural language statements, corresponding SQL statements, and associated table models.

进一步地，将所述第一训练数据输入至所述数据增强模块进行数据语法分析、信息去除以及数据采样，获得第二训练数据，包括：Furthermore, the first training data is input into the data enhancement module for data syntax analysis, information removal and data sampling to obtain second training data, including:

对所述第一训练数据中的SQL语句进行语法分析，确定所述SQL语句中的语法规则；Performing grammatical analysis on the SQL statements in the first training data to determine grammatical rules in the SQL statements;

根据所述语法规则，确定所述SQL语句中的各个组成部分；Determine each component of the SQL statement according to the grammatical rules;

对每一个SQL语句，根据组成部分进行信息去除，对应生成多个子产生式；For each SQL statement, information is removed according to its components, and multiple sub-productions are generated accordingly;

对所述子产生式进行分析，获得对应的SQL模板分类结果；Analyze the subproduction formula to obtain the corresponding SQL template classification result;

选择出现次数最多的预设数量个SQL模板，并获取其对应的多个不同自然语言语句；Select a preset number of SQL templates with the largest number of occurrences, and obtain their corresponding multiple different natural language statements;

对所述SQL模板、对应的自然语言语句以及表格模式进行随机采样与合成，生成所述第二训练数据。The SQL template, the corresponding natural language statement and the table pattern are randomly sampled and synthesized to generate the second training data.

进一步地，所述SQL语句通过产生式描述语法规则；所述SQL语句的组成部分包括非终结符和表达式，所述表达式为非终结符和终结符组成的序列；Furthermore, the SQL statement describes the grammatical rules through production formulas; the components of the SQL statement include non-terminal symbols and expressions, and the expression is a sequence of non-terminal symbols and terminal symbols;

对每一个SQL语句，根据组成部分进行信息去除，对应生成多个子产生式，包括：For each SQL statement, information is removed according to its components, and multiple sub-productions are generated accordingly, including:

在一次操作中，选择对所述SQL语句的表达式中的一个非终结符进行去除，获得一个子产生式；In one operation, a non-terminal symbol in the expression of the SQL statement is removed to obtain a subproduction formula;

依次对所述SQL语句的表达式中的非终结符进行去除，获得多个子产生式。Non-terminal symbols in the expression of the SQL statement are removed in sequence to obtain multiple sub-productions.

进一步地，所述任务训练模块根据所述训练数据进行训练，包括：Furthermore, the task training module is trained according to the training data, including:

建立所述自然语言语句与SQL语句的第一中间状态语句；Establishing a first intermediate state statement between the natural language statement and the SQL statement;

迭代执行如下操作，直到满足训练停止条件：Iterate and perform the following operations until the training stop condition is met:

基于掩码机制对所述第一中间状态语句的语法序列进行随机遮蔽；Randomly masking the grammatical sequence of the first intermediate state sentence based on a masking mechanism;

根据所述训练数据中对应的自然语言语句和所述SQL语句对所述第一中间状态语句中随机遮蔽的部分进行预测填充；Predicting and filling the randomly masked part in the first intermediate state statement according to the corresponding natural language statement in the training data and the SQL statement;

根据预测结果计算损失函数值，并在计算过程中获得梯度下降产生的梯度值；Calculate the loss function value based on the prediction result, and obtain the gradient value generated by gradient descent during the calculation process;

根据所述梯度值更新语言处理模型的参数。The parameters of the language processing model are updated according to the gradient value.

进一步地，所述语言转换模型将待转换的自然语言语句进行分解，获得简单语句，包括：Furthermore, the language conversion model decomposes the natural language sentence to be converted to obtain simple sentences, including:

所述语言转换模型利用嵌入的语言处理模型，将待转换的自然语言语句编码为向量表示并输入至预先设置的前馈神经网络中，预测待转换的自然语言语句是否包含多层复杂语义；The language conversion model uses an embedded language processing model to encode the natural language sentence to be converted into a vector representation and inputs it into a pre-set feedforward neural network to predict whether the natural language sentence to be converted contains multiple layers of complex semantics;

若待转换的自然语言语句包含多层复杂语义，则根据待转换的自然语言语句建立语义树；If the natural language sentence to be converted contains multiple layers of complex semantics, a semantic tree is established according to the natural language sentence to be converted;

对所述语义树进行深度优先遍历，获得简单语句。Perform a depth-first traversal on the semantic tree to obtain simple sentences.

进一步地，基于所述简单语句进行编码和解码，生成SQL语句，包括：Furthermore, encoding and decoding are performed based on the simple statement to generate an SQL statement, including:

对所述简单语句进行解析并转换为语句图表示；Parsing the simple statement and converting it into a statement graph representation;

对数据库的表和列的关系进行分析，获得数据库模式的图表示；Analyze the relationship between database tables and columns to obtain a graphical representation of the database schema;

将所述语句图表示和数据库模式的图表示进行关联匹配，获得关联图；Associating and matching the statement graph representation with the graph representation of the database schema to obtain an association graph;

利用所述语言转换模型和预先建立的关系感知网络对所述关联图进行编码，获得向量表示；Encoding the association graph using the language conversion model and a pre-established relationship-aware network to obtain a vector representation;

利用长短期记忆网络对所述向量表示进行解码，生成简单语句与数据库模式之间的第二中间状态语句；Decoding the vector representation using a long short-term memory network to generate a second intermediate state statement between the simple statement and the database pattern;

根据所述第二中间状态语句进行SQL语句的推断和拼接，生成最终的SQL语句。Infer and concatenate SQL statements based on the second intermediate state statement to generate a final SQL statement.

进一步地，利用所述语言转换模型和预先建立的关系感知网络对所述关联图进行编码，获得向量表示，包括：Furthermore, the association graph is encoded using the language conversion model and a pre-established relationship perception network to obtain a vector representation, including:

根据所述关联图构建输入向量；constructing an input vector according to the association graph;

利用所述语言转换模型对所述输入向量进行编码，获得初始向量表示；Encoding the input vector using the language conversion model to obtain an initial vector representation;

对简单语句中的每个字进行位置嵌入；Position embedding for each word in a simple sentence;

利用所述关系感知网络对所述初始向量表示和位置嵌入做进一步嵌入，获得所述向量表示。The initial vector representation and position embedding are further embedded using the relationship-aware network to obtain the vector representation.

第二方面，本发明提供一种自然语言语句生成SQL语句的装置，包括：In a second aspect, the present invention provides a device for generating SQL statements from natural language statements, comprising:

第一模型建立模块，用于建立基于数据增强的语言处理模型，所述语言处理模型包括数据增强模块和任务训练模块；A first model building module, used to build a language processing model based on data enhancement, wherein the language processing model includes a data enhancement module and a task training module;

数据处理模块，用于获取第一训练数据，将所述第一训练数据输入至所述数据增强模块进行数据语法分析、信息去除以及数据采样，获得第二训练数据；A data processing module, used for acquiring first training data, inputting the first training data into the data enhancement module for data syntax analysis, information removal and data sampling, and obtaining second training data;

训练模块，用于将所述第一训练数据和第二训练数据进行合并生成训练数据并发送至所述任务训练模块进行训练，直到所述语言处理模型收敛；A training module, used for merging the first training data and the second training data to generate training data and sending the training data to the task training module for training until the language processing model converges;

第二模型建立模块，用于建立语言转换模型，并将所述语言处理模型嵌入至所述语言转换模型中；A second model building module, used for building a language conversion model and embedding the language processing model into the language conversion model;

分解模块，用于通过所述语言转换模型接收待转换的自然语言语句，并通过所述语言转换模型将待转换的自然语言语句进行分解，获得简单语句；A decomposition module, used for receiving a natural language sentence to be converted through the language conversion model, and decomposing the natural language sentence to be converted through the language conversion model to obtain a simple sentence;

语句生成模块，基于所述简单语句进行编码和解码，生成SQL语句。The statement generation module performs encoding and decoding based on the simple statement to generate an SQL statement.

进一步地，数据处理模块包括：Furthermore, the data processing module includes:

进一步地，数据处理模块还包括：Furthermore, the data processing module also includes:

进一步地，训练模块还包括：Furthermore, the training module also includes:

进一步地，分解模块还包括：Furthermore, the decomposition module also includes:

进一步地，语句生成模块包括：Furthermore, the statement generation module includes:

进一步地，语句生成模块还包括：Furthermore, the statement generation module also includes:

第三方面，本发明还提供一种计算机存储介质，所述计算机存储介质存储有计算机指令，所述计算机指令被调用时，用于执行上述的方法。In a third aspect, the present invention further provides a computer storage medium, wherein the computer storage medium stores computer instructions, and when the computer instructions are called, they are used to execute the above method.

本发明提供的一种自然语言语句生成SQL语句的方法及装置，至少包括如下有益效果：The present invention provides a method and device for generating SQL statements from natural language statements, which at least have the following beneficial effects:

(1)可以使得非技术用户使用自然语言与数据库进行交互，无需了解SQL语法规则，从而降低用户的使用门槛。通过建立基于数据增强的语言处理模型，能够有效处理复杂的自然语言语句，通过数据增强和任务训练模块，可以提高对复杂查询的处理能力。通过语言转换模型将自然语言分解为简单语句，可以提高查询准确性。通过数据增强模块，可以增强模型的泛化能力。从自然语言到SQL语句的自动化转换，减少了人工编写和调试SQL语句的需求，使其具有自动化和智能化的特点。(1) It allows non-technical users to interact with the database using natural language without having to understand SQL syntax rules, thereby lowering the user's usage threshold. By establishing a language processing model based on data enhancement, it can effectively process complex natural language statements. Through data enhancement and task training modules, the processing ability of complex queries can be improved. By decomposing natural language into simple statements through a language conversion model, query accuracy can be improved. Through the data enhancement module, the generalization ability of the model can be enhanced. The automated conversion from natural language to SQL statements reduces the need for manual writing and debugging of SQL statements, making it automated and intelligent.

(2)通过第一中间状态语句和第二中间状态语句，可以提供更加直接和简化的路径，将自然语言查询转换为可执行的SQL语句，从而更有效地处理复杂的查询，并减少在转换过程中可能发生的错误。(2) Through the first intermediate state statement and the second intermediate state statement, a more direct and simplified path can be provided to convert natural language queries into executable SQL statements, thereby more efficiently processing complex queries and reducing errors that may occur during the conversion process.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提供的一种自然语言语句生成SQL语句的方法的流程图；FIG1 is a flow chart of a method for generating SQL statements from natural language statements provided by the present invention;

图2为本发明提供的某一实施例的获得第二训练数据的流程图；FIG2 is a flow chart of obtaining second training data according to an embodiment of the present invention;

图3为本发明提供的某一实施例的获得简单语句的流程图；FIG3 is a flowchart of obtaining a simple statement according to an embodiment of the present invention;

图4为本发明提供的某一实施例的生成SQL语句的流程图；FIG4 is a flowchart of generating a SQL statement according to an embodiment of the present invention;

图5为本发明提供的某一实施例的获得向量表示的流程图；FIG5 is a flow chart of obtaining a vector representation according to an embodiment of the present invention;

图6为本发明提供的一种自然语言语句生成SQL语句的装置的示意图。FIG6 is a schematic diagram of a device for generating SQL statements from natural language statements provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为了更好的理解上述技术方案，下面将结合说明书附图以及具体的实施方式对上述技术方案做详细的说明。In order to better understand the above technical solution, the above technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods.

如图1所示，本实施例提供一种自然语言语句生成SQL语句的方法，包括：As shown in FIG1 , this embodiment provides a method for generating SQL statements from natural language statements, including:

上述实施例提供的方法，可以使得非技术用户使用自然语言与数据库进行交互，无需了解SQL语法规则，从而降低用户的使用门槛。通过建立基于数据增强的语言处理模型，能够有效处理复杂的自然语言语句，通过数据增强和任务训练模块，模型能够学习到更丰富的自然语言与数据库模式之间的映射关系，从而提高对复杂查询的处理能力。通过语言转换模型将自然语言分解为简单语句，并基于这些简单语句进行编码和解码，生成SQL语句，可以提高查询准确性。通过数据增强模块，模型可以在多样化的数据上进行训练，这有助于提高模型对不同数据库模式和自然语言表达的适应能力，从而增强模型的泛化能力。从自然语言到SQL语句的自动化转换，减少了人工编写和调试SQL语句的需求，使其具有自动化和智能化的特点。The method provided by the above embodiment can enable non-technical users to interact with the database using natural language without understanding the SQL syntax rules, thereby lowering the user's usage threshold. By establishing a language processing model based on data enhancement, complex natural language statements can be effectively processed. Through data enhancement and task training modules, the model can learn a richer mapping relationship between natural language and database models, thereby improving the processing ability of complex queries. By decomposing natural language into simple statements through a language conversion model, and encoding and decoding based on these simple statements to generate SQL statements, the query accuracy can be improved. Through the data enhancement module, the model can be trained on diversified data, which helps to improve the model's adaptability to different database models and natural language expressions, thereby enhancing the generalization ability of the model. The automated conversion from natural language to SQL statements reduces the need for manual writing and debugging of SQL statements, making it automated and intelligent.

其中，语言处理模型可以为预训练语言模型，预训练语言模型是一种利用大量文本数据训练的机器学习模型，目的是理解和生成人类语言。这种模型通过学习大规模的文本库，捕捉语言的语法、词汇和上下文关系，从而能够在没有特定任务指导的情况下，对语言进行编码和解码。The language processing model can be a pre-trained language model, which is a machine learning model trained with a large amount of text data to understand and generate human language. This model captures the grammar, vocabulary and context of the language by learning a large-scale text library, so that it can encode and decode the language without specific task guidance.

其中，所述第一训练数据包括多组样本数据，所述样本数据包括自然语言语句、对应的SQL语句以及关联的表格模式。自然语言语句是日常交流所使用的语言形式，通常是自由形式的文本，能够表达丰富的信息和请求。在NL2SQL中，自然语言语句是由用户提出的询问或命令，旨在从数据库中检索或操作数据。例如，由用户提出询问：“哪些城市的平均工资高于平均水平？”，该用户的自然语言语句需要被转换成相应的SQL语句，以便在数据库上执行查询。SQL(Structured Query Language)语句是一种用于管理和操作关系数据库的编程语言。是一种结构化的查询语言，用于执行各种数据库操作，如数据查询、更新、插入和删除。SQL语句遵循严格的语法规则，需要指定正确的关键字、表名、列名、条件等。例如，对于上述用户提出询问对应的自然语言语句，其相应的SQL语句可以为：“SELECTcity FROMsalaries WHERE avg_salary>(SELECTAVG(salary)FROMsalaries)”。表格模式是数据库中表格的结构，包括表中的列名、数据类型、键(如主键、外键)以及表之间的关系(如一对一、一对多或多对多关系)。表格模式是数据库的一部分，定义了如何在数据库中组织和存储数据。在NL2SQL系统中，表格模式对于理解自然语言语句中的实体和属性至关重要，提供了将自然语言映射到具体数据库字段所需的上下文信息。The first training data includes multiple groups of sample data, and the sample data includes natural language statements, corresponding SQL statements, and associated table models. Natural language statements are language forms used in daily communication, usually free-form text, which can express rich information and requests. In NL2SQL, natural language statements are inquiries or commands raised by users, aiming to retrieve or operate data from a database. For example, a user raises a question: "Which cities have an average salary higher than the average?" The user's natural language statement needs to be converted into a corresponding SQL statement to execute the query on the database. SQL (Structured Query Language) statements are a programming language for managing and operating relational databases. It is a structured query language used to perform various database operations, such as data query, update, insert, and delete. SQL statements follow strict grammatical rules and need to specify correct keywords, table names, column names, conditions, etc. For example, for the natural language statement corresponding to the above user's query, the corresponding SQL statement can be: "SELECTcity FROMsalaries WHERE avg_salary>(SELECTAVG(salary)FROMsalaries)". The table schema is the structure of the table in the database, including the column names, data types, keys (such as primary keys, foreign keys), and relationships between tables (such as one-to-one, one-to-many, or many-to-many relationships). The table schema is part of the database and defines how to organize and store data in the database. In the NL2SQL system, the table schema is crucial for understanding the entities and attributes in natural language statements, and provides the contextual information required to map natural language to specific database fields.

进一步地，如图2所示，将所述第一训练数据输入至所述数据增强模块进行数据语法分析、信息去除以及数据采样，获得第二训练数据，可以包括：Further, as shown in FIG. 2 , inputting the first training data into the data enhancement module for data syntax analysis, information removal, and data sampling to obtain the second training data may include:

其中，进行的语法分析，确定SQL语句中的语法规则，可以包括以下内容：The syntax analysis performed to determine the syntax rules in the SQL statement may include the following:

识别关键字和操作符：进行语法分析时，需要识别SQL语句中的关键字(如SELECT,FROM,WHERE,GROUP BY等)和操作符(如＝,<,>,AND,OR等)，关键字和操作符是构成SQL语句结构的基本元素；Identify keywords and operators: When performing syntax analysis, it is necessary to identify keywords (such as SELECT, FROM, WHERE, GROUP BY, etc.) and operators (such as =, <, >, AND, OR, etc.) in SQL statements. Keywords and operators are the basic elements that constitute the structure of SQL statements.

解析表达式结构：分析SQL语句中的表达式，如列名、函数调用、条件表达式等，以及它们之间的层次关系，例如，识别出列名与比较操作符之间的关系，或者函数名与其参数的结构；Parsing expression structure: Analyze expressions in SQL statements, such as column names, function calls, conditional expressions, and their hierarchical relationships. For example, identify the relationship between column names and comparison operators, or the structure of function names and their parameters.

确定子句结构：SQL语句可以包含多个子句，如SELECT子句、FROM子句、WHERE子句等；语法分析需要确定每个子句的开始和结束位置，以及它们在语句中的顺序和嵌套关系；Determine the clause structure: SQL statements can contain multiple clauses, such as SELECT clause, FROM clause, WHERE clause, etc. Syntax analysis needs to determine the start and end position of each clause, as well as their order and nesting relationship in the statement;

识别聚合函数：聚合函数如COUNT,SUM,AVG,MAX,MIN等在SQL语句中用于对一组值执行计算；语法分析需要识别这些函数的使用，并理解它们在查询中的作用；Identify aggregate functions: Aggregate functions such as COUNT, SUM, AVG, MAX, MIN, etc. are used in SQL statements to perform calculations on a set of values; syntax analysis requires identifying the use of these functions and understanding their role in queries;

处理连接查询：对于包含JOIN操作的SQL语句，语法分析需要识别连接类型(如INNER JOIN,LEFT JOIN,RIGHT JOIN等)，以及连接条件和涉及的表；Processing join queries: For SQL statements containing JOIN operations, syntax analysis needs to identify the join type (such as INNER JOIN, LEFT JOIN, RIGHT JOIN, etc.), as well as the join conditions and the tables involved;

识别数据类型和约束：在SQL语句中可能会指定数据类型(如INT,VARCHAR,DATE等)和约束条件(如PRIMARYKEY,FOREIGNKEY,NOT NULL等)；语法分析需要识别这些信息，以便正确理解和处理数据；Identify data types and constraints: In SQL statements, data types (such as INT, VARCHAR, DATE, etc.) and constraints (such as PRIMARYKEY, FOREIGNKEY, NOT NULL, etc.) may be specified; syntax analysis needs to identify this information in order to correctly understand and process the data;

处理嵌套查询：对于包含子查询的SQL语句，语法分析需要识别子查询的范围，并理解其与外部查询的关系；Handling nested queries: For SQL statements containing subqueries, syntax analysis needs to identify the scope of the subquery and understand its relationship to the outer query;

识别注释和空格：SQL语句中的注释和空格对于语句的执行没有影响，但对于阅读和理解SQL语句很重要；语法分析需要正确处理注释和空格，确保不会干扰语法规则的识别；Identify comments and spaces: Comments and spaces in SQL statements have no effect on the execution of the statement, but are important for reading and understanding SQL statements; syntax analysis needs to correctly handle comments and spaces to ensure that they do not interfere with the recognition of syntax rules;

通过上述语法分析步骤，数据增强模块能够深入理解SQL语句的结构和含义，从而在保持语义不变的情况下对数据进行有效的增强和变换，为后续的训练模块提供更丰富、更多样化的训练数据。这些增强的数据有助于提高NL2SQL模型的泛化能力和准确性。Through the above syntax analysis steps, the data enhancement module can deeply understand the structure and meaning of SQL statements, thereby effectively enhancing and transforming the data while keeping the semantics unchanged, providing richer and more diverse training data for subsequent training modules. These enhanced data help improve the generalization ability and accuracy of the NL2SQL model.

其中，确定SQL语句中的各个组成部分时，该SQL语句中的各个组成部分可以包括：When determining each component in the SQL statement, each component in the SQL statement may include:

1、SELECT子句：用于指定要从查询结果中返回的列，例如：SELECTcustomers.name,orders.total_amount；customers.name：指定从customers表中选择name列；orders.total_amount：指定从orders表中选择total_amount列。1. SELECT clause: used to specify the columns to be returned from the query results, for example: SELECT customers.name, orders.total_amount; customers.name: specifies to select the name column from the customers table; orders.total_amount: specifies to select the total_amount column from the orders table.

2、FROM子句：指定查询涉及的表。2. FROM clause: specifies the tables involved in the query.

3、JOIN子句：用于连接多个表，并指定连接的条件。3. JOIN clause: used to connect multiple tables and specify the connection conditions.

4、WHERE子句：用于指定筛选条件，只返回符合特定条件的记录。4. WHERE clause: used to specify filtering conditions and only return records that meet specific conditions.

5、GROUP BY子句：用于将结果集按照一个或多个列进行分组。5. GROUP BY clause: used to group the result set according to one or more columns.

6、HAVING子句：用于对分组后的结果进行筛选，只包括满足指定条件的组。6. HAVING clause: used to filter the grouped results and only include the groups that meet the specified conditions.

7、ORDER BY子句：用于指定返回结果的排序方式。7. ORDER BY clause: used to specify the sorting method of returned results.

本实施例中的SQL语句通过产生式描述语法规则；所述SQL语句的组成部分包括非终结符和表达式，所述表达式为非终结符和终结符组成的序列。对每一个SQL语句，根据组成部分进行信息去除，对应生成多个子产生式，包括：The SQL statements in this embodiment describe the grammatical rules through productions; the components of the SQL statements include non-terminal symbols and expressions, and the expressions are sequences of non-terminal symbols and terminal symbols. For each SQL statement, information is removed according to the components, and multiple sub-productions are generated accordingly, including:

SQL语句的表达式为SQL语句中用于指定操作的部分，非终结符为上下文无关文法中用于定义语法规则的符号，代表可以进一步分解的语法结构。本实施例通过逐步去除非终结符，可以获得多个子产生式，获得的子产生式为原始表达式的简化或变换形式。示例性的，可以定义一个简化的上下文无关文法进行描述SQL语句表达式的结构，文法规则为：The expression of an SQL statement is the part of an SQL statement used to specify an operation. Non-terminal symbols are symbols used to define grammatical rules in a context-free grammar, representing a grammatical structure that can be further decomposed. In this embodiment, multiple sub-productions can be obtained by gradually removing non-terminal symbols. The obtained sub-productions are simplified or transformed forms of the original expression. Exemplarily, a simplified context-free grammar can be defined to describe the structure of an SQL statement expression. The grammar rules are:

SelectStatement→SELECT Columns FROM Table WHEREConditionSelectStatement→SELECT Columns FROM Table WHERECondition

Columns→Column|Columns,ColumnColumns→Column|Columns,Column

Column→*|ColumnNameColumn→*|ColumnName

Table→TableNameTable→TableName

Condition→Column OperatorValue|ConditionAND ConditionCondition→Column OperatorValue|ConditionAND Condition

Operator→＝|<|>|！＝Operator→＝|<|>|！＝

根据上述的文法规则，SQL语句的表达式可以表示为：According to the above grammar rules, the expression of SQL statement can be expressed as:

SelectStatementSelectStatement

逐步去除SelectStatement的非终结符来获得子产生式，得到：Gradually remove the non-terminal symbols of SelectStatement to obtain subproductions, and we get:

SELECT Columns FROM Table WHERE ConditionSELECT Columns FROM Table WHERE Condition

去除Columns非终结符，可以得到两个可能的子产生式：Removing the Columns nonterminal, we get two possible subproductions:

SELECT Column FROM Table WHERE ConditionSELECT Column FROM Table WHERE Condition

SELECT Columns,Column FROM Table WHERE ConditionSELECT Columns,Column FROM Table WHERE Condition

对于第一个子产生式，可以继续去除Column非终结符，去除Table非终结符，或者去除Condition非终结符；在去除Column非终结符时，可以再次得到两个可能的子产生式：For the first subproduction, you can continue to remove the Column nonterminal, remove the Table nonterminal, or remove the Condition nonterminal; when removing the Column nonterminal, you can get two possible subproductions again:

SELECT*FROM Table WHERE ConditionSELECT*FROM Table WHERE Condition

SELECT ColumnName FROM Table WHERE ConditionSELECT ColumnName FROM Table WHERE Condition

在去除Table非终结符时，可以得到以下子产生式：When the Table non-terminal symbol is removed, the following subproduction is obtained:

SELECT Columns FROM TableName WHERE ConditionSELECT Columns FROM TableName WHERE Condition

其中，TableName为具体的表名，比如customers；Among them, TableName is the specific table name, such as customers;

在去除Condition非终结符时，可以得到以下子产生式：When the Condition non-terminal symbol is removed, the following subproduction formula can be obtained:

SELECT Columns FROM TableSELECT Columns FROM Table

其表示一个没有过滤条件的查询。It represents a query with no filter conditions.

通过去除非终结符的方式，可以生成多种可能的SQL语句变体。每个子产生式都代表了原始表达式的一个有效替代，这些替代可以用于数据增强或其他NL2SQL任务中的操作。By removing non-terminal symbols, multiple possible SQL statement variants can be generated. Each subproduction represents a valid alternative to the original expression, which can be used for data augmentation or other operations in NL2SQL tasks.

进一步的，所述任务训练模块根据所述训练数据进行训练，可以包括：Furthermore, the task training module performs training according to the training data, which may include:

其中，中间状态语句(第一中间状态语句和第二中间状态语句)为中间表示，是一种结构化的形式，旨在简化从自然语言到SQL语句的转换过程。中间状态语句捕捉了自然语言语句中的关键信息，并将其组织成一种更接近SQL语句的结构，但比完整的SQL语句更简单、更易于处理的形式。中间状态语句存在以下几种表示方式：Among them, the intermediate state statements (the first intermediate state statement and the second intermediate state statement) are intermediate representations, which are structured forms designed to simplify the conversion process from natural language to SQL statements. The intermediate state statements capture the key information in the natural language statements and organize them into a structure that is closer to SQL statements, but simpler and easier to process than complete SQL statements. There are several ways to represent intermediate state statements:

1、基于模板的中间表示：1. Template-based intermediate representation:

自然语言语句可以为：找出所有在北京工作的员工的姓名和职位；A natural language statement could be: find the names and positions of all employees working in Beijing;

其对应的中间状态语句(基于模板)为：SELECT name,positionFROM employeesWHERE city＝'Beijing'；The corresponding intermediate state statement (based on the template) is: SELECT name, position FROM employees WHERE city = 'Beijing';

其中，中间状态语句遵循了预定义的模板，其中SELECT子句指定要选择的列(name和position)，FROM子句指定表(employees)，WHERE子句指定过滤条件(city＝'Beijing')。The intermediate state statement follows a predefined template, in which the SELECT clause specifies the columns to be selected (name and position), the FROM clause specifies the table (employees), and the WHERE clause specifies the filtering condition (city='Beijing').

2、基于语法树的中间表示：2. Intermediate representation based on syntax tree:

自然语言语句可以为：找出销售额超过平均值的产品。A natural language statement could be: Find products with sales exceeding the average.

其对应的中间状态语句(基于语法树)为：SELECT product_id,sales FROMsales_data GROUP BY product_id HAVING sales>(SELECTAVG(sales)FROM sales_data)；The corresponding intermediate state statement (based on the syntax tree) is: SELECT product_id,sales FROMsales_data GROUP BY product_id HAVING sales>(SELECTAVG(sales)FROM sales_data);

其中，中间状态语句采用SQL的GROUP BY和HAVING子句来计算平均销售额并过滤出销售额超过平均值的产品。这种表示形式更接近于最终的SQL语句，但省略了部分细节，如具体的表名和列名，省略的部分可以在后续的处理中确定。The intermediate statement uses the SQL GROUP BY and HAVING clauses to calculate the average sales and filter out products with sales exceeding the average. This representation is closer to the final SQL statement, but omits some details, such as the specific table name and column name, which can be determined in subsequent processing.

3、基于逻辑形式的中间表示：3. Intermediate representation based on logical form:

自然语言语句可以为：列出每个部门的平均工资。A natural language sentence could be: List the average salary for each department.

其对应的中间状态语句(基于逻辑形式)为：department平均工资filter(department_id,employees)The corresponding intermediate state statement (based on logical form) is: department average salary filter (department_id, employees)

其中，中间状态语句使用逻辑形式来表示查询，department平均工资描述要计算的聚合值，filter(department_id,employees)表示需要根据employees表中的department_id进行过滤。该逻辑形式不直接对应SQL语法，但提供一种清晰的查询语义，可以被进一步转换为SQL语句。The intermediate state statement uses a logical form to express the query. The department average salary describes the aggregate value to be calculated, and filter(department_id,employees) indicates that the department_id in the employees table needs to be filtered. This logical form does not directly correspond to SQL syntax, but provides a clear query semantics that can be further converted into SQL statements.

本发明建立中间状态语句的目的是为了提供一个更加直接和简化的路径，将自然语言查询转换为可执行的SQL语句。通过这种方式，NL2SQL系统可以更有效地处理复杂的查询，并减少在转换过程中可能发生的错误。The purpose of establishing the intermediate state statement of the present invention is to provide a more direct and simplified path to convert natural language queries into executable SQL statements. In this way, the NL2SQL system can process complex queries more effectively and reduce errors that may occur during the conversion process.

本发明实施例对第一中间状态语句中随机遮蔽的部分进行预测填充，可以包括以下几个方面：1、关键字和操作符：SQL语句中的关键字和操作符，如SELECT,FROM,WHERE,GROUP BY,HAVING,ORDER BY等，以及它们之间的逻辑关系。例如，中间状态语句中的HAVING子句被遮蔽，模型需要预测并填充这部分，以确保查询的正确性。2、列名和表名：模型需要预测SQL语句中引用的列名和表名。通常涉及从自然语言语句中提取相关信息，并将其转换为对应的数据库模式中的标识符。3、条件表达式：在WHERE子句中，模型需要预测条件表达式，包括列名、比较操作符(如＝,<,>,！＝等)和条件值。例如，如果自然语言语句中提到“年龄大于30”，模型需要预测对应的条件表达式为age>30。4、聚合函数：对于包含聚合函数的查询，如SUM,AVG,COUNT,MAX,MIN等，模型需要预测这些函数及其应用的列名。5、分组和排序：如果查询包含GROUP BY和ORDER BY子句，模型需要预测分组的列名和排序的规则(升序或降序)。The embodiment of the present invention predicts and fills the randomly masked part in the first intermediate state statement, which may include the following aspects: 1. Keywords and operators: Keywords and operators in SQL statements, such as SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc., and the logical relationship between them. For example, the HAVING clause in the intermediate state statement is masked, and the model needs to predict and fill this part to ensure the correctness of the query. 2. Column name and table name: The model needs to predict the column name and table name referenced in the SQL statement. It usually involves extracting relevant information from the natural language statement and converting it into an identifier in the corresponding database schema. 3. Conditional expression: In the WHERE clause, the model needs to predict the conditional expression, including column name, comparison operator (such as =, <, >, ! =, etc.) and conditional value. For example, if the natural language statement mentions "age is greater than 30", the model needs to predict the corresponding conditional expression as age>30. 4. Aggregate function: For queries containing aggregate functions, such as SUM, AVG, COUNT, MAX, MIN, etc., the model needs to predict these functions and the column names to which they are applied. 5. Grouping and sorting: If the query contains GROUP BY and ORDER BY clauses, the model needs to predict the grouping column names and the sorting rules (ascending or descending).

其中，预测填充的过程可以通过以下几种不同的技术实现，1、基于规则的方法：使用预定义的规则和模式来指导预测过程。2、基于统计的方法：使用机器学习模型，如随机森林或梯度提升机，根据训练数据中的模式进行预测。3、基于深度学习的方法：使用神经网络，如循环神经网络(RNN)或Transformer模型，来学习复杂的映射关系，并进行端到端的预测。Among them, the process of predicting filling can be achieved through the following different technologies: 1. Rule-based methods: use predefined rules and patterns to guide the prediction process. 2. Statistical methods: use machine learning models, such as random forests or gradient boosting machines, to make predictions based on patterns in training data. 3. Deep learning-based methods: use neural networks, such as recurrent neural networks (RNNs) or Transformer models, to learn complex mapping relationships and make end-to-end predictions.

在NL2SQL任务中，损失函数用来衡量模型预测结果与实际SQL语句之间的差异。通过最小化损失函数，模型可以调整其参数，以提高预测的准确性。损失函数可以包括交叉熵损失(Cross-Entropy Loss)和序列到序列损失(Sequence-to-Sequence Loss)，可以用于分类问题和序列生成问题。在NL2SQL任务中，损失函数和梯度的具体形式需要根据任务的特点和所使用的模型进行调整。例如，如果任务是将自然语言语句转换为SQL语句的中间表示，则损失函数需要考虑自然语言语句中的语义信息和SQL语句的结构信息。此外，如果使用了特定的中间表示(如抽象语法树或逻辑形式)，损失函数和梯度计算也需要相应地进行调整，以确保能够反映中间表示的特性。In the NL2SQL task, the loss function is used to measure the difference between the model's prediction results and the actual SQL statements. By minimizing the loss function, the model can adjust its parameters to improve the accuracy of the prediction. Loss functions can include cross-entropy loss and sequence-to-sequence loss, which can be used for classification problems and sequence generation problems. In the NL2SQL task, the specific forms of the loss function and gradient need to be adjusted according to the characteristics of the task and the model used. For example, if the task is to convert a natural language statement into an intermediate representation of a SQL statement, the loss function needs to take into account the semantic information in the natural language statement and the structural information of the SQL statement. In addition, if a specific intermediate representation (such as an abstract syntax tree or logical form) is used, the loss function and gradient calculation also need to be adjusted accordingly to ensure that the characteristics of the intermediate representation are reflected.

进一步的，如图3所示，所述语言转换模型将待转换的自然语言语句进行分解，获得简单语句，包括：Furthermore, as shown in FIG3 , the language conversion model decomposes the natural language sentence to be converted to obtain simple sentences, including:

其中，将待转换的自然语言语句编码为向量表示是将自然语言转换为机器学习模型可以处理的数值形式的过程。这个过程涉及以下步骤：Among them, encoding the natural language sentence to be converted into a vector representation is the process of converting natural language into a numerical form that can be processed by the machine learning model. This process involves the following steps:

分词与嵌入：首先，自然语言语句会被分词成一个个的单词或子词(Token)；然后，每个单词或子词会被映射到一个高维空间中的嵌入向量；这些嵌入向量是通过在大量文本数据上预训练得到的，能够捕捉单词的语义信息；Word segmentation and embedding: First, natural language sentences are segmented into individual words or subwords (Tokens); then, each word or subword is mapped to an embedding vector in a high-dimensional space; these embedding vectors are obtained by pre-training on a large amount of text data and can capture the semantic information of words;

上下文编码：由于单个词的嵌入可能不足以表达复杂的语义关系，因此会使用上下文编码器(如循环神经网络RNN、长短期记忆网络LSTM或门控循环单元GRU)来处理序列化的自然语言；这些编码器能够考虑到单词之间的顺序和依赖关系，生成一个包含上下文信息的向量表示；Context encoding: Since the embedding of a single word may not be sufficient to express complex semantic relationships, context encoders (such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), or gated recurrent units (GRUs)) are used to process serialized natural languages; these encoders are able to take into account the order and dependencies between words and generate a vector representation that contains contextual information;

注意力机制：在一些更先进的模型中，如Transformer或BERT，会使用注意力机制来动态地关注输入序列中不同部分的信息；这允许模型在生成每个输出时，都能够聚焦于输入序列中最相关的部分。Attention Mechanism: In some more advanced models, such as Transformer or BERT, an attention mechanism is used to dynamically focus on different parts of the input sequence; this allows the model to focus on the most relevant parts of the input sequence when generating each output.

通过预先设置的前馈神经网络进行预测的过程时，其可以包括：The process of making predictions through a pre-set feedforward neural network may include:

输入向量的准备：将编码后的自然语言语句向量作为输入，输入到前馈神经网络中；这些向量已经包含自然语言的语义和结构信息；Preparation of input vectors: The encoded natural language sentence vectors are used as input to the feedforward neural network; these vectors already contain the semantic and structural information of the natural language;

层叠处理：前馈神经网络由多个层次组成，每个层次包含多个神经元；输入向量会逐层传递，每一层都会应用激活函数(如ReLU或Sigmoid)和权重变换，以学习更抽象的特征表示；Cascade processing: Feedforward neural networks consist of multiple layers, each containing multiple neurons; the input vector is passed layer by layer, and each layer applies an activation function (such as ReLU or Sigmoid) and weight transformation to learn more abstract feature representations;

输出与预测：网络的最后一层通常会输出一个概率分布，表示待转换的自然语言语句包含多层复杂语义的可能性；这个输出可以通过softmax函数转换为概率，并通过损失函数与真实标签进行比较；Output and prediction: The last layer of the network usually outputs a probability distribution, indicating the possibility that the natural language sentence to be converted contains multiple layers of complex semantics; this output can be converted into a probability through the softmax function and compared with the true label through the loss function;

损失计算与反向传播：计算预测结果与实际标签之间的损失(例如，使用二元交叉熵损失)；然后通过反向传播算法计算梯度，并根据这些梯度更新网络的权重。Loss calculation and backpropagation: Calculate the loss between the predicted results and the actual labels (for example, using binary cross entropy loss); then calculate the gradients through the backpropagation algorithm and update the weights of the network based on these gradients.

根据待转换的自然语言语句建立语义树时，其可以包括：When a semantic tree is established according to the natural language sentence to be converted, it may include:

识别主要概念和逻辑结构：分析自然语言语句，识别出主要的概念(如实体、属性、事件等)和它们之间的逻辑关系(如并列、选择、因果等)。Identify main concepts and logical structures: Analyze natural language sentences to identify main concepts (such as entities, attributes, events, etc.) and the logical relationships between them (such as parallelism, selection, cause and effect, etc.).

构建树的根节点：确定语句的核心语义作为语义树的根节点，为整个语句的主题或主要动作。Construct the root node of the tree: Determine the core semantics of the sentence as the root node of the semantic tree, which is the theme or main action of the entire sentence.

递归构建子节点：对于每个主要概念，根据其在语句中的作用和与其他概念的关系，递归地构建子节点；子节点可以进一步展开为更深层次的子树，以表示更复杂的语义结构。Recursively build subnodes: For each main concept, recursively build subnodes according to its role in the sentence and its relationship with other concepts; subnodes can be further expanded into deeper subtrees to represent more complex semantic structures.

表示并列和选择关系：对于语句中的并列关系(如“和”、“或”连接的元素)，在语义树中通过同一层级的兄弟节点来表示；对于选择关系(如“要么...要么...”结构)，在语义树中通过条件分支来表示。Representing parallel and selection relationships: For parallel relationships in a sentence (such as elements connected by "and" and "or"), they are represented by sibling nodes at the same level in the semantic tree; for selection relationships (such as the "either...or..." structure), they are represented by conditional branches in the semantic tree.

处理嵌套和修饰结构：对于嵌套的语句或子查询，建立内部的子树来表示嵌套的逻辑；对于修饰性的短语或从句(如定语、状语等)，通过连接修饰节点和被修饰节点的方式，将它们链接到主树的适当位置。Handling nested and modified structures: For nested statements or subqueries, build internal subtrees to represent the nested logic; for modifying phrases or clauses (such as attributives, adverbials, etc.), link them to the appropriate positions in the main tree by connecting the modifying nodes and the modified nodes.

标注语义角色：为语义树中的每个节点分配语义角色标签，如“主语”、“谓语”、“宾语”、“条件”等，以明确每个节点在语句中的功能。Labeling semantic roles: Assign semantic role labels to each node in the semantic tree, such as "subject", "predicate", "object", "condition", etc., to clarify the function of each node in the sentence.

优化和简化语义树：根据需要，对语义树进行优化和简化，移除冗余的节点，合并可以合并的分支，确保树的结构清晰且易于处理。Optimize and simplify the semantic tree: Optimize and simplify the semantic tree as needed, remove redundant nodes, merge branches that can be merged, and ensure that the tree structure is clear and easy to handle.

对语义树进行深度优先遍历，可以包括以下步骤：The depth-first traversal of the semantic tree can include the following steps:

1)选择起始节点：从语义树的根节点开始遍历。1) Select the starting node: start traversing from the root node of the semantic tree.

2)访问当前节点：访问当前节点，并根据节点的类型和内容执行相应的操作。例如，如果节点代表一个实体，可能会记录实体的名称。2) Access the current node: Access the current node and perform corresponding operations based on the type and content of the node. For example, if the node represents an entity, the name of the entity may be recorded.

3)递归遍历子节点：对当前节点的所有子节点进行递归遍历。具体的，先访问最左边的子节点，然后是下一个子节点，以此类推，直到所有子节点都被访问过。在递归过程中，每个子节点都会被视为一个新的“当前节点”，并重复步骤2)和3)。3) Recursively traverse child nodes: recursively traverse all child nodes of the current node. Specifically, visit the leftmost child node first, then the next child node, and so on, until all child nodes have been visited. During the recursive process, each child node will be regarded as a new "current node" and steps 2) and 3) will be repeated.

4)回溯：在当前节点的所有子节点都被访问后，回溯到其父节点。如果父节点还有其他未访问的子节点，继续从未访问的子节点开始遍历。如果父节点的所有子节点都已访问完毕，继续回溯到更高层级的父节点，直到没有其他节点可访问。4) Backtracking: After all child nodes of the current node have been visited, backtrack to its parent node. If the parent node has other unvisited child nodes, continue to traverse from the unvisited child nodes. If all child nodes of the parent node have been visited, continue to backtrack to the parent node at a higher level until there are no other nodes to visit.

5)处理叶节点：当到达叶节点时，由于没有子节点可以继续访问，将叶节点的信息记录下来，然后回溯到父节点。5) Processing leaf nodes: When a leaf node is reached, since there are no child nodes to continue accessing, the information of the leaf node is recorded and then traced back to the parent node.

6)遍历完成：重复上述步骤1)～5)，直到所有节点都被访问过，此时对语义树进行的深度优先遍历完成。6) Traversal completed: Repeat the above steps 1) to 5) until all nodes have been visited, at which point the depth-first traversal of the semantic tree is completed.

进一步地，如图4所示，基于所述简单语句进行编码和解码，生成SQL语句，包括：Further, as shown in FIG4 , encoding and decoding are performed based on the simple statement to generate a SQL statement, including:

其中，对所述简单语句进行解析并转换为语句图表示，可以包括以下步骤：The parsing of the simple statement and converting it into a statement graph representation may include the following steps:

1、句法分析(Parsing)：对简单语句进行句法分析，识别出语句中的各个成分，如主语、谓语、宾语、定语、状语等。通过句法分析，可以构建出一个依赖树(DependencyTree)，其中每个节点代表一个词或短语，边表示词与词之间的依赖关系。1. Parsing: Perform syntactic analysis on simple sentences to identify the various components in the sentence, such as subject, predicate, object, attributive, adverbial, etc. Through syntactic analysis, a dependency tree can be constructed, in which each node represents a word or phrase, and the edge represents the dependency relationship between words.

2、构建语句图(Statement Graph)：在依赖树的基础上，进一步构建语句图，将语句中的实体、属性、关系等映射为图中的节点。实体通常对应于数据库中的表或列名，属性对应于列的特定值或条件，关系则表示实体之间的连接或语句中的逻辑操作。2. Build a statement graph: Based on the dependency tree, further build a statement graph to map the entities, attributes, and relationships in the statement to nodes in the graph. Entities usually correspond to tables or column names in the database, attributes correspond to specific values or conditions of columns, and relationships represent connections between entities or logical operations in statements.

3、节点和边的标注：节点需要被正确标注，以反映它们在SQL语句中的角色。例如，表名、列名、操作符、函数、数值等都会被赋予特定的标签。边表示节点之间的关系，如JOIN、ON、WHERE、SELECT等，这些边的类型需要根据语句的语义进行标注。3. Node and edge labeling: Nodes need to be correctly labeled to reflect their roles in SQL statements. For example, table names, column names, operators, functions, values, etc. will be given specific labels. Edges represent the relationship between nodes, such as JOIN, ON, WHERE, SELECT, etc. The types of these edges need to be labeled according to the semantics of the statement.

4、处理复杂结构：对于包含复杂结构的语句，如嵌套查询、子查询、聚合函数等，需要特别处理以确保图形的准确性。这些复杂结构可能需要额外的节点和边来表示，以确保语句图能够完整地捕捉到语句的语义。4. Handling complex structures: Statements containing complex structures, such as nested queries, subqueries, aggregate functions, etc., require special processing to ensure the accuracy of the graph. These complex structures may require additional nodes and edges to represent to ensure that the statement graph can fully capture the semantics of the statement.

5、图的优化：在构建完语句图后，可以进行语句图的优化，移除不必要的节点或边，合并可以合并的节点，以简化图形并减少歧义。优化过程有助于提高后续处理步骤的效率，如编码和解码过程。5. Graph optimization: After constructing the statement graph, you can optimize the statement graph, remove unnecessary nodes or edges, merge nodes that can be merged, to simplify the graph and reduce ambiguity. The optimization process helps improve the efficiency of subsequent processing steps, such as encoding and decoding processes.

数据库模式的图表示可以包括节点、边、节点属性、层次结构和图的布局；具体的，在数据库模式的图中，每个节点代表一个数据库中的实体，这可以是一张表或者表中的一个列。表由一个矩形表示，而列则由矩形内部的椭圆或其他形状表示，并放置在相应表的矩形内。边用来表示节点之间的关系；在数据库模式中，其是表之间的外键关系，或者是列之间的数据依赖。如果两个表之间存在外键关系，那么这两个表的节点之间会有一条边，边的标签表示关系类型，如“1:N”(一对多)、“N:M”(多对多)或“1:1”(一对一)。如果列之间有数据依赖，比如某个列的值依赖于另一个列的值，那么这两个列的节点也会通过边相连。节点属性提供了额外的信息，如表名、列名、数据类型、是否为主键等。这些属性可以作为节点的标签显示，或者通过悬停、点击等交互方式查看。在某些情况下，数据库模式图可能会展示表之间的层次结构，其中父表和子表的关系通过边的层次来表示。这种层次结构有助于理解数据的组织方式和查询的路径。数据库模式图的布局清晰地展示表和列之间的关系，避免边的交叉和重叠。图的布局会将相关的表和列聚集在一起，以便于识别和理解它们之间的关系。A graph representation of a database schema may include nodes, edges, node attributes, hierarchical structures, and the layout of the graph; specifically, in a graph of a database schema, each node represents an entity in the database, which can be a table or a column in a table. A table is represented by a rectangle, while a column is represented by an ellipse or other shape inside a rectangle and placed inside the rectangle of the corresponding table. Edges are used to represent relationships between nodes; in a database schema, they are foreign key relationships between tables, or data dependencies between columns. If there is a foreign key relationship between two tables, there will be an edge between the nodes of the two tables, and the label of the edge indicates the type of relationship, such as "1:N" (one-to-many), "N:M" (many-to-many), or "1:1" (one-to-one). If there is a data dependency between columns, such as the value of a column depends on the value of another column, the nodes of the two columns are also connected by an edge. Node attributes provide additional information, such as table name, column name, data type, whether it is a primary key, etc. These attributes can be displayed as labels for nodes, or viewed interactively by hovering, clicking, etc. In some cases, a database schema graph may show a hierarchy between tables, where the relationship between parent and child tables is represented by a hierarchy of edges. This hierarchical structure helps understand how data is organized and the query path. The layout of the database schema diagram clearly shows the relationship between tables and columns, avoiding crossing and overlapping edges. The layout of the diagram will group related tables and columns together to make it easier to identify and understand the relationship between them.

进一步地，如图5所示，利用所述语言转换模型和预先建立的关系感知网络对所述关联图进行编码，获得向量表示，包括：Further, as shown in FIG5 , the association graph is encoded using the language conversion model and a pre-established relationship perception network to obtain a vector representation, including:

其中，利用语言转换模型对输入向量进行编码，获得初始向量表示，可以包括以下步骤：The step of encoding the input vector using the language conversion model to obtain an initial vector representation may include the following steps:

将输入的自然语言文本转换为向量表示。其是通过嵌入层(Embedding Layer)和循环神经网络(如LSTM或GRU)来实现。在嵌入层，每个单词或子词被转换为一个固定长度的向量，这些向量捕捉了单词的语义信息。这些词向量被送入循环神经网络，该网络能够处理序列数据并捕捉长距离依赖关系。网络的最终输出是一个包含了整个输入序列信息的向量表示，即初始向量表示。Convert the input natural language text into a vector representation. This is achieved through an embedding layer and a recurrent neural network (such as LSTM or GRU). In the embedding layer, each word or subword is converted into a fixed-length vector that captures the semantic information of the word. These word vectors are fed into a recurrent neural network that can process sequence data and capture long-distance dependencies. The final output of the network is a vector representation that contains the entire input sequence information, that is, the initial vector representation.

关系感知网络可以用来增强初始向量表示，使其更好地反映自然语言语句中的语义关系和数据库模式的结构信息。利用关系感知网络对初始向量表示和位置嵌入做进一步嵌入，获得向量表示，包括以下步骤：The relation-aware network can be used to enhance the initial vector representation so that it better reflects the semantic relationship in the natural language sentence and the structural information of the database schema. The initial vector representation and position embedding are further embedded using the relation-aware network to obtain the vector representation, including the following steps:

1、初始向量表示：初始向量表示是通过语言模型(如BERT或GPT)获得的，包含了自然语言语句中每个词的上下文信息。1. Initial vector representation: The initial vector representation is obtained through a language model (such as BERT or GPT) and contains the contextual information of each word in the natural language sentence.

2、位置嵌入：位置嵌入是一种额外的向量信息，用于捕捉序列数据中元素的位置信息。在Transformer架构中，位置嵌入与词嵌入相加，以提供模型对序列中词序的理解。2. Position embedding: Position embedding is an additional vector information that captures the position information of elements in sequence data. In the Transformer architecture, position embedding is added to word embedding to provide the model with an understanding of the order of words in the sequence.

3、关系感知嵌入：关系感知网络通过引入额外的注意力机制来识别和编码输入向量中的实体关系。例如，可以学习哪些词是数据库中的表名，哪些词是列名，以及它们之间的关联关系。对于数据库模式，关系感知网络可以处理表与表之间的外键关系，以及表内部的列关系。3. Relation-aware embedding: Relation-aware networks recognize and encode entity relationships in input vectors by introducing additional attention mechanisms. For example, it can learn which words are table names in the database, which words are column names, and the associations between them. For database schemas, relationship-aware networks can handle foreign key relationships between tables and column relationships within tables.

4、进一步嵌入过程：在关系感知网络中，初始向量表示和位置嵌入首先被送入一个或多个注意力层。这些层通过计算注意力分数来确定序列中不同部分的重要性。注意力分数基于实体之间的关系和在查询中的作用动态计算。例如，如果一个列名与特定的条件相关联，注意力机制会赋予这个列名更高的权重。通过这种方式，网络能够生成一个增强的向量表示，它不仅包含了原始的语义信息，还包含了实体间关系的语义信息。4. Further embedding process: In the relation-aware network, the initial vector representation and position embedding are first fed into one or more attention layers. These layers determine the importance of different parts of the sequence by calculating attention scores. The attention scores are dynamically calculated based on the relationship between entities and their role in the query. For example, if a column name is associated with a specific condition, the attention mechanism will give this column name a higher weight. In this way, the network is able to generate an enhanced vector representation that contains not only the original semantic information, but also the semantic information of the relationship between entities.

本实施例得到的增强向量表示更加丰富，能够为后续的SQL语句生成提供更准确的上下文信息。这个增强的向量表示可以用于指导生成更准确的SELECT子句、FROM子句、WHERE子句等SQL语句的各个部分。关系感知网络的应用使得NL2SQL系统能够更深入地理解自然语言查询和数据库模式之间的复杂关系，从而生成更准确、更符合用户意图的SQL语句。在处理包含多个实体和复杂逻辑的查询时尤其有效。The enhanced vector representation obtained in this embodiment is richer and can provide more accurate context information for subsequent SQL statement generation. This enhanced vector representation can be used to guide the generation of more accurate SELECT clauses, FROM clauses, WHERE clauses and other parts of SQL statements. The application of the relationship-aware network enables the NL2SQL system to have a deeper understanding of the complex relationship between natural language queries and database schemas, thereby generating more accurate SQL statements that are more in line with user intent. This is particularly effective when processing queries containing multiple entities and complex logic.

利用LSTM(长短期记忆网络)对向量表示进行解码，生成简单语句与数据库模式之间的第二中间状态语句，其包括以下步骤：The vector representation is decoded using LSTM (Long Short-Term Memory Network) to generate a second intermediate state statement between the simple statement and the database schema, which includes the following steps:

解码阶段：LSTM网络使用其隐藏状态来生成第二中间状态语句。这一阶段的目标是将编码的语义信息转换为可执行的SQL语句的结构。LSTM网络的输出通过一个全连接层(或称为线性层)来生成中间状态语句的每个组成部分，如SELECT子句、FROM子句、WHERE子句等。Decoding phase: The LSTM network uses its hidden state to generate the second intermediate state statement. The goal of this phase is to convert the encoded semantic information into the structure of an executable SQL statement. The output of the LSTM network passes through a fully connected layer (or linear layer) to generate each component of the intermediate state statement, such as the SELECT clause, FROM clause, WHERE clause, etc.

注意力机制：为提高解码的准确性，可以在LSTM网络中集成注意力机制。注意力机制允许模型在生成每个词时，都能够聚焦于输入序列中最相关的部分。通过这种方式，模型可以更好地理解自然语言语句与数据库模式之间的对应关系，并生成准确的中间状态语句。Attention mechanism: To improve the accuracy of decoding, an attention mechanism can be integrated into the LSTM network. The attention mechanism allows the model to focus on the most relevant part of the input sequence when generating each word. In this way, the model can better understand the correspondence between natural language sentences and database patterns and generate accurate intermediate state sentences.

生成语句：LSTM网络逐个生成第二中间状态语句的组成部分，每一步的输出都会作为下一步的输入，直到生成完整的语句。生成的第二中间状态语句应该能够反映自然语言语句的意图，并与数据库模式相匹配，为最终生成可执行的SQL语句打下基础。Generate statements: The LSTM network generates the components of the second intermediate state statement one by one, and the output of each step is used as the input of the next step until a complete statement is generated. The generated second intermediate state statement should be able to reflect the intention of the natural language statement and match the database schema, laying the foundation for the final generation of executable SQL statements.

LSTM网络可以将自然语言语句的向量表示解码为与数据库模式相匹配的第二中间状态语句。这种第二中间状态语句是一个结构化的表示，它将自然语言的语义信息转换为更接近SQL语句的形式，从而为后续的SQL生成步骤提供了便利。The LSTM network can decode the vector representation of the natural language sentence into a second intermediate state sentence that matches the database schema. This second intermediate state sentence is a structured representation that converts the semantic information of the natural language into a form closer to the SQL sentence, thus facilitating the subsequent SQL generation step.

根据第二中间状态语句进行SQL语句的推断和拼接，生成最终的SQL语句，包括以下步骤：Inferring and splicing the SQL statement according to the second intermediate state statement to generate the final SQL statement includes the following steps:

1、解析第二中间状态语句：分析第二中间状态语句的结构，识别出查询所需的关键组件，如SELECT子句的列名、FROM子句的表名、WHERE子句的条件等。对于包含聚合函数或分组的语句，还需要识别这些组件，并确定它们在最终SQL语句中的位置。1. Parse the second intermediate state statement: Analyze the structure of the second intermediate state statement and identify the key components required for the query, such as the column name of the SELECT clause, the table name of the FROM clause, the conditions of the WHERE clause, etc. For statements containing aggregate functions or grouping, you also need to identify these components and determine their location in the final SQL statement.

2、构建SQL语句的框架：根据第二中间状态语句中的信息，构建SQL语句的基本框架，包括SELECT、FROM、WHERE等子句。对于需要聚合和分组的查询，还需要构建GROUP BY和HAVING子句。2. Build the framework of the SQL statement: Based on the information in the second intermediate state statement, build the basic framework of the SQL statement, including clauses such as SELECT, FROM, and WHERE. For queries that require aggregation and grouping, you also need to build GROUP BY and HAVING clauses.

3、细化SQL语句的各个部分：根据中间状态语句中的详细信息，填充SQL语句的各个部分。例如，将具体的列名填充到SELECT子句中，将表名填充到FROM子句中。对于WHERE子句，需要根据条件表达式生成具体的过滤逻辑。3. Refine each part of the SQL statement: Fill in each part of the SQL statement according to the detailed information in the intermediate state statement. For example, fill in the specific column name in the SELECT clause and the table name in the FROM clause. For the WHERE clause, it is necessary to generate specific filtering logic based on the conditional expression.

4、处理复杂的SQL组件：对于包含复杂组件的查询，如子查询、连接(JOIN)和嵌套查询，需要特别处理这些组件，并确保它们正确地拼接到SQL语句中。还可能涉及对中间状态语句中的逻辑进行转换，以符合SQL语法的要求。4. Handling complex SQL components: For queries containing complex components, such as subqueries, joins, and nested queries, these components need to be handled specially and ensured to be correctly spliced into SQL statements. It may also involve converting the logic in the intermediate state statements to comply with the requirements of SQL syntax.

5、优化SQL语句：在拼接过程中，可能需要对生成的SQL语句进行优化，以提高查询效率和准确性，其包括简化子查询、优化连接操作和调整条件表达式。优化步骤可以根据数据库的特定特性和查询的性能要求进行调整。5. Optimize SQL statements: During the concatenation process, you may need to optimize the generated SQL statements to improve query efficiency and accuracy, including simplifying subqueries, optimizing join operations, and adjusting conditional expressions. The optimization steps can be adjusted according to the specific characteristics of the database and the performance requirements of the query.

6、生成最终的SQL语句：经过上述步骤的处理，最终生成一个完整且语法正确的SQL语句。这个SQL语句可以直接在数据库管理系统中执行，以检索和操作数据。6. Generate the final SQL statement: After the above steps are processed, a complete and syntactically correct SQL statement is finally generated. This SQL statement can be directly executed in the database management system to retrieve and manipulate data.

本发明通过将用户的自然语言查询转换为结构化的SQL语句，使用户无需深入了解SQL语法即可与数据库进行交互。The present invention converts the user's natural language query into a structured SQL statement, so that the user can interact with the database without having a deep understanding of SQL syntax.

如图6所示，本实施例提供一种自然语言语句生成SQL语句的装置，包括：As shown in FIG6 , this embodiment provides a device for generating SQL statements from natural language statements, including:

本发明通过建立基于数据增强的语言处理模型，对第一训练数据进行语法分析、信息去除以及数据采样，获得第二训练数据，将所述第一训练数据和第二训练数据进行合并生成训练数据并发送至任务训练模块进行训练，通过所述语言转换模型将自然语言语句进行分解，获得简单语句，将简单语句进行编码和解码，生成SQL语句，其中涉及前馈神经网络、关系感知网络、长短期记忆网络等算法，能够实现有效处理复杂的自然语言语句，提高对数据库复杂查询的处理能力。The present invention establishes a language processing model based on data enhancement, performs grammatical analysis, information removal and data sampling on first training data to obtain second training data, merges the first training data and the second training data to generate training data and sends it to a task training module for training, decomposes natural language sentences through the language conversion model to obtain simple sentences, encodes and decodes the simple sentences to generate SQL sentences, which involves algorithms such as feedforward neural networks, relationship-aware networks, and long short-term memory networks, and can effectively process complex natural language sentences and improve the processing ability of complex database queries.

本实施例还提供一种计算机存储介质，所述计算机存储介质存储有计算机指令，所述计算机指令被调用时，用于执行上述的方法。This embodiment further provides a computer storage medium, wherein the computer storage medium stores computer instructions, and when the computer instructions are called, they are used to execute the above method.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Although preferred embodiments of the present invention have been described, additional changes and modifications may be made to these embodiments by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the present invention. Obviously, those skilled in the art may make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. A method for generating SQL statements from natural language statements, characterized by comprising:

Establishing a language processing model based on data enhancement, wherein the language processing model includes a data enhancement module and a task training module;

Acquire first training data, input the first training data into the data enhancement module for data syntax analysis, information removal, and data sampling, and obtain second training data;

Merging the first training data and the second training data to generate training data and sending the training data to the task training module for training until the language processing model converges;

Establishing a language conversion model, and embedding the language processing model into the language conversion model;

The language conversion model receives a natural language sentence to be converted, and decomposes the natural language sentence to be converted through the language conversion model to obtain a simple sentence;

Encoding and decoding are performed based on the simple statement to generate an SQL statement.

2. The method according to claim 1 is characterized in that the first training data includes multiple groups of sample data, and the sample data includes natural language statements, corresponding SQL statements and associated table models.

3. The method according to claim 2, characterized in that the first training data is input into the data enhancement module for data syntax analysis, information removal and data sampling to obtain the second training data, comprising:

Performing grammatical analysis on the SQL statements in the first training data to determine grammatical rules in the SQL statements;

Determine each component of the SQL statement according to the grammatical rules;

For each SQL statement, information is removed according to its components, and multiple sub-productions are generated accordingly;

Analyze the subproduction formula to obtain the corresponding SQL template classification result;

Select a preset number of SQL templates with the largest number of occurrences, and obtain their corresponding multiple different natural language statements;

The SQL template, the corresponding natural language statement and the table pattern are randomly sampled and synthesized to generate the second training data.

4. The method according to claim 3, characterized in that the SQL statement describes the grammatical rules through production formulas; the components of the SQL statement include non-terminal symbols and expressions, and the expression is a sequence composed of non-terminal symbols and terminal symbols;

For each SQL statement, information is removed according to its components, and multiple sub-productions are generated accordingly, including:

In one operation, a non-terminal symbol in the expression of the SQL statement is removed to obtain a subproduction formula;

Non-terminal symbols in the expression of the SQL statement are removed in sequence to obtain multiple sub-productions.

5. The method according to claim 2, characterized in that the task training module is trained according to the training data, comprising:

Establishing a first intermediate state statement between the natural language statement and the SQL statement;

Iterate and perform the following operations until the training stop condition is met:

Randomly masking the grammatical sequence of the first intermediate state sentence based on a masking mechanism;

Predicting and filling the randomly masked part in the first intermediate state statement according to the corresponding natural language statement in the training data and the SQL statement;

Calculate the loss function value based on the prediction result, and obtain the gradient value generated by gradient descent during the calculation process;

The parameters of the language processing model are updated according to the gradient value.

6. The method according to claim 5, characterized in that the language conversion model decomposes the natural language sentence to be converted to obtain a simple sentence, including:

The language conversion model uses an embedded language processing model to encode the natural language sentence to be converted into a vector representation and inputs it into a pre-set feedforward neural network to predict whether the natural language sentence to be converted contains multiple layers of complex semantics;

If the natural language sentence to be converted contains multiple layers of complex semantics, a semantic tree is established according to the natural language sentence to be converted;

Perform a depth-first traversal on the semantic tree to obtain simple sentences.

7. The method according to claim 5, characterized in that encoding and decoding based on the simple statement to generate an SQL statement comprises:

Parsing the simple statement and converting it into a statement graph representation;

Analyze the relationship between database tables and columns to obtain a graphical representation of the database schema;

Associating and matching the statement graph representation with the graph representation of the database schema to obtain an association graph;

Encoding the association graph using the language conversion model and a pre-established relationship-aware network to obtain a vector representation;

Decoding the vector representation using a long short-term memory network to generate a second intermediate state statement between the simple statement and the database pattern;

Infer and concatenate SQL statements based on the second intermediate state statement to generate a final SQL statement.

8. The method according to claim 7, characterized in that encoding the association graph using the language conversion model and a pre-established relationship-aware network to obtain a vector representation comprises:

constructing an input vector according to the association graph;

Encoding the input vector using the language conversion model to obtain an initial vector representation;

Position embedding for each word in a simple sentence;

The initial vector representation and position embedding are further embedded using the relationship-aware network to obtain the vector representation.

9. A device for generating SQL statements from natural language statements, comprising:

A first model building module, used to build a language processing model based on data enhancement, wherein the language processing model includes a data enhancement module and a task training module;

A data processing module, used for acquiring first training data, inputting the first training data into the data enhancement module for data syntax analysis, information removal and data sampling, and obtaining second training data;

A training module, used for merging the first training data and the second training data to generate training data and sending the training data to the task training module for training until the language processing model converges;

A second model building module, used for building a language conversion model and embedding the language processing model into the language conversion model;

A decomposition module, used for receiving a natural language sentence to be converted through the language conversion model, and decomposing the natural language sentence to be converted through the language conversion model to obtain a simple sentence;

The statement generation module performs encoding and decoding based on the simple statement to generate an SQL statement.

10. A computer storage medium, characterized in that the computer storage medium stores computer instructions, and when the computer instructions are called, they are used to execute any method according to claims 1-8.