CN118981475B

CN118981475B - SQL statement generation method and device based on large model

Info

Publication number: CN118981475B
Application number: CN202411464018.7A
Authority: CN
Inventors: 万建伟; 黄嘉俊; 邹秉吾; 倪哲鸣; 钮博彦
Original assignee: Pingkai Star Beijing Technology Co ltd
Current assignee: Pingkai Star Beijing Technology Co ltd
Filing date: 2024-10-18
Publication date: 2025-02-18
Anticipated expiration: 2044-10-18

Abstract

The embodiment of the application provides a large model-based SQL sentence generation method and device, relating to the field of distributed technology, for example, the field of distributed databases. The method comprises the steps of obtaining initial query sentences of a user and initial database mode information of a target database, and inputting the initial query sentences of the user and the initial database mode information of the target database into a large-scale language model to obtain structured SQL query sentences. The embodiment of the application can improve the accuracy of SQL generation based on a large language model.

Description

SQL sentence generation method and device based on large model

Technical Field

The application relates to the technical field of distribution, in particular to a large model-based SQL sentence generation method and device.

Background

The conversion of natural language into structured query language (Structured Query Language, SQL) into Text-to-SQL (abbreviated as T2S, or Text2 SQL) is a technology for converting natural language query into SQL query, and the main application scenario is in distributed database query, and a user can query in a natural language mode without grasping SQL language.

Currently, text2SQL relies primarily on methods based on templates, rules, and traditional machine learning. However, these methods have limitations in processing complex and diverse natural language queries, resulting in lower accuracy in converting natural language queries to SQL queries.

Disclosure of Invention

The embodiment of the application provides a large model-based SQL sentence generation method, which aims to solve the problem of low accuracy in converting natural language query into SQL query in the prior art.

Correspondingly, the embodiment of the application also provides a large-model-based SQL sentence generating device, electronic equipment and a storage medium, which are used for ensuring the realization and application of the method.

In order to solve the above problems, the embodiment of the application discloses a large model-based SQL sentence generation method, which comprises the following steps:

acquiring initial query sentences of a user and initial database mode information of a target database;

Inputting the initial query statement of the user and the initial database mode information of the target database into a large language model to obtain a structured SQL query statement;

wherein the large language model is configured to perform the following operations:

Carrying out semantic analysis on the initial database mode information of the target database to obtain a semantic analysis result of the target database;

According to the semantic analysis result of the target database, carrying out semantic analysis on the initial query statement of the user to obtain the query semantic of the user;

Screening the initial database mode information according to the user query semantics and the semantic analysis result of the target database to obtain database key entity information corresponding to the user query;

And generating the SQL query statement according to the database key entity information and the semantic analysis result of the target database corresponding to the database key entity information.

The embodiment of the application also discloses a device for generating the SQL sentence based on the big model, which comprises the following steps:

The acquisition module is used for acquiring initial query sentences of the user and initial database mode information of the target database;

The processing module is used for inputting the initial query statement of the user and the initial database mode information of the target database into a large language model to obtain an SQL query statement;

The embodiment of the application also discloses an electronic device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes one or more of the methods in the embodiment of the application when executing the program.

Embodiments of the present application also disclose a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method as described in one or more of the embodiments of the present application.

Embodiments of the application also disclose a computer program product comprising a computer program which, when executed by a processor, implements a method as described in one or more of the embodiments of the application.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

In the embodiment of the application, the SQL sentence generation method based on the large model realizes the efficient analysis and generation of natural language query by combining the initial query sentence of the user and the database mode information of the target database. The large language model not only can deeply understand the semantic structure of the target database, but also can accurately analyze the semantic of the user query, thereby effectively extracting the key entity information of the database related to the query. Compared with the existing Text2SQL method, the embodiment of the application remarkably improves semantic understanding of the database mode information, enhances generalization capability of coping with diversified query scenes, and greatly improves final accuracy of SQL generation.

Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a large model-based SQL statement generation method according to an embodiment of the application;

FIG. 2 is a schematic diagram of a first example provided by an embodiment of the present application;

FIG. 3 is a flow chart of a second example provided by an embodiment of the present application;

FIG. 4 is a flow chart of a third example provided by an embodiment of the present application;

FIG. 5 is a flow chart of a fourth example provided by an embodiment of the present application;

FIG. 6 is a flowchart of a fifth example provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a large model-based SQL sentence generating device according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "plurality" refers to two or more, whereby "plurality" may also be understood as "at least two" in embodiments of the present application. The term "and/or" describes an association relationship of associated objects, meaning that there may be three relationships, e.g., A and/or B, and that there may be three cases where A exists alone, while A and B exist together, and B exists alone. The character "/", unless otherwise specified, generally indicates that the associated object is an "or" relationship.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Currently, text2SQL techniques rely primarily on methods based on templates, rules, and traditional machine learning. However, these approaches have limitations in processing complex and diverse natural language queries. Template-based methods often have difficulty in covering all possible language expression forms, resulting in insufficient generalization capability, rule-based methods require manual writing of large numbers of rules, are large in workload and difficult to maintain, traditional machine learning methods have limited expression in understanding complex semantics and long-distance dependency relationships, and are difficult to accurately analyze natural language intentions.

Therefore, when the existing Text2SQL method is used for meeting the complex and changeable query requirements, the SQL generation accuracy is low, and the effect is not ideal.

The application provides a large model-based SQL sentence generation method and device, and aims to solve the technical problems in the prior art.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The scheme provided by the embodiment of the application can be executed by any electronic equipment, such as terminal equipment, and can also be a server, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the terminal and the server may include a database.

For the technical problems in the prior art, the SQL sentence generation method and device based on the large model provided by the application aim to solve at least one of the technical problems in the prior art.

The embodiment of the application provides a possible implementation manner, as shown in fig. 1, a flowchart of an SQL statement generating method based on a large model is provided, the method can be executed by any electronic device, optionally, the method can be applied to a server end or a terminal device, for convenience of description, the method provided by the embodiment of the application is described below by taking a server as an executing body, wherein the executing body can be a processing module in a database installed by the server.

The method and the device can be applied to the technical field of databases, and comprise the steps of obtaining initial query sentences of a user and initial database modes of a target database, inputting the initial query sentences of the user and the initial database mode information of the target database into a large language model to obtain structured SQL query sentences, wherein the large language model is used for carrying out semantic analysis on the initial database mode information of the target database to obtain semantic analysis results of the target database, carrying out semantic analysis on the initial query sentences of the user according to the semantic analysis results of the target database to obtain user query semantics, screening the initial database mode information according to the user query semantics and the semantic analysis results of the target database to obtain database key entity information corresponding to user query, and generating the SQL query sentences according to the database key entity information. Thus, the accurate analysis of the user natural language query and the automatic generation of the SQL query statement can be realized. For example, SQL query statements that are highly matched to user needs may be generated by semantic understanding of database schema information and user query statements by a large language model. Through the combination of semantic analysis of a target database and user query semantics, database key entity information related to query can be effectively screened out, the accuracy and efficiency of SQL query statement generation are ensured, the accuracy and generalization capability of query analysis are improved, manual intervention is reduced, and the automation and user experience of a system are improved.

Specifically, as shown in fig. 1, the method for generating the SQL statement based on the large model according to the embodiment of the application may include the following steps:

s101, acquiring initial query sentences of users and initial database mode information of a target database.

In the technical field of databases, database schema (schema) information includes definitions of various objects and their relationships in a database. In the embodiment of the application, the initial database mode information of the database is the structure description of the target database, and the basic organization information of the database is provided. For example, the initial database schema information typically includes a database name (referring to the name of the target database), a table name (the name of a table in the database), a column name (the name of each column in the table), a column type (the data type of each column in the table, e.g., integer, string, date, etc.), and column sample data (some example values in the column that help the model understand the meaning of the data). This information is an important basis for understanding the database structure and generating the correct SQL query.

Wherein, the initial query statement of the user refers to the query requirement input by the user by using natural language. Such a query statement may contain the user's need to retrieve, filter, or calculate certain data in the database. In an actual business scenario, because of the difference between the language expression habit and the language expression mode of the individual, the initial query statement of the user may have a plurality of different expression modes. For example, the user may express the same query intent using different words or orderings, such as "find sales data in 2023" and "what sales were in 2023" although in different forms, but all express the same query requirement.

In the embodiment of the application, the system can obtain the query requirement of the user and the related database structure information by acquiring the initial query statement of the user and the initial database mode information of the target database, thereby providing a basis for semantic analysis and SQL generation in the subsequent steps.

S102, inputting the initial query statement of the user and the initial database mode information of the target database into a large language model to obtain a structured SQL query statement;

Because the existing Text2SQL technology has certain limitation in processing complex and diversified natural language queries, the embodiment of the application changes the solving idea of the Text2SQL into a method based on a large-scale language model (Large Language Model, LLM) and a deep learning model architecture (Transformer), and improves the performance and accuracy of the Text2SQL by utilizing the advantage of the large-scale language model in semantic analysis. The large language model is a natural language processing model based on deep learning, and can understand and generate complex text (hereinafter referred to as large model). The model has strong language understanding and semantic analysis capability after a large amount of data training, and can process complex natural language query. In the embodiment of the application, the task of the large model is to combine the natural language query of the user with the structure of the target database to generate a corresponding SQL query statement. For example, the user's initial query statement and the initial database schema information for the target database are input into a large model to generate a structured SQL query statement.

Among these, the key of Text2SQL is the mapping construction, i.e., the necessary entities required for converting a user query into an SQL query, such as the table name, column name, condition, etc., of the database are identified from given database schema information. However, in the embodiment of the application, in the actual business scene, the applicant finds that the big model-based Text2SQL scheme generally has the following defects from case one to case three, so that the big model-based Text2SQL scheme is difficult to land or does not perform well.

In the first case, the schema semantic information of the database is weaker.

In an actual business scenario, the acquired database schema information may only have a corresponding table name, column type, and the like. However, these basic information often cannot directly reflect the specific meaning of tables and columns, the association relationship between tables, the main functions and uses of the database, and the like. The lack of semantic information greatly restricts the construction of the Text2SQL mapping and the accuracy of the finally generated SQL query statement. For example, if the specific meaning of the tables and columns in the database is not known, the generated SQL query may deviate from the actual query requirements. For example, the user's initial query statement is "list all customers who placed an order in the past week" and when it is not clear that the "date" column in the database is indicating the order date or shipping date, an incorrect SQL query statement may be generated. Also, if the table-to-table relationship is not clear, the generated SQL query statement may not link the table correctly, and thus the correct query result may not be obtained. For example, the user's initial query statement is "list all customers who purchased commodity A", and when it is unclear how the "order table" and "customer table" in the database are linked, then the correct SQL query statement cannot be generated.

In some embodiments, in order to solve the technical problem mentioned in the first aspect, the large model is used to perform semantic analysis on the initial database schema information of the target database, so as to obtain a semantic analysis result of the target database. Specifically, the large model firstly performs semantic analysis on the initial database mode information of the input target database. Semantic parsing refers to the meaning behind the large model understanding database structures, such as table-to-table relationships in the database, table meaning, column meaning, and so forth. The results of the semantic parsing may help the large model understand the various entities in the database. For example, when parsing the database schema information, the large model may derive that "order table" contains information such as "order ID", "customer ID", "order amount", etc., and understand the specific meaning and use of these fields.

And in the second case, the semantics of the initial query statement of the user have uncertainty and diversity.

In an actual business scenario, due to the fact that language expression habits and modes of different users are different, multiple different expression modes may exist for similar query requirements. In addition, the user may not be very specific to his or her own actual needs, or the user may not be able to accurately express his or her own needs, resulting in ambiguous or ambiguous query terms entered by the user, or even completely irrelevant to the current database contents. In this case, if an SQL query is blindly generated based on a large model, the SQL query appears reasonable or real, but is actually erroneous or not based, a phenomenon known as "illusion". All of the above factors can present additional challenges to the processing of Text2 SQL. Under the condition that the query intention of the user cannot be clarified, accurate mapping construction cannot be completed, and therefore the accuracy and efficiency of Text2SQL are affected.

In some embodiments, in order to solve the technical problem mentioned in the second aspect, the semantic analysis is performed on the initial query statement of the user by using the large model and the semantic analysis result of the database, so as to obtain the query semantic of the user. Specifically, the semantic analysis result of the database is utilized to clarify the problem of the initial query statement of the user, the understanding of the query intention of the user is optimized, and the user query semantic capable of clearly expressing the query result expected by the user is obtained.

And thirdly, a database with larger data volume cannot be processed.

In an actual business scenario, the database schema information may be very large (e.g., one database contains many tables, or some tables contain many columns), and the existing large model usually has a Context Window (Context Window) limitation. This means that if the database schema information of the database exceeds the Context Window of the large model, the large model cannot perform data processing once, so that the database key entity information required for converting the user query into the SQL query cannot be identified from the large database schema information. This not only affects the final accuracy of Text2SQL, but also limits the scope of application of the Text2SQL solution. Therefore, how to process oversized databases while guaranteeing model performance is also an important challenge faced by Text2 SQL.

In some embodiments, in order to solve the technical problem mentioned in the third aspect, the present application uses a large model to accurately identify, from initial database schema information of a database, database key entity information (including a correlation table, a column, etc.) required for converting a user query into an SQL query according to the user query semantics and a semantic analysis result of the target database, and is suitable for processing an ultra-large database. For example, for a user query "find 2023 sales total", the database key entity information may include the "sales date" and "sales amount" columns in the "sales table", which are key data required for the query.

In some embodiments, after obtaining the key entity information of the database corresponding to the user query, the large model will generate a corresponding SQL query statement. The SQL query statement is a structured expression of the user natural language query and can be directly executed by the database. For example, the user query "find 2023 sales total" would be converted to the SQL statement SELECT SUM (sales amount) FROM sales table WHERE sales date BETWEEN '2023-01-01' AND '2023-12-31'.

In some embodiments, the solution of step S102 is applicable to a variety of scenarios, especially in cases where the user is unfamiliar with the SQL language but requires a query to a complex database. For example, in a business analysis system, a user may wish to enter a query using natural language, and the system needs to retrieve data in a database by automatically generating SQL. The method is also suitable for a data analysis platform, an intelligent customer service system and the like.

As an example, in connection with fig. 2, the flow shown in fig. 2 illustrates the basic steps and working principles of a large model-based SQL statement generation method.

S201, connecting the target databases.

In the embodiment of the application, the user can acquire the database mode information of the database by connecting to the target database. The database schema information contains structural information such as tables, columns, field types and the like of the database, which is the basis for the subsequent generation of SQL queries.

S202, schema query and modeling of a database.

The embodiment of the application queries the database mode information of the target database and models the database according to the query result. This process involves semantic understanding and parsing of the database structure to provide input information for the large model.

In some embodiments, the results of the schema query of the database will be stored in a storage service for use in subsequent steps.

S203, user inquiry input.

In the embodiment of the application, the user puts forward the query requirement in a natural language mode and inputs the query requirement into the system. This query is typically in natural language form, such as "find 2023 sales data".

S204, clarifying the problem.

In the embodiment of the application, after receiving the natural language query of the user, the system enters a problem clarification stage. At this stage, the large language model will parse the user input, understand the query intent, and perform semantic analysis.

S205, mapping construction.

In the embodiment of the application, on the basis of understanding the user query semantics, the large model can match and map the user query intention with the schema of the database. At this point, the model needs to determine the database key entity information, such as table, column names, etc., involved in the query.

In some embodiments, the mapping results may also be stored in a storage service in the system for subsequent SQL generation processes.

S206, SQL generation and optimization.

In the embodiment of the application, based on the result of mapping construction, the large model generates a corresponding SQL query statement. In the generation process, the large model combines the database mode information and the user query semantics to generate an optimized SQL query statement so as to ensure efficient query execution.

In some embodiments, in the process of generating the SQL query statement, the system can also perform optimization processing on the SQL query statement, so that the query efficiency is further improved.

In one embodiment of the present application, the performing semantic analysis on the initial database schema information of the target database to obtain a semantic analysis result of the target database includes:

acquiring column meanings and table meanings of each table in the target database according to the initial database mode information of the target database;

determining a reference relation existing between each table and other tables according to the table meaning of each table and the column meaning of each column in each table;

According to the quantity of the reference relations between each table and other tables, determining a core table in the target database, clustering the core tables, and obtaining database core entity information;

And determining the database characteristic information of the target database according to the database core entity information.

In an embodiment of the present application, the initial database schema information includes, but is not limited to, database name (unique identifier of database), table name (name of each table in database for dividing different business modules), column name (names of each field in table), column type (representing type of field data, such as integer, character string, date, etc.), column sample data (some example data of each column for further deducing actual business meaning of the column). The information provides a basis for data structure for subsequent semantic parsing and table-to-table relationship identification.

In the embodiment of the application, according to the initial database mode information of the target database, the column meaning and the table meaning of each table in the target database are obtained from the database mode information. Specifically, for the business meaning inference of a column, the role of the column in the actual business can be inferred from the column name, column type, and column sample data. For example, the "order_id" column may be inferred as "order number" and the "created_at" column may be inferred as "creation time".

Further, the business meaning of the whole table is further deduced according to the business meaning of each column. For example, a table containing fields of "order_id", "order_date", "customer_id", etc. may be inferred as "order information table".

Further, the column meaning is optimized inversely by the business meaning of the table. For example, if a table is inferred to be an "order information table," then all order-independent columns in the table may require readjustment or culling of business meaning.

In the embodiment of the application, the business meanings of the columns and the tables are optimized through multiple iterations, so that the business meanings of each field and each table are ensured to be clear and accurate, the business semantics of the columns and the tables are ensured to be matched with the actual business scene highly, and the subsequent quotation relation inference is more reliable.

In the embodiment of the application, according to the deduced table meaning, the system then needs to traverse each table in the database, and according to the table meaning of each table and the column meaning of each column in each table, deduce possible reference relations between each table and other tables. Specifically, the system constructs two agents (agents), namely a first agent (agent-a) and a second agent (agent-b). Wherein agent-a is responsible for identifying and extracting the reference relationships between tables and agent-b is responsible for evaluating the accuracy of these reference relationships and providing feedback. For example, there is typically an out-key reference between the order table and the customer table, indicating the association of the order with the customer.

In the embodiment of the application, the agent-b feeds back the extraction result of the agent-a, and if the extraction result is wrong or missing, the agent-a can extract the reference relation again according to the feedback until the set iteration times or triggering end conditions are reached. Finally, the system filters out the possible repeated reference relationships, and ensures the uniqueness and accuracy of each reference relationship. Through the step, the embodiment of the application not only can identify the reference relation between the tables, but also lays a foundation for the subsequent extraction of the core entity information of the core table and the database.

In the embodiment of the application, the core tables in the target database are determined according to the reference relation between each table and other tables, and the core tables are clustered to obtain the core entity information of the database. Specifically, the system sorts the importance of each table according to the number of the reference relations, and selects the table with the largest reference relation of the first N tables as the core table. For example, the first 20 tables with the highest reference relationships are selected as the core tables.

Further, a third agent (agent-c) and a fourth agent (agent-d) are constructed, the core tables are clustered through the agent-c, and the core entity information of the database is extracted. For example, the database core entity information may be entity information such as "customer information", "order information" or "product information" that contains specific tables and core field descriptions.

In some embodiments, agent-d evaluates the extraction of the database core entity information and provides feedback to agent-c to ensure that the extracted database core entity information is semantically accurate. Through the process, the embodiment of the application can automatically identify the key business modules or entities in the database, and greatly improves the understanding and optimizing capability of the database.

In the embodiment of the application, the database characteristic information of the whole database is deduced according to the extracted database core entity information. The database characteristic information includes, but is not limited to, functions, purposes, belonging fields and core fields of the database. Wherein the functional use of the database, e.g. the database may be used for order management, customer information management or product inventory management, the field of application of the database, e.g. e-commerce, finance, medical etc. can be deduced, e.g. by analysis of tables and entities, and the core fields, e.g. fields in the database that are decisive for critical services, e.g. order ID, customer ID, product name etc.

Therefore, the embodiment of the application can accurately acquire the business meaning of each table and each column by carrying out multi-level and multi-dimensional analysis and inference on the initial database mode information of the target database, thereby improving the understanding capability of the system on the complex database structure. By introducing the intelligent agent, the embodiment of the application can automatically identify the reference relations among the tables, continuously optimize the identification accuracy of the reference relations through an iterative feedback mechanism and avoid the complicated operation of manual definition. In addition, the embodiment of the application can automatically identify the core table in the database, extract the key entity, and improve the accuracy of entity extraction by optimizing the feedback mechanism, so that the main service module of the database is more clearly visible.

In some embodiments, the obtaining the column meaning and the table meaning of each table in the target database according to the initial database schema information of the target database includes:

deducing and storing business meanings of each column in each table;

Optimizing and storing the business meaning of each table based on the business meaning of each column;

Optimizing and storing the business meaning of each column in each table based on the business meaning of each table;

and iteratively optimizing the business meaning of each column and the business meaning of each table until the set iteration times or triggering ending conditions are met.

In some embodiments, the determining, according to the table meaning of each table, a reference relationship existing between each table and other tables includes:

The method comprises the steps of constructing a first agent and a second agent, wherein the first agent is used for extracting a reference relation existing between each table and other tables according to the table meaning of each table, and the second agent is used for evaluating the reference relation and feeding back an evaluation result to the first agent;

and extracting again by the first agent according to the evaluation result, and iteratively optimizing the reference relation existing between each table and other tables until the set iteration times or triggering ending conditions are met.

In some embodiments, the determining the core table in the target database according to the reference relation between each table and other tables, and clustering the core tables to obtain the core entity information of the database includes:

sorting according to the number of the reference relations of each table, and selecting the first N tables as the core tables;

The method comprises the steps of constructing a third agent and a fourth agent, wherein the third agent is used for clustering the core tables to obtain database core entity information, and the fourth agent is used for evaluating and feeding back the database core entity information obtaining result;

And extracting again by the third agent according to the database core entity information acquisition result, and iteratively optimizing the database core entity information acquisition result until the set iteration times or triggering end conditions are met.

As an example, in connection with fig. 3, the flow shown in fig. 3 illustrates the basic steps and working principles of semantic parsing of initial database schema information of the target database based on a large model.

S301, connecting with a target database, and acquiring initial database mode information.

In the embodiment of the application, the method is firstly connected to a target database, and initial database mode information of the database is acquired. The information includes basic information such as database name, table name, column type, and sample data of a part of the column. These initial data provide data structure and type references for subsequent column and table meaning inferences.

S302, deducing column meanings.

In the embodiment of the present application, according to the database schema information acquired in S301, in this step, business meaning is inferred for each column of each table. By analyzing column names, column types, and sample data, the system deduces the specific business meaning of each column. For example, the column may be inferred by the column name "order_id". The system can also optimize the business meaning of the column through multiple iterations, and ensure the accuracy of the inferred result.

S303, deducing the meaning of the table.

In the embodiment of the present application, based on the business meanings of each column in S302, the overall business meaning of the table is further inferred. In this process, the meaning of a table is generally determined by the columns it contains. For example, if the table contains fields of "order_id", "order_date", etc., it may be inferred that the table is related to order services.

S304, deducing the table relation.

In an embodiment of the present application, the reference relationships between tables are inferred based on the field relationships between tables in this step. The current table is typically referred to as a table, and other related tables are referred to as tables. Through the collaboration of the large model and the intelligent agent, the system identifies foreign keys, dependencies and the like between tables. In the process, the system carries out feedback optimization on the inferred result by introducing the result evaluation agent, and the accuracy of the reference relation recognition is continuously improved.

S305, extracting the core entity.

In the embodiment of the application, the reference relation of all the tables is analyzed, the core tables of the database are identified, and the core tables are clustered and the entity is extracted. For example, the core entity may include business entities such as "clients", "orders", etc., and the accuracy of extraction and the matching degree of the business are continuously improved through a feedback optimization mechanism.

S306, deducing the functions and purposes of the database.

In the embodiment of the application, the main functions, purposes and fields of the whole database are deduced based on the core entity and the table relation thereof. And judging whether the database is used for a specific business scene, such as an order management system, a customer relationship management system and the like, according to the business meaning and the entity identified in the previous step. This inference provides an overall view of the business interpretation of the database.

In some embodiments, the performing semantic analysis on the initial query statement of the user according to the semantic analysis result of the target database to obtain the query semantic of the user includes:

Determining context information required by semantic analysis of the initial query statement of the user according to the semantic analysis result of the target database;

And determining the user query semantics according to the context information.

In the embodiment of the application, before carrying out semantic analysis on the initial query statement of the user, relevant context information which can help to understand the initial query statement of the user is determined according to the semantic analysis result of the target database. The context information refers to key data extracted from semantic analysis results of a database and capable of helping a system understand user queries. Such information may include core entities in the database, business meaning of fields, relationships between tables, and functions and uses of the database, among others.

In some embodiments, when handling very large databases, the system needs to intelligently select which Context information is most important according to Context Window (Context Window) constraints of the large model, ensuring efficient utilization of computing resources. Specifically, under the condition that the semantic analysis result does not exceed the Context Window limit, all semantic analysis results of the target database are used as the Context. This may provide a complete reference for semantic analysis to ensure maximum accuracy. And under the condition that the semantic analysis result exceeds the Context Window limit, extracting key core entity and database characteristic information as contexts. This is because core entities often reflect key tables and data relationships in databases, which can provide sufficient context information for most queries. If the sum of the core entity information and the database feature information exceeds the Context Window limit, the Context is further simplified, and only key information such as database function application, description of core fields and the like is used. This information provides a global view that helps the system understand the context and purpose of the user query.

In the embodiment of the application, the system can understand the main concepts or variables in the initial query statement of the user through the context information. For example, if the user mentions "order amount" in the initial query statement, the contextual information helps the system identify the specific fields and associated tables of the concept in the database. The system matches these identified concepts with the context information to ensure that each concept has a clear meaning in the database structure.

In the embodiment of the application, when semantic analysis is performed, if the system finds that certain concepts or variables in the initial query statement of the user are not matched with the context or have ambiguity, the system tries to generate relevant explanation. For example, the user may have entered a ambiguous field name and the system will provide possible field interpretations based on the context information. For example, if the user queries mention "total," the system may identify multiple possible fields of "total" in the database, such as "order total" or "payment," and provide a clear explanation.

According to the embodiment of the application, the context information is utilized to perform semantic analysis and clarification on the user query, so that the query precision can be remarkably improved. Ambiguous portions of the user query may be clarified by contextual inference of the system, making the query results more accurate. Furthermore, the hierarchical policy in the context selection scheme makes the system more efficient in handling very large databases. When the database information exceeds the computing capacity of the large model, the system can intelligently extract core information for processing, so that unnecessary computing burden is avoided, and meanwhile, the effectiveness of the context is guaranteed.

In some embodiments, the determining, according to the semantic analysis result of the target database, context information required for performing semantic analysis on the user initial query statement includes:

Under the condition that the semantic analysis result does not exceed the context window limit of the large model, taking the semantic analysis result as the context information;

When the semantic analysis result exceeds the context window limit of the large model, taking the database core entity information and the database characteristic information as context information;

And if the sum of the database core entity information and the database characteristic information exceeds the context window limit of the large model, taking the database characteristic information as the context information.

As an example, in connection with fig. 4, the flow shown in fig. 4 illustrates the basic steps and principles of operation of semantic analysis of the user's initial query statement based on a large model.

S401, constructing a relevant context based on the modeling result.

In the embodiment of the application, which Context information is most important is intelligently selected according to the Context Window limit of the large model, so that the efficient utilization of computing resources is ensured.

S402, identifying main concepts or variables in the user query.

In the embodiment of the application, the system can understand the main concepts or variables in the initial query statement of the user through the context information.

S403, judging whether blurring or ambiguity exists.

In the embodiment of the present application, when performing semantic analysis, if the system finds that some concepts or variables in the initial query statement of the user do not match the context or are ambiguous, step S404 is performed, otherwise step S405 is performed.

S404, generating the most relevant explanation according to the constructed context.

In the embodiment of the present application, for the concepts or variables that are not clear or ambiguous in step S403, the most relevant explanation is generated according to the context of construction.

S405, optimizing the user query.

In the embodiment of the present application, the interpretation generated in step S404 is used to optimize the user query. Ensuring that the desired results of the user are clearly expressed.

In some embodiments, the filtering the initial database schema information according to the user query semantics and the semantic analysis result of the target database to obtain database key entity information corresponding to the user query includes:

According to the semantic analysis result of the target database, determining N tables related to the user query semantics through a semantic similarity search technology;

In a context window of the large model, M tables related to the user query semantics and corresponding key columns in the N tables are identified, wherein M is less than or equal to N;

And aggregating the M tables and the key columns to obtain the database key entity information corresponding to the user query.

In the embodiment of the application, firstly, N tables related to the user query semantics are determined by utilizing a semantic similarity search technology. The semantic similarity searching technology is to determine a database table most relevant to user query by analyzing semantic information of text content. This technique utilizes natural language processing and machine learning models to understand the intent of the user's query and find the best matching table in the schema of the database. For example, through a semantic similarity search technology, the first 20 tables related to the user query semantics are determined, and the search scope is reduced. After the first 20 tables are obtained, it is necessary to further accurately identify which tables are most relevant to the semantics of the user query. Specifically, the embodiment of the application provides a decomposition processing strategy which is performed in the Context Window of a large model. The Context Window refers to a content range which can be considered by the model when processing data, so that the model can be accurately matched and identified within a limited range.

In some embodiments, the large model, when processing a user query, analyzes the contents and structure of the first 20 tables to determine which tables are most semantic to the query. From the 20 tables, M tables (M.ltoreq.20) that are most relevant to the user query are screened and key columns in the tables are identified.

In some embodiments, after the M most relevant tables and corresponding key columns are determined, these data need to be filtered and aggregated to determine the final required database key entity information. Specifically, the M tables are further filtered to remove redundant and irrelevant tables and columns. And summarizing the related tables and columns to integrate the final database key entity information required by the user query. The goal of this step is to reduce and determine the minimum and necessary tables and columns to meet the user's query needs. The embodiment of the application maps the user query to the key table and the column in the database through semantic similarity search, context analysis and data aggregation to generate the final SQL query. The method is particularly suitable for processing a very large-scale database, because the method avoids the whole scanning of the whole database and improves the efficiency through gradual screening and aggregation.

As an example, in conjunction with fig. 5, the flow shown in fig. 5 illustrates the basic steps and working principles of screening the initial database schema information based on a large model to obtain the database key entity information corresponding to the user query.

S501, determining N tables related to the user query semantics based on the semantic analysis result of the target database.

In the embodiment of the application, N tables related to the user query semantics are determined by utilizing a semantic similarity search technology.

S502, based on the large model, M key tables and key columns which are most necessary for inquiring with a user are further identified.

In the embodiment of the application, when the large model processes the user query, the content and the structure of the first N tables are analyzed to determine which table semantics are most pertinent to the query. M tables (M.ltoreq.20) most relevant to the user query are selected from the N tables, and key columns in the tables are identified.

S503, aggregating the M tables and the key columns to determine the minimum and necessary tables and columns finally needed.

In the embodiment of the application, after M most relevant tables and corresponding key columns are determined, the data are required to be screened and aggregated to determine the final required key entity information of the database.

In some embodiments, the generating the SQL query statement according to the database key entity information and the semantic analysis result of the target database corresponding to the database key entity information includes:

The method comprises the steps of constructing a fifth agent and a sixth agent, wherein the fifth agent is used for generating an SQL query statement according to the database key entity information and semantic analysis results of the target database corresponding to the database key entity information, and the sixth agent is used for evaluating the SQL query statement and feeding back evaluation results to the fifth agent;

and regenerating the SQL query statement by the fifth agent according to the evaluation result, and iteratively optimizing the SQL query statement until the set iteration times or triggering end conditions are met.

In the embodiment of the application, a fifth agent (agent-e) and a sixth agent (agent-f) are constructed. The agent-e is responsible for generating SQL query sentences according to the database key entity information and semantic analysis results of the target database corresponding to the database key entity information. This agent will convert the user query into an initial SQL query statement using the semantic parsing results corresponding to the database tables and column information obtained from the previous steps. agent-f is responsible for evaluating the execution results of SQL query statements and providing feedback. The task of this agent is to examine the execution of the SQL query and to feed back optimization suggestions based on the results. Specifically, agent-e constructs SQL query statements using data obtained from the database key entity information. This is the initial version of the SQL query generation, written based on the mapping build results. agent-f executes the SQL query statement and if the query fails to execute (e.g., due to a grammatical error or database structure problem), submits the error information as feedback to agent-e. If the SQL query is executed successfully, agent-f will verify whether the SQL query meets the user query requirement. For example, it is checked whether the returned data is accurate, contains all necessary information, etc. If the SQL query is not fully qualified, agent-b will make optimization suggestions.

In some embodiments, agent-e regenerates the SQL query statement based on agent-f's feedback. The optimization process may include that if the feedback contains syntax error information, agent-e will modify the SQL query to correct these errors. According to feedback of result verification, agent-e may adjust the logic of the SQL query to ensure that the returned result better meets the user's needs.

In some embodiments, the above process is repeated until a set number of iterations or trigger end condition is met.

As an example, in connection with fig. 6, the flow shown in fig. 6 illustrates the basic steps and working principles of generating the SQL query statement from the database key entity information based on a large model.

S601, generating an SQL query statement based on the mapping construction result and feedback.

In the embodiment of the application, agent-e generates SQL query sentences according to the mapping construction result, namely the key entity information of the database and/or the feedback of agent-f.

S602, evaluating the generated SQL query statement.

In the embodiment of the application, agent-f evaluates the SQL query statement generated by agent-e and feeds back comments to agent-a, including S6021, S6022, S6023, S6024 and S6025.

S6021, executing SQL.

S6022, if the SQL reports errors, executing S6023 if yes, otherwise executing S6024.

S6023, record error information, and execute S6025.

S6024, checking whether the generated SQL query statement meets the requirement of the user query.

S6025, generating final feedback.

Based on the same principle as the method provided by the embodiment of the application, the embodiment of the application also provides a large model-based SQL sentence generating device, as shown in FIG. 7, which comprises:

An obtaining module 701, configured to obtain initial database schema information of a user initial query statement and a target database;

The processing module 702 is configured to input the user initial query statement and initial database mode information of the target database into a large language model, so as to obtain an SQL query statement;

The large model-based SQL statement generation device provided by the embodiment of the application can realize each process realized in the method embodiments of fig. 1 to 6, and in order to avoid repetition, the description is omitted here.

According to the SQL sentence generating device based on the large model, provided by the application, the efficient analysis and generation of natural language query are realized by combining the initial query sentence of the user and the database mode information of the target database. The large language model not only can deeply understand the semantic structure of the target database, but also can accurately analyze the semantic of the user query, thereby effectively extracting the key entity information of the database related to the query. Compared with the existing Text2SQL method, the embodiment of the application remarkably improves semantic understanding of the database mode information, enhances generalization capability of coping with diversified query scenes, and greatly improves final accuracy of SQL generation.

The large model-based SQL statement generation device according to the embodiments of the present application may execute the large model-based SQL statement generation method according to the embodiments of the present application, and the implementation principle is similar, and actions executed by each module and unit in the large model-based SQL statement generation device according to each embodiment of the present application correspond to steps in the large model-based SQL statement generation method according to each embodiment of the present application, and detailed functional descriptions of each module of the large model-based SQL statement generation device may be referred to the descriptions in the corresponding large model-based SQL statement generation method shown in the foregoing, which are not repeated herein.

Based on the same principles as the methods shown in the embodiments of the present application, the embodiments of the present application also provide an electronic device that may include, but is not limited to, a processor and a memory, the memory being for storing a computer program, the processor being for executing the large model-based SQL statement generation method shown in any of the alternative embodiments of the present application by invoking the computer program. Compared with the prior art, the SQL sentence generation method based on the large model realizes the efficient analysis and generation of natural language query by combining the user initial query sentence and the database mode information of the target database. The large language model not only can deeply understand the semantic structure of the target database, but also can accurately analyze the semantic of the user query, thereby effectively extracting the key entity information of the database related to the query. Compared with the existing Text2SQL method, the embodiment of the application remarkably improves semantic understanding of the database mode information, enhances generalization capability of coping with diversified query scenes, and greatly improves final accuracy of SQL generation.

In an alternative embodiment, an electronic device is also provided, as shown in FIG. 8, the electronic device 8000 shown in FIG. 8 comprising a processor 8001 and a memory 8003. Processor 8001 is coupled to memory 8003, such as via bus 8002. Optionally, the electronic device 8000 may also include a transceiver 8004, the transceiver 8004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that, in practical applications, the transceiver 8004 is not limited to one, and the structure of the electronic device 8000 is not limited to the embodiment of the present application.

The Processor 8001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 8001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of DSP and microprocessor, etc.

Bus 8002 may include a path to transfer information between the components. Bus 8002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, etc. Bus 8002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

Memory 8003 may be, without limitation, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, but also EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 8003 is used to store a computer program that executes an embodiment of the present application, and is controlled to be executed by the processor 8001. The processor 8001 is configured to execute a computer program stored in the memory 8003 to implement the steps shown in the foregoing method embodiment.

Among them, the electronic devices include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. A method for generating SQL statements based on a large model, characterized by comprising:

Obtaining the user's initial query statement and the initial database schema information of the target database;

Inputting the user's initial query statement and the initial database schema information of the target database into a large language model to obtain a structured SQL query statement;

The large language model is used to perform the following operations:

Performing semantic analysis on the initial database schema information of the target database to obtain a semantic analysis result of the target database;

According to the semantic parsing result of the target database, semantic analysis is performed on the user's initial query statement to obtain the user's query semantics;

According to the user query semantics and the semantic parsing result of the target database, the initial database schema information is screened and processed to obtain database key entity information corresponding to the user query;

Generate the SQL query statement according to the database key entity information and the semantic parsing result of the target database corresponding to the database key entity information;

The method of filtering and processing the initial database schema information according to the user query semantics and the semantic parsing result of the target database to obtain the database key entity information corresponding to the user query includes:

According to the semantic analysis result of the target database, determine N tables related to the user query semantics through semantic similarity search technology;

In the context window of the large model, M tables and corresponding key columns are identified among the N tables that are semantically related to the user query, M≤N;

The M tables and the key columns are aggregated to obtain database key entity information corresponding to the user query.

2. The SQL statement generation method based on a large model according to claim 1 is characterized in that the semantic parsing of the initial database schema information of the target database to obtain the semantic parsing result of the target database includes:

According to the initial database schema information of the target database, obtaining the column meaning and table meaning of each table in the target database;

Determine, according to the table meaning of each table and the column meaning of each column in each table, the reference relationship between each table and other tables;

Determine the core tables in the target database according to the number of reference relationships between each table and other tables, and cluster each of the core tables to obtain database core entity information;

The database characteristic information of the target database is determined according to the database core entity information.

3. The SQL statement generation method based on a large model according to claim 2 is characterized in that the step of obtaining the column meaning and table meaning of each table in the target database according to the initial database schema information of the target database comprises:

Inferring and storing the business meaning of each column in each of the tables;

Based on the business meanings of the columns, optimizing and storing the business meanings of each table;

Based on the business meaning of each table, optimizing and storing the business meaning of each column in each table;

The business meaning of each column and the business meaning of each table are optimized iteratively until a set number of iterations is met or an end condition is triggered.

4. The SQL statement generation method based on a large model according to claim 2 is characterized in that the step of determining the reference relationship between each table and other tables according to the table meaning of each table comprises:

Constructing a first agent and a second agent; wherein the first agent is used to extract the reference relationship between each table and other tables according to the table meaning of each table, and the second agent is used to evaluate the reference relationship and feed back the evaluation result to the first agent;

The first agent re-extracts according to the evaluation result, and iteratively optimizes the reference relationship between each table and other tables until a set number of iterations is met or an end condition is triggered.

5. The method for generating SQL statements based on a large model according to claim 2 is characterized in that the core tables in the target database are determined according to the reference relationship between each table and other tables, and each of the core tables is clustered to obtain the core entity information of the database, including:

Sort by the number of reference relationships of each table, and select the first N tables as the core tables;

Constructing a third agent and a fourth agent; wherein the third agent is used to cluster each of the core tables to obtain database core entity information, and the fourth agent is used to evaluate the database core entity information and feed back the evaluation result to the third agent;

The third agent re-extracts according to the evaluation result and iteratively optimizes the core entity information of the database until the set number of iterations is met or the end condition is triggered.

6. The SQL statement generation method based on a large model according to claim 2 is characterized in that the semantic analysis of the user's initial query statement is performed according to the semantic parsing result of the target database to obtain the user query semantics, including:

Determining context information required for semantic analysis of the user's initial query statement according to the semantic parsing result of the target database;

The user query semantics is determined according to the context information.

7. The method for generating SQL statements based on a large model according to claim 6, wherein determining the context information required for semantic analysis of the user's initial query statement according to the semantic analysis result of the target database comprises:

In the case where the semantic parsing result does not exceed the context window limit of the large model, using the semantic parsing result as the context information;

When the semantic parsing result exceeds the context window limit of the large model, the database core entity information and the database feature information are used as context information;

If the sum of the database core entity information and the database feature information exceeds the context window limit of the large model, the database feature information is used as context information.

8. The method for generating SQL statements based on a large model according to claim 1, characterized in that the step of generating the SQL query statement according to the database key entity information and the semantic parsing result of the target database corresponding to the database key entity information comprises:

Constructing a fifth agent and a sixth agent; wherein the fifth agent is used to generate an SQL query statement according to the database key entity information and the semantic parsing result of the target database corresponding to the database key entity information, and the sixth agent is used to evaluate the SQL query statement and feed back the evaluation result to the fifth agent;

The fifth agent regenerates the SQL query statement according to the evaluation result, and iteratively optimizes the SQL query statement until a set number of iterations is met or an end condition is triggered.

9. A SQL statement generation device based on a large model, characterized by comprising:

An acquisition module, used to acquire the user's initial query statement and the initial database schema information of the target database;

A processing module, used for inputting the user's initial query statement and the initial database mode information of the target database into a large language model to obtain an SQL query statement;

The large language model is used to perform the following operations:

10. An electronic device, characterized in that it comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 8 when executing the program.

11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.

12. A computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.