CN117725078A

CN117725078A - Multi-table data query and prediction method based on natural language

Info

Publication number: CN117725078A
Application number: CN202311619928.3A
Authority: CN
Inventors: 查良瑜; 苏常保; 黄清仪; 杨赛赛; 袁静
Original assignee: Institute Of Computer Innovation Technology Zhejiang University
Current assignee: Institute Of Computer Innovation Technology Zhejiang University
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-03-19
Anticipated expiration: 2043-11-30
Also published as: CN117725078B

Abstract

The invention discloses a multi-table data query and prediction method based on natural language. Based on the existing service database and specific service scene, collecting common service problems; respectively constructing a form extraction data set, a business knowledge data set and a query analysis data set on the basis; selecting a form extraction model and a query analysis instruction generation model, and training in a full-parameter fine adjustment mode; the method comprises the steps of arranging a model in a production environment, and adapting and developing corresponding instruction translation, correction and execution modules according to actual function demand differences; and finally, sending a data query and analysis request to the model through the WEB front-end page. The invention can realize accurate data extraction and inquiry, visual analysis and data prediction in a multi-table and multi-field complex business number bin environment by using natural language, so that business personnel can cross the use threshold of structured data extraction and analysis languages such as SQL, python and the like to perform interactive data inquiry and analysis.

Description

Multi-table data query and prediction method based on natural language

Technical Field

The invention relates to a low-code automatic data query analysis method in the fields of big data and artificial intelligence, in particular to a multi-form data query, visualization and prediction method based on natural language.

Background

The popularity and importance of data query and predictive analysis in daily life is not negligible in the data-driven era. Whether in a plurality of fields such as finance, marketing, medical treatment, education and the like, data analysis becomes a core tool for decision support, and the working mode of professionals and the efficiency of business operation are deeply influenced. However, with the expansion of the company and business scale, the data analysis flow is increasingly complicated, and in addition, different departments and levels have differences in data demand, data attribution and data analysis language grasping degree, so that the data query analysis process in the company system is tedious and has low efficiency.

To solve the above problems, a great deal of research work related to the design of low-code business intelligent BI applications has been recently conducted, such as Tao Hong et al, which have studied the use of Power BI software in analyzing and managing the usage data of collected medicines in countries (2022), zhu Xiaowei et al, which have tried to investigate the application of design business BI software (2023). In addition, with the development of the technology of generating a Large Language Model (LLM), it is possible to process table data through generating SQL sentences (NL 2 SQL) by natural language for data analysis, and in the latest spider-SQL competition list, the Aliba team (Gao and Wang, 2023) obtains the achievement with 86.6% accuracy under 3-5 table scenes by using a DAIL-SQL+GPT-4 scheme through knowledge injection, multiple generation recall and other modes.

However, for the conventional low-code BI tool, the problems of higher software learning time cost and incapability of crossing the data caliber obstacle exist, communication coordination crossing departments, levels and personnel is still avoided in the data analysis and query process, and the efficiency improvement is limited. In addition, the technology of generating SQL based on the large language model LLM (NL 2 SQL) is currently used in the actual business score of the company, and then the technology exists: 1) In a multi-table data scene, too many tables cause too long input context length (context-length), so that forgetting of contents is easy to occur, and the accuracy of a generated result is influenced; 2) The controllability of directly generating SQL sentences is poor, and once generating errors occur, the result is difficult to correct; 3) The NL2SQL is stronger than the query due to the functions of SQL sentences, but has limited support for the functions of visualization, predictive analysis and the like, and is difficult to meet the actual use requirements.

Disclosure of Invention

Aiming at the defects existing in the field of intelligent analysis and query of the current data, the invention designs a method for completing form extraction and merging by adopting a plurality of model cascading based on pure natural language dialogue interaction and finally realizing functions of query, analysis, prediction display and the like. In addition, aiming at the problems of poor controllability and imperfect analysis function coverage of SQL statement generated by a Large Language Model (LLM), a set of Domain-defined language (Domain-specific language, DSL) which can cover most SQL query requirements and can perform visualization and predictive analysis is specially designed.

In order to solve the defects in the prior art, the invention provides the following technical scheme:

1. multi-table data query and prediction method based on natural language

Step one: according to the actual business requirements, the data warehouse and the data table which are in butt joint are confirmed, then basic distinguishing information among different data warehouses and basic distinguishing information among different data tables are obtained, and the basic distinguishing information is recorded as data source information;

step two: collecting user questions and requests from a currently docked data warehouse to form seed questions and business noun interpretations;

step three: respectively constructing a table extraction data set, a business knowledge data set and a query analysis data set based on the currently-butted data warehouse, the data table, the data source information, the seed problem and the business noun interpretation;

step four: respectively carrying out data enhancement and data cleaning operation on the table extraction data set, the business knowledge data set and the query analysis data set to respectively obtain a preprocessed table extraction data set, a preprocessed business knowledge data set and a preprocessed query analysis data set;

step five: training a form extraction model by using the preprocessed form extraction data set to obtain a trained form extraction model; training a query analysis instruction generation model by using the business knowledge data set and the query analysis data set to obtain a trained query analysis instruction generation model;

step six: after cascade deployment of the trained form extraction model and the query analysis instruction generation model, a multi-form query instruction generation model is obtained; and inputting the transmitted user request text into a multi-table query instruction generation model to obtain a final query-analysis result.

The query analysis dataset includes data for a domain-defined language based on user questions and corresponding related or likely related form definition statements.

And in the fifth step, a model is selected from the table, a model base adopts RoBERTa-Chinese-WWM, and a double-tower structure is adopted during model training.

And in the fifth step, the query analysis instruction generates a model, the base model is a large language generation model WizardLM, and a full-parameter fine tuning method is adopted to conduct fine tuning training on the model.

In the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model.

2. Computer equipment

The computer device comprises a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method when executing the computer program.

3. Computer readable storage medium

A computer-readable storage medium has stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the method.

The beneficial effects of the invention are as follows:

the method designs a set of flow frames which take natural language as input, completely eliminates other manual assistance and intervention in the middle process, and a user does not need any common data analysis tool language basic knowledge such as Python/SQL and the like to directly obtain data query and analysis results. The technical threshold of business data analysis and inquiry can be greatly reduced, the communication cost of middle cross departments, personnel caliber confirmation and the like is reduced, and the data analysis and inquiry efficiency is improved.

The invention adopts a mode of cascading a form selection model and a query analysis instruction generation model, and greatly reduces the input context length of the query analysis instruction generation model by selecting a precise form, so that a natural language driven data analysis technology (such as NL2 SQL) can be better adapted to a real multi-form and multi-field production environment, and meanwhile, the accuracy of instruction (code) generation is improved.

Aiming at the problems that the SQL can not perform common functions in the data analysis fields such as visual analysis, model training, prediction and the like, and the SQL can be directly generated with poor controllability and difficult correction when errors are generated, the invention designs a set of Domain-defined language (Domain-specific Language) which can cover the functions of SQL query, visualization, training and prediction. The DSL instruction set adopts the key: the value is output to the mode, and the mode can be quickly corrected after being generated and is easy to read and understand. And can finish the complex multi-step inquiry task by outputting the multi-step instruction list. It should be emphasized and supplemented that the DSL instruction set content can be flexibly added and defined for different production environments of the docking, and existing tools and components of the production environments are integrated.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a construction flow of a data analysis query framework using natural language according to an embodiment of the present invention;

FIG. 2 is a diagram showing an intermediate process of obtaining a final query and an analysis result by sending a query request from a front end according to an embodiment of the present invention;

fig. 3 is a sample of a data query analysis request directly performed by using natural language after model construction and deployment are completed on the basis of a certain fund index management database according to an embodiment of the present invention.

FIG. 4 shows a data query analysis request sample (the request is the average of the rising and falling of different types of funds in the last week) according to the present invention, and the corresponding query analysis result.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention belongs to the field of big data analysis and artificial intelligence, and takes a certain securities company fund intelligent query project as an example, and the system construction process for carrying out data query analysis based on natural language works based on the following parts:

step one: according to the actual business requirements, the data warehouse and the data table which are in butt joint are confirmed, then basic distinguishing information among different data warehouses and basic distinguishing information among different data tables are obtained, and the basic distinguishing information is recorded as data source information; the basic distinguishing information between different data warehouses and between different data tables comprises key information such as data sub-warehouses, data table descriptions, data field descriptions, table foreign key constraints and the like;

the data sub-warehouse refers to a data subset corresponding to different service types in the data warehouse. Taking a company operation index data warehouse as an example, the sub-warehouses possibly included in the operation index data warehouse are: a corporate personnel index warehouse, a corporate financial index warehouse, a corporate project investment index warehouse, etc.;

information such as data table description, data field description, table foreign key constraint and the like can be obtained from a data definition language Data Definition Language corresponding to the data warehouse, and is basic information for distinguishing different data warehouses and data tables.

Under an embodiment, it should be noted that the original conditions of the docking service data warehouse, such as tables, field layering and descriptions (comments), foreign key constraint integrity, can have a large impact on the accuracy of model selection tables, query analysis instruction generation. Therefore, the necessary administration work for the data warehouse is required before the framework is built.

Step two: collecting common problems and requests of users from a data warehouse in current butt joint so as to form basic seed problems and common business noun interpretation;

seed problems refer to the problems provided by the user that are most commonly used on a particular business scenario. The seed problem is expanded into a plurality of different problems of semantic approximation in the data enhancement process, and is used for enhancing the generalized understanding capability of the model on specific problems and semantics.

In the embodiment, the scene corresponds to the data warehouse, and a total of 500 tables are involved, the number of fields of each table is different from 5 to 30, and 5-10 seed questions are preferably collected corresponding to each table. These 5-10 seed questions should be guaranteed to cover most of the common fields in the table, and the seed questions involve business and field content should remain largely different, eventually collecting about 3000 seed questions.

Respectively constructing a table extraction data set, a business knowledge data set and a query analysis data set based on the currently-butted data warehouse, the data table, the data source information, the seed problem and the business noun interpretation;

in the third step, the table extracts a data set, which refers to a data set for training the deep learning model to distinguish different problems from different tables and the mapping relation capability of the table fields, and each data is composed of a triplet marked by "input", "right_label" and "false_label", and the data is specifically shown as follows:

the query analysis dataset includes data for a domain-defined language based on the user questions and corresponding related or likely related form definition statements. The data format is as follows:

in the field definition language (Domain-specific Language, DSL), the functions of common keyword combination, visual analysis, training prediction analysis and the like of the SQL are covered in the data query analysis scene. A single domain definition language (DSL) instruction consists of four tuples of "input", "output", "command" and "command_ars". Each individual DSL instruction can be regarded as a line of programming language in the form of: output=command (input, ×command_ars). Wherein command can be regarded as a function_api, command_args is an input parameter of the function_api, and input and output are input and output data of the function_api respectively. Any complex query and analysis task can be disassembled into a combination of a limited number of DSL instructions, and according to the names of input/output in each DSL statement, the execution sequence or connection relation between DSL instructions can be obtained, and finally, the DSL instructions can be translated into a simple program defined by a limited stroke order language.

And the method is limited by space, so that the complete domain definition language instruction set in the scene is not convenient to fully display. For the query function, taking the domain definition language corresponding to the SQL packet aggregate query Select Agg () GroupBy key as an example, the corresponding instruction output is as follows:

the specific meaning of the instruction is: from the table with the original name df_ori, the columns of "City" are taken as grouping columns, the columns of "Age" with different "City" types are averaged, and the result is saved as a table of "df_group_agg_0".

For the visualization function, taking pie chart drawing as an example, the corresponding domain definition language DSL instruction output is as follows:

the actual meaning of the instruction is: from the data table originally named "df_ori", values of three columns of "signups", "sales", "visits" are extracted to draw a pie chart.

For training predictive analysis, taking model training as an example, the corresponding domain definition language DSL instructions are as follows:

the actual meaning of the instruction is: based on the original data table df_ori, a classification (classification) model is trained by taking the "measured" column as a prediction target column.

The above examples of domain-specific languages all involve only a single-step simple query, and for more complex multi-step query analysis, the query, analysis instruction generation model is capable of generating a set of lists of multiple domain-specific language (DSL) instructions: [ DSL1, DSL2, …, DSLN ].

and (3) data enhancement, namely, based on the two sub-problems of the step, performing operations such as paraphrase replacement, approximate sentence generation, key condition (such as date and numerical value related in the problem) replacement and the like, and expanding one piece of original data into a plurality of pieces of approximate data.

The data cleaning refers to the process of using a set of fixed rule filtering, rare field removing and manual screening to screen and optimize the original data generated by calibration.

In an embodiment, the ratio of the enhanced and cleaned data samples to the data samples corresponding to the primordial seed questions is about 10:1, i.e., each primordial seed question (Query) will correspond to 10 table extraction data and 10 Query analysis data, respectively.

Practice shows that the enhanced and cleaned data can remarkably improve the understanding and generalization capability of the table selection model and the query analysis instruction generation model to scene problems.

Step five: training a form extraction model by using the preprocessed form extraction data set to obtain a trained form extraction model; training a query analysis instruction generation model by using the business knowledge data set and the query analysis data set to obtain a trained query analysis instruction generation model; in specific implementation, the effect evaluation is also carried out on the table extraction model and the query analysis instruction generation model respectively by using pre-reserved test/verification data. And (3) carrying out iterative tuning on the model by supplementing data, adjusting model parameters and the like according to whether the evaluation result can meet the production requirement or not until the production requirement is met.

And fifthly, selecting a model by a table, wherein a model base adopts RoBERTa-Chinese-WWM, and a double-tower structure is adopted during model training. Each time, the triplet samples of [ "query", "right_label", "false_label" ] are split into two samples of [ "query", "right_label" ], and [ "query", "false_label" ] are respectively input into two base models, the outputs of the two models are pooled (Pooling), and the contrast Loss (compressive Loss) is calculated.

The comparative Loss (compressive Loss) is defined as follows:

wherein: d (D) _W Representing two pairs of samples ([ "query", "right_label ]"],[“query”,“false_label”]) Is a Euclidean distance of (2); y represents the label of whether two samples match, for [ "query", "right_label ]"]Sample, y=1, for [ "query", "false_label ]"]Y=0; m is a set super parameter threshold; n is the number of samples; max () represents the maximum function, max (m-D _W ,0) ² It is shown that when the Euclidean distance of two samples exceeds the threshold m, loss of contrast should be low if the two dissimilar sample features are far apart with loss of contrast being 0.

And fifthly, generating a model by using a query analysis instruction, wherein a base model is a large language generation model WizardLM, inquiring, analyzing a data set, a business knowledge data set and an input training corpus, and performing fine tuning training on the model by adopting a full-parameter fine tuning method.

The full-parameter fine tuning (super-tuning) refers to that the training process can adjust parameters related to all modules and middle layers of a large model, and the reason for adopting the training method is that in a query analysis scene, the accuracy requirement on an output instruction is higher. Practical experience shows that compared with a fine tuning mode such as low-rank adaptation (LORA) of a model, the accuracy of instruction generation of the model subjected to full-parameter fine tuning is highest.

Under the embodiment, the table selection model base adopts Roberta-Chinese-ext-large, and the query and analysis instruction generation model base adopts wizardLM 13B base.

The Table selection model selects the Table by adopting [ "Query", "Table schema DDL" ] text and recalls the vector cosine similarity Top10, and the correct rate of the Top10 recall is required to be more than 99% in the evaluation process.

The recall accuracy is evaluated by: the top10 similarity rank corresponds to a set of tables with a probability of containing the correct table of greater than 99%.

The Query analysis model infers that the input form is [ "Table schema DDL+Query" ], and the output content is a corresponding domain definition language (DSL) instruction set list [ DSL1, DSL2, …, DSLN ]. The evaluation process requires that the result match of the model generation instruction set and the labeling instruction set be over 90% on the validation set.

The result matching degree calculating method of the instruction set comprises the following steps:

1. the number and proportion of complete matches between the generating instruction set DSL_p and the labeling instruction set DLL_true are calculated.

2. For DSL_p and DSL_true which cannot be completely matched, inputting the DSL_p and DSL_true into an instruction set interpretation executor respectively to obtain a final analysis result, and if the two analysis results are the same, considering that the DSL_p and DSL_true can be paired.

3. And summing the correct rates corresponding to the samples which can be matched in the steps 1) and 2), and finally reaching more than 90%.

Step six: as shown in fig. 2, after cascade deployment of the trained form extraction model and the query analysis instruction generation model, a multi-form query instruction generation model is obtained; inputting a user request text sent by a user from a front-end page of a product into a multi-table query instruction generation model, performing steps such as table selection, query analysis instruction generation, query analysis instruction interpretation and execution step by step, and obtaining a final query-analysis result by an instruction set generated by the instruction set interpretation and execution model of an executor.

In the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model (namely, the user request text).

And the cascade deployment refers to the cascade deployment of a form selection model and a query analysis instruction generation model. When receiving the request sent by the front end, the user request text is input into a form selection model, and the form selection model outputs the data definition languages corresponding to 5-10 forms according to the similarity between the user request query and the data definition language DSL corresponding to each form. And then preprocessing the output of the form extraction model, combining with the text requested by the user again, and inputting a query analysis instruction together to generate the model.

The reason for adopting the model cascade deployment form is that in a general business data warehouse, the number of tables and the number of table fields are more, and the definition sentences of all original tables and table fields are directly input into a query analysis instruction to generate a model, so that the context length limit of the model is easily exceeded. Even if a technique such as ROPE is used to enhance the long text input capability, a long text forgetting phenomenon is easy to occur, and the accuracy of the generated content is greatly affected. By using the form selection model, forms irrelevant to the problem are removed in advance, and only 5-10 forms and relevant fields are selected as inputs, so that the load of the context length of the query analysis instruction generation model can be greatly reduced, and the instruction output accuracy is improved.

Under an embodiment, the instruction set interpretation and the code corresponding to the executor are written by python, and intermediate result tables generated by different instructions (DSL) are stored as a csv format file and are stored in a specific storage unit in a lasting manner.

Fig. 3 and fig. 4 are sample diagrams of the results of the query analysis obtained by performing component execution and data analysis from the front-end sending request in this embodiment, respectively.

Finally, it should be noted that the above-mentioned embodiments and descriptions are only illustrative of the technical solution of the present invention and are not limiting. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the present invention without departing from the spirit and scope of the present invention as defined in the appended claims.

Claims

1. A multi-table data query and prediction method based on natural language, comprising the steps of:

2. The method of claim 1, wherein the query analysis dataset comprises data in a domain-defined language based on user questions and corresponding related or likely related form definition statements.

3. The method for inquiring and predicting multi-table data based on natural language according to claim 1, wherein the table in the fifth step is selected from a model, a model base of the model adopts Roberta-Chinese-WWM, and a double-tower structure is adopted during model training.

4. The method for inquiring and predicting multi-table data based on natural language according to claim 1, wherein the inquiry analysis instruction in the fifth step generates a model, the base model is a large language generation model WizardLM, and the model is subjected to fine tuning training by adopting a full parameter fine tuning method.

5. The method according to claim 1, wherein in the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model.

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.