CN117725078A - Multi-table data query and prediction method based on natural language - Google Patents
Multi-table data query and prediction method based on natural language Download PDFInfo
- Publication number
- CN117725078A CN117725078A CN202311619928.3A CN202311619928A CN117725078A CN 117725078 A CN117725078 A CN 117725078A CN 202311619928 A CN202311619928 A CN 202311619928A CN 117725078 A CN117725078 A CN 117725078A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- query
- analysis
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000004458 analytical method Methods 0.000 claims abstract description 78
- 238000000605 extraction Methods 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 4
- 210000001503 joint Anatomy 0.000 claims description 4
- 238000003860 storage Methods 0.000 claims description 4
- 238000007405 data analysis Methods 0.000 abstract description 14
- 230000006870 function Effects 0.000 abstract description 11
- 238000004519 manufacturing process Methods 0.000 abstract description 6
- 230000000007 visual effect Effects 0.000 abstract description 3
- 238000012937 correction Methods 0.000 abstract description 2
- 238000013075 data extraction Methods 0.000 abstract 2
- 230000002452 interceptive effect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 9
- 238000013461 design Methods 0.000 description 5
- 238000012800 visualization Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 101150013423 dsl-1 gene Proteins 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012854 evaluation process Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000003032 molecular docking Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a multi-table data query and prediction method based on natural language. Based on the existing service database and specific service scene, collecting common service problems; respectively constructing a form extraction data set, a business knowledge data set and a query analysis data set on the basis; selecting a form extraction model and a query analysis instruction generation model, and training in a full-parameter fine adjustment mode; the method comprises the steps of arranging a model in a production environment, and adapting and developing corresponding instruction translation, correction and execution modules according to actual function demand differences; and finally, sending a data query and analysis request to the model through the WEB front-end page. The invention can realize accurate data extraction and inquiry, visual analysis and data prediction in a multi-table and multi-field complex business number bin environment by using natural language, so that business personnel can cross the use threshold of structured data extraction and analysis languages such as SQL, python and the like to perform interactive data inquiry and analysis.
Description
Technical Field
The invention relates to a low-code automatic data query analysis method in the fields of big data and artificial intelligence, in particular to a multi-form data query, visualization and prediction method based on natural language.
Background
The popularity and importance of data query and predictive analysis in daily life is not negligible in the data-driven era. Whether in a plurality of fields such as finance, marketing, medical treatment, education and the like, data analysis becomes a core tool for decision support, and the working mode of professionals and the efficiency of business operation are deeply influenced. However, with the expansion of the company and business scale, the data analysis flow is increasingly complicated, and in addition, different departments and levels have differences in data demand, data attribution and data analysis language grasping degree, so that the data query analysis process in the company system is tedious and has low efficiency.
To solve the above problems, a great deal of research work related to the design of low-code business intelligent BI applications has been recently conducted, such as Tao Hong et al, which have studied the use of Power BI software in analyzing and managing the usage data of collected medicines in countries (2022), zhu Xiaowei et al, which have tried to investigate the application of design business BI software (2023). In addition, with the development of the technology of generating a Large Language Model (LLM), it is possible to process table data through generating SQL sentences (NL 2 SQL) by natural language for data analysis, and in the latest spider-SQL competition list, the Aliba team (Gao and Wang, 2023) obtains the achievement with 86.6% accuracy under 3-5 table scenes by using a DAIL-SQL+GPT-4 scheme through knowledge injection, multiple generation recall and other modes.
However, for the conventional low-code BI tool, the problems of higher software learning time cost and incapability of crossing the data caliber obstacle exist, communication coordination crossing departments, levels and personnel is still avoided in the data analysis and query process, and the efficiency improvement is limited. In addition, the technology of generating SQL based on the large language model LLM (NL 2 SQL) is currently used in the actual business score of the company, and then the technology exists: 1) In a multi-table data scene, too many tables cause too long input context length (context-length), so that forgetting of contents is easy to occur, and the accuracy of a generated result is influenced; 2) The controllability of directly generating SQL sentences is poor, and once generating errors occur, the result is difficult to correct; 3) The NL2SQL is stronger than the query due to the functions of SQL sentences, but has limited support for the functions of visualization, predictive analysis and the like, and is difficult to meet the actual use requirements.
Disclosure of Invention
Aiming at the defects existing in the field of intelligent analysis and query of the current data, the invention designs a method for completing form extraction and merging by adopting a plurality of model cascading based on pure natural language dialogue interaction and finally realizing functions of query, analysis, prediction display and the like. In addition, aiming at the problems of poor controllability and imperfect analysis function coverage of SQL statement generated by a Large Language Model (LLM), a set of Domain-defined language (Domain-specific language, DSL) which can cover most SQL query requirements and can perform visualization and predictive analysis is specially designed.
In order to solve the defects in the prior art, the invention provides the following technical scheme:
1. multi-table data query and prediction method based on natural language
Step one: according to the actual business requirements, the data warehouse and the data table which are in butt joint are confirmed, then basic distinguishing information among different data warehouses and basic distinguishing information among different data tables are obtained, and the basic distinguishing information is recorded as data source information;
step two: collecting user questions and requests from a currently docked data warehouse to form seed questions and business noun interpretations;
step three: respectively constructing a table extraction data set, a business knowledge data set and a query analysis data set based on the currently-butted data warehouse, the data table, the data source information, the seed problem and the business noun interpretation;
step four: respectively carrying out data enhancement and data cleaning operation on the table extraction data set, the business knowledge data set and the query analysis data set to respectively obtain a preprocessed table extraction data set, a preprocessed business knowledge data set and a preprocessed query analysis data set;
step five: training a form extraction model by using the preprocessed form extraction data set to obtain a trained form extraction model; training a query analysis instruction generation model by using the business knowledge data set and the query analysis data set to obtain a trained query analysis instruction generation model;
step six: after cascade deployment of the trained form extraction model and the query analysis instruction generation model, a multi-form query instruction generation model is obtained; and inputting the transmitted user request text into a multi-table query instruction generation model to obtain a final query-analysis result.
The query analysis dataset includes data for a domain-defined language based on user questions and corresponding related or likely related form definition statements.
And in the fifth step, a model is selected from the table, a model base adopts RoBERTa-Chinese-WWM, and a double-tower structure is adopted during model training.
And in the fifth step, the query analysis instruction generates a model, the base model is a large language generation model WizardLM, and a full-parameter fine tuning method is adopted to conduct fine tuning training on the model.
In the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model.
2. Computer equipment
The computer device comprises a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method when executing the computer program.
3. Computer readable storage medium
A computer-readable storage medium has stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the method.
The beneficial effects of the invention are as follows:
the method designs a set of flow frames which take natural language as input, completely eliminates other manual assistance and intervention in the middle process, and a user does not need any common data analysis tool language basic knowledge such as Python/SQL and the like to directly obtain data query and analysis results. The technical threshold of business data analysis and inquiry can be greatly reduced, the communication cost of middle cross departments, personnel caliber confirmation and the like is reduced, and the data analysis and inquiry efficiency is improved.
The invention adopts a mode of cascading a form selection model and a query analysis instruction generation model, and greatly reduces the input context length of the query analysis instruction generation model by selecting a precise form, so that a natural language driven data analysis technology (such as NL2 SQL) can be better adapted to a real multi-form and multi-field production environment, and meanwhile, the accuracy of instruction (code) generation is improved.
Aiming at the problems that the SQL can not perform common functions in the data analysis fields such as visual analysis, model training, prediction and the like, and the SQL can be directly generated with poor controllability and difficult correction when errors are generated, the invention designs a set of Domain-defined language (Domain-specific Language) which can cover the functions of SQL query, visualization, training and prediction. The DSL instruction set adopts the key: the value is output to the mode, and the mode can be quickly corrected after being generated and is easy to read and understand. And can finish the complex multi-step inquiry task by outputting the multi-step instruction list. It should be emphasized and supplemented that the DSL instruction set content can be flexibly added and defined for different production environments of the docking, and existing tools and components of the production environments are integrated.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 is a construction flow of a data analysis query framework using natural language according to an embodiment of the present invention;
FIG. 2 is a diagram showing an intermediate process of obtaining a final query and an analysis result by sending a query request from a front end according to an embodiment of the present invention;
fig. 3 is a sample of a data query analysis request directly performed by using natural language after model construction and deployment are completed on the basis of a certain fund index management database according to an embodiment of the present invention.
FIG. 4 shows a data query analysis request sample (the request is the average of the rising and falling of different types of funds in the last week) according to the present invention, and the corresponding query analysis result.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the embodiment of the invention belongs to the field of big data analysis and artificial intelligence, and takes a certain securities company fund intelligent query project as an example, and the system construction process for carrying out data query analysis based on natural language works based on the following parts:
step one: according to the actual business requirements, the data warehouse and the data table which are in butt joint are confirmed, then basic distinguishing information among different data warehouses and basic distinguishing information among different data tables are obtained, and the basic distinguishing information is recorded as data source information; the basic distinguishing information between different data warehouses and between different data tables comprises key information such as data sub-warehouses, data table descriptions, data field descriptions, table foreign key constraints and the like;
the data sub-warehouse refers to a data subset corresponding to different service types in the data warehouse. Taking a company operation index data warehouse as an example, the sub-warehouses possibly included in the operation index data warehouse are: a corporate personnel index warehouse, a corporate financial index warehouse, a corporate project investment index warehouse, etc.;
information such as data table description, data field description, table foreign key constraint and the like can be obtained from a data definition language Data Definition Language corresponding to the data warehouse, and is basic information for distinguishing different data warehouses and data tables.
Under an embodiment, it should be noted that the original conditions of the docking service data warehouse, such as tables, field layering and descriptions (comments), foreign key constraint integrity, can have a large impact on the accuracy of model selection tables, query analysis instruction generation. Therefore, the necessary administration work for the data warehouse is required before the framework is built.
Step two: collecting common problems and requests of users from a data warehouse in current butt joint so as to form basic seed problems and common business noun interpretation;
seed problems refer to the problems provided by the user that are most commonly used on a particular business scenario. The seed problem is expanded into a plurality of different problems of semantic approximation in the data enhancement process, and is used for enhancing the generalized understanding capability of the model on specific problems and semantics.
In the embodiment, the scene corresponds to the data warehouse, and a total of 500 tables are involved, the number of fields of each table is different from 5 to 30, and 5-10 seed questions are preferably collected corresponding to each table. These 5-10 seed questions should be guaranteed to cover most of the common fields in the table, and the seed questions involve business and field content should remain largely different, eventually collecting about 3000 seed questions.
Respectively constructing a table extraction data set, a business knowledge data set and a query analysis data set based on the currently-butted data warehouse, the data table, the data source information, the seed problem and the business noun interpretation;
in the third step, the table extracts a data set, which refers to a data set for training the deep learning model to distinguish different problems from different tables and the mapping relation capability of the table fields, and each data is composed of a triplet marked by "input", "right_label" and "false_label", and the data is specifically shown as follows:
the query analysis dataset includes data for a domain-defined language based on the user questions and corresponding related or likely related form definition statements. The data format is as follows:
in the field definition language (Domain-specific Language, DSL), the functions of common keyword combination, visual analysis, training prediction analysis and the like of the SQL are covered in the data query analysis scene. A single domain definition language (DSL) instruction consists of four tuples of "input", "output", "command" and "command_ars". Each individual DSL instruction can be regarded as a line of programming language in the form of: output=command (input, ×command_ars). Wherein command can be regarded as a function_api, command_args is an input parameter of the function_api, and input and output are input and output data of the function_api respectively. Any complex query and analysis task can be disassembled into a combination of a limited number of DSL instructions, and according to the names of input/output in each DSL statement, the execution sequence or connection relation between DSL instructions can be obtained, and finally, the DSL instructions can be translated into a simple program defined by a limited stroke order language.
And the method is limited by space, so that the complete domain definition language instruction set in the scene is not convenient to fully display. For the query function, taking the domain definition language corresponding to the SQL packet aggregate query Select Agg () GroupBy key as an example, the corresponding instruction output is as follows:
the specific meaning of the instruction is: from the table with the original name df_ori, the columns of "City" are taken as grouping columns, the columns of "Age" with different "City" types are averaged, and the result is saved as a table of "df_group_agg_0".
For the visualization function, taking pie chart drawing as an example, the corresponding domain definition language DSL instruction output is as follows:
the actual meaning of the instruction is: from the data table originally named "df_ori", values of three columns of "signups", "sales", "visits" are extracted to draw a pie chart.
For training predictive analysis, taking model training as an example, the corresponding domain definition language DSL instructions are as follows:
the actual meaning of the instruction is: based on the original data table df_ori, a classification (classification) model is trained by taking the "measured" column as a prediction target column.
The above examples of domain-specific languages all involve only a single-step simple query, and for more complex multi-step query analysis, the query, analysis instruction generation model is capable of generating a set of lists of multiple domain-specific language (DSL) instructions: [ DSL1, DSL2, …, DSLN ].
Step four: respectively carrying out data enhancement and data cleaning operation on the table extraction data set, the business knowledge data set and the query analysis data set to respectively obtain a preprocessed table extraction data set, a preprocessed business knowledge data set and a preprocessed query analysis data set;
and (3) data enhancement, namely, based on the two sub-problems of the step, performing operations such as paraphrase replacement, approximate sentence generation, key condition (such as date and numerical value related in the problem) replacement and the like, and expanding one piece of original data into a plurality of pieces of approximate data.
The data cleaning refers to the process of using a set of fixed rule filtering, rare field removing and manual screening to screen and optimize the original data generated by calibration.
In an embodiment, the ratio of the enhanced and cleaned data samples to the data samples corresponding to the primordial seed questions is about 10:1, i.e., each primordial seed question (Query) will correspond to 10 table extraction data and 10 Query analysis data, respectively.
Practice shows that the enhanced and cleaned data can remarkably improve the understanding and generalization capability of the table selection model and the query analysis instruction generation model to scene problems.
Step five: training a form extraction model by using the preprocessed form extraction data set to obtain a trained form extraction model; training a query analysis instruction generation model by using the business knowledge data set and the query analysis data set to obtain a trained query analysis instruction generation model; in specific implementation, the effect evaluation is also carried out on the table extraction model and the query analysis instruction generation model respectively by using pre-reserved test/verification data. And (3) carrying out iterative tuning on the model by supplementing data, adjusting model parameters and the like according to whether the evaluation result can meet the production requirement or not until the production requirement is met.
And fifthly, selecting a model by a table, wherein a model base adopts RoBERTa-Chinese-WWM, and a double-tower structure is adopted during model training. Each time, the triplet samples of [ "query", "right_label", "false_label" ] are split into two samples of [ "query", "right_label" ], and [ "query", "false_label" ] are respectively input into two base models, the outputs of the two models are pooled (Pooling), and the contrast Loss (compressive Loss) is calculated.
The comparative Loss (compressive Loss) is defined as follows:
wherein: d (D) W Representing two pairs of samples ([ "query", "right_label ]"],[“query”,“false_label”]) Is a Euclidean distance of (2); y represents the label of whether two samples match, for [ "query", "right_label ]"]Sample, y=1, for [ "query", "false_label ]"]Y=0; m is a set super parameter threshold; n is the number of samples; max () represents the maximum function, max (m-D W ,0) 2 It is shown that when the Euclidean distance of two samples exceeds the threshold m, loss of contrast should be low if the two dissimilar sample features are far apart with loss of contrast being 0.
And fifthly, generating a model by using a query analysis instruction, wherein a base model is a large language generation model WizardLM, inquiring, analyzing a data set, a business knowledge data set and an input training corpus, and performing fine tuning training on the model by adopting a full-parameter fine tuning method.
The full-parameter fine tuning (super-tuning) refers to that the training process can adjust parameters related to all modules and middle layers of a large model, and the reason for adopting the training method is that in a query analysis scene, the accuracy requirement on an output instruction is higher. Practical experience shows that compared with a fine tuning mode such as low-rank adaptation (LORA) of a model, the accuracy of instruction generation of the model subjected to full-parameter fine tuning is highest.
Under the embodiment, the table selection model base adopts Roberta-Chinese-ext-large, and the query and analysis instruction generation model base adopts wizardLM 13B base.
The Table selection model selects the Table by adopting [ "Query", "Table schema DDL" ] text and recalls the vector cosine similarity Top10, and the correct rate of the Top10 recall is required to be more than 99% in the evaluation process.
The recall accuracy is evaluated by: the top10 similarity rank corresponds to a set of tables with a probability of containing the correct table of greater than 99%.
The Query analysis model infers that the input form is [ "Table schema DDL+Query" ], and the output content is a corresponding domain definition language (DSL) instruction set list [ DSL1, DSL2, …, DSLN ]. The evaluation process requires that the result match of the model generation instruction set and the labeling instruction set be over 90% on the validation set.
The result matching degree calculating method of the instruction set comprises the following steps:
1. the number and proportion of complete matches between the generating instruction set DSL_p and the labeling instruction set DLL_true are calculated.
2. For DSL_p and DSL_true which cannot be completely matched, inputting the DSL_p and DSL_true into an instruction set interpretation executor respectively to obtain a final analysis result, and if the two analysis results are the same, considering that the DSL_p and DSL_true can be paired.
3. And summing the correct rates corresponding to the samples which can be matched in the steps 1) and 2), and finally reaching more than 90%.
Step six: as shown in fig. 2, after cascade deployment of the trained form extraction model and the query analysis instruction generation model, a multi-form query instruction generation model is obtained; inputting a user request text sent by a user from a front-end page of a product into a multi-table query instruction generation model, performing steps such as table selection, query analysis instruction generation, query analysis instruction interpretation and execution step by step, and obtaining a final query-analysis result by an instruction set generated by the instruction set interpretation and execution model of an executor.
In the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model (namely, the user request text).
And the cascade deployment refers to the cascade deployment of a form selection model and a query analysis instruction generation model. When receiving the request sent by the front end, the user request text is input into a form selection model, and the form selection model outputs the data definition languages corresponding to 5-10 forms according to the similarity between the user request query and the data definition language DSL corresponding to each form. And then preprocessing the output of the form extraction model, combining with the text requested by the user again, and inputting a query analysis instruction together to generate the model.
The reason for adopting the model cascade deployment form is that in a general business data warehouse, the number of tables and the number of table fields are more, and the definition sentences of all original tables and table fields are directly input into a query analysis instruction to generate a model, so that the context length limit of the model is easily exceeded. Even if a technique such as ROPE is used to enhance the long text input capability, a long text forgetting phenomenon is easy to occur, and the accuracy of the generated content is greatly affected. By using the form selection model, forms irrelevant to the problem are removed in advance, and only 5-10 forms and relevant fields are selected as inputs, so that the load of the context length of the query analysis instruction generation model can be greatly reduced, and the instruction output accuracy is improved.
Under an embodiment, the instruction set interpretation and the code corresponding to the executor are written by python, and intermediate result tables generated by different instructions (DSL) are stored as a csv format file and are stored in a specific storage unit in a lasting manner.
Fig. 3 and fig. 4 are sample diagrams of the results of the query analysis obtained by performing component execution and data analysis from the front-end sending request in this embodiment, respectively.
Finally, it should be noted that the above-mentioned embodiments and descriptions are only illustrative of the technical solution of the present invention and are not limiting. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the present invention without departing from the spirit and scope of the present invention as defined in the appended claims.
Claims (7)
1. A multi-table data query and prediction method based on natural language, comprising the steps of:
step one: according to the actual business requirements, the data warehouse and the data table which are in butt joint are confirmed, then basic distinguishing information among different data warehouses and basic distinguishing information among different data tables are obtained, and the basic distinguishing information is recorded as data source information;
step two: collecting user questions and requests from a currently docked data warehouse to form seed questions and business noun interpretations;
step three: respectively constructing a table extraction data set, a business knowledge data set and a query analysis data set based on the currently-butted data warehouse, the data table, the data source information, the seed problem and the business noun interpretation;
step four: respectively carrying out data enhancement and data cleaning operation on the table extraction data set, the business knowledge data set and the query analysis data set to respectively obtain a preprocessed table extraction data set, a preprocessed business knowledge data set and a preprocessed query analysis data set;
step five: training a form extraction model by using the preprocessed form extraction data set to obtain a trained form extraction model; training a query analysis instruction generation model by using the business knowledge data set and the query analysis data set to obtain a trained query analysis instruction generation model;
step six: after cascade deployment of the trained form extraction model and the query analysis instruction generation model, a multi-form query instruction generation model is obtained; and inputting the transmitted user request text into a multi-table query instruction generation model to obtain a final query-analysis result.
2. The method of claim 1, wherein the query analysis dataset comprises data in a domain-defined language based on user questions and corresponding related or likely related form definition statements.
3. The method for inquiring and predicting multi-table data based on natural language according to claim 1, wherein the table in the fifth step is selected from a model, a model base of the model adopts Roberta-Chinese-WWM, and a double-tower structure is adopted during model training.
4. The method for inquiring and predicting multi-table data based on natural language according to claim 1, wherein the inquiry analysis instruction in the fifth step generates a model, the base model is a large language generation model WizardLM, and the model is subjected to fine tuning training by adopting a full parameter fine tuning method.
5. The method according to claim 1, wherein in the sixth step, the output of the trained form extraction model is preprocessed and then used as the input of the trained query analysis instruction generation model together with the input of the trained form extraction model.
6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311619928.3A CN117725078B (en) | 2023-11-30 | 2023-11-30 | Multi-table data query and analysis method based on natural language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311619928.3A CN117725078B (en) | 2023-11-30 | 2023-11-30 | Multi-table data query and analysis method based on natural language |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117725078A true CN117725078A (en) | 2024-03-19 |
CN117725078B CN117725078B (en) | 2024-10-18 |
Family
ID=90204360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311619928.3A Active CN117725078B (en) | 2023-11-30 | 2023-11-30 | Multi-table data query and analysis method based on natural language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117725078B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118332000A (en) * | 2024-06-12 | 2024-07-12 | 浙江口碑网络技术有限公司 | Method and system for constructing language conversion assistant based on generative model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656540A (en) * | 2021-08-06 | 2021-11-16 | 北京仁科互动网络技术有限公司 | BI query method, device, equipment and medium based on NL2SQL |
CN114328823A (en) * | 2021-12-08 | 2022-04-12 | 阿里巴巴(中国)有限公司 | Database natural language query method and device, electronic equipment and storage medium |
WO2023051021A1 (en) * | 2021-09-30 | 2023-04-06 | 阿里巴巴达摩院(杭州)科技有限公司 | Human-machine conversation method and apparatus, device, and storage medium |
-
2023
- 2023-11-30 CN CN202311619928.3A patent/CN117725078B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656540A (en) * | 2021-08-06 | 2021-11-16 | 北京仁科互动网络技术有限公司 | BI query method, device, equipment and medium based on NL2SQL |
WO2023051021A1 (en) * | 2021-09-30 | 2023-04-06 | 阿里巴巴达摩院(杭州)科技有限公司 | Human-machine conversation method and apparatus, device, and storage medium |
CN114328823A (en) * | 2021-12-08 | 2022-04-12 | 阿里巴巴(中国)有限公司 | Database natural language query method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
"TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT", ARTIFICIAL INTELLIGENCE, 7 August 2023 (2023-08-07), pages 1 - 13 * |
孙振维: "基于知识图谱的自然语言生成SQL语句模型研究", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 01, 15 January 2022 (2022-01-15), pages 138 - 847 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118332000A (en) * | 2024-06-12 | 2024-07-12 | 浙江口碑网络技术有限公司 | Method and system for constructing language conversion assistant based on generative model |
Also Published As
Publication number | Publication date |
---|---|
CN117725078B (en) | 2024-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12204578B2 (en) | Machine learning based database search and knowledge mining | |
CA3033859C (en) | Method and system for automatically extracting relevant tax terms from forms and instructions | |
CN118170894B (en) | A knowledge graph question answering method, device and storage medium | |
CN105630827B (en) | A kind of information processing method, system and auxiliary system | |
CN114547274B (en) | Multi-turn question and answer method, device and equipment | |
CN113656540B (en) | BI query method, device, equipment and medium based on NL2SQL | |
CN114090620B (en) | Query request processing method and device | |
CN117453717B (en) | Data query statement generation method, device, equipment and storage medium | |
CN111104423A (en) | SQL statement generation method and device, electronic equipment and storage medium | |
CN117725078B (en) | Multi-table data query and analysis method based on natural language | |
CN118227655A (en) | Database query statement generation method, device, equipment and storage medium | |
CN117251455A (en) | Intelligent report generation method and system based on large model | |
CN117743543A (en) | Sentence generation method, device and electronic equipment based on large language model | |
CN118535682A (en) | A retrieval enhancement method combining keyword extraction and semantic analysis | |
CN118260717A (en) | Internet low-orbit satellite information mining method, system, device and medium | |
CN116860927A (en) | Knowledge graph-based audit guidance intelligent question-answering method, system and equipment | |
CN117540004B (en) | Industrial domain intelligent question-answering method and system based on knowledge graph and user behavior | |
CN113779176A (en) | Query request completion method and device, electronic equipment and storage medium | |
CN117609468A (en) | Method and device for generating search statement | |
CN117332064A (en) | Instruction generation and database operation method, electronic device and computer storage medium | |
CN116775811A (en) | Data retrieval and intelligent auxiliary writing system and method based on power grid information | |
US11036725B2 (en) | System and method for computerized data processing, analysis and display | |
US20250005018A1 (en) | Information processing method, device, equipment and storage medium based on large language model | |
CN118981475B (en) | SQL statement generation method and device based on large model | |
CN117851443B (en) | SQL rule tag generation method based on artificial intelligence and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |