[go: up one dir, main page]

CN118312524B - Table recall method, apparatus, electronic device and medium - Google Patents

Table recall method, apparatus, electronic device and medium Download PDF

Info

Publication number
CN118312524B
CN118312524B CN202410733522.6A CN202410733522A CN118312524B CN 118312524 B CN118312524 B CN 118312524B CN 202410733522 A CN202410733522 A CN 202410733522A CN 118312524 B CN118312524 B CN 118312524B
Authority
CN
China
Prior art keywords
target
tables
fields
vector
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410733522.6A
Other languages
Chinese (zh)
Other versions
CN118312524A (en
Inventor
陈坤鹏
彭鑫
黄飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanxin Technology Group Co ltd
Original Assignee
Lanxin Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanxin Technology Group Co ltd filed Critical Lanxin Technology Group Co ltd
Priority to CN202410733522.6A priority Critical patent/CN118312524B/en
Publication of CN118312524A publication Critical patent/CN118312524A/en
Application granted granted Critical
Publication of CN118312524B publication Critical patent/CN118312524B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a table recall method, a table recall device, electronic equipment and a medium, and relates to the field of artificial intelligence. The method comprises the following steps: constructing a first vector library for all tables of the database according to the table heads of the tables; constructing a second vector library for all the field grids of each table according to the fields of the table; querying a plurality of target tables similar to sentences to be queried from the first vector library; querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table; performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table; and sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result. The scheme of the invention can obviously improve the table recall precision and solve the trouble caused by overlong context when calling the large model recall table.

Description

Table recall method, apparatus, electronic device and medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a table recall method, a table recall device, electronic equipment and a medium.
Background
The intelligent question number aims to help a user to conveniently and quickly search data, and basic intelligent question number requirements can be generally realized based on a Text-to-sql technical route in the AI field. The Text-to-sql is usually based on the capability of a large model, the natural language query problem of the user is converted into an sql query statement, and a result is returned through background automatic query, so that the user can conveniently and quickly query a huge amount of databases without writing complex and highly specialized sql statements. The Text-to-sql solution generally comprises several links as shown in fig. 1. Form recall is an important link, in an actual application scene, a large number of library forms may exist in the background, if we input all library form information into a large model, the length limit of the context is often exceeded, the generation effect is greatly affected, so that a plurality of form header information most relevant to the query problem needs to be recalled by technical means, and the large model can only generate accurate sql query sentences based on a small amount of form header information.
Currently, there are two main ways of table recall: one way is to transfer all information of the library table to the large model once, and when the number of the tables in the database is large, the large model can be influenced by the fact that the upper part and the lower part Wen Guochang are caused, the generation effect of the large model and the accuracy of sql are influenced, and meanwhile, the overlong context can exceed the limit of hardware resources and cannot be operated; the other way is to realize the retrieval by turning the header into the vector, however, the accuracy of the form returned by the retrieval cannot be very high, and the form recall is used as a preposed link in the text-to-sql technical link, so that the accuracy of the finally generated sql is reduced due to the fact that the accuracy is not enough to cause great influence on the follow-up.
Disclosure of Invention
The invention provides a table recall method, a device, electronic equipment and a medium, which are used for solving the defect of low precision of the existing table recall, realizing a coarse-to-fine double-stage recall strategy, combining table field filtering to realize a high-precision table recall effect, and simultaneously solving the problem of overlong context caused by multiple tables or wide tables.
According to a first aspect of the present invention there is provided a table recall method comprising:
constructing a first vector library for all tables of a database according to the table heads of the tables, wherein the first vector library takes the whole database as an object, converts the data of each table head into a corresponding vector, and the vectors corresponding to all the tables form the first vector library;
Respectively constructing a second vector library for all field lattices of each table according to the fields of the table, wherein the second vector library converts each field data of the table into a corresponding vector by each table object, the vectors corresponding to all the fields form the second vector library, and each second vector library corresponds to one vector in the first vector library;
Querying a plurality of target tables similar to sentences to be queried from the first vector library;
Querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table;
Performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table;
And sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result.
In some possible implementations, the constructing a first vector library for all tables of the database according to the table header of the table includes:
Acquiring the table head content and the table unique identifier of each table in the database;
Constructing each header content into a first text string;
extracting features of each first text string by using embedding models to obtain feature vectors corresponding to each table, and associating the feature vectors with the unique table identifiers of the corresponding tables;
And summarizing and extracting the feature vectors corresponding to all the tables to obtain a first vector library.
In some possible implementations, the constructing a second vector library for all the field lattices of each table according to the fields of the table includes:
the following operations are respectively performed on each table in the database:
acquiring the field content of each field in the table and the table unique identifier of the table to which the field belongs;
Constructing each field content into a second text string;
extracting features of each second text string by using embedding models to obtain feature vectors corresponding to each field table, and associating each feature vector with a unique table identifier of the table to which the field belongs;
And respectively summarizing and extracting the feature vectors corresponding to all the fields from each table to obtain a second vector library corresponding to each table.
In some possible implementations, the querying, from the first vector library, a plurality of target tables similar to the sentence to be queried, including:
acquiring a query statement input to an intelligent question number service to obtain the statement to be queried;
extracting features of the sentence to be queried by using embedding model to obtain a vector to be queried;
calculating a first similarity between the vector to be queried and each vector in the first vector library;
And taking all the tables with the first similarity smaller than a first preset value as target tables, or sorting all the tables from small to large based on the first similarity, and taking the tables with the first preset number arranged before as target tables.
In some possible implementations, the querying at least one target field similar to the sentence to be queried from the second vector library corresponding to each target table includes:
the following operations are respectively executed for the second vector library corresponding to each target table:
Calculating a second similarity of the vector to be queried and each vector in the second vector library;
and taking all the fields with the second similarity smaller than a second preset value as target fields, or sorting all the fields from small to large based on the second similarity, and taking the fields with the second preset number arranged before as target fields.
In some possible implementations, the performing field filtering on the corresponding target table using the target field to obtain a filtering table corresponding to each target table includes:
Acquiring all target fields corresponding to each target table;
deleting all the fields except all the corresponding target fields in each target table to obtain a filtering table corresponding to each target table.
In some possible implementations, the sorting the multiple target tables using the large model and the multiple filter tables to obtain the table recall result includes:
combining each filtering table with the statement to be queried;
Respectively analyzing each pair of combinations by using a large model to obtain the correlation degree of each pair of combinations;
Sorting the plurality of target tables from large to small based on the correlation;
Outputting a plurality of target tables according to the sorting, or outputting a third preset number of target tables arranged in front.
According to a second aspect of the present invention, there is also provided a table recall device for implementing a table recall method as described in any one of the above, the device comprising:
The first construction module is used for constructing a first vector library according to all tables of the database, wherein the first vector library takes the whole database as an object, the data of each table head is converted into a corresponding vector, and the vectors corresponding to all the tables form the first vector library;
The second construction module is used for constructing a second vector library for all field grids of each table according to the fields of the table, wherein the second vector library converts each field data of the table into a corresponding vector according to each table object, the vectors corresponding to all the fields form the second vector library, and each second vector library corresponds to one vector in the first vector library;
The first query module is used for querying a plurality of target tables similar to the sentences to be queried from the first vector library;
A second query module, configured to query, from the second vector library corresponding to each target table, at least one target field similar to a sentence to be queried;
the filtering module is used for carrying out field filtering on the corresponding target forms by using the target fields to obtain filtering forms corresponding to each target form;
And the ordering module is used for ordering the multiple target tables by utilizing the large model and the multiple filtering tables to obtain a table recall result.
According to a third aspect of the present invention there is also provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a table recall method as described in any one of the above when executing the program.
According to a fourth aspect of the present invention there is also provided a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements a table recall method as described in any of the above.
The invention provides a table recall method, which comprises the steps of respectively constructing a first vector library and a second vector library corresponding to each table by the table head and the fields of the tables in advance, obtaining a plurality of target tables similar to a statement to be queried by the coarse granularity query of the table head level firstly through the first vector library, obtaining a target field by the fine granularity query of the field level of each table by the second vector library corresponding to each table, filtering the corresponding target table by using the target field, thereby eliminating irrelevant data, finally realizing sorting of the target tables by using a large model and the filtered tables, executing vector similarity retrieval to determine the recall table by adopting a double-stage strategy, and carrying out table rearrangement on the recall table by using the large model and the filtered table, thereby obviously improving the recall precision of the table, and simultaneously solving the trouble caused by overlong context when calling the recall table of the large model.
In addition, the table recall device, the electronic device and the non-transitory computer readable storage medium provided by the invention can also realize the technical effects, and are not repeated here.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of implementing intelligent questions using Text-to-SQL technology.
FIG. 2 is a schematic flow chart of a table recall method according to the present invention.
FIG. 3 is a second flowchart of the table recall method according to the present invention.
FIG. 4 is a schematic diagram of a database vector query provided by the present invention.
FIG. 5 is a schematic diagram of a table recall device according to the present invention.
Fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A table recall method, a table recall apparatus, an electronic device, and a non-transitory computer-readable storage medium of the present invention are described below with reference to fig. 2 to 6.
Fig. 2 is a schematic flow chart of a table recall method according to an embodiment of the present invention, and referring to fig. 2, the present embodiment provides a table recall method, which can be implemented through steps S201 to S206, and the following details are provided with reference to each step:
step S201, a first vector library is constructed for all tables of the database according to the header of the table.
In this embodiment, the first vector library is constructed by using the data in the header of each table in the database, specifically, the whole database is used as an object, the data in the header of each table is converted into a corresponding vector, and the vectors corresponding to all the tables form the first vector library. The header data refers to data for describing a table data structure, and exemplary header data is "table name: an electric quantity index statistical table; table description: various indexes for recording the electric quantity of the electronic large screen; the fields include index value identification, index coding, index value, same ratio, ring ratio, and primary key is index value identification ", and the data of each header can be regarded as text data or character string, and the manner of converting the text data or character string into a vector includes, but is not limited to: text vectorization with word embedding (Word Embedding), text vectorization with one-time heat coding, text vectorization based on a word bag model, text vectorization based on a word frequency-inverse document frequency model (TF-IDF (term frequency-inverse document frequency)).
Step S202, a second vector library is built for all the field grids of each table according to the fields of the table.
In this embodiment, the second vector library is constructed by using data in all fields in each table, specifically, each table is taken as an object, each field data of the table is converted into a corresponding vector, the vectors corresponding to all fields form the second vector library, each second vector library corresponds to one record in the first vector library one by one, that is, each second vector library corresponds to one vector in the first vector library, and mutual mapping between the two vectors can be realized through the correspondence between the two vectors. A field refers to data for describing the field, and exemplary certain field data is "field name: the same ratio; field interpretation: the same ratio refers to the same-ratio growth rate of the current year power usage and the current year, the data of each field can be regarded as text data or character strings, and the text data or character strings can be converted into vectors by adopting, but not limited to, text vectorization by adopting word embedding (Word Embedding), text vectorization by adopting single-hot coding, text vectorization by adopting a word bag model and text vectorization by adopting a word frequency-inverse document frequency model (TF-IDF) (term frequency-inverse document frequency), and in addition, the fact that each vector in a first vector library and a second vector library can be obtained by adopting the same conversion mode, and multiple conversion modes can also be mixed and used when constructing any vector library.
Step S203, query a plurality of target tables similar to the sentence to be queried from the first vector library.
In a specific implementation process, the statement to be queried refers to a statement to be queried, and the statement to be queried can be an inquiry statement or an exchange statement input by a user or acquired from other application programs, and the statement to be queried is exemplified by 'what happens today in history', 'which is the nearest planet to the earth'; the target table refers to a table corresponding to a vector similar to the sentence to be queried, which is searched from the first vector library, and in the implementation process, whether the vector in the first vector library is similar to the sentence to be queried or not can be realized through any existing index capable of measuring the similarity of the vector in the first vector library and the sentence to be queried, for example, through keywords, distance, similarity or the like. It should be noted that, the number of target tables found from the first vector library should be plural, and in the implementation process, a table corresponding to a vector with relatively high similarity to the sentence to be queried in the first vector library is preferentially selected as the target vector. It should be noted that the number of the target tables may be fixed or not, and only the requirement of searching out a plurality of tables is satisfied.
Step S204, at least one target field similar to the sentence to be queried is queried from the second vector library corresponding to each target table.
In this embodiment, the target field refers to a field corresponding to a vector similar to a statement to be queried, which is searched from each second vector library, and it is to be noted that the query of the target fields corresponding to different tables is independent, the manner of querying the target field is the same as the principle of querying the target table, and specifically, the searching of the target field can be realized by referring to the above manner of querying the target table. The target field is not required to be at least one other than looking up the target table.
Step S205, performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table.
In this embodiment, since the target fields correspond to the respective target tables, when filtering the target tables, each target table should use its corresponding target field for filtering, and the direct field filtering of different target tables does not affect each other.
Step S206, sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result.
According to the table recall method, a first vector library and a second vector library corresponding to each table are respectively built through the table head and the fields of the tables in advance, coarse-grained query of the table head level is firstly carried out through the first vector library to obtain a plurality of target tables similar to a statement to be queried, fine-grained query of the field level is carried out on each table through the second vector library corresponding to each table to obtain a target field, the target field is used for filtering the corresponding target table, therefore irrelevant data are removed, finally sorting of the target tables is achieved through a large model and the filtered tables, vector similarity retrieval is carried out through a double-stage strategy to determine the recall table, table rearrangement is carried out through the large model and the filtered table, recall precision is remarkably improved, and meanwhile the trouble caused by overlong context when the large model recall table is called can be solved.
In some possible implementations, the aforementioned step S201 constructs a first vector library for all tables of the database according to the header of the table, and specifically includes:
Acquiring the table head content and the table unique identifier of each table in the database;
Constructing each header content into a first text string;
extracting features of each first text string by using embedding models to obtain feature vectors corresponding to each table, and associating the feature vectors with the unique table identifiers of the corresponding tables;
And summarizing and extracting the feature vectors corresponding to all the tables to obtain a first vector library.
According to the table recall method, the embeddin model is utilized to convert the table head information of the table into the vector, the unique table identification of the table is used for distinguishing the vector corresponding to each table, the first vector library takes the table head of each table as an object, effective distinguishing of different tables can be achieved, the constructed first vector library can be used for coarsely searching the table related to the statement to be queried, and query efficiency is improved.
In some possible implementations, the step S202 includes constructing a second vector library for all the field grids of each table according to the fields of the table, including:
the following operations are respectively performed on each table in the database:
acquiring the field content of each field in the table and the table unique identifier of the table to which the field belongs;
Constructing each field content into a second text string;
extracting features of each second text string by using embedding models to obtain feature vectors corresponding to each field table, and associating each feature vector with a unique table identifier of the table to which the field belongs;
And respectively summarizing and extracting the feature vectors corresponding to all the fields from each table to obtain a second vector library corresponding to each table.
According to the table recall method, table field information is converted into vectors by using embeddin models aiming at each table, the vectors corresponding to the tables are distinguished by using the table unique identifiers of the tables, the second vector library takes the fields of the tables as objects, effective distinction of different fields can be achieved, the constructed second vector library can be used for searching fields related to sentences to be queried in a fine granularity mode, and processing of irrelevant fields in the tables can be avoided.
In some possible implementations, the step S203 queries a plurality of target tables similar to the sentence to be queried from the first vector library, and specifically includes:
acquiring a query statement input to an intelligent question number service to obtain the statement to be queried;
extracting features of the sentence to be queried by using embedding model to obtain a vector to be queried;
calculating a first similarity between the vector to be queried and each vector in the first vector library;
And taking all the tables with the first similarity smaller than a first preset value as target tables, or sorting all the tables from small to large based on the first similarity, and taking the tables with the first preset number arranged before as target tables.
The table recall method of the embodiment realizes the coarse query of the query statement input into the intelligent question number service in the database by means of the created first vector library, specifically has higher query speed, and can flexibly set the determination mode of the queried target table according to the query precision, the actual requirement and the like.
In some possible implementations, the step S204 includes querying, from the second vector library corresponding to each target table, at least one target field similar to the sentence to be queried, including:
the following operations are respectively executed for the second vector library corresponding to each target table:
Calculating a second similarity of the vector to be queried and each vector in the second vector library;
and taking all the fields with the second similarity smaller than a second preset value as target fields, or sorting all the fields from small to large based on the second similarity, and taking the fields with the second preset number arranged before as target fields.
According to the table recall method, for each target table, the created second vector library is used for realizing fine query of query sentences input into intelligent query services in each table, the query precision is high, and the determination mode of the queried target fields can be flexibly set according to the query precision, actual requirements and the like.
In some possible implementations, the step S205, performing field filtering on the corresponding target table by using the target field to obtain a filtered table corresponding to each target table, specifically includes:
Acquiring all target fields corresponding to each target table;
deleting all the fields except all the corresponding target fields in each target table to obtain a filtering table corresponding to each target table.
According to the table recall method of the embodiment, for each target table, the queried target field is utilized to realize the filtering of the corresponding table field, most of irrelevant table fields are filtered, and the table fields possibly related to the statement to be queried are returned at the same time, so that the field quantity of the large-model processing table is remarkably reduced.
In some possible implementations, the step S206, which uses the large model and the multiple filtering tables to sort the multiple target tables to obtain the table recall result, specifically includes:
combining each filtering table with the statement to be queried;
Respectively analyzing each pair of combinations by using a large model to obtain the correlation degree of each pair of combinations;
Sorting the plurality of target tables from large to small based on the correlation;
outputting a plurality of target tables according to the sorting, or outputting a third preset number of target tables arranged in front. In a specific implementation, the third preset number should be smaller than the first preset number.
In still another embodiment, in order to facilitate understanding of the solution of the present invention, the following uses an intelligent question number service as a specific application scenario, please refer to fig. 3, and the embodiment provides a table recall method applied in the scenario, and the specific implementation process is as follows:
Firstly, constructing a table vector library;
Specifically, the process of constructing the table vector library is as follows: the table header information in the database is acquired, and the table header information comprises table names, table definitions, field names, definition of the field names, main external keys and the like. The header information is organized into text strings, features are extracted by inputting embedding models, one table is correspondingly extracted to one high-dimensional feature vector (512 or 768 dimensions), the feature vectors of all tables are organized into a vector library, a chroma vector database or other similar tools can be used, and each piece of data in the vector library comprises a table vector, table meta information (table names, table paraphrasing and the like), a table index and the like.
Then, constructing a field vector library;
Specifically, according to a similar mode of constructing a table vector library, corresponding field feature vectors are constructed by taking each field in each table as a unit, and feature vectors of all fields are organized into a vector library, wherein each piece of data in the vector library comprises field vectors and field meta information (table names, field types, field paraphrasing and the like of the fields).
Next, inquiring Top K similar tables;
Referring to fig. 4, the specific query flow is as follows: the query problem input by a user is acquired, a embedding model is used for extracting a vector query, a table vector library is searched based on the query, the Top K most similar tables are returned, the value of K can be slightly larger, the step of searching is equivalent to coarse ordering, most irrelevant tables are filtered, and the tables of which the parts are possibly related to the query are returned.
Next, query and filter irrelevant table fields;
Specifically, the process of querying and filtering extraneous table fields is as follows: after returning to the Top M tables, the fields in the Top M tables that are close to the query may be retrieved based on the conditional query (retrieving only the fields for which the table information belongs to the Top M tables). And obtaining the similarity ordering of all the fields and the query. Most irrelevant fields with low similarity to the query can be filtered out, and only relevant fields are reserved in the Top M tables
In a practical business scenario, there is often a wide table, i.e. a table contains tens or hundreds of fields, and the table fields designed by a user query problem are often limited, which means that a large number of table-independent fields are fed as contexts to a large model, which leads to a decrease in accuracy of the large model, and that too long a context leads to anomalies. By the field filtering method, most irrelevant fields can be effectively filtered, the function of simplifying the table is achieved, the context length of the simplified Top K tables is obviously reduced, and the problem of overlong contexts is effectively solved.
Then, rearranging Top K similar tables based on the large model;
After the table field is simplified in the last step, the context of each table information is obviously reduced, and at the moment, the correlation degree of the query and the Top K tables is analyzed by utilizing the capability of the large model so as to reorder the Top K tables to obtain more accurate ordering and recall results.
Specifically, the rearrangement function is that, by utilizing the capability of the large model, a precise ordering can be performed on the returned Top K tables, and more accurate recall results can be obtained. And the large model is not required to process all tables in the library; the two-stage recall strategy well utilizes and combines the capability of embedding models and large models, and effectively improves recall accuracy.
Finally, a recall result is returned;
After the Top K tables are rearranged, a new table sequence is obtained, the higher the front in the sequence is, the higher the degree of correlation with the query is, and the front N (N < K) tables can be taken as the last recall tables for use in the text-to-sql follow-up links.
As comparison, taking a self-built table recall test set as an example, a conventional vector recall method and the method of the invention are adopted for the test set to execute the test, and three test objects, namely, recall@1, recall@2 and recall@3, are used for the recall rate comparison result of the two methods, and the recall rate comparison result is shown in a table 1.
Table 1 recall ratio comparison table of different table recall methods
The table recall method of the embodiment has at least the following advantages: the method for realizing the coarse-to-fine double-stage recall strategy and the table field filtering method has the advantages that the double-stage strategy recalls more accurate results through vector library similarity retrieval and large model rearrangement; the table field filtering well solves the problem of overlong context when multiple tables or wide tables are input into a large model; in addition, compared with the existing table recall method, the table recall method has higher table recall precision, and is beneficial to generating sql sentences with higher quality in the subsequent links of text-to-sql.
The table recall device provided by the invention is described below, and the table recall device described below and the table recall method described above can be referred to correspondingly.
Referring to fig. 5, the present embodiment provides a table recall device, which is configured to implement any one of the table recall methods described above, and specifically includes: the first building block 510, the second building block 520, the first query block 530, the second query block 540, the filtering block 550, and the sorting block 560 are described in detail below:
The first construction module 510 is configured to construct a first vector library for all tables of the database according to the table header of the table, where the first vector library uses the whole database as an object, converts the data of each table header into a corresponding vector, and the vectors corresponding to all tables form the first vector library;
A second construction module 520, configured to construct a second vector library for all the field lattices of each table according to the fields of the tables, where the second vector library uses each table as an object, converts each field data of the table into a corresponding vector, and the vectors corresponding to all the fields form the second vector library, and each second vector library corresponds to one vector in the first vector library;
a first query module 530, configured to query the first vector library for a plurality of target tables similar to the sentence to be queried;
a second query module 540, configured to query the second vector library corresponding to each target table for at least one target field similar to the sentence to be queried;
a filtering module 550, configured to perform field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table;
a sorting module 560 for sorting the multiple target tables by using the large model and the multiple filter tables to obtain table recall results
According to the table recall device, a first vector library and a second vector library corresponding to each table are respectively constructed through the table head and the fields of the tables in advance, coarse-grained query of the table head level is firstly carried out through the first vector library to obtain a plurality of target tables similar to a statement to be queried, fine-grained query of the field level is carried out on each table through the second vector library corresponding to each table to obtain a target field, the target field is used for filtering the corresponding target table, therefore irrelevant data are removed, finally sorting of the target tables is achieved through a large model and the filtered tables, vector similarity retrieval is carried out through a double-stage strategy to determine the recall table, table rearrangement is carried out through the large model and the filtered table, recall precision is remarkably improved, and meanwhile the trouble caused by overlong context when the large model recall table is called can be solved.
The first construction module 510 is further configured to:
Acquiring the table head content and the table unique identifier of each table in the database;
Constructing each header content into a first text string;
extracting features of each first text string by using embedding models to obtain feature vectors corresponding to each table, and associating the feature vectors with the unique table identifiers of the corresponding tables;
And summarizing and extracting the feature vectors corresponding to all the tables to obtain a first vector library.
The second building block 520 is further configured to:
the following operations are respectively performed on each table in the database:
acquiring the field content of each field in the table and the table unique identifier of the table to which the field belongs;
Constructing each field content into a second text string;
extracting features of each second text string by using embedding models to obtain feature vectors corresponding to each field table, and associating each feature vector with a unique table identifier of the table to which the field belongs;
And respectively summarizing and extracting the feature vectors corresponding to all the fields from each table to obtain a second vector library corresponding to each table.
The first query module 530 is further configured to:
acquiring a query statement input to an intelligent question number service to obtain the statement to be queried;
extracting features of the sentence to be queried by using embedding model to obtain a vector to be queried;
calculating a first similarity between the vector to be queried and each vector in the first vector library;
And taking all the tables with the first similarity smaller than a first preset value as target tables, or sorting all the tables from small to large based on the first similarity, and taking the tables with the first preset number arranged before as target tables.
The second query module 540 is further configured to:
the following operations are respectively executed for the second vector library corresponding to each target table:
Calculating a second similarity of the vector to be queried and each vector in the second vector library;
and taking all the fields with the second similarity smaller than a second preset value as target fields, or sorting all the fields from small to large based on the second similarity, and taking the fields with the second preset number arranged before as target fields.
The filtering module 550 is further configured to:
Acquiring all target fields corresponding to each target table;
deleting all the fields except all the corresponding target fields in each target table to obtain a filtering table corresponding to each target table.
The ranking module 560 is further configured to:
combining each filtering table with the statement to be queried;
Respectively analyzing each pair of combinations by using a large model to obtain the correlation degree of each pair of combinations;
Sorting the plurality of target tables from large to small based on the correlation;
Outputting a plurality of target tables according to the sorting, or outputting a third preset number of target tables arranged in front.
It should be noted that, each module in the table recall device may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or independent of a processor in the electronic device, or may be stored in software in a memory in the electronic device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610 (processor), communication interface 620 (Communications Interface), memory 630 (memory), and communication bus 640, wherein processor 610, communication interface 620, memory 630 communicate with each other through communication bus 640. Processor 610 may call logic instructions in memory 630 to perform a table recall method comprising: constructing a first vector library for all tables of the database according to the table heads of the tables; constructing a second vector library for all the field grids of each table according to the fields of the table; querying a plurality of target tables similar to sentences to be queried from the first vector library; querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table; performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table; and sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a table recall method provided by the methods described above, the method comprising: constructing a first vector library for all tables of the database according to the table heads of the tables; constructing a second vector library for all the field grids of each table according to the fields of the table; querying a plurality of target tables similar to sentences to be queried from the first vector library; querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table; performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table; and sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a table recall method provided by the methods described above, the method comprising: constructing a first vector library for all tables of the database according to the table heads of the tables; constructing a second vector library for all the field grids of each table according to the fields of the table; querying a plurality of target tables similar to sentences to be queried from the first vector library; querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table; performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table; and sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. A method of table recall, the method comprising:
constructing a first vector library for all tables of the database according to the table heads of the tables;
constructing a second vector library for all the fields of each table according to the fields of the table;
Querying a plurality of target tables similar to sentences to be queried from the first vector library;
Querying at least one target field similar to a sentence to be queried from the second vector library corresponding to each target table;
Performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table;
sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result;
the constructing a first vector library for all tables of the database according to the table head of the table comprises the following steps:
Acquiring the table head content and the table unique identifier of each table in the database;
Constructing each header content into a first text string;
extracting features of each first text string by using embedding models to obtain feature vectors corresponding to each table, and associating the feature vectors with the unique table identifiers of the corresponding tables;
Summarizing and extracting feature vectors corresponding to all tables to obtain a first vector library;
the constructing a second vector library for all the fields of each table according to the fields of the table respectively comprises the following steps:
the following operations are respectively performed on each table in the database:
acquiring the field content of each field in the table and the table unique identifier of the table to which the field belongs;
Constructing each field content into a second text string;
extracting features of each second text string by using embedding models to obtain feature vectors corresponding to each field table, and associating each feature vector with a unique table identifier of the table to which the field belongs;
summarizing and extracting the feature vectors corresponding to all the fields from each table to obtain a second vector library corresponding to each table;
The querying the target tables similar to the sentences to be queried from the first vector library comprises the following steps:
acquiring a query statement input to an intelligent question number service to obtain the statement to be queried;
extracting features of the sentence to be queried by using embedding model to obtain a vector to be queried;
calculating a first similarity between the vector to be queried and each vector in the first vector library;
Taking all the tables with the first similarity smaller than a first preset value as target tables, or sorting all the tables from small to large based on the first similarity, and taking the tables with the first preset quantity arranged before as target tables;
The querying at least one target field similar to the sentence to be queried from the second vector library corresponding to each target table comprises:
the following operations are respectively executed for the second vector library corresponding to each target table:
Calculating a second similarity of the vector to be queried and each vector in the second vector library;
Taking all fields with the second similarity smaller than a second preset value as target fields, or sorting all fields from small to large based on the second similarity, and taking the fields with the second preset number arranged before as target fields;
and performing field filtering on the corresponding target table by using the target field to obtain a filtering table corresponding to each target table, including:
Acquiring all target fields corresponding to each target table;
Deleting all the fields except all the corresponding target fields in each target table to obtain a filtering table corresponding to each target table;
The method for sorting the multiple target tables by using the large model and the multiple filtering tables to obtain a table recall result comprises the following steps:
combining each filtering table with the statement to be queried;
Respectively analyzing each pair of combinations by using a large model to obtain the correlation degree of each pair of combinations;
Sorting the plurality of target tables from large to small based on the correlation;
Outputting a plurality of target tables according to the sorting, or outputting a third preset number of target tables arranged in front.
2. A recall device, the device comprising:
the first construction module is used for constructing a first vector library for all tables of the database according to the table heads of the tables;
The second construction module is used for constructing a second vector library for all the fields of each table according to the fields of the table;
The first query module is used for querying a plurality of target tables similar to the sentences to be queried from the first vector library;
A second query module, configured to query, from the second vector library corresponding to each target table, at least one target field similar to a sentence to be queried;
the filtering module is used for carrying out field filtering on the corresponding target forms by using the target fields to obtain filtering forms corresponding to each target form;
the sorting module is used for sorting the multiple target tables by utilizing the large model and the multiple filtering tables to obtain a table recall result;
the first building block is further configured to:
Acquiring the table head content and the table unique identifier of each table in the database;
Constructing each header content into a first text string;
extracting features of each first text string by using embedding models to obtain feature vectors corresponding to each table, and associating the feature vectors with the unique table identifiers of the corresponding tables;
Summarizing and extracting feature vectors corresponding to all tables to obtain a first vector library;
The second building block is further configured to:
the following operations are respectively performed on each table in the database:
acquiring the field content of each field in the table and the table unique identifier of the table to which the field belongs;
Constructing each field content into a second text string;
extracting features of each second text string by using embedding models to obtain feature vectors corresponding to each field table, and associating each feature vector with a unique table identifier of the table to which the field belongs;
summarizing and extracting the feature vectors corresponding to all the fields from each table to obtain a second vector library corresponding to each table;
the first query module is further configured to:
acquiring a query statement input to an intelligent question number service to obtain the statement to be queried;
extracting features of the sentence to be queried by using embedding model to obtain a vector to be queried;
calculating a first similarity between the vector to be queried and each vector in the first vector library;
Taking all the tables with the first similarity smaller than a first preset value as target tables, or sorting all the tables from small to large based on the first similarity, and taking the tables with the first preset quantity arranged before as target tables;
the second query module is further configured to:
the following operations are respectively executed for the second vector library corresponding to each target table:
Calculating a second similarity of the vector to be queried and each vector in the second vector library;
Taking all fields with the second similarity smaller than a second preset value as target fields, or sorting all fields from small to large based on the second similarity, and taking the fields with the second preset number arranged before as target fields;
the filtration module is further for:
Acquiring all target fields corresponding to each target table;
Deleting all the fields except all the corresponding target fields in each target table to obtain a filtering table corresponding to each target table;
The ranking module is further configured to:
combining each filtering table with the statement to be queried;
Respectively analyzing each pair of combinations by using a large model to obtain the correlation degree of each pair of combinations;
Sorting the plurality of target tables from large to small based on the correlation;
Outputting a plurality of target tables according to the sorting, or outputting a third preset number of target tables arranged in front.
3. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the table recall method of claim 1 when the program is executed by the processor.
4. A non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements the table recall method of claim 1.
CN202410733522.6A 2024-06-07 2024-06-07 Table recall method, apparatus, electronic device and medium Active CN118312524B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410733522.6A CN118312524B (en) 2024-06-07 2024-06-07 Table recall method, apparatus, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410733522.6A CN118312524B (en) 2024-06-07 2024-06-07 Table recall method, apparatus, electronic device and medium

Publications (2)

Publication Number Publication Date
CN118312524A CN118312524A (en) 2024-07-09
CN118312524B true CN118312524B (en) 2024-09-27

Family

ID=91726658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410733522.6A Active CN118312524B (en) 2024-06-07 2024-06-07 Table recall method, apparatus, electronic device and medium

Country Status (1)

Country Link
CN (1) CN118312524B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119203958B (en) * 2024-11-25 2025-04-08 云筑信息科技(成都)有限公司 Method for extracting and aligning multi-format document table data based on large language model
CN119357220A (en) * 2024-12-26 2025-01-24 中科云谷科技有限公司 Database table recall method, device, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083688A (en) * 2019-05-10 2019-08-02 北京百度网讯科技有限公司 Search result recalls method, apparatus, server and storage medium
CN116401344A (en) * 2023-02-16 2023-07-07 广州广电运通金融电子股份有限公司 Method and device for searching table according to question

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767375A (en) * 2020-05-13 2020-10-13 平安科技(深圳)有限公司 Semantic recall method and device, computer equipment and storage medium
CN118093629A (en) * 2024-03-13 2024-05-28 北京沃东天骏信息技术有限公司 Database query statement generation method, device, equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083688A (en) * 2019-05-10 2019-08-02 北京百度网讯科技有限公司 Search result recalls method, apparatus, server and storage medium
CN116401344A (en) * 2023-02-16 2023-07-07 广州广电运通金融电子股份有限公司 Method and device for searching table according to question

Also Published As

Publication number Publication date
CN118312524A (en) 2024-07-09

Similar Documents

Publication Publication Date Title
CN118312524B (en) Table recall method, apparatus, electronic device and medium
US8171029B2 (en) Automatic generation of ontologies using word affinities
CN106202207B (en) HBase-ORM-based indexing and retrieval system
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN110516047A (en) Retrieval method and retrieval system based on knowledge graph in packaging field
CN109471929B (en) Method for semantic search of equipment maintenance records based on map matching
CN111506621A (en) Data statistical method and device
CN111858567A (en) Method and system for cleaning government affair data through standard data elements
CN118093632B (en) Graph database query method and device based on large language model and graph structure
CN112231321A (en) Oracle secondary index and index real-time synchronization method
CN113704575B (en) SQL method, device, equipment and storage medium for analyzing XML and Java files
CN113297251A (en) Multi-source data retrieval method, device, equipment and storage medium
CN112214494B (en) Retrieval method and device
CN117290376A (en) Two-stage Text2SQL model, method and system based on large language model
CN115794833A (en) Data processing method, server and computer storage medium
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN110874366A (en) Data processing and query method and device
CN118410124A (en) Unstructured data storage method and system
CN112905642A (en) Method for storing IEC61850 report data into relational database based on CSV mapping file
CN118210819A (en) Method for realizing dialogue type operation and maintenance management
CN114691845B (en) Semantic search method, device, electronic device, storage medium and product
CN112306421B (en) Method and system for storing MDF file in analysis and measurement data format
CN116049193A (en) Data storage method and device
KR102605931B1 (en) Method for processing structured data and unstructured data on a plurality of databases and data processing platform providing the method
CN115168399B (en) Data processing method, device, equipment and storage medium based on graphical interface

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant