CN119166740A

CN119166740A - Knowledge base construction method, data processing method, device, storage medium and program product

Info

Publication number: CN119166740A
Application number: CN202411200392.6A
Authority: CN
Inventors: 陈端; 黄铃钰
Original assignee: Hangzhou Alibaba Overseas Internet Industry Co ltd
Current assignee: Hangzhou Alibaba Overseas Internet Industry Co ltd
Priority date: 2024-08-28
Filing date: 2024-08-28
Publication date: 2024-12-20

Abstract

The embodiment of the present application provides a knowledge base construction method, data processing method, device, storage medium and program product. Among them, based on database data and document data, table description information of the data table in the database data and field description of the data table are generated; and the table relationship information corresponding to the data table is identified from the historical query statement of the database data; thus, the asset knowledge base can be constructed based on the table description information, the field description information and the table relationship information. The embodiment of the present application generates table description information and field description information based not only on database data but also on document data, and identifies and obtains table relationship information from the historical query statement and document data, so that the information in the constructed asset knowledge base is rich and usable, and the data quality of the asset knowledge base is improved.

Description

Knowledge base construction method, data processing method, device, storage medium, and program product

Technical Field

Embodiments of the present application relate to the field of large model technologies, and in particular, to a knowledge base construction method, a data processing method, a computing device, a computer readable storage medium, and a computer program product.

Background

In some enterprise database application scenarios based on large models, such as a look-up scenario of data from a database or a use scenario of processing data, or a NL2SQL scenario of converting text into SQL for a database, constructing a high quality asset knowledge base becomes a solution to improve the performance of large models.

The asset knowledge base is used to centrally store and manage organizational data assets, including data tables, data fields, and the like. And the combined asset knowledge base can be used for constructing training samples of the large model to fine tune the large model or serve as auxiliary information to carry out retrieval enhancement on the large model and the like.

Therefore, how to construct a high-quality asset knowledge base is a technical problem that needs to be solved at present.

Disclosure of Invention

The embodiment of the application provides a knowledge base construction method, a data processing method, equipment, a storage medium and a program product, which are used for solving the technical problem of how to construct a high-quality asset knowledge base in the prior art.

In a first aspect, an embodiment of the present application provides a knowledge base construction method, including:

acquiring database data and document data corresponding to a data provider, wherein the database data comprises data table data and metadata in a database;

Generating table description information of a data table in the database and field description information of the data table based on the database data and the document data;

Identifying table relation information corresponding to the data table from historical query sentences of the database data and the document data;

and constructing an asset knowledge base based on the table description information, the field description information and the table relation information.

In a second aspect, an embodiment of the present application provides a data processing method, including:

Determining an asset knowledge base;

Identifying target knowledge information matched with the input data from an asset knowledge base, wherein the target knowledge information comprises target table description information, target field description information and/or target table relation information;

and inputting a target big model based on the input data and the target table description information, the target field description information and/or the target table relation information to obtain output data.

In a third aspect, an embodiment of the present application provides a data processing method, including:

Determining an asset knowledge base;

constructing a training sample based on table description information, field description information and table relation information in the asset database;

And training a target large model by using the training sample.

In a fourth aspect, an embodiment of the present application provides a data processing method, including:

processing the query request by using the target large model;

Under the condition that the query request relates to database data, querying target knowledge information from an asset knowledge base corresponding to the database data, wherein the target knowledge information comprises table description information, target field description information and/or target table relation information;

generating a query result based on the target knowledge information;

and taking the query result as output data of the target large model.

In a fifth aspect, an embodiment of the present application provides a knowledge base construction apparatus, including:

The data acquisition module is used for acquiring database data and document data corresponding to a data provider, wherein the database data comprises data table data and metadata in a database;

a description information generating module, configured to generate table description information of a data table in the database and field description information of the data table based on the database data and the document data;

the first information identification module is used for identifying table relation information corresponding to the data table from historical query sentences of the database data and the document data;

And the knowledge base construction module is used for constructing an asset knowledge base based on the table description information, the field description information and the table relation information.

In a sixth aspect, an embodiment of the present application provides a data processing method, including:

the first knowledge base determining module is used for determining an asset knowledge base;

The second information identification module is used for identifying target knowledge information matched with the input data from an asset knowledge base, wherein the target knowledge information comprises target table description information, target field description information and/or target table relation information;

And the data generation module is used for inputting a target big model based on the input data, the target table description information, the target field description information and/or the target table relation information so as to obtain output data.

In a seventh aspect, an embodiment of the present application provides a data processing method, including:

the second knowledge base determining module is used for determining an asset knowledge base;

The sample construction module is used for constructing training samples based on the table description information, the field description information and the table relation information in the asset database;

And the model training module is used for training a target large model by utilizing the training sample.

In an eighth aspect, an embodiment of the present application provides a data processing method, including:

the request processing module is used for processing the query request by utilizing the target large model;

The query module is used for querying target knowledge information from an asset knowledge base corresponding to database data when the query request relates to the database data, wherein the target knowledge information comprises table description information, target field description information and/or target table relation information;

the result generation module is used for generating a query result based on the target knowledge information;

And the data output module is used for taking the query result as output data of the target large model.

In a ninth aspect, in an embodiment of the present application, a computing device includes a processing component and a storage component;

the storage component stores one or more computer instructions which are used for being called and executed by the processing component to realize the knowledge base construction method provided by the application or realize the data processing method provided by the application.

In a tenth aspect, in an embodiment of the present application, there is provided a computer readable storage medium having stored thereon a computer program, which when executed by a processing component, implements a knowledge base construction method provided by the present application, or implements a data processing method provided by the present application.

In an eleventh aspect, in an embodiment of the present application, a computer program product is provided, where the computer program/instruction implements a knowledge base construction method provided by the present application or implements a data processing method provided by the present application when the computer program/instruction is executed by a processing component.

In the embodiment of the application, the table description information and the field description information of the data table in the database data are generated by carrying out data analysis on the database data and the document data, and the table relation information corresponding to the data table is identified from the historical query statement of the database data and the document data, so that the asset knowledge base can be constructed based on the table description information, the field description information and the table relation information.

These and other aspects of the application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of one embodiment of a knowledge base construction method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a knowledge base construction process in a practical application of an embodiment of the present application;

FIG. 3 is a flow chart of one embodiment of a data processing method provided by an embodiment of the present application;

FIG. 4 is a flow chart of yet another embodiment of a data processing method provided by an embodiment of the present application;

FIG. 5 is a flow chart of yet another embodiment of a data processing method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating the construction of an embodiment of a knowledge base construction apparatus according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to another embodiment of the present application;

FIG. 9 is a schematic diagram of a data processing apparatus according to another embodiment of the present application;

FIG. 10 illustrates a schematic diagram of one embodiment of a computing device provided by embodiments of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present application and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In the current large model research field, the search is gradually going from pure model algorithm search to search with more emphasis on data quality. The driving force behind this transition is the increasing importance of data quality to model optimization. Such as a find number scenario, a use number scenario, or an NL2SQL (NL-to-SQL) scenario in which a Natural Language (NL) is converted into an SQL (Structured Query Language) statement, the requirement for data accuracy is continuously improved, and the data quality becomes particularly critical. Taking NL2SQL scenario as an example, the large model is used to convert the query text of natural language into SQL statements, the large model needs to identify keywords in the query text of natural language, and associate with table names, field names, etc. in the database, and in this process, the challenge is to perform semantic analysis on the database, understand the vocabulary and data relationships in the database, etc.

In the traditional scheme, metadata is often used as auxiliary information in the process of implementing the application, however, the inventor discovers that in some databases, such as ODPS (Open Data Processing Service, development of data service) or Hadoop (an open-source big data processing framework) ecological Metastore (a database for storing metadata in Hadoop), the relationship between tables cannot be intuitively embodied in the data model design stage, so that the asset knowledge base constructed based on the metadata is still quite high in understanding cost for the large model due to the lack of idle current wattmeter relationship information. Further, table annotation information, field annotation information, and the like related to metadata are manually provided, and thus have problems such as irregular annotations, and increase understanding costs of large models.

Therefore, how to construct a high-quality asset knowledge base to improve the performance of large models is a technical problem to be solved. According to the technical scheme, the inventor provides the technical scheme of the embodiment of the application through multiple rounds of tests, in the embodiment of the application, data analysis is carried out on database data and document data to generate table description information of a data table in the database data and field description information of the data table, table relation information corresponding to the data table is identified from historical query sentences of the database data and document data, so that an asset knowledge base can be constructed based on the table description information, the field description information and the table relation information.

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Fig. 1 shows a flowchart of a knowledge base construction method according to an embodiment of the present application, and as shown in fig. 1, the knowledge base construction method specifically may include the following steps:

and 101, acquiring database data and document data corresponding to the data provider.

The database data can be database data corresponding to the data provider, and the asset knowledge base constructed by the embodiment of the application can be used for docking a target large model applied by the data provider. A data provider may refer to, for example, an enterprise, in which many data may be generated during the production process of the enterprise, where the data may be stored in a database built by the enterprise, and some data may exist in text form, video form, or audio form, etc. to form document data, for example, text files such as plain text files (.txt), office documents, PDF files, web pages, emails, blog articles, academic papers, reports, etc.

The database data may include data table data, metadata, etc., wherein the metadata may include database schema information (schema), etc., for example. In practice, enterprise databases may typically employ ODPS and/or Hadoop Metastore, etc., and such databases may lack metadata information, often table relationship information, etc.

102, Generating table description information of a data table and field description information of the data table in the database data based on the database data and the document data.

The table description information is descriptive information of the data table, and may relate to various table description types, for example, may include a table name, a table type, table annotation information, and the like of the data table. The table description information may be extracted from the database data according to various table description types.

The field description information may include descriptive information of a field, may relate to various field description types, may include a field name, a field annotation, a table name of a data table to which the field description information belongs, and the like, and may be extracted from database data according to various field description types.

The plurality of table description types and the plurality of field description types may be set in connection with actual data quality requirements for the asset knowledge base, etc., such that the generated table description information and field description information meet the data quality requirements.

The specific generation modes of the table description information and the field description information are described in detail in the corresponding embodiments below.

In addition, since document data is also data generated during the production process of the data provider, entity related information about entities associated with the data table is also generally present, and such information can also be used to generate table description information, which will be described in detail in the following corresponding embodiments.

Alternatively, table description information of a data table in the database data and field description information of the data table may be generated based on the database data and the document data using the large model.

And 103, identifying table relation information corresponding to the data table from historical query sentences of the database data and document data.

The historical query statement may be recorded in a historical query record in the metadata. The table relation information is often related to the historical query statement, and the embodiment of the application can identify and acquire the table relation information through analyzing and processing the historical query statement.

The historical query statement may include a query statement corresponding to a historical data query operation or a data cleansing operation of database data. The data cleansing operation refers to a series of operations performed on data to improve its quality, including but not limited to removing duplicate data, correcting erroneous data, filling missing values, normalizing data formats, etc. In a database environment, data cleansing and data querying are typically accomplished using query statements, such as SQL statements, in which table relationships are typically embodied. Such as cross-table operations like JOIN (JOIN), sub-queries, etc.

In addition, since entity related information of an entity associated with the data table, such as entity related information of an entity corresponding to the data table, entity relationship information, entity operation information, and the like may exist in the document data, the table relationship information may be identified from the entity relationship information and the entity operation information.

Alternatively, table description information of a data table in the database data and field description information of the data table may be generated based on the database data and the document data using the large model. The table relation information corresponding to the data table may be identified from the historical query sentence of the database data and the document data by using a large model, which will be described in detail in the following embodiments.

104, Constructing an asset knowledge base based on the table description information, the field description information and the table relation information.

The asset knowledge base is used for interfacing with the target large model, for example, is used for generating a training corpus to fine tune the target large model or is used as auxiliary information to realize retrieval enhancement of the large model or help the large model to carry out semantic understanding in the data query process, and the like, and is used as a knowledge base of the large model.

In one embodiment of the present application, the history query sentence may include all query sentences historically executed in the database, but is not limited thereto, and may refer to query sentences executed in a preset history period. The query term may refer to, for example, an SQL term.

The table relation information may be a relation between the data table and other data tables stored in the database, and the table relation information may include, for example, a table association relation, a table blood edge relation, a field blood edge relation, blood edge path information, and the like. The table association relationship refers to the association between different data tables, such as the association between one record of one data table and one record of another data table, the association is usually established by associating an external key of one data table with a main key of another data table, such as a customer ID (identification) field in an order table, which refers to the main key ID of the customer table, the table blood-edge relationship refers to the blood-edge relationship of table granularity, which represents the source data table, the target data table and the like involved in the data circulation process, the field blood-edge relationship refers to the blood-edge relationship of field granularity, which represents the fields of the source data table, the target data table and the like involved in the data circulation process, and the blood-edge path information may refer to the data table, the circulation sequence, the path depth and the like involved in the data circulation process. In addition, the table relationship information may include table or field containing relationships, conversion relationships, dependency relationships, aggregation relationships, and/or derivatives relationships, where a containing relationship may refer to a field or record in one data table being part of a field or record in another data table, for example, an order table containing information of an order list, each record in the order list corresponding to a specific order, a conversion relationship may refer to a change in data from one data table or field to another data table or field, a dependency relationship may refer to a case where a field or data table depends on another field or data table, where the dependency may be logical or formed during data processing, an aggregation relationship may refer to how data in multiple data tables is summarized into one data table, for example, a sales summary table may aggregate data from multiple sales lists, and derivatives relationships may refer to data in one field or data table being calculated based on data in other fields or data tables.

The table relationship information can be obtained analytically from query statements, for example, in SQL statements, JOIN operations can describe the relationships between tables, sub-queries and CTE (Common Table Expressions ) define the flow of data between different tables, thereby determining the relationships of the tables, field selection and computation in SQL statements can also represent the dependencies between fields, and the definition of aggregate and computation fields can display the derived relationships of the fields, thereby indicating the relationships of the fields.

In the embodiment of the application, after the table description information, the field description information and the table relation information are generated, an asset knowledge base can be constructed based on the table description information, the field description information and the table relation information, and the asset knowledge base can be used for butting a target large model, so that the asset knowledge base can provide data support for training and reasoning of the target large model.

The asset knowledge base may provide a standardized set of interfaces to interface with the target large model so that the target large model can conveniently query and access data in the asset knowledge base.

In this embodiment, the table description information, the field description information and the table relation information corresponding to the identification data table of the data table in the database are generated by extracting the database data, and the asset knowledge base butt-joint target large model is constructed based on the table description information, the field description information and the table relation information, so that the asset knowledge base contains relatively comprehensive related information of the database assets, the data quality is ensured, the target large model is helped to better understand and process the data, and the interpretation of the large model are improved.

In an actual application scenario, the accuracy, the integrity and the validity of data in an asset database have important influence on the reasoning accuracy and the training result of a target large model, so before table description information and field description information of a data table are generated, the quality evaluation of the database data can be firstly carried out.

In some embodiments, generating table description information of a data table in the database data and field description information of the data table based on the database data and the document data can be specifically implemented by performing quality evaluation on the database data and the document data to screen the database data and the document data meeting the data quality requirement, and generating table description information of the data table in the database data and field description information of the data table according to the database data meeting the data quality requirement.

The quality evaluation of the database data may include, for example, determining whether there is a missing value in the data table, whether the field data is out of a range of values, whether a data format of the field data meets a format requirement, whether there is duplicate data, and the like. The data quality requirement may include that there is no missing value, that the field data is within the value range, that the data format of the field data meets the format requirement, that there is no duplicate data, etc., it should be noted that only possible quality evaluation modes are illustrated here, and that the quality evaluation modes may be set in combination with actual situations, or that database data meeting the data quality requirement may be screened from database data by using a large model, which is not limited by the present application.

Also, evaluating the quality of the document data may include determining whether missing values exist in the document data, whether the value range is exceeded, whether the data format meets the format requirements, whether duplicate data exists, and the like. Of course, document data meeting the data quality requirement can be screened from large model data document data.

Then, after the quality evaluation is performed on the database data and the document data, the acquired database data and document data can be subjected to data cleaning according to the inspection result, and the data tables and fields which do not meet the data quality requirements are cleaned, so that the database data and the document data which meet the quality requirements are screened. Therefore, table description information of the data table and field description information of the data table can be generated aiming at the screened database data and document data meeting the quality requirement, so that the data quality of the asset knowledge base can be ensured, the data accuracy, integrity and consistency of the asset database can be improved, and the value and reliability of the asset database can be improved.

In order to make full use of the database data, maintenance may be performed on data that does not meet quality requirements when quality assessment is performed. In some embodiments, the knowledge base construction method can further comprise outputting data maintenance prompt information for database data which does not meet the data quality requirement, responding to the data adjustment request, updating the database data, and taking the updated database data as the database data which meets the data quality requirement.

When the quality evaluation is carried out on the database data, the data which do not meet the data quality requirement can be marked, and after the quality evaluation is finished, data maintenance prompt information is output. The data maintenance prompt information may include specific quality problems of the data and suggested correction modes.

The data maintenance prompt information can be sent to related personnel, the data adjustment request can be triggered by the related personnel, the data adjustment request can comprise a correction mode and the like, so that database data which do not meet the quality requirements can be updated, and the updated database data are used as database data which meet the quality requirements.

And aiming at the document data which does not meet the data quality requirement, the data maintenance prompt information can be output, so that the document data is updated in response to the data adjustment request, and the updated document data is used as database data with the symbol data quality requirement.

In one embodiment of the present application, after updating the database, the quality of the updated database may be further evaluated again to determine whether the updated database data meets the data quality requirement.

In addition, in databases, a Data Model (Data Model) has an important and direct influence on Data quality, and is an abstract framework describing the structure, organization and management of Data, defining the Data structure and storage manner, and how to operate on Data such as adding, deleting, accessing, maintaining, updating, etc., so that a database system can efficiently and accurately process and store a large amount of Data.

Thus, the data model may affect the accuracy, consistency, integrity, expansibility, and usability of the data. Therefore, in order to ensure the data quality of the database data and ensure the data quality of the asset knowledge base built based on the database data, the quality evaluation can be further performed on the data model of the database on the basis of the quality evaluation of the database data.

In some embodiments, the knowledge base construction method may further include performing a quality assessment of a data model of the database data to determine whether the data model meets model quality requirements. If not, outputting model maintenance prompt information, and responding to the model adjustment request, and updating the data model.

In the quality evaluation of the data model, for example, the quality evaluation of the data model can be performed from at least one evaluation dimension, so as to obtain a quality score of the data model according to an evaluation score of the at least one evaluation dimension, and whether the data model meets the model quality requirement is determined according to the quality. The at least one evaluation dimension may include, for example, normalization and/or validity, etc. The validity evaluation of the data model may be, for example, whether a data structure and an organization manner of the data model meet a specification requirement, for example, whether a table usage instruction, an annotation, a naming, a field and the like are included, whether corresponding description requirements are met, and the like, so that a normalized evaluation score may be obtained according to a judgment result, and the validity evaluation of the data model may be, for example, whether a data table defined by the data model meets a validity requirement, where the validity requirement may include, for example, at least one condition of no empty data table that is no longer updated, no data table accessed in a last period, no data table with increment record of 0, no downstream application scenario, and downstream application offline, and the like, so that the corresponding evaluation score may be determined according to the met validity condition.

Alternatively, the quality of the data model may be obtained by weighting and summing the evaluation scores of at least one evaluation dimension, and the weight coefficient corresponding to each evaluation dimension may be set in combination with the actual situation, which is not limited in the present application.

Of course, the quality evaluation of the data model can also be implemented by using a large model, which is not limited by the present application.

In the case that the data model is determined to not meet the model quality requirement through quality assessment, in some embodiments, the method can further comprise outputting model maintenance prompt information, wherein the model maintenance prompt information can comprise quality problems of the data model and suggested correction methods.

The model maintenance prompt information is sent to related personnel, and the model adjustment request can be triggered by the related personnel and can include a correction mode to update the data model according to the correction mode.

The model quality requirement can be set according to practical situations, so as to evaluate the quality of the data model, which is not limited by the application.

In order to accurately and comprehensively determine table relationship information for data tables in database data, in some embodiments, the knowledge base construction method may further include determining manually configured table relationship information in response to a relationship maintenance request for the database data.

Manually configuring table relationships may refer to specifying table relationship information of data tables in a database by manual definition, for example, may specify association relationships between database tables by defining primary and foreign key relationships, specifying connection conditions, describing relationships between tables in a logic, and the like. Therefore, the table relation information corresponding to the identification data table in the historical query statement of the database data can be specifically realized by identifying the table relation information corresponding to the data table in the historical query statement of the database data and manually configured table relation information.

In addition, the metadata may also include partial table relation information, and in some embodiments, the table relation information corresponding to the identified data table may be specifically identified from the historical query statement of the database data, the document data, the table relation information maintained in the metadata, and the manually configured table relation information.

As described above, the table relationship information, such as JOIN, sub-query, etc., may be included in the historical query statement of the database, and may be identified by parsing the query statement. In addition, the metadata of the database typically includes table structure information, primary keys, foreign keys, etc., which may also be used to determine table relationships.

On the basis, when the table relation information corresponding to the data table is determined, the manually configured table relation information can be combined, so that the table relation information can be more comprehensively and accurately identified.

In some embodiments, generating the table description information of the data table and the field description information of the data table in the database data based on the database data and the document data may include generating the table description information of the data table and the field description information of the data table in the database data based on the database data, extracting entity related information of an entity associated with the data table from the document data, and taking the entity related information as one type of table description information.

The method comprises the steps of generating table description information of a data table in database data and field description information of the data table based on the database data, wherein the table description information of the data table and the field description information of the data table can comprise normalization processing of the table annotation information of the data table to be used as one type of table description information, extracting table index keywords from the table annotation information of the data table after normalization processing to be used as one type of table description information, the table index keywords are used for carrying out table description information matching, normalization processing of the field annotation information of the data table to be used as one type of field description information, extracting field index keywords from the field annotation information of the data table after normalization processing to be used as one type of field description information, and the field index keywords are used for carrying out field description information matching.

The entity related information extracted from the document data can be used as table description information of a data table, and can be combined into table annotation information after normalization processing.

In the embodiment of the application, the table description information can comprise table annotation information, wherein the table annotation information is descriptive annotation of a database table and is used for explaining the structure, the application, the data source and the like of the data table. In a database, table annotations may be added to a data table by a specific syntax. When interfacing an asset database to a target large model, table annotation information may help the large model understand the structure and purpose of the data table, etc.

However, as can be seen from the foregoing description, there are problems of non-standardization of table annotation in the current database, which are disadvantageous for generalization and understanding of the target large model, and increase of understanding cost, so that the table annotation information can be normalized to be converted into a more standard and uniform form.

The normalization processing of the table annotation information can include format normalization, logic normalization and the like. The format normalization is used for normalizing the text format of the annotation information, and the logic normalization can process the annotation information from the semantic level, so that the annotation content is more unified and meaningful logically.

By performing format normalization and logic normalization processing on the table annotation information, the table annotation information can be effectively processed, so that the table annotation information is consistent in form and semantics.

In addition, after normalization of the table annotation information, keyword extraction may be performed from the table annotation information, and the extracted table index keyword may be used as table description information. The table index keywords can be used for matching table description information, for example, input data and the table index keywords are matched to determine target table description information matched with the input data, and the target table description information can be used as auxiliary information of the input data to be input into a target large model so as to realize retrieval enhancement.

In addition, similar to the table description information, normalization processing is also required for field annotation information, which is text for describing fields in a database table, and generally contains information about descriptions of fields, data types, length restrictions, constraints, and the like, and in the database, field annotation can be added to the database table by a specific syntax. However, the field annotation has the problems of non-standardization and the like, so that the understanding cost of the target large model is increased, and therefore, the embodiment of the application can normalize the field annotation information. The normalization processing of the field annotation information can include format normalization, logic normalization and the like. The format normalization is used for normalizing the text format of the field annotation information, and the logic normalization can be processed from the semantic level, so that the annotation content is more unified and meaningful logically.

Alternatively, the format normalization may include, for example, case-to-case conversion to a unified format, removal of special symbols, removal of extra spaces, removal of nonsensical characters, unified date format, unified numerical format, correction of splice errors, and the like. The logical normalization may include, for example, adjusting text order, supplemental information, and the like. Specific implementation of normalization of text the present application is not limited in this regard.

The entity related information of the entity corresponding to the data table extracted from the document data can comprise the data field of the entity, proper noun interpretation information of the related proper noun, related application field and other information which can assist in understanding the data assets in the asset knowledge base. Alternatively, entity-related information of entities associated with the data table may be extracted from the document data using a large model.

The entity related information, such as the data domain, proper noun interpretation information of the related proper noun, the related application domain, and the like, can be used as one table description information of the corresponding data table or be combined into table annotation information after the normalization processing of the data table.

In addition, the entity-related information extracted from the document data may also participate in normalization processing of table annotation information and field annotation information, keyword extraction, and the like.

In some embodiments, the normalizing process is performed on the table annotation information of the data table, so that the table annotation information after normalization is used as a table description information, and the table index keyword is extracted from the table annotation information after normalization, so that the table annotation information after normalization can be used as a table description information by combining the large model with the entity related information, and carrying out normalization processing on the table annotation information of the data table, and extracting the table index key words from the table annotation information after normalization processing to obtain the table annotation information and the table index key words after normalization processing, wherein the table annotation information and the table index key words can be used as table description information. Alternatively, the entity-related information and the table annotation information may generate corresponding hint information to be input into the large model, so that the large model obtains the normalized table annotation information and the table index keyword, etc.

The normalizing process is performed on the field annotation information of the data table, so that the field annotation information after the normalizing process is used as field description information, and the field index keyword is extracted from the field annotation information after the normalizing process to be used as field description information, wherein the large model is combined with the entity related information, the normalizing process is performed on the field annotation information of the data table, and the field index keyword is extracted from the field annotation information after the normalizing process to obtain the field annotation information and the field index keyword after the normalizing process, and the field annotation information and the field index keyword can be used as field description information. Alternatively, the entity-related information and the field annotation information may generate corresponding hint information to be input into the large model, so that the large model obtains the table annotation information and the table index keyword after normalization processing, and the like.

In some embodiments, the knowledge base construction method may further include:

Downstream application information of the data table is identified from the database data, and the downstream application information is used as table description information of the data table.

Downstream applications may refer to applications or systems that operate on or further process data in a data table. The downstream application information may include identification information of the applications or systems, processing type, processing results, and the like. Downstream application information is also important for the target large model understanding database. Therefore, in the embodiment of the present application, the downstream application information is also used as a table description information.

Different downstream application information may reflect different degrees of importance of the data table. For example, a data table used by a core application, such as an order processing system or a customer relationship system, may indicate that the data table is important, while a data table used only by a secondary application, such as a log analysis tool or a statistical reporting system, may be less important. By using the downstream application information of the data table as table description information, understanding of the data table by the target large model can be facilitated and enhanced.

Furthermore, since each data packet in the database may represent a set of entities, i.e., a collection of entities of the same type, each row (record) in the data table represents a specific instance of an entity, and each column (field) represents an attribute of an entity. Therefore, the downstream application information of the entity corresponding to the data table can be extracted from the document data, and the downstream application information of the entity can be stored in the downstream application information data table.

In some embodiments, the knowledge base construction method may further include identifying downstream application information of the fields in the data table from the database data, and using the downstream application information as a field description information.

According to the field corresponding to the data table data used by the downstream application information, the downstream application information can be used as the downstream application information of the field. The downstream application information of the data table may include downstream application information of different fields.

In some embodiments, the knowledge base construction method may further include identifying the entity and the application domain corresponding to the entity in the data table from the database data, and using the entity and the application domain as a table description information of the data table or merging the entity and the application domain into the table annotation information after the normalization processing.

Wherein, in the database, an entity may refer to an object in the database that is used to represent information that needs to be stored, such as "customer," "order," "product," etc. Entities are typically represented in databases as tables, each table representing a set of entities, each row (data record) in the table representing a particular instance of an entity, and each column (field) representing an attribute of the entity, so that the entity can be identified by analyzing the structure, fields, table annotation information, etc.

An application domain may refer to the role or use of an entity in the production process of a data provider, e.g., an "order" entity may relate to a "sales management" domain, and a "customer" entity may relate to a "customer service" domain. The application domain of the identified entity map may be determined by pre-configuring the application domain of the different entity map.

In practical application, the entity and the application field thereof are important for understanding the database for the target large model, so that in one possible implementation mode of the application, the identified entity and the application field corresponding to the entity can be respectively used as table description information of a data table.

In another possible implementation manner of the present application, the entity and the corresponding application domain may be incorporated into the normalized table annotation information, so as to use the table annotation information as a table description information.

By identifying the entities in the data table and the corresponding application fields thereof and taking the information as table annotation information, the role and importance of the data table can be more comprehensively described, and the object large model is helped to understand the database.

In addition, the application domain of the entity associated with the data table can be extracted from the document data.

In some embodiments, extracting the table index key from the table annotation information after normalization processing as a table description information may be implemented by extracting the table index key from the table annotation information after normalization processing and the field annotation information using the first large model as a table description information;

in some embodiments, extracting the field index key from the field annotation information after normalization as a field description information includes extracting the field index key from the field annotation information after normalization as a field description information using a first large model.

In the embodiment of the application, the capability of the large model can be utilized to extract keywords from the table annotation information and the field annotation information after normalization processing. Because the large model can understand the semantic relation in the natural language, important keywords can be more accurately identified, and the accuracy and efficiency of keyword extraction are improved.

When keyword extraction is performed on the table annotation information and the field annotation information after normalization processing by utilizing the capability of the first large model, the Prompt (Prompt word) can be designed first, and the Prompt can help the large model understand the input intention so as to guide the large model to generate specific output data. Promt is a form of input that prompts or directs a large speech model to give an output that meets expectations. What action should be taken or what output should be generated when used to instruct the large language model to perform a particular task. Prompt is a natural language input, similar to a command or instruction, to let the large language model know what it needs to do.

Alternatively, the first hint word may be generated according to the first hint template based on the table annotation information and the field annotation information after the normalization process, and the first hint word may be input into the first large model to extract the table index keyword using the first large model.

Alternatively, a second hint word may be generated according to a second hint template based on the field annotation information after normalization processing, and the second hint word is input into the first large model to extract the field index keyword using the first large model.

In addition, since there may be some specific words such as professional words in the table annotation information or the field annotation information, these professional words may be reserved as keywords.

Therefore, the above extraction of the table index keyword from the table annotation information and the field annotation information after the normalization processing using the first large model may be to extract the table index keyword from the table annotation information and the field annotation information after the normalization processing using the first large model and to retain a specific type of vocabulary as the table index keyword.

The extracting the field index keyword from the field annotation information after normalization processing by using the first large model as one type of field description information may be extracting the field index keyword from the table annotation information and the field annotation information after normalization processing by using the first large model, and reserving a specific type of vocabulary as the field index keyword.

The particular type of vocabulary may include, for example, time-type vocabulary, behavior-type vocabulary, and/or proper nouns, among others.

For ease of understanding, the first hint word may be, for example:

{

' existing physical table annotation \ { table } \, table contains field annotation \ { filtered }, please refine core keywords for the following text \ { table }, { filtered }, etc.

If keywords of proper nouns such as p4p, AB3, DUV, 2N, 4PL, 3PL, app are used (case is ignored), reservation is needed

If the keywords such as last 1 day, last 7 days, last 30 days, last 90 days, and financial year accumulation are defined with respect to time, the keywords need to be reserved \

If keywords such as account hanging, inquiry, business opportunity, ab3 and the like are limited, the keywords need to be reserved and list annotation keyword modification needs to be added

The extracted keywords require field annotation as core content \

The length of the key word is not more than 20\

Finally please return information only in json format, keyword: 'keyword'

}

In addition, in some embodiments, table basic information extracted from database data such as metadata, such as table names, table types, and the like, may also be used as table description information, and field basic information extracted from database data such as metadata, such as field names, table names of the belonging data tables, and the like, may also be used as table description information.

Further, for some predetermined types of fields, a specified amount of field data may be selected as one type of field description information. The field data may help the big model understand the data output form, etc., and the predetermined type may refer to a String type, for example, for a String (String) type field, such as whether the field name is XX product, whether the field data is Yes or No. Yes or no can help the target big model understand that the data output form of the field is expressed in Chinese instead of letters such as Y or N.

For example, for a field of a predetermined type, a repeat may be avoided by first performing a repeat removal (distinct) operation to obtain a different value of the field, i.e., field data, and thus find a different enumerated value of the field. For example, a field may have only a few fixed values (e.g., a status field may have only two values, "enabled" and "disabled"), these enumerated values may be obtained through a deduplication operation, and all possible values of the field are identified.

After all the enumerated values of the field are obtained, several values with the highest frequency of occurrence, for example, the first three values with the highest frequency of occurrence, that is, field data, may be selected as a field description information.

Of course, the field data of the preset type can also be combined into the field annotation information after normalization processing, so that the field annotation information not only contains concise core keywords, but also is attached with main enumeration values of fields, so that the field annotation information is more visual and easier to understand, the quality of the data is improved, and the understandability and operability of the data are enhanced.

In addition, the predetermined type of field data may be directly used as a field index key, and used as field description information.

In some embodiments, identifying table relationship information corresponding to a data table from historical query sentences of database data and document data can be specifically realized by analyzing the historical query sentences of the database data by using a second large model to extract the table relationship information corresponding to the data table, extracting entity related information of a data table associated entity from the document data, wherein the entity related information comprises entity relationship information and entity operation information, and identifying the table relationship information corresponding to the data table from the entity relationship information and the entity operation information.

In the embodiment of the application, the historical query statement of the database data can be analyzed by utilizing the capability of the large model so as to extract the table relation information corresponding to the data table.

Alternatively, a third hint word may be generated based on the historical query statement, and the third hint word is input into the second bigram to extract table relationship information corresponding to the data table.

For ease of understanding, the third hint word may be, for example:

please analyze the following SQL, ask to find out the related tables in SQL and the corresponding association, please return in the following format

{

'table1':table_name1,

'table2':table_name2,

"rel":table_name1.col=table2.col

}

[ SQL info ]:

select*from(select DISTINCT t1.channel_name,t1.domain_account,t1.nam,t1.work_no,t1.total_job_days,t1.regist_date,t1.manager_org_id,t1.manager_org_name,t1.super_show_name,t1.is_dimission,t1.inner_org_id from icbucdm.dim_en_sales_line_user t1 where t1.ds＝'${bizdate}'and t1.is_sales＝'Y')t1

left join(select distinct manager_org_id,manager_org_name,region_org_name,area_org_name,crm_new_zone from icbucdm.dim_en_crm_org where ds＝'${bizdate}'

)t2 on t1.manager_org_id=t2.manager_org_id

left outer join icbudwa.dwm_en_crm_sales_profile_d t3

on t1.domain_account=t3.crnt_owner_id

and t3.ds=max_pt('icbudwa.dwm_en_crm_sales_profile_d')

The SQL information is a historical query statement, and the SQL information describes that data are extracted from three data packets and connected, and comprises three data tables, namely a data table icbucdm.dim_en_samples_line_user (alias t 1), a data table icbucdm.dim_en_crm_org (alias t 2) and a data table icbucdwa.dwm_en_crm_samples_profile_d (alias t 3).

The main query logic of the above SQL information is that t1 is associated with t2 by field manager_org_id and t1 is associated with t3 by field domain_account and field crnt_own_id,

Thus, the output data of the second largest model is:

{

'table1':icbucdm.dim_en_sales_line_user

'table2':icbucdm.dim_en_crm_org

"rel":t1.manager_org_id=t2.manager_org_id

}

{'

table1':icbucdm.dim_en_sales_line_user

'table2':icbudwa.dwm_en_crm_sales_profile_d

"rel":t1.domain_account=t3.crnt_owner_id

}

The output data of the second large model is in a JSON format, the extracted table relation information is a table association relation, the table association relation comprises that an association relation exists between a data table t1 and a data table t2, the data table t1 is associated to a main key manager_org_id of the data table t2 through an external key manager_org_id, and the data table t1 and the data table t2 have an association relation, and the main key crnt_wner_id of the data table t3 is associated through an external key domain_account.

The first large model and the second large model can be the same large model or different large models, which means a machine learning model with a large number of parameters and complex structures, can process mass data, and can complete various complex tasks such as natural language processing, computer vision, voice recognition and the like.

The first large model or the second large model may employ a current open source large language model (Large Language Model, abbreviated as LLM), such as GPT-3 (GENERATIVE PRE-Trained Transformer-3, third generation pre-training model), GPT-4 (GENERATIVE PRE-Trained Transforme-4, fourth generation pre-training model), BERT (Bidirectional Encoder Representation from Transformers, a Transformers-based bi-directional encoder model), turing NLG (Turing Natural language Generation ), and the like. The application is not limited in this regard.

As can be seen from the above description, the entity related information may include entity operation information, entity relationship information between entities, and the like, and in the database, the entities are typically represented as data tables, each data table may represent a set of entities, that is, a set of entities of the same type, each row (record) in the data table represents a specific entity instance, and each column (field) represents an attribute of the entity. Therefore, table relation information corresponding to the data table can be identified from the entity operation information and the entity relation information.

The entity operation information may be an action operation performed by the pointer on the entity. The association relationship or the blood relationship between the related data tables or the blood relationship between the related fields can be determined according to the association relationship information or the blood relationship information and the like between the related data tables or the fields related to the entity operation information. Since each entity can correspond to one data table, the entity relationship information can also represent the corresponding table relationship information.

Alternatively, entity operation information, entity relationship information and the like may be extracted from the document data by using the large model, corresponding prompt information may be generated based on the document data, and the prompt information may be input into the large model to obtain entity related information such as entity operation information, entity relationship information and the like.

Alternatively, the downstream application information of the data table and the downstream application information of the field may be obtained by identifying from metadata, table annotation information and/or field information using a large model.

In some embodiments, generating the table description information of the data table and the field description information of the data table in the database data based on the database data may be specifically implemented by generating the table description information of the data table and the field description information of the data table in the database data using a third large model based on the table annotation information of the data table in the database data, the fields and the field annotation information included in the data table, and the metadata.

In an embodiment of the present application, large model capabilities may be utilized to generate table description information of a data table in database data and field description information of the data table based on table annotation information of the data table in database data, fields and field annotation information contained in the data table, and metadata.

Alternatively, table description information may be generated based on table annotation information of a data table in the database data, fields and field annotation information contained in the data table, and metadata, and then field description information may be generated based on the fields and field annotation information contained in the data table using the third large model.

A fourth hint word may be generated based on the table annotation information, the field annotation information, and the metadata, the fourth hint word is input to the third big model to generate the table description information, a fifth hint word is generated based on the field annotation information and the metadata, and the fifth hint word is input to the third big model to generate the field description information.

For ease of understanding, the fourth prompting word may be, for example, it should be noted that this is merely illustrative, and the present application is not limited thereto:

{

”'

desired # # # (EXPECTATION) \n\ \

Please finish the summary extraction, keyword extraction, importance degree judgment and problem judgment of the Chinese description of the data table according to the metadata of the given data table, output the Chinese description, the keyword extraction, the importance degree judgment reason and the problem six parts of the table according to the format in the output example, wherein the output comprises basic understanding, use scenes, keyword extraction, importance degree judgment reasons and the problem six parts of the table, and the six parts are not required to be in the presence of and thickened. N \

The # # ROLE (ROLE) \n\

You are a data management expert responsible for understanding and analyzing the table information in the data warehouse to ensure efficient understanding, utilization, management and lookup of data. N \

ACTION # # # ACTION (ACTION) \n \

1. Metadata is understood by carefully reading and understanding the results of the two columns of fields of "manually extracted keywords" and "manually judged whether important" while referencing the provided [ -proper nouns ] to understand deeply the [ -metadata information ] of each data table. Particular attention is drawn to the content with "×" the number of the images. N \

-For [ metadata ]

Metadata requiring a comprehensive understanding of table and field levels;

focusing on the information identified by "/quadrant"

Learning the method of extracting key words by ' manually extracting key words ', and learning the importance degree judging method that the judging result is ' Y ' in ' manually judging whether the key words are important \

The field level information should be used to further refine the summary. N \

2. Summary refinement of the table chinese description: n \

The method is divided into two parts of "basic understanding" and "method of use: \

Basic understanding is that all acquired metadata information is integrated, the main content of the table is summarized, chinese annotation of the repeated table is avoided, the important reference table annotation, the name of the source table, the information described by the source table are emphasized, the full quantity/increment judgment of the data table is not involved, and the description is specifically described but not generally refers to a B2B electronic commerce platform, and \

The usage method is specifically described how to use the table, and the usage scenario in the measurement, the limit information, the label and the dimension description in the reference ". Times.quadrant". Times.is understood according to the field and the table description if not. N \

3. Keyword extraction: n \

Extracting key words from list annotation information, list name of source list, list description of source list, attribution data field and list supplementary description information, and facilitating quick positioning of user

If the content exists in the 'manual keyword extraction', the keyword is needed to be included and the extraction method of the keyword is learned

Reference keywords including merchant image, commodity image, buyer image, product dimension table (omitted here, set in connection with actual situation),/

The length of the key words is more than or equal to 2 characters, and a plurality of key words are separated by the keys

The method is characterized by comprising the following steps of (1) not appearing proper nouns such as application scenes, creators, index values, descriptions and the like which are too wide, not appearing common nouns such as names, contact ways and the like which are used for carrying main bodies such as sales names, sales contact ways, member contact ways and the like, not appearing fuzzy keywords such as organization relations and the like, and being capable of being in global sales organization relations and international organization relations. N \

4. Judging the importance degree: n \

Three levels of "important", "medium" and "unimportant" are used to evaluate the importance of a table, based on the assets in the table (assets are important), recent access times (access times exceeding 1000 is considered more), priority (7 is highest), number of downstream nodes (more table is important), table supplemental description information (not empty is relatively more important), number of partitions (partition number is generally more important), last modification time (later is important, now 2026), and the like

Learning the result of the judgment of the importance degree of the column value of 'Y', wherein 'is the value of' human judgment \

-Providing a reason for the judgment. N \

5. Source table problem identification: n \

Judging whether there is a problem such as no longer recommended use, abandoned, not updated for a long time, containing more bin modeling terms which are not easy to understand, and the like based on the table and field information

-Bypassing directly the output if an illegal situation is found. N \

6. Proper noun understanding: n \

Understanding the following proper nouns \

Duv detail page access (detail unique visitor); \

-Ab active buyer (active buyer) \

Ab pro, deep active buyers (advanced version of ab, calculated by algorithm);

(. Cndot. Cndot.) this is the part is omitted from the description and the part is not needed, can be set according to the actual situation

7. Output example \n\

Taking the 'icbucdm_en_crm_ ctrct _ ggs _ord_stat_di' table as an example, the table is strictly output in the following format, the title is not thickened, the symbol is not carried, and the English name of the table is \n\is not carried

1. It is basically understood that the table slightly summarizes member contract order information of the international merchant, including the most recent order information of the merchant, first order information, service information, and the like. \

2. The use scene is mainly used for inquiring the order information of international merchants and identifying information such as customers, the latest order, the earliest order and the like in international service. \

3. Keyword extraction, namely international order mild statistics, international merchant order information and international service clients. \

4. And judging the importance degree. \

5. The importance degree judging reason is that the table is configured with data quality inspection, and the priority is higher, and the downstream calling times are frequent. \

6. Problems with the tables table notes describe many terms of art for modeling in several bins, which is not easy to understand. N \

8. Metadata { tableMsg } \n\

Please complete the corresponding tasks for each table in the database according to the guidelines and examples above. N \

”'

}

The third large model may be the same large model or different large models as the first large model or the second large model, and may be implemented using an open source large language model.

In some embodiments, as can be seen in conjunction with the foregoing description, the table relationship information includes table relationship information, table blood relationship information, field blood relationship information, and blood path information.

The knowledge base construction method can further comprise the steps of extracting relation index keywords corresponding to the table association relation information, the table blood edge relation information, the field blood edge relation information and the blood edge path information respectively, and storing the relation index keywords in the asset knowledge base in a corresponding mode with the table association relation information, the table blood edge relation information, the field blood edge relation information and the blood edge path information, wherein the relation index keywords are used for carrying out relation recognition.

After the table association relation information, the table blood-edge relation information, the field blood-edge relation information and the blood-edge path information are correspondingly stored in the asset knowledge base, the table association relation information can be matched through the relation index key words, for example, input data and the relation index key words are matched to determine target relation information matched with the input data, and the target relation information can be used as auxiliary information of the input data to be input into a target large model so as to realize retrieval enhancement.

In the application scenario that the asset knowledge base is applied to retrieval enhancement and the like, in order to facilitate data searching, table description information, field description information and table relation information can be converted into feature vectors and then stored in the asset knowledge base.

In addition, in practical application, since the data structures of the description information (including the field description information and the table description information) and the table relation information are not communicated, if the description information and the table relation information are stored together when the asset knowledge base is constructed, data pollution may be caused. Therefore, to further enhance data lookup and avoid data pollution, etc., the table description information and the field description information may be stored separately from the table relationship information, and in some embodiments, building the asset knowledge base based on the table description information, the field description information, and the table relationship information includes building an asset vector base based on the table description information and the field description information, and building a relationship vector base based on the table relationship information.

The table description information and the field description information can be converted into feature vectors to form an asset vector library, and the table relation information can be converted into feature vectors to form a relation vector library.

The asset knowledge base includes the asset vector base and the relation vector base.

In one embodiment of the application, the vectorization capability of the large model may be utilized to construct an asset vector library and a relationship vector library based on the extracted table description information, field description information, and table relationship information.

By vectorizing and storing the description information and the table relation information respectively, the data quality can be ensured, the data pollution can be avoided, the query and processing performance of a large model can be improved, and the availability and consistency of the data can be enhanced.

In some embodiments, the knowledge base construction method can further comprise outputting table description information, field description information and table relationship information, updating the table description information, the field description information or the table relationship information in response to an information update request for the table description information, the field description information or the table relationship information;

The constructing the asset knowledge base based on the table description information, the field description information, and the table relationship information may be constructing the asset knowledge base based on the table description information, the field description information, and the table relationship information in response to an information confirmation request for the table description information, the field description information, and the table relationship information.

The table description information, the field description information and the table relation information are output by sending the table description information, the field description information and the table relation information to related personnel for manual checking by the related personnel to update or confirm the information. The information update request and the information confirmation request may be triggered by a person.

It will be appreciated in light of the foregoing that data generated during production by a data provider may also be stored in document data, and thus the document data also contains much knowledge that may also assist in data understanding by a target large model. Document data may also exist in tables or fields that have no associated knowledge that may also aid in data understanding by the target large model and thus, in some embodiments, the method may further include:

identifying an object from the document data and object related information of the object;

The constructing an asset knowledge base based on the table description information, the field description information and the table relationship information includes constructing an asset knowledge base based on the table description information, the field description information, the table relationship information and the object related information.

Wherein the object may be a processing target in the data provider production process, such as "order", "commodity", etc.

The object related information may include, for example, an application domain to which the object belongs, a downstream application scene of the object, proper nouns related to the object, interpretation information thereof, related application domain, object operation information, and object relationship information. The object operation information may include a behavior operation procedure performed on the object, introduction information of an execution subject related to the behavior operation procedure, and an execution policy, and the execution subject may be a system that performs a certain behavior operation on the object by using a pointer, etc.

As can be seen from the above description, in the embodiment of the present application, the table description information may include one or more of the following information:

table name, table type, table index key, table annotation information, interpretation information of data table related proper nouns, and downstream application information of data table.

The table type may refer to, for example, a real table or a dimension table.

The field description information may include one or more of the following:

The field name, the table name of the data table, the field annotation information, the field index key words, the interpretation information of the proper nouns related to the fields and the downstream application information of the field table.

The table relationship information may include one or more of table association information, table blood-edge relationship information, field blood-edge relationship information, and field blood-edge path information.

Wherein, the table association relationship information may include one or more of the following information:

table name, association table name of association data table, foreign key field in data table, primary key field in association data table, association condition, association times and relationship index key.

The association condition may refer to, for example, an association manner of the foreign key field and the primary key field, such as that the foreign key field is the same as the primary key field, or that the foreign key field is larger or smaller than the primary key field, or the like.

The table blood relationship information may include one or more of the following:

The list name of the source data list, the target list name of the target data list, the blood relationship type, the relationship creation time, the relationship modification time and the relationship index key word.

The field blood-edge key information may include one or more of the following:

table names of source data table, field names in source data table, target table names in target data table, target field names in target data table, blood relationship type, relationship creation time, relationship modification time, and relationship index key.

The field blood-edge path information may include one or more of the following:

table name of source data table, field name in source data table, target table name of target data table, target field name in target data table, blood edge path, path depth and path index key.

The table description information, the field description information, the table relation information can be identified by data, extracted from database data and document data by means of a large model, and the like, and the specific implementation can be described in the corresponding embodiments and is not repeated herein.

The embodiment of the application can convert all available data of the data provider into content in the asset knowledge base, such as database data, document data and the like, and automatically analyze and organize the asset knowledge base by means of the modes of data identification, association relation construction, large model understanding and the like by means of the Prompt design, the metadata design, the product function design and the like, wherein the asset knowledge base can be directly connected with a target large model, and the target large model can be any large model for the data provider. The target large model may be pre-trained in connection with corresponding training data of the data provider, etc., which may be obtained by fine tuning a currently common large language model, etc.

In order to facilitate data lookup, the table description information, the field description information, and the table relationship information may be stored in a structured manner, etc., and thus, in some embodiments, building the asset knowledge base based on the table description information, the field description information, and the table relationship information may include generating a plurality of data tables based on the table description information, the field description information, and the table relationship information according to respective corresponding data structures, and building the asset knowledge base based on the plurality of data tables.

The data structure may define the storage of table description information, field description information, table relationship information, or the like in the form of tables to generate a plurality of storage tables that may constitute an asset knowledge base, where the storage tables may be stored in a database of the data provider, such as ODPS or HDFS, and when hot, may also be stored in other databases, such as MySQL or Hologres, or the like.

The data structure may define, among other things, the table names of the memory tables, the individual fields included, etc.

Alternatively, the first storage table may be created based on the table description information according to the first data structure, for example, in practical application, the first storage table may be created according to the following SQL statement, which, of course, is not limited thereto:

CREATE TABLE IF NOT EXISTS dwm_en_meta_table_detail_d(

table _ NAME STRING component 'table name',

Table _ comment STRING COMMENT 'table notation',

Table _ TYPE STRING com 'table type [ fact table, dimension table',

Keywords STRING COMMENT' table index key,

The proper noun interpretation 'referred to by the proper noun STRING COMMENT' table,

Downstream usage scenario (including downstream execution system, analysis report, etc.) of downstream stream_ used STRING COMMENT 'table'

)

Detailed information of COMMENT 'Table'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

Using the SQL statement, a table named dwm _en_meta_table_detail_d can be created. The table is used for storing description information about other tables, and the storage table can comprise the following fields:

table_name, a field storing a table name, and the data type is STRING.

Table_comment, storing table comment information, wherein the data type is STRING, and the table comment information after normalization processing can be specifically stored;

table_type, store table type, e.g. "fact table" or "dimension table", data type is STRING.

Keywords, which can help locate this table when searching or intent recognition, are stored, with the data type being STRING.

Property_no, the interpretation information of the proper noun is stored, and the data type is STRING.

Downstream_used, downstream application information is stored, and the data type is STRING.

In addition, the memory table has the following characteristics:

The store is partitioned according to a ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data.

The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

Alternatively, the second storage table may be created based on the field description information according to the second data structure, for example, in actual application, the second storage table may be created according to the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_column_detail_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

Column NAME STRING component 'field name',

Column comment STRING COMMENT 'field original notes',

Column enum STRING COMMENT 'field enumerations value',

Table _ NAME STRING component 'table name',

Proper noun interpretation 'referred to by the proper noun STRING COMMENT' field,

Downstream usage scenario (including downstream execution system, analysis report, index configuration, etc.) of downstream stream_ used STRING COMMENT 'field'

)

Detailed information of COMMENT 'field'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

Using the SQL statement, a second memory table named dwm _en_meta_column_detail_d can be created, which is used to store field denomination information in the database. The second memory table may include the following fields:

And (3) storing field index keywords which can be used for intention recognition and similarity query, helping the content and the purpose of splitting.

Column_name, storage field name, data type is STRING.

Column_comment, storing field annotation information, and setting the data type as STRING.

Column_enum if the column field has enumerated values, the enumerated values, i.e., field data, are stored, with the data type being STRING.

Table_name, which stores the table name of the data part to which the field belongs, and the data type is STRING.

Proper_NOUN, proper nouns and interpretation information related to storage field, and data type is STRING.

The second memory table may have the following characteristics:

partitioning is performed according to the ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

Alternatively, the third storage table and the fourth storage table may be created based on the table association relationship information according to the third data structure, for example, in practical application, the third storage table may be created according to the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_primary_foreign_key_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

The foreign key col NAME STRING com foreign key field',

The foreign_key_table_ NAME STRING COMMENT 'foreign key physical table',

Primary key col NAME STRING com main 'key field',

Primary_key_table_ NAME STRING com 'primary key physical table',

)

COMMENT 'primary foreign key information'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

The SQL statement can be used for creating a memory table with a table name dwm _en_meta_primary_form_key_d, namely a third memory table, which is used for storing the table association relation. The fields referred to in the third memory table may include:

and storing relation index keywords, namely, keywords related to a main key or an external key, which are used for intention recognition and similarity query, wherein the data type is STRING.

Foreign_key_col_name, the field name of the memory foreign key field, and the data type is STRING.

The table name of the data table to which the foreign key field belongs is stored, and the data type is STRING.

Primary_key_col_name, field name of the stored primary key field, data type is STRING.

Primary_key_table_name, the table name of the data table to which the primary key field belongs is stored, and the data type is STRING.

The third memory table may also be characterized by partitioning according to a ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

For example, in practical application, the fourth storage table may be created according to the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_table_associate_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

Table _ NAME STRING component 'table name',

Associate _table_ NAME STRING component 'associated table name',

Associate _ DETAIL STRING component 'associated condition',

Associate _ cnt BIGINT COMMENT 'associated number of times',

)

Association relationship between COMMENT tables'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

A table named dwm _en_meta_table_ associate _d, that is, a fourth table, may be created using the SQL statement, and the fields involved in the fourth table may include:

and (3) storing relation keywords related to the table association relation information for intention recognition and similarity query, wherein the data type is STRING.

Table_name, the table name of the stored data table, and the data type is STRING.

Associate _table_name, the table name of the associated data table is stored, and the data type is STRING.

Associate _delete, store association condition, data type is STRING.

Associate _cnt, storing the association times, and setting the data type as BIGINT.

The fourth memory table may also have the property of being partitioned by a ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

Alternatively, the fifth storage table may be created based on the table blood relationship information according to the fifth data structure, for example, in practical applications, the fifth storage table may be created according to the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_table_relation_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

The src_table_ NAME STRING COMMENT 'source table name',

Dest_table_ NAME STRING COMMENT 'target table name',

The relationship_ TYPE STRING COMMENT 'relationship source type is 1 direct mapping 2 function conversion, 3.Join', create_ TIME STRING COMMENT 'relationship creation time',

Modification_ TIME STRING COMMENT 'relationship modification time'

)

COMMENT 'surface particle size blood margin'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

A table named dwm _en_meta_table_relation_d, that is, a fifth table, may be created by using the SQL statement to store table blood-edge relationship information, and the related fields may include:

And (3) storing relation keywords of the list blood-reason relation for intention recognition and similarity query, wherein the data type is STRING.

Src_table_name, the table name of the storage source data table, and the data type is STRING.

Dest_table_name, the table name of the storage target data table, and the data type is STRING.

Relation_type, store relation source type, such as direct mapping, function conversion or JOIN operation, data type is STRING.

Create_time, store relationship creation time, data type is STRING.

Modification_time, the last modification time of the storage relationship, and the data type is STRING.

The fifth memory table has the property that the partitioning is done according to the ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

Alternatively, the sixth storage table may be created based on the field blood relationship information in accordance with the sixth data structure, for example, in practical applications, the sixth storage table may be created in accordance with the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_column_relation_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

The src_table_ NAME STRING COMMENT 'source table name',

The src_column_ NAME STRING component 'source field name',

Dest_table_ NAME STRING COMMENT 'target table name',

Dest_column_ NAME STRING component 'destination field name',

Modification_ TIME STRING COMMENT 'relationship modification time'

)

COMMENT 'model metadata- -field granularity blood margin'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

A table named dwm _en_meta_column_relation_d, that is, a sixth table, may be created by using the SQL statement to store the blood-edge relationship information of the fields, and the related fields may include:

And (3) storing the relation keywords of the field blood-edge relation information for intention recognition and similarity query, wherein the data type is STRING.

Src_table_name, the table name of the storage source data table, and the data type is STRING. Source data tables, i.e. initial data tables in the blood-edge path

Src_column_name, field name of storage source field, data type is STRING.

Dest_table_name, the table name of the storage target data table, and the data type is STRING. The target data table, i.e., the final data table in the blood-edge path.

Dest_column_name, field name of storage target field, data type is STRING.

And (3) the relation_type is used for storing the associated source type, and the data type is STRING.

Create_time, store relationship creation time, data type is STRING.

The sixth memory table has the property that the partitioning is done according to the ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

Alternatively, the seventh storage table may be created based on the blood-edge path information in accordance with the seventh data structure, for example, in actual application, the seventh storage table may be created in accordance with the following SQL statement:

CREATE TABLE IF NOT EXISTS dwm_en_meta_column_path_d(

keywords STRING COMMENT 'keywords, query' for intent recognition similarity,

The src_table_ NAME STRING COMMENT 'source table name',

The src_column_ NAME STRING component 'source field name',

Dest_table_ NAME STRING COMMENT 'target table name',

Dest_column_ NAME STRING component 'destination field name',

Full _ id _ PATH STRING com 'full path',

Full_id_path \u dept BIGINT COMMENT 'path depth'

)

COMMENT 'field Path'

PARTITIONED BY

(

Ds STRING COMMENT 'partition'

)

LIFECYCLE 90;

A table named dwm _en_meta_column_path_d, that is, a seventh table, may be created using the SQL statement to store the blood-edge path information, and the related fields may include:

And (3) storing relation keywords of blood-edge path information for intention recognition and similarity query, wherein the data type is STRING.

Src_column_name, field name of storage source field, data type is STRING.

Dest_column_name, field name of storage target field, data type is STRING.

Full_id_path, the complete blood path of the field may include the table name of each data table related to the path and the field name of each field, and the data type is STRING.

Full_id_path_ dept path depth, data type BIGINT. The path depth refers to, for example, the number of related fields, etc.

The seventh memory table also has the property that the partitioning is done according to the ds field, where ds is a STRING type field, typically representing the date. Such a partitioning strategy may improve the efficiency of querying for particular date range data. The memory table is set to LIFECYCLE to 90, which means that the data partition of the memory table will automatically expire and be cleaned up 90 days after creation.

It should be noted that, the foregoing merely illustrates a possible storage manner of the table description information, the field description information and the table relation information, which may be in a table form, or any other form such as a json (a file format and a data exchange format) format text, the vector features after vectorization may also be stored according to different formats as required, or reference may be made to the characteristics of different selected vector libraries, or the storage manner may also be determined by combining the characteristics of the target large model. The present application is not limited thereto. In addition, the object related information extracted from the document data is stored according to different formats, such as a table form, or any other form, etc., or may refer to the characteristics of different selected vector libraries, or may be combined with the characteristics of the target large model to determine a storage mode, etc., or may be converted into vector features and then stored.

For ease of understanding, fig. 2 shows an overall flowchart of a knowledge base construction process according to an embodiment of the present application.

As shown in fig. 2, when constructing the asset knowledge base, first, database data and document data corresponding to the data provider may be determined, and the database data may include data table data and metadata.

Thereafter, the operations of step 201 may be performed to evaluate the quality of the database data and the document data to screen the database data and the document data for quality requirements to obtain high quality data assets.

In one embodiment of the present application, the quality assessment may be performed using a data asset screening model constructed based on a large model, i.e., the acquired data may be input into the data asset screening model, which, after screening, outputs high quality data assets meeting quality requirements.

High quality data assets can be categorized into data table data, table relationship information maintained in metadata, historical query statements, and document data.

In addition, the data assets which do not meet the quality requirements can be corrected through a manual maintenance mode, so that manually maintained data assets (including data table data, document data and the like) are obtained, and in addition, table relation information can be provided through a manual configuration mode, so that manually configured table relation information is obtained.

Thereafter, for these high quality data assets, the operations of step 202 may be performed for entity identification and application domain mapping, and the operations of step 203 may be performed for table annotation information normalization processing and table index keyword extraction and field annotation information normalization processing and field keyword extraction, and the operations of step 204 may be performed for table relationship information identification, etc. In addition, for the document data, step 205 may be performed to perform object recognition, from which objects and object related information are recognized.

Therefore, knowledge base contents such as table description information, field description information, table relation information and object related information can be obtained through the steps 202-205.

The table description information, field description information, table relation information and object related information can also be output to related personnel for manual sampling examination, updating, confirmation and the like.

Thereafter, in step 206, table description information and field description information may be used to construct an asset vector library, table relationship information may construct a relationship vector library, and object related information may construct a knowledge vector library.

The asset vector library, the relationship vector library and the knowledge vector library can be used as asset knowledge libraries to interface with the target large model.

The asset knowledge base can be used for butting a target large model, and the technical scheme of the embodiment of the application is described below from the application angle of the asset knowledge base respectively:

FIG. 3 is a flow chart of one embodiment of a data processing method provided by an embodiment of the present application, which may include the steps of:

301 determining an asset knowledge base.

302, Obtaining target knowledge information matched with input data from an asset knowledge base;

The target knowledge information may include target table description information, target field description information, and/or target table relationship information, among others.

In addition, in the case where object related information is included in the asset knowledge base, the target knowledge may also include target object information and the like.

And 303, inputting the input data and the target knowledge information into the target large model to obtain output data.

Alternatively, the asset knowledge base may be a general database, and knowledge information obtained from the asset knowledge base may be converted into a corresponding data format for processing in order to adapt to different target large models, in which case the target knowledge information may be converted into index and metadata forms for input into the target large models.

The construction process of the asset knowledge base may be described with reference to the embodiment shown in fig. 1, and will not be described herein.

In this embodiment, the asset knowledge base may interface with the target large model to implement RAG (RETRIEVAL-augmented Generation, search enhancement generation), and target knowledge information such as target table description information, target field description information, and/or target table relationship information may be used as auxiliary information of the input data to instruct generation of output data of the target large model.

The target knowledge information matched with the input data can be obtained through similarity calculation, and target table description information, target field description information, target table relation information and/or target object information with the similarity larger than a specified threshold or the maximum similarity is selected, wherein the similarity calculation can be realized through calculation modes such as Euclidean distance, cosine similarity and the like, and the application is not limited to the calculation modes.

Wherein the input data and the target knowledge information may generate a specific Prompt word (promt) according to a specific Prompt template, and input the target large model such that the target large model generates target data based on the specific Prompt word.

In practical applications, the input data may be, for example, a search request in a search scene, a use request in a use scene, or a query text in NL2SQL, etc., and of course, any data input by a user may be used in a scene such as a question-answering system, an intelligent assistant, etc., which is not limited in the present application.

In this embodiment, the input data and the information are combined by using the target knowledge information obtained from the asset knowledge base, so that a structured input set can be generated, the generated structured input set is input into the target large model, and necessary context information is provided for the target large model, so that the corresponding output data can be generated by using the reasoning capability of the target large model, the reasoning effect of the target large model is improved, and the accuracy and the efficiency of data processing are ensured.

FIG. 4 is a flowchart of another embodiment of a data processing method according to an embodiment of the present application, where the method may include the following steps:

401, determining an asset knowledge base;

And 402, constructing a training sample based on the table description information, the field description information and the table relation information in the asset knowledge base.

Optionally, in the case where the asset knowledge base further includes object related information, the training samples may be constructed based on table description information, field description information, table relationship information, and object related information in the asset knowledge base.

403, Training the large model of the target by using the training sample.

The construction process of the asset knowledge base may refer to the knowledge base construction method shown in fig. 1, and will not be described herein.

In this embodiment, the asset knowledge base interfaces with the target large model, which may be a training sample used to build fine-tuning to fine-tune the target large model. The target large model can be a large model which is trained in advance, and can be a large model which is already trimmed, and further trimming can be continuously performed on the basis of the previous trimming large model by adopting the technical scheme of the embodiment of the application so that the effect of the target large model is better.

Because the asset knowledge base can contain data in specific fields, such as proper nouns, industry terms, abbreviations and the like, the training sample is constructed by acquiring the table description information, the field description information and the table relation information from the asset knowledge base, and the training sample is used for further training and fine tuning the target large model, the target large model can learn the field knowledge contained in the asset knowledge base, and the reasoning capacity and knowledge reserve of the target large model in the specific fields are improved.

In some embodiments, constructing the training samples based on the table description information, the field description information, and the table relationship information in the asset knowledge base may be specifically implemented as:

And constructing training samples from the table description information, the field description information and the table relation information in the asset knowledge base based on at least one sample example by using a fourth large model, wherein the sample example comprises sample input data and sample training labels.

That is, training samples of the target large model may also be obtained from the asset knowledge base with the aid of large model capabilities. The sample examples may be used as Context information for the fourth large model, and Context Learning (ICL) may enable the fourth large model to infer based on some sample examples to generate input data, i.e., as input data In the training samples, and output data as training labels In the training samples.

Optionally, respective hint words may be generated based on at least one sample instance to input a fourth large model such that the fourth large model learns patterns from among the sample instances to generate training samples and builds the training samples from an asset knowledge base. For ease of understanding, the hint word may be:

{ please refer to the following example for generating training samples:

"campt": "ggs revenue- >" and "completion": n\ nggs are abbreviations for overseas gold plate members, representing merchants who purchased the member product package. N revenue includes member fee revenue, marketing revenue and other value added revenue. N\n thus ggs earnings refer to member fee earnings plus marketing earnings plus other value added earnings of overseas merchants "}

In the above example, "sample" is the sample input data, and "completion" is the sample training label. It should be noted that the foregoing examples are merely illustrative of possible organization of the sample examples, and the present application is not limited thereto.

The fourth biggest model may be a machine learning model with a large number of parameters and a complex structure, and may be capable of processing massive data to complete various complex tasks, such as natural language processing, computer vision, speech recognition, and so on. It can be implemented by using any large language model of the current open source, and the application is not limited to this.

Optionally, the knowledge information obtained from the asset knowledge base may be converted into a corresponding data format for processing, in this embodiment, the knowledge information in the asset knowledge base may be converted into json format, so that the fourth large model generates a training sample in josn format, and the training sample may be used to perform fine tuning processing on the target large model. In this embodiment, by using the reasoning, data analysis and processing capabilities of the large model, high-quality table description information, field description information and table relation information can be extracted from the complex data of the asset knowledge base, so as to construct a training sample, and training the target large model by using the high-quality training sample can improve the training quality.

FIG. 5 shows a flowchart of yet another embodiment of a data processing method provided by an embodiment of the present application, which may include the steps of:

501, processing a query request by utilizing a target large model;

and 502, in the case that the query request relates to database data, querying target knowledge information from an asset knowledge base corresponding to the database data.

The object knowledge information may include object table description information, object field description information, object table relationship information, and/or object information in the case where object related information is included in the asset knowledge base.

503, Generating a query result based on the target knowledge information;

504, taking the query result as output data of the target large model.

A user initiated query request may require that specific data be retrieved and analyzed from a database. In this case, the large model may use the asset knowledge base, and table description information, field description information and/or table relation information in the asset knowledge base may be considered as metadata generated by rearranging database data, and by querying the asset knowledge base, data may be further queried in combination with target table description information, target field description information and/or target table relation information, and the target table description information, the target field description information and/or the target table relation information may be used as index information to reduce a query range or be used as auxiliary information to improve query accuracy.

The target table description information, the target field description information and/or the target table relation information matched with the query request can be searched in a similarity calculation mode.

Alternatively, knowledge information obtained from the asset knowledge base may be converted into a corresponding data format for processing, and in this embodiment, the asset knowledge base may be converted into a key-value format for query processing by the target large model, and so on. In this embodiment, the target large model accesses the asset knowledge base, and since the asset knowledge base stores table description information, field description information and table relationship information between tables of all data tables in the database. This information is critical to understanding the database structure and data content. The target large model may retrieve relevant table description information, field description information, and table relationship information between tables from the asset knowledge base based on the specific content of the query request. With the detailed table description information, field description information and table relation information among tables, the target large model can analyze the information by using an intelligent algorithm of the target large model, identify the relation among data, screen out key data conforming to a query request, and possibly further aggregate, sort or process the data in other forms so as to obtain a query result.

Fig. 6 is a schematic structural diagram of an embodiment of a knowledge base construction device according to an embodiment of the present application, where the device may include:

the data acquisition module 601 is configured to acquire database data and document data corresponding to a data provider, where the database data includes data table data and metadata in a database;

A description information generating module 602, configured to generate table description information of a data table in the database and field description information of the data table based on the database data and the document data;

A first information identifying module 603, configured to identify table relationship information corresponding to the data table from the historical query statement of the database data and the document data;

A knowledge base construction module 604, configured to construct an asset knowledge base based on the table description information, the field description information, and the table relationship information.

In some embodiments, the descriptive information generation module 602 includes:

The first quality evaluation sub-module is used for performing quality evaluation on the database data and the document data so as to screen the database data and the document data which meet the data quality requirement;

And the description information sub-module is used for generating table description information of a data table in the database data and field description information of the data table aiming at the database data and the document data meeting the data quality requirement.

In some embodiments, the knowledge base construction apparatus further comprises:

the first maintenance prompt information output module is used for outputting data maintenance prompt information aiming at database data which does not meet the data quality requirement;

And the data updating module is used for responding to the data adjustment request, updating the database data and taking the updated database data as the database data meeting the quality requirement.

and the second quality evaluation sub-module performs quality evaluation on the data model of the database data to determine whether the data model meets the model quality requirement.

The second maintenance prompt information output module is used for outputting the maintenance prompt information of the model under the condition that the data model does not accord with the model quality requirement;

And the model adjustment module is used for responding to the model adjustment request and updating the data model under the condition that the data model meets the model quality requirement.

and the data maintenance module is used for responding to the relation maintenance request aiming at the database data and determining manually configured table relation information.

In some embodiments, the first information identification module 603 includes:

And the first information identification sub-module is used for identifying the table relation information corresponding to the data table from the historical query statement of the database data, the table relation information maintained in the document data and the metadata and the manually configured table relation information.

the first normalization sub-module is used for carrying out normalization processing on the table annotation information of the data table, so that the table annotation information after normalization processing is used as table description information;

The first keyword extraction sub-module is used for extracting a table index keyword from the table annotation information after normalization processing to be used as table description information, wherein the table index keyword is used for matching the table description information;

The second normalization sub-module is used for carrying out normalization processing on the field annotation information of the data table so as to take the field annotation information after normalization processing as field description information;

The second keyword extraction sub-module is used for extracting field index keywords from the field annotation information after normalization processing to be used as field description information, wherein the field index keywords are used for carrying out field description information matching;

An entity information extraction sub-module, configured to extract entity related information related to the data table related entity from the document data;

And the table description information determining submodule is used for taking the entity related information as table description information of the data table.

a first downstream application identification module, configured to identify downstream application information of the data table from the database data, and use the downstream application information as table description information of the data table;

a second downstream application identification module, configured to identify downstream application information of a field in the data table from the database data, and use the downstream application information as a field description information;

the entity identification module is used for identifying the entity in the data table and the application field corresponding to the entity from the database data;

And the merging normalization module is used for taking the entity and the application field as table description information of the data table or merging the entity and the application field into table annotation information after normalization processing.

In some embodiments, the first keyword extraction submodule includes:

a first keyword extraction unit for extracting a table index keyword from the table annotation information and the field annotation information after normalization processing using the first large model as a kind of table description information;

in some embodiments, the second keyword extraction submodule includes:

and a second keyword extraction unit for extracting a field index keyword from the field annotation information after normalization processing using the first large model as a kind of field description information.

In some embodiments, the first information identification module 603 includes:

The table relation information extraction sub-module is used for analyzing historical query sentences of the database data by using the second large model so as to extract table relation information corresponding to the data table;

the related information extraction sub-module is used for extracting entity related information of a data table related entity from the document data, wherein the entity related information comprises entity relation information and entity operation information;

And the table relation information identification sub-module is used for identifying the table relation information corresponding to the data table from the entity relation information and the entity operation information.

And the description information generation sub-module is used for generating the table description information of the data table in the database data and the field description information of the data table by using a third large model based on the table annotation information of the data table in the database data, the fields and the field annotation information contained in the data table and the metadata corresponding to the data table.

In some embodiments, the table relationship information includes table association relationship information, table blood-edge relationship information, field blood-edge relationship information, and blood-edge path information.

The information extraction module is used for extracting relation index keywords corresponding to the table association relation information, the table blood edge relation information, the field blood edge relation information and the blood edge path information respectively, and storing the relation index keywords in the asset knowledge base in a corresponding mode with the table association relation information, the table blood edge relation information, the field blood edge relation information and the blood edge path information, wherein the relation index keywords are used for carrying out relation identification.

In some embodiments, knowledge base construction module 604 includes:

the first vector conversion submodule is used for converting the table description information and the field description information into vector features so as to construct an asset vector library;

and the second vector conversion sub-module is used for converting the table relation information into vector characteristics so as to construct a relation vector library.

In some embodiments, the table description information includes a table name, a table type, a table index key, table annotation information, and downstream application information of the data table;

The field description information comprises a field name, a table name of a data table to which the field description information belongs, field annotation information, a field index keyword, interpretation information of proper nouns related to the field and field downstream application information;

the table relation information comprises table association relation information, table blood edge relation information, field blood edge relation information and field blood edge path information;

The table association relation information comprises a table name, an association table name of an association data table, an external key field in the data table, a main key field in the association data table, association conditions, association times and relation index keywords;

The table blood edge relation information comprises a table name of a source data table, a target table name of a target data table, a blood edge relation type, relation creation time, relation modification time and a relation index keyword;

The field blood-edge key information comprises a table name of a source data table, a field name in the source data table, a target table name of a target data table, a target field name in the target data table, a blood-edge relation type, a relation creation time, a relation modification time and a relation index key word;

The field blood edge path information comprises a table name of a source data table, a field name in the source data table, a table name of a data table, a target table name of a target data table, a target field name in the target data table, a blood edge path, a path depth and a path index keyword.

In some embodiments, knowledge base construction module 604 includes:

The data table generating sub-module is used for generating a plurality of data tables according to the corresponding data structures and based on the table description information, the field description information and the table relation information;

And the first knowledge base construction sub-module is used for constructing an asset knowledge base based on the plurality of data tables.

A document identification module for identifying an object and object related information of the object from the document data;

in some embodiments, knowledge base construction module 604 includes:

and the second knowledge base construction sub-module is used for constructing an asset knowledge base based on the table description information, the field description information, the table relation information and the object related information.

Fig. 7 is a schematic structural diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

a first knowledge base determination module 701, configured to determine an asset knowledge base;

A second information identifying module 702 for identifying target knowledge information from the asset knowledge base that matches the input data, the target knowledge information including target table description information, target field description information, and/or target table relationship information;

The data generating module 703 is configured to input a large target model based on the input data and the target table description information, the target field description information, and/or the target table relationship information, so as to obtain output data.

Fig. 8 is a schematic structural diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

A second knowledge base determination module 801 for determining an asset knowledge base;

A sample construction module 802 for constructing training samples based on table description information, field description information, and table relationship information in the asset database;

model training module 803 is configured to train a large model of the target using the training samples.

In some embodiments, the sample construction module 802 includes:

And the sample construction sub-module is used for constructing training samples from the table description information, the field description information and the table relation information in the asset database by utilizing a fourth large model based on sample examples, wherein the sample examples comprise sample input data and sample training labels.

Fig. 9 is a schematic structural diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, where the apparatus may include:

a request processing module 901, configured to process a query request using a target big model;

the query module 902 is configured to query, when the query request relates to database data, target knowledge information from an asset knowledge base corresponding to the database data, where the target knowledge information includes table description information, target field description information, and/or target table relationship information;

a result generating module 903, configured to generate a query result based on the target knowledge information;

And the data output module 904 is used for taking the query result as output data of the target large model.

In one possible design, an apparatus provided by an embodiment of the present application may be implemented as a computing device, as shown in fig. 10, which may include a storage component 1001 and a processing component 1002;

The storage component 1001 stores one or more computer instructions, where the one or more computer instructions are called by the processing component 1002 to implement a knowledge base construction method and a data processing method provided by the embodiments of the present application.

Of course, the computing device may necessarily include other components, such as input/output interfaces, communication components, and the like. The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

The computing device may be a physical device or an elastic computing host provided by the cloud computing platform, and at this time, the computing device may be a cloud server, and the processing component, the storage component, and the like may be a base server resource rented or purchased from the cloud computing platform.

When the computing device is a physical device, the computing device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device.

The embodiment of the application also provides a computer readable storage medium which stores a computer program, and the computer program can realize the knowledge base construction method and the data processing method provided by the embodiment of the application when being executed by a computer.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the knowledge base construction method and the data processing method provided by the embodiment of the application when being executed by a computer.

Wherein the processing components of the respective embodiments above may include one or more processors to execute computer instructions to perform all or part of the steps of the methods described above. Of course, the processing component may also be implemented as one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic elements for executing the methods described above.

The storage component is configured to store various types of data to support operation in the device. The memory component may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. A method for constructing a knowledge base, comprising:

Obtaining database data and document data corresponding to the data provider; the database data includes data table data and metadata in the database;

Generate table description information of a data table in the database and field description information of the data table based on the database data and the document data;

Identifying table relationship information corresponding to the data table from historical query statements of the database data and the document data;

An asset knowledge base is constructed based on the table description information, the field description information, and the table relationship information.

2. The method according to claim 1, characterized in that the step of generating a table description of a data table in the database data and field description information of the data table based on the database data and the document data comprises:

Performing quality assessment on the database data and the document data to select database data and document data that meet data quality requirements;

For database data and document data that meet data quality requirements, table description information of data tables in the database data and field description information of the data tables are generated.

3. The method according to claim 2, further comprising:

Output data maintenance prompt information for database data that does not meet data quality requirements;

In response to the data adjustment request, the database data is updated, and the updated database data is used as database data that meets the quality requirements.

4. The method according to claim 2, further comprising:

Performing a quality assessment on the data model of the database data to determine whether the data model meets the model quality requirements;

If not, output model maintenance prompt information;

In response to the model adjustment request, the data model is updated.

5. The method according to claim 1, further comprising:

In response to a relationship maintenance request for the database data, determining manually configured table relationship information;

The identifying, from the historical query statements of the database data and the document data, the table relationship information corresponding to the data table comprises:

The table relationship information corresponding to the data table is identified from the historical query statements of the database data, the document data, the table relationship information maintained in the metadata, and the manually configured table relationship information.

6. The method according to claim 1, characterized in that the step of generating table description information of a data table in the database data and field description information of the data table based on the database data and the document data comprises:

Normalizing the table annotation information of the data table so as to use the normalized table annotation information as a kind of table description information;

Extracting table index keywords from the normalized table annotation information as a kind of table description information; the table index keywords are used for matching the table description information;

Normalizing the field annotation information of the data table so as to use the normalized field annotation information as a kind of field description information;

Extracting field index keywords from the normalized field annotation information as a kind of field description information; the field index keywords are used to match the field description information;

Extracting entity related information of entities associated with the data table from the document data;

The entity related information is used as a kind of table description information of the data table.

7. The method according to claim 6, further comprising:

Identifying downstream application information of the data table from the database data, and using the downstream application information as a kind of table description information of the data table;

Identify downstream application information of a field in the data table from the database data, and use the downstream application information as a field description information;

From the database data, identifying entities in a data table and application fields corresponding to the entities;

The entity and the application field are used as a kind of table description information of the data table, or the entity and the application field are merged into the table annotation information after normalization processing.

8. The method according to claim 6, wherein extracting table index keywords from the normalized table annotation information as a kind of table description information comprises:

Using the first model, extracting table index keywords from the normalized table annotation information and field annotation information as a kind of table description information;

The extracting of field index keywords from the field annotation information after normalization as a kind of field description information includes:

The first large model is used to extract field index keywords from the field annotation information after normalization as a kind of field description information.

9. The method according to claim 1, wherein the step of identifying the table relationship information corresponding to the data table from the historical query statements of the database data and the document data comprises:

Analyzing the historical query statements of the database data using the second largest model to extract table relationship information corresponding to the data table;

Extracting entity related information of entities associated with data tables from the document data; the entity related information includes entity relationship information and entity operation information;

From the entity relationship information and the entity operation information, table relationship information corresponding to the data table is identified.

10. The method according to claim 1, characterized in that the step of generating table description information of a data table in the database data and field description information of the data table based on the database data and the document data comprises:

Based on the table annotation information of the data table in the database data, the fields and field annotation information contained in the data table, and the metadata corresponding to the data table, the third model is used to generate the table description information of the data table in the database data and the field description information of the data table.

11. The method according to claim 1, characterized in that the table relationship information includes table association relationship information, table lineage relationship information, field lineage relationship information and lineage path information;

The method further comprises:

Relationship index keywords corresponding to table association relationship information, table lineage relationship information, field lineage relationship information and lineage path information are extracted, and saved in the asset knowledge base in correspondence with the table association relationship information, table lineage relationship information, field lineage relationship information and lineage path information; the relationship index keywords are used for relationship identification.

12. The method according to claim 1, characterized in that the step of constructing an asset knowledge base based on the table description information, field description information and table relationship information comprises:

Converting the table description information and the field description information into vector features to construct an asset vector library;

The table relationship information is converted into vector features to construct a relationship vector library.

13. The method according to claim 1, characterized in that the table description information includes table name, table type, table index keyword, table annotation information and downstream application information of the data table;

The field description information includes the field name, the table name of the data table to which it belongs, field annotation information, field index keywords, explanation information of the proper nouns involved in the field, and field downstream application information;

The table relationship information includes table association relationship information, table lineage relationship information, field lineage relationship information and field lineage path information;

The table association relationship information includes table name, associated table name of the associated data table, foreign key field in the data table, primary key field in the associated data table, association condition, association times and relationship index keyword;

The table blood relationship information includes the table name of the source data table, the target table name of the target data table, the blood relationship type, the relationship creation time, the relationship modification time and the relationship index keyword;

The field lineage key information includes the table name of the source data table, the field name in the source data table, the target table name of the target data table, the target field name in the target data table, the lineage relationship type, the relationship creation time, the relationship modification time and the relationship index keyword;

The field lineage path information includes the table name of the source data table, the field name in the source data table, the table name of the belonging data table, the target table name of the target data table, the target field name in the target data table, the lineage path, the path depth and the path index keyword.

14. The method according to claim 13, characterized in that the step of constructing an asset knowledge base based on the table description information, the field description information and the table relationship information comprises:

Generate multiple data tables according to the corresponding data structures based on the table description information, the field description information and the table relationship information;

An asset knowledge base is constructed based on the multiple data tables.

15. The method according to claim 1, further comprising:

identifying an object and object-related information of the object from the document data;

The constructing of the asset knowledge base based on the table description information, the field description information and the table relationship information includes:

An asset knowledge base is constructed based on the table description information, the field description information, the table relationship information, and the object related information.

16. A data processing method, comprising:

Identify the asset knowledge base;

Identifying target knowledge information matching the input data from the asset knowledge base; the target knowledge information includes target table description information, target field description information and/or target table relationship information;

Based on the input data and the target table description information, the target field description information and/or the target table relationship information, a target big model is input to obtain output data.

17. A data processing method, comprising:

Identify the asset knowledge base;

Constructing training samples based on table description information, field description information, and table relationship information in the asset database;

The target large model is trained using the training samples.

18. The method according to claim 16, wherein constructing training samples based on table description information, field description information, and table relationship information in the asset database comprises:

The fourth model is used to construct training samples based on sample examples from table description information, field description information and table relationship information in the asset database; the sample examples include sample input data and sample training labels.

19. A data processing method, comprising:

Use the target large model to process query requests;

In the case where the query request involves database data, querying target knowledge information from an asset knowledge base corresponding to the database data; the target knowledge information includes table description information, target field description information and/or target table relationship information;

Based on the target knowledge information, generating a query result;

The query result is used as output data of the target large model.

20. A computing device, comprising a processing component and a storage component;

The storage component stores one or more computer instructions; the one or more computer instructions are used to be called and executed by the processing component to implement the knowledge base construction method as described in any one of claims 1 to 15, or to implement the data processing method as described in claim 16, or to implement the data processing method as described in any one of claims 17 to 18, or to implement the data processing method as described in claim 19.

21. A computer-readable storage medium, characterized in that a computer program is stored thereon, and when the computer program is executed by a processing component, it implements the knowledge base construction method as described in claims 1 to 15, or implements the data processing method as described in claim 16, or implements the data processing method as described in any one of claims 17 to 18, or implements the data processing method as described in claim 19.

22. A computer program product, characterized in that it includes a computer program/instruction, which, when executed by a processing component, implements the knowledge base construction method as described in any one of claims 1 to 15, or implements the data processing method as described in claim 16, or implements the data processing method as described in any one of claims 17 to 18, or implements the data processing method as described in claim 19.