CN113434654A

CN113434654A - Data processing method and device, equipment and storage medium

Info

Publication number: CN113434654A
Application number: CN202110790742.9A
Authority: CN
Inventors: 张蒙
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-09-24
Anticipated expiration: 2041-07-13
Also published as: CN113434654B

Abstract

The present application discloses a data processing method, apparatus, device and storage medium. The method includes: acquiring query content; determining at least one query field, an identifier of a target data model, and at least one output field based on the query content and a metadata database ; the metadata database stores fields in at least two data models; using a running data query engine, at least based on the at least one query field, the identifier of the target data model and the at least one output field to obtain a query result. This solution can lift the limitation of business scenarios and improve the query accuracy in complex business scenarios.

Description

Data processing method and device, equipment and storage medium

Technical Field

The present application relates to the field of data processing technology, and relates to, but is not limited to, a data processing method, apparatus, device, and storage medium.

Background

With the development of Artificial Intelligence (AI) and Natural Language Processing (NLP) technologies, intelligent questioning and answering products (e.g., apple voice assistant, microsoft ice, jingdong intelligent customer service) are becoming more and more widely used in the industry fields of e-commerce, retail, finance, transportation, etc., and daily life.

Meanwhile, with the development of Business Intelligence (BI), more and more enterprises and researchers pay attention to intelligent question answering aiming at a structured database, and related research from natural language to structured query language is widely developed.

However, in the related art, the intelligent question-answering product can only realize the answer of the question inquired in the single data model under a single service scene; when a complex business scene is faced, the accuracy of the query result cannot be ensured when the answer is queried under a single business scene.

Disclosure of Invention

The application provides a data processing method, a data processing device, data processing equipment and a data processing storage medium, and the scheme can remove the limitation of a service scene and improve the query accuracy in a complex service scene.

The technical scheme of the application is realized as follows:

the application provides a data processing method, which comprises the following steps:

acquiring query content;

determining at least one query field, an identification of a target data model, and at least one output field based on the query content and a metadata repository; the metadata base stores fields in at least two data models;

obtaining, with a data query engine running, a query result based at least on the at least one query field, the identification of the target data model, and the at least one output field.

The present application provides a data processing apparatus, the apparatus comprising:

an acquisition unit configured to acquire query content;

a determining unit for determining at least one query field, an identification of a target data model and at least one output field based on the query content and a metadata base; the metadata base stores fields in at least two data models;

a processing unit configured to obtain a query result based on at least the at least one query field, the identification of the target data model, and the at least one output field using a running data query engine.

The present application further provides an electronic device, including: a memory storing a computer program operable on a processor and a processor implementing the above data processing method when executing the program.

The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described data processing method.

The data processing method, the data processing device, the data processing equipment and the storage medium comprise the steps of obtaining query contents; determining at least one query field, an identification of a target data model, and at least one output field based on the query content and a metadata repository; the metadata base stores fields in at least two data models; obtaining, with a data query engine running, a query result based at least on the at least one query field, the identification of the target data model, and the at least one output field. In the scheme, fields included by a plurality of data models are integrated in a metadata database, so that answers of questions in the plurality of data models are searched by querying the metadata database, and higher query accuracy is achieved; and the restriction on the service scenario can be removed.

Drawings

FIG. 1 is a schematic diagram of an alternative configuration of a data processing system according to an embodiment of the present application;

fig. 2 is an alternative schematic structural diagram of a data processing end according to an embodiment of the present application;

fig. 3 is an alternative flow chart of a data processing method according to an embodiment of the present application;

fig. 4 is an alternative flow chart of the data processing method according to the embodiment of the present application;

fig. 5 is an alternative flow chart of the data processing method according to the embodiment of the present application;

fig. 6 is an alternative flow chart of the data processing method according to the embodiment of the present application;

fig. 7 is an alternative flowchart of a data processing method according to an embodiment of the present application;

fig. 8 is an alternative flowchart of a data processing method according to an embodiment of the present application;

FIG. 9 is an alternative schematic structural diagram of a multi-table intelligent question-answering system according to an embodiment of the present application;

fig. 10 is a schematic flow chart of an alternative data processing method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an alternative configuration of a text input unit according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an alternative data processing apparatus according to an embodiment of the present application;

fig. 13 is an alternative structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the following will describe the specific technical solutions of the present application in further detail with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, the terms "first \ second \ third" are used merely as examples to distinguish different objects, and do not represent a specific ordering for the objects, and do not have a definition of a sequential order. It is to be understood that the terms first, second, and third, if any, may be used interchangeably with the specified order or sequence to enable the embodiments of the application described herein to be practiced in other sequences than those illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In order to facilitate understanding of the technical solutions of the present application, the following first explains the terms or technical terms referred to in the present application.

1) The data model refers to a model including a plurality of items and values of each item.

In one example, the data model may be in the form of a two-dimensional table; table 1 illustrates a data model A, which includes student performance; table 2 illustrates a data model B, which includes basic information about the student.

TABLE 1 data model examples

Student name	Chinese achievements	Mathematical achievement	English score	Sports score
					Zhang San	85	98	71	96
Li Si	95	94	88	91
					……	……	……	……	……
Wang Wu	90	88	98	86

TABLE 2 data model examples

Student name	Sex	Height (centimeter)	Home address	Health condition
					Zhang San	For male	155	Address A	Good effect
Li Si	Woman	156	Address B	Good effect
					……		……	……	……
Wang Wu	For male	158	Address C	Good effect

In another example, the data model may also be in graphical form, Object Notation (json) form, dictionary form, and the like.

2) And the field indicates an item in the data model. In an example, a field may refer to the name of a column in a two-dimensional table; for example, the fields may be the names of students in Table 1; for another example, the field may be the Chinese achievement in Table 1.

3) And the query field refers to a field determined by the query content and used for querying in the data model. Example 1, the query content is "how many language average scores? ", the query field may be considered a Chinese achievement.

4) And the output field refers to a field determined by the query content and used for representing the query result, and corresponds to example 1, and the output field is the Chinese achievement in the table 1.

The embodiment of the application can provide a data processing method, a data processing device, data processing equipment and a storage medium. In practical applications, the data processing method may be implemented by a data processing apparatus, and each functional entity in the data processing apparatus may be cooperatively implemented by hardware resources of an electronic device (e.g., a terminal device), such as computing resources like a processor, and communication resources (e.g., for supporting communications in various manners like optical cables and cellular).

The data processing method provided by the embodiment of the application is applied to a data processing system, and the data processing system comprises a data processing end. In an example, the data processing system may also include a client.

As an example, a data processing system may be configured as shown in FIG. 1, including: a client 10 and a data processing terminal 20.

In an example, the client 10 and the data processing side 20 may be the same physical entity; in one example, as shown in fig. 1, the client 10 and the data processing side 20 may be different physical entities, and the client 10 and the data processing side 20 interact with each other through the network 30.

Here, the client 10 is configured to receive an operation of a user, and send query content to the data processing terminal 20 based on the operation of the user. The data processing terminal 20 is configured to receive the query content sent by the client terminal 10, obtain an answer to the query content in at least two data models according to the query content, and output the answer.

In an example, the structure of the data processing end may be as shown in fig. 2, and the data processing end 20 includes: a semantic parsing module 201, a data storage module 202 and a background processing module 203.

The semantic analysis module 201 is configured to receive and process data sent by the client 10. For example, in conjunction with the embodiment of the present application, the semantic parsing module 201 may be configured to receive query content sent by the client 10, and parse the query content to obtain a structured query condition (at least including at least one query field, an identifier of a target data model, and at least one output field) corresponding to the query content; the semantic parsing module 201 may be further configured to send the structured query to the client, so that the structured query may be displayed at the client; the semantic parsing module 201 may further be configured to sort fields in the multiple data models, and send information, such as the fields included in the multiple models and the identifiers of the data models to which the fields belong, to the data storage module 202.

The data storage module 202 is used for storing data. For example, in combination with the embodiment of the present application, the data storage module 202 may include a metadata database, where the metadata database is used to store information such as fields included in the multiple models and identifiers of data models to which the fields belong, which are sent by the semantic parsing module 201; the data storage module 202 may further include a Frequently Asked Questions (FAQ) sub-library, where the FAQ sub-library is used to store Frequently Asked Questions (also referred to as query contents or question contents) and structured query conditions corresponding to the Frequently Asked Questions. It will be appreciated that the FAQ library may also be stored as a sub-library in the metadata library.

The background processing module 203 is used for processing the related data. For example, in conjunction with an embodiment of the present application, the background processing module 203 may be configured to determine answers (which may also be referred to as query results) corresponding to query contents in a plurality of data models based on the structured query condition; the background processing module 203 may also be configured to sort the answers corresponding to the query contents into a data report or a visual graphic format, and send the data report or the visual graphic format to the client 10, so that the client displays the answers sorted into the data report or the visual graphic format; the background processing module 203 may also be configured to search the FAQ library for a structured query condition corresponding to a frequently asked question.

In the embodiment of the present application, based on the data processing system shown in fig. 1, a client sends query content to a data processing end, and the data processing end receives the query content and executes: determining at least one query field, an identification of a target data model, and at least one output field based on the query content and a metadata repository; the metadata base stores fields in at least two data models; obtaining, with a data query engine running, a query result based at least on the at least one query field, the identification of the target data model, and the at least one output field.

Embodiments of a data processing method, a data processing apparatus, a data processing device, and a storage medium according to the embodiments of the present application are described below with reference to a schematic diagram of a data processing system shown in fig. 1.

The present embodiment provides a data processing method, which is applied to a data processing apparatus, wherein the data processing apparatus can be implemented on an electronic device as a data processing end. The functions implemented by the method can be implemented by calling program code by a processor in an electronic device, and the program code can be stored in a computer storage medium.

The electronic device may be any device having associated information processing capabilities and, in one embodiment, may be a server.

Of course, the embodiments of the present application are not limited to the provided method and hardware, and may be implemented in various ways, for example, as a storage medium (storing instructions for executing the data processing method provided by the embodiments of the present application).

The following describes a data processing method provided in an embodiment of the present application.

Fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application, configured to determine a query result of query content. The processing procedure of each query content is similar, and now, taking the query content as an example, a detailed description is given to a procedure of how to obtain a query result corresponding to the query content.

The data processing method may include, but is not limited to, S301 to S303 described below as shown in fig. 3.

S301, the electronic equipment acquires the query content.

S301 may be implemented as: the electronic equipment receives data content input by a client or sent by other equipment through a data transmission interface, and takes the data content as query content or converts the data content into the query content.

The acquisition form of the query content is not specifically limited, and the query content can be configured according to actual requirements. For example, the query content may be received by the electronic device in a text manner through a text input interface on the client device; for another example, the query content may be obtained by the electronic device receiving the query content in a voice manner through a voice input interface on the client device, and then converting the voice into a text; for another example, the query content may be received by the electronic device through a data transmission interface with another device or converted after receiving.

The specific access form of the client may be configured according to actual requirements, which is not specifically limited in the embodiment of the present application. For example, the access form of the client may be in the form of a World Wide Web (Web) Web page or in the form of an Application (App).

For example, S301 may be implemented as: the electronic equipment receives the input number of people leaving the department A in the last three months through a character input interface on the web page, and takes the number of people leaving the department A in the last three months as query content.

S302, the electronic device determines at least one query field, an identification of the target data model and at least one output field based on the query content and the metadata base.

Wherein the metadata base stores fields in at least two data models; wherein, the different data models store the data contents of different service scenes.

In an example, the metadata repository also stores an identification of the data model to which the field belongs; wherein the identification of the data model uniquely points to one data model.

The content stored in the metadata database is not specifically limited, and can be configured according to actual requirements.

S302 may be implemented as: the electronic equipment analyzes the content of the query content, and determines at least one query field, the identification of the target data model and at least one output field according to the content of the analyzed query content and the content in the metadata base.

It will be appreciated that at least one query field, an identification of the target data model, and at least one output field may also be packaged as a structured query term.

The format of the structured query condition is not specifically limited in the embodiment of the application, and the structured query condition can be identified by a data query engine. For example, the Structured Query condition may be a Structured Query Language (SQL) statement.

Example a, the query content is "how many the Chinese achievements of each student? ", the metadata base may include: (student name, data model a and data model B), (chinese achievement, data model a), (math achievement, data model a), (english achievement, data model a), (sports achievement, data model a), (gender, data model B), (height, data model B), (home address, data model B), (health condition, data model B); the content in a bracket comprises a field and the identification of the data model to which the word belongs; the at least one query field, the identification of the target data model, and the at least one output field determined by the electronic device after S302 include: the query field includes the student name and the Chinese achievement, the target data model includes the data model A, and the output field includes the student name and the Chinese achievement.

S303, the electronic equipment obtains a query result at least based on the at least one query field, the identification of the target data model and the at least one output field by using a running data query engine.

The form of the data query engine is not particularly limited in the embodiment of the application, and the data query engine can be configured according to actual requirements. For example, the data query engine may be a Presto high speed engine, Spark analysis engine, Hive, or the like.

S303 may be implemented as: the electronic equipment utilizes an operating data query engine to search a target data model pointed by the identification of the target data model, then searches each field value corresponding to at least one query field in the target data model, then determines an output field according to the query field, determines the value of the output field according to the value of the query field, and finally takes the value of at least one output field and the output field as a query result.

Based on example a, the query result obtained by the electronic device through S303 may be table 3 below.

TABLE 3 query results example

Student name	Chinese achievements
		Zhang San	85
Li Si	95
		……	……
Wang Wu	90

The data processing scheme provided by the embodiment of the application comprises the steps of obtaining query content; determining at least one query field, an identification of a target data model, and at least one output field based on the query content and a metadata repository; the metadata base stores fields in at least two data models; obtaining, with a data query engine running, a query result based at least on the at least one query field, the identification of the target data model, and the at least one output field. In the scheme, fields included by a plurality of data models are integrated in a metadata database, so that answers of questions in the plurality of data models are searched by querying the metadata database, and higher query accuracy is achieved; and the restriction on the service scenario can be removed.

The metadata base is described in detail below.

In an example, the metadata base further stores alias names of fields, types of the fields, dimension enumeration values of the fields, field metric values, enumeration value mapping fields, and the like; the dimension enumeration value of the field is a list value of possible values of the field; the field measurement value is a left boundary value and a right boundary value of the field value; the enumeration mapping field is used for representing that an enumeration value uniquely corresponds to a field.

Illustratively, the metadata repository may also store the relevant content in table 4 below.

TABLE 4 metadata repository examples

It is understood that the metadata base may also store FAQ sub-bases or other data, etc., which are not described in detail herein.

The storage format of the data in the metadata database is not specifically limited, and can be adjusted according to actual requirements.

Example 1, the storage may be in the form of a file.

For example, the fields and field aliases that the data model includes may be stored in a file; storing an identification of the data model to which the field belongs in another file; storing the type of the field in another file; store the attributes of the fields in another file, and so on.

Example 2, the storage may be in the form of a table.

For example, various information of the fields may be stored in a table format.

For ease of description, the at least one query field, the identification of the target data model, and the at least one output field will be referred to below simply as the first query condition.

In an example, the first query condition can correspond to a structured query condition corresponding to the first query field.

In another example, the structured query condition corresponding to the first query field may also include other content, for example, the structured query condition corresponding to the first query field may also include an attribute filter condition, an aggregation function of a metric field, and so on.

S302 when determining the first query condition, the electronic device may include, but is not limited to, the following two implementation manners:

the implementation method A comprises the steps of determining a first query condition according to an FAQ sub-library in a metadata library;

implementation B, determining a first query condition according to the fields in the metadata base and the identification of the data model to which the fields belong.

Implementation a may include: a plurality of question contents and query conditions corresponding to the question contents are stored in an FAQ sub-library in a metadata library in advance. Implementation a may be implemented as: the electronic equipment matches the query content with the problem content in the FAQ sub-library, and if the target problem content meeting the quick query condition exists in the FAQ sub-library, the electronic equipment reads the query condition corresponding to the target problem content and takes the query condition corresponding to the target problem content as a first query condition.

The content of the quick query condition is not specifically limited, and can be configured according to actual requirements. For example, the fast query conditions may include: the content similarity with the query content is greater than or equal to the fast query threshold.

When the first query condition is determined in the implementation mode a, because the plurality of question contents and the query condition corresponding to the question contents are pre-stored in the FAQ sub-library, the first query condition is generated without detailed analysis of the first query condition when the first query condition is determined, and the query condition corresponding to the question contents meeting the fast query condition in the FAQ sub-library can be directly called as the first query condition. Therefore, the response speed is improved, and the user experience is further improved.

It is understood that the FAQ sub-library may also be stored in parallel with the metadata library as a separate database, that is, the determination of the first query condition may also be implemented by accessing the separate FAQ library.

It can be understood that if the target question content satisfying the fast query condition does not exist in the FAQ sub-library, the first query condition may be determined by using the implementation B.

Implementation B may include, but is not limited to, S401 to S404 shown in fig. 4 described below.

S401, the electronic device determines the at least one query field based on the query content and the candidate field set in the metadata base.

The set of candidate fields includes fields in at least two data models, and alias fields for the fields.

The number of the query fields is not specifically limited in the embodiment of the application, and can be determined according to actual requirements.

S401 may be implemented as: and the electronic equipment determines the fields meeting the field screening conditions in the candidate set in the metadata as query fields based on the query contents.

The field screening conditions are not specifically limited in the embodiments of the present application, and can be configured according to actual requirements.

S402, for each query field in the at least one query field, the electronic device searches the metadata base for an identifier of the data model to which the query field belongs, and the identifier serves as an identifier of the at least one query data model.

Since the metadata base also stores the identification of the data model to which the field belongs, S402 can be implemented as: and for each query field, the electronic equipment searches the identification of the data model to which each query field belongs in the metadata base as the identification of the query data model.

Based on example a, S402 may be implemented as example B: the electronic equipment finds out that the data model to which the field student name belongs comprises a data model A and a data model B in the metadata base, and the data model to which the field Chinese achievement belongs comprises the data model A, so that the query data model determined by the electronic equipment comprises the following steps: data model a and data model B.

S403, the electronic device determines the identifier of the target data model according to the identifier of the at least one query data model.

S403 may be implemented as: the electronic equipment judges the magnitude relation between the quantity of the identifications of the at least one query data model and a first numerical value, and if the quantity of the identifications of the at least one query data model is equal to the first numerical value, the identifications of the query data model are used as the identifications of the target data model; and if the number of the identifiers of at least one query data model is larger than the first numerical value, taking the identifier of the query data model meeting the constraint condition as the identifier of the target data model.

The embodiment of the application does not limit the specific content of the constraint condition, and can be configured according to actual requirements. Illustratively, the constraints may include: the number of fields included is at most the identity of the query data model.

The specific value of the first numerical value is not limited uniquely, and the configuration can be carried out according to actual requirements. In one example, the first value may be 1.

Based on example B, assuming that the first value is 1, the constraint comprising identification of the data model comprises: the number of included fields is at most the identity of the query data model, S403 may be implemented as example C: the electronic device determines that the number of identifiers of the query data model is 2, and further determines that the number of fields included in the data model a is 2 and the number of fields included in the data model B is 1 because 2 is greater than the first value 1, so that it is determined that the number of fields included in the data model a is greater than the number of fields included in the data model B, and thus it is determined that the identifier of the data model a is the identifier of the target data model.

It should be noted that, if the number of the identifiers of the at least one query data model is smaller than the first value, the identifier of the at least one query data model is used as the identifier of the target data model.

S404, the electronic device determines the at least one output field based on the at least one query field.

The number of the output fields is not particularly limited in the embodiment of the application, and can be determined according to time requirements.

In an example, S404 may be implemented as: the electronic device takes all of the at least one query field as output fields.

In another example, S404 may be implemented as: and the electronic equipment deletes the fields meeting the optimization conditions in at least one query field, so that the rest of the deleted query fields are used as output fields.

The optimization conditions of the embodiments of the present application are not specifically limited, and may be configured according to actual requirements.

In an example, implementations of S401 may include, but are not limited to, S4011 through S4014 described below.

S4011, the electronic device obtains at least one content slice included in the query content.

The electronic equipment conducts stop word processing on the query content, extracts valuable text information in the query content, and then conducts slicing processing on the query content by adopting a content slicing algorithm to obtain one or more content slices.

Based on example a, S4011 can be implemented as example D: the electronic equipment removes stop words in the query content to obtain a student Chinese score, and then two content slices are obtained by adopting a content slicing algorithm, wherein the two content slices respectively comprise: "students" and "Chinese achievements".

S4012, the electronic device calculates, for each content slice in the at least one content slice, a content similarity between the content slice and each candidate field in the candidate field set.

S4012 can be implemented as: the electronic device performs the following operations on each content slice in the at least one content slice, taking a first content slice (any content slice) as an example, and calculating the content similarity between the first content slice and each candidate field in the candidate set by using a similarity algorithm, and the electronic device traverses each content slice in the at least one content slice to obtain the content similarity between each content slice and all candidate fields respectively.

Based on example D, S4012 can be implemented as example E: the electronic equipment adopts a similarity calculation method to calculate the content similarity between a content slice student and candidate fields of student name, Chinese score, mathematic score, English score, sports score, sex, height, home address and health condition, wherein the content similarity is respectively 0.6, 0.1 and 0.1; the content similarity between the content slice "language score" and the candidate fields "student name", "language score", "math score", "english score", "sports score", "sex", "height", "home address" and "health condition" is calculated to be 0.1, 1, 0.4, 0.1 and 0.1, respectively.

S4013, the electronic equipment determines the target similarity meeting the first similarity condition in the content similarities.

The embodiment of the present application does not limit the specific content of the first similarity condition, and can be configured according to actual requirements. For example, the first similarity condition may include that the content similarity is greater than or equal to a content similarity threshold.

S4013 can be implemented as: and the electronic equipment traverses all the content similarity, and determines the content similarity meeting the first similarity condition as the target similarity.

Based on example E, the first similarity condition is that the content similarity is greater than or equal to 0.5; s4013 can be implemented as example F: the electronic equipment determines that the content similarity 0.6 between the content slice student and the candidate field student name is greater than the content similarity threshold 0.5; determining that the content similarity 1 between the content slice "Chinese achievement" and the candidate field "Chinese achievement" is greater than the content similarity threshold 0.5, so the electronic device determining the target similarity includes: content similarity between the content slice "student" and the candidate field "student name", and content similarity between the content slice "language achievement" and the candidate field "language achievement".

S4014, the electronic device determines the candidate field with the target similarity as the query field.

And the electronic equipment obtains a candidate field for determining the target similarity as a query field.

Based on example F, the electronic device determines that the query field includes "student name" and "language achievement," via S4014.

After the electronic device determines at least one query field in S401, the query field may be removed. The following describes the removal process by taking the removal of the first field as an example.

The electronic device determines whether a second field exists in the query field while satisfying: and the first field covers the second field, and the content similarity of the first field is smaller than that of the second field, the first field is removed from the query field.

The first field is different from the second field; the content similarity of the first field is the highest content similarity between the first field and the at least one content slice; the content similarity of the second field is a highest content similarity between the second field and the at least one content slice.

For example, if at least one query field includes a field date with a similarity of 0.88 and a field birthday with a similarity of 0.91, the field date should be removed.

After the electronic device determines at least one query field in S401, another query field may be added. The following describes a process of adding a query field, which may specifically include, but is not limited to, S501 to S505 shown in fig. 5.

S501, the electronic equipment determines the unrecognized text in the query content.

The unrecognized text is composed of a second slice in the query content except for the first slice, and the first slice is a slice for obtaining the target similarity. In other words, the first slice is an identified slice and the second slice is an unidentified slice.

S501 may be implemented as: the electronic equipment acquires the query content, determines a content slice related to the target similarity as a first slice, and deletes the first slice in the query content, so that the unidentified text in the query content is obtained.

S502, the electronic equipment converts the unidentified text into pinyin.

S502 may be implemented as: the electronic device converts the unrecognized text to pinyin using a character recognition algorithm and a pinyin conversion algorithm.

S503, the electronic equipment obtains at least one pinyin slice included in the pinyin.

The implementation process of S503 may be: the electronic device processes the pinyin converted from the unrecognized text by using a pinyin slicing algorithm, thereby obtaining one or more pinyin slices.

S504, the electronic device calculates pinyin similarity between the pinyin slice and each candidate field in the candidate field set for each pinyin slice in the at least one pinyin slice.

For specific implementation of S504, reference may be made to detailed description of calculating, by the electronic device in S4012, content similarity between the content slice and each candidate field in the candidate field set, which is not described herein again.

And S505, the electronic equipment adds the candidate field of which the pinyin similarity meets a second similarity condition to the at least one query field.

The embodiment of the present application does not limit the specific content of the second similarity condition, and can be configured according to actual requirements. For example, the second similarity condition may include the content similarity being greater than or equal to a pinyin similarity threshold.

The size relationship between the first similarity threshold and the second similarity threshold is not specifically limited in the embodiments of the present application. In an example, the first similarity threshold may be less than or equal to the second similarity threshold.

For a specific implementation of S505, reference may be made to detailed descriptions of the query field determined by the electronic device in S4013 to S4014, which are not described herein again.

Therefore, the query field can be expanded through the related calculation of the pinyin similarity, the accuracy and the comprehensiveness of the query field are improved, and the accuracy rate of the query is further improved.

If the query field includes a measurement field having a measurement attribute, the data processing method provided in the embodiment of the present application further needs to obtain an aggregation function corresponding to the measurement field. The process may include, but is not limited to, S601 to S602 shown in fig. 6 described below.

S601, the electronic equipment acquires a measurement field with a measurement attribute in the at least one query field.

The metric attribute is used for representing that the value of the field has a size characteristic.

S601 may be implemented as: the electronic device looks up the attributes of each query field in the metadata base to determine the measure field with the measure attributes.

Based on example E, the query field includes "student name" and "language score", the electronic device determines the attribute of the field "student name" in the metadata repository as a spatial dimension attribute, belonging to the spatial dimension field; and determining the attribute of the field 'Chinese achievement' as a measurement attribute, wherein the measurement attribute belongs to the measurement field.

S602, the electronic equipment sets an aggregation function of the measurement field according to the aggregation condition.

The implementation of the present application does not limit the specific content of the polymerization conditions, and the polymerization conditions can be configured according to actual requirements.

In one example, the polymerization conditions may include: the aggregation function of the fields that can be scaled is the sum function (sum), and the aggregation function of the fields that cannot be scaled is the average function (avg).

Optionally, for the dimension field, an aggregation function may also be set, for example, the aggregation function of the dimension field may be set as a count function (count).

Correspondingly, S303 may be implemented as: the electronic equipment utilizes an operating data query engine to search a target data model pointed by the identification of the target data model, then searches each field value corresponding to at least one query field in the target data model, determines an output field according to the query field, determines the value of the output field according to the value of the query field and a measurement function, and finally takes the values of the output field and the output field as a query result.

The data processing method provided in the embodiment of the present application may further include determining an attribute filtering condition corresponding to the field attribute, and the process may include, but is not limited to, S701 and S702 shown in fig. 7 described below.

S701, the electronic equipment determines the field attribute of each field included in the at least one query field.

The electronic device looks up the field attributes for each query field in the metadata database.

The attribute of a field is used for characterizing the content of the field (which may also be referred to as the value of the field). Different features correspond to different field attributes.

For example, the content of the chinese achievement field in table 1 is a numeric value, and the size measurement can be performed, so the chinese achievement field has a measurement attribute;

for another example, the date of birth field in table 1 has a temporal meaning and the content of the field is not measurable, so the date of birth field has dimensional attributes;

as another example, the student name field in Table 1 has no temporal meaning and the content of the bullet is not measurable, so the student name field has spatial dimension attributes.

S702, the electronic equipment determines an attribute screening condition corresponding to the field attribute based on the field attribute of each field.

The embodiment of the application does not limit the specific content of the attribute screening condition corresponding to the field attribute, and can be configured according to actual requirements.

In an example, a field of the time dimension attribute corresponds to a time dimension filtering condition; the field of the spatial dimension attribute corresponds to a spatial dimension screening condition; the fields of the metric attribute correspond to metric screening conditions.

Correspondingly, S303 may be implemented as: the electronic equipment utilizes an operating data query engine to search a target data model pointed by the identification of the target data model, then searches each field value corresponding to a query field in the target data model according to the attribute screening condition corresponding to each field, then determines an output field according to the query field, determines the value of the output field according to the value of the query field, and finally takes the values of the output field and the output field as query results.

It can be understood that, when the processing method provided in the embodiment of the present application includes a process of obtaining an aggregation function corresponding to a measurement field and determining an attribute screening condition corresponding to a field attribute, S303 may be implemented as: the electronic equipment utilizes an operating data query engine to search a target data model pointed by the identification of the target data model, then searches each field value corresponding to a query field in the target data model according to the attribute screening condition corresponding to each field, then determines an output field according to the query field, determines the value of the output field according to the value of the query field and a measurement function, and finally takes the values of the output field and the output field as a query result.

The implementation of S702 may include, but is not limited to, obtaining a screening condition of the time dimension attribute, obtaining a screening condition of the space dimension attribute, and obtaining a screening condition of the metric attribute; the process of obtaining the screening condition of each attribute is similar, and now, taking the screening condition of obtaining the time dimension attribute as an example, the process of determining the attribute screening condition corresponding to the field attribute will be described, and the process may include, but is not limited to, the following S801 and S802 shown in fig. 8.

S801, the electronic device acquires at least one time content slice.

The temporal content slice corresponds to a field having a temporal dimension attribute.

S801 may be implemented as: the electronic device identifies a temporally-related content slice of the content slices as at least one temporal content slice.

The number of the time content slices is not limited uniquely, and can be determined according to the actual situation.

Example 1, a temporal content slice acquired by an electronic device may include "4 to 10 months of 2015".

Optionally, synonymy conversion of time and number can be performed on the time content slices, and time phrases and spoken digital phrases with timeliness in the time content slices are converted into a form convenient for query.

For example: "this year" to "2021 year", "last month" to "2021 year 06 month" of the expression time; the expression "two" or "two" of the numbers goes to "2" and "thousand" goes to "000".

S802, the electronic device converts the content of each time content slice in the at least one time content slice into a corresponding enumeration screening function and a screening value, and the function is used as an attribute screening condition corresponding to the field attribute.

The function type of the enumeration screening function is not specifically limited in the embodiment of the application, and can be configured according to actual requirements.

In an example, enumerating the filter function may include: interval function (area), equal function (EQ), not equal function (NEQ), greater than function (GT), greater than or equal function (GTEQ).

Based on example 1, S802 may be example 2: the electronic device determines that the enumeration screening function is area and the screening value is [2015-04, 2015-10 ]; the corresponding screening conditions were between '2015-04' and '2015-10'.

The data processing method provided by the embodiment of the application can also modify the determined first query condition. The modification process may include: receiving input modification content, and obtaining a query result according to the modification content; wherein modifying the content may include at least one of: identification of the target data model, output field. It should be noted that, at this time, S302 is not executed any more, and S303 is executed directly according to the modified content to obtain the query result.

The data processing method provided by the embodiment of the application can also arrange the query result into a visual graph or a data report, and then send the visual graph or the data report to the client.

The data processing method provided in the embodiments of the present application is described below with specific application scenarios.

In recent years, along with the development and popularization of AI and NLP technologies, intelligent question-answering products are more and more widely applied in the industry fields of e-commerce, retail, finance, transportation and the like and daily life, and representative products include voice assistants of apples, microsoft mini ice, jingdong intelligent customer service and the like. The products are usually based on a knowledge base or a knowledge map which is artificially constructed, and by means of models such as deep learning and machine learning, semantic understanding and mode reasoning are carried out on natural language input by a user through semantic analysis, so that intuitive answers are given.

Meanwhile, along with the development trend of digitization and intellectualization in the BI industry and unprecedented demands for large data mining and exploration, intelligent question answering for structured databases draws more and more attention from enterprises and researchers, and related researches from natural languages to structured query languages and products appear like spring shoots after rain. Some products have been put into practical use primarily.

1. From a functional point of view, interactive smart question-answering products (such as apple voice assistant) have the following disadvantages:

(1) the knowledge base or the knowledge map which is constructed and edited by people is highly depended on, and the knowledge base occupies a large amount of storage resources, so the reading and writing response efficiency is influenced by the technology; meanwhile, the cost of early construction and later maintenance is high, so that the method is difficult to adapt to the quick update of enterprise data.

(2) Only the ready-made results in the knowledge base can be returned, and the data cannot be processed by operation and processing.

(3) The accuracy of the result is difficult to guarantee in highly specialized complex business scenarios.

2. From the technical point of analysis, the core of the interactive intelligent question-answering technology is a structured query language technology, which mainly has the following disadvantages:

(1) the technology is still in the stage of starting. The current product can only be used for inquiring a single data source table (also called as a single data model or a data model) generally, and for inquiring multiple data source tables, the accuracy is low and the applicability degree is far from being reached. For example, the current Spider (Spider) english data set of yale university can only reach 71% of optimal accuracy, and cannot be applied to the query of multiple data source tables in flexible and complex actual business scenarios.

(2) Compared with English, Chinese language has complex structure and flexible expression form; and the structured query language technology faces the problems of lack of evaluation mechanism and immature models, so that the practical effect is more difficult to promote.

The following describes the data processing method provided in the embodiments of the present application in detail by taking a multi-data source table as an example.

The embodiment of the application can be applied to the query of a multi-data-source table under an enterprise-level complex business scene, an end-to-end data-driven multi-table intelligent question-answering system is designed, the interaction from natural language input to a report or a visual graph query result is realized, and the limitations that the traditional interactive intelligent question-answering and structured query language technology is low in Chinese semantic analysis accuracy, difficult to process multi-table query productively, highly dependent on a large knowledge base and the like are effectively solved.

The structure of the multi-table intelligent question answering system designed by the application is shown in figure 9. The whole system consists of 4 parts, such as a client 901, a semantic analysis module 902, a background processing module 903, a data storage module 904 and the like. Semantic parsing module 902 is the core of the system.

The user may send query content to a text input unit of the client 901, and the client 901 may present a structured query condition corresponding to the query content to the user, and may present a query result to the user in the form of data inclusion or a visual image; the semantic parsing module 902 may include an Application Programming Interface (API) 9021, an NL2SQL model 9022, and a Platform as a Service (PaaS) Platform 9023 of a structured query language model (NL2SQL model); the NL2SQL model API9021 is configured to provide an API interface and receive query content, and the NL2SQL model 9022 may be configured to call a data source 9041 in the data storage module 904 and convert the query content into a corresponding structured query condition; the PaaS platform 9023 is used for automatically deploying metadata knowledge; the background processing module 903 may include an FAQ library 9031, a query engine 9032, a data processing unit 9033, and a visualization unit 9034, where the FAQ library 9031 stores commonly used question content and structured query conditions corresponding to the question content; the query engine 9032 is configured to search for a structured query condition corresponding to the query content in the FAQ library; the data processing unit 9033 may be configured to sort the query result into a data report form; visualization unit 9034 may be configured to sort the query results into a visualization graph.

It should be noted that the semantic parsing module 902 in fig. 9 may be equivalent to the semantic parsing module 201 in fig. 2, the background processing module 903 in fig. 9 may be equivalent to the station processing module 203 in fig. 2, and the data storage module 904 in fig. 9 may be equivalent to the data storage module 202 in fig. 2, that is, the semantic parsing module 902, the background processing module 903, and the data storage module 904 in fig. 9 may constitute the data processing terminal 20 in fig. 2.

The system workflow includes the steps of data preparation, user input, client response, text semantic parsing, background processing, result output, user interaction, and the like shown in fig. 10.

Referring now to fig. 9 and 10, the data processing procedure provided in the present application may include, but is not limited to, the following S1 to S7.

S1: and (4) preparing data.

The electronic equipment writes a plurality of data models required to be inquired in each service scene into the data storage module in a two-dimensional table (also called a data source table) form to generate a data source. The data storage medium may be a distributed file system, a columnar database, a relational database, or the like.

S2: and (4) inputting by a user.

A user inputs query contents (which may also be referred to as input texts) in a text input unit of a client, for example, "the number of people leaving a department in last three months a"; the client may be in the form of a web page or an app, and the text input unit is shown in fig. 11.

S3: and the client responds.

After receiving input query content, a client firstly matches problem content in an FAQ library by a preset similarity threshold; if the FAQ library does not meet the matching condition, the intelligent computing unit is required to perform text analysis, namely S3 is executed; and if the question with the highest similarity in the FAQ library and the input text meet the matching condition, quickly responding, returning the structured query condition corresponding to the question in the FAQ library, and executing S4.

The structured query condition can be SQL (whether the query engine supports) statement, which is convenient for the data query engine to directly execute. In order to facilitate a user to modify the query conditions for extended query, or when an internal protocol which cannot be acquired by the intelligent algorithm module is arranged between the data source and the background processing module, the structured query conditions can be set to be in a data object form. For example, a structured query condition may contain 4 pieces of information: 1) name or code of the data source table to be queried (which may be referred to as identification of the data model); 2) a query (SELECT) field and an output (GROUP BY) dimension field; 3) a metric field, and an aggregation function for each metric field; 4) the field names of the screening condition (i.e., WHERE condition) portion and its screening function and screening value.

S4: and (6) text semantic parsing.

And the client sends a request to the API of the semantic analysis module. The core of the semantic parsing module is an NL2SQL algorithm model, which is deployed on a PaaS cloud service platform and provides a calling interface for a client.

The algorithm model comprises two parts, namely a metadata knowledge base (also called a metadata base) and a text parsing algorithm. Wherein. The metadata knowledge base stores offline information such as field data types, field types, tables where fields are located, dimension enumeration values, measurement boundary values, field aliases, enumeration Value mapping fields and the like in a data source table in a Key-Value pair (Key-Value) mode, and specific contents are shown in table 5.

TABLE 5 examples of metadata repository content

Since the metadata information may be updated over time. The online model of the system reads the latest metadata knowledge from the data source periodically, and the online model is automatically updated by adopting a workflow scheduling task system. Through the automatic updating metadata knowledge base mechanism, the requirement on storage space is effectively reduced, the read-write response efficiency is improved, and the timeliness of the query result is ensured.

The steps of the NL2SQL model for semantic parsing of the input text are described in detail in S4.1 to S4.5 below.

And S4.1, removing stop words.

And performing stop word processing on the query content input by the user.

S4.2, identifying the field.

Specific steps may include S4.2.1 through S4.2.4 described below.

S4.2.1, field preliminary identification.

All fields and aliases thereof in the metadata knowledge base are used as candidate sets, an editing distance matching or named entity recognition technology is adopted, a preset similarity threshold (for example, the editing distance similarity threshold is set to be 0.85) is used as a standard, all successfully matched fields are obtained as a recognized field list, and the field names and slices with the highest similarity in the text (namely substrings, hereinafter referred to as matching slices) and similarity values thereof are recorded.

S4.2.2, removing the covered field.

In the recognized field list, if a field name is completely covered by another matching successful field name and its similarity is lower than the latter, the field is removed from the recognized field list. For example, there is a field date with a similarity of 0.88 in the identified field list, and if there is a field birthday with a similarity of 0.91, the field date should be removed.

S4.2.3, hiding the matching slice.

After the recognized fields are confirmed through the steps, the corresponding matching slices of the fields are hidden in the input text, so that the interference of the matching slices with the subsequent steps is avoided.

S4.2.4, supplementary sound near word matching.

In order to enhance the fault-tolerant capability of wrongly written characters and wrongly spelled characters, the Chinese characters in the processed texts and the candidate field names are converted into pinyin, supplementary matching is carried out by adopting an edit distance method (the similarity threshold value can be properly improved compared with S4.2.1 so as to ensure the accuracy of the result), and the fields which are successfully matched are added into the identified field list.

S4.3, identifying field screening conditions.

Specific examples may include, but are not limited to, the following 3.3.1 to S4.3.5.

S4.3.1, time and digit synonymy conversion.

Converting time phrases and spoken numeric phrases in the input text that are time sensitive into structured forms that facilitate queries, such as: "this year" to "2021 year", "last month" to "2021 year 03 month" of the expression time; the expression "two" or "two" of the numbers goes to "2" and "thousand" goes to "000".

S4.3.2, time dimension field screening condition identification.

And identifying 3 types of sentences in S4.3.2.1-S4.3.2.3 in the input text in sequence for the time dimension field in the identified field by adopting methods such as regular expression matching and the like. And if a plurality of time dimension fields exist, combining left and right boundary values of the time dimension fields in the metadata knowledge base 'dimension enumeration value' to obtain fields in which the screening values exist. And after each step of identification is completed, hiding the successfully matched slices in the input text.

S4.3.2.1, multiple screening functions and identification of values.

The correlation candidate function includes: area (between … and), a function (GTEQ) greater than or equal to, a function (GT) greater than, a function (LTEQ) less than or equal to, a function (LT) less than, and the like. For example, from the slice "4 to 10 months in 2015", one can identify the screening function as area and the screening values as [ '2015-04', '2018-10' ], i.e., the screening conditions are between '2015-04' and '2018-10'.

S4.3.2.2, single enumerated value function, and identification of values.

The correlation function includes: equal to the function (EQ), not equal to the function (NEQ), etc. For example, from the section "except for 2020", it can be recognized that the screening function is NEQ, the screening value is '2020', and the screening condition is '2020'.

S4.3.2.3, identification of time sensitive descriptive phrase filter functions and values.

The correlation candidate function includes: GTEQ, GT, etc. For example, if the slice "last three years" identifies the filter function as GTEQ, the filter value is '2018' (2021 years is assumed), and the filter condition is > = '2018'.

S4.3.3, metric field screening condition identification.

Similar to the time dimension field, the 2 types of sentences in S4.3.3.1-S4.3.3.2 are sequentially identified in the input text for the measurement field in the identified field by adopting a method such as regular expression matching. If a plurality of measurement fields exist, combining left and right boundary values of the measurement fields in the metadata knowledge base 'measurement boundary values', and taking the fields in which the screening values exist. And after each step of identification is completed, hiding the successfully matched slices in the input text.

S4.3.3.1, multiple screening functions and identification of values.

The correlation candidate function includes: area (between … and), GTEQ, GT, LTEQ, LT, etc.

S4.3.3.2, single enumerated value function, and identification of values.

The correlation candidate function includes: EQ, NEQ, etc.

S4.3.4, spatial dimension field screening condition identification.

Specifically, but not limited to, S4.3.4.1 to S4.3.4.3 described below.

S4.3.4.1, for the space dimension field in the identified field, combining with the metadata knowledge base 'dimension enumeration value' to match its enumeration value with a preset similarity threshold.

S4.3.4.2, removing the covered enumeration value by S4.2.3 method to the list of enumeration values successfully matched to obtain the final screening value.

S4.3.4.3, determining a screening function by adopting methods such as regular matching and the like and combining the screening value obtained in the last step, wherein the relevant candidate functions comprise: EQ, NEQ, IN, NIN. And hiding the successfully matched slices in the input text.

S4.3.5, supplementing the identification space dimension field with enumerated values.

S4.3.2 through S4.3.4 are fields identified by field name or alias. In a real-world query scenario, a field name may be omitted from a filtering condition input by a user, and only an enumerated value is input, for example, "performance distribution of employees in a department a", which implies that the filtering condition "department name ═ a department'", and the enumerated value and the corresponding field need to be identified at the same time. Note that only enumerated values contained in the metadata repository "enumerated value mapping field" can be identified here, and if enumerated values appear in a plurality of fields, they are not identified.

Specifically, but not limited to, S4.3.5.1-S4.3.5.3.

S4.3.5.1, matching the input text with the enumeration value in the "enumeration value mapping field" by a preset similarity threshold value by using a method such as an edit distance method, and recording all successfully matched enumeration values, corresponding field names and matched input content slices.

S4.3.5.2, removing the overwritten enumerated value. In the list of identified enumeration values, if an enumeration value name is completely covered by another matching successful enumeration value name and its similarity is lower than the latter, the enumeration value is removed from the list of identified enumeration values.

S4.3.5.3, determining a screening function by adopting methods such as regular matching and the like and combining the screening enumeration value obtained in the last step, wherein the relevant candidate functions comprise: EQ, NEQ, IN, NIN. And adding the field corresponding to the enumerated value into the identified field list. And hiding the successfully matched slices in the input text.

And S4.4, identifying a data source table.

For the multi-table query situation, after identifying the fields and the screening conditions, a metadata knowledge base 'the table where the fields are located' needs to be collected to further identify the data source table to be queried. The method comprises the following specific steps.

S4.4.1, if the recognized field list is empty, return exception information "no valid field recognized" to the API. Otherwise, step S4.4.2 is performed.

S4.4.2, if there is only data source table containing all identified fields, determining the table as the query data source table; otherwise, if the number of the identified fields contained in a certain data source table exceeds the sum of the number of the identified fields contained in other data source tables, the table is taken as a query data source table, and the relevant results of the fields which are not in the table are removed from the identified fields and the screening conditions; otherwise, returning exception information 'data source table can not be determined' to the API.

And S4.5, optimizing the recognition result.

And performing final optimization and structured integration on the recognized result, and returning to the API and the client. The concrete steps are S4.5.1-S4.5.4.

S4.5.1, the screening conditions are complete.

Some data source tables need to be supplemented with default screening conditions. For example, if the query data source table identified in the previous step is a pull-linked list or a partition table, a default screening condition needs to be added to make the query range be valid data or current data. In addition, some data tables may contain some dirty data, and a default screening condition is added for elimination.

S4.5.2, optimized output field.

Some fields in the field list identified in the previous step may be only intended for filtering by the user and are not expected to be seen in the query result so as not to interfere with the filtering. This requires the system to finalize the output field, identify the field that is used only for screening and not output it.

The output field is composed of an output dimension and an output metric, initialized to the dimension field and the metric field in the identified field, respectively. The following is a recommended output field selection logic: for the identified time dimension field, if it is in the filtering condition and the filtering function is EQ, then the time dimension is removed from the output dimension (i.e. the time dimension is only used for filtering, the same applies below); for the identified space dimension field, if the identified space dimension field is IN the screening condition and the screening function is not IN, the space dimension is removed from the output dimension; for an identified metric field, if it is in the filtering condition, the metric field is removed from the output metrics.

S4.5.3, determining a metric field aggregation function.

The output data is summary data, so the aggregation function needs to be returned for the measurement field. The system sets its aggregation function sum for the addable amount and its aggregation function avg for the non-addable amount. If the output field has no measurement field, increasing the output aggregation index: and (4) counting.

S4.5.4, integrating the above recognition results, including information such as data source table, output dimension field list, output measurement field and its aggregation function, screening conditions (fields, screening functions, screening values), etc., returning to API in the form of structured query conditions, and then outputting to the client.

And S5, background processing.

The data query engine acquires the structured query conditions from the client, executes corresponding SQL (or Hive SQL) statements, and reads data from a data source. The query can be made by Presto high speed engine, Spark analysis engine, Hive, etc., where Presto speed is faster. After the relevant summarized data are read in, the data are processed by the data processing unit as necessary and integrated into a form of a report or a visual graph required by output.

And S6, outputting the result.

And the background outputs the processed data result to the client in the form of a report or a visual graph.

And S7, user interaction.

The user can modify the information in the structured query condition, and perform extended query, such as modifying the data source table or the output field, modifying the value range of the screening field, and the like. At this time, the modified structured query condition is directly transmitted to the data query engine without passing through the semantic parsing module, and S5 and S6 are sequentially executed.

The technical effects of the data processing method provided by the embodiment of the application can include:

1. aiming at the database query requirement under an enterprise-level complex business scene, an end-to-end data-driven multi-table intelligent question-answering system is designed, and consists of a client, a semantic analysis module, a background processing module and a data storage module. The semantic analysis module is the core of the system, and effectively solves the problems of low accuracy rate of Chinese semantic analysis, difficulty in processing multi-table query in a production mode, high dependence on a large knowledge base and the like in the current intelligent question answering and NL2SQL technologies; only the metadata knowledge base is stored and automatically deployed at regular intervals, so that the storage space is effectively saved, the corresponding reading and writing efficiency is improved, and the output timeliness is ensured.

2. The semantic analysis module is designed with an intelligent text semantic analysis algorithm, and text semantic analysis is completed through field identification, screening condition identification, data source identification and identification result optimization, wherein the screening condition identification comprises the steps of time dimension screening condition identification, measurement screening condition identification, space dimension screening condition identification, enumerated value supplement identification and the like, and user requirements are identified to the maximum extent. In semantic analysis, synonymy conversion, hidden matching slice and other means are adopted, and the identification accuracy is effectively improved.

Fig. 12 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 12, the data processing apparatus 120 includes: an acquisition unit 1201, a determination unit 1202, and a processing unit 1203. Wherein:

an obtaining unit 1201 is configured to obtain query content.

A determining unit 1202 for determining at least one query field, an identification of a target data model and at least one output field based on the query content and a metadata base; the metadata repository stores fields in at least two data models.

A processing unit 1203, configured to obtain, by using a running data query engine, a query result based on at least the at least one query field, the identification of the target data model, and the at least one output field.

In some embodiments, the determining unit 1202 is further configured to:

determining the at least one query field based on the query content and a set of candidate fields in the metadata repository; the set of candidate fields includes fields in the at least two data models, and alias fields for the fields;

for each query field in the at least one query field, searching the metadata base for an identifier of the data model to which the query field belongs, and using the identifier as the identifier of the at least one query data model; the metadata base also stores the identification of the data model to which the field belongs;

determining an identity of the target data model from an identity of the at least one query data model;

determining the at least one output field based on the at least one query field.

In some embodiments, the determining unit 1202 is further configured to:

obtaining at least one content slice included in the query content;

calculating a content similarity between the content slice and each candidate field in the candidate field set for each content slice in the at least one content slice;

determining a target similarity meeting a first similarity condition in the content similarities;

and determining the candidate field with the target similarity as the query field.

In some embodiments, the data processing apparatus 120 further comprises a removal unit.

A removing unit, configured to remove a first field included in the at least one query field from the at least one query field if the first field covers a second field included in the at least one query field and the content similarity of the first field is smaller than the content similarity of the second field; the first field is different from the second field; the content similarity of the first field is the highest content similarity between the first field and the at least one content slice; the content similarity of the second field is the highest content similarity between the second field and the at least one content slice.

In some embodiments, the data processing apparatus 120 further comprises an adding unit.

An adding unit, configured to determine an unrecognized text in the query content; the unrecognized text is composed of a second slice except a first slice in the query content, and the first slice is a slice for obtaining the target similarity;

converting the unrecognized text into pinyin;

obtaining at least one pinyin slice included in the pinyin;

and respectively calculating the pinyin similarity between the pinyin slice and each candidate field in the candidate field set for each pinyin slice in the at least one pinyin slice, and adding the candidate field of which the pinyin similarity meets a second similarity condition to the at least one query field.

In some embodiments, the determining unit 1202 is further configured to:

if the number of the identifiers of the at least one query data model is equal to a first numerical value, taking the identifiers of the query data models as the identifiers of the target data models;

and if the number of the identifiers of the at least one query data model is greater than the first numerical value, taking the identifier of the query data model meeting the constraint condition as the identifier of the target data model.

In some embodiments, the determining unit 1202 is further configured to: deleting the fields meeting the optimization conditions in the at least one query field to obtain the at least one output field.

In some embodiments, the data processing device 120 further comprises an aggregation unit.

The aggregation unit is used for acquiring a measurement field with a measurement attribute in the at least one query field; the measurement attribute is used for representing that the value of the field has a size characteristic;

setting an aggregation function of the measurement field according to the aggregation condition;

correspondingly, the processing unit 1203 is further configured to: determining the query result based at least on the at least one query field, the identification of the target data model, the at least one output field, and the aggregation function.

In some embodiments, the data processing apparatus 120 further comprises a screening unit.

A screening unit for determining a field attribute of each field included in the at least one query field;

determining an attribute screening condition corresponding to the field attribute based on the field attribute of each field;

correspondingly, the processing unit 1203 is further configured to: obtaining the query result based at least on the at least one query field, the identification of the target data model, the at least one output field, and the attribute filtering condition.

In some embodiments, the screening unit is further configured to:

obtaining at least one temporal content slice; the temporal content slice corresponds to a field having a temporal dimension attribute; the field attribute comprises a time dimension attribute, and the time dimension attribute of the field is used for representing that the value of the field does not have the size characteristic and represents the time meaning;

and converting the content of each time content slice in the at least one time content slice into a corresponding enumeration screening function and a screening value as an attribute screening condition corresponding to the field attribute.

In an example, the functions of the obtaining unit 1201 can be implemented by the semantic parsing module 201 in fig. 2, the functions of the determining unit 1202 can be implemented by the semantic parsing module 201 in fig. 2, the functions of the processing unit 1203 can be implemented by the background processing model 203 in fig. 2, and the functions of the removing unit, the adding unit, the aggregating unit, and the screening unit can be implemented by the semantic parsing module 201 in fig. 2.

It should be noted that the data processing apparatus provided in the embodiment of the present application includes each included unit, and may be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the data processing method is implemented in the form of a software functional module and sold or used as a standalone product, the data processing method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that can be run on the processor, and the processor executes the computer program to implement the steps in the data processing method provided in the foregoing embodiment.

Accordingly, embodiments of the present application provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the data processing method provided in the above embodiments.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that fig. 13 is a schematic hardware entity diagram of the electronic device 130 according to an embodiment of the present application, and in an example, the electronic device 130 may be the above-mentioned electronic device. As shown in fig. 13, the electronic device 130 includes: a processor 1301, at least one communication bus 1302, a user interface 1303, at least one external communication interface 1304, and memory 1305. Wherein the communication bus 1302 is configured to enable connective communication between these components. The user interface 1303 may include a display screen, and the external communication interface 1304 may include a standard wired interface and a wireless interface.

The Memory 1305 is configured to store instructions and applications executable by the processor 1301, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 1301 and modules in the electronic device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. a data processing method, is characterized in that, described method comprises:

Get the query content;

Determine at least one query field, an identifier of a target data model and at least one output field based on the query content and the metadata database; the metadata database stores fields in at least two data models;

With the data query engine running, a query result is obtained based at least on the at least one query field, the identification of the target data model, and the at least one output field.

2. The method according to claim 1, wherein the determining at least one query field, the identifier of the target data model and at least one output field based on the query content and the metadata database comprises:

determining the at least one query field based on the query content and a set of candidate fields in the metadata database; the set of candidate fields includes fields in the at least two data models, and alias fields of the fields;

For each query field in the at least one query field, look up the identifier of the data model to which the query field belongs in the metadata database as the identifier of the at least one query data model; the metadata database also stores the identifier of the data model to which the query field belongs. The identifier of the data model to which the field belongs;

Determine the identifier of the target data model according to the identifier of the at least one query data model;

The at least one output field is determined based on the at least one query field.

3. The method according to claim 2, wherein the determining the at least one query field based on the query content and the set of candidate fields in the metadata database comprises:

acquiring at least one content slice included in the query content;

For each content slice in the at least one content slice, respectively, calculate the content similarity between the content slice and each candidate field in the candidate field set;

determining the target similarity that satisfies the first similarity condition in the content similarity;

A candidate field for obtaining the target similarity is determined as the query field.

4. The method according to claim 3, wherein the method further comprises:

If the first field included in the at least one query field covers the second field included in the at least one query field, and the content similarity of the first field is smaller than the content similarity of the second field, the The first field is removed from the at least one query field; the first field is different from the second field; the content similarity of the first field is the difference between the first field and the at least one content slice the highest content similarity between the two fields; the content similarity of the second field is the highest content similarity between the second field and the at least one content slice.

5. The method according to claim 3, wherein the method further comprises:

Determine the unrecognized text in the query content; the unrecognized text is composed of a second slice in the query content except the first slice, and the first slice is the slice for obtaining the target similarity ;

converting the unrecognized text into pinyin;

Obtain at least one pinyin slice included in the pinyin;

For each pinyin slice in the at least one pinyin slice, calculate the pinyin similarity between the pinyin slice and each candidate field in the candidate field set, and make the pinyin similarity satisfy the second similarity Candidate fields for the condition are added to the at least one query field.

6. The method according to claim 2, wherein the determining the identifier of the target data model according to the identifier of the at least one query data model comprises:

If the number of identifiers of the at least one query data model is equal to the first value, the identifier of the query data model is used as the identifier of the target data model;

If the number of identifiers of the at least one query data model is greater than the first value, the identifier of the query data model that satisfies the constraint condition is used as the identifier of the target data model.

7. The method of claim 2, wherein the determining the at least one output field based on the at least one query field comprises:

Deleting the field satisfying the optimization condition in the at least one query field to obtain the at least one output field.

8. The method of claim 1, wherein the method further comprises:

Obtain a metric field with a metric attribute in the at least one query field; the metric attribute is used to characterize that the value of the field has a size feature;

Set the aggregation function of the measure field according to the aggregation conditions;

The obtaining a query result based on at least the at least one query field, the identifier of the target data model, and the at least one output field includes:

The query result is determined based at least on the at least one query field, the identification of the target data model, the at least one output field, and the aggregation function.

9. The method of claim 1, wherein the method further comprises:

determining a field attribute of each field included in the at least one query field;

Determine the attribute filter condition corresponding to the field attribute based on the field attribute of each field;

The query result is obtained based on at least the at least one query field, the identifier of the target data model, the at least one output field, and the attribute filter condition.

10. The method according to claim 7, wherein the field attribute comprises a time dimension attribute, and the time dimension attribute of the field is used to characterize that the value of the field does not have size characteristics and represents time meaning, and the The field attribute of the field determines the attribute filter condition corresponding to the field attribute, including:

Obtain at least one time content slice; the time content slice corresponds to a field with a time dimension attribute;

The content of each of the time content slices in the at least one time content slice is converted into a corresponding enumeration filter function and filter value as the attribute filter condition corresponding to the field attribute.

11. A data processing device, characterized in that the device comprises:

Get unit, used to get the query content;

a determining unit, configured to determine at least one query field, an identifier of a target data model and at least one output field based on the query content and the metadata database; the metadata database stores fields in at least two data models;

A processing unit, configured to obtain a query result based on at least the at least one query field, the identifier of the target data model and the at least one output field by using the running data query engine.

12. An electronic device comprising a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the program, the data processing according to any one of claims 1 to 10 is implemented method.

13. A storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the data processing method according to any one of claims 1 to 10 is implemented.