CN118349679B

CN118349679B - Task matching data security matching method based on neural network

Info

Publication number: CN118349679B
Application number: CN202410500108.0A
Authority: CN
Inventors: 王世谦; 邵志鹏; 张小建; 王圆圆; 贾一博; 高先周; 李为; 宋大为; 李秋燕; 费稼轩; 黄秀丽; 卜飞飞; 华远鹏; 韩丁; 于雪辉
Original assignee: Henan Jiuyu Tenglong Information Engineering Co ltd; Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd; State Grid Smart Grid Research Institute of SGCC
Current assignee: Henan Jiuyu Tenglong Information Engineering Co ltd; Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd; State Grid Smart Grid Research Institute of SGCC
Priority date: 2024-04-24
Filing date: 2024-04-24
Publication date: 2025-03-25
Anticipated expiration: 2044-04-24
Also published as: CN118349679A

Abstract

The present invention relates to the field of data matching technology, and in particular to a task matching data security matching method based on a neural network, the method comprising: collecting text data sets from different data sources; constructing a homologous data association coefficient between any two text data according to the semantic information difference and structural feature similarity between each text data in the text data set of the same data source; constructing a multi-section valuable matching cost between any two text data according to the association rule difference between any two words in different data sources and the spatial information difference between any two text data; using a twin network to obtain the matching degree between any two text data in different data sources, and screening the matching data of each text data in each data source according to the matching degree. The present invention aims to improve the accuracy of the twin network in matching text data from different data sources, thereby improving the effectiveness of data matching.

Description

Task matching data safety matching method based on neural network

Technical Field

The application relates to the technical field of data matching, in particular to a task matching data security matching method based on a neural network.

Background

Data matching refers to the process of matching and integrating related data in different data sources. In the data matching process, the association relationship between the data in different data sources is found by comparing and matching the data, and the data are combined into a group of more complete and useful data. The data matching can be applied to various fields and application scenes, and helps integrate data from different sources, so that more accurate and comprehensive information is provided, and decision making and business process proceeding are supported. The data matching process comprises the steps of data cleaning, data matching, data merging and the like.

As the demand for energy resources by enterprises increases, so does the frequency with which enterprises conduct energy transactions. When the energy transaction is carried out, the actual energy use condition of the enterprise needs to be effectively checked. For example, the consumption of electric energy, the emission of carbon dioxide and the like, but because the energy use condition relates to data in different forms and different formats, all check data of an enterprise are required to be obtained by utilizing a task matching mode, and the data to be checked and the current checked data are combined to obtain a group of more complete and effective data of the enterprise by automatically matching the data to be checked of the enterprise, so that the matching efficiency can be greatly improved. The conventional graph embedding model in the present stage is used for independently embedding the pairs to be matched into a high-dimensional vector, but due to the lack of interaction between nodes and graphs, the graph embedding model can only compare the similarity between the pairs to be matched on the whole, and some fine-granularity characteristics can be lost, so that the quality of the data matching result is lower.

Disclosure of Invention

In order to solve the technical problems, the invention provides a task matching data security matching method based on a neural network so as to solve the existing problems.

The task matching data security matching method based on the neural network adopts the following technical scheme:

the embodiment of the invention provides a neural network-based task matching data security matching method, which comprises the following steps:

Collecting text data sets of different data sources;

For each text data in the text data set of the same data source, constructing a structural multivalent matrix of each text data according to the correlation relationship between each text data and the dependency relationship between words, acquiring the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data, acquiring the data dependency structure similarity between any two text data according to the structure feature similarity condition between any two text data, constructing the homologous data correlation coefficient between any two text data based on the whole text meaning matching degree and the data dependency structure similarity;

Acquiring transaction association rule matching cost between any two words in different data sources according to association rule differences between any two words in different data sources; for any two pieces of text data between different data sources, acquiring the text characteristic space information matching degree between any two pieces of text data according to the space information difference between the any two pieces of text data; acquiring multi-section valuable matching cost between any two pieces of text data based on text feature space information matching degree and task multi-dimensional matching cost weight;

And obtaining the matching degree between any two pieces of text data in different data sources by adopting a twin network, and screening the matching data of each piece of text data in each data source according to the matching degree.

Preferably, the construction of the structural multivalent matrix of each text data according to the correlation relationship between each text data and the dependency relationship between words includes:

the Jacquard coefficient between each piece of text data and all the text data in the text data set of the same data source is used as input of an Ojin threshold algorithm, and a segmentation threshold is obtained;

The text data with the Jacquard coefficient larger than the segmentation threshold value in all the text data in the text data set of the same data source form a similar data set of each piece of text data;

counting all dependency relationship types and corresponding frequencies of all text data in a similar data set of each piece of text data;

And for each piece of text data, taking the frequency of the r dependency relationship corresponding to the p-th word in the text data as an element of the p-th row and the r-th column in the structural multivalent matrix of the text data.

Preferably, the obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data includes:

The method comprises the steps of obtaining ED editing distance between any two pieces of text data, taking absolute value of difference value of word quantity in any two pieces of text data as an index of an exponential function based on a natural constant, calculating DTW distance between structural multivalent matrixes of any two pieces of text data, and obtaining the ED editing distance between any two pieces of text data;

And calculating the product of the calculation result of the exponential function and the DTW distance, and taking the sum of the product and the ED editing distance as the whole text meaning matching degree between any two text data.

Preferably, the obtaining the data dependency structure similarity between any two pieces of text data according to the structure feature similarity between any two pieces of text data includes:

for each word in each piece of text data, adopting ELMo model to obtain word vector of the word, using the word vector of the word as first element of single dependency structure down-conversion sequence, using the dependency relationship frequency in the corresponding row vector in the structure multivalent matrix of the text data where the word is located as second to last element of single dependency structure down-conversion sequence according to descending order;

Taking any two words in any two pieces of text data as a group of word pairs, calculating pearson correlation coefficients between single dependency structure down-conversion sequences of the word pairs, and calculating the average value of the pearson correlation coefficients of all the word pairs in the any two pieces of text data;

And taking the product of the word number and the mean value as the data dependency structure similarity between any two pieces of text data.

Preferably, the constructing the homologous data association coefficient between any two text data based on the whole text meaning matching degree and the data dependency structure similarity includes:

The method comprises the steps of constructing a first index function by taking a natural constant as a base number and taking the meaning matching degree of the whole text between any two pieces of text data as an index, constructing a second index function by taking the natural constant as a base number and taking the data dependency structure similarity between any two pieces of text data as an index, and taking the ratio of the calculation result of the second index function to the calculation result of the first index function as a homologous data association coefficient between any two pieces of text data.

Preferably, the obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources includes:

acquiring an association rule and an association rule confidence coefficient of each word in a text data set of each data source by adopting an Apriori algorithm;

For any two words in different data sources, acquiring the number of association rules in the data source where the words are located, and recording the sum of the number of association rules of any two words as a first sum;

aiming at any two words in different data sources, obtaining the minimum value and the average value of the confidence coefficient of the association rule in the data source where the words are located, and calculating the product of the minimum value and the average value;

And taking the ratio of the first sum value to the second sum value as the transaction association rule matching cost between any two words in different data sources.

Preferably, the obtaining the matching degree of the text feature space information between any two text data according to the space information difference between any two text data includes:

The method comprises the steps of taking each text data in each data source as each Node, taking a homologous data association coefficient between any two text data in each data source as edge weight between corresponding nodes, and constructing a label graph of each data source according to the nodes and the edge weight between the nodes;

for a node pair formed by any two nodes directly connected with corresponding nodes in the label graph of any two text data, calculating the average value of the correlation coefficients of the homologous data of the node pair;

And taking the sum of the sum and the first Euclidean distance as the matching degree of text feature space information between any two pieces of text data.

Preferably, the obtaining the task multidimensional matching cost weight between any two text data according to the association rule difference between words in any two text data includes:

acquiring the sum of the reciprocal of the transaction association rule matching cost between all any two words in any two pieces of text data;

Acquiring a set of matching cost levels corresponding to transaction association rule matching costs of all any two words in any two pieces of text data, wherein the same transaction association rule matching cost is used as the same matching cost level;

and calculating a Jacquard coefficient between the set of any two pieces of text data, and multiplying the inverse of the sum value of the Jacquard coefficient and a preset parameter adjustment factor by the sum value to obtain a task multidimensional matching cost weight between any two pieces of text data.

Preferably, the obtaining the multiple sections of valuable matching cost between any two pieces of text data based on the text feature space information matching degree and the task multidimensional matching cost weight includes:

And calculating the product of the matching degree of the text characteristic space information between any two pieces of text data and the multi-dimensional matching cost weight of the task for any two pieces of text data, and taking the reciprocal of the sum of the product and a preset parameter adjusting factor as the multi-section valuable matching cost between any two pieces of text data.

Preferably, the screening the matching data of each piece of text data in each data source according to the matching degree includes:

And regarding each piece of text data in each data source, taking the text data corresponding to the maximum value of the matching degree between each piece of text data and all text data in all the rest data sources as the matching data of each piece of text data.

The invention has at least the following beneficial effects:

According to the invention, through analyzing the similarity condition among text data in text data sets in a plurality of data sources in one-time energy transaction process of an enterprise, the similar data sets of the text data are screened, so that word analysis is conveniently carried out on text data with similar characteristics, the number of samples for word relation analysis is increased, and the quality of relation analysis is improved; meanwhile, the invention constructs the whole text meaning matching degree between any two pieces of text data based on semantic information features among different text data in homologous data, mines the whole difference information of the text data, words and the structural multivalent matrix, analyzes the data matching condition from the whole text data, then constructs the data dependency structure similarity between any two text data according to the single dependency structure down-conversion sequence difference of all word combinations in different text data, analyzes the similarity between the data structures from the structural feature, combines the semantic correlation and the lexical structural similarity to construct the homologous data association coefficient, and simultaneously considers the semantic information and the syntactic structural feature of the text data to accurately evaluate the association strength between the text data in the same data source;

the method comprises the steps of determining transaction association rule matching cost when text data of different data sources are matched based on association rules of text data mining of the different data sources, determining the transaction association rule matching cost when the text data of the different data sources are matched, considering that part of words in the text data are high in confidence degree and too frequently appear in spite of the fact that the association rules are high in confidence degree, so that importance degree of the matching is affected when the words are matched, constraining matching cost of words with different frequencies in the text data through the number and the confidence degree of the association rules, influencing weight of rule relation among deep analysis words on matching of the text data, reflecting basic word rule architecture in the text data from a deeper level, increasing reliability of text data matching analysis, constructing a tag graph through homologous data association coefficients of the text data, mining space information of each node in the tag graph, helping to identify similarity among all nodes through mapping of association information to a low-dimensional continuous space, and accordingly helping to judge data matching conditions, and secondly, determining multiple sections of valid word matching indexes among the text data in different data sources based on the fact that matching cost of single words and whole text in matching time is high and low, serving as a measure mode of a downstream network in a twin network, improving accuracy of matching of the text data in matching of the different data sources.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a neural network-based task matching data security matching method provided by the invention;

FIG. 2 is a schematic diagram of a structural multivalent matrix;

FIG. 3 is a schematic diagram of a network architecture;

fig. 4 is a matching data acquisition flow chart.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description is given below of the task matching data security matching method based on the neural network according to the invention, and the detailed implementation, structure, characteristics and effects thereof are described in detail below with reference to the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the task matching data security matching method based on the neural network provided by the invention with reference to the accompanying drawings.

The task matching data security matching method based on the neural network provided by the embodiment of the invention.

Specifically, the following method for matching task matching data based on a neural network is provided, please refer to fig. 1, and the method comprises the following steps:

And S001, collecting text data of different data sources when the enterprise A carries out energy transaction, and preprocessing the collected data.

In the embodiment, task matching data when the enterprise A carries out energy transaction is used as basic data of security matching. When an enterprise A carries out energy transaction, the energy consumption of the enterprise A, the purchase amount of each energy source, the actual condition of checking the energy use of the enterprise A and the like are required to be counted, namely, a plurality of tasks are involved in the flow of the energy transaction, and sample spaces of different tasks are different, for example, the energy use data of an enterprise database and the energy types purchased by the enterprise A and the purchase amount data of each energy source recorded by an energy supply center are different.

In this embodiment, the number of data sources required for processing multiple tasks is recorded as M, and the value of M is set by the implementer according to the actual situation, and in this embodiment, the value is 3. The 3 data sources are a third-party auditing institution, an enterprise database and an energy supply center respectively. When the enterprise a conducts an energy transaction, the supply of the seller and the demand of the enterprise a include a plurality of specific branches, such as the discharge condition, purchase amount, purchase time, selling amount of the seller, etc., of the enterprise a, and the above data are recorded in text form in different data sources.

For any one of the original text data for which there is time data, the time data in each of the original text data is first converted into a form of a time stamp. Secondly, in this embodiment, all original text data in each data source is used as input, each piece of original text data is converted into a data string form by using a dataframe. Str function in a Python tool library, the converted result is used as text data, and Chinese characters in each piece of text data are marked. For any one data source, all text data in each data source is formed into a text data set.

So far, the text data used for task matching during energy transaction in the embodiment, namely the text data set of each data source, is obtained and used for carrying out data matching among different data sources in the follow-up process.

And step S002, constructing a homologous data association coefficient based on semantic relativity and lexical structural similarity between text data in the same data source, and determining a multi-section valuable matching index between text data in different data sources based on a label graph of each data source and matching cost.

The embodiment aims at matching text data of different data sources through a neural network, and data security matching is completed based on a matching result. For datasets of different data sources, the present embodiment contemplates matching using the data structures and entity attribute tags of text data in the different data sources.

Specifically, when the label graph is constructed, text data is represented by nodes, the side represents the association relationship between two text data, and the same text data at different moments can be treated as the same node.

Further, a label graph is constructed for the text data set of each data source, and the purpose of the label graph is to reflect the relevance of entities and attributes in different data sources. This is because the same energy usage records the different corresponding content in different data sources, for example, the a enterprise purchases electricity once, the name, time, amount of purchase, etc. of the a enterprise are recorded in the function center, and the data of staff, unit price, signature staff, etc. of the electricity purchased are recorded in the database of the a enterprise, and the entities corresponding to the different data in the process of purchasing electricity once should have certain association with the a enterprise.

For each piece of text data, calculating the Jacquard coefficient of a character string set between each piece of text data and any other piece of text data in the text data set of the same data source, taking the Jacquard coefficient of the character string set between each piece of text data and all other pieces of text data as input, acquiring a segmentation threshold value by adopting an Ojin threshold algorithm, and taking a set formed by all pieces of text data with the Jacquard coefficient larger than the segmentation threshold value as a similar data set of each piece of text data. The jekcard coefficient and the oxford threshold algorithm are known techniques, and the description of this embodiment is omitted.

And taking each text data as input, and acquiring the dependency relationship between any two words in each text data by adopting a dependency syntax, wherein the dependency syntax is a known technology, and the specific process is not repeated. And secondly, obtaining the dependency relationship between any two words in each piece of text data in the similar data set of each piece of text data, and counting the types and the frequencies of the dependency relationship to form a structural multivalent matrix of each piece of text data. The structure multivalent matrix of the a-th text data is shown in fig. 2, wherein m is the number of words in the a-th text data, H is the category number of all the dependences in all the text data, q ₁ to q _H are the 1st to H-th dependences respectively, w ₁ to w _H are the 1st to m-th words respectively, and i ₁₁ is the frequency of the first dependency q ₁ existing in the first word w ₁ in the a-th text data.

For any one data source, all sentences in the text data set of each data source are used as input of a ELMo model (Embeddings form Language Models), a ELMo model is utilized to obtain word vectors of each word, and a ELMo model is a known technology, and the specific process is not repeated. Secondly, taking a word vector of an xth word in the structural multivalent matrix of each text data as a first element, and taking a sequence formed by taking the dependency relationship frequency of the xth word as a second element to a last element according to a descending order as a single dependency structure down-conversion sequence of the xth word.

Based on the analysis, a homologous data association coefficient is constructed here for characterizing the degree of association of structural features and semantic information between two text data. Calculating a homologous data association coefficient between the a text data and the b text data:

S_ab＝ED(C_a,C_b)+exp(|N_a-N_b|)×dtw(Y_a,Y_b)

Wherein S _ab is the degree of matching of the meaning of the whole text between the text data of the a-th and the b-th, C _a、C_b is the character string of the text data of the a-th and the b-th, ED (C _a,C_b) is the ED editing distance between the character strings C _a and C _b, exp () is an exponential function based on a natural constant e, N _a、N_b is the number of words in the text data of the a-th and the b-th, Y _a、Y_b is the structure multivalent matrix of the text data of the a-th and the b-th, DTW (Y _a,Y_b) is the DTW distance between the matrix Y _a and Y _b, wherein the ED editing distance and the DTW distance are all known techniques, and the detailed process is not repeated;

l _ab is the data dependency structure similarity between the a-th and b-th text data, m ₁ is the number of words simultaneously existing in the a-th and b-th text data, x and y are the x-th and y-th words in the a-th and b-th text data respectively, d _x is the single dependency structure down-conversion sequence of the x-th word of the a-th text data, d _y is the single dependency structure down-conversion sequence of the y-th word of the b-th text data, and P (d _x,d_y) is the pearson correlation coefficient between the sequences d _x and d _y;

l _ab is the homology data correlation coefficient between the a-th and b-th text data.

The greater the probability of being generated by the same A enterprise energy transaction, the stronger the association of the two pieces of text data, the smaller the value of ED (C _a,C_b), the stronger the association of the a-th and b-th text data, the greater the similarity of the a-th and b-th text data in the similar data set, the greater the similarity of the a-th and b-th text data, the greater the similarity of the elements in the structure multivalent matrix of the a-th and b-th text data, the greater the similarity of Y _a and Y _b, the smaller the value of exp (|N _a-N_b |), the smaller the value of dtw (Y _a,Y_b), the greater the matching degree between the a-th and b-th text data, the greater the number of words simultaneously existing in the structure multivalent matrix of the a-th and b-th text data, the greater the value of m ₁, the greater the similarity of the b-th and b-th text data, the greater the similarity of the b-th text data, the greater the value of b-th text data, the greater the similarity of the b-th and the greater the similarity of the b-th text data, the greater the b-and the greater the similarity of the b-th text data, the greater the value of the b-4.

Further, when any two pieces of text data in two data sources are matched, the matching results of different words have different contributions to the matching degree between the text data. For example, all text data generated by the enterprise a during one energy transaction have different recording forms of time data. For example, the A enterprise data center records 2021, 10 months and 9 days, the energy supply center records are nineteen days, the purchasing personnel recorded in the A enterprise data center has staff I ₁、I₂、I₃, the energy supply center records are signing personnel I ₁, words corresponding to two attributes of time and personnel can have different influences when text data of different data sources are matched, and the matching degree of the two text data can be directly determined according to the matching results of the time and part of personnel, because the energy transaction is not frequent under normal conditions and does not occur for a plurality of times in one day. That is, when text data of different data sources are matched, matching contribution of associated combinations of different words in the text data is different.

Therefore, in this embodiment, the text data set of each data source is used as a basic database of the transaction library, and the Apriori data mining algorithm is used to obtain the association rules of all the words in the text data set of each data source and the confidence level corresponding to each association rule, where the Apriori data mining algorithm is a known technology, and the specific process is not repeated.

For the association rules of the text data sets of any one data source, the relevance and the association degree between different words are different, the words in the text data form association rules with different confidence degrees, when the text data of different data sources are matched, different matching costs are considered to be set according to the confidence degree weights of the association rules, the matching cost between the words corresponding to the association rules with larger confidence degree weights is smaller, and the reason for the setting is that the more reliable the relevance between the words in the association rules with larger confidence degrees and the text data is, the more accurate the matching result is. The calculation formula of the transaction association rule matching cost between the xth word and the p-th word in different data sources is as follows:

Where D _xp is the trade association rule matching cost between the xth word and the p-th word, n _x、n_p is the number of association rules containing the xth and p-th words in the corresponding data source, mu _x,min、μ_p,min is the minimum of the confidence of all the association rules containing the xth and p-th words in the corresponding data source, The confidence values of the association rules of the xth and the p words in the corresponding data sources are respectively the average value of the confidence values of the association rules of the xth and the p words.

Wherein, the larger the confidence value of the association rule containing the x-th and p-th words in the association rules corresponding to different data sources, mu _x,min、μ_p,min,The larger the value of each of the (c) is,And taking the sum of n _x、n_p as a molecule to consider that the confidence of the association rule is higher but the importance degree of the partial words in the text data in matching is affected too frequently, and constraining the matching cost of the words with different frequencies in the text data through the number of the association rules and the confidence.

According to the steps, the homologous data association coefficient between any two pieces of text data in each data source is obtained respectively, and the homologous data association coefficient between any two pieces of text data is used as the weight of the edge between the corresponding two nodes to obtain the label graph of each data source.

Text data sets of different data sources may have special characters and abbreviations in the text data due to differences in the form of the data record and record carrier, and concatenation of these many feature words may also lead to text miss semantics. The above situation results in that when a large number of abbreviations and special character combinations exist between two pieces of text data, the text similarity between the full names and the abbreviations of the same attribute value is not high, so that the text similarity is difficult to measure the true similarity, and other information needs to be introduced to assist matching.

Specifically, a label graph of each data source is taken as input, the Node2vec algorithm is adopted to acquire the spatial information of each Node in the input label graph, the Node2vec algorithm learns the representation of the nodes by designing a flexible exploration mode for the neighbors of the nodes of the graph, finally, the nodes in the graph are mapped to a low-dimensional continuous space, and the spatial information of the graph can be recorded and stored, and the Node2vec algorithm is a known technology, and the specific process is not repeated. For text data with higher matching degree in different data sources, the spatial information of the corresponding nodes must have larger similarity.

Further, transaction association rule matching costs between any two words in any two data sources are respectively obtained, and each equal transaction association rule matching cost is used as a matching cost level. When two text data of two data sources are matched, the more the number of matching cost stages is, the more unstable the matching cost of different words in the two text data is, and the more the data matching cost is.

Based on the analysis, a plurality of sections of valuable matching indexes are constructed and used for representing the cost when matching between two text data in different data sources. Calculating a plurality of sections of valuable matching indexes between the a-th text data and the k-th text data in two data sources:

R _ak is the matching degree of text characteristic space information between the a-th text data and the k-th text data, o _a、o_k is the space information of corresponding nodes in a label graph where the a-th text data and the k-th text data are located, M ₁、M₂ is the number of nodes directly connected with the corresponding nodes in the a-th text data and the k-th text data in the label graph, j and h are the j-th and h-th nodes directly connected with the corresponding nodes in the a-th text data and the k-th text data in the label graph, L _aj is the homologous data association coefficient corresponding to the corresponding nodes and the j-th nodes in the a-th text data in the label graph, L _kh is the homologous data association coefficient corresponding to the corresponding nodes and the h-th nodes in the k-th text data in the label graph, o _j、o_h is the j-th and h-th node corresponding space information, and dist (o _a,o_k)、dist(o_j,o_h) is the Euclidean distance between o _a and o _k、o_j and o _h;

u _ak is a task multidimensional matching cost weight between the a-th text data and the k-th text data, G _a、G_k is a set of matching cost levels corresponding to the matching cost of transaction association rules of all words in the a-th text data and the k-th text data respectively, jac (G _a,G_k) is a Jaccard coefficient between the sets G ₁ and G ₂, m ₂、m₃ is the number of words in the a-th text data and the k-th text data respectively, a _x、k_p is the x-th word and the p-th word in the a-th text data respectively, and D (a _x,k_p) is the matching cost of transaction association rules between the words a _x and k _p;

V _ak is the multiple segment valuable match index between the a-th and k-th text data in two data sources, Is a preset parameter adjusting factor for preventing the denominator from being 0,The value of (2) is 0.001.

When the matching task of the energy transaction data between two data sources is performed, the matching degree between text data in the two data sources is larger, the space information o _a、o_k in the low-dimensional continuous space is more similar, the matching degree between a bar and k bar text data is larger, the data generated by the same energy transaction is more likely to be generated, the connection structure of the corresponding node of the a bar and k bar text data in a tag graph is more similar, the value of dist (o _a,o_k)、dist(o_j,o_h) is smaller, the value of R _ak is smaller, the matching cost of different words in the a bar and k bar text data is more unstable, the distribution similarity between a bar and k bar text data is lower, the value of Jac (G _a,G_k) is smaller, the confidence of the association rule in the a bar and k bar text data is larger, the association between the a bar and k bar text data is more reliable, the matching result is more reliable, the matching cost of the u is more accurate, the matching cost of the matching result is more stable, the matching cost of the words between the two words is larger, the matching cost of the matching value of the text data is more intense, and the matching cost of the text data is more intense, and the matching value of the text data is more intense, and 4.

So far, the multi-section valuable matching index between any two pieces of text data in different data sources is obtained and is used as a measurement mode in a twin network.

And step S003, obtaining a data matching result between different data sources in the energy transaction of the enterprise A by adopting the twin network based on the multiple sections of valuable matching indexes between the text data.

And respectively acquiring multiple sections of valuable matching indexes between any two pieces of text data in different data sources according to the steps, and completing data matching based on the multiple sections of valuable matching indexes.

Specifically, two pieces of text data in two data sources are used as input of a twin network, the twin network is adopted to obtain the matching degree between the two inputs, the output of an upstream network in the twin network is the high latitude characteristic of the two pieces of text data in the same characteristic space, the multi-section valuable matching index between the two pieces of text data is used as the measurement distance in a downstream network of the twin network, the twin network is a coupling framework established by two sub-networks sharing weights, the network structure is shown in fig. 3, the twin network is a known technology, and the specific process is not repeated.

And for any text data in each data source, taking the text data in the rest data sources corresponding to the maximum matching degree of each text data as the matching data. Wherein, the matching data acquisition flow chart is shown in fig. 4.

This embodiment is completed.

In summary, according to the embodiment of the invention, through analyzing the similarity condition among text data in text data sets in a plurality of data sources in one-time energy transaction process of an enterprise, similar data sets of each text data are screened, so that word analysis is conveniently performed on text data with similar characteristics, the number of samples for word relation analysis is increased, and the quality of relation analysis is improved; meanwhile, the embodiment of the invention constructs the whole text meaning matching degree between any two text data based on semantic information features among different text data in homologous data, mines the overall difference information of the text data's character strings, words and the structural multivalent matrix, analyzes the data matching condition from the overall angle of the text data, then constructs the data dependency structure similarity between any two text data according to the single structure down-conversion sequence difference of all word combinations in different text data, analyzes the similarity between the data structures from the structural feature angle, combines the semantic correlation and the lexical structural similarity to construct the homologous data association coefficient, and simultaneously considers the semantic information and the structural feature of the text data, thus being capable of accurately evaluating the association strength between the text data in the same data source;

Meanwhile, according to the embodiment of the invention, based on the association rules of text data mining of different data sources, transaction association rule matching cost is determined when text data of different data sources are matched, the transaction association rule matching cost considers that part of words in the text data are higher in confidence and occur too frequently although the association rules are high in confidence, so that importance degree of the matching is affected when the words are matched, matching cost of words with different frequencies in the text data is constrained by the number of the association rules and the confidence, weight is affected by the rule relation among deep analysis words on matching of the text data, reliability of text data matching analysis is improved by reflecting basic word rule architecture in the text data from a deeper level, spatial information of each node in the tag graph is mined by constructing homologous data association coefficients of the text data, similarity between each node is helped to be identified by mapping association information to a low-dimensional continuous space, and therefore, data matching situation is assisted to be judged, and a multi-section valuable matching index between text data in different data sources is determined as a downstream network measuring mode based on matching cost of single and whole text when the data is matched, and accuracy of the matching words in the twin network is improved.

It should be noted that the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.

The embodiments described above are only for illustrating the technical solutions of the present application, but not for limiting the same, and the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims

1. A task matching data security matching method based on a neural network, characterized in that the method comprises the following steps:

Collect text datasets from different data sources;

For each piece of text data in a text data set of the same data source, a structural multivalent matrix of each piece of text data is constructed according to the correlation between each piece of text data and the dependency between words; the whole text meaning matching between any two pieces of text data is obtained according to the semantic information difference between any two pieces of text data; the data dependency structure similarity between any two pieces of text data is obtained according to the similarity of structural features between any two pieces of text data; the homology data association coefficient between any two pieces of text data is constructed based on the whole text meaning matching and the data dependency structure similarity;

According to the difference in association rules between any two words in different data sources, the transaction association rule matching cost between any two words in different data sources is obtained; for any two text data from different data sources, the text feature spatial information matching degree between any two text data is obtained according to the difference in spatial information between any two text data; according to the difference in association rules between words in any two text data, the task multi-dimensional matching cost weight between any two text data is obtained; based on the text feature spatial information matching degree and the task multi-dimensional matching cost weight, the multi-section valuable matching cost between any two text data is obtained;

The twin network is used to obtain the matching degree between any two text data in different data sources, and the matching data of each text data in each data source are screened according to the matching degree;

The step of constructing a structural multivalent matrix of each piece of text data according to the correlation between each piece of text data and the dependency between words includes:

The similarity between each piece of text data and all the remaining text data in the text data set of the same data source is used as the input of the threshold segmentation algorithm to obtain the segmentation threshold;

The text data whose similarity is greater than the segmentation threshold in all the remaining text data in the text data set of the same data source are used to form a similar data set for each text data;

A dependency analysis algorithm is used to obtain the dependency relationship between any two words in each text data; all dependency relationship types and corresponding frequencies of all text data in a similar data set of each text data are counted;

For each piece of text data, the frequency of the rth dependency relationship corresponding to the pth word in the text data is used as the element of the pth row and rth column in the structural multivalent matrix of the text data;

The method of obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data includes:

Obtain the degree of difference between any two text data; use the absolute value of the difference in the number of words in any two text data as the exponent of an exponential function with a natural constant as the base; calculate the metric distance between the structural multivalent matrices of any two text data;

Calculating the product of the calculation result of the exponential function and the metric distance, the whole text meaning matching degree between any two text data is positively correlated with the product and the difference degree respectively;

The step of obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources includes:

An association rule mining algorithm is used to obtain the association rule and association rule confidence of each word in the text data set of each data source;

For any two words in different data sources, the number of association rules in the data source where the words are located is obtained, and the sum of the number of association rules of the any two words is recorded as a first sum;

For any two words in different data sources, obtain the minimum value and the mean value of the confidence of the association rules in the data sources where the words are located, and calculate the fusion result of the minimum value and the mean value; record the sum of the fusion results of the any two words as the second sum value;

The transaction association rule matching cost between any two words in different data sources is positively correlated with the first sum value and negatively correlated with the second sum value;

The step of obtaining the task multi-dimensional matching cost weight between any two text data according to the difference in association rules between words in any two text data includes:

Obtain the sum of the reciprocals of the transaction association rule matching costs between any two words in any two text data;

Obtaining a set of matching cost levels corresponding to the transaction association rule matching costs of all two arbitrary words in any two text data; wherein the same transaction association rule matching costs are regarded as the same matching cost level;

Calculating the similarity between the sets of any two text data, the multi-dimensional matching cost weight of the task between the any two text data is positively correlated with the sum of the reciprocals, and negatively correlated with the similarity;

The multi-section valuable matching cost between any two text data is negatively correlated with the text feature space information matching degree between any two text data and the task multi-dimensional matching cost weight.

2. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the structural multivalent matrix of each text data is constructed according to the correlation between each text data and the dependency between words, including:

The Jaccard coefficient between each text data and all the remaining text data in the text data set of the same data source is used as the input of the Otsu threshold algorithm to obtain the segmentation threshold;

The text data with Jaccard coefficient greater than the segmentation threshold in all the remaining text data in the text data set of the same data source are combined into a similar data set for each text data;

Dependency syntax is used to obtain the dependency relationship between any two words in each text data; all dependency relationship types and corresponding frequencies of all text data in the similar data set of each text data are counted;

For each piece of text data, the frequency of the rth dependency relationship corresponding to the pth word in the text data is used as the element of the pth row and rth column in the structural multivalent matrix of the text data.

3. The task matching data security matching method based on neural network according to claim 2 is characterized in that the step of obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data comprises:

Obtain the ED edit distance between any two text data; use the absolute value of the difference in the number of words in any two text data as the exponent of an exponential function with a natural constant as the base; calculate the DTW distance between the structural multivalent matrices of any two text data;

The product of the calculation result of the exponential function and the DTW distance is calculated, and the sum of the product and the ED edit distance is used as the whole text meaning matching degree between any two text data.

4. The task matching data security matching method based on neural network according to claim 2 is characterized in that the data dependency structure similarity between any two text data is obtained according to the similarity of the structural features between any two text data, including:

For each word in each text data, the ELMo model is used to obtain the word vector of the word; the word vector of the word is used as the first element of the single dependency structure frequency reduction sequence, and the dependency frequency in the corresponding row vector in the structural multivalent matrix of the text data where the word is located is used as the second to the last element of the single dependency structure frequency reduction sequence in descending order;

Taking any two words in any two text data as a group of word pairs, calculating the Pearson correlation coefficient between the single dependency structure frequency reduction sequences of the word pairs, and calculating the average of the Pearson correlation coefficients of all word pairs in the any two text data;

The number of words that exist simultaneously in any two text data is obtained; and the product of the number of words and the mean is used as the data dependency structure similarity between any two text data.

5. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the homologous data correlation coefficient between any two text data is constructed based on the matching degree of the meaning of the entire text and the similarity of the data dependency structure, including:

A first exponential function is constructed with a natural constant as the base and the degree of matching of the meaning of the entire text between any two text data as the exponent; a second exponential function is constructed with a natural constant as the base and the similarity of the data dependency structure between any two text data as the exponent; and the ratio of the calculation result of the second exponential function to the calculation result of the first exponential function is used as the homologous data correlation coefficient between any two text data.

6. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources comprises:

The Apriori algorithm is used to obtain the association rules and association rule confidence of each word in the text data set of each data source;

For any two words in different data sources, obtain the minimum value and the mean value of the confidence of the association rules in the data sources where the words are located, and calculate the fusion result of the minimum value and the mean value; record the sum of the fusion results of any two words as the second sum value; wherein the fusion result is the product of the minimum value and the mean value;

The ratio of the first sum value to the second sum value is used as the transaction association rule matching cost between any two words in different data sources.

7. The task matching data security matching method based on neural network according to claim 1 is characterized in that the text feature spatial information matching degree between any two text data is obtained according to the spatial information difference between any two text data, including:

Each text data in each data source is taken as each node, and the homologous data correlation coefficient between any two text data in each data source is taken as the edge weight between the corresponding nodes. The label graph of each data source is constructed according to the nodes and the edge weights between the nodes; the label graph of each data source is used to obtain the spatial information of each node using the Node2vec algorithm; the Euclidean distance between the spatial information of the corresponding nodes in the label graph where any two text data are located is calculated, which is recorded as the first Euclidean distance;

For a node pair consisting of any two nodes directly connected to the corresponding node in the label graph where the two text data are located, calculate the mean value of the homologous data association coefficient of the node pair; calculate the Euclidean distance between the spatial information of the node pair, recorded as the second Euclidean distance;

The sum of the product of the mean value of all the node pairs and the second Euclidean distance is calculated and recorded as the third sum; the sum of the third sum and the first Euclidean distance is used as the text feature space information matching degree between any two text data.

8. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of obtaining the task multi-dimensional matching cost weight between any two text data according to the difference in association rules between words in any two text data comprises:

Obtain the sum of the reciprocals of the transaction association rule matching costs between any two words in any two text data, recorded as the fourth sum;

Calculate the Jaccard coefficient between the sets of any two text data, and multiply the inverse of the sum of the Jaccard coefficient and the preset parameter adjustment factor by the fourth sum to obtain the multidimensional matching cost weight of the task between any two text data.

9. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the multi-section valuable matching cost between any two text data is obtained based on the text feature space information matching degree and the task multi-dimensional matching cost weight, including:

For any two pieces of text data, calculate the product of the text feature space information matching degree between the two pieces of text data and the multi-dimensional matching cost weight of the task, which is recorded as the first product; take the inverse of the sum of the first product and the preset parameter adjustment factor as the multi-section valuable matching cost between any two pieces of text data.

10. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of screening the matching data of each text data in each data source according to the matching degree comprises:

For each piece of text data in each data source, text data corresponding to the maximum value of the matching degree between each piece of text data and all text data in all remaining data sources is used as matching data for each piece of text data.