[go: up one dir, main page]

CN118349679B - Task matching data security matching method based on neural network - Google Patents

Task matching data security matching method based on neural network Download PDF

Info

Publication number
CN118349679B
CN118349679B CN202410500108.0A CN202410500108A CN118349679B CN 118349679 B CN118349679 B CN 118349679B CN 202410500108 A CN202410500108 A CN 202410500108A CN 118349679 B CN118349679 B CN 118349679B
Authority
CN
China
Prior art keywords
text data
data
matching
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410500108.0A
Other languages
Chinese (zh)
Other versions
CN118349679A (en
Inventor
王世谦
邵志鹏
张小建
王圆圆
贾一博
高先周
李为
宋大为
李秋燕
费稼轩
黄秀丽
卜飞飞
华远鹏
韩丁
于雪辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Jiuyu Tenglong Information Engineering Co ltd
Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
State Grid Smart Grid Research Institute of SGCC
Original Assignee
Henan Jiuyu Tenglong Information Engineering Co ltd
Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd
State Grid Smart Grid Research Institute of SGCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Jiuyu Tenglong Information Engineering Co ltd, Economic and Technological Research Institute of State Grid Henan Electric Power Co Ltd, State Grid Smart Grid Research Institute of SGCC filed Critical Henan Jiuyu Tenglong Information Engineering Co ltd
Priority to CN202410500108.0A priority Critical patent/CN118349679B/en
Publication of CN118349679A publication Critical patent/CN118349679A/en
Application granted granted Critical
Publication of CN118349679B publication Critical patent/CN118349679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及数据撮合技术领域,具体涉及基于神经网络的任务匹配数据安全撮合方法,该方法包括:采集不同数据源的文本数据集;根据同一数据源文本数据集中的每条文本数据之间的语义信息差异、结构特征相似情况构建任意两条文本数据之间的同源数据关联系数;根据不同数据源中任意两个词语之间的关联规则差异、任意两条文本数据之间的空间信息差异构建任意两条文本数据之间的多节有价匹配代价;采用孪生网络获取不同数据源中任意两条文本数据之间的匹配度,根据匹配度筛选每个数据源中每条文本数据的匹配数据。本发明旨在提高了孪生网络在匹配不同数据源的文本数据时的准确率,进而提高数据撮合的有效性。

The present invention relates to the field of data matching technology, and in particular to a task matching data security matching method based on a neural network, the method comprising: collecting text data sets from different data sources; constructing a homologous data association coefficient between any two text data according to the semantic information difference and structural feature similarity between each text data in the text data set of the same data source; constructing a multi-section valuable matching cost between any two text data according to the association rule difference between any two words in different data sources and the spatial information difference between any two text data; using a twin network to obtain the matching degree between any two text data in different data sources, and screening the matching data of each text data in each data source according to the matching degree. The present invention aims to improve the accuracy of the twin network in matching text data from different data sources, thereby improving the effectiveness of data matching.

Description

Task matching data safety matching method based on neural network
Technical Field
The application relates to the technical field of data matching, in particular to a task matching data security matching method based on a neural network.
Background
Data matching refers to the process of matching and integrating related data in different data sources. In the data matching process, the association relationship between the data in different data sources is found by comparing and matching the data, and the data are combined into a group of more complete and useful data. The data matching can be applied to various fields and application scenes, and helps integrate data from different sources, so that more accurate and comprehensive information is provided, and decision making and business process proceeding are supported. The data matching process comprises the steps of data cleaning, data matching, data merging and the like.
As the demand for energy resources by enterprises increases, so does the frequency with which enterprises conduct energy transactions. When the energy transaction is carried out, the actual energy use condition of the enterprise needs to be effectively checked. For example, the consumption of electric energy, the emission of carbon dioxide and the like, but because the energy use condition relates to data in different forms and different formats, all check data of an enterprise are required to be obtained by utilizing a task matching mode, and the data to be checked and the current checked data are combined to obtain a group of more complete and effective data of the enterprise by automatically matching the data to be checked of the enterprise, so that the matching efficiency can be greatly improved. The conventional graph embedding model in the present stage is used for independently embedding the pairs to be matched into a high-dimensional vector, but due to the lack of interaction between nodes and graphs, the graph embedding model can only compare the similarity between the pairs to be matched on the whole, and some fine-granularity characteristics can be lost, so that the quality of the data matching result is lower.
Disclosure of Invention
In order to solve the technical problems, the invention provides a task matching data security matching method based on a neural network so as to solve the existing problems.
The task matching data security matching method based on the neural network adopts the following technical scheme:
the embodiment of the invention provides a neural network-based task matching data security matching method, which comprises the following steps:
Collecting text data sets of different data sources;
For each text data in the text data set of the same data source, constructing a structural multivalent matrix of each text data according to the correlation relationship between each text data and the dependency relationship between words, acquiring the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data, acquiring the data dependency structure similarity between any two text data according to the structure feature similarity condition between any two text data, constructing the homologous data correlation coefficient between any two text data based on the whole text meaning matching degree and the data dependency structure similarity;
Acquiring transaction association rule matching cost between any two words in different data sources according to association rule differences between any two words in different data sources; for any two pieces of text data between different data sources, acquiring the text characteristic space information matching degree between any two pieces of text data according to the space information difference between the any two pieces of text data; acquiring multi-section valuable matching cost between any two pieces of text data based on text feature space information matching degree and task multi-dimensional matching cost weight;
And obtaining the matching degree between any two pieces of text data in different data sources by adopting a twin network, and screening the matching data of each piece of text data in each data source according to the matching degree.
Preferably, the construction of the structural multivalent matrix of each text data according to the correlation relationship between each text data and the dependency relationship between words includes:
the Jacquard coefficient between each piece of text data and all the text data in the text data set of the same data source is used as input of an Ojin threshold algorithm, and a segmentation threshold is obtained;
The text data with the Jacquard coefficient larger than the segmentation threshold value in all the text data in the text data set of the same data source form a similar data set of each piece of text data;
counting all dependency relationship types and corresponding frequencies of all text data in a similar data set of each piece of text data;
And for each piece of text data, taking the frequency of the r dependency relationship corresponding to the p-th word in the text data as an element of the p-th row and the r-th column in the structural multivalent matrix of the text data.
Preferably, the obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data includes:
The method comprises the steps of obtaining ED editing distance between any two pieces of text data, taking absolute value of difference value of word quantity in any two pieces of text data as an index of an exponential function based on a natural constant, calculating DTW distance between structural multivalent matrixes of any two pieces of text data, and obtaining the ED editing distance between any two pieces of text data;
And calculating the product of the calculation result of the exponential function and the DTW distance, and taking the sum of the product and the ED editing distance as the whole text meaning matching degree between any two text data.
Preferably, the obtaining the data dependency structure similarity between any two pieces of text data according to the structure feature similarity between any two pieces of text data includes:
for each word in each piece of text data, adopting ELMo model to obtain word vector of the word, using the word vector of the word as first element of single dependency structure down-conversion sequence, using the dependency relationship frequency in the corresponding row vector in the structure multivalent matrix of the text data where the word is located as second to last element of single dependency structure down-conversion sequence according to descending order;
Taking any two words in any two pieces of text data as a group of word pairs, calculating pearson correlation coefficients between single dependency structure down-conversion sequences of the word pairs, and calculating the average value of the pearson correlation coefficients of all the word pairs in the any two pieces of text data;
And taking the product of the word number and the mean value as the data dependency structure similarity between any two pieces of text data.
Preferably, the constructing the homologous data association coefficient between any two text data based on the whole text meaning matching degree and the data dependency structure similarity includes:
The method comprises the steps of constructing a first index function by taking a natural constant as a base number and taking the meaning matching degree of the whole text between any two pieces of text data as an index, constructing a second index function by taking the natural constant as a base number and taking the data dependency structure similarity between any two pieces of text data as an index, and taking the ratio of the calculation result of the second index function to the calculation result of the first index function as a homologous data association coefficient between any two pieces of text data.
Preferably, the obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources includes:
acquiring an association rule and an association rule confidence coefficient of each word in a text data set of each data source by adopting an Apriori algorithm;
For any two words in different data sources, acquiring the number of association rules in the data source where the words are located, and recording the sum of the number of association rules of any two words as a first sum;
aiming at any two words in different data sources, obtaining the minimum value and the average value of the confidence coefficient of the association rule in the data source where the words are located, and calculating the product of the minimum value and the average value;
And taking the ratio of the first sum value to the second sum value as the transaction association rule matching cost between any two words in different data sources.
Preferably, the obtaining the matching degree of the text feature space information between any two text data according to the space information difference between any two text data includes:
The method comprises the steps of taking each text data in each data source as each Node, taking a homologous data association coefficient between any two text data in each data source as edge weight between corresponding nodes, and constructing a label graph of each data source according to the nodes and the edge weight between the nodes;
for a node pair formed by any two nodes directly connected with corresponding nodes in the label graph of any two text data, calculating the average value of the correlation coefficients of the homologous data of the node pair;
And taking the sum of the sum and the first Euclidean distance as the matching degree of text feature space information between any two pieces of text data.
Preferably, the obtaining the task multidimensional matching cost weight between any two text data according to the association rule difference between words in any two text data includes:
acquiring the sum of the reciprocal of the transaction association rule matching cost between all any two words in any two pieces of text data;
Acquiring a set of matching cost levels corresponding to transaction association rule matching costs of all any two words in any two pieces of text data, wherein the same transaction association rule matching cost is used as the same matching cost level;
and calculating a Jacquard coefficient between the set of any two pieces of text data, and multiplying the inverse of the sum value of the Jacquard coefficient and a preset parameter adjustment factor by the sum value to obtain a task multidimensional matching cost weight between any two pieces of text data.
Preferably, the obtaining the multiple sections of valuable matching cost between any two pieces of text data based on the text feature space information matching degree and the task multidimensional matching cost weight includes:
And calculating the product of the matching degree of the text characteristic space information between any two pieces of text data and the multi-dimensional matching cost weight of the task for any two pieces of text data, and taking the reciprocal of the sum of the product and a preset parameter adjusting factor as the multi-section valuable matching cost between any two pieces of text data.
Preferably, the screening the matching data of each piece of text data in each data source according to the matching degree includes:
And regarding each piece of text data in each data source, taking the text data corresponding to the maximum value of the matching degree between each piece of text data and all text data in all the rest data sources as the matching data of each piece of text data.
The invention has at least the following beneficial effects:
According to the invention, through analyzing the similarity condition among text data in text data sets in a plurality of data sources in one-time energy transaction process of an enterprise, the similar data sets of the text data are screened, so that word analysis is conveniently carried out on text data with similar characteristics, the number of samples for word relation analysis is increased, and the quality of relation analysis is improved; meanwhile, the invention constructs the whole text meaning matching degree between any two pieces of text data based on semantic information features among different text data in homologous data, mines the whole difference information of the text data, words and the structural multivalent matrix, analyzes the data matching condition from the whole text data, then constructs the data dependency structure similarity between any two text data according to the single dependency structure down-conversion sequence difference of all word combinations in different text data, analyzes the similarity between the data structures from the structural feature, combines the semantic correlation and the lexical structural similarity to construct the homologous data association coefficient, and simultaneously considers the semantic information and the syntactic structural feature of the text data to accurately evaluate the association strength between the text data in the same data source;
the method comprises the steps of determining transaction association rule matching cost when text data of different data sources are matched based on association rules of text data mining of the different data sources, determining the transaction association rule matching cost when the text data of the different data sources are matched, considering that part of words in the text data are high in confidence degree and too frequently appear in spite of the fact that the association rules are high in confidence degree, so that importance degree of the matching is affected when the words are matched, constraining matching cost of words with different frequencies in the text data through the number and the confidence degree of the association rules, influencing weight of rule relation among deep analysis words on matching of the text data, reflecting basic word rule architecture in the text data from a deeper level, increasing reliability of text data matching analysis, constructing a tag graph through homologous data association coefficients of the text data, mining space information of each node in the tag graph, helping to identify similarity among all nodes through mapping of association information to a low-dimensional continuous space, and accordingly helping to judge data matching conditions, and secondly, determining multiple sections of valid word matching indexes among the text data in different data sources based on the fact that matching cost of single words and whole text in matching time is high and low, serving as a measure mode of a downstream network in a twin network, improving accuracy of matching of the text data in matching of the different data sources.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a neural network-based task matching data security matching method provided by the invention;
FIG. 2 is a schematic diagram of a structural multivalent matrix;
FIG. 3 is a schematic diagram of a network architecture;
fig. 4 is a matching data acquisition flow chart.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description is given below of the task matching data security matching method based on the neural network according to the invention, and the detailed implementation, structure, characteristics and effects thereof are described in detail below with reference to the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the task matching data security matching method based on the neural network provided by the invention with reference to the accompanying drawings.
The task matching data security matching method based on the neural network provided by the embodiment of the invention.
Specifically, the following method for matching task matching data based on a neural network is provided, please refer to fig. 1, and the method comprises the following steps:
And S001, collecting text data of different data sources when the enterprise A carries out energy transaction, and preprocessing the collected data.
In the embodiment, task matching data when the enterprise A carries out energy transaction is used as basic data of security matching. When an enterprise A carries out energy transaction, the energy consumption of the enterprise A, the purchase amount of each energy source, the actual condition of checking the energy use of the enterprise A and the like are required to be counted, namely, a plurality of tasks are involved in the flow of the energy transaction, and sample spaces of different tasks are different, for example, the energy use data of an enterprise database and the energy types purchased by the enterprise A and the purchase amount data of each energy source recorded by an energy supply center are different.
In this embodiment, the number of data sources required for processing multiple tasks is recorded as M, and the value of M is set by the implementer according to the actual situation, and in this embodiment, the value is 3. The 3 data sources are a third-party auditing institution, an enterprise database and an energy supply center respectively. When the enterprise a conducts an energy transaction, the supply of the seller and the demand of the enterprise a include a plurality of specific branches, such as the discharge condition, purchase amount, purchase time, selling amount of the seller, etc., of the enterprise a, and the above data are recorded in text form in different data sources.
For any one of the original text data for which there is time data, the time data in each of the original text data is first converted into a form of a time stamp. Secondly, in this embodiment, all original text data in each data source is used as input, each piece of original text data is converted into a data string form by using a dataframe. Str function in a Python tool library, the converted result is used as text data, and Chinese characters in each piece of text data are marked. For any one data source, all text data in each data source is formed into a text data set.
So far, the text data used for task matching during energy transaction in the embodiment, namely the text data set of each data source, is obtained and used for carrying out data matching among different data sources in the follow-up process.
And step S002, constructing a homologous data association coefficient based on semantic relativity and lexical structural similarity between text data in the same data source, and determining a multi-section valuable matching index between text data in different data sources based on a label graph of each data source and matching cost.
The embodiment aims at matching text data of different data sources through a neural network, and data security matching is completed based on a matching result. For datasets of different data sources, the present embodiment contemplates matching using the data structures and entity attribute tags of text data in the different data sources.
Specifically, when the label graph is constructed, text data is represented by nodes, the side represents the association relationship between two text data, and the same text data at different moments can be treated as the same node.
Further, a label graph is constructed for the text data set of each data source, and the purpose of the label graph is to reflect the relevance of entities and attributes in different data sources. This is because the same energy usage records the different corresponding content in different data sources, for example, the a enterprise purchases electricity once, the name, time, amount of purchase, etc. of the a enterprise are recorded in the function center, and the data of staff, unit price, signature staff, etc. of the electricity purchased are recorded in the database of the a enterprise, and the entities corresponding to the different data in the process of purchasing electricity once should have certain association with the a enterprise.
For each piece of text data, calculating the Jacquard coefficient of a character string set between each piece of text data and any other piece of text data in the text data set of the same data source, taking the Jacquard coefficient of the character string set between each piece of text data and all other pieces of text data as input, acquiring a segmentation threshold value by adopting an Ojin threshold algorithm, and taking a set formed by all pieces of text data with the Jacquard coefficient larger than the segmentation threshold value as a similar data set of each piece of text data. The jekcard coefficient and the oxford threshold algorithm are known techniques, and the description of this embodiment is omitted.
And taking each text data as input, and acquiring the dependency relationship between any two words in each text data by adopting a dependency syntax, wherein the dependency syntax is a known technology, and the specific process is not repeated. And secondly, obtaining the dependency relationship between any two words in each piece of text data in the similar data set of each piece of text data, and counting the types and the frequencies of the dependency relationship to form a structural multivalent matrix of each piece of text data. The structure multivalent matrix of the a-th text data is shown in fig. 2, wherein m is the number of words in the a-th text data, H is the category number of all the dependences in all the text data, q 1 to q H are the 1st to H-th dependences respectively, w 1 to w H are the 1st to m-th words respectively, and i 11 is the frequency of the first dependency q 1 existing in the first word w 1 in the a-th text data.
For any one data source, all sentences in the text data set of each data source are used as input of a ELMo model (Embeddings form Language Models), a ELMo model is utilized to obtain word vectors of each word, and a ELMo model is a known technology, and the specific process is not repeated. Secondly, taking a word vector of an xth word in the structural multivalent matrix of each text data as a first element, and taking a sequence formed by taking the dependency relationship frequency of the xth word as a second element to a last element according to a descending order as a single dependency structure down-conversion sequence of the xth word.
Based on the analysis, a homologous data association coefficient is constructed here for characterizing the degree of association of structural features and semantic information between two text data. Calculating a homologous data association coefficient between the a text data and the b text data:
Sab=ED(Ca,Cb)+exp(|Na-Nb|)×dtw(Ya,Yb)
Wherein S ab is the degree of matching of the meaning of the whole text between the text data of the a-th and the b-th, C a、Cb is the character string of the text data of the a-th and the b-th, ED (C a,Cb) is the ED editing distance between the character strings C a and C b, exp () is an exponential function based on a natural constant e, N a、Nb is the number of words in the text data of the a-th and the b-th, Y a、Yb is the structure multivalent matrix of the text data of the a-th and the b-th, DTW (Y a,Yb) is the DTW distance between the matrix Y a and Y b, wherein the ED editing distance and the DTW distance are all known techniques, and the detailed process is not repeated;
l ab is the data dependency structure similarity between the a-th and b-th text data, m 1 is the number of words simultaneously existing in the a-th and b-th text data, x and y are the x-th and y-th words in the a-th and b-th text data respectively, d x is the single dependency structure down-conversion sequence of the x-th word of the a-th text data, d y is the single dependency structure down-conversion sequence of the y-th word of the b-th text data, and P (d x,dy) is the pearson correlation coefficient between the sequences d x and d y;
l ab is the homology data correlation coefficient between the a-th and b-th text data.
The greater the probability of being generated by the same A enterprise energy transaction, the stronger the association of the two pieces of text data, the smaller the value of ED (C a,Cb), the stronger the association of the a-th and b-th text data, the greater the similarity of the a-th and b-th text data in the similar data set, the greater the similarity of the a-th and b-th text data, the greater the similarity of the elements in the structure multivalent matrix of the a-th and b-th text data, the greater the similarity of Y a and Y b, the smaller the value of exp (|N a-Nb |), the smaller the value of dtw (Y a,Yb), the greater the matching degree between the a-th and b-th text data, the greater the number of words simultaneously existing in the structure multivalent matrix of the a-th and b-th text data, the greater the value of m 1, the greater the similarity of the b-th and b-th text data, the greater the similarity of the b-th text data, the greater the value of b-th text data, the greater the similarity of the b-th and the greater the similarity of the b-th text data, the greater the b-and the greater the similarity of the b-th text data, the greater the value of the b-4.
Further, when any two pieces of text data in two data sources are matched, the matching results of different words have different contributions to the matching degree between the text data. For example, all text data generated by the enterprise a during one energy transaction have different recording forms of time data. For example, the A enterprise data center records 2021, 10 months and 9 days, the energy supply center records are nineteen days, the purchasing personnel recorded in the A enterprise data center has staff I 1、I2、I3, the energy supply center records are signing personnel I 1, words corresponding to two attributes of time and personnel can have different influences when text data of different data sources are matched, and the matching degree of the two text data can be directly determined according to the matching results of the time and part of personnel, because the energy transaction is not frequent under normal conditions and does not occur for a plurality of times in one day. That is, when text data of different data sources are matched, matching contribution of associated combinations of different words in the text data is different.
Therefore, in this embodiment, the text data set of each data source is used as a basic database of the transaction library, and the Apriori data mining algorithm is used to obtain the association rules of all the words in the text data set of each data source and the confidence level corresponding to each association rule, where the Apriori data mining algorithm is a known technology, and the specific process is not repeated.
For the association rules of the text data sets of any one data source, the relevance and the association degree between different words are different, the words in the text data form association rules with different confidence degrees, when the text data of different data sources are matched, different matching costs are considered to be set according to the confidence degree weights of the association rules, the matching cost between the words corresponding to the association rules with larger confidence degree weights is smaller, and the reason for the setting is that the more reliable the relevance between the words in the association rules with larger confidence degrees and the text data is, the more accurate the matching result is. The calculation formula of the transaction association rule matching cost between the xth word and the p-th word in different data sources is as follows:
Where D xp is the trade association rule matching cost between the xth word and the p-th word, n x、np is the number of association rules containing the xth and p-th words in the corresponding data source, mu x,min、μp,min is the minimum of the confidence of all the association rules containing the xth and p-th words in the corresponding data source, The confidence values of the association rules of the xth and the p words in the corresponding data sources are respectively the average value of the confidence values of the association rules of the xth and the p words.
Wherein, the larger the confidence value of the association rule containing the x-th and p-th words in the association rules corresponding to different data sources, mu x,min、μp,min,The larger the value of each of the (c) is,And taking the sum of n x、np as a molecule to consider that the confidence of the association rule is higher but the importance degree of the partial words in the text data in matching is affected too frequently, and constraining the matching cost of the words with different frequencies in the text data through the number of the association rules and the confidence.
According to the steps, the homologous data association coefficient between any two pieces of text data in each data source is obtained respectively, and the homologous data association coefficient between any two pieces of text data is used as the weight of the edge between the corresponding two nodes to obtain the label graph of each data source.
Text data sets of different data sources may have special characters and abbreviations in the text data due to differences in the form of the data record and record carrier, and concatenation of these many feature words may also lead to text miss semantics. The above situation results in that when a large number of abbreviations and special character combinations exist between two pieces of text data, the text similarity between the full names and the abbreviations of the same attribute value is not high, so that the text similarity is difficult to measure the true similarity, and other information needs to be introduced to assist matching.
Specifically, a label graph of each data source is taken as input, the Node2vec algorithm is adopted to acquire the spatial information of each Node in the input label graph, the Node2vec algorithm learns the representation of the nodes by designing a flexible exploration mode for the neighbors of the nodes of the graph, finally, the nodes in the graph are mapped to a low-dimensional continuous space, and the spatial information of the graph can be recorded and stored, and the Node2vec algorithm is a known technology, and the specific process is not repeated. For text data with higher matching degree in different data sources, the spatial information of the corresponding nodes must have larger similarity.
Further, transaction association rule matching costs between any two words in any two data sources are respectively obtained, and each equal transaction association rule matching cost is used as a matching cost level. When two text data of two data sources are matched, the more the number of matching cost stages is, the more unstable the matching cost of different words in the two text data is, and the more the data matching cost is.
Based on the analysis, a plurality of sections of valuable matching indexes are constructed and used for representing the cost when matching between two text data in different data sources. Calculating a plurality of sections of valuable matching indexes between the a-th text data and the k-th text data in two data sources:
R ak is the matching degree of text characteristic space information between the a-th text data and the k-th text data, o a、ok is the space information of corresponding nodes in a label graph where the a-th text data and the k-th text data are located, M 1、M2 is the number of nodes directly connected with the corresponding nodes in the a-th text data and the k-th text data in the label graph, j and h are the j-th and h-th nodes directly connected with the corresponding nodes in the a-th text data and the k-th text data in the label graph, L aj is the homologous data association coefficient corresponding to the corresponding nodes and the j-th nodes in the a-th text data in the label graph, L kh is the homologous data association coefficient corresponding to the corresponding nodes and the h-th nodes in the k-th text data in the label graph, o j、oh is the j-th and h-th node corresponding space information, and dist (o a,ok)、dist(oj,oh) is the Euclidean distance between o a and o k、oj and o h;
u ak is a task multidimensional matching cost weight between the a-th text data and the k-th text data, G a、Gk is a set of matching cost levels corresponding to the matching cost of transaction association rules of all words in the a-th text data and the k-th text data respectively, jac (G a,Gk) is a Jaccard coefficient between the sets G 1 and G 2, m 2、m3 is the number of words in the a-th text data and the k-th text data respectively, a x、kp is the x-th word and the p-th word in the a-th text data respectively, and D (a x,kp) is the matching cost of transaction association rules between the words a x and k p;
V ak is the multiple segment valuable match index between the a-th and k-th text data in two data sources, Is a preset parameter adjusting factor for preventing the denominator from being 0,The value of (2) is 0.001.
When the matching task of the energy transaction data between two data sources is performed, the matching degree between text data in the two data sources is larger, the space information o a、ok in the low-dimensional continuous space is more similar, the matching degree between a bar and k bar text data is larger, the data generated by the same energy transaction is more likely to be generated, the connection structure of the corresponding node of the a bar and k bar text data in a tag graph is more similar, the value of dist (o a,ok)、dist(oj,oh) is smaller, the value of R ak is smaller, the matching cost of different words in the a bar and k bar text data is more unstable, the distribution similarity between a bar and k bar text data is lower, the value of Jac (G a,Gk) is smaller, the confidence of the association rule in the a bar and k bar text data is larger, the association between the a bar and k bar text data is more reliable, the matching result is more reliable, the matching cost of the u is more accurate, the matching cost of the matching result is more stable, the matching cost of the words between the two words is larger, the matching cost of the matching value of the text data is more intense, and the matching cost of the text data is more intense, and the matching value of the text data is more intense, and 4.
So far, the multi-section valuable matching index between any two pieces of text data in different data sources is obtained and is used as a measurement mode in a twin network.
And step S003, obtaining a data matching result between different data sources in the energy transaction of the enterprise A by adopting the twin network based on the multiple sections of valuable matching indexes between the text data.
And respectively acquiring multiple sections of valuable matching indexes between any two pieces of text data in different data sources according to the steps, and completing data matching based on the multiple sections of valuable matching indexes.
Specifically, two pieces of text data in two data sources are used as input of a twin network, the twin network is adopted to obtain the matching degree between the two inputs, the output of an upstream network in the twin network is the high latitude characteristic of the two pieces of text data in the same characteristic space, the multi-section valuable matching index between the two pieces of text data is used as the measurement distance in a downstream network of the twin network, the twin network is a coupling framework established by two sub-networks sharing weights, the network structure is shown in fig. 3, the twin network is a known technology, and the specific process is not repeated.
And for any text data in each data source, taking the text data in the rest data sources corresponding to the maximum matching degree of each text data as the matching data. Wherein, the matching data acquisition flow chart is shown in fig. 4.
This embodiment is completed.
In summary, according to the embodiment of the invention, through analyzing the similarity condition among text data in text data sets in a plurality of data sources in one-time energy transaction process of an enterprise, similar data sets of each text data are screened, so that word analysis is conveniently performed on text data with similar characteristics, the number of samples for word relation analysis is increased, and the quality of relation analysis is improved; meanwhile, the embodiment of the invention constructs the whole text meaning matching degree between any two text data based on semantic information features among different text data in homologous data, mines the overall difference information of the text data's character strings, words and the structural multivalent matrix, analyzes the data matching condition from the overall angle of the text data, then constructs the data dependency structure similarity between any two text data according to the single structure down-conversion sequence difference of all word combinations in different text data, analyzes the similarity between the data structures from the structural feature angle, combines the semantic correlation and the lexical structural similarity to construct the homologous data association coefficient, and simultaneously considers the semantic information and the structural feature of the text data, thus being capable of accurately evaluating the association strength between the text data in the same data source;
Meanwhile, according to the embodiment of the invention, based on the association rules of text data mining of different data sources, transaction association rule matching cost is determined when text data of different data sources are matched, the transaction association rule matching cost considers that part of words in the text data are higher in confidence and occur too frequently although the association rules are high in confidence, so that importance degree of the matching is affected when the words are matched, matching cost of words with different frequencies in the text data is constrained by the number of the association rules and the confidence, weight is affected by the rule relation among deep analysis words on matching of the text data, reliability of text data matching analysis is improved by reflecting basic word rule architecture in the text data from a deeper level, spatial information of each node in the tag graph is mined by constructing homologous data association coefficients of the text data, similarity between each node is helped to be identified by mapping association information to a low-dimensional continuous space, and therefore, data matching situation is assisted to be judged, and a multi-section valuable matching index between text data in different data sources is determined as a downstream network measuring mode based on matching cost of single and whole text when the data is matched, and accuracy of the matching words in the twin network is improved.
It should be noted that the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The embodiments described above are only for illustrating the technical solutions of the present application, but not for limiting the same, and the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.

Claims (10)

1.基于神经网络的任务匹配数据安全撮合方法,其特征在于,该方法包括以下步骤:1. A task matching data security matching method based on a neural network, characterized in that the method comprises the following steps: 采集不同数据源的文本数据集;Collect text datasets from different data sources; 对于同一数据源的文本数据集中的每条文本数据,根据每条文本数据之间的相关关系以及词语之间的依存关系构建每条文本数据的结构多价矩阵;根据任意两条文本数据之间的语义信息差异获取任意两条文本数据之间的整条文本含义匹配度;根据任意两条文本数据之间的结构特征相似情况获取任意两条文本数据之间的数据依存结构相似度;基于整条文本含义匹配度以及数据依存结构相似度构建任意两条文本数据之间的同源数据关联系数;For each piece of text data in a text data set of the same data source, a structural multivalent matrix of each piece of text data is constructed according to the correlation between each piece of text data and the dependency between words; the whole text meaning matching between any two pieces of text data is obtained according to the semantic information difference between any two pieces of text data; the data dependency structure similarity between any two pieces of text data is obtained according to the similarity of structural features between any two pieces of text data; the homology data association coefficient between any two pieces of text data is constructed based on the whole text meaning matching and the data dependency structure similarity; 根据不同数据源中任意两个词语之间的关联规则差异获取不同数据源中任意两个词语之间的交易关联规则匹配代价;对于不同数据源之间的任意两条文本数据,根据任意两条文本数据之间的空间信息差异获取任意两条文本数据之间的文本特征空间信息匹配度;根据任意两条文本数据内词语之间的关联规则差异获取任意两条文本数据之间的任务多维匹配代价权重;基于文本特征空间信息匹配度以及任务多维匹配代价权重获取任意两条文本数据之间的多节有价匹配代价;According to the difference in association rules between any two words in different data sources, the transaction association rule matching cost between any two words in different data sources is obtained; for any two text data from different data sources, the text feature spatial information matching degree between any two text data is obtained according to the difference in spatial information between any two text data; according to the difference in association rules between words in any two text data, the task multi-dimensional matching cost weight between any two text data is obtained; based on the text feature spatial information matching degree and the task multi-dimensional matching cost weight, the multi-section valuable matching cost between any two text data is obtained; 采用孪生网络获取不同数据源中任意两条文本数据之间的匹配度,根据匹配度筛选每个数据源中每条文本数据的匹配数据;The twin network is used to obtain the matching degree between any two text data in different data sources, and the matching data of each text data in each data source are screened according to the matching degree; 所述根据每条文本数据之间的相关关系以及词语之间的依存关系构建每条文本数据的结构多价矩阵,包括:The step of constructing a structural multivalent matrix of each piece of text data according to the correlation between each piece of text data and the dependency between words includes: 将每条文本数据与同一数据源的文本数据集中剩余所有文本数据之间的相似性,作为阈值分割算法的输入,获取分割阈值;The similarity between each piece of text data and all the remaining text data in the text data set of the same data source is used as the input of the threshold segmentation algorithm to obtain the segmentation threshold; 将同一数据源的文本数据集中剩余所有文本数据中,所述相似性大于分割阈值的文本数据组成每条文本数据的相似数据集合;The text data whose similarity is greater than the segmentation threshold in all the remaining text data in the text data set of the same data source are used to form a similar data set for each text data; 采用依存关系分析算法获取每条文本数据中任意两个词语之间的依存关系;统计每条文本数据的相似数据集合中所有文本数据的所有依存关系种类及对应频率;A dependency analysis algorithm is used to obtain the dependency relationship between any two words in each text data; all dependency relationship types and corresponding frequencies of all text data in a similar data set of each text data are counted; 对于每条文本数据,将文本数据中第p个词语对应的第r种依存关系的频率作为文本数据的结构多价矩阵中第p行第r列的元素;For each piece of text data, the frequency of the rth dependency relationship corresponding to the pth word in the text data is used as the element of the pth row and rth column in the structural multivalent matrix of the text data; 所述根据任意两条文本数据之间的语义信息差异获取任意两条文本数据之间的整条文本含义匹配度,包括:The method of obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data includes: 获取任意两条文本数据之间的差异程度;将任意两条文本数据中的词语数量的差值绝对值作为以自然常数为底数的指数函数的指数;计算任意两条文本数据的结构多价矩阵之间的度量距离;Obtain the degree of difference between any two text data; use the absolute value of the difference in the number of words in any two text data as the exponent of an exponential function with a natural constant as the base; calculate the metric distance between the structural multivalent matrices of any two text data; 计算所述指数函数的计算结果与所述度量距离的乘积,任意两条文本数据之间的整条文本含义匹配度分别与所述乘积及所述差异程度成正相关关系;Calculating the product of the calculation result of the exponential function and the metric distance, the whole text meaning matching degree between any two text data is positively correlated with the product and the difference degree respectively; 所述根据不同数据源中任意两个词语之间的关联规则差异获取不同数据源中任意两个词语之间的交易关联规则匹配代价,包括:The step of obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources includes: 采用关联规则挖掘算法获取每个数据源的文本数据集中每个词语的关联规则和关联规则置信度;An association rule mining algorithm is used to obtain the association rule and association rule confidence of each word in the text data set of each data source; 对于不同数据源中任意两个词语,获取词语所在数据源内的关联规则数量,将任意两个词语的所述关联规则数量的和值记为第一和值;For any two words in different data sources, the number of association rules in the data source where the words are located is obtained, and the sum of the number of association rules of the any two words is recorded as a first sum; 针对不同数据源中任意两个词语,获取词语所在数据源中关联规则置信度的最小值、均值,计算所述最小值与均值的融合结果;将任意两个词语的所述融合结果的和值记为第二和值;For any two words in different data sources, obtain the minimum value and the mean value of the confidence of the association rules in the data sources where the words are located, and calculate the fusion result of the minimum value and the mean value; record the sum of the fusion results of the any two words as the second sum value; 不同数据源中任意两个词语之间的交易关联规则匹配代价与所述第一和值成正相关关系,与所述第二和值成负相关关系;The transaction association rule matching cost between any two words in different data sources is positively correlated with the first sum value and negatively correlated with the second sum value; 所述根据任意两条文本数据内词语之间的关联规则差异获取任意两条文本数据之间的任务多维匹配代价权重,包括:The step of obtaining the task multi-dimensional matching cost weight between any two text data according to the difference in association rules between words in any two text data includes: 获取任意两条文本数据中,所有任意两个词语之间的交易关联规则匹配代价的倒数的和值;Obtain the sum of the reciprocals of the transaction association rule matching costs between any two words in any two text data; 获取任意两条文本数据中所有任意两个词语的交易关联规则匹配代价对应的匹配代价级的集合;其中,将相同的交易关联规则匹配代价作为同一个匹配代价级;Obtaining a set of matching cost levels corresponding to the transaction association rule matching costs of all two arbitrary words in any two text data; wherein the same transaction association rule matching costs are regarded as the same matching cost level; 计算任意两条文本数据的所述集合之间的相似程度,任意两条文本数据之间的任务多维匹配代价权重与所述倒数的和值成正相关关系,与所述相似程度成负相关关系;Calculating the similarity between the sets of any two text data, the multi-dimensional matching cost weight of the task between the any two text data is positively correlated with the sum of the reciprocals, and negatively correlated with the similarity; 所述任意两条文本数据之间的多节有价匹配代价分别与任意两条文本数据之间的文本特征空间信息匹配度及任务多维匹配代价权重成负相关关系。The multi-section valuable matching cost between any two text data is negatively correlated with the text feature space information matching degree between any two text data and the task multi-dimensional matching cost weight. 2.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据每条文本数据之间的相关关系以及词语之间的依存关系构建每条文本数据的结构多价矩阵,包括:2. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the structural multivalent matrix of each text data is constructed according to the correlation between each text data and the dependency between words, including: 将每条文本数据与同一数据源的文本数据集中剩余所有文本数据之间的杰卡德系数,作为大津阈值算法的输入,获取分割阈值;The Jaccard coefficient between each text data and all the remaining text data in the text data set of the same data source is used as the input of the Otsu threshold algorithm to obtain the segmentation threshold; 将同一数据源的文本数据集中剩余所有文本数据中,杰卡德系数大于分割阈值的文本数据组成每条文本数据的相似数据集合;The text data with Jaccard coefficient greater than the segmentation threshold in all the remaining text data in the text data set of the same data source are combined into a similar data set for each text data; 采用依存句法获取每条文本数据中任意两个词语之间的依存关系;统计每条文本数据的相似数据集合中所有文本数据的所有依存关系种类及对应频率;Dependency syntax is used to obtain the dependency relationship between any two words in each text data; all dependency relationship types and corresponding frequencies of all text data in the similar data set of each text data are counted; 对于每条文本数据,将文本数据中第p个词语对应的第r种依存关系的频率作为文本数据的结构多价矩阵中第p行第r列的元素。For each piece of text data, the frequency of the rth dependency relationship corresponding to the pth word in the text data is used as the element of the pth row and rth column in the structural multivalent matrix of the text data. 3.如权利要求2所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据任意两条文本数据之间的语义信息差异获取任意两条文本数据之间的整条文本含义匹配度,包括:3. The task matching data security matching method based on neural network according to claim 2 is characterized in that the step of obtaining the whole text meaning matching degree between any two text data according to the semantic information difference between any two text data comprises: 获取任意两条文本数据之间的ED编辑距离;将任意两条文本数据中的词语数量的差值绝对值作为以自然常数为底数的指数函数的指数;计算任意两条文本数据的结构多价矩阵之间的DTW距离;Obtain the ED edit distance between any two text data; use the absolute value of the difference in the number of words in any two text data as the exponent of an exponential function with a natural constant as the base; calculate the DTW distance between the structural multivalent matrices of any two text data; 计算所述指数函数的计算结果与所述DTW距离的乘积,将所述乘积与所述ED编辑距离的和值作为任意两条文本数据之间的整条文本含义匹配度。The product of the calculation result of the exponential function and the DTW distance is calculated, and the sum of the product and the ED edit distance is used as the whole text meaning matching degree between any two text data. 4.如权利要求2所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据任意两条文本数据之间的结构特征相似情况获取任意两条文本数据之间的数据依存结构相似度,包括:4. The task matching data security matching method based on neural network according to claim 2 is characterized in that the data dependency structure similarity between any two text data is obtained according to the similarity of the structural features between any two text data, including: 对于每条文本数据中的每个词语,采用ELMo模型获取词语的词向量;将词语的词向量作为单一依存结构降频序列的第一个元素,将词语所在文本数据的结构多价矩阵中对应行向量内的依存关系频率,按照降序顺序作为单一依存结构降频序列的第二个至最后一个元素;For each word in each text data, the ELMo model is used to obtain the word vector of the word; the word vector of the word is used as the first element of the single dependency structure frequency reduction sequence, and the dependency frequency in the corresponding row vector in the structural multivalent matrix of the text data where the word is located is used as the second to the last element of the single dependency structure frequency reduction sequence in descending order; 将任意两条文本数据中的任意两个词语作为一组词语对,计算所述词语对的单一依存结构降频序列之间的皮尔逊相关系数,计算所述任意两条文本数据中的所有词语对的所述皮尔逊相关系数的均值;Taking any two words in any two text data as a group of word pairs, calculating the Pearson correlation coefficient between the single dependency structure frequency reduction sequences of the word pairs, and calculating the average of the Pearson correlation coefficients of all word pairs in the any two text data; 获取任意两条文本数据中同时存在的词语数量;将所述词语数量与所述均值的乘积作为任意两条文本数据之间的数据依存结构相似度。The number of words that exist simultaneously in any two text data is obtained; and the product of the number of words and the mean is used as the data dependency structure similarity between any two text data. 5.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述基于整条文本含义匹配度以及数据依存结构相似度构建任意两条文本数据之间的同源数据关联系数,包括:5. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the homologous data correlation coefficient between any two text data is constructed based on the matching degree of the meaning of the entire text and the similarity of the data dependency structure, including: 以自然常数为底数、以任意两条文本数据之间的整条文本含义匹配度为指数构建第一指数函数;以自然常数为底数、以任意两条文本数据之间的数据依存结构相似度为指数构建第二指数函数;将第二指数函数的计算结果与第一指数函数的计算结果的比值作为任意两条文本数据之间的同源数据关联系数。A first exponential function is constructed with a natural constant as the base and the degree of matching of the meaning of the entire text between any two text data as the exponent; a second exponential function is constructed with a natural constant as the base and the similarity of the data dependency structure between any two text data as the exponent; and the ratio of the calculation result of the second exponential function to the calculation result of the first exponential function is used as the homologous data correlation coefficient between any two text data. 6.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据不同数据源中任意两个词语之间的关联规则差异获取不同数据源中任意两个词语之间的交易关联规则匹配代价,包括:6. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of obtaining the transaction association rule matching cost between any two words in different data sources according to the association rule difference between any two words in different data sources comprises: 采用Apriori算法获取每个数据源的文本数据集中每个词语的关联规则和关联规则置信度;The Apriori algorithm is used to obtain the association rules and association rule confidence of each word in the text data set of each data source; 对于不同数据源中任意两个词语,获取词语所在数据源内的关联规则数量,将任意两个词语的所述关联规则数量的和值记为第一和值;For any two words in different data sources, the number of association rules in the data source where the words are located is obtained, and the sum of the number of association rules of the any two words is recorded as a first sum; 针对不同数据源中任意两个词语,获取词语所在数据源中关联规则置信度的最小值、均值,计算所述最小值与均值的融合结果;将任意两个词语的所述融合结果的和值记为第二和值;其中所述融合结果为所述最小值与均值的乘积;For any two words in different data sources, obtain the minimum value and the mean value of the confidence of the association rules in the data sources where the words are located, and calculate the fusion result of the minimum value and the mean value; record the sum of the fusion results of any two words as the second sum value; wherein the fusion result is the product of the minimum value and the mean value; 将第一和值与第二和值的比值作为不同数据源中任意两个词语之间的交易关联规则匹配代价。The ratio of the first sum value to the second sum value is used as the transaction association rule matching cost between any two words in different data sources. 7.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据任意两条文本数据之间的空间信息差异获取任意两条文本数据之间的文本特征空间信息匹配度,包括:7. The task matching data security matching method based on neural network according to claim 1 is characterized in that the text feature spatial information matching degree between any two text data is obtained according to the spatial information difference between any two text data, including: 将每个数据源中的每条文本数据作为各节点,将每个数据源中任意两条文本数据之间的同源数据关联系数作为对应节点之间的边权重,根据节点以及节点之间的边权重构建每个数据源的标签图;将每个数据源的标签图采用Node2vec算法获取各节点的空间信息;计算任意两条文本数据所在标签图中对应节点的空间信息之间的欧式距离,记为第一欧式距离;Each text data in each data source is taken as each node, and the homologous data correlation coefficient between any two text data in each data source is taken as the edge weight between the corresponding nodes. The label graph of each data source is constructed according to the nodes and the edge weights between the nodes; the label graph of each data source is used to obtain the spatial information of each node using the Node2vec algorithm; the Euclidean distance between the spatial information of the corresponding nodes in the label graph where any two text data are located is calculated, which is recorded as the first Euclidean distance; 对于与所述任意两条文本数据所在标签图中对应节点的直接相连的任意两个节点组成的节点对,计算节点对的同源数据关联系数均值;计算节点对的空间信息之间的欧式距离,记为第二欧式距离;For a node pair consisting of any two nodes directly connected to the corresponding node in the label graph where the two text data are located, calculate the mean value of the homologous data association coefficient of the node pair; calculate the Euclidean distance between the spatial information of the node pair, recorded as the second Euclidean distance; 计算所有所述节点对的所述均值与所述第二欧式距离的乘积的和值,记为第三和值;将所述第三和值与所述第一欧式距离之和作为任意两条文本数据之间的文本特征空间信息匹配度。The sum of the product of the mean value of all the node pairs and the second Euclidean distance is calculated and recorded as the third sum; the sum of the third sum and the first Euclidean distance is used as the text feature space information matching degree between any two text data. 8.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据任意两条文本数据内词语之间的关联规则差异获取任意两条文本数据之间的任务多维匹配代价权重,包括:8. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of obtaining the task multi-dimensional matching cost weight between any two text data according to the difference in association rules between words in any two text data comprises: 获取任意两条文本数据中,所有任意两个词语之间的交易关联规则匹配代价的倒数的和值,记为第四和值;Obtain the sum of the reciprocals of the transaction association rule matching costs between any two words in any two text data, recorded as the fourth sum; 获取任意两条文本数据中所有任意两个词语的交易关联规则匹配代价对应的匹配代价级的集合;其中,将相同的交易关联规则匹配代价作为同一个匹配代价级;Obtaining a set of matching cost levels corresponding to the transaction association rule matching costs of all two arbitrary words in any two text data; wherein the same transaction association rule matching costs are regarded as the same matching cost level; 计算任意两条文本数据的所述集合之间的杰卡德系数,将所述杰卡德系数与预设调参因子的和值的倒数,乘以所述第四和值得到任意两条文本数据之间的任务多维匹配代价权重。Calculate the Jaccard coefficient between the sets of any two text data, and multiply the inverse of the sum of the Jaccard coefficient and the preset parameter adjustment factor by the fourth sum to obtain the multidimensional matching cost weight of the task between any two text data. 9.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述基于文本特征空间信息匹配度以及任务多维匹配代价权重获取任意两条文本数据之间的多节有价匹配代价,包括:9. The task matching data security matching method based on neural network as claimed in claim 1 is characterized in that the multi-section valuable matching cost between any two text data is obtained based on the text feature space information matching degree and the task multi-dimensional matching cost weight, including: 对于任意两条文本数据,计算任意两条文本数据之间的文本特征空间信息匹配度与任务多维匹配代价权重的乘积,记为第一乘积;将所述第一乘积与预设调参因子的和值的倒数作为任意两条文本数据之间的多节有价匹配代价。For any two pieces of text data, calculate the product of the text feature space information matching degree between the two pieces of text data and the multi-dimensional matching cost weight of the task, which is recorded as the first product; take the inverse of the sum of the first product and the preset parameter adjustment factor as the multi-section valuable matching cost between any two pieces of text data. 10.如权利要求1所述的基于神经网络的任务匹配数据安全撮合方法,其特征在于,所述根据匹配度筛选每个数据源中每条文本数据的匹配数据,包括:10. The task matching data security matching method based on neural network according to claim 1 is characterized in that the step of screening the matching data of each text data in each data source according to the matching degree comprises: 对于每个数据源中每条文本数据,将每条文本数据与剩余所有数据源中的所有文本数据之间的匹配度的最大值对应的文本数据,作为每条文本数据的匹配数据。For each piece of text data in each data source, text data corresponding to the maximum value of the matching degree between each piece of text data and all text data in all remaining data sources is used as matching data for each piece of text data.
CN202410500108.0A 2024-04-24 2024-04-24 Task matching data security matching method based on neural network Active CN118349679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410500108.0A CN118349679B (en) 2024-04-24 2024-04-24 Task matching data security matching method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410500108.0A CN118349679B (en) 2024-04-24 2024-04-24 Task matching data security matching method based on neural network

Publications (2)

Publication Number Publication Date
CN118349679A CN118349679A (en) 2024-07-16
CN118349679B true CN118349679B (en) 2025-03-25

Family

ID=91818146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410500108.0A Active CN118349679B (en) 2024-04-24 2024-04-24 Task matching data security matching method based on neural network

Country Status (1)

Country Link
CN (1) CN118349679B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868656A (en) * 2021-09-30 2021-12-31 中国电子科技集团公司第十五研究所 A behavioral pattern-based homology determination method for APT events
CN115545001A (en) * 2022-11-29 2022-12-30 支付宝(杭州)信息技术有限公司 Text matching method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597994B (en) * 2018-12-04 2023-06-06 挖财网络技术有限公司 Short text problem semantic matching method and system
CN109740126B (en) * 2019-01-04 2023-11-21 平安科技(深圳)有限公司 Text matching method and device, storage medium and computer equipment
CN113011155B (en) * 2021-03-16 2023-09-05 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for text matching
US12111815B2 (en) * 2021-05-07 2024-10-08 Sightly Enterprises, Inc. Correlating event data across multiple data streams to identify compatible distributed data files with which to integrate data at various networked computing devices
CN116720485A (en) * 2023-05-25 2023-09-08 上证所信息网络有限公司 Multi-source text information fusion system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868656A (en) * 2021-09-30 2021-12-31 中国电子科技集团公司第十五研究所 A behavioral pattern-based homology determination method for APT events
CN115545001A (en) * 2022-11-29 2022-12-30 支付宝(杭州)信息技术有限公司 Text matching method and device

Also Published As

Publication number Publication date
CN118349679A (en) 2024-07-16

Similar Documents

Publication Publication Date Title
CN111813950B (en) Building field knowledge graph construction method based on neural network self-adaptive optimization tuning
CN113779264B (en) Transaction recommendation method based on patent supply and demand knowledge graph
WO2020143409A1 (en) Method and device for predicting business indicators
Pei et al. Concept factorization with adaptive neighbors for document clustering
US20120078913A1 (en) System and method for schema matching
CN110910243A (en) Property right transaction method based on reconfigurable big data knowledge map technology
CN116109195B (en) Performance evaluation method and system based on graph convolution neural network
CN110852856A (en) Invoice false invoice identification method based on dynamic network representation
CN108038720A (en) A kind of ad click rate Forecasting Methodology based on Factorization machine
CN114119057A (en) User portrait model construction system
CN115547466B (en) Medical institution registration and review system and method based on big data
CN113822018B (en) Entity Relation Joint Extraction Method
CN112883066B (en) Method for estimating multi-dimensional range query cardinality on database
CN118735704A (en) Financial data collection system and method based on web crawler technology
Irshad et al. Inferential properties with a novel two parameter Poisson generalized Lindley distribution with regression and application to INAR (1) process
CN118691210B (en) Supply chain material data management method and system
CN118349679B (en) Task matching data security matching method based on neural network
CN113779933A (en) Commodity encoding method, electronic device and computer-readable storage medium
CN115098694B (en) Customs data classification method, device and storage medium based on knowledge graph representation
CN117454188A (en) Multi-strategy data governance rule adaptation method and system based on standard data elements
Hewapathirana et al. A systematic investigation on the effectiveness of the tabbert model for credit card fraud detection
Cao et al. Financial network analysis using polymodel theory
CN114817566A (en) Emotion reason pair extraction method based on emotion embedding
CN115063165A (en) Manufacturing equipment price prediction method based on feature screening and attention mechanism
CN114510943A (en) An Incremental Named Entity Recognition Method Based on Pseudo-Sample Replay

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant