CN115329850A

CN115329850A - Information comparison method and device, electronic equipment and storage medium

Info

Publication number: CN115329850A
Application number: CN202210920358.0A
Authority: CN
Inventors: 武晗; 祝恒书; 熊辉; 刘浩; 秦川; 刘淇; 陈恩红
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-11

Abstract

The disclosure provides an information comparison method, an information comparison device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of intelligent search. The specific implementation scheme is as follows: extracting a plurality of items of text information from the text content of the reference file, and extracting metadata characteristics based on the metadata of the reference file; respectively extracting text features of each item of text information, and performing fusion processing on each text feature to obtain comprehensive text features; and determining the similarity between the reference file and the file to be compared based on the metadata characteristics and the comprehensive text characteristics. In the embodiment of the disclosure, a plurality of items of text information are extracted from the reference file, and the text characteristics are respectively extracted, which is beneficial to extracting thought characteristics independently expressed by each item of text information. The integrated text features derived from combining the plurality of text features can represent the overall text features. The metadata characteristics are further combined, the characteristic description of the reference file can be realized from multiple dimensions, and the accuracy of the file similarity is further improved.

Description

Information comparison method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the field of intelligent search technology.

Background

The accumulation of files over the years, coupled with new text that is continually completed, results in a huge number of files. Although a database can be used to manage various files, relevant files can also be retrieved by a query method such as fuzzy query. However, how to screen out a desired file from a large number of files is a subject to be improved.

Disclosure of Invention

The disclosure provides an information comparison method, an information comparison device, electronic equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided an information comparison method, including:

extracting a plurality of items of text information from the text content of the reference file, and extracting metadata features based on the metadata of the reference file;

respectively extracting text features of each item of text information, and performing fusion processing on the text features of each item of text information to obtain comprehensive text features;

and determining the similarity between the reference file and the file to be compared based on the metadata characteristics and the comprehensive text characteristics.

According to a second aspect of the present disclosure, there is provided an information comparing apparatus, including:

the acquisition module is used for extracting a plurality of items of text information from the text content of the reference file and extracting metadata characteristics based on the metadata of the reference file;

the extraction module is used for respectively extracting text features of each item of text information and fusing the text features of each item of text information to obtain comprehensive text features;

and the comparison module is used for determining the similarity between the reference file and the file to be compared based on the metadata characteristics and the comprehensive text characteristics.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the aforementioned first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aforementioned first aspect.

According to the scheme provided by the embodiment, a plurality of items of text information are extracted from the reference file, and the text characteristics are respectively extracted from each item of text information, so that the thought characteristics independently expressed by each item of text information can be independently refined. And combining a plurality of text features to obtain a comprehensive text feature which can represent the whole text feature of the reference file. The metadata characteristics of the reference file are further combined, the characteristic description of the reference file can be realized from multiple dimensions, and the accuracy of the file similarity can be further improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of an information comparison method according to an embodiment of the disclosure;

FIG. 2 is another schematic flow chart diagram of an information comparison method according to another embodiment of the present disclosure;

fig. 3 is a schematic diagram of a model structure for extracting text features in an information comparison method according to an embodiment of the disclosure;

FIG. 4 is information of another embodiment of the present disclosure another schematic flow chart of the comparison method;

FIG. 5 is a schematic diagram of another model structure for extracting text features in an information comparison method according to another embodiment of the disclosure;

FIG. 6 is another schematic flow chart diagram illustrating an information comparison method according to another embodiment of the present disclosure;

FIG. 7 is another schematic flow chart diagram illustrating an information comparison method according to another embodiment of the present disclosure;

fig. 8 is a schematic diagram of a model structure for extracting comprehensive text features in an information comparison method according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another model structure for extracting comprehensive text features in an information comparison method according to another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a model structure for extracting comprehensive text features and metadata features in an information comparison method according to another embodiment of the present disclosure;

FIG. 11 is a schematic diagram of an exemplary configuration of an information comparing apparatus according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of another structure of an information comparing apparatus according to another embodiment of the disclosure;

fig. 13 is a block diagram of an electronic device for implementing the information comparison method according to the embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, as will be recognized by those of ordinary skill in the art, various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms "first" and "second," and the like in the description embodiments and in the claims of the present disclosure, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that a list of steps or elements is included. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

The embodiment of the present disclosure provides an information comparison method, which is a schematic flow chart shown in fig. 1 and includes the following steps:

s101, extracting a plurality of items of text information from the text content of the reference file, and extracting metadata characteristics based on the metadata of the reference file.

And S102, respectively extracting text features of each item of text information, and fusing the text features of each item of text information to obtain comprehensive text features.

Taking the patent application as an example, the plurality of text messages may include at least two of the following: abstract, title, claims and specification of technical effect. The abstract can reflect the core invention point of the patent application document and complete the summary of the technical scheme; the title of the patent application is a summary of the subject matter referred to in the patent application; the claims can express the core solution of the patent application in a concise and direct manner, and the technical effect description can deeply explain the technical effect brought by the implementation. The text information can reflect the core content of the patent application file, so that the core idea of the patent application file can be extracted by adopting the text information, so that similar patents of the patent application file can be accurately retrieved.

In addition, for other types of documents, such as papers, periodicals, magazines, novels, etc., there are typically abstracts and headings. For other types of files, the plurality of items of textual information may select a summary and/or a title. Or, according to actual requirements, for example, according to template requirements of periodicals, a plurality of pieces of text information are selected as appropriate, which is not limited in the embodiments of the present disclosure.

S103, determining the similarity between the reference file and the file to be compared based on the metadata characteristic and the comprehensive text characteristic.

In the embodiment of the disclosure, a plurality of items of text information are extracted from the reference file, and the text features are respectively extracted from each item of text information, which is beneficial to independently refining the thought features independently expressed by each item of text information. And then, combining the text characteristics of each item of text information to obtain the comprehensive text characteristics of the reference file, so that the comprehensive text characteristics can represent the overall text characteristics of the reference file while including the thought characteristics independently expressed by each item of text information. The embodiment of the disclosure further combines the metadata characteristics of the reference file, and can realize the characteristic description of the reference file from multiple dimensions. When the file to be compared is compared, the similarity between the reference file and the file to be compared can be accurately described on the basis of the feature description of multiple dimensions.

In some embodiments, in order to better extract the text features of the reference file, when the text features of each item of text information are respectively extracted, the embodiment of the disclosure may perform operations as shown in fig. 2 for each item of text information, including:

s201, extracting initial text features of the text information by adopting a first language model corresponding to the text information.

In the disclosed embodiment, the language model may select a pre-trained BERT (Bidirectional Encoder Representation from transforms) model. A RoBERTa (a Robustly Optimized BERT preceding Approach, a robust Optimized BERT pre-training method) model may also be selected, and NEZHA (Nezha), a BERT-based Chinese pre-training language model may also be selected.

S202, inputting the initial text characteristics of the text information into a first full connection layer corresponding to the text information to obtain the text characteristics of the text information output by the first full connection layer.

After extracting a plurality of items of text information from the reference file, each item of text information is a text sequence, and punctuations can be included in the text sequence obtained in the implementation and also can be removed. For example, for documents with emphasis on literature, emotion may need to be expressed, punctuation marks are kept in a text sequence, and the punctuation marks can be removed by comparing with the technical scheme of emphasis on patent application documents.

After the text sequence of each text message is obtained, each character in the text sequence may be converted into a word vector, resulting in a word vector representation of the text sequence. In implementation, the text sequence can be converted into word vector representation by word embedding technology, such as FastText (fast text classifier), word2vec (word-to-vector), and the like.

Then, the word vector representations of the text sequence are respectively input into the respective BERTs (namely the first language models) to extract the initial text features of each item of text information. As shown in fig. 3, assuming that a plurality of items of text information extracted from a reference file include text item 1, text item 2, and text item 3, the word vector representation of each text item is respectively input into a corresponding BERT model, an initial text feature is obtained, and then the text feature of each item of text information is output after being processed by FC (full connected layers) (i.e., first full connected layers).

In the embodiment of the disclosure, each item of text information corresponds to a respective first language model, and the plurality of language models are adopted in parallel to extract the initial text features of each item of text information, so that the extraction efficiency of the initial text features can be improved. Then, inputting the initial text features into the full-connection layer to obtain the text features. Wherein, because each node of the full connection layer is connected with all the nodes of the previous layer, the characteristics extracted by the language model can be integrated. The initial text features extracted by the language model are processed by the full connection layer, the full connection layer can play a role of a classifier, and the full connection layer also plays a role of mapping distributed feature representation learned by the language model to other feature spaces, so that the features extracted by the language model can be further sublimated and converted into features which can be distinguished from other files, and the consistency of similar features can be ensured. Therefore, the embodiment of the disclosure can extract accurate features for text comparison based on the language model and the full connection layer, so as to improve the accuracy of file comparison.

In some embodiments, there may be text items (i.e., text information) that are more complex. For example, a summary in a patent application document includes a paragraph and a title also includes a paragraph, and a plurality of claims are included in the claims, each claim in the disclosed embodiment is regarded as a paragraph, and the content of the claims is obviously more complicated. In view of this, the text information including a plurality of paragraphs may be defined as a complex text item in the embodiments of the present disclosure. For a complex text item, since each paragraph expresses one content individually, as in the claims, each paragraph can express one content individually, each paragraph of the complex text item can be processed individually, so as to expect to obtain a comprehensive feature expression of the complex text item. In the embodiment of the present disclosure, the text features of the text information including a single paragraph may be extracted in the manner described in fig. 2 and 3, and for the complex text item, in addition to the text features extracted in the manner described in fig. 2 and 3, the embodiment of the present disclosure further provides the text features extracted in the manner shown in fig. 4, including the following steps:

s401, based on the second language model corresponding to the complex text item, respectively extracting sub-text features of each text segment in the complex text item.

S402, performing dimension reduction processing on the subfile features of each text segment to obtain the dimension reduction features of the complex text items.

And S403, inputting the dimension reduction features of the complex text items into a second full-connection layer corresponding to the complex text items to obtain the text features of the complex text items output by the second full-connection layer.

On the basis of fig. 3, the operation of extracting the text features of the complex text item is shown in fig. 5. In fig. 5, it is assumed that the text item 3 is a complex text item, each text in the complex text item corresponds to a text sequence, and each text sequence is represented by a word vector through a word embedding technique. The word vector representation of the complex text item forms a multi-dimensional matrix, and the multi-dimensional matrix is input into a BERT model to obtain the initial text features of each paragraph.

Because the original text features of the complex text item are more, in order to avoid emphasizing the text features of the complex text item and neglecting the text features of the text item with fewer paragraphs, the text features of multiple paragraphs of the complex text item are subjected to dimension reduction processing in the embodiment of the disclosure. In addition, the deep level features of the text features of the complex text items can be further extracted through dimension reduction processing, so that the accuracy of the comparison result of the reference file and the file to be compared is improved.

In the embodiment of the present disclosure, an AVG (Word Averaging) model shown in fig. 5 may be selected as a way of reducing the dimension, and an average value of the initial text features of each text segment is obtained, so as to implement feature dimension reduction.

In other embodiments, linear dimension reduction methods such as subset selection, principal component analysis, and the like can be selected for dimension reduction.

In the above description, how to extract text features of each item of text information is introduced, and how to perform fusion processing on the text features of each item of text information to obtain comprehensive text features is explained below.

In one possible implementation, the text features of each item of text information may be processed based on a hierarchical Attention (Attention) mechanism to obtain a comprehensive text feature. Therefore, the key features can be focused on based on the hierarchical attention mechanism, so that the obtained comprehensive text features are more suitable for information comparison, and the accuracy of information comparison is improved.

In another embodiment, the text features of each item of text information can be processed based on a hierarchical attention mechanism, so as to obtain the semantic features of the reference file. And then, splicing the semantic features and the text features of at least one item of text information to obtain comprehensive text features. In the disclosed embodiment, the hierarchical attention mechanism can help to extract deep semantic features of the reference document. The semantic features and the text features of other text information are spliced, so that the comprehensive text features can express the semantics of the reference file and can also synthesize other text features, the features of the reference file can be comprehensively described from multiple dimensions, and the text comparison accuracy is improved.

In some embodiments, processing the text features of the text messages based on the hierarchical attention mechanism to obtain the comprehensive text features may be implemented as shown in fig. 6:

s601, determining complex text items comprising a plurality of paragraphs in a plurality of items of text information, and determining text information except the complex text items in the plurality of items of text information as simple text information;

s602, determining key features, value features and query features of a hierarchical attention mechanism based on the complex text items; the method comprises the steps that sub-text features of each paragraph in text features of complex text items are key features and value features, and text features of simple text information are query features;

s603, determining optimized text characteristics of the complex text items based on the key characteristics, the value characteristics and the query characteristics;

and S604, splicing the optimized text features of the complex text items and the text features of the simple text items to obtain comprehensive text features.

The sub-semantic features (which can also be understood as sub-semantic features in the semantic features) corresponding to the simple text information are obtained by taking the sub-text features of each paragraph in the text features of the complex text items as the key features and the value features of the hierarchical attention mechanism and taking the text features of each simple text information as the query features respectively; and determining the optimized text features of the complex text items based on the sub-semantic features.

Taking the patent application document as an example, the extracted multiple items of text information include a title, an abstract and a claim. Text features of a title

As Query vectors (i.e., query features), subfile features of each claim in the text characterization of the claims

Key vectors (i.e., key features) and Value vectors (i.e., value features), respectively. Wherein,

the subfigures of claim i are the subfigures of claim i, with claim total being s. Where each claim may be viewed as a paragraph.

Computing the Attention weight through a Query vector, a Key vector and a Value vector, and then performing weighted summation on the characteristics of each sub-document in the claims, wherein the computing process is as shown in formula (1):

wherein,

is composed of

An Attention weight of;

for measuring title

And the claims

The similarity of the image data can be calculated by selecting the modes of point multiplication, cosine similarity and the like; the dimensions of the Query vector, the Key vector and the Value vector are all d _k Therefore √ d _k Is a preset hyper-parameter.

In the same way, the Chinese character 'Yuan' can also be summarized

For a Query vector, the method comprises the following claims

Calculating the Attention weight through the Query vector, the Key vector and the Value vector for the Key vector and the Value vector, and then performing weighted summation on the claims, wherein the calculation process is as shown in formula (2):

wherein

Is composed of

An Attention weight of;

for measuring title

And the claims

The similarity of (d) can be calculated by selecting the point multiplication and cosine similarity, etc _k Is a preset hyper-parameter.

Then, averaging the obtained sub-semantic features taking the title and the abstract as Query to obtain the semantic features of the reference file, and recording the semantic features as the semantic features

The calculation process is shown in formula (3):

in summary, in the embodiment of the present disclosure, for a complex text item, the content difference conveyed by each paragraph can be considered, and the sub-semantic features are obtained by using the text features of other text information as a reference, so that the extracted sub-semantic features can focus on the key features. The semantic features of the reference file obtained based on the sub-semantic features can comprehensively describe the information characteristics conveyed by the complex text items, and the accuracy of semantic feature extraction is improved.

Further, still taking the patent application document as an example, the text characterization and the semantic characterization of the title and the abstract are spliced to obtain a spliced characterization, and the spliced characterization can be expressed as

And then, processing and splicing the characteristics through the full connection layer to obtain the comprehensive text characteristics of the reference file, wherein the processing process of processing and splicing the characteristics through the full connection layer can be expressed as shown in a formula (4):

wherein, W _o And b _o Is a parameter of the full connection layer.

In some embodiments, the semantic features may be concatenated with the text features of the abstract, may be concatenated with the text features of the title separately, or may be concatenated with the text features of the abstract, the title, and the claims. That is, in the embodiment of the present disclosure, the semantic features may be spliced with text features of part of text information of multiple text information, or may be spliced with text features of all text information, which is not limited in the embodiment of the present disclosure.

In some embodiments, besides the comprehensive text features obtained by adopting a hierarchical attention mechanism, the text features of each item of text information can be spliced to obtain the comprehensive text features. The splicing processing is simple and easy to operate, and the efficiency of extracting the comprehensive text features is improved while the comprehensive text features can be obtained.

In the embodiment of the disclosure, the text features of each item of text information are respectively extracted, and the text features of each item of text information are fused to obtain the comprehensive text features, which can be realized based on the comprehensive text feature network model, that is, the text features of each item of text information are obtained through the comprehensive text feature network model and are fused to obtain the comprehensive text features.

The comprehensive text characteristic network model can utilize an artificial intelligence technology to dig out text characteristics of the reference file so as to improve the accuracy of the comparison result of the reference file and the file to be compared.

The comprehensive text feature network model can obtain comprehensive text features by adopting the hierarchical attention mechanism, and can also obtain the comprehensive text features based on the mode of splicing the text features of all text messages. In any way, the embodiment of the present disclosure may train the comprehensive text feature network model by using the following method, as shown in fig. 7, including the following steps:

s701, extracting multiple text messages from the same file, constructing a positive sample, extracting multiple text messages from different files, and constructing a negative sample.

For example, the plurality of items of text information include a title, an abstract, and a claim. If the three text messages are all from the same patent application file, the three text messages are positive samples, and if at least one text message is from other files, the three text messages are negative samples. For example, a positive example is constructed by extracting three text messages of title, abstract and claims from the patent application document a. The title and abstract are extracted from patent application document A, but the claims are extracted from patent application document B, and the three text messages are constructed as negative examples.

And S702, respectively inputting the positive sample and the negative sample into the initial text feature network to obtain the comprehensive text features of the positive sample and the comprehensive text features of the negative sample output by the initial text feature network.

And S703, respectively classifying the comprehensive text characteristics of the positive samples and the comprehensive text characteristics of the negative samples by adopting a classifier to obtain classification processing results, wherein the classification category of the classifier comprises the positive samples and the negative samples.

S704, based on the classification processing result, the class label of the positive sample and the class label of the negative sample, a classification loss value is determined.

In the embodiment of the disclosure, the positive and negative samples are input to the classifier for discrimination and prediction to obtain the classification processing result of the positive and negative samples, and the loss function may adopt a commonly used cross entropy loss function.

S705, based on the classification loss value, adjusting model parameters of the initial text feature network to obtain a comprehensive text feature network model.

In the embodiment of the disclosure, the training samples do not need to be manually marked, so that the sample acquisition efficiency is improved. And the classification model is adopted to be combined with the positive sample and the negative sample for training, only the positive sample and the negative sample need to be classified during training, the model training mode is simple and easy, the training efficiency is high, and the converged comprehensive text feature network model can be obtained as soon as possible.

The two comprehensive text feature network model structures of the comprehensive text feature obtained by splicing the text features and the comprehensive text feature obtained by the hierarchical attention mechanism are respectively explained below.

As shown in fig. 8, the structure diagram of the comprehensive text feature network model for obtaining the comprehensive text features by using the stitched text features is shown, and in fig. 8, the network model includes a language model (BERT), a full connection layer (FC), a dimension reduction layer, and a stitching layer.

Taking a patent application document as an example, a text sequence of a title, an abstract and a claim is extracted from a patent application text, wherein each claim in the claims is regarded as a paragraph, and the text sequence is extracted from each claim respectively. The word vector of the text sequence defining the title is represented as

AbstractThe word vector is represented by

The word vector of the claims is represented by

Wherein

l is the length of the word sequence and s is the number of claims;

as shown in fig. 8, the word vector representations of the text messages are processed by the respective language models and then pass through the respective full-connected layers to obtain the text features of the titles

Text features of the abstract

And the text features of the claims

In fig. 8, after the claims are subjected to the language model extraction for feature extraction by taking each claim as a unit, the features of all the claims are subjected to the dimension reduction processing by the AVG dimension reduction layer and then input into the full connection layer FC.

The process of treatment of the fully-connected layer is as shown in formula (5):

in formula (5), W _t 、b _t 、W _a 、b _a 、W _c 、b _c Respectively setting parameters of all the fully connected layers to be trained;

representing the result obtained after the ith word vector in the title is processed by a BERT model; l _t In the presentation titleThe length of the text sequence; in the same way, the method for preparing the composite material,

representing the result obtained after the ith word vector in the abstract is processed by a BERT model; l. the _a Representing the length of the text sequence in the abstract;

a result obtained after a word vector representing an ith claim in the claims is processed through a BERT model;

representing the number of terms in a claim.

Finally, the text characteristics of the title

Text features of the abstract

And the text features of the claims

Obtaining synthetic text features via a splice layer Concat splice

As shown in fig. 9, a schematic structural diagram of a comprehensive text feature network model for obtaining a comprehensive text feature by using a hierarchical attention mechanism is shown, where in fig. 9, the network model includes a language model (BERT), a third fully-connected layer (FC), a hierarchical attention mechanism network, a concatenation layer, and a fourth fully-connected layer.

Taking a patent application file as an example, the text sequences of the title, abstract and claims are extracted from the patent application text. Wherein each claim in the claims is regarded as a paragraph to respectively extract text sequences; the word vector of the text sequence defining the title is represented as

The word vector of the abstract is represented as

The term vector of the claims is expressed as

Wherein

l is the length of the word sequence and s is the number of claims.

Respectively processed by respective language model and then processed by a first full connection layer to obtain the text characteristics of the title

Text features of the abstract

And the text features of the claims

The process of processing each item of text information through the respective third full connection layer is shown as formula (5), and is not described herein again. It should be noted that the feature of the claims may not be dimension-reduced in fig. 9.

Then by title

As a Query vector, as claimed

For the Key vector and the Value vector, a first sub-semantic feature is obtained through a hierarchical Attention mechanism network (Attention in fig. 9).

In the same wayIn summary

As a Query vector, as claimed

For the Key vector and the Value vector, a second sub-semantic feature is obtained through a hierarchical Attention mechanism network (Attention in fig. 9).

Then, the first sub-semantic feature and the second sub-semantic feature are averaged to obtain the semantic feature

Finally, the semantic features are combined

Text features of a title

And text features of the summary

Processing the text through a Consat layer and then processing the text through a fourth full connection layer to obtain comprehensive text characteristics

On the basis of FIG. 9, the text characteristics can be integrated with the metadata characteristics of the reference file

And fusing to obtain the feature representation of the reference file. A schematic diagram of a reference characterization network for extracting metadata features is shown in fig. 10. A patent embedding network can be defined in the disclosed embodiments, and the patent embedding network comprises a comprehensive text feature network model for extracting comprehensive text features and also comprises a reference characterization network for extracting metadata features.

For ease of understanding, how the reference characterizes the network is explained below with reference to fig. 10.

In the embodiment of the present disclosure, metadata is first extracted from a reference file, and taking a patent application file as an example, the content of the metadata is shown in table 1. It should be noted that table 1 is only used for illustrating the embodiments of the present disclosure, and is not limited to the embodiments of the present disclosure.

TABLE 1

Wherein the forward citation trend may be determined based on the forward citation of patent a in recent years. For example, in 2017, the number of forward references is a, and in 2018, the data of the forward references is b. The forward reference trend can be expressed based on the number of forward references in the years.

And after the metadata is obtained, establishing a reference characterization network F, wherein each node in the reference characterization network F is a file, and the attribute characterization of each node is the metadata of the file.

Reference network G = { V, E, X }, where V = { V = _k I k =1,2, …, N } represents all file nodes,

representing a reference relationship between file nodes, X = { X = { _k |k∈S ^P Denotes the metadata of the file node. If file node

Reference file node

Then an edge can be observed

This edge represents the reference relationship between the file nodes,the nodes having the reference relationship become the neighbor nodes. And then, through network embedding learning, converting the attribute representation of the nodes of the high-dimensional unstructured reference representation network into the low-dimensional representation, thereby realizing more accurate depiction of the file features.

Taking patent application documents as an example, firstly, attribute representations of all the patent application documents are integrated to form an attribute matrix

Wherein N is the number of patent application documents, d _m A dimension characterized by the attributes of each node. Attribute matrix X ^m Is marked as x on the k-th line _k And represents the attribute characterization of the patent k.

According to the random walk strategy, each node in the patent citation network G is taken as a root node, and the neighbor nodes of the node are randomly sampled, so that different paths such as:

<root,neighborhood1,neighborhood2,…>

for each path, a set of neighbor nodes called patent k is also called context information of patent k, and the context information expression is shown in formula (6):

context(v _k )＝{v _k-s ,…,v _k+s }\{v _k } (6)

expression (6) indicates that k itself needs to be excluded from the extracted context information of patent k; s is used to limit the length of the context information.

Then, by maximizing the following conditional probability (as shown in equation (7)), training to obtain a reference characterization network F, where equation (7) expresses the probability of using the neighbor node guess k as the center node:

wherein the element value field j represents the neighbor node of k;

and

respectively representing the metadata characteristics output by the patent k via the reference characterization network F and the metadata characteristics output by the context information via the reference characterization network F,

is defined as shown in formula (8):

wherein,

the metadata characteristic that the neighbor node j characterizes the output of the network F via the reference in the context information representing the patent k.

To convert X of patent node ^m Embedded in the network learning process, definition reference characterizes the network

For each patent k, its attributes are characterized

Is converted into

When the training citation characterization network F, in the embodiment of the present disclosure, non-adjacent nodes of the patent node k serve as negative samples of the patent node k. And (3) approximating an objective function by a negative sampling strategy to optimize parameters, wherein the objective function is shown as a formula (9):

wherein, in formula 9The parameters of the σ () function are simplified by x, σ (x) = 1/(1 + exp (-x)) is the Sigmoid function, neg is the number of negative samples sampled for each positive sample k, E is the mathematical expectation, P is the mathematical expectation _n (v)∝d _v ^3/4 Noise distribution of the Layer negative sample node, d _v Representing the out degree of node v.

After the citation characterization network is trained, inputting the attribute matrix of each patent k to obtain the metadata characteristics of the patent k

As shown in fig. 10, the patent citation network is entered with reference to the Input Layer (Input Layer) characterizing the network, where node k is the center. Extracting context information and attribute characterization of node k from patent citation network through sequence generation

The context information and attribute characterization input reference characterization network F is used to train the reference characterization network. The trained citation characterization network can obtain the metadata characteristics of the patent k based on the attribute characteristics and the context information of the patent k. In fig. 10, a characterization network is referenced for projecting (project) the attribute matrix into the feature space of the metadata feature by embedding learning. Referencing metadata features characterizing network extraction

And comprehensive text characteristics extracted by the comprehensive text characteristic network model

The Fusion layer (Fusion layer) is spliced through the Concat layer, then processed through the full connection layer FC, and then the comparison characteristics of the patent application files are obtained, and the similarity of the comparison characteristics and the comparison characteristics of other files can be calculated. So that similar patents to the patent application document can be found.

Based on the same technical concept, an embodiment of the present disclosure further provides an information comparing apparatus, as shown in fig. 11, including:

an obtaining module 1101, configured to extract multiple items of text information from text content of a reference file, and extract metadata features based on metadata of the reference file;

the extraction module 1102 is configured to extract text features of each item of text information, and perform fusion processing on the text features of each item of text information to obtain a comprehensive text feature;

a comparison module 1103, configured to determine similarity between the reference file and the file to be compared based on the metadata feature and the comprehensive text feature.

In some embodiments, on the basis of fig. 11, as shown in fig. 12, the extracting module 1102 is configured to process text features of each item of text information based on a hierarchical attention mechanism to obtain a comprehensive text feature.

In some embodiments, based on fig. 11, as shown in fig. 12, the extraction module 1102 includes:

a text item determining unit 1201, configured to determine a complex text item including a plurality of paragraphs in the plurality of items of text information, and determine that text information other than the complex text item in the plurality of items of text information is simple text information;

a feature determination unit 1202, configured to determine, based on the complex text item, a key feature, a value feature, and a query feature of the hierarchical attention mechanism; the method comprises the steps that sub-text features of each paragraph in text features of complex text items are key features and value features, and text features of simple text information are query features;

a feature optimization unit 1203, configured to determine an optimized text feature of the complex text item based on the key feature, the value feature, and the query feature;

and a splicing unit 1204, configured to splice the optimized text features of the complex text item and the text features of the simple text item to obtain comprehensive text features.

In some embodiments, the extraction module 1102 is configured to perform a splicing process on text features of each item of text information to obtain a comprehensive text feature.

In some embodiments, the extracting module 1102 is configured to perform the following operations for each text message:

extracting initial text features of the text information by adopting a first language model corresponding to the text information;

and inputting the initial text characteristics of the text information into a first full-connection layer corresponding to the text information to obtain the text characteristics of the text information output by the first full-connection layer.

In some embodiments, for a complex text item including multiple paragraphs in multiple items of text information, the extraction module 1102 extracts text features of the complex text item based on the following method:

based on a second language model corresponding to the complex text item, respectively extracting sub-text features of each text segment in the complex text item;

performing dimension reduction processing on the subfile features of each text segment to obtain dimension reduction features of the complex text items;

and inputting the dimension reduction characteristics of the complex text items into a second full-connection layer corresponding to the complex text items to obtain the text characteristics of the complex text items output by the second full-connection layer.

In some embodiments, the extracting module 1102 is configured to extract text features of each item of text information respectively based on the comprehensive text feature network model, and perform fusion processing on the text features of each item of text information to obtain comprehensive text features.

In some embodiments, the training module 1203 is further included for training the comprehensive text feature network model based on the following method:

extracting multiple text messages from the same file to construct a positive sample, extracting multiple text messages from different files to construct a negative sample;

respectively inputting the positive sample and the negative sample into an initial text feature network to obtain the comprehensive text features of the positive sample and the comprehensive text features of the negative sample output by the initial text feature network;

respectively classifying the comprehensive text characteristics of the positive samples and the comprehensive text characteristics of the negative samples by adopting a classifier to obtain classification processing results, wherein the classification category of the classifier comprises the positive samples and the negative samples;

determining a classification loss value based on the classification processing result, the class label of the positive sample and the class label of the negative sample;

and adjusting model parameters of the initial text feature network based on the classification loss value to obtain a comprehensive text feature network model. In the embodiment of the disclosure, on the basis of the first sorting value, the first sorting value is further adjusted by using the adjustment parameter of the reference recommendation information of the same category of the user, so that the first sorting value is adjusted based on the information of the same category, and the sorting accuracy is improved.

According to another embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

Fig. 13 illustrates a schematic block diagram of an example electronic device 1300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the electronic device 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for the operation of the electronic apparatus 1300 can also be stored. The calculation unit 1301, the ROM1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

A number of components in the electronic device 1300 are connected to the I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, or the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, or the like; and a communication unit 1309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1309 allows the electronic device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1301 performs the information comparison method described above. In some embodiments, the information comparison method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1308. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 1300 via the ROM1302 and/or the communication unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the information comparing method may be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured to perform the information comparison method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information comparison method, comprising:

extracting a plurality of items of text information from text content of a reference file, and extracting metadata features based on metadata of the reference file;

and determining the similarity between the reference file and the file to be compared based on the metadata characteristic and the comprehensive text characteristic.

2. The method according to claim 1, wherein the fusing the text features of the text messages to obtain a comprehensive text feature includes:

and processing text features of each item of text information based on a hierarchical attention mechanism to obtain the comprehensive text features.

3. The method of claim 2, wherein processing the text features of the text messages based on the hierarchical attention mechanism to obtain the integrated text feature comprises:

determining complex text items comprising a plurality of paragraphs in the plurality of items of text information, and determining text information except the complex text items in the plurality of items of text information as simple text information;

determining key features, value features, and query features of the hierarchical attention mechanism based on the complex text entry; wherein the sub-text features of each paragraph in the text features of the complex text item are the key features and the value features, and the text features of the simple text information are the query features;

determining optimized text features for the complex text item based on the key features, the value features, and the query features;

and splicing the optimized text features of the complex text items and the text features of the simple text items to obtain the comprehensive text features.

4. The method according to claim 1, wherein the fusing the text features of the text messages to obtain a comprehensive text feature includes:

and splicing the text features of each item of text information to obtain the comprehensive text feature.

5. The method according to any one of claims 1-4, wherein the extracting text features of each item of text information respectively comprises:

the following operations are respectively executed for each item of text information:

inputting the initial text features of the text information into a first full-connection layer corresponding to the text information to obtain the text features of the text information output by the first full-connection layer.

6. The method of any one of claims 1-4, extracting, for a complex text item comprising a plurality of paragraphs in the plurality of items of text information, text features of the complex text item, comprising:

performing dimension reduction processing on the subfile features of each text segment to obtain the dimension reduction features of the complex text items;

and inputting the dimension reduction features of the complex text items into a second full connection layer corresponding to the complex text items to obtain the text features of the complex text items output by the second full connection layer.

7. The method according to claim 1, wherein the extracting text features of each item of text information respectively, and performing fusion processing on the text features of each item of text information to obtain a comprehensive text feature comprises:

and respectively extracting text features of each item of text information based on a comprehensive text feature network model, and fusing the text features of each item of text information to obtain the comprehensive text features.

8. The method of claim 7, further comprising training the synthetic text feature network model based on:

classifying the comprehensive text characteristics of the positive sample and the comprehensive text characteristics of the negative sample by adopting a classifier to obtain a classification result, wherein the classification category of the classifier comprises the positive sample and the negative sample;

determining a classification loss value based on the classification processing result, the class label of the positive exemplar, and the class label of the negative exemplar;

and adjusting the model parameters of the initial text feature network based on the classification loss value to obtain the comprehensive text feature network model.

9. An information comparison apparatus, comprising:

10. The apparatus of claim 9, wherein the extracting module is configured to process text features of each item of text information based on a hierarchical attention mechanism to obtain the comprehensive text feature.

11. The apparatus of claim 10, the extraction module comprising:

a text item determining unit, configured to determine a complex text item including multiple paragraphs in the multiple items of text information, and determine that text information other than the complex text item in the multiple items of text information is simple text information;

a feature determination unit, configured to determine, based on the complex text item, a key feature, a value feature, and a query feature of the hierarchical attention mechanism; the sub-text features of each paragraph in the text features of the complex text items are the key features and the value features, and the text features of the simple text information are the query features;

a feature optimization unit to determine optimized text features for the complex text item based on the key features, the value features, and the query features;

and the splicing unit is used for splicing the optimized text features of the complex text items and the text features of the simple text items to obtain the comprehensive text features.

12. The apparatus according to claim 9, wherein the extraction module is configured to perform a splicing process on text features of each text message to obtain the comprehensive text feature.

13. The apparatus according to any of claims 9-12, said extraction module configured to perform the following for each text message, respectively:

and inputting the initial text characteristics of the text information into a first full connection layer corresponding to the text information to obtain the text characteristics of the text information output by the first full connection layer.

14. The apparatus according to any one of claims 9-12, wherein, for a complex text item comprising a plurality of paragraphs in the plurality of text information, the extraction module extracts text features of the complex text item based on:

15. The apparatus according to claim 9, wherein the extracting module is configured to extract text features of each item of text information respectively based on a comprehensive text feature network model, and perform fusion processing on the text features of each item of text information to obtain the comprehensive text features.

16. The apparatus of claim 15, further comprising a training module for training the synthetic text feature network model based on:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-8.