CN116522165B - Public opinion text matching system and method based on twin structure - Google Patents
Public opinion text matching system and method based on twin structure Download PDFInfo
- Publication number
- CN116522165B CN116522165B CN202310761055.3A CN202310761055A CN116522165B CN 116522165 B CN116522165 B CN 116522165B CN 202310761055 A CN202310761055 A CN 202310761055A CN 116522165 B CN116522165 B CN 116522165B
- Authority
- CN
- China
- Prior art keywords
- sentence
- similarity
- layer
- vector
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 239000013598 vector Substances 0.000 claims abstract description 139
- 238000012512 characterization method Methods 0.000 claims abstract description 70
- 238000013528 artificial neural network Methods 0.000 claims abstract description 60
- 230000003993 interaction Effects 0.000 claims abstract description 42
- 238000004364 calculation method Methods 0.000 claims abstract description 22
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 230000014509 gene expression Effects 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 20
- 238000011176 pooling Methods 0.000 claims description 20
- 238000010606 normalization Methods 0.000 claims description 14
- 238000005259 measurement Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 6
- 230000008878 coupling Effects 0.000 claims description 3
- 238000010168 coupling process Methods 0.000 claims description 3
- 238000005859 coupling reaction Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims 1
- 239000010410 layer Substances 0.000 description 122
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000002372 labelling Methods 0.000 description 8
- 230000007704 transition Effects 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000010380 label transfer Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000002344 surface layer Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a public opinion text matching system based on a twin structure, which comprises a twin neural network module: the method comprises the steps of constructing a coding layer of a twin neural network, and obtaining a first similarity characterization vector between named entities; semantic interaction module: for obtaining a second similarity characterization vector; and a fusion module: the method comprises the steps of splicing a first similarity characterization vector and a second similarity characterization vector to obtain a final similarity characterization vector of a sentence pair; and a matching module: and the final similarity characterization vector is used for obtaining a text matching result through a softMax classification function. According to the method, the named entity similarity characteristics and the text semantic similarity characteristics of the public opinion texts are extracted, semantic similarity calculation is carried out after the two types of characteristics are fused, whether the two public opinion texts are similar or not is analyzed, and accuracy and robustness of matching of the public opinion texts are improved, because the theme and meaning of the texts are not simply matched, and meanwhile matching of expressions aiming at the same person, thing or phenomenon is considered.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a public opinion text matching system and method based on a twin structure.
Background
The core problem of the current public opinion text matching method is to solve the problem of text data similarity judgment, and the matching accuracy of a public opinion text system can be improved only when the text data similarity judgment is accurate. In the traditional method, a great deal of manpower and time are required for manually judging, labeling and removing similar public opinion texts. Therefore, an intelligent public opinion text matching system is needed to refine important information and improve the efficiency of text analysis. The public opinion text matching plays a crucial role in public opinion analysis and public opinion early warning, and the accuracy of the public opinion text matching is related to the accuracy of subsequent public opinion research judgment.
At present, two modes are mostly adopted for calculating the public opinion text matching, one is based on a traditional text matching algorithm, and the other is based on a text matching algorithm of deep learning. Conventional text matching algorithms can be generally classified into string-based methods, statistical-based methods, and knowledge-base-based methods. Most of the traditional text matching algorithms can only calculate the meaning of the text surface layer, and the deep meaning of the text is difficult to mine. Along with the wider and wider requirements of natural language processing tasks, the traditional method cannot break through the bottleneck of semantic similarity calculation tasks all the time, so that the method is gradually replaced by a semantic similarity algorithm based on deep learning. The deep meaning of the text can be understood by the text matching algorithm based on deep learning, so that the model effect is better, but the accuracy of the model is still to be improved because the research time is not long. The method for generating the distributed word vector, namely word2vec, proposed in 2013 predicts the word vector of each word in the text according to the context in a certain range, and then the generated word vector can represent certain semantic information after being spliced; the context on which each word depends is limited and thus the semantic information of each word vector expression is locally limited. In 2014, a doc2vec method was proposed for vectorizing the text of a document, which differs from words in that the document has no logical structure like words to words, which is an integral text data. The vectors generated by the two methods are static, namely, cannot be dynamically changed according to different text contexts, so that the accuracy and the performance of the methods are affected.
In recent years, the BERT method has great influence on the field of natural language processing, combines a self-attention mechanism, provides two very novel and effective pre-training targets for masking a language model task and a following prediction task, brings great improvement to the performance of the method, and becomes one of the most commonly used methods for generating dynamic word vectors at present. Public opinion text matching has higher difficulty than general text matching, and not only needs to judge whether two texts are similar semantically, but also needs to judge whether the two texts are expressed beliefs, attitudes, ideas, emotions and the like aiming at the same person, thing or phenomenon. The existing text matching algorithm only considers text character matching or text meaning matching, namely, when two texts have a plurality of similar characters or the two texts express the same theme or the same meaning, the two texts are judged to be similar and are not specific to a person or an event layer, so the invention provides a public opinion text matching method based on a twin structure, and the accuracy and the robustness of the text matching of a public opinion scene are further improved.
Disclosure of Invention
Compared with general text matching, the public opinion text matching has higher difficulty, so that whether two texts are similar in semantics or not is judged, and whether the two texts are beliefs, attitudes, ideas, emotions and the like expressed for the same person, thing or phenomenon is also judged.
In order to overcome the defects of the prior art, the invention aims to provide a public opinion text matching system and method based on a twin structure.
According to a first aspect of the present invention, there is provided a system for public opinion text matching based on a twinning structure, comprising
Twin neural network module: the method comprises the steps of constructing a coding layer of a twin neural network, extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector between the named entities;
semantic interaction module: a second similarity characterization vector used for acquiring sentence pairs in terms of semantics;
and a fusion module: the method comprises the steps of splicing a first similarity characterization vector and a second similarity characterization vector to obtain a final similarity characterization vector of a sentence pair;
and a matching module: and the final similarity characterization vector is used for obtaining a text matching result through a softMax classification function.
In an exemplary embodiment of the present invention, the twin neural network module specifically utilizes a bert+crf method to construct an encoding layer of the twin neural network, and the encoding layer includes two coupling three-layer architectures that are the same or similar to each other and are respectively an input layer, a feature extraction layer and a similarity measurement layer, where the input layer inputs a sentence pair to be matched, the feature extraction layer embeds the input sentence pair sample into a high latitude space to obtain a representation vector of the sentence pair two samples, and the similarity measurement layer performs similarity calculation on the extracted representation vectors of the two samples through a mathematical formula to obtain a first similarity representation vector of the sentence pair.
In an exemplary embodiment of the invention, the BERT model of the twin neural network module
The method also comprises a mask language model task unit, wherein (the mask language model task of the BERT layer is adopted to obtain text characteristics of word level in the input sentence pair sentence), part of characters are covered in a training input layer immediately, then the covered characters are predicted by utilizing the remaining unmasked characters, the model can fully learn the text characteristics of word level in the input sentence through the training of the mode, and then the feature vector output by the BERT layer is input to the CRF layer;
the system also comprises a context prediction task unit, a text prediction task unit and a text prediction task unit, wherein the context prediction task unit is used for judging whether an A sentence and a B sentence of an input sentence pair are related in an up-down manner, so that a model learns the relation between two texts and the problem of a sentence level is solved; inputting the feature vector output by the BERT layer to the CRF layer;
in an exemplary embodiment of the present invention, the CRF model of the twin neural network module further includes a transition probability unit between tags in the dataset, and the CRF layer corrects the output of the BERT layer by learning the transition probabilities between tags in the dataset, thereby ensuring the rationality of the predicted tags;
the system also comprises a labeling unit, wherein a named entity in a sentence pair is required to be extracted, a training set, namely the sentence pair, is used for labeling the entity by adopting a BIO method, B (begin) represents that the character is positioned at the beginning of one entity, I (inside) represents that the character is positioned at the internal position of the entity, and O (outside) represents that the non-entity character outside the entity is not concerned; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;
the method also comprises a unit for obtaining part-of-speech states to characterize vectors, wherein [ CLS ] is required to be added to the head of a sentence before the sentence pair is sent into the twin neural network]Identifier, get A sentence of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndsending the text into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer so as to obtain part-of-speech states to perform characterization vectors, wherein the output of all BERT is used as the input of the CRF layer;
in an exemplary embodiment of the invention, the semantic interaction module employs a following predictive task to learn sentence-to-sentence relationships between text, in particular based on BERTThe system is characterized by comprising an encoding layer of an interaction module, a pooling layer of the interaction module and a normalization layer of the interaction module, wherein the encoding layer of the interaction module needs to add [ CLS ] to the head of a sentence before sending the sentence pair into the BERT]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs;
the pooling layer of the interactive module obtains sentence vectors through BERTExtracting important features through a pooling layer to reduce the dimension;
normalization layer of the interaction module and sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.
In an exemplary embodiment of the present invention, in the matching module, a specific SoftMax classification function is as follows,representative meaning is the probability that the sample vector x belongs to the j-th class, where W is a weight coefficient and k represents k classes:
characterizing the final similarity to a vectorIs input into the softmax function,whereinFor the output of the twin neural network module,for the output of the interaction module,x is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value of text similarity is set to 0.5, whenAnd if the two texts are not matched, the two texts are considered to be matched, otherwise, the two texts are not matched.
According to a second aspect of the present invention, there is provided a method for matching public opinion text based on a twin structure, to which the system for matching public opinion text based on a twin structure is applied, comprising the steps of:
constructing a coding layer of the twin neural network, thereby extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities;
obtaining a second similarity characterization vector of the sentence pairs in terms of semantics;
splicing the first similarity characterization vector and the second similarity characterization vector to obtain a final similarity characterization vector of the sentence pair;
and obtaining a text matching result through a softMax classification function by the final similarity characterization vector.
In an exemplary embodiment of the present invention, the coding layer of the twin neural network is specifically configured by using a bert+crf method, and includes two coupled three-layer architectures built by the same or similar neural networks, which are an input layer, a feature extraction layer and a similarity measurement layer, where the input layer inputs a sentence pair to be matched, the feature extraction layer embeds the input sentence pair sample into a high latitude space to obtain characterization vectors of the sentence pair two samples, and the similarity measurement layer performs similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentence pair.
In an exemplary embodiment of the present invention, the constructing the coding layer of the twin neural network extracts named entity information in sentence pairs, performs similarity calculation on the extracted named entities, and obtains a first similarity characterization vector between the named entities, and specifically further includes:
the mask language model task of the BERT layer is adopted to acquire text features of word levels in input sentences and sentences, and then the feature vectors output by the BERT layer are input to the CRF layer;
the CRF layer corrects the output of the BERT layer by learning the transition probability among the tags in the data set;
the training set, namely, sentence pair, marks the entity by adopting the BIO method, B (begin) indicates that the character is at the beginning of one entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates that the non-entity character is not concerned outside the entity; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;
before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence]Identifier, obtain A sentence vector of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndfine tuning by entering BERT, introducing context information for characters at each position in a sentence by BERT layer encodingThus, part-of-speech states are obtained for token vectors, and the output of all BERTs will be taken as input to the CRF layer.
In an exemplary embodiment of the present invention, the obtaining a second similarity characterization vector of the sentence pair in terms of semantics specifically includes: specifically, based on BERT, adopting a following prediction task to learn sentence relation characteristics among texts; before sending sentence pairs into BERT, it is necessary to add [ CLS ] to the head of sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs; sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension; sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.
According to a third aspect of the present invention, there is provided a computer readable storage medium comprising a stored program, wherein the program when run performs the above-described method of twin structure-based public opinion text matching.
According to a fourth aspect of the present invention there is provided an electronic device comprising a memory having a computer program stored therein and a processor arranged to perform the twin structure based method of public opinion text matching by the computer program.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the system is divided into two main modules, namely a twin neural network module based on BERT+CRF and a semantic interaction module based on BERT. The twin neural network module utilizes a BERT+CRF method to construct a coding layer of the twin neural network, so that named entity information in sentence pairs including names, places and the like is extracted, similarity calculation is carried out on the extracted named entities, and similarity characteristics (characterization vectors) among the named entities are obtained. The BERT-based semantic interaction module may obtain semantically similar features (token vectors) of sentence pairs. According to the invention, the named entity similarity feature and the text semantic similarity feature of the public opinion text are extracted through the two modules, semantic similarity calculation is carried out after the two types of features are fused, whether the two public opinion texts are similar or not is analyzed, and accuracy and robustness of matching of the public opinion text are improved, because the topic and meaning of the text are not simply matched, and meanwhile, matching of expressions aiming at the same person, thing or phenomenon is considered.
Drawings
Fig. 1 is a schematic diagram of a public opinion text matching system based on a twin structure.
FIG. 2 is a vector diagram of the input characterization of the BERT model of the twin neural network module of the present invention.
Fig. 3 is a specific label form diagram of a training set of the twin neural network module of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.
Example 1
Referring to fig. 1, the present embodiment provides a system for matching public opinion text based on a twin structure, including: the twin neural network module is used for constructing a coding layer of the twin neural network, extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities; the semantic interaction module is used for acquiring a second similarity characterization vector of the sentence pairs in terms of semantics; the fusion module is used for splicing the first similarity characterization vector and the second similarity characterization vector to obtain a final similarity characterization vector of the sentence pair; and the matching module is used for obtaining a text matching result from the final similarity characterization vector through a softMax classification function.
In an exemplary embodiment, the twin neural network module specifically utilizes the bert+crf method (i.e., BERT model+crf model) to construct an encoding layer of the twin neural network, and includes a coupled three-layer architecture established by two identical or similar neural networks (specific BERT model+crf model), and the natural advantages of the coupled three-layer architecture make it very suitable for solving the similarity matching problem. The three-layer architecture is respectively an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentences to be matched to the samples, the feature extraction layer embeds the input sentences to the samples into a high latitude space to obtain two characterization vectors of the sentences to the samples, the similarity measurement layer carries out similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentences, and the similarity of the two samples can be generally calculated by adopting methods such as Euclidean distance, cosine distance or Jacquard distance.
Specifically, the BERT model employs a multi-layer Transfomer encoder as its network layer, enabling deep mining of important features in text, capturing context information over longer distances. The BERT is a multitasking model, and the pre-trained BERT model can complete various downstream tasks. The input of the model may be either a single sentence or text. In text input, a special classification symbol [ CLS ] is added to the head of a text sequence, and then a special symbol [ SEP ] is added at the end position of each sentence as a delimiter and an end symbol of the sentence. Each character in the text is first vector initialized by a word2vec model to form an original token vector. To distinguish the character sources, a segment attribution information insert needs to be added to distinguish whether the character is sentence a or sentence B from a sentence pair. Finally, in order for the (yes) model to learn the influence of the position information of each character in the sentence on the meaning of the sentence, a position vector needs to be embedded. The input token vector of the final BERT model is formed by adding three parts of word embedding, segment attribution information embedding and position embedding, as shown in fig. 2.
The pre-training task of the BERT model consists of two unsupervised learning subtasks, a mask language model and a following predictive task, respectively. The mask language model is characterized in that part of characters are covered in a training input layer, and then the covered characters are predicted by utilizing the remaining unmasked characters, and the model can fully learn the word-level text characteristics in an input sentence through the training of the mode. The following prediction task is to make the model judge whether the two input sentences are related up and down, so that the model learns the relation between the two texts and the problem of sentence level is solved. After the full training of the two tasks is carried out on each character through a large amount of unsupervised corpus, language characteristics of the text are learned, and character vector codes with deeper expression are output. In the downstream task, the trained model parameters can be directly utilized to vectorize the text.
In an exemplary embodiment, the BERT model of the twin neural network module further includes a mask language model task unit, (a mask language model task of the BERT layer is adopted to obtain text features of word levels in the input sentence and sentence), part of characters are covered in a trained input layer immediately, and then the covered characters are predicted by using the remaining unmasked characters, so that the model can fully learn the text features of word levels in the input sentence through training in this way, and then feature vectors output by the BERT layer are input to the CRF layer; the system also comprises a context prediction task unit, a text prediction task unit and a text prediction task unit, wherein the context prediction task unit is used for judging whether an A sentence and a B sentence of an input sentence pair are related in an up-down manner, so that a model learns the relation between two texts and the problem of a sentence level is solved; and then inputting the feature vector output by the BERT layer to the CRF layer.
In an exemplary embodiment of the invention, the CRF model of the twin neural network module furtherThe method comprises the steps that a transition probability unit between tags in a data set is included, a CRF layer corrects output of a BERT layer by learning transition probability between tags in the data set, so that reasonability of a predicted tag is guaranteed, output of the BERT layer is corrected, for example, vector X is output by the BERT before, and the corrected output is X'; the system also comprises a labeling unit, wherein a named entity training set in a sentence pair is required to be extracted, namely, the sentence pair labels an entity by adopting a BIO method, B (begin) represents that the character is at the beginning of one entity, I (inside) represents that the character is at the internal position of the entity, and O (outside) represents that non-entity characters outside the entity are not concerned; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set; the method also comprises a unit for obtaining part-of-speech states to characterize vectors, wherein [ CLS ] is required to be added to the head of a sentence before the A, B sentence of the sentence pair is sent to the twin neural network]Identifier, obtain A sentence vector of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndsending the text into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer so as to obtain part-of-speech states to perform characterization vectors, wherein the output of all BERT is used as the input of the CRF layer; the method can further comprise a preprocessing unit, wherein the training set of the twin neural network module, namely, a sentence pair, is used as the input of the model, the text of the input sentence pair is cleaned and the stop word is removed, the stop word list is adopted to filter the whole text (thereby reducing the text length and improving the calculation efficiency of the model), and the direct cut-off mode is adopted to limit the length of the input text.
In a word, the twin neural network module constructs a coding layer of the twin neural network by using a BERT+CRF method, so that named entity information in sentence pairs is extracted, and similarity calculation is performed on the extracted named entities. The twin neural network module firstly adopts a mask language model task of the BERT layer to acquire text characteristics of word levels in an input sentence. And then, inputting the feature vector output by the BERT layer to the CRF layer, wherein the CRF layer can correct the output of the BERT layer by learning the transition probability among the tags in the data set, thereby ensuring the rationality of the predicted tags. The method comprises the following steps:
the sentence pairs A and B, namely, the sentence pairs A and B are the sentence pairs which need to be judged whether to be similar or not, namely, the training set of the twin neural network module, the sentence pairs are used as the input of the model, and the input text needs to be cleaned and the word removal is stopped at first. And (3) cleaning the text, namely processing redundant information and error information in the text, deleting unimportant information such as blank symbols or emoticons, converting traditional Chinese characters in the text into simplified Chinese characters, and unifying character formats in the text into half-angle formats so as to facilitate subsequent text characterization vectors. The method can directly delete the mood words or some unimportant words in the text, and filter the whole text by adopting the stop word list, thereby reducing the text length and improving the calculation efficiency of the model. The direct cut-off mode is adopted to limit the length of the input text. The processed sentence a has a length n and the sentence B has a length m, which is denoted as a= { WA1, WA2,..once., WAn }, b= { WB1, WB2,..once., WBn }, wherein WAi and WBi represent the i-th words of the sentence a and the sentence B, respectively. Because named entities in sentence pairs need to be extracted, the training set marks the entities by using a BIO method, B (begin) indicates that the character is at the beginning of an entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates non-entity characters outside the entity that are not concerned. For public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the text are important to pay attention to, so the physical labels of the training set are 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O. A specific label format is shown in fig. 3. Before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence]The identity of the tag is used to determine,obtainingAnd. Will beAndsending into BERT for fine tuning, and introducing context information for characters at each position in sentence through BERT layer coding to obtain part-of-speech state for representing vectorAnd。the encoding vector representing the i-th word to which sentence a corresponds,the text represents the code vector of the i-th word corresponding to sentence B. The output of all BERTs will be input to the CRF layer. The CRF has two types of feature functions, one for the correspondence between the observed sequence and the state (e.g., "i" are generally nouns) and one for the relationship between states (e.g., "verbs" are generally followed by "nouns"). In the BERT+CRF model, the output of the former type of characteristic function is replaced by the output of BERT, and the output of the latter type of characteristic function is a label transfer matrixLabel transfer matrixRepresenting the transition score between tags. Specifically, the token vector of the BERT layer outputFor a matrix, get eachCharacter(s)The corresponding tag score is distributed asThis matrix is referred to as the transmit matrix. For sentence A, its corresponding tagIs one strand. Sentence A has a length of n and a total of 7 types of tags, so that it is commonThe possible marking results are thatPossible species. For public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the text are important to pay attention to, so the physical labels of the training set are 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O. The number of tags is less or more depending on the specific application scenario, which is only the general case. For charactersIts label score distributionAs a 7-dimensional vector, the tagScore of (2) isWhereinIs of integer type, representing a tag index. Will all ofAdding up to obtainScore to each character node. According to a tag matrixObtainingTo the point ofTransfer score of (2). Finally, summing all the scores to obtain the score of each possible labeling result of the sentence A as. Then the probability of each labeling result is obtained by normalization of the Softmax functionWherein. Similarly, the probability of each labeling result of sentence B isWherein。
And taking the labeling result with the highest probability as an entity label of the character, taking the character with the label of B as the beginning of the entity, and splicing all the characters following the label I together to form a word serving as an entity word. Extracting character characterization vectors output by the BERT layer corresponding to the character position of the entity word to obtainAndand constructing a similarity measurement layer by using a cosine algorithm, wherein the distance characteristic between two vectors is calculated as follows:
a first similarity characterization vector, which is a similarity feature matrix representing sentence pairs obtained by using a twin neural network (SEN), is further fused with interaction features of the sentence pairs obtained by BERT.
In an exemplary embodiment, the semantic interaction module, specifically, the semantic interaction module based on BERT, adopts a following prediction task to learn sentence relationship features between texts, and includes a coding layer of the interaction module, a pooling layer of the interaction module, and a normalization layer of the interaction module;
before the sentence pair is sent to the BERT, the encoding layer of the interactive module needs to add a [ CLS ] identifier to the head of the sentence and insert a [ SEP ] identifier between two sentences for segmentation.
The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs; sentenceSpecifically, t= { [ CLS]Good, well-learned, [ SEP ]]Day, upward, up }.
Pooling layer of interactive module, sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension;
normalization layer of interaction module, sentence vectorThe output result after the layer normalization is the second sentence pair obtained by the interaction moduleThe similarity characterizes the vector.
As a specific example, a model is trained using a community question-answer dataset, which is a large-scale high-quality question-answer dataset that asks for certain social questions, each question having multiple feedback answers, and feedback of the same question can be used as a similar public opinion.
For the coding layer of the interactive module, before the sentence pair is sent to the BERT, the [ CLS ] needs to be added to the head of the sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e. a vectorized representation of sentence pairs.
For the pooling layer of the interactive module, sentence vectors obtained through BERTImportant feature scaling is extracted by the pooling layer. Averaging pooling is mainly used when all information should contribute, such as to obtain global context or to obtain semantic information in the deep layer of the network. The maximum pooling is mainly to reduce the influence caused by garbage, and at the same time, the method can reduce the feature dimension and extract better and stronger semantic information features. In order to make the model more robust, the feature vectors, i.e. the token vectors, are processed together with an average pooling and a maximum pooling. Sentence vectorThe result after average pooling isMaximum pooling intoWhereinIs a vector of sentences T obtained after global average pooling,the vectors of sentences T obtained after global maximization are pooled. Splicing the average pooled calculation result with the maximally pooled calculation result, namely。
For the normalization layer of the interaction module,the output result after layer normalization is。I.e. as a second similarity characterization vector for the interaction module.
The system further comprises a matching module, the matching module is used for splicing the first similarity characterization vector obtained by the twin neural network module with the second similarity characterization vector obtained by the BERT-based interaction module to obtain final similarity characterization vectors of sentences A and B。The method not only can express the difference between entity words in the sentence pairs, but also can acquire the deep semantic interaction characteristics of the sentence pairs by combining the BERT model so as to acquire more accurate text similarity information. And finally, obtaining a final result through a softMax classification function.
In an exemplary embodiment of the present invention, in the matching module, a specific SoftMax classification function is as follows,meaning represented by the probability that the sample vector x belongs to the j-th class, where W is a weight coefficient and k representsThere are k categories:
characterizing the final similarity to a vectorIs input into the softmax function,whereinFor the output of the twin neural network module,for the output of the interaction module,x is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value for the text similarity of sentences A and B is set to 0.5, whenAnd if the two texts of the sentence A and the sentence B are matched, otherwise, the two texts are not matched.
Example two
The embodiment provides a public opinion text matching method based on a twin structure by using the public opinion text matching system based on the twin structure in the first embodiment, which comprises the following steps:
constructing a coding layer of a twin neural network, extracting named entity information in sentence pairs by utilizing a twin neural network module, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities;
the method comprises the steps of constructing a coding layer of a twin neural network, specifically constructing the coding layer of the twin neural network by using a BERT+CRF model (method), and constructing a coupling three-layer framework which is built by two identical or similar neural networks and is respectively an input layer, a feature extraction layer and a similarity measurement layer, wherein the input layer inputs sentence pairs to be matched, the feature extraction layer embeds input sentence pair samples into a high latitude space to obtain characterization vectors of the sentence pairs and the two samples, and the similarity measurement layer carries out similarity calculation on the extracted characterization vectors of the two samples through a mathematical formula to obtain a first similarity characterization vector of the sentence pairs.
Specifically, a BERT+CRF model (method) is utilized to construct a coding layer of the twin neural network, so that named entity information in sentence pairs is extracted, similarity calculation is carried out on the extracted named entities, a first similarity characterization vector among the named entities is obtained, and the method specifically further comprises:
the mask language model task of the BERT layer is adopted to acquire text features of word levels in input sentences and sentences, and then the feature vectors output by the BERT layer are input to the CRF layer;
the method comprises the steps of adopting a BERT layer context prediction task to judge whether an A sentence and a B sentence of an input sentence pair are related up and down, so that a model learns the relation between two texts, and the problem of a sentence level is solved; inputting the feature vector output by the BERT layer to the CRF layer;
the CRF layer corrects the output of the BERT layer by learning the transition probability among the tags in the data set;
the training set, namely, sentence pair, marks the entity by adopting the BIO method, B (begin) indicates that the character is at the beginning of one entity, I (inside) indicates that the character is at the internal position of the entity, and O (outside) indicates that the non-entity character is not concerned outside the entity; for public opinion texts, people names (PER), place names (GEO) and Organizations (ORG) in the texts are important to pay attention to, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as entity labels of a training set;
before the sentence pair is sent into the twin neural network, it is necessary to add [ CLS ] to the head of the sentence]Identifier, obtain A sentence vector of A, B sentence pairAndthe method comprises the steps of carrying out a first treatment on the surface of the Will beAndand sending the text to the BERT for fine tuning, introducing context information for characters at each position in the sentence through the code of the BERT layer, thereby obtaining part-of-speech states to perform characterization vectors, and taking the output of all BERT as the input of the CRF layer.
The second similarity characterization vector of the sentence pairs in terms of semantics is obtained by utilizing a semantic interaction module, and the method specifically comprises the following steps:
specifically, based on BERT, adopting a following prediction task to learn sentence relation characteristics among texts;
before sending sentence pairs into BERT, it is necessary to add [ CLS ] to the head of sentence]Identifier and insert [ SEP ] between two sentences]The identifier is split. The sentences after being spliced are processedSending into BERT model for fine adjustment, and outputtingI.e., vectorized representation of sentence pairs;
sentence vector obtained by BERTExtracting important features through a pooling layer to reduce the dimension;
sentence vectorThe output result after the layer normalization is the second similarity characterization vector of the sentence pair acquired by the interaction module.
Splicing the first similarity representation vector and the second similarity representation vector, and splicing by using a fusion module to obtain a final similarity representation vector of the sentence pair;
obtaining a text matching result from the final similarity characterization vector through a softMax classification function, and splicing the first similarity characterization vector obtained by the twin neural network module with the second similarity characterization vector obtained by the BERT-based interaction module by utilizing a matching module to obtain final similarity characterization vectors of sentences A and B。The method not only can express the difference between entity words in the sentence pairs, but also can acquire the deep semantic interaction characteristics of the sentence pairs by combining the BERT model so as to acquire more accurate text similarity information. And finally, obtaining a final result through a softMax classification function. In any case the number of the devices to be used in the system,expressive of differences between entity words in sentence pairsDeep semantic interaction characteristics of sentence pairs can be obtained, so that more accurate text similarity information can be obtained.
In an exemplary embodiment of the present invention, in the matching module, a specific SoftMax classification function is as follows,representative meaning is the probability that the sample vector x belongs to the j-th class, where W is a weight coefficient and k represents k classes:
characterizing the final similarity to a vectorIs input into the softmax function,which is provided withIn (a)For the output of the twin neural network module,for the output of the interaction module,x is the softmax function described above; the final result obtainedAt [0,1]In the interval, assuming that the threshold value for the text similarity of sentences A and B is set to 0.5, whenAnd if the two texts of the sentence A and the sentence B are matched, otherwise, the two texts are not matched.
In order to further show the technical effect of the invention, the public opinion text matching method based on the twin structure is applied to the STS-B semantic similarity dataset. Each data in the data set comprises a sentence pair and a similarity score, the similarity score is from 0 to 5, the higher the score is, the higher the similarity of the sentence pair is, and the 0 score is, the semantic dissimilarity of the two sentences is represented. And the data set is divided into a training set, a validation set and a test set, wherein the training set has 5231 pieces of data in total, the validation set contains 1458 pieces of data, and the test set contains 1361 piece of data.
In addition, for more visual comparison, the invention simultaneously utilizes several mainstream models of Siamese-CNN and Siamese-LSTM, ABCNN, BERT in the text matching task to carry out comparison experiments. The experimental results of the different models on the STS-B dataset are shown below:
model name | Model accuracy |
Siamese-CNN | 60.21 |
Siamese-LSTM | 64.52 |
ABCNN | 66.80 |
BERT | 75.52 |
The method provided by the invention | 83.96 |
The experimental result can find that the twin neural network structure can effectively improve the performance of the model when being applied to the field of semantic similarity. According to the method, the similarity judgment is carried out on the semantic features of the two texts of the sentence A and the sentence B through sentences, and whether the texts are described aiming at the same person, thing or phenomenon or the like is judged through the entity features of the two texts of the sentence A and the sentence B through sentences, so that the similarity judgment of text data is more accurate, the matching accuracy of a public opinion text system is improved, a great amount of manpower and time for artificial judgment is reduced, and the efficiency of public opinion text analysis is improved.
Example III
In another aspect, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored program, where the program executes the above-mentioned method for matching public opinion text based on a twinning structure.
Example IV
The invention also provides an electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor being arranged to perform the method of twin structure based public opinion text matching by the computer program.
Claims (4)
1. A public opinion text matching system based on a twin structure is characterized by comprising
Twin neural network module: the method comprises the steps of constructing a coding layer of a twin neural network, extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector between the named entities;
semantic interaction module: the method comprises the steps of obtaining a second similarity characterization vector of sentence pairs in terms of semantics, wherein the second similarity vector represents text semantic similarity characteristics;
the semantic interaction module comprises an encoding layer of the interaction module, a pooling layer of the interaction module and a normalization layer of the interaction module, wherein the encoding layer of the interaction module needs to add [ CLS ] to the head of a sentence before sending the sentence pair into the BERT]Identifier and insert [ SEP ] between two sentences]Splitting the identifier, and splitting the spliced sentence t= {,[SEP],/>Sending the character information into BERT model for fine tuning, introducing context information into the characters at each position in the sentence through BERT layer coding to obtain part-of-speech state for representing vector, and outputting ++>={-vectorized expression of sentence pairs; the pooling layer of the interactive module, sentence vector +.>Extracting important features through a pooling layer to reduce the dimension; normalization layer of the interaction module, sentence vector->The output result after the layer normalization is a second similarity characterization vector of the sentence pairs obtained by the interaction module;
the twin neural network module is specifically configured to construct a coding layer of the twin neural network by using a BERT+CRF method, the twin neural network module comprises a coupling three-layer architecture established by two identical neural networks, each neural network of the twin neural network comprises an input layer and a feature extraction layer, the two identical neural networks share a similarity measurement layer, the input layer of each neural network inputs one sentence of a sentence pair to be matched, the feature extraction layer embeds the sentence of the input sentence pair into a high-dimensional space to obtain a vector representation of the sentence pair, and the similarity measurement layer carries out similarity calculation on the vector representation of the sentence of the extracted sentence pair through a mathematical formula to obtain a first similarity representation vector of the sentence pair; before the A, B sentences of the sentence pairs are sent to the twin neural network, the [ CLS ] needs to be added to the head of the sentences]Identifier, obtain A sentence vector of A, B sentence pairAnd B sentence vector->The method comprises the steps of carrying out a first treatment on the surface of the Will->Andsending the text into the BERT for fine tuning, introducing context information into characters at each position in the sentence through the code of the BERT layer so as to obtain part-of-speech states to perform characterization vectors, wherein the output of all BERT is used as the input of the CRF layer;
the training set of the twin neural network module, namely sentence pairs, marks the entity by adopting a BIO method, wherein B represents that a character is at the beginning of one entity, I represents that the character is at the internal position of the entity, and O represents that non-entity characters outside the entity are not focused; the public opinion text needs to pay attention to the name PER, the place name GEO and the organization ORG in the text, so that 7 types of labels, namely B-PER, I-PER, B-GEO, I-GEO, B-ORG, I-ORG and O, are used as the entity labels of the training set; b is the shorthand of begin, I is the shorthand of inside, O is the shorthand of outside;
and a fusion module: the method comprises the steps of splicing a first similarity characterization vector and a second similarity characterization vector to obtain a final similarity characterization vector of a sentence pair;
and a matching module: and the final similarity characterization vector is used for obtaining a text matching result through a softMax classification function.
2. The public opinion text matching method based on the twin structure, which is applied to the public opinion text matching system based on the twin structure as claimed in claim 1, is characterized by comprising the following steps:
constructing a coding layer of the twin neural network, thereby extracting named entity information in sentence pairs, and carrying out similarity calculation on the extracted named entities to obtain a first similarity characterization vector among the named entities;
acquiring a second similarity representation vector of the sentence pair in terms of semantics through a semantic interaction module, wherein the second similarity vector represents text semantic similarity characteristics;
splicing the first similarity characterization vector and the second similarity characterization vector to obtain a final similarity characterization vector of the sentence pair;
and obtaining a text matching result through a softMax classification function by the final similarity characterization vector.
3. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of twin structure based public opinion text matching as described in claim 2.
4. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of twin structure based public opinion text matching of claim 2 by means of the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310761055.3A CN116522165B (en) | 2023-06-27 | 2023-06-27 | Public opinion text matching system and method based on twin structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310761055.3A CN116522165B (en) | 2023-06-27 | 2023-06-27 | Public opinion text matching system and method based on twin structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116522165A CN116522165A (en) | 2023-08-01 |
CN116522165B true CN116522165B (en) | 2024-04-02 |
Family
ID=87408580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310761055.3A Active CN116522165B (en) | 2023-06-27 | 2023-06-27 | Public opinion text matching system and method based on twin structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116522165B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117194614B (en) * | 2023-11-02 | 2024-01-30 | 北京中电普华信息技术有限公司 | Text difference recognition method, device and computer readable medium |
CN118760908B (en) * | 2024-09-05 | 2025-02-11 | 浙商证券股份有限公司 | Financial public opinion similarity probability prediction method, system and device |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259127A (en) * | 2020-01-15 | 2020-06-09 | 浙江大学 | Long text answer selection method based on transfer learning sentence vector |
CN111723575A (en) * | 2020-06-12 | 2020-09-29 | 杭州未名信科科技有限公司 | Method, device, electronic equipment and medium for recognizing text |
CN113673225A (en) * | 2021-08-20 | 2021-11-19 | 中国人民解放军国防科技大学 | Method and device for judging similarity of Chinese sentences, computer equipment and storage medium |
CN114329225A (en) * | 2022-01-24 | 2022-04-12 | 平安国际智慧城市科技股份有限公司 | Search method, device, equipment and storage medium based on search statement |
CN114386421A (en) * | 2022-01-13 | 2022-04-22 | 平安科技(深圳)有限公司 | Similar news detection method and device, computer equipment and storage medium |
CN114579731A (en) * | 2022-02-28 | 2022-06-03 | 江苏至信信用评估咨询有限公司 | Network information topic detection method, system and device based on multi-feature fusion |
CN114896397A (en) * | 2022-04-29 | 2022-08-12 | 中航华东光电(上海)有限公司 | Empty pipe instruction repeating inspection method based on BERT-CRF word vector model |
CN115292447A (en) * | 2022-07-14 | 2022-11-04 | 昆明理工大学 | News matching method fusing theme and entity knowledge |
CN115374778A (en) * | 2022-08-08 | 2022-11-22 | 北京工商大学 | Cosmetic public opinion text entity relation extraction method based on deep learning |
CN115408494A (en) * | 2022-07-25 | 2022-11-29 | 中国科学院深圳先进技术研究院 | A Text Matching Method Fused with Multi-Head Attention Alignment |
CN115470871A (en) * | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
CN115630632A (en) * | 2022-09-29 | 2023-01-20 | 北京蜜度信息技术有限公司 | Method, system, medium and terminal for correcting personal name in specific field based on context semantics |
CN115687939A (en) * | 2022-09-02 | 2023-02-03 | 重庆大学 | Mask text matching method and medium based on multi-task learning |
CN115712713A (en) * | 2022-11-23 | 2023-02-24 | 桂林电子科技大学 | Text matching method, device and system and storage medium |
CN115759104A (en) * | 2023-01-09 | 2023-03-07 | 山东大学 | Financial field public opinion analysis method and system based on entity recognition |
CN116304745A (en) * | 2023-03-27 | 2023-06-23 | 济南大学 | Text topic matching method and system based on deep semantic information |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965459B2 (en) * | 2014-08-07 | 2018-05-08 | Accenture Global Services Limited | Providing contextual information associated with a source document using information from external reference documents |
US20220198146A1 (en) * | 2020-12-17 | 2022-06-23 | Jpmorgan Chase Bank, N.A. | System and method for end-to-end neural entity linking |
-
2023
- 2023-06-27 CN CN202310761055.3A patent/CN116522165B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259127A (en) * | 2020-01-15 | 2020-06-09 | 浙江大学 | Long text answer selection method based on transfer learning sentence vector |
CN111723575A (en) * | 2020-06-12 | 2020-09-29 | 杭州未名信科科技有限公司 | Method, device, electronic equipment and medium for recognizing text |
CN113673225A (en) * | 2021-08-20 | 2021-11-19 | 中国人民解放军国防科技大学 | Method and device for judging similarity of Chinese sentences, computer equipment and storage medium |
CN114386421A (en) * | 2022-01-13 | 2022-04-22 | 平安科技(深圳)有限公司 | Similar news detection method and device, computer equipment and storage medium |
CN114329225A (en) * | 2022-01-24 | 2022-04-12 | 平安国际智慧城市科技股份有限公司 | Search method, device, equipment and storage medium based on search statement |
CN114579731A (en) * | 2022-02-28 | 2022-06-03 | 江苏至信信用评估咨询有限公司 | Network information topic detection method, system and device based on multi-feature fusion |
CN114896397A (en) * | 2022-04-29 | 2022-08-12 | 中航华东光电(上海)有限公司 | Empty pipe instruction repeating inspection method based on BERT-CRF word vector model |
CN115292447A (en) * | 2022-07-14 | 2022-11-04 | 昆明理工大学 | News matching method fusing theme and entity knowledge |
CN115408494A (en) * | 2022-07-25 | 2022-11-29 | 中国科学院深圳先进技术研究院 | A Text Matching Method Fused with Multi-Head Attention Alignment |
CN115374778A (en) * | 2022-08-08 | 2022-11-22 | 北京工商大学 | Cosmetic public opinion text entity relation extraction method based on deep learning |
CN115687939A (en) * | 2022-09-02 | 2023-02-03 | 重庆大学 | Mask text matching method and medium based on multi-task learning |
CN115630632A (en) * | 2022-09-29 | 2023-01-20 | 北京蜜度信息技术有限公司 | Method, system, medium and terminal for correcting personal name in specific field based on context semantics |
CN115470871A (en) * | 2022-11-02 | 2022-12-13 | 江苏鸿程大数据技术与应用研究院有限公司 | Policy matching method and system based on named entity recognition and relation extraction model |
CN115712713A (en) * | 2022-11-23 | 2023-02-24 | 桂林电子科技大学 | Text matching method, device and system and storage medium |
CN115759104A (en) * | 2023-01-09 | 2023-03-07 | 山东大学 | Financial field public opinion analysis method and system based on entity recognition |
CN116304745A (en) * | 2023-03-27 | 2023-06-23 | 济南大学 | Text topic matching method and system based on deep semantic information |
Non-Patent Citations (4)
Title |
---|
A Graph-based Text Similarity Measure That Employs Named Entity Information;Leonidas Tsekouras 等;《Proceedings of Recent Advances in Natural Language Processing》;第765-771页 * |
基于BERT-BiLSTM-CRF模型的中文实体识别;谢腾;杨俊安;刘辉;;计算机系统应用;20200715(第07期);第52-59页 * |
基于BERTCA的新闻实体与正文语义相关度计算模型;向军毅 等;《第十九届中国计算语言学大会论文集》;第288-300页 * |
基于BERT模型的司法文书实体识别方法;陈剑;何涛;闻英友;马林涛;;东北大学学报(自然科学版);20201015(第10期);第16-21页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116522165A (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113609859B (en) | A Chinese named entity recognition method for special equipment based on pre-training model | |
CN110298037B (en) | Text Recognition Approach Based on Convolutional Neural Network Matching with Enhanced Attention Mechanism | |
CN114064918B (en) | Multi-modal event knowledge graph construction method | |
CN110765775B (en) | A Domain Adaptation Method for Named Entity Recognition Fusing Semantics and Label Differences | |
CN109992782B (en) | Legal document named entity identification method and device and computer equipment | |
CN114757182B (en) | A BERT short text sentiment analysis method with improved training method | |
CN110609891A (en) | A Visual Dialogue Generation Method Based on Context-Aware Graph Neural Network | |
CN109635288B (en) | Resume extraction method based on deep neural network | |
CN110134771A (en) | An Implementation Method of Fusion Network Question Answering System Based on Multi-Attention Mechanism | |
CN117171333B (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN110929030A (en) | A joint training method for text summarization and sentiment classification | |
CN110737758A (en) | Method and apparatus for generating a model | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN111914556B (en) | Emotion guiding method and system based on emotion semantic transfer pattern | |
CN114357166B (en) | Text classification method based on deep learning | |
CN112069312B (en) | A text classification method and electronic device based on entity recognition | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN114444507A (en) | Context-parameter Chinese entity prediction method based on water environment knowledge graph-enhanced relationship | |
CN111221964B (en) | A Text Generation Method Guided by Evolutionary Trends of Different Faceted Viewpoints | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115858750A (en) | Power grid technical standard intelligent question-answering method and system based on natural language processing | |
CN115713072A (en) | Relation category inference system and method based on prompt learning and context awareness | |
CN108536781B (en) | Social network emotion focus mining method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A Public Opinion Text Matching System and Method Based on Twin Structure Granted publication date: 20240402 Pledgee: Guanggu Branch of Wuhan Rural Commercial Bank Co.,Ltd. Pledgor: Wuhan AGCO Software Technology Co.,Ltd. Registration number: Y2024980019034 |