CN119398054A - Chinese entity recognition model and method based on vocabulary enhancement and character external information - Google Patents
Chinese entity recognition model and method based on vocabulary enhancement and character external information Download PDFInfo
- Publication number
- CN119398054A CN119398054A CN202411501310.1A CN202411501310A CN119398054A CN 119398054 A CN119398054 A CN 119398054A CN 202411501310 A CN202411501310 A CN 202411501310A CN 119398054 A CN119398054 A CN 119398054A
- Authority
- CN
- China
- Prior art keywords
- character
- embedding
- matching
- chinese
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a Chinese entity recognition model and a Chinese entity recognition method based on vocabulary enhancement and character external information, which solve the problem that the Chinese character external information is not fully utilized and the static weight of a matched word is determined only by relying on word frequency. The Chinese entity recognition model is sequentially connected with a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-path encoder part for enhancing character embedding and component embedding and a decoding output part. The realization method comprises the steps of data cleaning, chinese character feature extraction containing dynamic matching word features, and double-path coding and decoding output for enhancing character embedding and component embedding. The invention uses a double-path coding structure to integrate the vocabulary, pinyin and component information of Chinese characters into a model, and uses dynamic weights to embed and weight the matched words when extracting the characteristics of the matched words, thereby being more balanced and reasonable, avoiding the weight distribution from being too biased to high-frequency words and improving the generalization capability and the recognition precision. For extracting chinese entities from unstructured input text.
Description
Technical Field
The invention belongs to the field of artificial intelligence and natural language processing, mainly relates to Chinese named entity recognition, in particular to a Chinese entity recognition model and method based on vocabulary enhancement and character external information, which are used for extracting Chinese entities such as personal names, place names, organization names and the like from unstructured input texts and have wide application in a plurality of fields such as intelligent customer service, intelligent recommendation, intelligent question-answering and the like.
Background
Named Entity Recognition (NER) aims to detect entities from sentences and to identify predefined types, such as names of people, places and institutions. Named entity recognition is the basis for many downstream Natural Language Processing (NLP) questions, such as knowledge graph, information retrieval, question-answering systems, and machine translation. Chinese, one of the most frequently used languages in the world, plays an increasingly important role in international communications, business transactions and cultural diffusion. The popularity of the internet has led to a vast amount of chinese text data, including social media, news stories, and academic articles. The Chinese named entity recognition technology can extract key information from the information, and help a user to quickly find the required content, so that the efficiency of information retrieval and processing is improved.
Early named entity recognition methods were based primarily on manually written rules and dictionaries matching named entities in text. But these rules tend to be strongly related to domain, language text style, and the creation of rules and dictionaries relies on expert knowledge and is extremely time consuming to produce. Furthermore, such systems are poorly portable and inconvenient to use across scenes.
The NER task is converted into a sequence labeling task based on a statistical method, and training is performed by using an artificial labeling corpus. The cost of data tagging is far lower than the cost of design rules, and statistical-based methods are generic, do not require too many manually designed rules, and gradually become the dominant method before deep learning bursts. Common statistical models include conditional random fields, hidden Markov models, maximum entropy models, and the like. However, statistical-based methods rely on artificial design features, such as parts of speech, glyph features, etc., which require knowledge and experience from domain experts, and incorrect feature designs can severely impact model performance.
In recent years, deep learning has made a significant breakthrough in the NLP field. When the deep learning is applied to named entity recognition, the deep learning can learn complex hidden representations without complex feature engineering and rich domain knowledge. In addition, the named entity recognition method based on deep learning is stronger in generalization and more general. Therefore, compared with the traditional rule-based method and the traditional statistical-based method, the named entity recognition method based on deep learning is more widely used, and the recognition accuracy is higher. Named entity recognition methods based on deep learning have been studied in many ways, but mainly focus on English named entity recognition. Chinese named entity recognition is more difficult than English named entity recognition, which is mainly embodied in two aspects. Firstly, the obvious difference between Chinese and English is that the basic unit of Chinese is Chinese character, english is word, and Chinese has no obvious separator like space in English. Secondly, many entities in English have obvious features, namely capitalization. This results in more difficult boundary recognition of chinese entities, while english entity recognition is focused on recognizing the type of entity.
Since there are no explicit separators in chinese, early studies mostly removed the word before the chinese NER was performed. However, errors in chinese segmentation may lead to false identification of entity boundaries, affecting model performance. Recently proposed NER methods are character-based, which can eliminate word segmentation errors, but at the same time lose word information. In chinese NER, word information and lexical boundary information are critical. Some recent NER methods based on vocabulary enhancement utilize an external dictionary to supplement missing vocabulary information, enabling NER models to more accurately capture entity information in the language by integrating the vocabulary information in the dictionary into a character-level presentation layer. Zhang Y et al in its published paper "Chinese NER using lattice LSTM"(Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:1554-1564.) propose LATTICE LSTM, LATTICE LSTM as the first method to obtain potential word information from an external dictionary and incorporate it into a character-based chinese named entity model, introducing word information without relying on word segmentation results, avoiding error propagation due to word segmentation. R Ma et al in its published paper "Simplify the Usage of Lexicon in Chinese NER"(Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:5951-5960.) propose SoftLexicon, softLexicon to integrate lexical information into the character representation, avoiding complex modeling structures, which have achieved promising results on several chinese NER datasets by injecting word information into the model through an external dictionary. However, after the dictionary information is integrated into the character representation, the existing methods mostly ignore additional information (such as strokes, pinyin, etc.) of the Chinese character, which has been proven to be beneficial to improving the performance of the Chinese NER model. After introducing various external information, how to effectively integrate them into the model is also a key issue. In addition, the existing vocabulary enhancement-based method is directly based on word frequency of matching words as weight, and different relevance between characters and vocabularies in different contexts is not well captured.
The prior art compensates for the missing vocabulary information and word boundary information of the character-based Chinese named entity method by integrating the external dictionary information. The method often ignores additional information (such as strokes, pinyin and the like) of the Chinese characters, and the information can further enrich the characteristic representation, so that the method is very helpful for improving the performance of the model. In addition, the existing methods are mostly based on word frequency as weight, and cannot sufficiently capture the relevance of characters and words in different contexts.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a Chinese named entity recognition model and method based on vocabulary enhancement and character external information, wherein the recognition precision is higher, and the generalization capability is stronger.
The invention relates to a Chinese entity recognition model based on vocabulary enhancement and character external information, which is characterized in that a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-way encoder part for enhancing character embedding and part embedding and a decoding output part are sequentially connected; the Chinese character information extraction part is provided with a character extraction module, a matching word extraction module, a pinyin extraction module and a component extraction module based on the character extraction module, the Chinese character feature extraction part containing dynamic matching word features comprises a character feature extraction module, a dynamic matching word feature extraction module, a pinyin feature extraction module and a component feature extraction module, wherein the character feature extraction module takes the output of the character extraction module in the Chinese character information extraction part as input to acquire character embedding, the dynamic matching word feature extraction module takes the output of the matching word extraction module in the Chinese character information extraction part and the output of the character extraction module as input together, and then performs weighted summation on the matching word embedding based on an attention mechanism to acquire dynamic matching word feature embedding, the pinyin feature extraction module takes the output of the pinyin extraction module in the Chinese character information extraction part as input to extract pinyin feature embedding, the component feature extraction module takes the output of the component extraction module in the Chinese character information extraction part as input to extract component feature embedding, the double-way encoder part of the enhanced character embedding and the component embedding comprises two branches which are respectively used for encoding the context feature of component embedding and the context feature of enhanced character embedding combined with the dynamic matching word feature, and then splicing the component context features and the enhanced character context features to be output as a double-path encoder, wherein the decoding output part is used for decoding the encoder output to obtain the optimal tag sequence.
The invention also provides a Chinese entity recognition method based on vocabulary enhancement and character external information, which is realized on any Chinese entity recognition model based on vocabulary enhancement and character external information according to claims 1-2, and is characterized by comprising the following steps:
(1) The data cleaning, namely receiving an original Chinese input text, dividing the overlong text into a plurality of short sentences, removing illegal characters and Unicode characters in the short sentences, and completing data cleaning to obtain cleaned sentences;
(2) Extracting various characteristics of the input sentences after data cleaning, including character characteristics, dynamic matching word characteristics, pinyin characteristics and component characteristics; the character features are extracted by inputting the sentence after cleaning into a BERT model to obtain feature embedding of each character; extracting dynamic matching word characteristics, firstly carrying out character string matching on each character and an external dictionary to obtain matching words, calculating dynamic weights of the matching words based on the matching word embedding and character characteristic embedding, and carrying out weighted summation on the matching word embedding to obtain final dynamic matching word characteristic embedding; the method comprises the steps of extracting pinyin characteristics, namely, obtaining pinyin of each Chinese character in an input sentence, extracting by using a convolutional neural network to obtain pinyin characteristic embedding, and sequentially splicing character characteristic embedding, dynamic matching word characteristic embedding and pinyin characteristic embedding to obtain enhanced character embedding;
(3) The method comprises the steps of performing two-way coding of enhanced character embedding and component embedding, namely extracting the context characteristics of the enhanced character embedding and the context characteristics of Chinese character component embedding by using two Bi-directional gating circulating units Bi-GRU, performing character embedding, dynamic matching word characteristic embedding and pinyin embedding splicing to obtain enhanced character embedding, inputting the enhanced character embedding into a Bi-GRU to extract the context representation of the enhanced character embedding, inputting the Chinese character component embedding representation into another Bi-GRU to extract the context representation of the component embedding, splicing the component context representation and the enhanced character context representation, and finishing the two-way coding of the enhanced character embedding and the component embedding;
(4) And decoding and outputting, namely selecting a linear chain conditional random field CRF as a decoder, inputting the coding context representation fused with Chinese character information, matching word information, pinyin information and component information into the CRF for decoding, and extracting entities in sentences based on the tag sequence by the CRF to finish decoding and outputting.
The invention solves the problems that the prior method can not fully utilize the external information of the Chinese characters and the prior method can only rely on word frequency to determine the static weight of the matched word.
The invention provides a double-path coding structure, which effectively fuses word information, pinyin information and part information of Chinese characters into a model, greatly enriches semantic expression of Chinese named entity recognition based on characters, and solves the problem that the traditional method cannot fully utilize external information of the Chinese characters. In addition, when the vocabulary information is introduced, the invention adopts a attention mechanism, and dynamically adjusts the weight of the matching word by calculating the relativity of the characters and the matching word. Compared with the existing method which only depends on word frequency to determine the static weight of the matched word, the method is more flexible and reasonable in the distribution of the weight of the matched word, and can treat each matched word more uniformly, so that the situation that the weight distribution is too biased to the high-frequency word is avoided, and the generalization capability and recognition accuracy of the model are improved.
Compared with the prior art, the invention has the following advantages:
The invention integrates pinyin, components and vocabulary information into a character feature extraction process, adopts a brand-new double-path coding fusion strategy, uses two parallel modules to respectively process enhanced character representation and Chinese character component embedding in a context coding process, optimizes the boundary recognition effect of Chinese named entity recognition, and effectively improves the recognition accuracy of the model in a complex text environment.
The invention adopts the attention mechanism to extract the dynamic matching word characteristics, and dynamically adjusts the weight of the matching word by calculating the relativity of the characters and the matching word. The Chinese entity recognition model can better capture different relevance between characters and words in different contexts, and can treat each matched word more uniformly and accurately, so that the situation that weight distribution is too biased to high-frequency words is avoided, and generalization capability and recognition accuracy of the model are further improved.
Drawings
FIG. 1 is a block diagram of a model of the present invention;
FIG. 2 is a diagram of a model structure of the present invention;
FIG. 3 is a flow chart of the present invention;
FIG. 4 is a flow chart of the present invention;
FIG. 5 is a diagram of a convolutional neural network architecture used to extract pinyin/feature features.
Detailed Description
Example 1
The existing vocabulary enhancement-based method for identifying the Chinese named entities mostly ignores more external information of Chinese characters after dictionary information is integrated into character representation, so that the accuracy and the robustness of identifying the named entities are insufficient. Because the entity recognition model can not fully utilize the abundant information of the structure, pronunciation and the like of the Chinese characters, the understanding of the structure of the Chinese characters is lacking. The existing vocabulary enhancement-based method is directly based on word frequency of matching words as weight, and does not capture different relevance between characters and vocabularies in different contexts well, so that important semantic information can be lost, and overall understanding of texts is affected. Aiming at the problems, the invention expands analysis and research, and provides a Chinese entity recognition model and method based on vocabulary enhancement and character external information.
Referring to fig. 1, fig. 1 is a block diagram of the model of the present invention, and the model of the present invention is sequentially connected with a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-path encoder part for enhancing character embedding and component embedding, and a decoding output part. The Chinese character information extraction part is provided with a character extraction module, and a matching word extraction module, a pinyin extraction module and a component extraction module are further arranged on the basis of the character extraction module. The Chinese character feature extraction part with the dynamic matching word features comprises a character feature extraction module, a dynamic matching word feature extraction module, a pinyin feature extraction module and a component feature extraction module, wherein the character feature extraction module takes the output of the character extraction module in the Chinese character information extraction part as the input of the character feature extraction module to obtain character embedding. The dynamic matching word feature extraction module takes the output of the matching word extraction module and the output of the character extraction module in the Chinese character information extraction part as inputs, and performs weighted summation on the matching word embedding based on the attention mechanism to obtain the final dynamic matching word feature embedding. The dynamic weight adopted in the invention can better capture the different relevance of characters and words under different contexts, and can treat each matching word more uniformly and accurately, compared with the static weight adopted by most of the traditional Chinese entity recognition methods, the dynamic weight can avoid the situation that the weight distribution is too biased to high-frequency words, thereby improving the generalization capability and recognition accuracy of the model. The pinyin feature extraction module takes the output of the pinyin extraction module in the Chinese character information extraction part as input, and uses a convolutional neural network to extract pinyin feature embedding, so that the phenomenon of polyphone is particularly common in Chinese, and different meanings of homonyms can be easily distinguished based on the pinyin feature embedding. The component feature extraction module takes the output of the component extraction module in the Chinese character information extraction part as input, uses a convolutional neural network to extract component feature embedding, wherein the Chinese character is pictographic characters, the structural information of strokes, components and the like of the Chinese character contains rich semantics, and the component feature embedding is integrated into the Chinese entity recognition model of the invention, so that the semantic representation capability of the model can be improved, the understanding of the structure of the Chinese character can be enhanced, and the performance improvement can be realized. The enhanced character embedding and component embedding dual-path encoder part comprises two branches, which are respectively used for encoding the component embedding context feature and the enhanced character embedding context feature fused with the dynamic matching word feature and the pinyin feature, so that the model is helped to understand different types of input more deeply, and then the component context feature and the enhanced character context feature are spliced to be output as the dual-path encoder. In addition, the double-path encoder can independently adjust the super parameters aiming at the encoder of each branch, thereby more accurately controlling the learning process of the model and improving the performance of the model. The decoding output part is used for decoding the output of the encoder to obtain the optimal tag sequence.
The model provided by the invention introduces pinyin and component information, and uses a double-path coding structure to effectively integrate character information, vocabulary information, pinyin information and component information, thereby improving the model precision. In addition, when the vocabulary information is introduced, a attention mechanism is adopted, and through calculating the correlation between the characters and the matched words, each matched word can be treated more uniformly, the situation that weight distribution is too biased to high-frequency words is avoided, and therefore the generalization capability of the model is improved.
The invention aims at the technical proposal of Chinese entity recognition based on vocabulary enhancement and character external information, which is provided by the prior art, and is used for solving the problems that the prior method can not fully utilize the Chinese character external information and lacks dynamic weight distribution when integrating the external dictionary information, thereby resulting in insufficient generalization capability. The invention is an integral technical scheme, comprising a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-way encoder part for enhancing character embedding and component embedding and a decoding output part, thereby improving the semantic representation capability and recognition precision of a Chinese entity recognition model.
Example 2
The general composition of the Chinese entity recognition model based on vocabulary enhancement and character external information is the same as that of embodiment 1, and the final dynamic matching word feature embedding is obtained by weighting and summing the matching word embedding in the dynamic matching word feature extraction module, wherein the dynamic attention weight alpha of each matching word is obtained by calculating the correlation between the character embedding and each matching word embedding through an attention mechanism, and the calculation formula is as follows:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor;
In order to keep segmentation information, the invention divides the matching words into B, M, E, S groups according to the relation between the characters and the matching words, wherein the relation between the characters and the matching words in the group B is the first character of the matching words, the relation between the characters and the matching words in the group M is the first character of the matching words, the relation between the characters and the matching words in the group E is the last character of the matching words, the relation between the characters and the matching words in the group S is the same as the characters and the matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, and e ω (ω) is the embedded representation of the matching word ω.
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)}
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
The dynamic weight used in the invention combines the embedding of characters in sentences and the relevance of matching words through the attention mechanism calculation, and the weight can be flexibly adjusted according to specific contexts. Therefore, the Chinese entity recognition model can better capture different relevance between characters and words in different contexts, and improves recognition accuracy. In addition, the dynamic weight of the invention avoids the situation that the weight distribution is too biased to the high-frequency word, and improves the generalization capability of the model.
Example 3
The invention also provides a Chinese entity recognition method based on vocabulary enhancement and character external information, which is realized on any Chinese entity recognition model based on vocabulary enhancement and character external information, and the Chinese entity recognition model based on vocabulary enhancement and character external information is the same as that of the embodiment 1-2. Referring to fig. 3, fig. 3 is a flow chart of the present invention, which includes the following steps:
(1) And (3) data cleaning, namely receiving an original Chinese input text, dividing the overlong text into a plurality of short sentences, removing illegal characters, unicode characters and the like in the short sentences, and finishing data cleaning to obtain cleaned sentences.
The invention divides the long text into short sentences during data cleaning, which is helpful for the model to concentrate more on local context and improves the processing efficiency and accuracy. Illegal characters and Unicode characters interfere with the learning process of the model, resulting in a reduction in recognition accuracy, and thus also need to be removed.
(2) Extracting multiple characteristics of the input sentence after data cleaning, including character characteristics, dynamic matching word characteristics, pinyin characteristics and component characteristics. The character features are extracted by inputting the sentence after cleaning into the BERT model to obtain the feature embedding of each character. Extracting dynamic matching word characteristics, firstly carrying out character string matching on each character and an external dictionary to obtain matching words, calculating dynamic weights of the matching words based on the matching word embedding and character characteristic embedding, and carrying out weighted summation on the matching word embedding to obtain final dynamic matching word characteristic embedding. The dynamic weighting mechanism used in the invention can more effectively capture the relevance of characters and words in different contexts. Compared with the static weight adopted by most of the current Chinese entity recognition methods, the dynamic weight avoids the condition of excessive deviation of high-frequency words, and improves the generalization capability and recognition accuracy of the model. The phenomenon of polyphones is particularly common in Chinese, different meanings of homonyms are easily distinguished based on pinyin feature embedding, and the entity recognition capability of the model on a text with polyphones can be improved by embedding the pinyin feature into the Chinese entity recognition model. The Pinyin feature extraction is to obtain the Pinyin of each Chinese character in the input sentence, and extract and obtain the Pinyin feature embedding by using a convolutional neural network. And (3) character feature embedding, dynamic matching word feature embedding and pinyin feature embedding are spliced in sequence to obtain enhanced character embedding. The Chinese character is pictographic character, the structural information of strokes, parts and the like of the Chinese character contains rich semantics, and the embedding of the part features into the Chinese entity recognition model can improve the semantic representation capability of the model and realize the performance improvement. When extracting the component features, the invention firstly decomposes each Chinese character in the input sentence to obtain the components for forming the Chinese character, then uses the convolutional neural network to extract and obtain the component feature embedding, and completes the Chinese character feature extraction containing the dynamic matching word features.
(3) The invention uses two Bi-directional gate control circulation units Bi-GRU to extract the context characteristics of the enhanced character and the context characteristics of Chinese character component embedding, embeds character, dynamically matches word characteristics, splices pinyin to obtain enhanced character embedding, and inputs the enhanced character embedding into a Bi-GRU to extract the context representation of the enhanced character embedding. The invention uses two Bi-GRUs for extracting the context features for enhancing character embedding and the context features for Chinese character component embedding respectively, which is helpful for the model to understand the input of different types more deeply. In addition, the two-way encoder can independently adjust the super parameters aiming at the encoder of each Bi-GRU, so that the learning process of the model is controlled more accurately, and the performance of the model is improved. Inputting the Chinese character component embedded representation into another Bi-GRU extraction component embedded context representation, splicing the component context representation and the enhanced character context representation, and completing the enhanced character embedding and the two-way coding of the component embedding.
Referring to fig. 2, fig. 2 is a block diagram of a chinese entity recognition model of the present invention, the present invention first obtains characters from an input sentence, and obtains matching words, pinyin and components based on the characters. The Chinese character feature extraction method comprises the steps of extracting component parts of Chinese characters from a component extraction module, obtaining Chinese character component feature embedding through a convolutional neural network, then taking the component parts as input of a first Bi-GRU in two-way coding, obtaining matching words of the characters in a matching word extraction module, taking the character embedding and the matching word embedding as input, calculating correlation of the character embedding and the matching word embedding based on an attention mechanism as dynamic weights, carrying out weighted summation on the matching word embedding to obtain final dynamic matching word feature embedding, obtaining pinyin of the characters in a pinyin extraction module, sequentially splicing the character embedding, the dynamic matching word embedding and the pinyin embedding to obtain the input of another Bi-GRU in the two-way coding after the character embedding is sequentially obtained through the convolutional neural network, and taking the character embedding as part of the enhanced character embedding in the Chinese entity recognition model, taking the input of the matching word extraction module as the input of the dynamic matching word feature to participate in the calculation of the dynamic matching word feature.
(4) The invention selects the linear chain conditional random field CRF as the decoder, and the CRF can effectively model the dependency relationship between labels, thereby improving the identification accuracy. Inputting the coding context representation fused with the Chinese character information, the matching word information, the pinyin information and the component information into the CRF for decoding, and extracting entities in sentences based on the tag sequence by the CRF to finish decoding output.
The invention aims at the Chinese entity recognition method based on vocabulary enhancement and character external information, which is provided by the prior art, and is used for solving the problems that the prior method cannot fully utilize the Chinese character external information, such as stroke information, pinyin information, component information and the like, and the dynamic weight distribution is lacking when the external dictionary information is integrated, so that the generalization capability is insufficient. The invention is an integral technical scheme, the flow includes data cleaning, chinese character feature extraction containing dynamic matching word feature, enhanced character embedding and component embedding double-way coding and decoding output. The invention uses dynamic weight to embed weight to the matching words when extracting the matching word characteristics, the dynamic weight captures the different relevance between the characters and the vocabulary in different contexts better, and treats each matching word more evenly and accurately, compared with the static weights adopted by most of the current Chinese entity recognition methods, the dynamic weights used by the method avoid the situation that the distribution of weights is too biased to high-frequency words, and the generalization capability and recognition accuracy of the model are improved. When the Chinese character features are extracted, the invention extracts pinyin and component feature embedding, the polyphone phenomenon is particularly common in Chinese, different meanings of homonyms can be easily distinguished based on the pinyin feature embedding, and the Chinese entity recognition model integrating the pinyin feature embedding improves the entity recognition capability of the model on a text with polyphones. The Chinese character is pictographic character, the structural information of strokes, parts and the like of the Chinese character contains rich semantics, and the embedding of the part features into the Chinese entity recognition model of the invention improves the semantic representation capability of the model and realizes the performance improvement. The invention uses the double-path coding structure to extract the context characteristics of the enhanced character embedding and the context characteristics of the Chinese character component embedding, and the double-path encoder can independently adjust super parameters aiming at each Bi-GRU encoder, thereby more accurately controlling the learning process of the model and improving the model performance.
Example 4
The Chinese character feature extraction in the step (2) of the invention specifically extracts the Chinese character features based on vocabulary enhancement and character external information in the same way as in the embodiments 1 to 3, and comprises the following steps:
2.1 Inputting the sentence after cleaning into a Chinese-BERT-wwm model to obtain the feature embedding of each character. The Chinese-BERT-wwm model is trained in large scale by Chinese corpus, has rich language knowledge, and can effectively capture context information.
2.2 Referring to fig. 2, the invention firstly carries out character string matching on each character and an external dictionary to obtain matching words of each character, then divides the matching words into B, M, E, S groups according to the relation between the characters and the matching words, calculates the relativity of character embedding and each matching word embedding based on an attention mechanism to be used as the dynamic weight of the matching words, then respectively carries out weighted summation on each group of matching words to obtain weighted word embedding of each group of matching words, and carries out embedding and splicing of the four groups of weighted words to obtain final dynamic matching word feature embedding. The Chinese entity recognition model can better capture different relevance between characters and words in different contexts, and can treat each matched word more uniformly and accurately, so that the situation that weight distribution is too biased to high-frequency words is avoided, and generalization capability and recognition accuracy of the model are further improved.
2.3 Pinyin feature extraction, namely, the phenomenon of polyphones is particularly common in Chinese, different meanings of homophones and heterowords can be easily distinguished based on Pinyin feature embedding, and the entity recognition capability of the model on a text with polyphones can be improved by embedding and integrating the Pinyin feature into the Chinese entity recognition model. The invention firstly utilizes PyPinyin tools to obtain the pinyin of each Chinese character in the input sentence, and then uses convolutional neural network to extract and obtain pinyin characteristic embedding.
2.4 Component feature extraction, namely, chinese characters are pictographic characters, structural information such as strokes, components and the like of the Chinese characters contains rich semantics, and embedding the component features into the Chinese entity recognition model can improve the semantic representation capability of the model and realize performance improvement. The invention firstly decomposes each Chinese character in an input sentence to obtain components for forming the Chinese character, and then uses a convolutional neural network to extract and obtain component characteristic embedding.
2.5 Referring to fig. 5, fig. 5 is a convolutional neural network architecture diagram used for extracting pinyin/component features, and the structure of the convolutional neural network in the pinyin feature extraction is identical to that of the convolutional neural network in the component feature extraction, which are sequentially connected with a one-dimensional convolutional layer, a max pooling layer and a full connection layer.
According to the invention, the dynamic matching word characteristics are extracted, the dynamic weights better capture the different relevance between the characters and the vocabulary in different contexts, and each matching word is treated more uniformly and accurately, so that the situation that the distribution of the weights is too biased to high-frequency words is avoided, and the generalization capability and recognition accuracy of the model are improved. The invention also extracts the pinyin characteristics and the component characteristics of the Chinese characters, the pinyin and the component information of the Chinese characters contain rich semantics, and the pinyin characteristics and the component characteristics are integrated into the Chinese entity recognition model of the invention, so that the semantic representation capability of the model is improved and the performance improvement is realized.
Example 5
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as those of the embodiment 1-4, step 2.2), and the matching word feature extraction method comprises the following steps:
2.2.1 In the step (2), the final matching word characteristic embedding is obtained by weighting and summing the embedding of each matching word, namely the dynamic attention weight alpha of each matching word is obtained by calculating the relativity of character embedding and embedding of each matching word through an attention mechanism, and the calculation formula is as follows:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor.
2.2.2 In order to keep the segmentation information, dividing the matching words into B, M, E, S groups according to the relation between the characters and the matching words, wherein the relation between the characters and the matching words in the group B is the first character of the matching words, the relation between the characters and the matching words in the group M is the characters appearing in the middle of the matching words, the relation between the characters and the matching words in the group E is the last character of the matching words, the relation between the characters and the matching words in the group S is the characters and the matching words are equal, and carrying out weighted summation on the word embedding of the matching words in each group of the four groups of the matching words to obtain weighted word embedding of the matching words in each group of the four groups of the matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, and e ω (ω) is the embedded representation of the matching word ω.
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)}
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
The dynamic weight adopted by the invention is calculated through an attention mechanism, the embedding of characters in sentences is combined with the correlation of matching words, and the weight can be flexibly adjusted according to specific context. This means that the model can better capture different relevance of characters and words in different contexts, thereby improving recognition accuracy. In addition, the dynamic weight can avoid the situation that the weight distribution is too biased to the high-frequency words, so that the generalization capability of the model is improved.
Example 6
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as those of embodiments 1-5, and the component for composing Chinese characters in step 2.4) is completed based on a Chinese character component dictionary, and for most common characters, the dictionary is directly queried to obtain the component for composing the character, so that the method is rapid and convenient. If the character can not be found in the dictionary, usually the rare word with low occurrence probability, the invention uses a special character [ "None" ] to represent the composition.
Example 7
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as those of embodiments 1-6, and the two Bi-directional gating circulation units Bi-GRU used in the two-way coding in the step (3) are independent in parameters, optimized independently and are not interfered with each other. The bidirectional gating circulation unit is the combination of two GRUs in the forward direction and the reverse direction, the two GRUs are consistent in network structure, only the input sequence directions are opposite, and the forward network is realized by the following steps:
zt=σ(Wz·[ht-1,xt])
rt=σ(Wr·[ht-1,xt])
Where σ represents the sigmoid function, tan is a hyperbolic tangent function, r and z represent the reset gate and the update gate, respectively, h t represents the hidden state at time step t, Then the Bi-GRU is candidate hidden state, W r、Wz and W are trainable weight parameters, and the Bi-GRU is spliced by the forward hidden state and the backward hidden state at the character i: In the forward hidden state of the device, A backward hidden state.
The invention adopts a brand-new double-path coding fusion strategy, uses two parallel modules to respectively process the enhanced character representation and the Chinese character component embedding in the context coding process, optimizes the boundary recognition effect of Chinese named entity recognition, and effectively improves the recognition accuracy of the model in a complex text environment. The dual-path encoder can independently adjust super parameters aiming at each Bi-GRU encoder, and more precisely control the learning process of the model.
The invention unifies pinyin, parts and vocabulary information into a character feature extraction process. And a brand new fusion strategy is adopted, and two parallel modules are used for respectively processing the enhanced character representation and the Chinese character component embedding, so that Chinese character pinyin information and Chinese character component information can be well combined with vocabulary information. Further enriching the character's characteristic representation and achieving performance improvements.
Example 8
The invention designs a double-flow model based on vocabulary enhancement and Chinese character external information, adopts a brand-new double-flow coding fusion strategy, and uses two parallel modules to respectively process enhanced character representation and Chinese character component embedding so as to effectively fuse rich external information of Chinese characters (the external information of Chinese characters is usually not in a single Chinese character but is the additional information which is critical to understanding and processing Chinese characters) to further improve the vocabulary enhancement-based Chinese naming entity model, thereby improving the accuracy of naming entity identification.
In order to achieve the above purpose, the technical scheme of the invention comprises the following steps:
Step 1) preprocessing the input text data set, including removing redundant characters and special symbols. Then, the overlong text data is segmented into reasonable sentence units through an intelligent sentence dividing algorithm, so that the high-efficiency processing capacity of the model is ensured;
step 2) this step aims at extracting various features of the input sentence, including character features, vocabulary features, pinyin features, component features, etc.;
Step 2.1) inputting the cleaned sentence s= { c 1,c2,c3,…,cn } into the BERT model to obtain the embedded representation of each character c i;
Step 2.2) carrying out character string matching on each character and the external dictionary to obtain matching words of each character. Each character may have a plurality of matching words, each matching word needs to be assigned a weight, and previous methods were based directly on the word frequency of the matching word as its weight. In order to capture the relation between characters and matching words thereof, the weight of each matching word is more reasonably distributed, and the invention calculates the attention weight of the characters and each matching word thereof based on an attention mechanism:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor.
To retain the segmentation information, the matching words are divided into four groups of B (the character is the first character of the matching word), M (the character appears in the middle of the matching word), E (the character is the last character of the matching word), S (the character and the matching word are equal) according to the relationship between the character and the matching word. The word embedding weight summation of each group of matching words in the four groups of matching words is carried out to obtain the weighted word embedding of each group of matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, and e ω (ω) is the embedded representation of the matching word ω.
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)}
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
Step 2.3) acquiring the pinyin of each Chinese character in the input sentence, and extracting pinyin characteristics by using a convolutional neural network to obtain pinyin characteristic embedding;
step 2.4) decomposing each Chinese character in the input sentence to obtain components for forming the Chinese character, and extracting the characteristics of the components by using a convolutional neural network to obtain component embedding;
Step 3) embedding characters, embedding words and spelling, and splicing to obtain enhanced character representation:
xi←[xi;vi;ep]
Where x i is the character embedded representation and e p is the pinyin embedded representation;
Step 4) extracting the context characteristics represented by the enhanced characters and the context characteristics embedded by the Chinese character components by using a Bi-channel Bi-directional gating circulating unit (Bi-GRU);
step 4.1) character embedding, vocabulary embedding and pinyin embedding are spliced to obtain enhanced character representation, the enhanced character representation is input into a Bi-GRU to further extract context representation of character embedding, the bidirectional gating circulation unit can be understood as combination of a forward GRU and a reverse GRU, the two GRUs are consistent in network structure, only the input sequence directions are opposite, and the forward network is realized by the following steps:
zt=σ(Wz·[ht-1,xt])
rt=σ(Wr·[ht-1,xt])
Where σ represents the sigmoid function, tan is a hyperbolic tangent function, r and z represent the reset gate and the update gate, respectively, h t represents the hidden state at time step t, Then the Bi-GRU is candidate hidden state, W r、Wz and W are trainable weight parameters, and the Bi-GRU is spliced by the forward hidden state and the backward hidden state at the character i: In the forward hidden state of the device, The Bi-GRU output of this step can be represented as a set of hidden vectors
Step 4.2) inputting the Chinese character component embedded representation into another Bi-GRU to further extract the component embedded context representation, the Bi-GRU output of this step may be represented as a set of hidden vectors
Step 5) the present invention selects the linear chain Conditional Random Field (CRF) as the decoder, and the outputs of Bi-GRU in step 4.1) and step 4.2) are spliced together to obtainS l is input to the CRF, which will output the score of the entity tag sequence y= [ l 1,l2,...,ln ]:
where y' represents all possible tag sequences, Is specific to the i i learnable parameters,Is the offset from tags l i-1 to l i.
The invention uses the Viterbi algorithm to search the label sequence with highest score, and can extract the entity in the sentence based on the label sequence.
The present invention will be further described by the following examples of the model and method fused together.
Example 9
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as those of embodiments 1-3, and specific embodiments and effects of the present invention are described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of this example are as follows:
And step 1, data cleaning.
The input text may be non-canonical, may contain non-canonical characters, unicode characters, etc., that need to be cleared. In addition, some of the input text may be too long, BERT has a limit on the length of the input text, and too long text is disadvantageous for Bi-GRU to extract the context information, so that the too long input text needs to be split into several phrases.
And 2, extracting the characteristics of the input sentence.
This step aims to extract various features of the input sentence including character features, vocabulary features, pinyin features, part features, etc. The Chinese named entity recognition method provided by the invention is a character-based named entity recognition method, and for the sentences cleaned in the step 1, the character embedding, the matching word embedding, the pinyin embedding and the Chinese character component embedding of the sentences can be obtained based on each character in the sentences.
The specific implementation of the steps is as follows:
2.1 Extracting character embedding, for the sentence after cleaning, there are various methods for extracting the character embedding. The invention extracts the character embedding by using the BERT model, uses the mask language model and the next sentence prediction as the pre-training task based on the transducer architecture and performs the pre-training by a large amount of data, thereby being capable of generating high-quality language representation. Inputting the cleaned sentence s= { c 1,c2,c3,…,cn } into the BERT model to obtain the embedded representation of each character c i;
2.2 Extracting vocabulary and embedding, namely carrying out character string matching on each character in the sentence and an external dictionary to obtain matching words of each character. Each character may have a plurality of matching words, and each matching word needs to be assigned a weight, and most previous methods are directly based on word frequencies of the matching words as weights. In order to capture the relation between characters and matching words thereof, the weight of each matching word is more reasonably distributed, and the invention calculates the attention weight of the characters and each matching word thereof based on an attention mechanism:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor.
To retain the segmentation information, the matching words are divided into four groups of B (the character is the first character of the matching word), M (the character appears in the middle of the matching word), E (the character is the last character of the matching word), S (the matching word is a single character) according to the relationship of the character and the matching word thereof. The word embedding weight summation of each group of matching words in the four groups of matching words is carried out to obtain the weighted word embedding of each group of matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, and e ω (ω) is the embedded representation of the matching word ω.
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)}
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
2.3 Extracting pinyin and embedding, namely extracting the pinyin of each character in sentences by using PyPinyin tools, such as character 'middle', pyPinyin is output as [ "z", "h", "o", "n", "g", "1 ], wherein '1' represents that the tone is the first sound. Then, the Pinyin is input into a convolutional neural network to extract Pinyin and embed. Referring to fig. 3, the convolutional neural network includes one-dimensional convolutional layer, a max-pooling layer, and a full-connection layer.
2.4 The invention constructs a Chinese character part dictionary, and for most common characters, the dictionary can be directly inquired to obtain the parts composing the character, for example, the bright parts are [ "tou", "mouth", "冖", "children" ]. Likewise, it is input to a convolutional neural network whose structure is identical to that of the convolutional neural network mentioned in 2.3) to extract the kanji component embeddings.
And step 3, obtaining the enhanced character embedding.
Character embedding, vocabulary embedding and pinyin embedding are spliced to obtain enhanced character representation:
xi←[xi;vi;ep]
Where x i is the character-embedded representation and e p is the pinyin-embedded representation.
And 4, extracting text context characteristics.
The invention provides a double-flow model for fusing various external information of Chinese characters, and selects two Bi-directional gating circulating units (Bi-GRU) as encoders to extract context characteristics of enhanced character representation and component embedding respectively. Bi-GRU can capture both forward and backward information in the sequence, meaning that the output at each point in time is based on the context of the entire sequence, which is critical to named entity recognition. Bi-GRU can be flexibly applied to different sequence lengths and effectively cope with long-term dependency in long sequences.
4.1 The bidirectional gating circulation unit can be understood as the combination of a forward direction GRU and a reverse direction GRU, the two GRUs are consistent in network structure, only the input sequence directions are opposite, and the forward direction network is realized by the following steps:
zt=σ(Wz·[ht-1,xt])
rt=σ(Wr·[ht-1,xt])
Wherein sigma represents a sigmoid function, z and r represent a reset gate and an update gate respectively, W is a trainable parameter, and Bi-GRU Bi-directional hidden states at character i can be represented as a concatenation of a forward hidden state and a backward hidden state: the Bi-GRU output of this step can be represented as a set of hidden vectors
4.2 Inputting the Chinese character component embedded representation into another Bi-GRU to further extract the component embedded context representation, the Bi-GRU output of this step may be represented as a set of hidden vectors
And 5, performing CRF decoding to obtain output.
The linear chain member random field (CRF) can capture global information of the entire sequence, ensuring that the labeling of the entire sequence is optimal. This means that CRF considers not only the optimal labeling of a single location, but also the optimal labeling combination of the whole sequence, which is critical for NER task, and the output of two Bi-GRU modules in step 4 can be obtained after being splicedS l is input to the CRF model, which will output the score of the entity tag sequence y= [ l 1,l2,...,ln ]:
where y' represents all possible tag sequences, Is specific to the i i learnable parameters,The invention uses Viterbi algorithm to search the label sequence with highest score, and can extract entity in sentence based on label sequence.
The invention unifies pinyin, parts and vocabulary information into a character feature extraction process. And a brand new fusion strategy is adopted, and two parallel modules are used for respectively processing the enhanced character representation and the Chinese character component embedding, so that Chinese character pinyin information and Chinese character component information can be well combined with vocabulary information, the character characteristic representation is further enriched, and the performance improvement is realized. The realization scheme is that the input text is washed, invalid characters are removed, and the overlong text is segmented. And extracting various characteristics of the input sentence, including character characteristics, vocabulary characteristics, pinyin characteristics and component characteristics. First, the BERT model is used to extract each character of the input sentence for embedding. And carrying out character string matching on each character in the input sentence and an external dictionary, obtaining matching words of each character, and carrying out weighted summation on the matching words based on an attention mechanism to obtain vocabulary embedding. After the PyPinyin tool is used for acquiring the pinyin of each character in the input sentence, the pinyin is sent to a convolutional neural network to extract the pinyin for embedding. And after the component parts of each character are obtained based on the constructed Chinese character component dictionary, the component parts are sent to a convolutional neural network to extract Chinese character components for embedding. The character characteristics, the pinyin characteristics and the vocabulary characteristics are spliced to obtain enhanced character embedding, the enhanced character embedding is input into one Bi-GRU model to extract character context characteristics, the component embedding is input into the other Bi-GRU model to extract component context characteristics, the output of the two Bi-GRUs are spliced and then input into a linear chain member random field model to be decoded, and an output entity tag sequence can be obtained.
The technical effects of the present invention will be further illustrated by simulation
Example 10
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as embodiments 1-9.
Simulation conditions
Experimental data used Weibo dataset, ontoNotes dataset, resume dataset. The simulation platform is Intel Core i5-12400 CPU,16.0GB memory, and the display card is NVIDIA GeForce RTX 3090Ti,Windows 11 operating system.
Emulation content
The three data sets are tested by the invention and the existing SoftLexicon algorithm, TENER algorithm, FLAT algorithm and MECT algorithm respectively, and the F1 score index is used for evaluating the respective performances, the results are shown in table 1, and table 1 is a comparison table of simulation results of the invention and the prior art.
Simulation results and analysis
See table 1 where the invention improves the F1 index by 2.58% over baseline model SoftLexicon on Weibo dataset, improves the F1 score to 83.32% on OntoNotes dataset, and achieves the best results on Resume dataset, improving the F1 score by 0.15% over baseline model SoftLexicon. Simulation results show that on a Weibo data set which is difficult, the F1 fraction performance of the method provided by the invention is obviously higher than that of other methods, and the adaptability of the method in a complex text environment is proved. Simulation results show that the results of the invention on three data sets are superior to the existing method.
Table 1 comparison of simulation results of the present invention and prior art
| OntoNotes | Resume | ||
| SoftLexicon | 70.50 | 82.81 | 96.11 |
| TENER | 58.39 | 72.82 | 95.25 |
| FLAT | 68.55 | 81.82 | 95.86 |
| MECT | 70.43 | 82.57 | 95.98 |
| The invention is that | 73.08 | 83.32 | 96.26 |
Example 11
The Chinese entity recognition model and method based on vocabulary enhancement and character external information are the same as those of examples 1-9, and the simulation condition part is the same as that of example 10, in this example, only Weibo datasets are used on the experimental dataset.
Emulation content
In order to verify the contribution of each module in the model of the invention, the invention carries out an ablation experiment on Weibo data sets, and the respective performances are evaluated by using the indexes of accuracy, recall and F1 fraction, and the results are shown in Table 2, and Table 2 shows the results of the ablation experiment of the invention.
Simulation results and analysis
To verify the contribution of each module in the model of the present invention, the present invention performed an ablation experiment on Weibo dataset. The results of the ablation experiments are shown in table 2.
TABLE 2 results of ablation experiments of the invention
| Model | Accuracy rate of | Recall rate of recall | F1 fraction |
| Complete model | 73.73 | 72.46 | 73.08 |
| Removing pinyin information | 71.06 | 72.95 | 71.99 |
| Removing part information | 73.42 | 70.04 | 71.69 |
| Removing parts and pinyin information simultaneously | 72.82 | 68.60 | 70.65 |
| Removing dynamic weights | 73.27 | 71.50 | 72.37 |
When the pinyin insert is removed, its F1 score drops to 71.99, 1.09% lower compared to the full model. The present invention removes the component information to evaluate the contribution of the component information. After removal of the part information, the F1 score of the model on Weibo datasets was reduced by 1.39%. In addition, when pinyin and part information is deleted, the F1 score is further reduced to 70.65 by 2.43% compared to the full model. The above results also prove that the model performance can be further improved by supplementing various external information. In this example, the dynamic weights used in the present invention are eliminated, and instead, the word frequency is used as the weight of the matching word like SoftLexicon, thereby evaluating the importance of the dynamic weights. Experimental results show that the F1 fraction is reduced by 0.71%, and the dynamic weight used by the method can effectively improve the accuracy and generalization capability of the model.
In summary, the Chinese entity recognition model and method based on vocabulary enhancement and character external information provided by the invention unifies pinyin, components and vocabulary information into a character feature extraction process. And a brand new fusion strategy is adopted, and two parallel modules are used for respectively processing the enhanced character representation and the Chinese character component embedding, so that Chinese character pinyin information and Chinese character component information can be well combined with vocabulary information. Further enriching the character's characteristic representation and achieving performance improvements. The method solves the problems that the prior method can not fully utilize the external information of the Chinese characters and the prior method only depends on word frequency to determine the static weight of the matched word. The Chinese entity recognition model is sequentially connected with a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-path encoder part for enhancing character embedding and component embedding and a decoding output part. The realization method comprises the steps of data cleaning, chinese character feature extraction containing dynamic matching word features, and double-path coding and decoding output for enhancing character embedding and component embedding. The invention provides a double-path coding structure, which effectively fuses word information, pinyin information and part information of Chinese characters into a model, greatly enriches semantic expression of Chinese named entity recognition based on characters, and solves the problem that the traditional method cannot fully utilize external information of the Chinese characters. In addition, when the vocabulary information is introduced, the invention adopts a attention mechanism, and dynamically adjusts the weight of the matching word by calculating the relativity of the characters and the matching word. Compared with the existing method which only depends on word frequency to determine the static weight of the matched word, the method is more flexible and reasonable in the distribution of the weight of the matched word, can treat each matched word more uniformly, avoids the situation that the weight distribution is too biased to the high-frequency word, and improves the generalization capability and recognition accuracy of the model. For extracting chinese entities from unstructured input text.
Claims (7)
1. A Chinese entity recognition model and method based on vocabulary enhancement and character external information is characterized in that a Chinese character information extraction part, a Chinese character feature extraction part containing dynamic matching word features, a double-way encoder part for enhancing character embedding and part embedding and a decoding output part are sequentially connected; the Chinese character information extraction part is provided with a character extraction module, a matching word extraction module, a pinyin extraction module and a component extraction module based on the character extraction module, the Chinese character feature extraction part containing dynamic matching word features comprises a character feature extraction module, a dynamic matching word feature extraction module, a pinyin feature extraction module and a component feature extraction module, wherein the character feature extraction module takes the output of the character extraction module in the Chinese character information extraction part as input to acquire character embedding, the dynamic matching word feature extraction module takes the output of the matching word extraction module in the Chinese character information extraction part and the output of the character extraction module as input together, and then performs weighted summation on the matching word embedding based on an attention mechanism to acquire dynamic matching word feature embedding, the pinyin feature extraction module takes the output of the pinyin extraction module in the Chinese character information extraction part as input to extract pinyin feature embedding, the component feature extraction module takes the output of the component extraction module in the Chinese character information extraction part as input to extract component feature embedding, the double-way encoder part of the enhanced character embedding and the component embedding comprises two branches which are respectively used for encoding the context feature of component embedding and the context feature of enhanced character embedding combined with the dynamic matching word feature, and then splicing the component context features and the enhanced character context features to be output as a double-path encoder, wherein the decoding output part is used for decoding the encoder output to obtain the optimal tag sequence.
2. The model for recognizing chinese entities based on vocabulary enhancement and character external information according to claim 1, wherein when the dynamic matching word feature extraction module performs weighted summation on each matching word to obtain dynamic matching word feature embedding, the dynamic attention weight α of each matching word is obtained by calculating the correlation between the character embedding and each matching word embedding through an attention mechanism, and the calculation formula is as follows:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor;
In order to keep the segmentation information, the matching words are divided into B, M, E, S groups according to the relation between the characters and the matching words, wherein the relation between the characters and the matching words in the group B is the first character of the matching words, the relation between the characters and the matching words in the group M is the first character of the matching words, the relation between the characters and the matching words in the group E is the last character of the matching words, the relation between the characters and the matching words in the group S is the same as the characters and the matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, e ω (ω) is the embedded representation of the matching word ω;
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)};
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
3. A method for identifying chinese entities based on vocabulary enhancement and character external information, implemented on any of the chinese named entity identification models based on vocabulary enhancement and character external information as set forth in claims 1-2, comprising the steps of:
(1) The data cleaning, namely receiving an original Chinese input text, dividing the overlong text into a plurality of short sentences, removing illegal characters and Unicode characters in the short sentences, and completing data cleaning to obtain cleaned sentences;
(2) Extracting various characteristics of the input sentences after data cleaning, including character characteristics, dynamic matching word characteristics, pinyin characteristics and component characteristics; the character features are extracted by inputting the sentence after cleaning into a BERT model to obtain feature embedding of each character; extracting dynamic matching word characteristics, firstly, carrying out character string matching on each character and an external dictionary to obtain matching words, calculating dynamic weights of the matching words based on the matching word embedding and character characteristic embedding, and then carrying out weighted summation on the matching word embedding to obtain dynamic matching word characteristic embedding; the method comprises the steps of extracting pinyin characteristics, namely, obtaining pinyin of each Chinese character in an input sentence, extracting by using a convolutional neural network to obtain pinyin characteristic embedding, and sequentially splicing character characteristic embedding, dynamic matching word characteristic embedding and pinyin characteristic embedding to obtain enhanced character embedding;
(3) The method comprises the steps of performing two-way coding of enhanced character embedding and component embedding, namely extracting the context characteristics of the enhanced character embedding and the context characteristics of Chinese character component embedding by using two Bi-directional gating circulating units Bi-GRU, performing character embedding, dynamic matching word characteristic embedding and pinyin embedding splicing to obtain enhanced character embedding, inputting the enhanced character embedding into a Bi-GRU to extract the context representation of the enhanced character embedding, inputting the Chinese character component embedding representation into another Bi-GRU to extract the context representation of the component embedding, splicing the component context representation and the enhanced character context representation, and finishing the two-way coding of the enhanced character embedding and the component embedding;
(4) And decoding and outputting, namely selecting a linear chain conditional random field CRF as a decoder, inputting the coding context representation fused with Chinese character information, matching word information, pinyin information and component information into the CRF for decoding, and extracting entities in sentences based on the tag sequence by the CRF to finish decoding and outputting.
4. The method for recognizing chinese entities based on vocabulary enhancement and character external information according to claim 3, wherein said extracting specific extraction of chinese character features in step (2) comprises:
2.1 Inputting the sentence after cleaning into a Chinese-BERT-wwm model to obtain the feature embedding of each character;
2.2 Firstly, carrying out character string matching on each character and an external dictionary to obtain matching words of each character, then dividing the matching words into B, M, E, S groups according to the relation between the characters and the matching words, calculating the relativity of character embedding and each matching word embedding based on an attention mechanism to be used as the dynamic weight of the matching words, then respectively weighting and summing each group of matching words to obtain weighted word embedding of each group of matching words, and splicing the four groups of weighted word embedding to obtain final dynamic matching word feature embedding;
2.3 Pinyin feature extraction, namely utilizing PyPinyin tools to obtain the pinyin of each Chinese character in an input sentence, and utilizing a convolutional neural network to extract and obtain pinyin feature embedding;
2.4 Decomposing each Chinese character in the input sentence to obtain components for forming the Chinese character, and extracting and obtaining component characteristic embedding by using a convolutional neural network;
2.5 The structure of the convolutional neural network in the pinyin feature extraction and the component feature extraction is consistent, and a one-dimensional convolutional layer, a maximum pooling layer and a full connection layer are sequentially connected.
5. The method for recognizing chinese entities based on vocabulary enhancement and character external information according to claim 4, wherein the dynamically matching word feature extraction in step 2.2) comprises the steps of:
2.2.1 Calculating the dynamic attention weight alpha of each matching word by calculating the relativity between character embedding and each matching word embedding through an attention mechanism, wherein the calculation formula is as follows:
where x and y are character embedding and matching word embedding, respectively, W Q and W K are learnable parameters, Is a scaling factor;
2.2.2 In order to keep the segmentation information, dividing the matching words into B, M, E, S groups according to the relation between the characters and the matching words, wherein the relation between the characters and the matching words in the group B is the first character of the matching words, the relation between the characters and the matching words in the group M is the characters appearing in the middle of the matching words, the relation between the characters and the matching words in the group E is the last character of the matching words, the relation between the characters and the matching words in the group S is the characters and the matching words are equal, and carrying out weighted summation on the word embedding of the matching words in each group of the four groups of the matching words to obtain weighted word embedding of the matching words in each group of the four groups of the matching words
Wherein the method comprises the steps ofAlpha ω is the attention weight of the matching word ω, e ω (ω) is the embedded representation of the matching word ω;
The four groups of weighted words are embedded and spliced to obtain the final dynamic matching word embedded representation v i, and the calculation formula is as follows:
vi={vi(B);vi(M);vi(E);vi(S)};
wherein v i(B)、vi(M)、vi(E)、vi (S) is a dynamically weighted word embedding of B, M, E, S sets of matching words, respectively.
6. The method of claim 4, wherein the step 2.4) of obtaining the components constituting the Chinese character is performed based on a dictionary of Chinese character components, and the dictionary is directly searched for the most common characters to obtain the components constituting the character, and if the character cannot be found in the dictionary, a special character [ "None" ] is used to represent the composition.
7. The method for recognizing Chinese entities based on vocabulary enhancement and character external information according to claim 3, wherein the two Bi-directional gating cyclic units Bi-GRU used in the two-way coding in the step (3) are independent parameters, are optimized independently and do not interfere with each other, the Bi-directional gating cyclic units are a combination of two GRUs in forward direction and reverse direction, and are consistent in network structure, and only the input sequence direction is opposite, the forward direction network is realized by:
zt=σ(Wz·[ht-1,xt])
rt=σ(Wr·[ht-1,xt])
where σ represents a sigmoid function, tan is a hyperbolic tangent function, r and z represent a reset gate and an update gate respectively,
H t denotes the hidden state at time step t,Then the Bi-GRU is candidate hidden state, W r、Wz and W are trainable weight parameters, and the Bi-GRU is spliced by the forward hidden state and the backward hidden state at the character i: In the forward hidden state of the device, A backward hidden state.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411501310.1A CN119398054B (en) | 2024-10-25 | 2024-10-25 | Chinese entity recognition model and method based on vocabulary enhancement and character external information |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411501310.1A CN119398054B (en) | 2024-10-25 | 2024-10-25 | Chinese entity recognition model and method based on vocabulary enhancement and character external information |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119398054A true CN119398054A (en) | 2025-02-07 |
| CN119398054B CN119398054B (en) | 2025-10-28 |
Family
ID=94418160
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411501310.1A Active CN119398054B (en) | 2024-10-25 | 2024-10-25 | Chinese entity recognition model and method based on vocabulary enhancement and character external information |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119398054B (en) |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114662476A (en) * | 2022-02-24 | 2022-06-24 | 北京交通大学 | Character sequence recognition method fusing dictionary and character features |
| CN114818717A (en) * | 2022-05-25 | 2022-07-29 | 华侨大学 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
| CN115238696A (en) * | 2022-08-03 | 2022-10-25 | 江西理工大学 | Chinese named entity recognition method, electronic equipment and storage medium |
| CN115600597A (en) * | 2022-10-18 | 2023-01-13 | 淮阴工学院(Cn) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium |
| CN116306642A (en) * | 2022-12-14 | 2023-06-23 | 河北工业大学 | A Named Entity Recognition Method Fused with Glyph Features and Local Attention Enhancement |
| CN116579341A (en) * | 2023-04-26 | 2023-08-11 | 南京航空航天大学 | A Chinese Named Entity Recognition Method Based on Word Fusion with Low Vocabulary Information Loss |
| CN117113997A (en) * | 2023-07-25 | 2023-11-24 | 四川大学 | A Chinese named entity recognition method that enhances the integration of dictionary knowledge |
| CN117272998A (en) * | 2023-08-24 | 2023-12-22 | 大连民族大学 | A method for recognizing named entities in ancient Chinese based on knowledge embedding |
-
2024
- 2024-10-25 CN CN202411501310.1A patent/CN119398054B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114662476A (en) * | 2022-02-24 | 2022-06-24 | 北京交通大学 | Character sequence recognition method fusing dictionary and character features |
| CN114818717A (en) * | 2022-05-25 | 2022-07-29 | 华侨大学 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
| CN115238696A (en) * | 2022-08-03 | 2022-10-25 | 江西理工大学 | Chinese named entity recognition method, electronic equipment and storage medium |
| CN115600597A (en) * | 2022-10-18 | 2023-01-13 | 淮阴工学院(Cn) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium |
| CN116306642A (en) * | 2022-12-14 | 2023-06-23 | 河北工业大学 | A Named Entity Recognition Method Fused with Glyph Features and Local Attention Enhancement |
| CN116579341A (en) * | 2023-04-26 | 2023-08-11 | 南京航空航天大学 | A Chinese Named Entity Recognition Method Based on Word Fusion with Low Vocabulary Information Loss |
| CN117113997A (en) * | 2023-07-25 | 2023-11-24 | 四川大学 | A Chinese named entity recognition method that enhances the integration of dictionary knowledge |
| CN117272998A (en) * | 2023-08-24 | 2023-12-22 | 大连民族大学 | A method for recognizing named entities in ancient Chinese based on knowledge embedding |
Non-Patent Citations (2)
| Title |
|---|
| GUOQIANG PENG ET AL.: "A Chinese named entity approach for character semantic enhancement", ACM, 31 December 2022 (2022-12-31) * |
| 谢润忠;李烨;: "基于BERT和双通道注意力的文本情感分类模型", 数据采集与处理, no. 04, 15 July 2020 (2020-07-15) * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119398054B (en) | 2025-10-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110083710B (en) | A Word Definition Generation Method Based on Recurrent Neural Network and Latent Variable Structure | |
| CN112541356B (en) | Method and system for recognizing biomedical named entities | |
| Zhang et al. | Deep neural networks in machine translation: An overview. | |
| Liu et al. | A recursive recurrent neural network for statistical machine translation | |
| CN106847288A (en) | The error correction method and device of speech recognition text | |
| CN106202010A (en) | The method and apparatus building Law Text syntax tree based on deep neural network | |
| CN105068997B (en) | The construction method and device of parallel corpora | |
| CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
| KR102043353B1 (en) | Apparatus and method for recognizing Korean named entity using deep-learning | |
| CN115965009B (en) | Text error correction model training and text error correction methods and equipment | |
| CN110427619B (en) | An automatic proofreading method for Chinese text based on multi-channel fusion and reordering | |
| Nguyen et al. | Improving vietnamese named entity recognition from speech using word capitalization and punctuation recovery models | |
| CN119272774B (en) | Chinese named entity recognition method based on hierarchical label enhanced contrast learning | |
| CN112926323B (en) | Chinese named entity recognition method based on multi-level residual convolution and attention mechanism | |
| CN114218921A (en) | Problem semantic matching method for optimizing BERT | |
| Weng et al. | An effective contextual language modeling framework for speech summarization with augmented features | |
| CN111222329B (en) | Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system | |
| Nouhaila et al. | Assessing the impact of static, contextual and character embeddings for Arabic machine translation | |
| CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
| Ma et al. | Joint pre-trained Chinese named entity recognition based on bi-directional language model | |
| CN117408244A (en) | A text error correction method and related devices | |
| CN109960782A (en) | A Tibetan word segmentation method and device based on deep neural network | |
| CN119398054B (en) | Chinese entity recognition model and method based on vocabulary enhancement and character external information | |
| CN116483990B (en) | Internet news content automatic generation method based on big data | |
| CN118248120A (en) | Subword-level multilingual TTS system based on deep feature loss and graph convolution |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |