CN107203511B

CN107203511B - A Network Text Named Entity Recognition Method Based on Neural Network Probabilistic Disambiguation

Info

Publication number: CN107203511B
Application number: CN201710390409.2A
Authority: CN
Inventors: 周勇; 刘兵; 韩兆宇; 王重秋
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2020-07-17
Anticipated expiration: 2037-05-27
Also published as: CN107203511A; CA3039280A1; CA3039280C; AU2017416649A1; RU2722571C1; WO2018218705A1

Abstract

The invention discloses a network text named entity recognition method based on neural network probability disambiguation, which comprises the steps of segmenting words of a non-tag corpus, extracting Word vectors by using Word2Vec, converting a sample corpus into Word feature matrixes and windowing, constructing a deep neural network for training, adding a softmax function into an output layer of the neural network for normalization processing, and obtaining a probability matrix of each Word corresponding to a named entity category; and (4) re-windowing the probability matrix, and disambiguating by using a conditional random field model to obtain the final named entity label. The invention provides a word vector increment learning method without changing a neural network structure according to the characteristics of network words and new words, and adopts a probability disambiguation method for solving the problems of non-standard grammatical structure and many wrongly written words in a network text. Therefore, the method of the invention can generate higher accuracy in the task of identifying the network text naming entity.

Description

A Network Text Named Entity Recognition Method Based on Neural Network Probabilistic Disambiguation

技术领域technical field

本发明涉及网络文本的处理及分析，尤其涉及一种基于神经网络概率消歧的网络文本命名实体识别的方法。The invention relates to the processing and analysis of network texts, in particular to a method for network text named entity recognition based on neural network probability disambiguation.

背景技术Background technique

网络使得信息的采集、传播的速度和规模达到空前的水平，实现了全球的信息共享与交互，它已经成为信息社会必不可少的基础设施。现代通信和传播技术，大大提高了信息传播的速度和广度。但与之俱来的问题和“副作用”是：汹涌而来的信息有时使人无所适从，从浩如烟海的信息海洋中迅速而准确地获取自己最需要的信息，变得非常困难。如何从海量的网络文本中分析出互联网用户所关注的人物、地点、机构等命名实体，成为网上营销、群体情感分析等各种上层应用提供重要的支持信息。这使得面向网络文本的命名实体识别成为网络数据处理与分析中的一项重要的核心技术。The network makes the speed and scale of information collection and dissemination reach an unprecedented level, and realizes the global information sharing and interaction. It has become an indispensable infrastructure of the information society. Modern communication and dissemination technologies have greatly improved the speed and breadth of information dissemination. But the accompanying problems and "side effects" are: the surging information sometimes makes people at a loss, and it becomes very difficult to quickly and accurately obtain the information they need most from the vast sea of information. How to analyze the people, places, institutions and other named entities that Internet users pay attention to from the massive network texts provides important support information for various upper-level applications such as online marketing and group sentiment analysis. This makes named entity recognition for web text an important core technology in web data processing and analysis.

人们处理命名实体识别的方法研究主要分为两类，基于规则的方法(rule-based)和基于统计的方法(statistic-based)。随着机器学习理论的不断完善和计算性能的极大提高，基于统计学的方法更加受到人们青睐。The research on the methods that people deal with named entity recognition is mainly divided into two categories, rule-based methods and statistical-based methods. With the continuous improvement of machine learning theory and the great improvement of computing performance, methods based on statistics are more favored by people.

目前，命名实体识别应用的统计模型方法主要包括：隐马尔可夫模型、决策树、最大熵模型、支持向量机、条件随机场以及人工神经网络。人工神经网络在命名实体识别方面可以的到比条件随机场、最大熵模型等模型取得更好的结果，但实用仍以条件随机场、最大熵模型为主，如专利号CN201310182978.X使用条件随机场并结合命名实体库提出了对微博文本的命名实体识别方法及装置、专利号CN200710098635.X提出了一种利用字特征使用最大熵模型建模的命名实体识别方法。人工神经网络难以实用的原因在于人工神经网络在命名实体识别领域常需要将词转化成词向量空间中的向量，因此对于新生词汇无法得到对应的向量，所以无法得到大规模的实际应用。At present, statistical model methods for named entity recognition applications mainly include: Hidden Markov Model, Decision Tree, Maximum Entropy Model, Support Vector Machine, Conditional Random Field and Artificial Neural Network. Artificial neural network can achieve better results than conditional random field and maximum entropy model in terms of named entity recognition, but it is still mainly based on conditional random field and maximum entropy model, such as patent number CN201310182978.X. Combined with the named entity library, the airport proposed a named entity recognition method and device for microblog text, and the patent number CN200710098635.X proposed a named entity recognition method using word features and maximum entropy model modeling. The reason why artificial neural networks are difficult to use is that artificial neural networks often need to convert words into vectors in the word vector space in the field of named entity recognition, so the corresponding vectors cannot be obtained for new words, so they cannot be applied on a large scale.

基于上述现状，针对网络文本的命名实体识别主要存在以下问题：第一，网络文本因存在大量网络词汇、新生词汇、错别字，无法训练出包含所有词的词向量空间以训练神经网络。第二，网络文本存在的语言形式任意、语法结构不规范、错别字多等现象导致其命名实体识别准确率下降。Based on the above situation, the named entity recognition for network text mainly has the following problems: First, because there are a large number of network words, new words, and typos in network text, it is impossible to train a word vector space containing all words to train neural networks. Second, the existence of arbitrary language forms, irregular grammatical structures, and many typos in network texts lead to a decline in the accuracy of named entity recognition.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术中存在的不足，本发明提供一种增量提取词特征而不需要重新训练神经网络、同时概率消歧识别的基于神经网络概率消歧的网络文本命名实体识别方法，该方法通过训练神经网络，获取神经网络对词语所属命名实体类型的预测概率矩阵，对神经网络输出的预测矩阵再以概率模型进行消歧，提高了网络文本命名实体识别的准确性和准确率。Purpose of the invention: In order to overcome the deficiencies in the prior art, the present invention provides a network text named entity recognition method based on neural network probability disambiguation, which does not require retraining of neural network and simultaneous probability disambiguation recognition by incrementally extracting word features , the method obtains the predicted probability matrix of the named entity type to which the word belongs by training the neural network, and then disambiguates the predicted matrix output by the neural network with the probability model, which improves the accuracy and accuracy of the named entity recognition in the network text. .

技术方案：为实现上述目的，本发明采用的技术方案为：Technical scheme: In order to realize the above-mentioned purpose, the technical scheme adopted in the present invention is:

一种基于神经网络概率消歧的网络文本命名实体识别方法，将无标签语料分词，利用Word2Vec提取词向量，将样本语料转换成词特征矩阵并窗口化，构建深度神经网络进行训练，在神经网络的输出层加入softmax函数做归一化处理，得到每个词对应命名实体类别的概率矩阵。将概率矩阵重新窗口化，利用条件随机场模型进行消歧，得到最后的命名实体标注。A network text named entity recognition method based on neural network probability disambiguation, segmenting unlabeled corpus, using Word2Vec to extract word vectors, converting the sample corpus into word feature matrix and windowing, and constructing a deep neural network for training. The softmax function is added to the output layer for normalization, and the probability matrix of each word corresponding to the named entity category is obtained. Rewindow the probability matrix and disambiguate using the conditional random field model to obtain the final named entity annotation.

具体包括以下步骤：Specifically include the following steps:

步骤1，通过网页爬虫获取无标签语料，从语料库获取有命名实体标注的样本语料，利用自然语言工具对无标签语料进行分词。Step 1: Obtain unlabeled corpus through a web crawler, obtain sample corpus marked with named entities from the corpus, and use natural language tools to segment the unlabeled corpus.

步骤2，对已分词好的无标签语料和样本语料通过Word2Vec工具进行词向量空间的训练。Step 2: Train the word vector space on the segmented unlabeled corpus and sample corpus through the Word2Vec tool.

步骤3，将样本语料中的文本按照已训练的Word2Vec模型转换成代表词特征的词向量，并对词向量窗口化，将窗口w乘词向量长度d的二维矩阵作为神经网络的输入。将样本语料中的标签转成one-hot形式作为神经网络的输出。神经网络的输出层采用softmax函数进行归一化，使神经网络的分类结果为词汇属于非命名实体及各类命名实体的概率，调整神经网络中的结构、深度、节点数、步长、激活函数、初始值参数以及选取激活函数训练神经网络。Step 3: Convert the text in the sample corpus into a word vector representing word features according to the trained Word2Vec model, window the word vector, and use the window w multiplied by the two-dimensional matrix of the word vector length d as the input of the neural network. Convert the labels in the sample corpus into one-hot form as the output of the neural network. The output layer of the neural network is normalized by the softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to non-named entities and various named entities, and the structure, depth, number of nodes, step size, and activation function in the neural network are adjusted. , the initial value parameters, and the selected activation function to train the neural network.

步骤4，将神经网络输出的预测矩阵重新窗口化，将待标注词的上下文预测信息作为条件随机场模型中待标注词的实际分类的关联点，根据训练语料利用EM算法，计算出各边的期望值，训练出对应的条件随机场模型。Step 4: Re-window the prediction matrix output by the neural network, take the context prediction information of the word to be labeled as the correlation point of the actual classification of the word to be labeled in the conditional random field model, and use the EM algorithm according to the training corpus to calculate the value of each side. Expected value, train the corresponding conditional random field model.

步骤5，识别时，首先将待识别文本按照已训练的Word2Vec模型转换成代表词特征的词向量，若Word2Vec模型中不包含对应的训练词汇，则采用增量学习、获取词向量、回溯词向量空间的方法将该词转换为词向量，并对词向量窗口化，将窗口w乘词向量长度d的二维矩阵作为神经网络的输入。然后将神经网络得到的预测矩阵重新窗口化放入训练好的条件随机场模型中进行消歧，获得待识别文本中最终的命名实体标注。Step 5: During recognition, first convert the text to be recognized into word vectors representing word features according to the trained Word2Vec model. If the Word2Vec model does not contain the corresponding training vocabulary, then incremental learning, obtaining word vectors, and backtracking word vectors are used. The spatial method converts the word into a word vector and window the word vector, and takes the two-dimensional matrix of the window w multiplied by the word vector length d as the input of the neural network. Then, the prediction matrix obtained by the neural network is re-windowed into the trained conditional random field model for disambiguation, and the final named entity annotation in the text to be recognized is obtained.

优选的：所述Word2Vec工具的参数如下：词向量长度选择200，迭代次数25次，初始步长0.025，最小步长0.0001，选用CBOW模型。Preferably: the parameters of the Word2Vec tool are as follows: the length of the word vector is 200, the number of iterations is 25, the initial step size is 0.025, the minimum step size is 0.0001, and the CBOW model is selected.

优选的：所述神经网络的参数如下：隐藏层2层，隐藏节点数150个，步长0.01，batchSize选取40，激活函数使用sigmoid函数。Preferably, the parameters of the neural network are as follows: 2 hidden layers, 150 hidden nodes, a step size of 0.01, a batchSize of 40, and the activation function using a sigmoid function.

优选的：将样本语料中的标签转成one-hot形式的方法：将样本语料中的”/o”、”/n”、”/p”标签相应的转化为命名实体标签”/Org-B”、”/Org-I”、”/Per-B”、”/Per-I”、”/Loc-B”、”/Loc-I”，在转换成one-hot的形式。Preferred: the method of converting the tags in the sample corpus into one-hot form: correspondingly convert the "/o", "/n", and "/p" tags in the sample corpus into named entity tags "/Org-B" ", "/Org-I", "/Per-B", "/Per-I", "/Loc-B", "/Loc-I" are converted into one-hot form.

优选的：词向量窗口化的窗口大小为5。Preferred: The window size for word vector windowing is 5.

优选的：神经网络训练时，从样本数据中抽取十分之一的词汇不参与神经网络的训练，作为神经网络的衡量标准。Preferably: during the training of the neural network, one tenth of the vocabulary is extracted from the sample data and does not participate in the training of the neural network, which is used as the measurement standard of the neural network.

本发明相比现有技术，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

可以增量提取出不需要重新训练神经网络的词向量，利用神经网络预测并用概率模型消歧，使得该方法在网络文本的命名实体识别中拥有更好的实用性、准确性和准确率。在网络文本的命名实体识别任务中，本发明根据其存在网络词汇、新生词汇的特性，提供了一种不改变神经网络结构的词向量增量学习方法，为应对网络文本中语法结构不规范、错别字多的问题，采用了概率消歧的方法。因此本发明的方法在网络文本命名实体识别任务中可产生较高的准确率。The word vector can be incrementally extracted without retraining the neural network, and the neural network can be used to predict and disambiguate with a probability model, which makes the method more practical, accurate and accurate in the named entity recognition of network text. In the named entity recognition task of network text, the present invention provides a word vector incremental learning method that does not change the neural network structure according to the characteristics of the existence of network words and new words. For the problem of many typos, the method of probability disambiguation is adopted. Therefore, the method of the present invention can produce a higher accuracy rate in the network text named entity recognition task.

附图说明Description of drawings

图1是根据本发明训练一个基于神经网络概率消歧的网络文本命名实体识别装置的流程图。FIG. 1 is a flow chart of training a network text named entity recognition device based on neural network probability disambiguation according to the present invention.

图2是根据本发明将词转化为词特征的流程图。FIG. 2 is a flow chart of converting words into word features according to the present invention.

图3是根据本发明文本处理以及神经网络结构的示意图。FIG. 3 is a schematic diagram of text processing and neural network structure according to the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with the accompanying drawings and specific embodiments, the present invention will be further clarified. It should be understood that these examples are only used to illustrate the present invention and are not used to limit the scope of the present invention. Modifications in the form of valence all fall within the scope defined by the appended claims of the present application.

具体包括以下步骤：Specifically include the following steps:

步骤1，通过网页爬虫无标签网络文本，并从各语料库下载有命名实体标注的语料作为样本语料，利用自然语言工具对无标签语料进行分词。Step 1, through the web crawler unlabeled network text, and download the corpus marked with named entities from each corpus as the sample corpus, and use natural language tools to segment the unlabeled corpus.

步骤3，将样本语料中的文本按照已训练的Word2Vec模型转换成代表词特征的词向量，作为神经网络的输入。将样本语料中的标签转成one-hot形式作为神经网络的输出，因为在文本处理任务中，一个命名实体可能被分割成多个词汇，所以为了保证识别出命名实体具完整性，标注形式采用IOB模式进行标注。Step 3: Convert the text in the sample corpus into a word vector representing word features according to the trained Word2Vec model, and use it as the input of the neural network. Convert the labels in the sample corpus into one-hot form as the output of the neural network, because in the text processing task, a named entity may be divided into multiple words, so in order to ensure the integrity of the identified named entity, the labeling form adopts IOB mode is marked.

词汇为何类命名实体不能仅凭词汇本身判定，还需要依靠词汇所处上下文信息决定，因此在建立神经网络时，我们引入窗口的概念，即在判断词汇的时候，将词汇及其固定长度上下文的特征信息都作为神经网络的输入，神经网络的输入不再是词特征向量的长度d，而是窗口w乘词特征长度d的二维矩阵。What kind of named entity a word is can not be determined only by the word itself, but also by the context information in which the word is located. Therefore, when building a neural network, we introduce the concept of a window, that is, when judging a word, the word and its fixed-length context are combined. The feature information is used as the input of the neural network. The input of the neural network is no longer the length d of the word feature vector, but a two-dimensional matrix of the window w multiplied by the word feature length d.

神经网络的输出层采用softmax函数进行归一化，使神经网络的分类结果为词汇属于非命名实体及各类命名实体的概率。调整神经网络中的结构、深度、节点数、步长、激活函数、初始值参数以及选取激活函数训练神经网络。The output layer of the neural network is normalized by the softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to non-named entities and various named entities. Adjust the structure, depth, number of nodes, step size, activation function, initial value parameters in the neural network and select the activation function to train the neural network.

步骤5，识别时，首先将待识别文本按照已训练的Word2Vec模型转换成代表词特征的词向量，若Word2Vec模型中不包含对应的训练词汇，则采用增量学习、获取词向量、回溯词向量空间的方法将该词转换为词向量。Step 5: During recognition, first convert the text to be recognized into word vectors representing word features according to the trained Word2Vec model. If the Word2Vec model does not contain the corresponding training vocabulary, then incremental learning, obtaining word vectors, and backtracking word vectors are used. The spatial method converts the word into a word vector.

(1)将待转换词汇在已训练的词向量空间中匹配。(1) Match the words to be converted in the trained word vector space.

(2)若待转换词汇在词向量空间中能够匹配，则直接将词汇转换成对应词向量。(2) If the words to be converted can be matched in the word vector space, the words are directly converted into corresponding word vectors.

(3)若Word2Vec模型中不包含对应词汇，则备份词向量空间，防止增量学习产生的词空间偏移导致神经网络模型精度的下降，载入Word2Vec模型，获取不匹配词汇所在句子获取不匹配词汇所在句子，将其放入Word2Vec模型中进行增量训练，并获取词汇的词向量，利用备份的词向量空间，回溯模型。(3) If the Word2Vec model does not contain the corresponding vocabulary, back up the word vector space to prevent the word space offset caused by incremental learning from reducing the accuracy of the neural network model, load the Word2Vec model, and obtain the sentence where the mismatched vocabulary is located. The sentence where the vocabulary is located, put it into the Word2Vec model for incremental training, and obtain the word vector of the vocabulary, and use the backed up word vector space to backtrack the model.

对词向量窗口化，将窗口w乘词向量长度d的二维矩阵作为神经网络的输入。然后将神经网络得到的预测矩阵重新窗口化放入训练好的条件随机场模型中进行消歧，获得待识别文本中最终的命名实体标注。Windowing the word vector, the two-dimensional matrix of the window w multiplied by the word vector length d is used as the input of the neural network. Then, the prediction matrix obtained by the neural network is re-windowed into the trained conditional random field model for disambiguation, and the final named entity annotation in the text to be recognized is obtained.

实例example

从搜狗新闻网站爬虫网络文本，从数据堂语料库下载有命名实体语料作为样本语料，利用自然语言工具对爬虫网络文本进行分词，将分好词的语料与样本语料利用python中的gensim包通过Word2Vec模型进行词向量空间的训练，具体参数如下，词向量长度选择200，迭代次数25次，初始步长0.025，最小步长0.0001，选用CBOW模型。Crawler web text from Sogou News website, download named entity corpus from Datatang corpus as sample corpus, use natural language tools to segment the crawler web text, and use the gensim package in python to pass Word2Vec model. The training of the word vector space is carried out. The specific parameters are as follows. The length of the word vector is 200, the number of iterations is 25, the initial step size is 0.025, the minimum step size is 0.0001, and the CBOW model is selected.

将样本语料的文本按照已训练的Word2Vec模型转换成代表词特征的词向量，若Word2Vec模型中不包含对应的训练词汇，则采用增量学习、获取词向量、回溯词向量空间的方法将该词转换为词向量。作为每个词的特征。将数据堂提供样本语料中的”/o”、”/n”、”/p”等标签相应的转化为命名实体标签”/Org-B”、”/Org-I”、”/Per-B”、”/Per-I”、”/Loc-B”、”/Loc-I”等，并转换成one-hot的形式作为神经网络的输出。Convert the text of the sample corpus into the word vector representing the word feature according to the trained Word2Vec model. If the Word2Vec model does not contain the corresponding training vocabulary, the method of incremental learning, obtaining the word vector, and backtracking the word vector space is used for the word. Convert to word vectors. as a feature of each word. Convert the "/o", "/n", "/p" and other tags in the sample corpus provided by Datatang into named entity tags "/Org-B", "/Org-I", "/Per-B accordingly. ", "/Per-I", "/Loc-B", "/Loc-I", etc., and converted into one-hot form as the output of the neural network.

设定窗口大小为5，即在考虑当前词的命名实体类别时，将其本身和前后各两个词的词特征作为神经网络的输入，神经网络的输入为batchSize*1000的向量，从样本数据中抽取十分之一的词汇不参与神经网络的训练，作为神经网络的衡量标准，神经网络的输出层采用softmax函数进行归一化，使神经网络的分类结果为词汇属于非命名实体及各类命名实体的概率，暂时取概率最大值作为最终分类结果。调整神经网络中的结构、深度、节点数、步长、激活函数、初始值等参数，使神经网络取得较为良好的精确度，最终具体参数如下，隐藏层2层，隐藏节点数150个，步长0.01，batchSize选取40，激活函数使用sigmoid时可以产生良好的分类效果，准确度可以达到99.83％，最具代表性的人名、地名、机构名的F值可以达到93.4％、84.2％、80.4％。The window size is set to 5, that is, when considering the named entity category of the current word, the word features of itself and the two words before and after are used as the input of the neural network. The input of the neural network is a vector of batchSize*1000, from the sample data. One-tenth of the vocabulary is extracted from the neural network and does not participate in the training of the neural network. As the measurement standard of the neural network, the output layer of the neural network is normalized by the softmax function, so that the classification result of the neural network is that the vocabulary belongs to non-named entities and various types. The probability of the named entity, temporarily taking the maximum probability as the final classification result. Adjust the structure, depth, number of nodes, step size, activation function, initial value and other parameters in the neural network, so that the neural network can achieve better accuracy, the final specific parameters are as follows, the hidden layer is 2, the number of hidden nodes is 150, step The length is 0.01, and the batchSize is 40. When the activation function uses sigmoid, it can produce a good classification effect, and the accuracy can reach 99.83%. The F value of the most representative person, place, and institution name can reach 93.4%, 84.2%, 80.4% .

将神经网络输出的预测矩阵取概率最大值作为最终分类结果的步骤移除，直接将概率矩阵重新窗口化，将待标注词的上下文预测信息作为条件随机场模型中待标注词的实际分类的关联点，根据训练语料利用EM算法，计算出条件随机场各边的期望值，训练出对应的条件随机场模型，在使用条件随机场进行消歧后人名、地名、机构名的F值可以提升至94.8％、85.0％、82.0％。Remove the step of taking the prediction matrix output by the neural network to take the maximum probability as the final classification result, directly rewindow the probability matrix, and use the contextual prediction information of the word to be labeled as the association of the actual classification of the word to be labeled in the conditional random field model According to the training corpus, the EM algorithm is used to calculate the expected value of each side of the conditional random field, and the corresponding conditional random field model is trained. After using the conditional random field for disambiguation, the F value of the names of people, places, and institutions can be increased to 94.8 %, 85.0%, 82.0%.

通过上文的具体实施例可以看出，与传统的有监督的命名实体识别方法相比，本发明提供的基于神经网络概率消歧的文本命名实体识别方法，使用了一种可增量提取词特征而不产生词向量空间偏移的词向量转换方法，使神经网络可以应用在新词、错别字多的网络文本中。而且，本发明对神经网络输出的概率矩阵重新窗口化，采用条件随机场模型进行上下文消歧，可以较好的解决网络文本中错别字多、语法不规范的现象。It can be seen from the above specific embodiments that, compared with the traditional supervised named entity recognition method, the text named entity recognition method based on neural network probability disambiguation provided by the present invention uses an incrementally extractable word The word vector transformation method that does not generate the spatial offset of the word vector, enables the neural network to be applied to the network text with many new words and typos. Moreover, the present invention re-windows the probability matrix output by the neural network, adopts the conditional random field model to disambiguate the context, and can better solve the phenomenon of many typos and irregular grammar in the network text.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. a network text named entity recognition method based on neural network probability disambiguation, it is characterized in that: with unlabeled corpus word segmentation, utilize Word2Vec to extract word vector, sample corpus is converted into word feature matrix and windowed, constructs deep neural network After training, the softmax function is added to the output layer of the neural network for normalization to obtain the probability matrix of each word corresponding to the named entity category; the probability matrix is re-windowed, and the conditional random field model is used for disambiguation to obtain the final name. Entity annotation, including the following steps:

Step 1, obtain unlabeled corpus through web crawler, obtain sample corpus marked with named entities from the corpus, and use natural language tools to segment the unlabeled corpus;

Step 2: Train the word vector space on the segmented unlabeled corpus and sample corpus through the Word2Vec tool;

Step 3: Convert the text in the sample corpus into a word vector representing word features according to the trained Word2Vec model, window the word vector, and use the window w multiplied by the two-dimensional matrix of the word vector length d as the input of the neural network; The labels in the sample corpus are converted into one-hot form as the output of the neural network; the output layer of the neural network is normalized by the softmax function, so that the classification result of the neural network is the probability that the vocabulary belongs to non-named entities and various named entities, Adjust the structure, depth, number of nodes, step size, activation function, initial value parameters in the neural network and select the activation function to train the neural network;

Step 4: Re-window the prediction matrix output by the neural network, take the context prediction information of the word to be labeled as the correlation point of the actual classification of the word to be labeled in the conditional random field model, and use the EM algorithm according to the training corpus to calculate the value of each side. Expected value, train the corresponding conditional random field model;

Step 5: During recognition, first convert the text to be recognized into word vectors representing word features according to the trained Word2Vec model. If the Word2Vec model does not contain the corresponding vocabulary, then incremental learning, acquisition of word vectors, and backtracking of the word vector space are used. The method of converting the word into a word vector, and windowing the word vector, takes the two-dimensional matrix of the window w multiplied by the word vector length d as the input of the neural network; then re-window the prediction matrix obtained by the neural network into the trained Disambiguation is carried out in the conditional random field model of , and the final named entity annotation in the text to be recognized is obtained.

2. the network text named entity recognition method based on neural network probability disambiguation according to claim 1, is characterized in that: the parameter of described Word2Vec tool is as follows: word vector length is selected 200, iteration times 25 times, initial step size 0.025, The minimum step size is 0.0001, and the CBOW model is selected.

3. the network text named entity recognition method based on neural network probability disambiguation according to claim 1, is characterized in that: the parameter of described neural network is as follows: hidden layer 2 layers, 150 hidden nodes, step size 0.01, batchSize Choose 40, the activation function uses the sigmoid function.

4. the network text named entity recognition method based on neural network probability disambiguation according to claim 1, is characterized in that: the label in the sample corpus is converted into the method of one-hot form: "/o" in the sample corpus , "/n", "/p" tags are correspondingly converted to named entity tags "/Org-B", "/Org-I", "/Per-B", "/Per-I", "/Loc- B", "/Loc-I", and then converted to one-hot form.

5 . The network text named entity recognition method based on neural network probability disambiguation according to claim 1 , wherein the window size of the word vector windowing is 5. 6 .

6. the network text named entity recognition method based on neural network probability disambiguation according to claim 1, it is characterized in that: during neural network training, from the sample data, extract one tenth of the vocabulary not to participate in the training of neural network, as Metrics for Neural Networks.