CN110008469A

CN110008469A - A Multi-level Named Entity Recognition Method

Info

Publication number: CN110008469A
Application number: CN201910207179.0A
Authority: CN
Inventors: 常亮; 王文凯; 宾辰忠; 宣闻; 秦赛歌; 陈源鹏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-12
Anticipated expiration: 2039-03-19
Also published as: CN110008469B

Abstract

The present invention proposes a kind of multi-level name entity recognition method, comprising: S1 pre-processes data text, obtains vocabulary C；Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated；S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded；S4 is decoded the Text eigenvector sequence with CRF model, marks out the entity in the Text eigenvector sequence；S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the candidate sequence of subsequent identification process；The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, and attention force vector is calculated；Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the entity in sequence by S7.

Description

A Multi-level Named Entity Recognition Method

技术领域technical field

本发明涉及自然语言处理领域，具体涉及一种多层次命名实体识别方法。The invention relates to the field of natural language processing, in particular to a multi-level named entity recognition method.

背景技术Background technique

自然语言处理作为计算机领域与人工智能领域的一个交叉方向，随着人工智能领域的快速发展而不断发展。命名实体识别(Named Entity Recognition，简称NER)是自然语言处理的一个基本任务，它的目的是识别出文本中有特定意义的实体并对它们分类，这些实体的类型主要包括了人名、机构名、地点以及其他一些专有名词。伴随着互联网中海量大数据的产生，命名实体识别任务也在不断的受到学术界和工业界的重视，它已经被广泛运用在机器翻译、智能问答、信息检索等其它自然语言处理任务中。As a cross direction between the computer field and the artificial intelligence field, natural language processing continues to develop with the rapid development of the artificial intelligence field. Named Entity Recognition (NER) is a basic task of natural language processing. Its purpose is to identify entities with specific meanings in the text and classify them. The types of these entities mainly include person names, institution names, place and some other proper nouns. With the generation of massive big data in the Internet, the task of named entity recognition has also received increasing attention from academia and industry. It has been widely used in machine translation, intelligent question answering, information retrieval and other natural language processing tasks.

目前命名实体别的方法包括了基于规则的传统方法、基于词典的传统方法以及基于统计的传统方法，比较有代表性的方法就是基于统计的隐式马尔可夫模型(HMM)和条件随机场模型(CRF)，随着深度学习的热潮的到来，也涌现出许多基于神经网络的方法，比如利用长短时记忆网络(LSTM)进行命名实体识别，并且通过传统方法和神经网络的方法相结合的方式，取得了不错的成果。At present, other methods of named entities include traditional rule-based methods, dictionary-based traditional methods, and statistical-based traditional methods. The more representative methods are statistical-based Hidden Markov Model (HMM) and Conditional Random Field Model. (CRF), with the advent of the boom of deep learning, many neural network-based methods have emerged, such as the use of long short-term memory network (LSTM) for named entity recognition, and the combination of traditional methods and neural network methods. , with good results.

基于规则和词典的方法十分依赖于词典和规则的构造，所以它们只适合用于小规模的限定领域语料库中，对于大规模语料库就有点捉襟见肘，并且在处理新词汇时有很大局限性；基于统计的方法依赖人工特征提取，这将会耗费大量的人力和时间。目前很多基于神经网络的方法能够一定程度上解决传统方法的不足，但是对于文本中出现的生僻词，或者语义不清楚的词汇，这类方法的查全率和准确率仍然有待提升。Rule-based and dictionary-based methods are very dependent on the construction of dictionaries and rules, so they are only suitable for small-scale domain-limited corpora, which are a bit stretched for large-scale corpora, and have great limitations when dealing with new vocabulary; based on Statistical methods rely on manual feature extraction, which will consume a lot of manpower and time. At present, many methods based on neural networks can solve the shortcomings of traditional methods to a certain extent, but for rare words in texts or words with unclear semantics, the recall and accuracy of such methods still need to be improved.

发明内容SUMMARY OF THE INVENTION

鉴于以上所述现有技术的缺点，本发明的目的在于提供一种多层次命名实体识别方法，本发明通过利用推理单元和存储单元进行多次识别的方式，有效解决了实际应用命名实体识别中对于生僻词，语义不清楚的词汇准确率不高的问题，提高了文本信息的查全率和准确率。In view of the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a multi-level named entity recognition method. The present invention effectively solves the problem in practical application named entity recognition by using the inference unit and the storage unit to perform multiple recognitions. For uncommon words and words with unclear semantics, the accuracy rate is not high, which improves the recall rate and accuracy rate of text information.

为实现上述目的及其他相关目的，本发明提供一种多层次命名实体识别方法，包括以下步骤：In order to achieve the above object and other related objects, the present invention provides a multi-level named entity recognition method, comprising the following steps:

S1对数据文本进行预处理，得到词汇表C；S1 preprocesses the data text to obtain the vocabulary C;

S2利用预训练好的词向量，结合文本的字符特征信息序列，得到的文本的向量表示；S2 uses the pre-trained word vector, combined with the character feature information sequence of the text, to obtain the vector representation of the text;

S3对所述文本的向量表示进行编码，得到编码后的文本特征向量序列；S3 encodes the vector representation of the text to obtain an encoded text feature vector sequence;

S4把所述文本特征向量序列用CRF模型进行解码，标注出所述文本特征向量序列中的实体；S4 decodes the text feature vector sequence with the CRF model, and marks the entities in the text feature vector sequence;

S5把标记处的实体的前文信息、后文信息以及该实体的信息作为后续的识别过程的候选序列；S5 takes the preceding information, the following information and the information of the entity at the marked place as the candidate sequence of the subsequent identification process;

S6将所述文本特征向量序列以及所述候选序列，输入到基于注意力机制的推理单元，计算得到注意力向量；S6 inputs the text feature vector sequence and the candidate sequence into the inference unit based on the attention mechanism, and calculates the attention vector;

S7把所述注意力向量和所述文本特征向量序列输入到CRF模型中，标注出序列中的实体。S7 inputs the attention vector and the text feature vector sequence into the CRF model, and marks the entities in the sequence.

S8重复步骤S5～S7直至步骤S7没产生新实体。S8 repeats steps S5-S7 until no new entity is generated in step S7.

可选地，所述对数据文本进行预处理，得到文本的字符特征信息序列，具体包括：Optionally, the data text is preprocessed to obtain the character feature information sequence of the text, which specifically includes:

首先以句号为标志，把一段长文本分割为句子；First, with a period as a sign, a long text is divided into sentences;

对所有所述句子进行分词；tokenize all said sentences;

然后去掉重复的词构建词汇表C。Then remove duplicate words to construct vocabulary C.

可选地，所述利用预训练好的词向量，结合文本的字符特征信息序列，得到的文本的向量表示，具体包括：Optionally, the use of the pre-trained word vector, combined with the character feature information sequence of the text, obtains the vector representation of the text, specifically including:

对语料库进行预训练，经过预训练可以得到一份词向量；The corpus is pre-trained, and a word vector can be obtained after pre-training;

将输入的文本表示为one-hot的形式，作为字符特征信息序列；Represent the input text in the form of one-hot as a sequence of character feature information;

通过所述词汇表C、所述字符特征信息序列以及预训练的词向量，得到每个词组的词向量表示；Obtain the word vector representation of each phrase through the vocabulary C, the character feature information sequence and the pre-trained word vector;

对分词后的句子进行词嵌入，得到文本的向量表示X。The word embedding is performed on the sentence after word segmentation, and the vector representation X of the text is obtained.

可选地，采用BiLSTM对所述文本的向量表示进行编码，得到编码后的文本特征向量序列。Optionally, BiLSTM is used to encode the vector representation of the text to obtain an encoded text feature vector sequence.

可选地，把所述文本特征向量序列用CRF模型进行解码，标注出所述文本特征向量序列中的实体，具体包括：Optionally, the text feature vector sequence is decoded with a CRF model, and the entities in the text feature vector sequence are marked, specifically including:

把文本特征向量序列H＝{h₁,h₂,h₃,…,h_t}输入至CRF模型，通过CRF模型的计算得到的预测标签序列L＝{l₁,l₂,l₃,…,l_n}，l表示的是每一个词的标签，标注结果中包含了“BIE”或者“S”的都是标注得到的实体e，实体集合表示为E＝(e₁,e₂,e₃,…,e_m)，m表示实体数。Input the text feature vector sequence H={h ₁ ,h ₂ ,h ₃ ,...,h _t } into the CRF model, and the predicted label sequence L={l ₁ ,l ₂ ,l ₃ ,... ,l _n }, l represents the label of each word, the marking result contains "BIE" or "S" is the marked entity e, the entity set is represented as E=(e ₁ ,e ₂ ,e ₃ ,…,e _m ), where m represents the number of entities.

可选地，所述预测标签序列通过以下方法计算获得：Optionally, the predicted label sequence is calculated and obtained by the following method:

设标记序列为Y＝{y₁,y₂,y₃,…,y_n}，序列Y表示BIOES标注体系中所有可能的标记的集合，x标记上标签y这一结果的得分函数为：Let the label sequence be Y={y ₁ , y ₂ , y ₃ ,...,y _n }, the sequence Y represents the set of all possible labels in the BIOES labeling system, and the score function of the result of label y on the x label is:

其中，t_j(y_i-1,y_i,x,i)代表CRF模型中的特征函数，表示在给定x的情况下，上一个标签结点y_i-1转移到当前标签结点y_i的情况，取值为0或1，称为转移特征，s_k(y_i,x,i)表示当前标签结点y_i是否标记在x上，取值也为0或1，称为状态特征，λ_j以及μ_k分别表示的t_j和s_k的权值，j,k表示特征函数的个数；Among them, t _j (y _i-1 ,y _i ,x,i) represents the feature function in the CRF model, indicating that given x, the previous label node y _i-1 is transferred to the current label node y In the case of _i , the value is 0 or 1, which is called the transition feature, and s _k (y _i ,x,i) indicates whether the current label node y _i is marked on x, and the value is also 0 or 1, which is called the state Features, λ _j and μ _k represent the weights of t _j and s _k respectively, and j, k represent the number of feature functions;

对score(y,x)进行指数化和标准化，就能够得到为x打上y标签的条件概率p(y|x)，计算p(y|x)具体公式如下：By indexing and normalizing the score(y,x), the conditional probability p(y|x) that labels x with y can be obtained. The specific formula for calculating p(y|x) is as follows:

Z(x)为规范化因子，计算公式为：Z(x) is the normalization factor, and the calculation formula is:

其中y′＝(y′₁,y′₂,y′₃,…,y′_n)，表示可能的标注序列；where y′=(y′ ₁ , y′ ₂ , y′ ₃ ,…,y′ _n ), indicating the possible label sequence;

通过维特比算法进行求解，模型取使得概率最大的y′为作为标注结果，记为l，即则预测标签序列为L＝{l₁,l₂,l₃,…,l_n}。The solution is solved by the Viterbi algorithm, and the model takes the y' with the highest probability as the labeling result, denoted as l, that is, Then the predicted label sequence is L={l ₁ ,l ₂ ,l ₃ ,...,l _n }.

可选地，对于实体集合E中的每个实体e_i(i＝1,2,3,…,m)，结合前向LSTM隐含层输出以及后向LSTM隐含层输出把每个实体表示为V′＝[v₁,v₂,v₃,v₄]的形式，其中，v₁表示的是实体的上文信息，v₂表示的是前向LSTM得到的实体本身信息，v₃表示的是后向LSTM得到的实体本身信息，v₄实体的后文信息，则存储单元中的实体序列表示为V＝{V′₁,V′₂,V′₃,…,V′_m′}，m′表示已经存入到存储单元的实体数，则存储单元中的实体序列表示为候选序列。Optionally, for each entity e _i (i=1,2,3,...,m) in the entity set E, combine the forward LSTM hidden layer output and the output of the backward LSTM hidden layer Represent each entity in the form of V′=[v ₁ , v ₂ , v ₃ , v ₄ ], where v ₁ represents the above information of the entity, and v ₂ represents the entity itself obtained by forward LSTM information, v ₃ represents the information of the entity itself obtained by the backward LSTM, and the latter information of the v ₄ entity, the entity sequence in the storage unit is expressed as V={V′ ₁ ,V′ ₂ ,V′ ₃ ,..., V'_m' }, m' represents the number of entities that have been stored in the storage unit, and the entity sequence in the storage unit is represented as a candidate sequence.

可选地，把存储单元中的候选序列V和文本特征向量序列H输入到推理单元中，通过推理单元得出每个实体信息V对文本特征向量序列H的影响程度，得到注意力向量S；Optionally, the candidate sequence V and the text feature vector sequence H in the storage unit are input into the inference unit, and the degree of influence of each entity information V on the text feature vector sequence H is obtained by the inference unit, and the attention vector S is obtained;

对于文本特征向量序列H＝{h₁,h₂,h₃,…,h_t}和候选序列V＝{V′₁,V′₂,V′₃,…,V′_m′}，分别计算每个时刻的h_i与所有V′_j(j＝1,2,3,…,m′)的点积，得到注意力分数σ，其中i＝1,2,3,...,t；For the text feature vector sequence H={h ₁ ,h ₂ ,h ₃ ,…,h _t } and the candidate sequence V={V′ ₁ ,V′ ₂ ,V′ ₃ ,…,V′ _m′ }, respectively calculate The dot product of hi at each moment and all _V'j ( _j =1, 2, 3,...,m') to get the attention score σ, where i=1,2,3,...,t;

任意时刻t，候选序列V对h_t的注意力分数的计算公式如下：At any time t, the formula for calculating the attention score of the candidate sequence V to h _t is as follows:

利用softmax函数把注意力分数转化为概率分布作为后序加权求和的权重α：Use the softmax function to convert the attention score into a probability distribution as the weight α of the post-order weighted summation:

α^t＝softmax(σ^t)α ^t =softmax(σ ^t )

对候选序列V进行加权求和，得到任意时刻t，候选序列V对h_t注意力向量s，公式如下：The weighted summation of the candidate sequence V is carried out to obtain any time t, and the attention vector s of the candidate sequence V to h _t is obtained. The formula is as follows:

计算所有时刻的结果之后，得到候选序列V对文本特征向量序列H的注意力向量序列S＝{s₁,s₂,s₃,…,s_t}。After calculating the results at all times, the attention vector sequence S={s ₁ , s ₂ , s ₃ ,..., s _t } of the candidate sequence V to the text feature vector sequence H is obtained.

可选地，该方法还包括:计算所述步骤S7中的实体与步骤S5中的实体的相似度，当所述相似度小于相似度阈值时，把该实体作为新的实体。Optionally, the method further includes: calculating the similarity between the entity in the step S7 and the entity in the step S5, and when the similarity is less than the similarity threshold, the entity is regarded as a new entity.

可选地，所述相似度采用余弦相似度的方法计算，具体公式为：Optionally, the similarity is calculated by the method of cosine similarity, and the specific formula is:

如上所述，本发明的一种多层次命名实体识别方法，具有以下有益效果：As mentioned above, a multi-level named entity recognition method of the present invention has the following beneficial effects:

1、相比于只进行一次命名实体识别的方法，本方法使用多次识别，通过多次识别的方式提高了命名实体识别任务的查全率；1. Compared with the method of performing named entity recognition only once, this method uses multiple recognitions, and improves the recall rate of the named entity recognition task through multiple recognitions;

2、现有方法比较适合短文本数据，对于较长的长文本数据，效率就会下降，本发明设计了存储单元，存储重要的实体信息以及它的上下文信息，通过这种方式处理长文本数据，同时设计了候选单元减小存储单元的空间开销；2. The existing method is more suitable for short text data, but for longer long text data, the efficiency will decrease. The present invention designs a storage unit to store important entity information and its context information, and process long text data in this way. , while designing a candidate unit to reduce the space overhead of the storage unit;

3、设计了推理单元，结合推理单元进行实体识别，对于文本中的一些生僻词，以及一些语义不清的词，识别的效果能够提升，提高系统的准确率和查全率。3. The reasoning unit is designed, and the reasoning unit is used for entity recognition. For some rare words in the text and some words with unclear semantics, the recognition effect can be improved, and the accuracy and recall rate of the system can be improved.

附图说明Description of drawings

为了进一步阐述本发明所描述的内容，下面结合附图对本发明的具体实施方式作进一步详细的说明。应当理解，这些附图仅作为典型示例，而不应看作是对本发明的范围的限定。In order to further illustrate the content described in the present invention, the specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. It should be understood that these drawings are presented by way of example only and should not be considered as limiting the scope of the present invention.

图1为命名实体识别系统流程图；Figure 1 is a flowchart of a named entity recognition system;

图2分别为存储单元结构示意图以及候选单元结构示意图；2 is a schematic diagram of a storage unit structure and a schematic diagram of a candidate unit structure;

图3为推理单元结构示意图；Fig. 3 is a schematic diagram of the structure of an inference unit;

图4为系统结构示意图；4 is a schematic diagram of the system structure;

图5为CRF结构示意图；Figure 5 is a schematic diagram of the CRF structure;

图6为LSTM单元结构示意图。Figure 6 is a schematic diagram of the structure of the LSTM unit.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic concept of the present invention in a schematic way, so the drawings only show the components related to the present invention rather than the number, shape and number of components in actual implementation. For dimension drawing, the type, quantity and proportion of each component can be changed at will in actual implementation, and the component layout may also be more complicated.

本发明提供一种多层次命名实体识别方法，通过多次识别的方式，逐步的把文本中前几次命名实体识别过程中的未被识别出来的实体识别出来。本发明要解决的核心问题包括如下两点：The invention provides a multi-level named entity recognition method, which can gradually recognize the unrecognized entities in the text in the first several named entity recognition processes by means of multiple recognition. The core problem to be solved by the present invention includes the following two points:

1、对于传统的命名实体识别系统，整个系统只会进行一次识别，所以识别结果中仍然会存在一部分的词汇无法识别出来或识别错误，导致命名实体识别的查全率以及准确率不高，这种情况在对一些生僻词进行识别，如少见的人名，地名等，以及识别一些文本中语义不清晰的词汇时更加严重；1. For the traditional named entity recognition system, the entire system will only be recognized once, so there will still be some words in the recognition result that cannot be recognized or recognized incorrectly, resulting in low recall and accuracy of named entity recognition. This situation is more serious when recognizing some uncommon words, such as rare names of people, places, etc., and recognizing words with unclear semantics in some texts;

2、对于目前许多命名实体识别系统不好解决长文本的问题，本发明采用存储单元存储实体的相关信息来解决。2. For the problem that many current named entity recognition systems cannot solve the problem of long texts, the present invention uses a storage unit to store the relevant information of the entity to solve the problem.

本发明的核心思想是：保存已经识别出来的实体的上下文信息，以及实体本身信息，因为实体的上下文信息以及实体本身信息所组成的信息可以看作一些句式信息，通过这些信息以及多次识别的方式去帮助系统去识别出文本中未被识别出的实体，具体的方法将在下面的实例中说明。The core idea of the present invention is to save the context information of the identified entities and the information of the entity itself, because the information composed of the context information of the entity and the information of the entity itself can be regarded as some sentence information. way to help the system to identify unrecognized entities in the text, the specific method will be explained in the following example.

本发明的实施方式如图1所示，图1包括了本发明的总体思想，其中本发明采用的标注体系为BIOES标注体系，即“B”表示多字符实体的开头字符，“I”表示多字符实体的中间字符，“O”表示其它的非实体词，“E”表示多字符实体的结尾，“S”表示一个词单独为实体，并用“PER”表示人名，“ORG”表示机构名，“LOC”表示地名。The embodiment of the present invention is shown in FIG. 1, which includes the general idea of the present invention, wherein the labeling system adopted in the present invention is the BIOES labeling system, that is, “B” represents the beginning character of a multi-character entity, and “I” represents a multi-character entity. The middle character of a character entity, "O" represents other non-entity words, "E" represents the end of a multi-character entity, "S" represents a word alone as an entity, and "PER" represents a person's name, "ORG" represents an organization name, "LOC" means place name.

如图1所示，本实施例提供一种多层次命名实体识别方法，该方法包括以下步骤：As shown in FIG. 1, this embodiment provides a multi-level named entity recognition method, and the method includes the following steps:

在步骤S1中，对数据文本进行预处理，具体为：首先以句号为标志，把一段长文本分割为句子，处理过后的句子以行为单位存储起来，利用开源工具jieba分词对所有句子进行分词，把句子分成词的形式，然后去掉重复的词可以得到词汇表，把词汇表设为C。In step S1, the data text is preprocessed, which is as follows: first, using a period as a symbol, a long text is divided into sentences, the processed sentences are stored in units of actions, and the open source tool jieba is used to segment all sentences. Divide the sentence into the form of words, and then remove the repeated words to get the vocabulary. Let the vocabulary be C.

在步骤S2中，利用预训练好的词向量，结合文本的字符特征信息序列，得到的文本的向量表示。具体地，先用jieba分词对中文维基百科语料库分词，利用开源工具word2vec对中文维基百科语料库分词后的进行预训练，经过预训练可以得到一份词向量，令预训练的词向量的维度为d。对于输入的文本，先把它表示为one-hot的形式，作为字符特征信息序列，然后通过词汇表C以及预训练的词向量，便可得到每个词组的词向量表示，令词向量为x，通过对分词后的句子进行词嵌入，可以得到文本的向量表示X，其中X＝{x₁,x₂,x₃,…,x_n}，X∈R^d×n，n为句子分词后词的个数。In step S2, the vector representation of the text is obtained by using the pre-trained word vector and combining the character feature information sequence of the text. Specifically, first use jieba word segmentation to segment the Chinese Wikipedia corpus, and use the open source tool word2vec to pre-train the Chinese Wikipedia corpus after word segmentation. After pre-training, a word vector can be obtained, and the dimension of the pre-trained word vector is d. . For the input text, first represent it in the form of one-hot as a sequence of character feature information, and then through the vocabulary C and the pre-trained word vector, the word vector representation of each phrase can be obtained, let the word vector be x , by performing word embedding on the sentence after word segmentation, the vector representation X of the text can be obtained, where X={x ₁ ,x ₂ ,x ₃ ,...,x _n }, X∈R ^d×n , n is the sentence after word segmentation number of words.

在步骤S3中，步骤S1和步骤S2通过前两步的处理工作，得到了文本的向量表示X，把X作为输入序列输入到双向LSTM模型中。通过LSTM，能够提取文本特征信息，而在某个时间段一个词的特征信息有时候不仅包括了前面的词汇对它的影响，也包括了后面的词对它的影响，所以采用双向LSTM，能够从两个方向充分提取出文本特征向量序列。通过双向LSTM把文本的向量表示X编码为文本特征向量序列H，其中H＝{h₁,h₂,h₃,…,h_t}，为双向LSTM的隐含层每个时刻的输出,代表每一时刻前向LSTM隐含层输出和后向LSTM隐含层输出进行拼接。In step S3, step S1 and step S2 obtain the text vector representation X through the processing work of the first two steps, and input X as the input sequence into the bidirectional LSTM model. Through LSTM, text feature information can be extracted, and the feature information of a word in a certain period of time sometimes includes not only the influence of the previous words on it, but also the influence of the following words on it, so the use of two-way LSTM can The text feature vector sequence is fully extracted from two directions. The vector representation X of the text is encoded as a text feature vector sequence H by bidirectional LSTM, where H={h ₁ ,h ₂ ,h ₃ ,...,h _t }, is the output at each moment of the hidden layer of the bidirectional LSTM, Represents the output of the forward LSTM hidden layer at each moment and the output of the backward LSTM hidden layer Do stitching.

本步骤中的LSTM模型如图6所示，它的具体公式描述如下：The LSTM model in this step is shown in Figure 6, and its specific formula is described as follows:

首先决定LSTM需要选择哪些信息需要遗忘，哪些信息需要保留，f_t表示遗忘门的输出First decide which information LSTM needs to choose to forget and which information to retain, f _t represents the output of the forget gate

f_t＝σ(W_f·[h_t-1,x_t]+b_f)f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f )

然后确定LSTM需要更新的信息，i_t表示的是记忆门的输出，表示的是LSTM中临时记忆细胞的状态值Then determine the information that LSTM needs to update, i _t represents the output of the memory gate, Represents the state value of the temporary memory cell in the LSTM

i_t＝σ(W_i·[h_t-1,x_t]+b_i)i _t =σ(W _i ·[h _t-1 ,x _t ]+b _i )

接着更新LSTM细胞的状态，C_t，C_t-1分别表示LSTM中当前细胞状态，以及前一时刻细胞状态Then update the state of the LSTM cell, C _t , C _t-1 represent the current cell state in the LSTM, and the cell state at the previous moment, respectively

最后输出隐含层信息，o_t表示输出门的值Finally output the hidden layer information, o _t represents the value of the output gate

o_t＝σ(W_o[h_t-1,x_t]+b_o)o _t =σ(W _o [h _t-1 ,x _t ]+b _o )

h_t＝o_t*tanh(C_t)h _t =o _t *tanh(C _t )

其中，x_t为当前时刻的输入，h_t-1为上一时刻隐含层的输出，h_t表示当前时刻隐含层的输出，W是各个函数对应的权重矩阵，b为各个函数对应的偏置，σ为sigmoid函数，tanh为双曲正切函数。Among them, x _t is the input at the current moment, h _t-1 is the output of the hidden layer at the previous moment, h _t is the output of the hidden layer at the current moment, W is the weight matrix corresponding to each function, and b is the corresponding value of each function. Bias, σ is the sigmoid function, and tanh is the hyperbolic tangent function.

在步骤S4中，把文本特征向量序列H＝{h₁,h₂,h₃,…,h_t}输入如图5所示的CRF模型，通过CRF模型的计算得到的预测标签序列L＝{l₁,l₂,l₃,…,l_n}，l表示的是每一个词的标签，标注结果中包含了“BIE”或者“S”的都是标注得到的实体e，实体集合表示为E＝(e₁,e₂,e₃,…,e_m)，m表示实体数。In step S4, the text feature vector sequence H={h ₁ , h ₂ , h ₃ ,...,h _t } is input into the CRF model shown in Figure 5, and the predicted label sequence L={ l ₁ ,l ₂ ,l ₃ ,...,l _n }, l represents the label of each word, the labeling result contains "BIE" or "S" is the marked entity e, the entity set is expressed as E=(e ₁ , e ₂ , e ₃ ,...,e _m ), where m represents the number of entities.

本步骤中的预测标签序列L相关具体计算方式如下：The specific calculation method of the predicted label sequence L in this step is as follows:

其中，t_j(y_i-1,y_i,x,i)代表CRF模型中的特征函数，表示在给定x的情况下，上一个标签结点y_i-1转移到当前标签结点y_i的情况，取值为0或1，称为转移特征；s_k(y_i,x,i)表示当前标签结点y_i是否标记在x上，取值也为0或1，称为状态特征；λ_j以及μ_k分别表示的t_j和s_k的权值，j,k表示特征函数的个数。Among them, t _j (y _i-1 ,y _i ,x,i) represents the feature function in the CRF model, indicating that given x, the previous label node y _i-1 is transferred to the current label node y In the case of _i , the value is 0 or 1, which is called the transition feature; s _k (y _i ,x,i) indicates whether the current label node y _i is marked on x, and the value is also 0 or 1, which is called the state Features; λ _j and μ _k represent the weights of t _j and s _k respectively, and j, k represent the number of feature functions.

对score(y,x)进行指数化和标准化，就能够得到为x标记上标签y的条件概率p(y|x)，计算p(y|x)具体公式如下：By indexing and normalizing the score(y,x), the conditional probability p(y|x) of the label y on the x tag can be obtained. The specific formula for calculating p(y|x) is as follows:

Z(x)为规范化因子，它的计算公式为：Z(x) is the normalization factor, and its calculation formula is:

其中，y′＝(y′₁,y′₂,y′₃,…,y′_n)，表示可能的标注序列。Among them, y′=(y′ ₁ , y′ ₂ , y′ ₃ ,..., y′ _n ), which represents the possible label sequence.

在步骤S5中，步骤S4识别出了部分实体，但是在文本中可能仍然存在部分未被识别出来的生僻词，或者语义不清的词汇，所以通过把已经识别出来的实体信息存储下来，利用实体的本身信息以及它的上下文信息帮助系统去识别出这些生僻词。In step S5, step S4 recognizes some entities, but there may still be some unrecognized unrecognized words or words with unclear semantics in the text, so by storing the identified entity information, use entity information Its own information and its context information help the system to identify these rare words.

步骤S5的操作如下：The operation of step S5 is as follows:

按照语法规则，一个名词的前面或者后面应该接动词或介词，动词或介词与名词所组成的句式便含有丰富的文本特征信息。因为LSTM的隐含层输出中包括了文本每个时刻的信息，所以对于E中的每个实体e_i(i＝1,2,3,…,m)，结合前向LSTM隐含层输出把e_i前一时刻的信息存储为v₁，e_i本身信息存储为v₂，结合后向LSTM隐含层输出把e_i后一时刻的信息存储为v₃，e_i本身信息存储为v₄，则每个实体表示为V′＝[v₁,v₂,v₃,v₄]的形式，存储到存储单元。如图2所示，v₁表示的是实体的上文信息，v₂表示的是前向LSTM得到的实体本身信息，v₃表示的是后向LSTM得到的实体本身信息，v₄实体的后文信息，则存储单元中的实体序列表示为V＝{V′₁,V′₂,V′₃,…,V′_m′}，m′表示已经存入到存储单元的实体数，则存储单元中的实体序列表示为候选序列V。According to grammatical rules, a noun should be preceded or followed by a verb or a preposition, and the sentence pattern composed of a verb or a preposition and a noun contains rich textual feature information. Because the output of the hidden layer of LSTM includes the information of each moment of the text, for each entity e _i (i=1, 2, 3, ..., m) in E, combine the output of the forward LSTM hidden layer Store the information of e _i at the previous moment as v ₁ , store the information of e _i itself as v ₂ , and combine the output of the backward LSTM hidden layer Store the information at the moment after e _i as v ₃ , and store the information of e _i itself as v ₄ , then each entity is represented in the form of V′=[v ₁ , v ₂ , v ₃ , v ₄ ], which is stored in the storage unit. As shown in Figure 2, v ₁ represents the above information of the entity, v ₂ represents the information of the entity itself obtained by the forward LSTM, v ₃ represents the information of the entity itself obtained by the backward LSTM, and the back of the v ₄ entity If the text information, the entity sequence in the storage unit is represented as V={V′ ₁ , V′ ₂ , V′ ₃ ,..., V′ _m′ }, where m′ represents the number of entities that have been stored in the storage unit, then the storage The entity sequences in the cell are denoted as candidate sequences V.

步骤S6：把存储单元中的候选序列V和文本特征向量序列H输入到推理单元中，通过推理单元得出每个候选序列V对文本特征向量序列H的影响程度，得到注意力向量S。Step S6: Input the candidate sequence V and the text feature vector sequence H in the storage unit into the inference unit, obtain the degree of influence of each candidate sequence V on the text feature vector sequence H through the inference unit, and obtain the attention vector S.

本步骤中的推理单元的示意图如图3所示，采用注意力机制的方法，通过计算存储单元中的每个实体对每个特征向量的关注程度，得到一个注意力向量。A schematic diagram of the reasoning unit in this step is shown in Figure 3, and an attention vector is obtained by calculating the attention degree of each entity in the storage unit to each feature vector by using the method of attention mechanism.

对于文本特征向量序列H＝{h₁,h₂,h₃,…,h_t}和候选序列V＝{V′₁,V′₂,V′₃,…,V′_m′}，分别计算每个时刻的h_i(i＝1,2,3,…,t)与所有V′_j(j＝1,2,3,…,m′)的点积，得到注意力分数σ。For the text feature vector sequence H={h ₁ ,h ₂ ,h ₃ ,…,h _t } and the candidate sequence V={V′ ₁ ,V′ ₂ ,V′ ₃ ,…,V′ _m′ }, respectively calculate The dot product of hi ( _i =1,2,3,...,t) at each moment and all V'j ( _j =1,2,3,...,m') yields the attention score σ.

任意时刻t，V对h_t的注意力分数的计算公式如下：The formula for calculating the attention score of V to h _t at any time t is as follows:

然后利用softmax函数把注意力分数转化为概率分布作为后序加权求和的权重α：Then use the softmax function to convert the attention score into a probability distribution as the weight α of the post-order weighted summation:

α^t＝softmax(σ^t)α ^t =softmax(σ ^t )

最后对V进行加权求和，得到任意时刻t，V对h_t注意力向量s，公式如下：Finally, the weighted summation of V is performed to obtain the attention vector s of V to h _t at any time t. The formula is as follows:

计算所有时刻的结果之后，得到V对H注意力向量序列为S＝{s₁,s₂,s₃,…,s_t}。After calculating the results at all times, the V-to-H attention vector sequence is obtained as S={s ₁ , s ₂ , s ₃ ,...,s _t }.

在步骤S7中，把文本特征向量序列H和注意力向量S进行拼接之后，拼接过程和双向LSTM的隐含层的拼接过程一样，拼接之后的向量表示为[H:S]，把拼接之后的向量输入到CRF模型后，得到新的标注结果，此处的CRF模型和步骤S4中的CRF模型共享参数。In step S7, after splicing the text feature vector sequence H and the attention vector S, the splicing process is the same as the splicing process of the hidden layer of the bidirectional LSTM. The spliced vector is expressed as [H:S], and the spliced After the vector is input into the CRF model, a new labeling result is obtained. The CRF model here and the CRF model in step S4 share parameters.

于一实施例中，基于多层次命名实体识别方法还包括：通过步骤S7的新标注得到对应的新实体，采用步骤S5的方法把它们表示为V″＝[v₁,v₂,v₃,v₄]的形式，准备把V″＝[v₁,v₂,v₃,v₄]存储到存储单元中。如果不经过筛选就把新的实体信息存储到存储单元，可能会导致后期存储单元数据量过大，且造成存储空间的浪费，所以在把新的实体信息存储到存储单元之前，首先把它存储到候选单元，结构如图2所示，通过计算候选单元中这些新词汇和存储单元的新词汇之间的相似度，设定一个阈值β，当相似度小于阈值的时候，才能把新的实体存入到存储单元中去。In an embodiment, the method based on the multi-level named entity recognition further includes: obtaining corresponding new entities through the new labeling in step S7, and using the method in step S5 to represent them as V″=[v ₁ , v ₂ , v ₃ , v ₄ ], prepare to store V″=[v ₁ , v ₂ , v ₃ , v ₄ ] into the storage unit. If the new entity information is stored in the storage unit without screening, it may lead to an excessive amount of data in the later storage unit and waste of storage space. Therefore, before storing the new entity information in the storage unit, store it first. To the candidate unit, the structure is shown in Figure 2. By calculating the similarity between these new words in the candidate unit and the new words of the storage unit, a threshold β is set. When the similarity is less than the threshold, the new entity can be added. into the storage unit.

本步骤中的词汇之间的相似度采用余弦相似度的方法去计算，具体公式为：The similarity between words in this step is calculated by the method of cosine similarity, and the specific formula is:

于一实施例中，基于多层次命名实体识别方法还包括步骤九：重复上述步骤S5到步骤S8，直到步骤S7没有产生新的实体的时候，命名实体识别的过程结束。In one embodiment, the multi-level named entity identification method further includes step 9: repeating the above steps S5 to S8 until no new entity is generated in step S7, the named entity identification process ends.

下面通过一个实例来讲述上述的命名实体过程，如下图4所示：The above named entity process is described below through an example, as shown in Figure 4 below:

步骤1：输入的句子为“当汤姆遇到杰瑞时李某遇到蔡某”，首先对句子进行预处理，用jieba中文分词工具对句子分词，结果为“当/汤姆/遇到/杰瑞/时/李某/遇到/蔡某”，其中“/”表示分词的分隔符，并且利用已经预训练好的词向量可以得到该句子的词向量表示。Step 1: The input sentence is "When Tom meets Jerry when Li meets Cai", first preprocess the sentence, use jieba Chinese word segmentation tool to segment the sentence, the result is "when/tom/encounter/jie" Rui / Shi / Li Mou / Encounter / Cai Mou", where "/" represents the delimiter of the word segmentation, and the word vector representation of the sentence can be obtained by using the pre-trained word vector.

步骤2：把步骤1得到的词向量表示输入到双向LSTM中，用双向LSTM作为编码器对词向量进行编码，而LSTM的隐含层的输出就是我们所需要的编码后文本特征向量序列。Step 2: Input the word vector representation obtained in Step 1 into the bidirectional LSTM, use the bidirectional LSTM as the encoder to encode the word vector, and the output of the hidden layer of the LSTM is the encoded text feature vector sequence we need.

步骤3：把步骤2中的文本特征向量输入到下一层的CRF模型中，用CRF模型作为解码器对文本特征向量进行解码，同时进行标注得到标注结果如下所示：Step 3: Input the text feature vector in Step 2 into the CRF model of the next layer, use the CRF model as a decoder to decode the text feature vector, and perform annotation at the same time to obtain the labeling result as follows:

当when 汤姆tom 遇到meet 杰瑞Jerry 时Time 李某Lee 遇到meet 蔡某Cai OO S-PERS-PER OO S-PERS-PER OO OO OO OO

通过上述结果可以看出，实体人名“汤姆(S-PER)”和“杰瑞(S-PER)”标注正确，被识别出来了，而另外两个词“李某”和“蔡某”并没标注出来,因为这两个词汇相对而言是比较生僻的，所以没有被标注出来。From the above results, it can be seen that the entity names "Tom (S-PER)" and "Jerry (S-PER)" are correctly labeled and recognized, while the other two words "Li" and "Cai" are not Not marked, because these two terms are relatively uncommon, so they have not been marked.

通过观察上述例句“当汤姆遇到杰瑞时李某遇到蔡某”，可以看出，虽然后面两个词“李某”和“蔡某”是比较生僻的词汇，但是这两个词的上下文信息和已经识别出来的实体“汤姆”“杰瑞”是很相识的，“遇到”一词同时出现在“汤姆遇到杰瑞”和“李某遇到蔡某”中，也就是说，“汤姆遇到杰瑞”和“李某遇到蔡某”是相似的句式，它们是上下文语义非常相似的实体，所以可以通过把“汤姆”和“杰瑞”的上下文信息存储起来，利用它们的信息去帮助系统识别出“李某”“柴某”这两个实体。By observing the above example sentence "When Tom met Jerry when Li met Cai", it can be seen that although the latter two words "Li" and "Cai" are relatively uncommon words, but the two words The context information and the identified entities "Tom" and "Jerry" are very acquainted, and the word "encounter" appears in both "Tom meets Jerry" and "Li meets Cai", that is to say , "Tom meets Jerry" and "Li meets Cai" are similar sentence patterns, they are entities with very similar context semantics, so by storing the context information of "Tom" and "Jerry", Use their information to help the system identify the two entities "Li" and "Chai".

步骤4：把步骤3中识别出的实体“汤姆”“杰瑞”的按照实体上文信息、前向实体信息、后向实体信息、实体的后文信息这样的格式作为一个候选序列，把候选序列存储到存储单元中。Step 4: Take the entities "Tom" and "Jerry" identified in Step 3 as a candidate sequence according to the format of entity above information, forward entity information, backward entity information, and entity's post information. The sequence is stored in the storage unit.

步骤5：把步骤2得到的文本特征向量序列以及存储单元存储的候选序列，输入到推理单元中，经过推理单元进行计算可以得出实体“汤姆”“杰瑞”的本身信息以及上下文信息对原始句子中每个词的影响程度，得到一组注意力向量。Step 5: Input the text feature vector sequence obtained in step 2 and the candidate sequence stored in the storage unit into the inference unit. After the inference unit performs calculations, the information and context information of the entities "Tom" and "Jerry" can be obtained. The degree of influence of each word in the sentence, and get a set of attention vectors.

步骤6：把步骤5中的注意力向量以及文本特征向量序列输入到CRF模型进行解码，CRF模型通过参考参照向量标注出了“李某”和“蔡某”这两个实体。Step 6: Input the attention vector and text feature vector sequence in Step 5 into the CRF model for decoding. The CRF model labels the two entities "Li" and "Cai" by referring to the reference vector.

步骤7：把“李某”和“蔡某”存储到候选单元，把阈值设置为β，通过计算它们和存储单元中每一个词的相似度，发现这两个词汇和“李某”和“汤姆”，“蔡某”和“杰瑞”的相似度大于设定的阈值β，所以不把这两个词汇存储到存储单元。Step 7: Store "Li" and "Cai" in the candidate unit, set the threshold to β, and find out that these two words are the same as "Li" and "" by calculating the similarity between them and each word in the storage unit. Tom", "Cai" and "Jerry" are more similar than the set threshold β, so these two words are not stored in the storage unit.

步骤8：重复上述步骤4到步骤7之后，发现没有新的实体产生，所以命名实体识别结束，最后标注出的实体为：Step 8: After repeating the above steps 4 to 7, it is found that no new entities are generated, so the named entity recognition ends, and the final marked entity is:

当when 汤姆tom 遇到meet 杰瑞Jerry 时Time 李某Lee 遇到meet 蔡某Cai OO S-PERS-PER OO S-PERS-PER OO S-PERS-PER OO S-PERS-PER

上述过程识别出的实体，存储到存储单元后，只要在之后的文本中出现了和存储单元单元中相似实体的信息，即出现了与存储单元中类似句式的文本，这些实体就都能够被识别出出来，这样就解决了长文本准确率和查全率低的问题。After the entities identified by the above process are stored in the storage unit, as long as information similar to the entity in the storage unit appears in the subsequent text, that is, text similar to the sentence pattern in the storage unit appears, these entities can be stored. Identify it, which solves the problem of low accuracy and recall in long texts.

在本系统中，可选的，为了减少计算量的开支，可以事先设定识别命名实体识别的循环次数，而不采用“没有新实体产生”作为命名实体系统的结束标志，这样可以减少计算量的开支，但会导致查全率和准确率的下降，实际运用时需要权衡利弊进行舍取。In this system, optionally, in order to reduce the cost of calculation, the number of cycles for identifying named entity recognition can be set in advance, instead of using "no new entity generation" as the end mark of the named entity system, which can reduce the amount of calculation. However, it will lead to a decrease in recall rate and accuracy rate. In actual application, it is necessary to weigh the pros and cons.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical idea disclosed in the present invention should still be covered by the claims of the present invention.

Claims

1. a kind of multi-level name entity recognition method, which is characterized in that it includes following that this names entity recognition method at many levels Step:

S1 pre-processes data text, obtains vocabulary C；

Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated；

S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded；

S4 is decoded the Text eigenvector sequence with CRF model, marks out in the Text eigenvector sequence Entity；

S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the time of subsequent identification process Select sequence；

The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, Attention force vector is calculated；

Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the reality in sequence by S7 Body；

S8 repeats step S5~S7 until step S7 does not generate novel entities.

2. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described to data Text is pre-processed, and is obtained the image information features sequence of text, is specifically included:

It is first mark with fullstop, is sentence one section of long text segmentation；

All sentences are segmented；

Then remove dittograph building vocabulary C.

3. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described using pre- Trained term vector, in conjunction with the image information features sequence of text, the vector of obtained text is indicated, is specifically included:

Pre-training is carried out to corpus, by the available a term vector of pre-training；

The form for being one-hot by the text representation of input, as image information features sequence；

By the vocabulary C, the term vector of the image information features sequence and pre-training, obtain the word of each phrase to Amount indicates；

Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.

4. according to claim 3 a kind of based on multi-level name entity recognition method, which is characterized in that use BiLSTM The vector expression of the text is encoded, the Text eigenvector sequence after being encoded.

5. according to claim 4 a kind of based on multi-level name entity recognition method, which is characterized in that the text Characteristic vector sequence is decoded with CRF model, is marked out the entity in the Text eigenvector sequence, is specifically included:

Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tIt is input to CRF model, pass through being calculated for CRF model Prediction label sequence L={ l₁,l₂,l₃,…,l_n, what l was indicated is the label of each word, contains " BIE " in annotation results Or " S " is all the obtained entity e of mark, entity sets are expressed as E=(e₁,e₂,e₃,…,e_m), m presentation-entity number.

6. according to claim 5 a kind of based on multi-level name entity recognition method, which is characterized in that the pre- mark Label sequence calculates acquisition by the following method:

If flag sequence is Y={ y₁,y₂,y₃,…,y_n, sequence Y indicates the collection of all possible label in BIOES mark system It closes, the scoring function of this result of x marked with tag y are as follows:

Wherein, t_j(y_i-1,y_i, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, a upper label Node y_i-1It is transferred to current label node y_iThe case where, value is 0 or 1, referred to as transfer characteristic, s_k(y_i, x, i) and indicate current mark Sign node y_iWhether mark on x, value is also 0 or 1, referred to as state feature, λ_jAnd μ_kThe t respectively indicated_jAnd s_kWeight, The number of j, k expression characteristic function；

Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping the conditional probability p (y | x) of y label, meter for x Calculation p (y | x) specific formula is as follows:

Z (x) is standardizing factor, calculation formula are as follows:

Wherein y '=(y '₁,y′₂,y′₃,…,y′_n), indicate possible annotated sequence；

It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l₁,l₂,l₃,…,l_n}。

7. according to claim 6 a kind of based on multi-level name entity recognition method, which is characterized in that for entity set Close each entity e in E_i, i=1,2,3 ..., m is exported in conjunction with preceding to LSTM hidden layerAnd LSTM hidden layer is defeated backward OutEach entity is expressed as V '=[v₁,v₂,v₃,v₄] form, wherein v₁That indicate is the information above of entity, v₂Table That show is the preceding entity self-information obtained to LSTM, v₃What is indicated be after the entity self-information that is obtained to LSTM, v₄Entity Information hereinafter, then the entity sequence in storage unit is expressed as V={ V₁′,V₂′,V₃′,…,V′_m′, m ' expression has been deposited into The entity number of storage unit, then the entity sequence in storage unit is expressed as candidate sequence.

8. according to claim 7 a kind of based on multi-level name entity recognition method, which is characterized in that storage unit In candidate sequence V and Text eigenvector sequence H be input in reasoning element, each entity information is obtained by reasoning element V is to the influence degree of Text eigenvector sequence H, and gain attention force vector S；

For Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tAnd candidate sequence V={ V₁′,V₂′,V₃′,…,V′_m′, point The h at each moment is not calculated_iWith all V_j' dot product, the power that gains attention score σ, wherein i=1,2,3 ..., t, j=1,2, 3,...,m'；

Any time t, candidate sequence V is to h_tAttention score calculation formula it is as follows:

Attention score is converted into probability distribution as the weight α of postorder weighted sum:

α^t=softmax (σ^t)

Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to h_tNotice that force vector s, formula are as follows:

After the result for calculating all moment, candidate sequence V is obtained to the attention sequence vector S of Text eigenvector sequence H ={ s₁,s₂,s₃,…,s_t}。

9. according to claim 8 a kind of based on multi-level name entity recognition method, which is characterized in that this method is also wrapped It includes: calculating the similarity of the entity in the entity and step S5 in the step S7, when the similarity is less than similarity threshold When, using the entity as new entity.

10. according to claim 9 a kind of based on multi-level name entity recognition method, which is characterized in that described similar Degree is calculated using the method for cosine similarity, specific formula are as follows: