[go: up one dir, main page]

CN110008469A - A Multi-level Named Entity Recognition Method - Google Patents

A Multi-level Named Entity Recognition Method Download PDF

Info

Publication number
CN110008469A
CN110008469A CN201910207179.0A CN201910207179A CN110008469A CN 110008469 A CN110008469 A CN 110008469A CN 201910207179 A CN201910207179 A CN 201910207179A CN 110008469 A CN110008469 A CN 110008469A
Authority
CN
China
Prior art keywords
sequence
entity
text
vector
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910207179.0A
Other languages
Chinese (zh)
Other versions
CN110008469B (en
Inventor
常亮
王文凯
宾辰忠
宣闻
秦赛歌
陈源鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN201910207179.0A priority Critical patent/CN110008469B/en
Publication of CN110008469A publication Critical patent/CN110008469A/en
Application granted granted Critical
Publication of CN110008469B publication Critical patent/CN110008469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The present invention proposes a kind of multi-level name entity recognition method, comprising: S1 pre-processes data text, obtains vocabulary C;Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;S4 is decoded the Text eigenvector sequence with CRF model, marks out the entity in the Text eigenvector sequence;S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the candidate sequence of subsequent identification process;The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, and attention force vector is calculated;Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the entity in sequence by S7.

Description

一种多层次命名实体识别方法A Multi-level Named Entity Recognition Method

技术领域technical field

本发明涉及自然语言处理领域,具体涉及一种多层次命名实体识别方法。The invention relates to the field of natural language processing, in particular to a multi-level named entity recognition method.

背景技术Background technique

自然语言处理作为计算机领域与人工智能领域的一个交叉方向,随着人工智能领域的快速发展而不断发展。命名实体识别(Named Entity Recognition,简称NER)是自然语言处理的一个基本任务,它的目的是识别出文本中有特定意义的实体并对它们分类,这些实体的类型主要包括了人名、机构名、地点以及其他一些专有名词。伴随着互联网中海量大数据的产生,命名实体识别任务也在不断的受到学术界和工业界的重视,它已经被广泛运用在机器翻译、智能问答、信息检索等其它自然语言处理任务中。As a cross direction between the computer field and the artificial intelligence field, natural language processing continues to develop with the rapid development of the artificial intelligence field. Named Entity Recognition (NER) is a basic task of natural language processing. Its purpose is to identify entities with specific meanings in the text and classify them. The types of these entities mainly include person names, institution names, place and some other proper nouns. With the generation of massive big data in the Internet, the task of named entity recognition has also received increasing attention from academia and industry. It has been widely used in machine translation, intelligent question answering, information retrieval and other natural language processing tasks.

目前命名实体别的方法包括了基于规则的传统方法、基于词典的传统方法以及基于统计的传统方法,比较有代表性的方法就是基于统计的隐式马尔可夫模型(HMM)和条件随机场模型(CRF),随着深度学习的热潮的到来,也涌现出许多基于神经网络的方法,比如利用长短时记忆网络(LSTM)进行命名实体识别,并且通过传统方法和神经网络的方法相结合的方式,取得了不错的成果。At present, other methods of named entities include traditional rule-based methods, dictionary-based traditional methods, and statistical-based traditional methods. The more representative methods are statistical-based Hidden Markov Model (HMM) and Conditional Random Field Model. (CRF), with the advent of the boom of deep learning, many neural network-based methods have emerged, such as the use of long short-term memory network (LSTM) for named entity recognition, and the combination of traditional methods and neural network methods. , with good results.

基于规则和词典的方法十分依赖于词典和规则的构造,所以它们只适合用于小规模的限定领域语料库中,对于大规模语料库就有点捉襟见肘,并且在处理新词汇时有很大局限性;基于统计的方法依赖人工特征提取,这将会耗费大量的人力和时间。目前很多基于神经网络的方法能够一定程度上解决传统方法的不足,但是对于文本中出现的生僻词,或者语义不清楚的词汇,这类方法的查全率和准确率仍然有待提升。Rule-based and dictionary-based methods are very dependent on the construction of dictionaries and rules, so they are only suitable for small-scale domain-limited corpora, which are a bit stretched for large-scale corpora, and have great limitations when dealing with new vocabulary; based on Statistical methods rely on manual feature extraction, which will consume a lot of manpower and time. At present, many methods based on neural networks can solve the shortcomings of traditional methods to a certain extent, but for rare words in texts or words with unclear semantics, the recall and accuracy of such methods still need to be improved.

发明内容SUMMARY OF THE INVENTION

鉴于以上所述现有技术的缺点,本发明的目的在于提供一种多层次命名实体识别方法,本发明通过利用推理单元和存储单元进行多次识别的方式,有效解决了实际应用命名实体识别中对于生僻词,语义不清楚的词汇准确率不高的问题,提高了文本信息的查全率和准确率。In view of the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a multi-level named entity recognition method. The present invention effectively solves the problem in practical application named entity recognition by using the inference unit and the storage unit to perform multiple recognitions. For uncommon words and words with unclear semantics, the accuracy rate is not high, which improves the recall rate and accuracy rate of text information.

为实现上述目的及其他相关目的,本发明提供一种多层次命名实体识别方法,包括以下步骤:In order to achieve the above object and other related objects, the present invention provides a multi-level named entity recognition method, comprising the following steps:

S1对数据文本进行预处理,得到词汇表C;S1 preprocesses the data text to obtain the vocabulary C;

S2利用预训练好的词向量,结合文本的字符特征信息序列,得到的文本的向量表示;S2 uses the pre-trained word vector, combined with the character feature information sequence of the text, to obtain the vector representation of the text;

S3对所述文本的向量表示进行编码,得到编码后的文本特征向量序列;S3 encodes the vector representation of the text to obtain an encoded text feature vector sequence;

S4把所述文本特征向量序列用CRF模型进行解码,标注出所述文本特征向量序列中的实体;S4 decodes the text feature vector sequence with the CRF model, and marks the entities in the text feature vector sequence;

S5把标记处的实体的前文信息、后文信息以及该实体的信息作为后续的识别过程的候选序列;S5 takes the preceding information, the following information and the information of the entity at the marked place as the candidate sequence of the subsequent identification process;

S6将所述文本特征向量序列以及所述候选序列,输入到基于注意力机制的推理单元,计算得到注意力向量;S6 inputs the text feature vector sequence and the candidate sequence into the inference unit based on the attention mechanism, and calculates the attention vector;

S7把所述注意力向量和所述文本特征向量序列输入到CRF模型中,标注出序列中的实体。S7 inputs the attention vector and the text feature vector sequence into the CRF model, and marks the entities in the sequence.

S8重复步骤S5~S7直至步骤S7没产生新实体。S8 repeats steps S5-S7 until no new entity is generated in step S7.

可选地,所述对数据文本进行预处理,得到文本的字符特征信息序列,具体包括:Optionally, the data text is preprocessed to obtain the character feature information sequence of the text, which specifically includes:

首先以句号为标志,把一段长文本分割为句子;First, with a period as a sign, a long text is divided into sentences;

对所有所述句子进行分词;tokenize all said sentences;

然后去掉重复的词构建词汇表C。Then remove duplicate words to construct vocabulary C.

可选地,所述利用预训练好的词向量,结合文本的字符特征信息序列,得到的文本的向量表示,具体包括:Optionally, the use of the pre-trained word vector, combined with the character feature information sequence of the text, obtains the vector representation of the text, specifically including:

对语料库进行预训练,经过预训练可以得到一份词向量;The corpus is pre-trained, and a word vector can be obtained after pre-training;

将输入的文本表示为one-hot的形式,作为字符特征信息序列;Represent the input text in the form of one-hot as a sequence of character feature information;

通过所述词汇表C、所述字符特征信息序列以及预训练的词向量,得到每个词组的词向量表示;Obtain the word vector representation of each phrase through the vocabulary C, the character feature information sequence and the pre-trained word vector;

对分词后的句子进行词嵌入,得到文本的向量表示X。The word embedding is performed on the sentence after word segmentation, and the vector representation X of the text is obtained.

可选地,采用BiLSTM对所述文本的向量表示进行编码,得到编码后的文本特征向量序列。Optionally, BiLSTM is used to encode the vector representation of the text to obtain an encoded text feature vector sequence.

可选地,把所述文本特征向量序列用CRF模型进行解码,标注出所述文本特征向量序列中的实体,具体包括:Optionally, the text feature vector sequence is decoded with a CRF model, and the entities in the text feature vector sequence are marked, specifically including:

把文本特征向量序列H={h1,h2,h3,…,ht}输入至CRF模型,通过CRF模型的计算得到的预测标签序列L={l1,l2,l3,…,ln},l表示的是每一个词的标签,标注结果中包含了“BIE”或者“S”的都是标注得到的实体e,实体集合表示为E=(e1,e2,e3,…,em),m表示实体数。Input the text feature vector sequence H={h 1 ,h 2 ,h 3 ,...,h t } into the CRF model, and the predicted label sequence L={l 1 ,l 2 ,l 3 ,... ,l n }, l represents the label of each word, the marking result contains "BIE" or "S" is the marked entity e, the entity set is represented as E=(e 1 ,e 2 ,e 3 ,…,e m ), where m represents the number of entities.

可选地,所述预测标签序列通过以下方法计算获得:Optionally, the predicted label sequence is calculated and obtained by the following method:

设标记序列为Y={y1,y2,y3,…,yn},序列Y表示BIOES标注体系中所有可能的标记的集合,x标记上标签y这一结果的得分函数为:Let the label sequence be Y={y 1 , y 2 , y 3 ,...,y n }, the sequence Y represents the set of all possible labels in the BIOES labeling system, and the score function of the result of label y on the x label is:

其中,tj(yi-1,yi,x,i)代表CRF模型中的特征函数,表示在给定x的情况下,上一个标签结点yi-1转移到当前标签结点yi的情况,取值为0或1,称为转移特征,sk(yi,x,i)表示当前标签结点yi是否标记在x上,取值也为0或1,称为状态特征,λj以及μk分别表示的tj和sk的权值,j,k表示特征函数的个数;Among them, t j (y i-1 ,y i ,x,i) represents the feature function in the CRF model, indicating that given x, the previous label node y i-1 is transferred to the current label node y In the case of i , the value is 0 or 1, which is called the transition feature, and s k (y i ,x,i) indicates whether the current label node y i is marked on x, and the value is also 0 or 1, which is called the state Features, λ j and μ k represent the weights of t j and s k respectively, and j, k represent the number of feature functions;

对score(y,x)进行指数化和标准化,就能够得到为x打上y标签的条件概率p(y|x),计算p(y|x)具体公式如下:By indexing and normalizing the score(y,x), the conditional probability p(y|x) that labels x with y can be obtained. The specific formula for calculating p(y|x) is as follows:

Z(x)为规范化因子,计算公式为:Z(x) is the normalization factor, and the calculation formula is:

其中y′=(y′1,y′2,y′3,…,y′n),表示可能的标注序列;where y′=(y′ 1 , y′ 2 , y′ 3 ,…,y′ n ), indicating the possible label sequence;

通过维特比算法进行求解,模型取使得概率最大的y′为作为标注结果,记为l,即则预测标签序列为L={l1,l2,l3,…,ln}。The solution is solved by the Viterbi algorithm, and the model takes the y' with the highest probability as the labeling result, denoted as l, that is, Then the predicted label sequence is L={l 1 ,l 2 ,l 3 ,...,l n }.

可选地,对于实体集合E中的每个实体ei(i=1,2,3,…,m),结合前向LSTM隐含层输出以及后向LSTM隐含层输出把每个实体表示为V′=[v1,v2,v3,v4]的形式,其中,v1表示的是实体的上文信息,v2表示的是前向LSTM得到的实体本身信息,v3表示的是后向LSTM得到的实体本身信息,v4实体的后文信息,则存储单元中的实体序列表示为V={V′1,V′2,V′3,…,V′m′},m′表示已经存入到存储单元的实体数,则存储单元中的实体序列表示为候选序列。Optionally, for each entity e i (i=1,2,3,...,m) in the entity set E, combine the forward LSTM hidden layer output and the output of the backward LSTM hidden layer Represent each entity in the form of V′=[v 1 , v 2 , v 3 , v 4 ], where v 1 represents the above information of the entity, and v 2 represents the entity itself obtained by forward LSTM information, v 3 represents the information of the entity itself obtained by the backward LSTM, and the latter information of the v 4 entity, the entity sequence in the storage unit is expressed as V={V′ 1 ,V′ 2 ,V′ 3 ,..., V'm' }, m' represents the number of entities that have been stored in the storage unit, and the entity sequence in the storage unit is represented as a candidate sequence.

可选地,把存储单元中的候选序列V和文本特征向量序列H输入到推理单元中,通过推理单元得出每个实体信息V对文本特征向量序列H的影响程度,得到注意力向量S;Optionally, the candidate sequence V and the text feature vector sequence H in the storage unit are input into the inference unit, and the degree of influence of each entity information V on the text feature vector sequence H is obtained by the inference unit, and the attention vector S is obtained;

对于文本特征向量序列H={h1,h2,h3,…,ht}和候选序列V={V′1,V′2,V′3,…,V′m′},分别计算每个时刻的hi与所有V′j(j=1,2,3,…,m′)的点积,得到注意力分数σ,其中i=1,2,3,...,t;For the text feature vector sequence H={h 1 ,h 2 ,h 3 ,…,h t } and the candidate sequence V={V′ 1 ,V′ 2 ,V′ 3 ,…,V′ m′ }, respectively calculate The dot product of hi at each moment and all V'j ( j =1, 2, 3,...,m') to get the attention score σ, where i=1,2,3,...,t;

任意时刻t,候选序列V对ht的注意力分数的计算公式如下:At any time t, the formula for calculating the attention score of the candidate sequence V to h t is as follows:

利用softmax函数把注意力分数转化为概率分布作为后序加权求和的权重α:Use the softmax function to convert the attention score into a probability distribution as the weight α of the post-order weighted summation:

αt=softmax(σt)α t =softmax(σ t )

对候选序列V进行加权求和,得到任意时刻t,候选序列V对ht注意力向量s,公式如下:The weighted summation of the candidate sequence V is carried out to obtain any time t, and the attention vector s of the candidate sequence V to h t is obtained. The formula is as follows:

计算所有时刻的结果之后,得到候选序列V对文本特征向量序列H的注意力向量序列S={s1,s2,s3,…,st}。After calculating the results at all times, the attention vector sequence S={s 1 , s 2 , s 3 ,..., s t } of the candidate sequence V to the text feature vector sequence H is obtained.

可选地,该方法还包括:计算所述步骤S7中的实体与步骤S5中的实体的相似度,当所述相似度小于相似度阈值时,把该实体作为新的实体。Optionally, the method further includes: calculating the similarity between the entity in the step S7 and the entity in the step S5, and when the similarity is less than the similarity threshold, the entity is regarded as a new entity.

可选地,所述相似度采用余弦相似度的方法计算,具体公式为:Optionally, the similarity is calculated by the method of cosine similarity, and the specific formula is:

如上所述,本发明的一种多层次命名实体识别方法,具有以下有益效果:As mentioned above, a multi-level named entity recognition method of the present invention has the following beneficial effects:

1、相比于只进行一次命名实体识别的方法,本方法使用多次识别,通过多次识别的方式提高了命名实体识别任务的查全率;1. Compared with the method of performing named entity recognition only once, this method uses multiple recognitions, and improves the recall rate of the named entity recognition task through multiple recognitions;

2、现有方法比较适合短文本数据,对于较长的长文本数据,效率就会下降,本发明设计了存储单元,存储重要的实体信息以及它的上下文信息,通过这种方式处理长文本数据,同时设计了候选单元减小存储单元的空间开销;2. The existing method is more suitable for short text data, but for longer long text data, the efficiency will decrease. The present invention designs a storage unit to store important entity information and its context information, and process long text data in this way. , while designing a candidate unit to reduce the space overhead of the storage unit;

3、设计了推理单元,结合推理单元进行实体识别,对于文本中的一些生僻词,以及一些语义不清的词,识别的效果能够提升,提高系统的准确率和查全率。3. The reasoning unit is designed, and the reasoning unit is used for entity recognition. For some rare words in the text and some words with unclear semantics, the recognition effect can be improved, and the accuracy and recall rate of the system can be improved.

附图说明Description of drawings

为了进一步阐述本发明所描述的内容,下面结合附图对本发明的具体实施方式作进一步详细的说明。应当理解,这些附图仅作为典型示例,而不应看作是对本发明的范围的限定。In order to further illustrate the content described in the present invention, the specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. It should be understood that these drawings are presented by way of example only and should not be considered as limiting the scope of the present invention.

图1为命名实体识别系统流程图;Figure 1 is a flowchart of a named entity recognition system;

图2分别为存储单元结构示意图以及候选单元结构示意图;2 is a schematic diagram of a storage unit structure and a schematic diagram of a candidate unit structure;

图3为推理单元结构示意图;Fig. 3 is a schematic diagram of the structure of an inference unit;

图4为系统结构示意图;4 is a schematic diagram of the system structure;

图5为CRF结构示意图;Figure 5 is a schematic diagram of the CRF structure;

图6为LSTM单元结构示意图。Figure 6 is a schematic diagram of the structure of the LSTM unit.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic concept of the present invention in a schematic way, so the drawings only show the components related to the present invention rather than the number, shape and number of components in actual implementation. For dimension drawing, the type, quantity and proportion of each component can be changed at will in actual implementation, and the component layout may also be more complicated.

本发明提供一种多层次命名实体识别方法,通过多次识别的方式,逐步的把文本中前几次命名实体识别过程中的未被识别出来的实体识别出来。本发明要解决的核心问题包括如下两点:The invention provides a multi-level named entity recognition method, which can gradually recognize the unrecognized entities in the text in the first several named entity recognition processes by means of multiple recognition. The core problem to be solved by the present invention includes the following two points:

1、对于传统的命名实体识别系统,整个系统只会进行一次识别,所以识别结果中仍然会存在一部分的词汇无法识别出来或识别错误,导致命名实体识别的查全率以及准确率不高,这种情况在对一些生僻词进行识别,如少见的人名,地名等,以及识别一些文本中语义不清晰的词汇时更加严重;1. For the traditional named entity recognition system, the entire system will only be recognized once, so there will still be some words in the recognition result that cannot be recognized or recognized incorrectly, resulting in low recall and accuracy of named entity recognition. This situation is more serious when recognizing some uncommon words, such as rare names of people, places, etc., and recognizing words with unclear semantics in some texts;

2、对于目前许多命名实体识别系统不好解决长文本的问题,本发明采用存储单元存储实体的相关信息来解决。2. For the problem that many current named entity recognition systems cannot solve the problem of long texts, the present invention uses a storage unit to store the relevant information of the entity to solve the problem.

本发明的核心思想是:保存已经识别出来的实体的上下文信息,以及实体本身信息,因为实体的上下文信息以及实体本身信息所组成的信息可以看作一些句式信息,通过这些信息以及多次识别的方式去帮助系统去识别出文本中未被识别出的实体,具体的方法将在下面的实例中说明。The core idea of the present invention is to save the context information of the identified entities and the information of the entity itself, because the information composed of the context information of the entity and the information of the entity itself can be regarded as some sentence information. way to help the system to identify unrecognized entities in the text, the specific method will be explained in the following example.

本发明的实施方式如图1所示,图1包括了本发明的总体思想,其中本发明采用的标注体系为BIOES标注体系,即“B”表示多字符实体的开头字符,“I”表示多字符实体的中间字符,“O”表示其它的非实体词,“E”表示多字符实体的结尾,“S”表示一个词单独为实体,并用“PER”表示人名,“ORG”表示机构名,“LOC”表示地名。The embodiment of the present invention is shown in FIG. 1, which includes the general idea of the present invention, wherein the labeling system adopted in the present invention is the BIOES labeling system, that is, “B” represents the beginning character of a multi-character entity, and “I” represents a multi-character entity. The middle character of a character entity, "O" represents other non-entity words, "E" represents the end of a multi-character entity, "S" represents a word alone as an entity, and "PER" represents a person's name, "ORG" represents an organization name, "LOC" means place name.

如图1所示,本实施例提供一种多层次命名实体识别方法,该方法包括以下步骤:As shown in FIG. 1, this embodiment provides a multi-level named entity recognition method, and the method includes the following steps:

S1对数据文本进行预处理,得到词汇表C;S1 preprocesses the data text to obtain the vocabulary C;

S2利用预训练好的词向量,结合文本的字符特征信息序列,得到的文本的向量表示;S2 uses the pre-trained word vector, combined with the character feature information sequence of the text, to obtain the vector representation of the text;

S3对所述文本的向量表示进行编码,得到编码后的文本特征向量序列;S3 encodes the vector representation of the text to obtain an encoded text feature vector sequence;

S4把所述文本特征向量序列用CRF模型进行解码,标注出所述文本特征向量序列中的实体;S4 decodes the text feature vector sequence with the CRF model, and marks the entities in the text feature vector sequence;

S5把标记处的实体的前文信息、后文信息以及该实体的信息作为后续的识别过程的候选序列;S5 takes the preceding information, the following information and the information of the entity at the marked place as the candidate sequence of the subsequent identification process;

S6将所述文本特征向量序列以及所述候选序列,输入到基于注意力机制的推理单元,计算得到注意力向量;S6 inputs the text feature vector sequence and the candidate sequence into the inference unit based on the attention mechanism, and calculates the attention vector;

S7把所述注意力向量和所述文本特征向量序列输入到CRF模型中,标注出序列中的实体。S7 inputs the attention vector and the text feature vector sequence into the CRF model, and marks the entities in the sequence.

S8重复步骤S5~S7直至步骤S7没产生新实体。S8 repeats steps S5-S7 until no new entity is generated in step S7.

在步骤S1中,对数据文本进行预处理,具体为:首先以句号为标志,把一段长文本分割为句子,处理过后的句子以行为单位存储起来,利用开源工具jieba分词对所有句子进行分词,把句子分成词的形式,然后去掉重复的词可以得到词汇表,把词汇表设为C。In step S1, the data text is preprocessed, which is as follows: first, using a period as a symbol, a long text is divided into sentences, the processed sentences are stored in units of actions, and the open source tool jieba is used to segment all sentences. Divide the sentence into the form of words, and then remove the repeated words to get the vocabulary. Let the vocabulary be C.

在步骤S2中,利用预训练好的词向量,结合文本的字符特征信息序列,得到的文本的向量表示。具体地,先用jieba分词对中文维基百科语料库分词,利用开源工具word2vec对中文维基百科语料库分词后的进行预训练,经过预训练可以得到一份词向量,令预训练的词向量的维度为d。对于输入的文本,先把它表示为one-hot的形式,作为字符特征信息序列,然后通过词汇表C以及预训练的词向量,便可得到每个词组的词向量表示,令词向量为x,通过对分词后的句子进行词嵌入,可以得到文本的向量表示X,其中X={x1,x2,x3,…,xn},X∈Rd×n,n为句子分词后词的个数。In step S2, the vector representation of the text is obtained by using the pre-trained word vector and combining the character feature information sequence of the text. Specifically, first use jieba word segmentation to segment the Chinese Wikipedia corpus, and use the open source tool word2vec to pre-train the Chinese Wikipedia corpus after word segmentation. After pre-training, a word vector can be obtained, and the dimension of the pre-trained word vector is d. . For the input text, first represent it in the form of one-hot as a sequence of character feature information, and then through the vocabulary C and the pre-trained word vector, the word vector representation of each phrase can be obtained, let the word vector be x , by performing word embedding on the sentence after word segmentation, the vector representation X of the text can be obtained, where X={x 1 ,x 2 ,x 3 ,...,x n }, X∈R d×n , n is the sentence after word segmentation number of words.

在步骤S3中,步骤S1和步骤S2通过前两步的处理工作,得到了文本的向量表示X,把X作为输入序列输入到双向LSTM模型中。通过LSTM,能够提取文本特征信息,而在某个时间段一个词的特征信息有时候不仅包括了前面的词汇对它的影响,也包括了后面的词对它的影响,所以采用双向LSTM,能够从两个方向充分提取出文本特征向量序列。通过双向LSTM把文本的向量表示X编码为文本特征向量序列H,其中H={h1,h2,h3,…,ht},为双向LSTM的隐含层每个时刻的输出,代表每一时刻前向LSTM隐含层输出和后向LSTM隐含层输出进行拼接。In step S3, step S1 and step S2 obtain the text vector representation X through the processing work of the first two steps, and input X as the input sequence into the bidirectional LSTM model. Through LSTM, text feature information can be extracted, and the feature information of a word in a certain period of time sometimes includes not only the influence of the previous words on it, but also the influence of the following words on it, so the use of two-way LSTM can The text feature vector sequence is fully extracted from two directions. The vector representation X of the text is encoded as a text feature vector sequence H by bidirectional LSTM, where H={h 1 ,h 2 ,h 3 ,...,h t }, is the output at each moment of the hidden layer of the bidirectional LSTM, Represents the output of the forward LSTM hidden layer at each moment and the output of the backward LSTM hidden layer Do stitching.

本步骤中的LSTM模型如图6所示,它的具体公式描述如下:The LSTM model in this step is shown in Figure 6, and its specific formula is described as follows:

首先决定LSTM需要选择哪些信息需要遗忘,哪些信息需要保留,ft表示遗忘门的输出First decide which information LSTM needs to choose to forget and which information to retain, f t represents the output of the forget gate

ft=σ(Wf·[ht-1,xt]+bf)f t =σ(W f ·[h t-1 ,x t ]+b f )

然后确定LSTM需要更新的信息,it表示的是记忆门的输出,表示的是LSTM中临时记忆细胞的状态值Then determine the information that LSTM needs to update, i t represents the output of the memory gate, Represents the state value of the temporary memory cell in the LSTM

it=σ(Wi·[ht-1,xt]+bi)i t =σ(W i ·[h t-1 ,x t ]+b i )

接着更新LSTM细胞的状态,Ct,Ct-1分别表示LSTM中当前细胞状态,以及前一时刻细胞状态Then update the state of the LSTM cell, C t , C t-1 represent the current cell state in the LSTM, and the cell state at the previous moment, respectively

最后输出隐含层信息,ot表示输出门的值Finally output the hidden layer information, o t represents the value of the output gate

ot=σ(Wo[ht-1,xt]+bo)o t =σ(W o [h t-1 ,x t ]+b o )

ht=ot*tanh(Ct)h t =o t *tanh(C t )

其中,xt为当前时刻的输入,ht-1为上一时刻隐含层的输出,ht表示当前时刻隐含层的输出,W是各个函数对应的权重矩阵,b为各个函数对应的偏置,σ为sigmoid函数,tanh为双曲正切函数。Among them, x t is the input at the current moment, h t-1 is the output of the hidden layer at the previous moment, h t is the output of the hidden layer at the current moment, W is the weight matrix corresponding to each function, and b is the corresponding value of each function. Bias, σ is the sigmoid function, and tanh is the hyperbolic tangent function.

在步骤S4中,把文本特征向量序列H={h1,h2,h3,…,ht}输入如图5所示的CRF模型,通过CRF模型的计算得到的预测标签序列L={l1,l2,l3,…,ln},l表示的是每一个词的标签,标注结果中包含了“BIE”或者“S”的都是标注得到的实体e,实体集合表示为E=(e1,e2,e3,…,em),m表示实体数。In step S4, the text feature vector sequence H={h 1 , h 2 , h 3 ,...,h t } is input into the CRF model shown in Figure 5, and the predicted label sequence L={ l 1 ,l 2 ,l 3 ,...,l n }, l represents the label of each word, the labeling result contains "BIE" or "S" is the marked entity e, the entity set is expressed as E=(e 1 , e 2 , e 3 ,...,e m ), where m represents the number of entities.

本步骤中的预测标签序列L相关具体计算方式如下:The specific calculation method of the predicted label sequence L in this step is as follows:

设标记序列为Y={y1,y2,y3,…,yn},序列Y表示BIOES标注体系中所有可能的标记的集合,x标记上标签y这一结果的得分函数为:Let the label sequence be Y={y 1 , y 2 , y 3 ,...,y n }, the sequence Y represents the set of all possible labels in the BIOES labeling system, and the score function of the result of label y on the x label is:

其中,tj(yi-1,yi,x,i)代表CRF模型中的特征函数,表示在给定x的情况下,上一个标签结点yi-1转移到当前标签结点yi的情况,取值为0或1,称为转移特征;sk(yi,x,i)表示当前标签结点yi是否标记在x上,取值也为0或1,称为状态特征;λj以及μk分别表示的tj和sk的权值,j,k表示特征函数的个数。Among them, t j (y i-1 ,y i ,x,i) represents the feature function in the CRF model, indicating that given x, the previous label node y i-1 is transferred to the current label node y In the case of i , the value is 0 or 1, which is called the transition feature; s k (y i ,x,i) indicates whether the current label node y i is marked on x, and the value is also 0 or 1, which is called the state Features; λ j and μ k represent the weights of t j and s k respectively, and j, k represent the number of feature functions.

对score(y,x)进行指数化和标准化,就能够得到为x标记上标签y的条件概率p(y|x),计算p(y|x)具体公式如下:By indexing and normalizing the score(y,x), the conditional probability p(y|x) of the label y on the x tag can be obtained. The specific formula for calculating p(y|x) is as follows:

Z(x)为规范化因子,它的计算公式为:Z(x) is the normalization factor, and its calculation formula is:

其中,y′=(y′1,y′2,y′3,…,y′n),表示可能的标注序列。Among them, y′=(y′ 1 , y′ 2 , y′ 3 ,..., y′ n ), which represents the possible label sequence.

通过维特比算法进行求解,模型取使得概率最大的y′为作为标注结果,记为l,即则预测标签序列为L={l1,l2,l3,…,ln}。The solution is solved by the Viterbi algorithm, and the model takes the y' with the highest probability as the labeling result, denoted as l, that is, Then the predicted label sequence is L={l 1 ,l 2 ,l 3 ,...,l n }.

在步骤S5中,步骤S4识别出了部分实体,但是在文本中可能仍然存在部分未被识别出来的生僻词,或者语义不清的词汇,所以通过把已经识别出来的实体信息存储下来,利用实体的本身信息以及它的上下文信息帮助系统去识别出这些生僻词。In step S5, step S4 recognizes some entities, but there may still be some unrecognized unrecognized words or words with unclear semantics in the text, so by storing the identified entity information, use entity information Its own information and its context information help the system to identify these rare words.

步骤S5的操作如下:The operation of step S5 is as follows:

按照语法规则,一个名词的前面或者后面应该接动词或介词,动词或介词与名词所组成的句式便含有丰富的文本特征信息。因为LSTM的隐含层输出中包括了文本每个时刻的信息,所以对于E中的每个实体ei(i=1,2,3,…,m),结合前向LSTM隐含层输出把ei前一时刻的信息存储为v1,ei本身信息存储为v2,结合后向LSTM隐含层输出把ei后一时刻的信息存储为v3,ei本身信息存储为v4,则每个实体表示为V′=[v1,v2,v3,v4]的形式,存储到存储单元。如图2所示,v1表示的是实体的上文信息,v2表示的是前向LSTM得到的实体本身信息,v3表示的是后向LSTM得到的实体本身信息,v4实体的后文信息,则存储单元中的实体序列表示为V={V′1,V′2,V′3,…,V′m′},m′表示已经存入到存储单元的实体数,则存储单元中的实体序列表示为候选序列V。According to grammatical rules, a noun should be preceded or followed by a verb or a preposition, and the sentence pattern composed of a verb or a preposition and a noun contains rich textual feature information. Because the output of the hidden layer of LSTM includes the information of each moment of the text, for each entity e i (i=1, 2, 3, ..., m) in E, combine the output of the forward LSTM hidden layer Store the information of e i at the previous moment as v 1 , store the information of e i itself as v 2 , and combine the output of the backward LSTM hidden layer Store the information at the moment after e i as v 3 , and store the information of e i itself as v 4 , then each entity is represented in the form of V′=[v 1 , v 2 , v 3 , v 4 ], which is stored in the storage unit. As shown in Figure 2, v 1 represents the above information of the entity, v 2 represents the information of the entity itself obtained by the forward LSTM, v 3 represents the information of the entity itself obtained by the backward LSTM, and the back of the v 4 entity If the text information, the entity sequence in the storage unit is represented as V={V′ 1 , V′ 2 , V′ 3 ,..., V′ m′ }, where m′ represents the number of entities that have been stored in the storage unit, then the storage The entity sequences in the cell are denoted as candidate sequences V.

步骤S6:把存储单元中的候选序列V和文本特征向量序列H输入到推理单元中,通过推理单元得出每个候选序列V对文本特征向量序列H的影响程度,得到注意力向量S。Step S6: Input the candidate sequence V and the text feature vector sequence H in the storage unit into the inference unit, obtain the degree of influence of each candidate sequence V on the text feature vector sequence H through the inference unit, and obtain the attention vector S.

本步骤中的推理单元的示意图如图3所示,采用注意力机制的方法,通过计算存储单元中的每个实体对每个特征向量的关注程度,得到一个注意力向量。A schematic diagram of the reasoning unit in this step is shown in Figure 3, and an attention vector is obtained by calculating the attention degree of each entity in the storage unit to each feature vector by using the method of attention mechanism.

对于文本特征向量序列H={h1,h2,h3,…,ht}和候选序列V={V′1,V′2,V′3,…,V′m′},分别计算每个时刻的hi(i=1,2,3,…,t)与所有V′j(j=1,2,3,…,m′)的点积,得到注意力分数σ。For the text feature vector sequence H={h 1 ,h 2 ,h 3 ,…,h t } and the candidate sequence V={V′ 1 ,V′ 2 ,V′ 3 ,…,V′ m′ }, respectively calculate The dot product of hi ( i =1,2,3,...,t) at each moment and all V'j ( j =1,2,3,...,m') yields the attention score σ.

任意时刻t,V对ht的注意力分数的计算公式如下:The formula for calculating the attention score of V to h t at any time t is as follows:

然后利用softmax函数把注意力分数转化为概率分布作为后序加权求和的权重α:Then use the softmax function to convert the attention score into a probability distribution as the weight α of the post-order weighted summation:

αt=softmax(σt)α t =softmax(σ t )

最后对V进行加权求和,得到任意时刻t,V对ht注意力向量s,公式如下:Finally, the weighted summation of V is performed to obtain the attention vector s of V to h t at any time t. The formula is as follows:

计算所有时刻的结果之后,得到V对H注意力向量序列为S={s1,s2,s3,…,st}。After calculating the results at all times, the V-to-H attention vector sequence is obtained as S={s 1 , s 2 , s 3 ,...,s t }.

在步骤S7中,把文本特征向量序列H和注意力向量S进行拼接之后,拼接过程和双向LSTM的隐含层的拼接过程一样,拼接之后的向量表示为[H:S],把拼接之后的向量输入到CRF模型后,得到新的标注结果,此处的CRF模型和步骤S4中的CRF模型共享参数。In step S7, after splicing the text feature vector sequence H and the attention vector S, the splicing process is the same as the splicing process of the hidden layer of the bidirectional LSTM. The spliced vector is expressed as [H:S], and the spliced After the vector is input into the CRF model, a new labeling result is obtained. The CRF model here and the CRF model in step S4 share parameters.

于一实施例中,基于多层次命名实体识别方法还包括:通过步骤S7的新标注得到对应的新实体,采用步骤S5的方法把它们表示为V″=[v1,v2,v3,v4]的形式,准备把V″=[v1,v2,v3,v4]存储到存储单元中。如果不经过筛选就把新的实体信息存储到存储单元,可能会导致后期存储单元数据量过大,且造成存储空间的浪费,所以在把新的实体信息存储到存储单元之前,首先把它存储到候选单元,结构如图2所示,通过计算候选单元中这些新词汇和存储单元的新词汇之间的相似度,设定一个阈值β,当相似度小于阈值的时候,才能把新的实体存入到存储单元中去。In an embodiment, the method based on the multi-level named entity recognition further includes: obtaining corresponding new entities through the new labeling in step S7, and using the method in step S5 to represent them as V″=[v 1 , v 2 , v 3 , v 4 ], prepare to store V″=[v 1 , v 2 , v 3 , v 4 ] into the storage unit. If the new entity information is stored in the storage unit without screening, it may lead to an excessive amount of data in the later storage unit and waste of storage space. Therefore, before storing the new entity information in the storage unit, store it first. To the candidate unit, the structure is shown in Figure 2. By calculating the similarity between these new words in the candidate unit and the new words of the storage unit, a threshold β is set. When the similarity is less than the threshold, the new entity can be added. into the storage unit.

本步骤中的词汇之间的相似度采用余弦相似度的方法去计算,具体公式为:The similarity between words in this step is calculated by the method of cosine similarity, and the specific formula is:

于一实施例中,基于多层次命名实体识别方法还包括步骤九:重复上述步骤S5到步骤S8,直到步骤S7没有产生新的实体的时候,命名实体识别的过程结束。In one embodiment, the multi-level named entity identification method further includes step 9: repeating the above steps S5 to S8 until no new entity is generated in step S7, the named entity identification process ends.

下面通过一个实例来讲述上述的命名实体过程,如下图4所示:The above named entity process is described below through an example, as shown in Figure 4 below:

步骤1:输入的句子为“当汤姆遇到杰瑞时李某遇到蔡某”,首先对句子进行预处理,用jieba中文分词工具对句子分词,结果为“当/汤姆/遇到/杰瑞/时/李某/遇到/蔡某”,其中“/”表示分词的分隔符,并且利用已经预训练好的词向量可以得到该句子的词向量表示。Step 1: The input sentence is "When Tom meets Jerry when Li meets Cai", first preprocess the sentence, use jieba Chinese word segmentation tool to segment the sentence, the result is "when/tom/encounter/jie" Rui / Shi / Li Mou / Encounter / Cai Mou", where "/" represents the delimiter of the word segmentation, and the word vector representation of the sentence can be obtained by using the pre-trained word vector.

步骤2:把步骤1得到的词向量表示输入到双向LSTM中,用双向LSTM作为编码器对词向量进行编码,而LSTM的隐含层的输出就是我们所需要的编码后文本特征向量序列。Step 2: Input the word vector representation obtained in Step 1 into the bidirectional LSTM, use the bidirectional LSTM as the encoder to encode the word vector, and the output of the hidden layer of the LSTM is the encoded text feature vector sequence we need.

步骤3:把步骤2中的文本特征向量输入到下一层的CRF模型中,用CRF模型作为解码器对文本特征向量进行解码,同时进行标注得到标注结果如下所示:Step 3: Input the text feature vector in Step 2 into the CRF model of the next layer, use the CRF model as a decoder to decode the text feature vector, and perform annotation at the same time to obtain the labeling result as follows:

when 汤姆tom 遇到meet 杰瑞Jerry Time 李某Lee 遇到meet 蔡某Cai OO S-PERS-PER OO S-PERS-PER OO OO OO OO

通过上述结果可以看出,实体人名“汤姆(S-PER)”和“杰瑞(S-PER)”标注正确,被识别出来了,而另外两个词“李某”和“蔡某”并没标注出来,因为这两个词汇相对而言是比较生僻的,所以没有被标注出来。From the above results, it can be seen that the entity names "Tom (S-PER)" and "Jerry (S-PER)" are correctly labeled and recognized, while the other two words "Li" and "Cai" are not Not marked, because these two terms are relatively uncommon, so they have not been marked.

通过观察上述例句“当汤姆遇到杰瑞时李某遇到蔡某”,可以看出,虽然后面两个词“李某”和“蔡某”是比较生僻的词汇,但是这两个词的上下文信息和已经识别出来的实体“汤姆”“杰瑞”是很相识的,“遇到”一词同时出现在“汤姆遇到杰瑞”和“李某遇到蔡某”中,也就是说,“汤姆遇到杰瑞”和“李某遇到蔡某”是相似的句式,它们是上下文语义非常相似的实体,所以可以通过把“汤姆”和“杰瑞”的上下文信息存储起来,利用它们的信息去帮助系统识别出“李某”“柴某”这两个实体。By observing the above example sentence "When Tom met Jerry when Li met Cai", it can be seen that although the latter two words "Li" and "Cai" are relatively uncommon words, but the two words The context information and the identified entities "Tom" and "Jerry" are very acquainted, and the word "encounter" appears in both "Tom meets Jerry" and "Li meets Cai", that is to say , "Tom meets Jerry" and "Li meets Cai" are similar sentence patterns, they are entities with very similar context semantics, so by storing the context information of "Tom" and "Jerry", Use their information to help the system identify the two entities "Li" and "Chai".

步骤4:把步骤3中识别出的实体“汤姆”“杰瑞”的按照实体上文信息、前向实体信息、后向实体信息、实体的后文信息这样的格式作为一个候选序列,把候选序列存储到存储单元中。Step 4: Take the entities "Tom" and "Jerry" identified in Step 3 as a candidate sequence according to the format of entity above information, forward entity information, backward entity information, and entity's post information. The sequence is stored in the storage unit.

步骤5:把步骤2得到的文本特征向量序列以及存储单元存储的候选序列,输入到推理单元中,经过推理单元进行计算可以得出实体“汤姆”“杰瑞”的本身信息以及上下文信息对原始句子中每个词的影响程度,得到一组注意力向量。Step 5: Input the text feature vector sequence obtained in step 2 and the candidate sequence stored in the storage unit into the inference unit. After the inference unit performs calculations, the information and context information of the entities "Tom" and "Jerry" can be obtained. The degree of influence of each word in the sentence, and get a set of attention vectors.

步骤6:把步骤5中的注意力向量以及文本特征向量序列输入到CRF模型进行解码,CRF模型通过参考参照向量标注出了“李某”和“蔡某”这两个实体。Step 6: Input the attention vector and text feature vector sequence in Step 5 into the CRF model for decoding. The CRF model labels the two entities "Li" and "Cai" by referring to the reference vector.

步骤7:把“李某”和“蔡某”存储到候选单元,把阈值设置为β,通过计算它们和存储单元中每一个词的相似度,发现这两个词汇和“李某”和“汤姆”,“蔡某”和“杰瑞”的相似度大于设定的阈值β,所以不把这两个词汇存储到存储单元。Step 7: Store "Li" and "Cai" in the candidate unit, set the threshold to β, and find out that these two words are the same as "Li" and "" by calculating the similarity between them and each word in the storage unit. Tom", "Cai" and "Jerry" are more similar than the set threshold β, so these two words are not stored in the storage unit.

步骤8:重复上述步骤4到步骤7之后,发现没有新的实体产生,所以命名实体识别结束,最后标注出的实体为:Step 8: After repeating the above steps 4 to 7, it is found that no new entities are generated, so the named entity recognition ends, and the final marked entity is:

when 汤姆tom 遇到meet 杰瑞Jerry Time 李某Lee 遇到meet 蔡某Cai OO S-PERS-PER OO S-PERS-PER OO S-PERS-PER OO S-PERS-PER

上述过程识别出的实体,存储到存储单元后,只要在之后的文本中出现了和存储单元单元中相似实体的信息,即出现了与存储单元中类似句式的文本,这些实体就都能够被识别出出来,这样就解决了长文本准确率和查全率低的问题。After the entities identified by the above process are stored in the storage unit, as long as information similar to the entity in the storage unit appears in the subsequent text, that is, text similar to the sentence pattern in the storage unit appears, these entities can be stored. Identify it, which solves the problem of low accuracy and recall in long texts.

在本系统中,可选的,为了减少计算量的开支,可以事先设定识别命名实体识别的循环次数,而不采用“没有新实体产生”作为命名实体系统的结束标志,这样可以减少计算量的开支,但会导致查全率和准确率的下降,实际运用时需要权衡利弊进行舍取。In this system, optionally, in order to reduce the cost of calculation, the number of cycles for identifying named entity recognition can be set in advance, instead of using "no new entity generation" as the end mark of the named entity system, which can reduce the amount of calculation. However, it will lead to a decrease in recall rate and accuracy rate. In actual application, it is necessary to weigh the pros and cons.

上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical idea disclosed in the present invention should still be covered by the claims of the present invention.

Claims (10)

1. a kind of multi-level name entity recognition method, which is characterized in that it includes following that this names entity recognition method at many levels Step:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out in the Text eigenvector sequence Entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the time of subsequent identification process Select sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, Attention force vector is calculated;
Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the reality in sequence by S7 Body;
S8 repeats step S5~S7 until step S7 does not generate novel entities.
2. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described to data Text is pre-processed, and is obtained the image information features sequence of text, is specifically included:
It is first mark with fullstop, is sentence one section of long text segmentation;
All sentences are segmented;
Then remove dittograph building vocabulary C.
3. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described using pre- Trained term vector, in conjunction with the image information features sequence of text, the vector of obtained text is indicated, is specifically included:
Pre-training is carried out to corpus, by the available a term vector of pre-training;
The form for being one-hot by the text representation of input, as image information features sequence;
By the vocabulary C, the term vector of the image information features sequence and pre-training, obtain the word of each phrase to Amount indicates;
Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.
4. according to claim 3 a kind of based on multi-level name entity recognition method, which is characterized in that use BiLSTM The vector expression of the text is encoded, the Text eigenvector sequence after being encoded.
5. according to claim 4 a kind of based on multi-level name entity recognition method, which is characterized in that the text Characteristic vector sequence is decoded with CRF model, is marked out the entity in the Text eigenvector sequence, is specifically included:
Text eigenvector sequence H={ h1,h2,h3,…,htIt is input to CRF model, pass through being calculated for CRF model Prediction label sequence L={ l1,l2,l3,…,ln, what l was indicated is the label of each word, contains " BIE " in annotation results Or " S " is all the obtained entity e of mark, entity sets are expressed as E=(e1,e2,e3,…,em), m presentation-entity number.
6. according to claim 5 a kind of based on multi-level name entity recognition method, which is characterized in that the pre- mark Label sequence calculates acquisition by the following method:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates the collection of all possible label in BIOES mark system It closes, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, a upper label Node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic, sk(yi, x, i) and indicate current mark Sign node yiWhether mark on x, value is also 0 or 1, referred to as state feature, λjAnd μkThe t respectively indicatedjAnd skWeight, The number of j, k expression characteristic function;
Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping the conditional probability p (y | x) of y label, meter for x Calculation p (y | x) specific formula is as follows:
Z (x) is standardizing factor, calculation formula are as follows:
Wherein y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence;
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
7. according to claim 6 a kind of based on multi-level name entity recognition method, which is characterized in that for entity set Close each entity e in Ei, i=1,2,3 ..., m is exported in conjunction with preceding to LSTM hidden layerAnd LSTM hidden layer is defeated backward OutEach entity is expressed as V '=[v1,v2,v3,v4] form, wherein v1That indicate is the information above of entity, v2Table That show is the preceding entity self-information obtained to LSTM, v3What is indicated be after the entity self-information that is obtained to LSTM, v4Entity Information hereinafter, then the entity sequence in storage unit is expressed as V={ V1′,V2′,V3′,…,V′m′, m ' expression has been deposited into The entity number of storage unit, then the entity sequence in storage unit is expressed as candidate sequence.
8. according to claim 7 a kind of based on multi-level name entity recognition method, which is characterized in that storage unit In candidate sequence V and Text eigenvector sequence H be input in reasoning element, each entity information is obtained by reasoning element V is to the influence degree of Text eigenvector sequence H, and gain attention force vector S;
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V1′,V2′,V3′,…,V′m′, point The h at each moment is not calculatediWith all Vj' dot product, the power that gains attention score σ, wherein i=1,2,3 ..., t, j=1,2, 3,...,m';
Any time t, candidate sequence V is to htAttention score calculation formula it is as follows:
Attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to htNotice that force vector s, formula are as follows:
After the result for calculating all moment, candidate sequence V is obtained to the attention sequence vector S of Text eigenvector sequence H ={ s1,s2,s3,…,st}。
9. according to claim 8 a kind of based on multi-level name entity recognition method, which is characterized in that this method is also wrapped It includes: calculating the similarity of the entity in the entity and step S5 in the step S7, when the similarity is less than similarity threshold When, using the entity as new entity.
10. according to claim 9 a kind of based on multi-level name entity recognition method, which is characterized in that described similar Degree is calculated using the method for cosine similarity, specific formula are as follows:
CN201910207179.0A 2019-03-19 2019-03-19 Multilevel named entity recognition method Active CN110008469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910207179.0A CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910207179.0A CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Publications (2)

Publication Number Publication Date
CN110008469A true CN110008469A (en) 2019-07-12
CN110008469B CN110008469B (en) 2022-06-07

Family

ID=67167300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910207179.0A Active CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Country Status (1)

Country Link
CN (1) CN110008469B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN110688854A (en) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 Named entity recognition method, device and computer readable storage medium
CN110852108A (en) * 2019-11-11 2020-02-28 中山大学 Joint training method, apparatus and medium for entity recognition and entity disambiguation
CN111241832A (en) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 Core entity labeling method, device and electronic device
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
CN111581957A (en) * 2020-05-06 2020-08-25 浙江大学 Nested entity detection method based on pyramid hierarchical network
CN111832293A (en) * 2020-06-24 2020-10-27 四川大学 Entity and Relation Joint Extraction Method Based on Head Entity Prediction
CN111858817A (en) * 2020-07-23 2020-10-30 中国石油大学(华东) A BiLSTM-CRF Path Inference Method for Sparse Trajectories
CN112151183A (en) * 2020-09-23 2020-12-29 上海海事大学 An entity recognition method for Chinese electronic medical records based on Lattice LSTM model
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112307208A (en) * 2020-11-05 2021-02-02 Oppo广东移动通信有限公司 Long text classification method, terminal and computer storage medium
CN112836514A (en) * 2020-06-19 2021-05-25 合肥量圳建筑科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
WO2021114745A1 (en) * 2019-12-13 2021-06-17 华南理工大学 Named entity recognition method employing affix perception for use in social media
CN113947081A (en) * 2021-01-11 2022-01-18 复旦大学 A Chinese Named Entity Recognition System Combined with Dictionary
CN114171147A (en) * 2021-11-30 2022-03-11 中国医学科学院北京协和医院 Novel medical text preprocessing system
CN114648028A (en) * 2020-12-21 2022-06-21 阿里巴巴集团控股有限公司 Method and device for training label model, electronic equipment and storage medium
CN115146033A (en) * 2022-07-18 2022-10-04 北京龙智数科科技服务有限公司 Named entity recognition method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A Chinese electronic medical record word segmentation and named entity recognition method and system
WO2017165038A1 (en) * 2016-03-21 2017-09-28 Amazon Technologies, Inc. Speaker verification method and system
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108536754A (en) * 2018-03-14 2018-09-14 四川大学 Electronic health record entity relation extraction method based on BLSTM and attention mechanism
CN109062901A (en) * 2018-08-14 2018-12-21 第四范式(北京)技术有限公司 Neural network training method and device and name entity recognition method and device
CN109062893A (en) * 2018-07-13 2018-12-21 华南理工大学 A kind of product name recognition methods based on full text attention mechanism
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian Named Entity Recognition Method and Recognition System Based on Neural Network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
WO2017165038A1 (en) * 2016-03-21 2017-09-28 Amazon Technologies, Inc. Speaker verification method and system
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A Chinese electronic medical record word segmentation and named entity recognition method and system
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN108536754A (en) * 2018-03-14 2018-09-14 四川大学 Electronic health record entity relation extraction method based on BLSTM and attention mechanism
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN109062893A (en) * 2018-07-13 2018-12-21 华南理工大学 A kind of product name recognition methods based on full text attention mechanism
CN109062901A (en) * 2018-08-14 2018-12-21 第四范式(北京)技术有限公司 Neural network training method and device and name entity recognition method and device
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian Named Entity Recognition Method and Recognition System Based on Neural Network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GUL KHAN SAFI QAMAS等: "基于深度神经网络的命名实体识别方法研究", 《信息网络安全》 *
HUI-KANG YI等: "A Chinese Named Entity Recognition System with Neural Networks", 《ITM WEB OF CONFERENCES》 *
XIAOCHENG FENG等: "Multi-Level Cross-Lingual Attentive Neural Architecture for Low Resource Name Tagging", 《TSINGHUA SCIENCE AND TECHNOLOGY》 *
姜宇新: "基于深度学习的生物医学命名实体识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
常量: "图像理解中的卷积神经网络", 《自动化学报》 *
张璞: "基于深度学习的中文微博评价对象抽取方法", 《计算机工程与设计》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516241B (en) * 2019-08-26 2021-03-02 北京三快在线科技有限公司 Geographic address resolution method and device, readable storage medium and electronic equipment
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN110688854A (en) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 Named entity recognition method, device and computer readable storage medium
CN110852108A (en) * 2019-11-11 2020-02-28 中山大学 Joint training method, apparatus and medium for entity recognition and entity disambiguation
WO2021114745A1 (en) * 2019-12-13 2021-06-17 华南理工大学 Named entity recognition method employing affix perception for use in social media
CN111241832B (en) * 2020-01-15 2023-08-15 北京百度网讯科技有限公司 Core entity labeling method, device and electronic equipment
CN111274815B (en) * 2020-01-15 2024-04-12 北京百度网讯科技有限公司 Method and apparatus for mining entity attention points in text
US11775761B2 (en) 2020-01-15 2023-10-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for mining entity focus in text
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
CN111241832A (en) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 Core entity labeling method, device and electronic device
CN111581957B (en) * 2020-05-06 2022-04-12 浙江大学 A Nested Entity Detection Method Based on Pyramid Hierarchical Network
CN111581957A (en) * 2020-05-06 2020-08-25 浙江大学 Nested entity detection method based on pyramid hierarchical network
CN112836514A (en) * 2020-06-19 2021-05-25 合肥量圳建筑科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN111832293A (en) * 2020-06-24 2020-10-27 四川大学 Entity and Relation Joint Extraction Method Based on Head Entity Prediction
CN111832293B (en) * 2020-06-24 2023-05-26 四川大学 Entity and Relation Joint Extraction Based on Head Entity Prediction
CN111858817B (en) * 2020-07-23 2021-05-18 中国石油大学(华东) A BiLSTM-CRF Path Inference Method for Sparse Trajectories
CN111858817A (en) * 2020-07-23 2020-10-30 中国石油大学(华东) A BiLSTM-CRF Path Inference Method for Sparse Trajectories
CN112151183A (en) * 2020-09-23 2020-12-29 上海海事大学 An entity recognition method for Chinese electronic medical records based on Lattice LSTM model
CN112151183B (en) * 2020-09-23 2024-11-22 上海海事大学 An entity recognition method for Chinese electronic medical records based on Lattice LSTM model
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112185572B (en) * 2020-09-25 2024-03-01 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic equipment and medium
CN112307208A (en) * 2020-11-05 2021-02-02 Oppo广东移动通信有限公司 Long text classification method, terminal and computer storage medium
CN114648028A (en) * 2020-12-21 2022-06-21 阿里巴巴集团控股有限公司 Method and device for training label model, electronic equipment and storage medium
CN113947081A (en) * 2021-01-11 2022-01-18 复旦大学 A Chinese Named Entity Recognition System Combined with Dictionary
CN114171147A (en) * 2021-11-30 2022-03-11 中国医学科学院北京协和医院 Novel medical text preprocessing system
CN115146033A (en) * 2022-07-18 2022-10-04 北京龙智数科科技服务有限公司 Named entity recognition method and device

Also Published As

Publication number Publication date
CN110008469B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN110008469B (en) Multilevel named entity recognition method
CN109635288B (en) Resume extraction method based on deep neural network
CN112541356B (en) Method and system for recognizing biomedical named entities
CN110263325B (en) Chinese word segmentation system
CN106484674B (en) Chinese electronic medical record concept extraction method based on deep learning
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN112632997A (en) Chinese entity identification method based on BERT and Word2Vec vector fusion
CN110598221A (en) A Method of Improving the Quality of Mongolian-Chinese Translation Using Generative Adversarial Networks to Construct Mongolian-Chinese Parallel Corpus
CN109657239A (en) The Chinese name entity recognition method learnt based on attention mechanism and language model
CN109408812A (en) A method of the sequence labelling joint based on attention mechanism extracts entity relationship
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN108416058A (en) A kind of Relation extraction method based on the enhancing of Bi-LSTM input informations
CN108388560A (en) GRU-CRF meeting title recognition methods based on language model
CN114153973A (en) Mongolian multimodal sentiment analysis method based on T-M BERT pre-training model
CN108268447A (en) A kind of mask method of Tibetan language name entity
CN113191150B (en) A multi-feature fusion method for Chinese medical text named entity recognition
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN115809666B (en) Named entity recognition method integrating dictionary information and attention mechanism
CN115859164A (en) A prompt-based architectural entity recognition and classification method and system
CN116011456B (en) Chinese building code text entity recognition method and system based on prompt learning
CN111368542A (en) A method and system for text-language association extraction based on recurrent neural network
CN116595023A (en) Method and device for updating address information, electronic device, and storage medium
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
Wu et al. One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant