CN110705293A

CN110705293A - Electronic medical record text named entity recognition method based on pre-training language model

Info

Publication number: CN110705293A
Application number: CN201910785097.4A
Authority: CN
Inventors: 戴亚康; 戴斌; 耿辰; 周志勇; 胡冀苏
Original assignee: Suzhou Institute of Biomedical Engineering and Technology of CAS
Current assignee: Suzhou Institute of Biomedical Engineering and Technology of CAS
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2020-01-17

Abstract

The invention belongs to the technical field of medical information data processing, and in particular relates to a method for recognizing a named entity of electronic medical record text based on a pre-trained language model, comprising: collecting electronic medical record text from a public data set as original text, and preprocessing; Set to label the preprocessed original text entities, and get the labeling text; input the labeling text into the pre-training language model to obtain the training text represented by word vectors; build a BiLSTM-CRF sequence labeling model, learn the training text, and obtain the training labeling Model; using the trained annotation model as the entity recognition model, input the test text to output the labeled category label sequence. Using the text features and semantic information in the deep language model obtained by training in the super large-scale Chinese corpus can provide better semantic compression effect, avoid the tedious and complicated problem of manual annotation, and does not rely on dictionaries and rules, which improves the performance of named entity recognition. recall and accuracy.

Description

Named Entity Recognition Method for Electronic Medical Record Text Based on Pre-trained Language Model

技术领域technical field

本发明属于医疗信息数据处理技术领域，具体涉及一种基于预训练语言模型的电子病历文本命名实体识别方法。The invention belongs to the technical field of medical information data processing, and in particular relates to a named entity recognition method for electronic medical record text based on a pre-trained language model.

背景技术Background technique

病历(case history)是医务人员对患者疾病的发生、发展、转归进行检查、诊断、治疗等医疗活动过程的记录，也是对采集到的资料加以归纳、整理、综合分析并按规定的格式和要求书写的患者医疗健康档案。随着计算机及互联网技术的发展，大部分医院已实现临床病历的电子化，电子病历是利用电子设备来记录、保存、管理、传输和重现数字化的医疗记录，具有安全可靠以及方便记录、存储、共享等优点。电子病历的应用不但能够为卫生事业管理、医学诊疗与科研提供最实际、最丰富的数据资料，还将是评价医疗质量、管理水平和处理医疗纠纷的重要判定责任依据。目前电子病历多是以自然语言方式录入，然后经计算机转化为结构化数据以供日后数据分析和搜索。Case history is a record of medical staff's medical activities such as inspection, diagnosis, and treatment of the occurrence, development, and outcome of a patient's disease. A written patient medical health record is required. With the development of computer and Internet technology, most hospitals have realized the electronicization of clinical medical records. Electronic medical records are digital medical records that use electronic equipment to record, save, manage, transmit and reproduce digital medical records. , sharing, etc. The application of electronic medical records can not only provide the most practical and rich data for health management, medical diagnosis and treatment and scientific research, but also an important basis for determining responsibility for evaluating medical quality, management level and handling medical disputes. At present, electronic medical records are mostly entered in natural language, and then converted into structured data by computer for future data analysis and search.

自然语言处理是计算机与人工智能领域的一个交叉方向，命名实体识别(NamedEntity Recognition,简称NER)是自然语言处理的一个基本任务，旨在识别出自然语言文本中有特定意义的实体，如人名、地名、机构名、专有名词等。作为舆情分析、信息检索、查询分类、自动问答、机器翻译等自然语言处理的重要组成部分，命名实体识别结果的好坏直接影响自然语言处理的效果。Natural language processing is a cross direction in the field of computer and artificial intelligence. Named Entity Recognition (NER) is a basic task of natural language processing, which aims to identify entities with specific meanings in natural language texts, such as names, Place names, institution names, proper nouns, etc. As an important part of natural language processing such as public opinion analysis, information retrieval, query classification, automatic question answering, and machine translation, the quality of named entity recognition results directly affects the effect of natural language processing.

现有的命名实体识别方法一般包括三种：基于词典的方法、基于启发式规则的方法和基于机器学习的方法。基于词典的方法是先构造大规模实体词典，然后通过匹配语句和词典来实现实体识别，但该方法严重依赖于词典，无法识别未登录词，且无法识别实体配套情形，准确率不高；基于启发式规则的方法是根据实体特有的上下文特征来构建规则，然后将文本与规则进行匹配来实现实体识别，但该方法在构建规则时需要语言学背景知识，由于中文表达具有多样性，规则难以枚举且容易冲突，从而导致查全率和准确率有限；基于机器学习的方法是将命名实体识别任务形式化序列标注任务，通过预测每个字或每个词的标签，联合预测实体边界和实体类型，但该方法需要大量的人工标注。Existing named entity recognition methods generally include three types: dictionary-based methods, heuristic rule-based methods, and machine learning-based methods. The dictionary-based method first constructs a large-scale entity dictionary, and then realizes entity recognition by matching sentences and dictionaries, but this method relies heavily on the dictionary, cannot identify unregistered words, and cannot identify entity matching situations, and the accuracy is not high; based on The method of heuristic rules is to construct rules according to the specific context features of entities, and then match the text with the rules to realize entity recognition. However, this method requires linguistic background knowledge when constructing rules. Due to the diversity of Chinese expressions, rules are difficult to achieve. Enumeration and easy conflict, resulting in limited recall and accuracy; machine learning-based methods formalize the task of named entity recognition as a sequence labeling task, and jointly predict entity boundaries and labels by predicting each word or each word. entity type, but this method requires a lot of manual annotation.

发明内容SUMMARY OF THE INVENTION

因此，本发明要解决的技术问题在于克服现有的命名实体识别方法准确率和查全率有限以及繁琐复杂的缺陷，从而提供一种准确率和查全率高且快速简单的基于预训练语言模型的电子病历文本命名实体识别方法。Therefore, the technical problem to be solved by the present invention is to overcome the defects of limited accuracy and recall and cumbersome and complicated existing named entity recognition methods, thereby providing a fast and simple pre-training language-based language with high accuracy and recall. Model named entity recognition method for electronic medical record text.

为解决上述技术问题，本发明采用的技术方案是：In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

本发明提供一种基于预训练语言模型的电子病历文本命名实体识别方法，包括以下步骤：The present invention provides an electronic medical record text named entity recognition method based on a pre-trained language model, comprising the following steps:

步骤1，从公开的数据集收集电子病历文本作为原始文本，对所述原始文本进行数据预处理；Step 1, collect the electronic medical record text from the public data set as the original text, and perform data preprocessing on the original text;

步骤2，基于规范的医疗术语集对步骤1中经数据预处理的所述原始文本进行实体标注，得到标注文本；Step 2, performing entity labeling on the original text preprocessed by the data in Step 1 based on the standard medical term set to obtain the labeling text;

步骤3，将所述标注文本输入预训练语言模型中，得到字向量表示的训练文本；Step 3, inputting the marked text into the pre-training language model to obtain the training text represented by the word vector;

步骤4，构建BiLSTM-CRF序列标注模型，对所述训练文本进行训练，得到训练的标注模型；Step 4, build a BiLSTM-CRF sequence labeling model, train the training text, and obtain a trained labeling model;

步骤5，以所述训练的标注模型作为实体识别模型，输入测试文本即可输出标注的类别标签序列。Step 5, using the trained labeling model as an entity recognition model, and inputting the test text to output the labelled category label sequence.

优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤1中，统计数据集中所有所述原始文本中的字，去除停用词、无用符号，并生成字典文件。Preferably, in the pre-trained language model-based named entity recognition method for electronic medical record text, in step 1, the words in all the original texts in the data set are counted, stop words and useless symbols are removed, and a dictionary file is generated.

优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤2中，基于SNOMED CT医疗术语集并采用BIO标注模式标注出步骤1中经数据预处理的所述原始文本中出现的疾病和诊断、检查、检验、手术、药物、解剖部位六个实体。Preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 2, based on the SNOMED CT medical term set and using the BIO labeling mode to mark the original text that has undergone data preprocessing in step 1. Diseases and six entities of diagnosis, examination, examination, surgery, medicine, anatomy.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤3中，采用ERNIE预训练语言模型。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 3, an ERNIE pre-trained language model is used.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤3中，采用半监督的并行方式对ERNIE预训练语言模型进行训练。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 3, the ERNIE pre-trained language model is trained in a semi-supervised parallel manner.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤4中，所述BiLSTM-CRF序列标注模型包括look-up层、双向LSTM层和CRF层。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the BiLSTM-CRF sequence labeling model includes a look-up layer, a bidirectional LSTM layer and a CRF layer.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤4中，以所述训练文本的字向量输入至所述BiLSTM-CRF序列标注模型中并迭代训练。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the word vector of the training text is input into the BiLSTM-CRF sequence annotation model and iteratively trained.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤4中，通过构建损失函数计算损失值以确定迭代训练次数，直至所述损失值小于设定阈值。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, a loss value is calculated by constructing a loss function to determine the number of iterative training times, until the loss value is less than a set threshold.

进一步优选地，该基于预训练语言模型的电子病历文本命名实体识别方法，步骤4中，将所述训练文本的类别标签序列和带标签数据输入至所述损失函数中以计算所述损失值。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the category label sequence and labeled data of the training text are input into the loss function to calculate the loss value.

本发明技术方案，具有如下优点：The technical scheme of the present invention has the following advantages:

1.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法，包括：步骤1，从公开的数据集收集电子病历文本作为原始文本，对原始文本进行数据预处理；步骤2，基于规范的医疗术语集对步骤1中经数据预处理的原始文本进行实体标注，得到标注文本；步骤3，将标注文本输入预训练语言模型中，得到字向量表示的训练文本；步骤4，构建BiLSTM-CRF序列标注模型，对训练文本进行学习，得到标注模型；步骤5，以标注模型作为实体识别模型，输入测试文本即可输出标注的类别标签序列。1. The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention comprises: Step 1, collect electronic medical record text from a public data set as the original text, and perform data preprocessing on the original text; Step 2, based on The standard medical term set performs entity labeling on the original text preprocessed by data in step 1, and obtains the labeling text; step 3, inputs the labeling text into the pre-training language model, and obtains the training text represented by the word vector; step 4, constructs BiLSTM -CRF sequence labeling model, learn the training text to obtain the labeling model; step 5, use the labeling model as the entity recognition model, and input the test text to output the labelled category label sequence.

本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法，利用在超大规模中文语料中训练得到深层语言模型中的文本特征和语义信息，提供更好的语义压缩效果，避免了人工标注繁琐复杂的问题，且不依赖词典和规则，提高了命名实体识别的查全率和准确率，提升了命名实体的识别效果。The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention utilizes the text features and semantic information in the deep language model obtained by training in the super large-scale Chinese corpus, provides better semantic compression effect, and avoids manual annotation. Cumbersome and complex problems, and do not rely on dictionaries and rules, improve the recall rate and accuracy of named entity recognition, and improve the recognition effect of named entities.

2.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法，可以保证从公开的数据集中学习到的关键信息更加全面，提高了查全率，也更加方便排查电子病历文本中的错误。2. The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention can ensure that the key information learned from the public data set is more comprehensive, improves the recall rate, and is more convenient to check the electronic medical record text. mistake.

3.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法，采用ERNIE预训练语言模型代替常规的Bert预训练语言模型，通过建模海量数据中的实体概念学习完整的实体语义表示，能够更好地表示实体的语义信息。3. The electronic medical record text named entity recognition method based on the pre-training language model provided by the present invention adopts the ERNIE pre-training language model to replace the conventional Bert pre-training language model, and learns the complete entity semantic representation by modeling the entity concept in the massive data , which can better represent the semantic information of entities.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the specific embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative efforts.

图1为本发明实施例1提供的基于预训练语言模型的电子病历文本命名实体识别方法流程示意图；1 is a schematic flowchart of a method for identifying named entities in electronic medical record text based on a pre-trained language model provided in Embodiment 1 of the present invention;

图2为BiLSTM-CRF序列标注模型示意图。Figure 2 is a schematic diagram of the BiLSTM-CRF sequence annotation model.

具体实施方式Detailed ways

为了便于理解本发明的目的、技术方案和要点，下面将对本发明的实施方式作进一步详细描述。本发明可以多种不同的形式实施，而不应该被理解为仅限于在此阐述的实施例。相反，提供此实施例，使得本发明将是彻底的和完整的，并且将把本发明的构思充分传达给本领域技术人员，本发明将仅由权利要求来限定。In order to facilitate understanding of the purpose, technical solutions and gist of the present invention, the embodiments of the present invention will be described in further detail below. The present invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, this embodiment is provided so that this disclosure will be thorough and complete, and will fully convey the concept of the present invention to those skilled in the art, and the present invention will only be defined by the appended claims.

实施例1Example 1

本实施例提供一种基于预训练语言模型的电子病历文本命名实体识别方法，如图1所示，包括以下步骤：This embodiment provides a method for identifying named entities in electronic medical record text based on a pre-trained language model, as shown in FIG. 1 , including the following steps:

步骤1，从公开的数据集收集电子病历文本作为原始文本，对原始文本进行数据预处理；Step 1, collect the electronic medical record text from the public data set as the original text, and perform data preprocessing on the original text;

具体地，本实施例中，原始文本所采用的语料来源于公开的数据集所收集的电子病历文本，统计数据集中所有原始文本中出现的字，并去除停用字、无关符号等，生成字典文件。Specifically, in this embodiment, the corpus used in the original text comes from the electronic medical record text collected by the public data set, and the words appearing in all the original texts in the data set are counted, and stop words, irrelevant symbols, etc. are removed to generate a dictionary document.

步骤2：基于规范的医疗术语集对步骤1中经数据预处理的原始文本进行实体标注，得到标注文本；Step 2: Perform entity labeling on the original text preprocessed by the data in Step 1 based on the standardized medical term set to obtain the labelled text;

具体地，基于SNOMED CT医疗术语集并采用BIO标注模式对步骤1中经数据预处理的原始文本中的语料进行标注，得到标注文本。该标注文本中，语料实体分为六类：Specifically, the corpus in the original text preprocessed by the data in step 1 is annotated based on the SNOMED CT medical term set and the BIO annotation mode is used to obtain the annotated text. In this annotated text, corpus entities are divided into six categories:

疾病和诊断：医学上所定义的疾病以及医生在临床工作中对病因、病生理、分型分期等所作的判断；Diseases and diagnosis: diseases defined in medicine and judgments made by doctors on etiology, pathophysiology, classification and staging, etc. in clinical work;

检查：影像检查(如X线、CT、MR、PETCT等)、造影、超声、心电图等，为避免检查操作与手术操作过多冲突，不包含此外其它的诊断性操作(如胃镜、肠镜等)；Inspection: imaging inspection (such as X-ray, CT, MR, PETCT, etc.), angiography, ultrasound, electrocardiogram, etc. In order to avoid excessive conflict between inspection operations and surgical operations, other diagnostic operations (such as gastroscope, colonoscopy, etc. are not included) );

检验：在实验室进行的物理或化学检查，本实施例特指临床工作中检验科进行的化验，不含免疫组化等广义实验室检查；Test: physical or chemical tests performed in the laboratory, this embodiment specifically refers to tests performed by the laboratory in clinical work, excluding generalized laboratory tests such as immunohistochemistry;

手术：医生在患者身体局部进行的切除、缝合等治疗，属外科的主要治疗方法；Surgery: Resection, suture and other treatments performed by doctors on the part of the patient's body are the main treatment methods of surgery;

药物：用于疾病治疗的具体化学物质；Drugs: specific chemicals used in the treatment of diseases;

解剖部位：指疾病、症状和体征发生的人体解剖学部位。Anatomical Site: Refers to the anatomical site of the human body where disease, symptoms and signs occur.

在经过实体标注的标注文本抽取了1000条作为标注集，该标注集统计的情况如表1所示：1000 items were extracted from the annotated texts marked by entities as the annotation set, and the statistics of the annotation set are shown in Table 1:

表1实体个数统计情况Table 1 Statistics of the number of entities

原始文本original text 疾病和诊断Diseases and Diagnoses 检查an examination 检验test 手术Operation 药物drug 解剖部位anatomical site 总数total 标注集callout set 10001000 21162116 222222 318318 765765 456456 14861486 53635363

。.

步骤3：将标注文本输入预训练语言模型中，得到字向量表示的训练文本；Step 3: Input the marked text into the pre-training language model to obtain the training text represented by the word vector;

具体地，获取百度的ERNIE预训练语言模型和若干电子病历文本语料，根据电子病历文本语料采用半监督的并行双向方式训练ERNIE模型，得到新的语言模型，将步骤2中标注文本的标注集输入至新的训练模型中，得到字向量表示的训练文本的训练集。Specifically, obtain Baidu's ERNIE pre-trained language model and several electronic medical record text corpora, train the ERNIE model in a semi-supervised parallel two-way manner according to the electronic medical record text corpus, obtain a new language model, and input the annotation set of the annotated text in step 2 To the new training model, the training set of the training text represented by the word vector is obtained.

步骤4，构建BiLSTM-CRF序列标注模型，对训练文本进行训练，得到训练的标注模型；Step 4, build a BiLSTM-CRF sequence labeling model, train the training text, and obtain the trained labeling model;

BiLSTM-CRF序列标注模型属双向的循环神经网络，能够根据上下文信息对输入的字给以预测标签的概率。该模型分为三层：第一层为look-up层，第二层为双向LSTM层(即BiLSTM层)，第三层为CRF层。The BiLSTM-CRF sequence labeling model is a bidirectional recurrent neural network, which can predict the probability of the label for the input word according to the context information. The model is divided into three layers: the first layer is the look-up layer, the second layer is the bidirectional LSTM layer (ie BiLSTM layer), and the third layer is the CRF layer.

如图2所示，向该模型中输入步骤3中训练文本的字的one-hot向量，即以句子为单位，将一个含有n个字的句子(字的序列)记作x＝(x₁,x₂,...,x_n)，其中x_i表示该句子的第i个字在字典文件中的id，继而得到字的one-hot向量，向量的维度为字典文件大小。As shown in Figure 2, input the one-hot vector of the words of the training text in step 3 into the model, that is, take the sentence as the unit, and denote a sentence (sequence of words) containing n words as x=(x ₁ ,x ₂ ,...,x _n ), where x _i represents the id of the i-th word of the sentence in the dictionary file, and then the one-hot vector of the word is obtained, and the dimension of the vector is the size of the dictionary file.

look-up层作为该模型的第一层，利用预训练或随机初始化的embedding矩阵将句子中的每个字x_i由one-hot向量映射为低维稠密的字向量(character embedding)，其中，x_i∈R^d，d为embedding的维度。在输入下一层之前，设置dropout以缓解过拟合。As the first layer of the model, the look-up layer uses a pre-trained or randomly initialized embedding matrix to map each word x _i in the sentence from a one-hot vector to a low-dimensional dense word vector (character embedding), where, x _i ∈ R ^d , d is the dimension of embedding. Set dropout to alleviate overfitting before feeding into the next layer.

双向LSTM层作为该模型的第二层，用于自动提取句子特征。将一个句子各个字的character embedding序列(x₁,x₂,...,x_n)作为双向LSTM模型各个时间步的输入，再将正向LSTM输出的隐状态序列与反向LSTM的序列在各个位置输出的隐状态进行按位置拼接得到完整的隐状态序列(h₁,h₂,...,h_n)∈R^n×m。在设置dropout后，接入一个线性层，将隐状态向量从m维映射到k维，k是训练集的标签数，从而得到自动提取的句子特征，记作矩阵P＝(p₁,p₂,...,p_n)∈R^n×k。将p_i∈R^k的每一维p_ij都视作将字x_i分类到第j个标签的打分值，如果再对P进行Softmax的话，就相当于对各个位置独立进行k类分类，但这样对各个位置进行标注时无法利用已经标注过的信息，因此须接入CRF层来进行标注。A bidirectional LSTM layer is used as the second layer of this model to automatically extract sentence features. Use the character embedding sequence (x ₁ , x ₂ ,..., x _n ) of each word of a sentence as the input of each time step of the bidirectional LSTM model, and then put the hidden state sequence output by the forward LSTM and the reverse LSTM sequence in the The hidden states output by each position are spliced by position to obtain a complete hidden state sequence (h ₁ , h ₂ ,...,h _n )∈R ^n×m . After setting dropout, a linear layer is connected to map the hidden state vector from m dimension to k dimension, where k is the number of labels in the training set, so as to obtain automatically extracted sentence features, denoted as matrix P=(p ₁ ,p ₂ ,...,p _n )∈R ^n×k . Each dimension p _ij of p _i ∈ R ^k is regarded as a scoring value for classifying the word x _i to the jth label. If Softmax is performed on P, it is equivalent to classifying each position independently of k classes, but In this way, the marked information cannot be used when marking each position, so it is necessary to access the CRF layer for marking.

CRF层作为该模型的第三层，用于句子级的序列标注。CRF层的参数是一个(k+2)×(k+2)的转移矩阵A，A_ij表示的是从第i个标签到第j个标签的转移得分，继而在为一个位置进行标注时可利用此前已经标注过的标签。由于需要在句子首部添加一个起始状态以及在句子尾部添加一个终止状态，在k上加2。如果记一个长度等于句子长度的标签序列y＝(y₁,y₂,...,y_n)，那么模型对于句子x的标签等于y的打分为：As the third layer of the model, the CRF layer is used for sentence-level sequence annotation. The parameter of the CRF layer is a (k+2)×(k+2) transition matrix A, and _Aij represents the transition score from the i-th label to the j-th label, and then it can be used to label a position. Utilize the tags that have been marked before. Since we need to add a start state at the beginning of the sentence and a stop state at the end of the sentence, add 2 to k. If we remember a label sequence y=(y ₁ , y ₂ ,..., y _n ) whose length is equal to the length of the sentence, then the model's score for the label of sentence x is equal to y's score:

由公式(1)，整个序列的打分等于各个位置的打分之和，而每个位置的打分由两部分得到，一部分由LSTM的p_i决定，另一部分则由CRF层的转移矩阵A决定，继而可以利用Softmax得到归一化后的概率：According to formula (1), the score of the entire sequence is equal to the sum of the scores of each position, and the score of each position is obtained by two parts, one part is determined by the pi of the _LSTM , and the other part is determined by the transition matrix A of the CRF layer, and then The normalized probability can be obtained using Softmax:

模型训练时通过最大化对数似然函数，公式(3)给出了对一个训练样本(x,y^x)的对数似然：By maximizing the log-likelihood function during model training, equation (3) gives the log-likelihood of a training sample (x, y ^x ):

模型在预测过程(解码)时使用动态规划的Viterbi算法求解最优路径：The model uses the Viterbi algorithm of dynamic programming to solve the optimal path during the prediction process (decoding):

将步骤3中训练文本的训练集的字向量输入至上述BiLSTM-CRF序列标注模型中得到类别标签序列，所输入训练样本的训练集中每个字向量都对应一个标签，以BIO三元标记法标注上述六类实体，则有'DISEASE-I':1，'DISEASE-B':2，'CHECK-B':3，'CHECK-I':4，'EXAMINE-I':5，'EXAMINE-B':6，'OPERATION-B':7，'OPERATION-I':8，'MEDICINE-I':9，'MEDICINE-B':10，'BODY-B':11，'BODY-I':12及O共13个类别，而测试集中无标签。Input the word vector of the training set of the training text in step 3 into the above-mentioned BiLSTM-CRF sequence labeling model to obtain the category label sequence. Each word vector in the training set of the input training sample corresponds to a label, which is marked with the BIO ternary notation method. For the above six types of entities, there are 'DISEASE-I': 1, 'DISEASE-B': 2, 'CHECK-B': 3, 'CHECK-I': 4, 'EXAMINE-I': 5, 'EXAMINE- B': 6, 'OPERATION-B': 7, 'OPERATION-I': 8, 'MEDICINE-I': 9, 'MEDICINE-B': 10, 'BODY-B': 11, 'BODY-I' :12 and O have a total of 13 categories, and there are no labels in the test set.

为确保该训练过程每一轮输出的类别标签序列更正确，构建损失函数，将训练文本的训练集的类别标签序列和标签数据输入至损失函数中，计算损失值，并设定一个标准值作为设定阈值。通过每次的损失值与设定阈值比较，若损失值大于等于设定阈值，就迭代训练下一轮，直至损失值小于设定阈值，即认为标注模型CRF层输出的序列是最优解。In order to ensure that the class label sequence output in each round of the training process is more correct, a loss function is constructed, the class label sequence and label data of the training set of the training text are input into the loss function, the loss value is calculated, and a standard value is set as Set the threshold. By comparing the loss value with the set threshold each time, if the loss value is greater than or equal to the set threshold, the next round of training is iteratively trained until the loss value is less than the set threshold, that is, the sequence output by the CRF layer of the labeling model is considered to be the optimal solution.

步骤5：以训练的标注模型作为实体识别模型，输入测试文本即可输出标注的类别标签序列；Step 5: Use the trained labeling model as the entity recognition model, and input the test text to output the labelled category label sequence;

具体地，保存学习好的标注模型作为实体识别模型，以新的电子病历文本作为测试文本，选取300条作为测试集，输入至实体识别模型即可输出标注的类别标签序列，训练集和测试集的文本条数和准确率统计分别如表2所示：Specifically, save the learned labeling model as the entity recognition model, take the new electronic medical record text as the test text, select 300 records as the test set, and input it into the entity recognition model to output the labeled category label sequence, training set and test set. The number of texts and accuracy statistics are shown in Table 2 respectively:

表2训练集和测试集的文本条数和准确率统计情况Table 2 Statistics of the number of texts and the accuracy of the training set and test set

训练集Training set 测试集test set 文本条数number of text 62926292 15731573 准确率Accuracy 0.94870.9487 0.81960.8196

由表2得知，经训练文本训练得到的标注模型识别的准确率较高，由此标注模型作为实体标注模型对测试集中实体识别的准确率也较高。It can be seen from Table 2 that the labeling model trained by the training text has a higher recognition accuracy rate, and thus the labeling model as an entity labeling model has a higher accuracy rate for entity recognition in the test set.

显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。Obviously, the above-mentioned embodiments are only examples for clear description, and are not intended to limit the implementation manner. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. And the obvious changes or changes derived from this are still within the protection scope of the present invention.

Claims

1. A method for recognizing a text named entity of an electronic medical record based on a pre-training language model is characterized by comprising the following steps of:

step 1, collecting an electronic medical record text from a public data set as an original text, and performing data preprocessing on the original text;

step 2, carrying out entity labeling on the original text subjected to data preprocessing in the step 1 based on a normative medical term set to obtain a labeled text;

step 3, inputting the marked text into a pre-training language model to obtain a training text represented by a word vector;

step 4, constructing a BilSTM-CRF sequence labeling model, and training the training text to obtain a trained labeling model;

and 5, inputting a test text by taking the trained labeling model as an entity recognition model, and outputting a labeled class label sequence.

2. The method for recognizing the text-named entities in the electronic medical record based on the pre-trained language model as claimed in claim 1, wherein in step 1, all words in the original text in the data set are counted, deactivated words and useless symbols are removed, and a dictionary file is generated.

3. The method for recognizing named entities in electronic medical record texts based on a pre-training language model as claimed in claim 1 or 2, wherein in step 2, six entities of diseases, diagnosis, examination, surgery, drugs and anatomical region appearing in the original texts subjected to data preprocessing in step 1 are labeled based on SNOMED CT medical term set and using BIO labeling mode.

4. The method for recognizing the named entity of the text of the electronic medical record based on the pre-trained language model as claimed in claim 3, wherein the ERNIE pre-trained language model is used in the step 3.

5. The method for recognizing the named entities in the texts of the electronic medical records based on the pre-trained language model as claimed in claim 4, wherein in the step 3, the ERNIE pre-trained language model is trained in a semi-supervised parallel manner.

6. The method for recognizing the text-based named entity of the electronic medical record based on the pre-trained language model as claimed in claim 5, wherein in the step 4, the BilSTM-CRF sequence labeling model comprises a look-up layer, a bidirectional LSTM layer and a CRF layer.

7. The method of claim 6, wherein in step 4, the word vector of the training text is input into the BilSTM-CRF sequence labeling model and iteratively trained.

8. The method for recognizing the text-named entities in the electronic medical records based on the pre-trained language model as claimed in claim 7, wherein in step 4, the number of iterative training is determined by constructing a loss function to calculate a loss value until the loss value is smaller than a set threshold.

9. The method as claimed in claim 8, wherein in step 4, the category label sequence and labeled data of the training text are input into the loss function to calculate the loss value.