CN110705293A - Electronic medical record text named entity recognition method based on pre-training language model - Google Patents
Electronic medical record text named entity recognition method based on pre-training language model Download PDFInfo
- Publication number
- CN110705293A CN110705293A CN201910785097.4A CN201910785097A CN110705293A CN 110705293 A CN110705293 A CN 110705293A CN 201910785097 A CN201910785097 A CN 201910785097A CN 110705293 A CN110705293 A CN 110705293A
- Authority
- CN
- China
- Prior art keywords
- text
- training
- model
- language model
- electronic medical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Epidemiology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
本发明属于医疗信息数据处理技术领域,具体涉及一种基于预训练语言模型的电子病历文本命名实体识别方法,包括:从公开数据集收集电子病历文本作为原始文本,并预处理;基于规范医疗术语集对经预处理的原始文本实体标注,得标注文本;将标注文本输入预训练语言模型,得字向量表示的训练文本;构建BiLSTM‑CRF序列标注模型,对训练文本进行学习,得训练的标注模型;以训练的标注模型作为实体识别模型,输入测试文本即可输出标注的类别标签序列。利用在超大规模中文语料中训练得到深层语言模型中的文本特征和语义信息,能够提供更好的语义压缩效果,避免人工标注繁琐复杂的问题,且不依赖词典和规则,提高了命名实体识别的查全率和准确率。
The invention belongs to the technical field of medical information data processing, and in particular relates to a method for recognizing a named entity of electronic medical record text based on a pre-trained language model, comprising: collecting electronic medical record text from a public data set as original text, and preprocessing; Set to label the preprocessed original text entities, and get the labeling text; input the labeling text into the pre-training language model to obtain the training text represented by word vectors; build a BiLSTM-CRF sequence labeling model, learn the training text, and obtain the training labeling Model; using the trained annotation model as the entity recognition model, input the test text to output the labeled category label sequence. Using the text features and semantic information in the deep language model obtained by training in the super large-scale Chinese corpus can provide better semantic compression effect, avoid the tedious and complicated problem of manual annotation, and does not rely on dictionaries and rules, which improves the performance of named entity recognition. recall and accuracy.
Description
技术领域technical field
本发明属于医疗信息数据处理技术领域,具体涉及一种基于预训练语言模型的电子病历文本命名实体识别方法。The invention belongs to the technical field of medical information data processing, and in particular relates to a named entity recognition method for electronic medical record text based on a pre-trained language model.
背景技术Background technique
病历(case history)是医务人员对患者疾病的发生、发展、转归进行检查、诊断、治疗等医疗活动过程的记录,也是对采集到的资料加以归纳、整理、综合分析并按规定的格式和要求书写的患者医疗健康档案。随着计算机及互联网技术的发展,大部分医院已实现临床病历的电子化,电子病历是利用电子设备来记录、保存、管理、传输和重现数字化的医疗记录,具有安全可靠以及方便记录、存储、共享等优点。电子病历的应用不但能够为卫生事业管理、医学诊疗与科研提供最实际、最丰富的数据资料,还将是评价医疗质量、管理水平和处理医疗纠纷的重要判定责任依据。目前电子病历多是以自然语言方式录入,然后经计算机转化为结构化数据以供日后数据分析和搜索。Case history is a record of medical staff's medical activities such as inspection, diagnosis, and treatment of the occurrence, development, and outcome of a patient's disease. A written patient medical health record is required. With the development of computer and Internet technology, most hospitals have realized the electronicization of clinical medical records. Electronic medical records are digital medical records that use electronic equipment to record, save, manage, transmit and reproduce digital medical records. , sharing, etc. The application of electronic medical records can not only provide the most practical and rich data for health management, medical diagnosis and treatment and scientific research, but also an important basis for determining responsibility for evaluating medical quality, management level and handling medical disputes. At present, electronic medical records are mostly entered in natural language, and then converted into structured data by computer for future data analysis and search.
自然语言处理是计算机与人工智能领域的一个交叉方向,命名实体识别(NamedEntity Recognition,简称NER)是自然语言处理的一个基本任务,旨在识别出自然语言文本中有特定意义的实体,如人名、地名、机构名、专有名词等。作为舆情分析、信息检索、查询分类、自动问答、机器翻译等自然语言处理的重要组成部分,命名实体识别结果的好坏直接影响自然语言处理的效果。Natural language processing is a cross direction in the field of computer and artificial intelligence. Named Entity Recognition (NER) is a basic task of natural language processing, which aims to identify entities with specific meanings in natural language texts, such as names, Place names, institution names, proper nouns, etc. As an important part of natural language processing such as public opinion analysis, information retrieval, query classification, automatic question answering, and machine translation, the quality of named entity recognition results directly affects the effect of natural language processing.
现有的命名实体识别方法一般包括三种:基于词典的方法、基于启发式规则的方法和基于机器学习的方法。基于词典的方法是先构造大规模实体词典,然后通过匹配语句和词典来实现实体识别,但该方法严重依赖于词典,无法识别未登录词,且无法识别实体配套情形,准确率不高;基于启发式规则的方法是根据实体特有的上下文特征来构建规则,然后将文本与规则进行匹配来实现实体识别,但该方法在构建规则时需要语言学背景知识,由于中文表达具有多样性,规则难以枚举且容易冲突,从而导致查全率和准确率有限;基于机器学习的方法是将命名实体识别任务形式化序列标注任务,通过预测每个字或每个词的标签,联合预测实体边界和实体类型,但该方法需要大量的人工标注。Existing named entity recognition methods generally include three types: dictionary-based methods, heuristic rule-based methods, and machine learning-based methods. The dictionary-based method first constructs a large-scale entity dictionary, and then realizes entity recognition by matching sentences and dictionaries, but this method relies heavily on the dictionary, cannot identify unregistered words, and cannot identify entity matching situations, and the accuracy is not high; based on The method of heuristic rules is to construct rules according to the specific context features of entities, and then match the text with the rules to realize entity recognition. However, this method requires linguistic background knowledge when constructing rules. Due to the diversity of Chinese expressions, rules are difficult to achieve. Enumeration and easy conflict, resulting in limited recall and accuracy; machine learning-based methods formalize the task of named entity recognition as a sequence labeling task, and jointly predict entity boundaries and labels by predicting each word or each word. entity type, but this method requires a lot of manual annotation.
发明内容SUMMARY OF THE INVENTION
因此,本发明要解决的技术问题在于克服现有的命名实体识别方法准确率和查全率有限以及繁琐复杂的缺陷,从而提供一种准确率和查全率高且快速简单的基于预训练语言模型的电子病历文本命名实体识别方法。Therefore, the technical problem to be solved by the present invention is to overcome the defects of limited accuracy and recall and cumbersome and complicated existing named entity recognition methods, thereby providing a fast and simple pre-training language-based language with high accuracy and recall. Model named entity recognition method for electronic medical record text.
为解决上述技术问题,本发明采用的技术方案是:In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:
本发明提供一种基于预训练语言模型的电子病历文本命名实体识别方法,包括以下步骤:The present invention provides an electronic medical record text named entity recognition method based on a pre-trained language model, comprising the following steps:
步骤1,从公开的数据集收集电子病历文本作为原始文本,对所述原始文本进行数据预处理;Step 1, collect the electronic medical record text from the public data set as the original text, and perform data preprocessing on the original text;
步骤2,基于规范的医疗术语集对步骤1中经数据预处理的所述原始文本进行实体标注,得到标注文本;
步骤3,将所述标注文本输入预训练语言模型中,得到字向量表示的训练文本;Step 3, inputting the marked text into the pre-training language model to obtain the training text represented by the word vector;
步骤4,构建BiLSTM-CRF序列标注模型,对所述训练文本进行训练,得到训练的标注模型;Step 4, build a BiLSTM-CRF sequence labeling model, train the training text, and obtain a trained labeling model;
步骤5,以所述训练的标注模型作为实体识别模型,输入测试文本即可输出标注的类别标签序列。Step 5, using the trained labeling model as an entity recognition model, and inputting the test text to output the labelled category label sequence.
优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤1中,统计数据集中所有所述原始文本中的字,去除停用词、无用符号,并生成字典文件。Preferably, in the pre-trained language model-based named entity recognition method for electronic medical record text, in step 1, the words in all the original texts in the data set are counted, stop words and useless symbols are removed, and a dictionary file is generated.
优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤2中,基于SNOMED CT医疗术语集并采用BIO标注模式标注出步骤1中经数据预处理的所述原始文本中出现的疾病和诊断、检查、检验、手术、药物、解剖部位六个实体。Preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤3中,采用ERNIE预训练语言模型。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 3, an ERNIE pre-trained language model is used.
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤3中,采用半监督的并行方式对ERNIE预训练语言模型进行训练。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 3, the ERNIE pre-trained language model is trained in a semi-supervised parallel manner.
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤4中,所述BiLSTM-CRF序列标注模型包括look-up层、双向LSTM层和CRF层。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the BiLSTM-CRF sequence labeling model includes a look-up layer, a bidirectional LSTM layer and a CRF layer.
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤4中,以所述训练文本的字向量输入至所述BiLSTM-CRF序列标注模型中并迭代训练。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the word vector of the training text is input into the BiLSTM-CRF sequence annotation model and iteratively trained.
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤4中,通过构建损失函数计算损失值以确定迭代训练次数,直至所述损失值小于设定阈值。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, a loss value is calculated by constructing a loss function to determine the number of iterative training times, until the loss value is less than a set threshold.
进一步优选地,该基于预训练语言模型的电子病历文本命名实体识别方法,步骤4中,将所述训练文本的类别标签序列和带标签数据输入至所述损失函数中以计算所述损失值。Further preferably, in the named entity recognition method for electronic medical record text based on a pre-trained language model, in step 4, the category label sequence and labeled data of the training text are input into the loss function to calculate the loss value.
本发明技术方案,具有如下优点:The technical scheme of the present invention has the following advantages:
1.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法,包括:步骤1,从公开的数据集收集电子病历文本作为原始文本,对原始文本进行数据预处理;步骤2,基于规范的医疗术语集对步骤1中经数据预处理的原始文本进行实体标注,得到标注文本;步骤3,将标注文本输入预训练语言模型中,得到字向量表示的训练文本;步骤4,构建BiLSTM-CRF序列标注模型,对训练文本进行学习,得到标注模型;步骤5,以标注模型作为实体识别模型,输入测试文本即可输出标注的类别标签序列。1. The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention comprises: Step 1, collect electronic medical record text from a public data set as the original text, and perform data preprocessing on the original text;
本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法,利用在超大规模中文语料中训练得到深层语言模型中的文本特征和语义信息,提供更好的语义压缩效果,避免了人工标注繁琐复杂的问题,且不依赖词典和规则,提高了命名实体识别的查全率和准确率,提升了命名实体的识别效果。The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention utilizes the text features and semantic information in the deep language model obtained by training in the super large-scale Chinese corpus, provides better semantic compression effect, and avoids manual annotation. Cumbersome and complex problems, and do not rely on dictionaries and rules, improve the recall rate and accuracy of named entity recognition, and improve the recognition effect of named entities.
2.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法,可以保证从公开的数据集中学习到的关键信息更加全面,提高了查全率,也更加方便排查电子病历文本中的错误。2. The electronic medical record text named entity recognition method based on the pre-trained language model provided by the present invention can ensure that the key information learned from the public data set is more comprehensive, improves the recall rate, and is more convenient to check the electronic medical record text. mistake.
3.本发明提供的基于预训练语言模型的电子病历文本命名实体识别方法,采用ERNIE预训练语言模型代替常规的Bert预训练语言模型,通过建模海量数据中的实体概念学习完整的实体语义表示,能够更好地表示实体的语义信息。3. The electronic medical record text named entity recognition method based on the pre-training language model provided by the present invention adopts the ERNIE pre-training language model to replace the conventional Bert pre-training language model, and learns the complete entity semantic representation by modeling the entity concept in the massive data , which can better represent the semantic information of entities.
附图说明Description of drawings
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the specific embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative efforts.
图1为本发明实施例1提供的基于预训练语言模型的电子病历文本命名实体识别方法流程示意图;1 is a schematic flowchart of a method for identifying named entities in electronic medical record text based on a pre-trained language model provided in Embodiment 1 of the present invention;
图2为BiLSTM-CRF序列标注模型示意图。Figure 2 is a schematic diagram of the BiLSTM-CRF sequence annotation model.
具体实施方式Detailed ways
为了便于理解本发明的目的、技术方案和要点,下面将对本发明的实施方式作进一步详细描述。本发明可以多种不同的形式实施,而不应该被理解为仅限于在此阐述的实施例。相反,提供此实施例,使得本发明将是彻底的和完整的,并且将把本发明的构思充分传达给本领域技术人员,本发明将仅由权利要求来限定。In order to facilitate understanding of the purpose, technical solutions and gist of the present invention, the embodiments of the present invention will be described in further detail below. The present invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, this embodiment is provided so that this disclosure will be thorough and complete, and will fully convey the concept of the present invention to those skilled in the art, and the present invention will only be defined by the appended claims.
实施例1Example 1
本实施例提供一种基于预训练语言模型的电子病历文本命名实体识别方法,如图1所示,包括以下步骤:This embodiment provides a method for identifying named entities in electronic medical record text based on a pre-trained language model, as shown in FIG. 1 , including the following steps:
步骤1,从公开的数据集收集电子病历文本作为原始文本,对原始文本进行数据预处理;Step 1, collect the electronic medical record text from the public data set as the original text, and perform data preprocessing on the original text;
具体地,本实施例中,原始文本所采用的语料来源于公开的数据集所收集的电子病历文本,统计数据集中所有原始文本中出现的字,并去除停用字、无关符号等,生成字典文件。Specifically, in this embodiment, the corpus used in the original text comes from the electronic medical record text collected by the public data set, and the words appearing in all the original texts in the data set are counted, and stop words, irrelevant symbols, etc. are removed to generate a dictionary document.
步骤2:基于规范的医疗术语集对步骤1中经数据预处理的原始文本进行实体标注,得到标注文本;Step 2: Perform entity labeling on the original text preprocessed by the data in Step 1 based on the standardized medical term set to obtain the labelled text;
具体地,基于SNOMED CT医疗术语集并采用BIO标注模式对步骤1中经数据预处理的原始文本中的语料进行标注,得到标注文本。该标注文本中,语料实体分为六类:Specifically, the corpus in the original text preprocessed by the data in step 1 is annotated based on the SNOMED CT medical term set and the BIO annotation mode is used to obtain the annotated text. In this annotated text, corpus entities are divided into six categories:
疾病和诊断:医学上所定义的疾病以及医生在临床工作中对病因、病生理、分型分期等所作的判断;Diseases and diagnosis: diseases defined in medicine and judgments made by doctors on etiology, pathophysiology, classification and staging, etc. in clinical work;
检查:影像检查(如X线、CT、MR、PETCT等)、造影、超声、心电图等,为避免检查操作与手术操作过多冲突,不包含此外其它的诊断性操作(如胃镜、肠镜等);Inspection: imaging inspection (such as X-ray, CT, MR, PETCT, etc.), angiography, ultrasound, electrocardiogram, etc. In order to avoid excessive conflict between inspection operations and surgical operations, other diagnostic operations (such as gastroscope, colonoscopy, etc. are not included) );
检验:在实验室进行的物理或化学检查,本实施例特指临床工作中检验科进行的化验,不含免疫组化等广义实验室检查;Test: physical or chemical tests performed in the laboratory, this embodiment specifically refers to tests performed by the laboratory in clinical work, excluding generalized laboratory tests such as immunohistochemistry;
手术:医生在患者身体局部进行的切除、缝合等治疗,属外科的主要治疗方法;Surgery: Resection, suture and other treatments performed by doctors on the part of the patient's body are the main treatment methods of surgery;
药物:用于疾病治疗的具体化学物质;Drugs: specific chemicals used in the treatment of diseases;
解剖部位:指疾病、症状和体征发生的人体解剖学部位。Anatomical Site: Refers to the anatomical site of the human body where disease, symptoms and signs occur.
在经过实体标注的标注文本抽取了1000条作为标注集,该标注集统计的情况如表1所示:1000 items were extracted from the annotated texts marked by entities as the annotation set, and the statistics of the annotation set are shown in Table 1:
表1实体个数统计情况Table 1 Statistics of the number of entities
。.
步骤3:将标注文本输入预训练语言模型中,得到字向量表示的训练文本;Step 3: Input the marked text into the pre-training language model to obtain the training text represented by the word vector;
具体地,获取百度的ERNIE预训练语言模型和若干电子病历文本语料,根据电子病历文本语料采用半监督的并行双向方式训练ERNIE模型,得到新的语言模型,将步骤2中标注文本的标注集输入至新的训练模型中,得到字向量表示的训练文本的训练集。Specifically, obtain Baidu's ERNIE pre-trained language model and several electronic medical record text corpora, train the ERNIE model in a semi-supervised parallel two-way manner according to the electronic medical record text corpus, obtain a new language model, and input the annotation set of the annotated text in
步骤4,构建BiLSTM-CRF序列标注模型,对训练文本进行训练,得到训练的标注模型;Step 4, build a BiLSTM-CRF sequence labeling model, train the training text, and obtain the trained labeling model;
BiLSTM-CRF序列标注模型属双向的循环神经网络,能够根据上下文信息对输入的字给以预测标签的概率。该模型分为三层:第一层为look-up层,第二层为双向LSTM层(即BiLSTM层),第三层为CRF层。The BiLSTM-CRF sequence labeling model is a bidirectional recurrent neural network, which can predict the probability of the label for the input word according to the context information. The model is divided into three layers: the first layer is the look-up layer, the second layer is the bidirectional LSTM layer (ie BiLSTM layer), and the third layer is the CRF layer.
如图2所示,向该模型中输入步骤3中训练文本的字的one-hot向量,即以句子为单位,将一个含有n个字的句子(字的序列)记作x=(x1,x2,...,xn),其中xi表示该句子的第i个字在字典文件中的id,继而得到字的one-hot向量,向量的维度为字典文件大小。As shown in Figure 2, input the one-hot vector of the words of the training text in step 3 into the model, that is, take the sentence as the unit, and denote a sentence (sequence of words) containing n words as x=(x 1 ,x 2 ,...,x n ), where x i represents the id of the i-th word of the sentence in the dictionary file, and then the one-hot vector of the word is obtained, and the dimension of the vector is the size of the dictionary file.
look-up层作为该模型的第一层,利用预训练或随机初始化的embedding矩阵将句子中的每个字xi由one-hot向量映射为低维稠密的字向量(character embedding),其中,xi∈Rd,d为embedding的维度。在输入下一层之前,设置dropout以缓解过拟合。As the first layer of the model, the look-up layer uses a pre-trained or randomly initialized embedding matrix to map each word x i in the sentence from a one-hot vector to a low-dimensional dense word vector (character embedding), where, x i ∈ R d , d is the dimension of embedding. Set dropout to alleviate overfitting before feeding into the next layer.
双向LSTM层作为该模型的第二层,用于自动提取句子特征。将一个句子各个字的character embedding序列(x1,x2,...,xn)作为双向LSTM模型各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的序列在各个位置输出的隐状态进行按位置拼接得到完整的隐状态序列(h1,h2,...,hn)∈Rn×m。在设置dropout后,接入一个线性层,将隐状态向量从m维映射到k维,k是训练集的标签数,从而得到自动提取的句子特征,记作矩阵P=(p1,p2,...,pn)∈Rn×k。将pi∈Rk的每一维pij都视作将字xi分类到第j个标签的打分值,如果再对P进行Softmax的话,就相当于对各个位置独立进行k类分类,但这样对各个位置进行标注时无法利用已经标注过的信息,因此须接入CRF层来进行标注。A bidirectional LSTM layer is used as the second layer of this model to automatically extract sentence features. Use the character embedding sequence (x 1 , x 2 ,..., x n ) of each word of a sentence as the input of each time step of the bidirectional LSTM model, and then put the hidden state sequence output by the forward LSTM and the reverse LSTM sequence in the The hidden states output by each position are spliced by position to obtain a complete hidden state sequence (h 1 , h 2 ,...,h n )∈R n×m . After setting dropout, a linear layer is connected to map the hidden state vector from m dimension to k dimension, where k is the number of labels in the training set, so as to obtain automatically extracted sentence features, denoted as matrix P=(p 1 ,p 2 ,...,p n )∈R n×k . Each dimension p ij of p i ∈ R k is regarded as a scoring value for classifying the word x i to the jth label. If Softmax is performed on P, it is equivalent to classifying each position independently of k classes, but In this way, the marked information cannot be used when marking each position, so it is necessary to access the CRF layer for marking.
CRF层作为该模型的第三层,用于句子级的序列标注。CRF层的参数是一个(k+2)×(k+2)的转移矩阵A,Aij表示的是从第i个标签到第j个标签的转移得分,继而在为一个位置进行标注时可利用此前已经标注过的标签。由于需要在句子首部添加一个起始状态以及在句子尾部添加一个终止状态,在k上加2。如果记一个长度等于句子长度的标签序列y=(y1,y2,...,yn),那么模型对于句子x的标签等于y的打分为:As the third layer of the model, the CRF layer is used for sentence-level sequence annotation. The parameter of the CRF layer is a (k+2)×(k+2) transition matrix A, and Aij represents the transition score from the i-th label to the j-th label, and then it can be used to label a position. Utilize the tags that have been marked before. Since we need to add a start state at the beginning of the sentence and a stop state at the end of the sentence, add 2 to k. If we remember a label sequence y=(y 1 , y 2 ,..., y n ) whose length is equal to the length of the sentence, then the model's score for the label of sentence x is equal to y's score:
由公式(1),整个序列的打分等于各个位置的打分之和,而每个位置的打分由两部分得到,一部分由LSTM的pi决定,另一部分则由CRF层的转移矩阵A决定,继而可以利用Softmax得到归一化后的概率:According to formula (1), the score of the entire sequence is equal to the sum of the scores of each position, and the score of each position is obtained by two parts, one part is determined by the pi of the LSTM , and the other part is determined by the transition matrix A of the CRF layer, and then The normalized probability can be obtained using Softmax:
模型训练时通过最大化对数似然函数,公式(3)给出了对一个训练样本(x,yx)的对数似然:By maximizing the log-likelihood function during model training, equation (3) gives the log-likelihood of a training sample (x, y x ):
模型在预测过程(解码)时使用动态规划的Viterbi算法求解最优路径:The model uses the Viterbi algorithm of dynamic programming to solve the optimal path during the prediction process (decoding):
将步骤3中训练文本的训练集的字向量输入至上述BiLSTM-CRF序列标注模型中得到类别标签序列,所输入训练样本的训练集中每个字向量都对应一个标签,以BIO三元标记法标注上述六类实体,则有'DISEASE-I':1,'DISEASE-B':2,'CHECK-B':3,'CHECK-I':4,'EXAMINE-I':5,'EXAMINE-B':6,'OPERATION-B':7,'OPERATION-I':8,'MEDICINE-I':9,'MEDICINE-B':10,'BODY-B':11,'BODY-I':12及O共13个类别,而测试集中无标签。Input the word vector of the training set of the training text in step 3 into the above-mentioned BiLSTM-CRF sequence labeling model to obtain the category label sequence. Each word vector in the training set of the input training sample corresponds to a label, which is marked with the BIO ternary notation method. For the above six types of entities, there are 'DISEASE-I': 1, 'DISEASE-B': 2, 'CHECK-B': 3, 'CHECK-I': 4, 'EXAMINE-I': 5, 'EXAMINE- B': 6, 'OPERATION-B': 7, 'OPERATION-I': 8, 'MEDICINE-I': 9, 'MEDICINE-B': 10, 'BODY-B': 11, 'BODY-I' :12 and O have a total of 13 categories, and there are no labels in the test set.
为确保该训练过程每一轮输出的类别标签序列更正确,构建损失函数,将训练文本的训练集的类别标签序列和标签数据输入至损失函数中,计算损失值,并设定一个标准值作为设定阈值。通过每次的损失值与设定阈值比较,若损失值大于等于设定阈值,就迭代训练下一轮,直至损失值小于设定阈值,即认为标注模型CRF层输出的序列是最优解。In order to ensure that the class label sequence output in each round of the training process is more correct, a loss function is constructed, the class label sequence and label data of the training set of the training text are input into the loss function, the loss value is calculated, and a standard value is set as Set the threshold. By comparing the loss value with the set threshold each time, if the loss value is greater than or equal to the set threshold, the next round of training is iteratively trained until the loss value is less than the set threshold, that is, the sequence output by the CRF layer of the labeling model is considered to be the optimal solution.
步骤5:以训练的标注模型作为实体识别模型,输入测试文本即可输出标注的类别标签序列;Step 5: Use the trained labeling model as the entity recognition model, and input the test text to output the labelled category label sequence;
具体地,保存学习好的标注模型作为实体识别模型,以新的电子病历文本作为测试文本,选取300条作为测试集,输入至实体识别模型即可输出标注的类别标签序列,训练集和测试集的文本条数和准确率统计分别如表2所示:Specifically, save the learned labeling model as the entity recognition model, take the new electronic medical record text as the test text, select 300 records as the test set, and input it into the entity recognition model to output the labeled category label sequence, training set and test set. The number of texts and accuracy statistics are shown in Table 2 respectively:
表2训练集和测试集的文本条数和准确率统计情况Table 2 Statistics of the number of texts and the accuracy of the training set and test set
由表2得知,经训练文本训练得到的标注模型识别的准确率较高,由此标注模型作为实体标注模型对测试集中实体识别的准确率也较高。It can be seen from Table 2 that the labeling model trained by the training text has a higher recognition accuracy rate, and thus the labeling model as an entity labeling model has a higher accuracy rate for entity recognition in the test set.
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。Obviously, the above-mentioned embodiments are only examples for clear description, and are not intended to limit the implementation manner. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. And the obvious changes or changes derived from this are still within the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910785097.4A CN110705293A (en) | 2019-08-23 | 2019-08-23 | Electronic medical record text named entity recognition method based on pre-training language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910785097.4A CN110705293A (en) | 2019-08-23 | 2019-08-23 | Electronic medical record text named entity recognition method based on pre-training language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110705293A true CN110705293A (en) | 2020-01-17 |
Family
ID=69193850
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910785097.4A Pending CN110705293A (en) | 2019-08-23 | 2019-08-23 | Electronic medical record text named entity recognition method based on pre-training language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110705293A (en) |
Cited By (83)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310471A (en) * | 2020-01-19 | 2020-06-19 | 陕西师范大学 | A Tourism Named Entity Recognition Method Based on BBLC Model |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method, electronic device and computer readable medium of new category label |
CN111341404A (en) * | 2020-02-26 | 2020-06-26 | 山东健康医疗大数据有限公司 | Electronic medical record data set analysis method and system based on ernie model |
CN111339759A (en) * | 2020-02-21 | 2020-06-26 | 北京百度网讯科技有限公司 | Method and device for training field element recognition model and electronic equipment |
CN111353311A (en) * | 2020-03-03 | 2020-06-30 | 平安医疗健康管理股份有限公司 | Named entity identification method and device, computer equipment and storage medium |
CN111428502A (en) * | 2020-02-19 | 2020-07-17 | 中科世通亨奇(北京)科技有限公司 | Named entity labeling method for military corpus |
CN111523316A (en) * | 2020-03-04 | 2020-08-11 | 平安科技(深圳)有限公司 | Medicine identification method based on machine learning and related equipment |
CN111553164A (en) * | 2020-04-29 | 2020-08-18 | 平安科技(深圳)有限公司 | Training method and device for named entity recognition model and computer equipment |
CN111582497A (en) * | 2020-04-27 | 2020-08-25 | 平安医疗健康管理股份有限公司 | Training file generation and evaluation method, device, computer system and storage medium |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111651991A (en) * | 2020-04-15 | 2020-09-11 | 天津科技大学 | A Medical Named Entity Recognition Method Using Multi-Model Fusion Strategy |
CN111666414A (en) * | 2020-06-12 | 2020-09-15 | 上海观安信息技术股份有限公司 | Method for detecting cloud service by sensitive data and cloud service platform |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
CN111709241A (en) * | 2020-05-27 | 2020-09-25 | 西安交通大学 | A Named Entity Recognition Method for Network Security |
CN111767723A (en) * | 2020-05-14 | 2020-10-13 | 上海大学 | A BIC-based entity labeling method for Chinese electronic medical records |
CN111785387A (en) * | 2020-07-02 | 2020-10-16 | 朱玮 | Method and system for disease standardized mapping classification by using Bert |
CN111785367A (en) * | 2020-06-30 | 2020-10-16 | 平安科技(深圳)有限公司 | Triage method, device and computer equipment based on neural network model |
CN111783466A (en) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | A Named Entity Recognition Method for Chinese Medical Records |
CN111797629A (en) * | 2020-06-23 | 2020-10-20 | 平安医疗健康管理股份有限公司 | Medical text data processing method and device, computer equipment and storage medium |
CN111834014A (en) * | 2020-07-17 | 2020-10-27 | 北京工业大学 | A method and system for named entity recognition in the medical field |
CN111858935A (en) * | 2020-07-13 | 2020-10-30 | 北京航空航天大学 | A fine-grained sentiment classification system for flight reviews |
CN111930909A (en) * | 2020-08-11 | 2020-11-13 | 付立军 | Geological intelligent question and answer oriented data automatic sequence labeling identification method |
CN111986765A (en) * | 2020-09-03 | 2020-11-24 | 平安国际智慧城市科技股份有限公司 | Electronic case entity marking method, device, computer equipment and storage medium |
CN111984983A (en) * | 2020-08-28 | 2020-11-24 | 山东健康医疗大数据有限公司 | User privacy encryption method |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
CN112115714A (en) * | 2020-09-25 | 2020-12-22 | 平安国际智慧城市科技股份有限公司 | Deep learning sequence labeling method and device and computer readable storage medium |
CN112347771A (en) * | 2020-12-03 | 2021-02-09 | 云知声智能科技股份有限公司 | Method and equipment for extracting entity relationship |
CN112347253A (en) * | 2020-11-04 | 2021-02-09 | 新智数字科技有限公司 | Method and device for establishing text information recognition model and terminal equipment |
CN112380327A (en) * | 2020-11-09 | 2021-02-19 | 天翼爱音乐文化科技有限公司 | Cold-start slot filling method, system, device and storage medium |
CN112420205A (en) * | 2020-12-08 | 2021-02-26 | 医惠科技有限公司 | Entity recognition model generation method and device and computer readable storage medium |
CN112487814A (en) * | 2020-11-27 | 2021-03-12 | 北京百度网讯科技有限公司 | Entity classification model training method, entity classification device and electronic equipment |
CN112614559A (en) * | 2020-12-29 | 2021-04-06 | 苏州超云生命智能产业研究院有限公司 | Medical record text processing method and device, computer equipment and storage medium |
CN112613273A (en) * | 2020-12-16 | 2021-04-06 | 上海交通大学 | Compression method and system of multi-language BERT sequence labeling model |
CN112614562A (en) * | 2020-12-23 | 2021-04-06 | 联仁健康医疗大数据科技股份有限公司 | Model training method, device, equipment and storage medium based on electronic medical record |
CN112669918A (en) * | 2020-12-24 | 2021-04-16 | 上海市第一人民医院 | Ophthalmic VEGF-related multidimensional clinical trial data processing method and system |
CN112669928A (en) * | 2021-01-06 | 2021-04-16 | 腾讯科技(深圳)有限公司 | Structured information construction method and device, computer equipment and storage medium |
CN112699682A (en) * | 2020-12-11 | 2021-04-23 | 山东大学 | Named entity identification method and device based on combinable weak authenticator |
CN112712863A (en) * | 2021-01-05 | 2021-04-27 | 中国人民解放军海军军医大学第一附属医院 | Method and system for calculating clinical data of accurate drug administration for liver metastasis of colon cancer |
CN112712118A (en) * | 2020-12-29 | 2021-04-27 | 银江股份有限公司 | Medical text data oriented filtering method and system |
CN112732863A (en) * | 2021-01-15 | 2021-04-30 | 清华大学 | Standardized segmentation method for electronic medical records |
CN112749562A (en) * | 2020-12-31 | 2021-05-04 | 合肥工业大学 | Named entity identification method, device, storage medium and electronic equipment |
CN112802570A (en) * | 2021-02-07 | 2021-05-14 | 成都延华西部健康医疗信息产业研究院有限公司 | Named entity recognition system and method for electronic medical record |
CN112860842A (en) * | 2021-03-05 | 2021-05-28 | 联仁健康医疗大数据科技股份有限公司 | Medical record labeling method and device and storage medium |
CN112949311A (en) * | 2021-03-05 | 2021-06-11 | 北京工业大学 | Named entity identification method fusing font information |
CN112949308A (en) * | 2021-02-25 | 2021-06-11 | 武汉大学 | Method and system for identifying named entities of Chinese electronic medical record based on functional structure |
CN112949310A (en) * | 2021-03-01 | 2021-06-11 | 创新奇智(上海)科技有限公司 | Model training method, traditional Chinese medicine name recognition method and device and network model |
CN113204969A (en) * | 2021-05-31 | 2021-08-03 | 平安科技(深圳)有限公司 | Medical named entity recognition model generation method and device and computer equipment |
CN113299375A (en) * | 2021-07-27 | 2021-08-24 | 北京好欣晴移动医疗科技有限公司 | Method, device and system for marking and identifying digital file information entity |
CN113345542A (en) * | 2021-06-18 | 2021-09-03 | 北京百度网讯科技有限公司 | Electronic medical record management method, device, equipment, storage medium and computer program product |
CN113343680A (en) * | 2021-05-19 | 2021-09-03 | 大连东软教育科技集团有限公司 | Structured information extraction method based on multi-type case history texts |
CN113392633A (en) * | 2021-08-05 | 2021-09-14 | 中国医学科学院阜外医院 | Medical named entity identification method, device and storage medium |
CN113408279A (en) * | 2021-06-23 | 2021-09-17 | 平安科技(深圳)有限公司 | Training method, device and equipment of sequence labeling model and storage medium |
CN113536182A (en) * | 2021-07-12 | 2021-10-22 | 广州万孚生物技术股份有限公司 | Method, device, electronic device and storage medium for generating long text webpage |
CN113642329A (en) * | 2020-04-27 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Method and device for establishing term identification model, term identification method and device |
CN113657105A (en) * | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
CN113674866A (en) * | 2021-06-23 | 2021-11-19 | 江苏天瑞精准医疗科技有限公司 | Medical text oriented pre-training method |
CN113761891A (en) * | 2021-08-31 | 2021-12-07 | 国网冀北电力有限公司 | Power grid text data entity identification method, system, device and medium |
CN113807097A (en) * | 2020-10-30 | 2021-12-17 | 北京中科凡语科技有限公司 | Named Entity Recognition Model Establishment Method and Named Entity Recognition Method |
CN113868372A (en) * | 2021-09-08 | 2021-12-31 | 北京国研网信息股份有限公司 | Statistical bulletin index extraction method based on rules and text sequence annotation |
CN113886602A (en) * | 2021-10-19 | 2022-01-04 | 四川大学 | An entity recognition method of domain knowledge base based on multi-granularity cognition |
CN114048745A (en) * | 2021-11-05 | 2022-02-15 | 新智道枢(上海)科技有限公司 | Method and system for recognizing named entities of digital police service warning situation addresses |
CN114218954A (en) * | 2021-12-16 | 2022-03-22 | 云知声智能科技股份有限公司 | Method and device for distinguishing negative and positive of disease entity and symptom entity in medical record text |
CN114330350A (en) * | 2022-01-05 | 2022-04-12 | 北京环境特性研究所 | Named entity identification method and device, electronic equipment and storage medium |
CN114328902A (en) * | 2020-10-09 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Text labeling model construction method and device |
CN114385824A (en) * | 2021-12-07 | 2022-04-22 | 暨南大学 | An Ethical Behavior Extraction Method Based on Pre-training Model |
CN114530223A (en) * | 2022-01-18 | 2022-05-24 | 华南理工大学 | NLP-based cardiovascular disease medical record structuring system |
CN114528408A (en) * | 2022-02-25 | 2022-05-24 | 大连海洋大学 | ResNeXt-fused EBAC fish text combined extraction method |
CN114692634A (en) * | 2022-01-27 | 2022-07-01 | 清华大学 | Chinese named entity recognition and classification method and device |
CN114757191A (en) * | 2022-03-29 | 2022-07-15 | 国网江苏省电力有限公司营销服务中心 | Named Entity Recognition Method System in Electric Power Public Opinion Field Based on Deep Learning |
CN114913953A (en) * | 2022-07-19 | 2022-08-16 | 北京惠每云科技有限公司 | Medical entity relationship identification method and device, electronic equipment and storage medium |
CN114944211A (en) * | 2022-05-20 | 2022-08-26 | 医渡云(北京)技术有限公司 | Medical entity identification method, device, equipment and storage medium |
CN115146642A (en) * | 2022-07-21 | 2022-10-04 | 北京市科学技术研究院 | Automatic training set labeling method and system for named entity recognition |
CN115148367A (en) * | 2022-06-27 | 2022-10-04 | 厦门快商通科技股份有限公司 | A method and system for normalizing medical symptom entity information |
CN115171835A (en) * | 2022-09-02 | 2022-10-11 | 北京智源人工智能研究院 | Case structured model training method and device and case structured method |
CN115631868A (en) * | 2022-11-17 | 2023-01-20 | 神州医疗科技股份有限公司 | Infectious disease early warning direct reporting method and system based on prompt learning model |
CN116386800A (en) * | 2023-06-06 | 2023-07-04 | 神州医疗科技股份有限公司 | Medical record data segmentation method and system based on pre-training language model |
CN116525125A (en) * | 2023-07-04 | 2023-08-01 | 之江实验室 | A method and device for generating a virtual electronic medical record |
CN116720519A (en) * | 2023-06-08 | 2023-09-08 | 吉首大学 | A method for identifying named entities of Miao medicine |
CN116804998A (en) * | 2023-08-22 | 2023-09-26 | 神州医疗科技股份有限公司 | Medical terminology retrieval method and system based on medical semantic understanding |
CN116994694A (en) * | 2023-09-27 | 2023-11-03 | 之江实验室 | Patient medical record data screening method, device and medium based on information extraction |
CN117094325A (en) * | 2023-09-25 | 2023-11-21 | 安徽农业大学 | Named entity identification method in rice pest field |
CN119361058A (en) * | 2024-10-29 | 2025-01-24 | 南京迈拓医药科技有限公司 | Electronic medical record feature extraction method based on natural language processing |
CN119514536A (en) * | 2025-01-20 | 2025-02-25 | 成都医星科技有限公司 | Method, electronic device and storage medium for extracting post-structured information from electronic medical records based on large language model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
CN109359291A (en) * | 2018-08-28 | 2019-02-19 | 昆明理工大学 | A Named Entity Recognition Method |
CN109522546A (en) * | 2018-10-12 | 2019-03-26 | 浙江大学 | Entity recognition method is named based on context-sensitive medicine |
CN109558569A (en) * | 2018-12-14 | 2019-04-02 | 昆明理工大学 | A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model |
-
2019
- 2019-08-23 CN CN201910785097.4A patent/CN110705293A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
CN109359291A (en) * | 2018-08-28 | 2019-02-19 | 昆明理工大学 | A Named Entity Recognition Method |
CN109522546A (en) * | 2018-10-12 | 2019-03-26 | 浙江大学 | Entity recognition method is named based on context-sensitive medicine |
CN109558569A (en) * | 2018-12-14 | 2019-04-02 | 昆明理工大学 | A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model |
Non-Patent Citations (2)
Title |
---|
DETERMINED22: "DL4NLP —— 序列标注:BiLSTM-CRF模型做基于字的中文命名实体识别", 《HTTPS://WWW.CNBLOGS.COM/DETERMINED22/P/7238342.HTML》 * |
机器之心PRO: "中文任务全面超越BERT:百度正式发布NLP预训练模型ERNI", 《HTTPS://BAIJIAHAO.BAIDU.COM/S?ID=1628129789985269995&WFR=SPIDER&FOR=PC》 * |
Cited By (124)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310471A (en) * | 2020-01-19 | 2020-06-19 | 陕西师范大学 | A Tourism Named Entity Recognition Method Based on BBLC Model |
CN111428502A (en) * | 2020-02-19 | 2020-07-17 | 中科世通亨奇(北京)科技有限公司 | Named entity labeling method for military corpus |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method, electronic device and computer readable medium of new category label |
US11755654B2 (en) | 2020-02-20 | 2023-09-12 | Beijing Baidu Netcom Science Technology Co., Ltd. | Category tag mining method, electronic device and non-transitory computer-readable storage medium |
CN111339250B (en) * | 2020-02-20 | 2023-08-18 | 北京百度网讯科技有限公司 | Mining method for new category labels, electronic equipment and computer readable medium |
CN111339759A (en) * | 2020-02-21 | 2020-06-26 | 北京百度网讯科技有限公司 | Method and device for training field element recognition model and electronic equipment |
CN111339759B (en) * | 2020-02-21 | 2023-07-25 | 北京百度网讯科技有限公司 | Domain element recognition model training method and device and electronic equipment |
CN111341404B (en) * | 2020-02-26 | 2023-07-14 | 山东浪潮智慧医疗科技有限公司 | Electronic medical record data set analysis method and system based on ernie model |
CN111341404A (en) * | 2020-02-26 | 2020-06-26 | 山东健康医疗大数据有限公司 | Electronic medical record data set analysis method and system based on ernie model |
CN111353311A (en) * | 2020-03-03 | 2020-06-30 | 平安医疗健康管理股份有限公司 | Named entity identification method and device, computer equipment and storage medium |
CN111523316A (en) * | 2020-03-04 | 2020-08-11 | 平安科技(深圳)有限公司 | Medicine identification method based on machine learning and related equipment |
CN111523316B (en) * | 2020-03-04 | 2024-11-01 | 平安科技(深圳)有限公司 | Drug identification method based on machine learning and related equipment |
CN111651991A (en) * | 2020-04-15 | 2020-09-11 | 天津科技大学 | A Medical Named Entity Recognition Method Using Multi-Model Fusion Strategy |
CN111651991B (en) * | 2020-04-15 | 2022-08-26 | 天津科技大学 | Medical named entity identification method utilizing multi-model fusion strategy |
CN111582497A (en) * | 2020-04-27 | 2020-08-25 | 平安医疗健康管理股份有限公司 | Training file generation and evaluation method, device, computer system and storage medium |
CN113642329A (en) * | 2020-04-27 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Method and device for establishing term identification model, term identification method and device |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111651986B (en) * | 2020-04-28 | 2024-04-02 | 银江技术股份有限公司 | Event keyword extraction method, device, equipment and medium |
WO2021218024A1 (en) * | 2020-04-29 | 2021-11-04 | 平安科技(深圳)有限公司 | Method and apparatus for training named entity recognition model, and computer device |
CN111553164A (en) * | 2020-04-29 | 2020-08-18 | 平安科技(深圳)有限公司 | Training method and device for named entity recognition model and computer equipment |
CN111767723A (en) * | 2020-05-14 | 2020-10-13 | 上海大学 | A BIC-based entity labeling method for Chinese electronic medical records |
CN111709241A (en) * | 2020-05-27 | 2020-09-25 | 西安交通大学 | A Named Entity Recognition Method for Network Security |
CN111709241B (en) * | 2020-05-27 | 2023-03-28 | 西安交通大学 | Named entity identification method oriented to network security field |
CN111666414B (en) * | 2020-06-12 | 2023-10-17 | 上海观安信息技术股份有限公司 | Method for detecting cloud service by sensitive data and cloud service platform |
CN111666414A (en) * | 2020-06-12 | 2020-09-15 | 上海观安信息技术股份有限公司 | Method for detecting cloud service by sensitive data and cloud service platform |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
CN111696674B (en) * | 2020-06-12 | 2023-09-08 | 电子科技大学 | A deep learning method and system for electronic medical records |
CN111797629B (en) * | 2020-06-23 | 2022-07-29 | 平安医疗健康管理股份有限公司 | Method and device for processing medical text data, computer equipment and storage medium |
CN111797629A (en) * | 2020-06-23 | 2020-10-20 | 平安医疗健康管理股份有限公司 | Medical text data processing method and device, computer equipment and storage medium |
CN111785367A (en) * | 2020-06-30 | 2020-10-16 | 平安科技(深圳)有限公司 | Triage method, device and computer equipment based on neural network model |
CN111785387B (en) * | 2020-07-02 | 2024-06-11 | 朱玮 | Method and system for classifying disease standardization mapping by using Bert |
CN111785387A (en) * | 2020-07-02 | 2020-10-16 | 朱玮 | Method and system for disease standardized mapping classification by using Bert |
CN111858935A (en) * | 2020-07-13 | 2020-10-30 | 北京航空航天大学 | A fine-grained sentiment classification system for flight reviews |
CN111783466A (en) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | A Named Entity Recognition Method for Chinese Medical Records |
CN111834014A (en) * | 2020-07-17 | 2020-10-27 | 北京工业大学 | A method and system for named entity recognition in the medical field |
CN111930909B (en) * | 2020-08-11 | 2023-09-12 | 付立军 | Geological intelligent question-answering oriented data automation sequence labeling identification method |
CN111930909A (en) * | 2020-08-11 | 2020-11-13 | 付立军 | Geological intelligent question and answer oriented data automatic sequence labeling identification method |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
CN111984983A (en) * | 2020-08-28 | 2020-11-24 | 山东健康医疗大数据有限公司 | User privacy encryption method |
CN111986765B (en) * | 2020-09-03 | 2023-11-21 | 深圳平安智慧医健科技有限公司 | Electronic case entity marking method, electronic case entity marking device, electronic case entity marking computer equipment and storage medium |
CN111986765A (en) * | 2020-09-03 | 2020-11-24 | 平安国际智慧城市科技股份有限公司 | Electronic case entity marking method, device, computer equipment and storage medium |
CN112115714B (en) * | 2020-09-25 | 2023-08-18 | 深圳平安智慧医健科技有限公司 | Deep learning sequence labeling method, device and computer readable storage medium |
CN112115714A (en) * | 2020-09-25 | 2020-12-22 | 平安国际智慧城市科技股份有限公司 | Deep learning sequence labeling method and device and computer readable storage medium |
CN114328902A (en) * | 2020-10-09 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Text labeling model construction method and device |
CN113807097A (en) * | 2020-10-30 | 2021-12-17 | 北京中科凡语科技有限公司 | Named Entity Recognition Model Establishment Method and Named Entity Recognition Method |
CN112347253B (en) * | 2020-11-04 | 2023-09-08 | 新奥新智科技有限公司 | Text information recognition model building method and device and terminal equipment |
CN112347253A (en) * | 2020-11-04 | 2021-02-09 | 新智数字科技有限公司 | Method and device for establishing text information recognition model and terminal equipment |
CN112380327B (en) * | 2020-11-09 | 2022-03-04 | 天翼爱音乐文化科技有限公司 | Cold-start slot filling method, system, device and storage medium |
CN112380327A (en) * | 2020-11-09 | 2021-02-19 | 天翼爱音乐文化科技有限公司 | Cold-start slot filling method, system, device and storage medium |
CN112487814A (en) * | 2020-11-27 | 2021-03-12 | 北京百度网讯科技有限公司 | Entity classification model training method, entity classification device and electronic equipment |
CN112487814B (en) * | 2020-11-27 | 2024-04-02 | 北京百度网讯科技有限公司 | Entity classification model training method, entity classification device and electronic equipment |
CN112347771A (en) * | 2020-12-03 | 2021-02-09 | 云知声智能科技股份有限公司 | Method and equipment for extracting entity relationship |
CN112420205A (en) * | 2020-12-08 | 2021-02-26 | 医惠科技有限公司 | Entity recognition model generation method and device and computer readable storage medium |
CN112699682A (en) * | 2020-12-11 | 2021-04-23 | 山东大学 | Named entity identification method and device based on combinable weak authenticator |
CN112613273A (en) * | 2020-12-16 | 2021-04-06 | 上海交通大学 | Compression method and system of multi-language BERT sequence labeling model |
CN112613273B (en) * | 2020-12-16 | 2022-09-23 | 上海交通大学 | Compression method and system of multi-language BERT sequence labeling model |
CN112614562B (en) * | 2020-12-23 | 2024-05-31 | 联仁健康医疗大数据科技股份有限公司 | Model training method, device, equipment and storage medium based on electronic medical record |
CN112614562A (en) * | 2020-12-23 | 2021-04-06 | 联仁健康医疗大数据科技股份有限公司 | Model training method, device, equipment and storage medium based on electronic medical record |
CN112669918A (en) * | 2020-12-24 | 2021-04-16 | 上海市第一人民医院 | Ophthalmic VEGF-related multidimensional clinical trial data processing method and system |
CN112614559A (en) * | 2020-12-29 | 2021-04-06 | 苏州超云生命智能产业研究院有限公司 | Medical record text processing method and device, computer equipment and storage medium |
CN112712118A (en) * | 2020-12-29 | 2021-04-27 | 银江股份有限公司 | Medical text data oriented filtering method and system |
CN112712118B (en) * | 2020-12-29 | 2024-06-21 | 银江技术股份有限公司 | Medical text data-oriented filtering method and system |
CN112749562A (en) * | 2020-12-31 | 2021-05-04 | 合肥工业大学 | Named entity identification method, device, storage medium and electronic equipment |
CN112712863A (en) * | 2021-01-05 | 2021-04-27 | 中国人民解放军海军军医大学第一附属医院 | Method and system for calculating clinical data of accurate drug administration for liver metastasis of colon cancer |
CN112669928A (en) * | 2021-01-06 | 2021-04-16 | 腾讯科技(深圳)有限公司 | Structured information construction method and device, computer equipment and storage medium |
CN112732863A (en) * | 2021-01-15 | 2021-04-30 | 清华大学 | Standardized segmentation method for electronic medical records |
CN112732863B (en) * | 2021-01-15 | 2022-12-23 | 清华大学 | Standardized segmentation method for electronic medical records |
CN112802570A (en) * | 2021-02-07 | 2021-05-14 | 成都延华西部健康医疗信息产业研究院有限公司 | Named entity recognition system and method for electronic medical record |
CN112949308A (en) * | 2021-02-25 | 2021-06-11 | 武汉大学 | Method and system for identifying named entities of Chinese electronic medical record based on functional structure |
CN112949310A (en) * | 2021-03-01 | 2021-06-11 | 创新奇智(上海)科技有限公司 | Model training method, traditional Chinese medicine name recognition method and device and network model |
CN112860842A (en) * | 2021-03-05 | 2021-05-28 | 联仁健康医疗大数据科技股份有限公司 | Medical record labeling method and device and storage medium |
CN112949311A (en) * | 2021-03-05 | 2021-06-11 | 北京工业大学 | Named entity identification method fusing font information |
CN113343680A (en) * | 2021-05-19 | 2021-09-03 | 大连东软教育科技集团有限公司 | Structured information extraction method based on multi-type case history texts |
CN113204969A (en) * | 2021-05-31 | 2021-08-03 | 平安科技(深圳)有限公司 | Medical named entity recognition model generation method and device and computer equipment |
WO2022252378A1 (en) * | 2021-05-31 | 2022-12-08 | 平安科技(深圳)有限公司 | Method and apparatus for generating medical named entity recognition model, and computer device |
CN113345542A (en) * | 2021-06-18 | 2021-09-03 | 北京百度网讯科技有限公司 | Electronic medical record management method, device, equipment, storage medium and computer program product |
CN113408279A (en) * | 2021-06-23 | 2021-09-17 | 平安科技(深圳)有限公司 | Training method, device and equipment of sequence labeling model and storage medium |
CN113408279B (en) * | 2021-06-23 | 2022-05-20 | 平安科技(深圳)有限公司 | Training method, device and equipment of sequence labeling model and storage medium |
CN113674866B (en) * | 2021-06-23 | 2024-06-14 | 江苏天瑞精准医疗科技有限公司 | Pre-training method for medical text |
CN113674866A (en) * | 2021-06-23 | 2021-11-19 | 江苏天瑞精准医疗科技有限公司 | Medical text oriented pre-training method |
CN113536182A (en) * | 2021-07-12 | 2021-10-22 | 广州万孚生物技术股份有限公司 | Method, device, electronic device and storage medium for generating long text webpage |
CN114093468A (en) * | 2021-07-27 | 2022-02-25 | 北京好欣晴移动医疗科技有限公司 | Cardiovascular disease information entity labeling and identifying method, device and system |
CN113299375A (en) * | 2021-07-27 | 2021-08-24 | 北京好欣晴移动医疗科技有限公司 | Method, device and system for marking and identifying digital file information entity |
CN113392633A (en) * | 2021-08-05 | 2021-09-14 | 中国医学科学院阜外医院 | Medical named entity identification method, device and storage medium |
CN113392633B (en) * | 2021-08-05 | 2021-12-24 | 中国医学科学院阜外医院 | Method, device and storage medium for medical named entity recognition |
CN113761891A (en) * | 2021-08-31 | 2021-12-07 | 国网冀北电力有限公司 | Power grid text data entity identification method, system, device and medium |
CN113657105A (en) * | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement |
CN113761891B (en) * | 2021-08-31 | 2024-08-20 | 国网冀北电力有限公司 | Method, system, equipment and medium for identifying text data entity of power grid |
CN113868372A (en) * | 2021-09-08 | 2021-12-31 | 北京国研网信息股份有限公司 | Statistical bulletin index extraction method based on rules and text sequence annotation |
CN113886602A (en) * | 2021-10-19 | 2022-01-04 | 四川大学 | An entity recognition method of domain knowledge base based on multi-granularity cognition |
CN114048745A (en) * | 2021-11-05 | 2022-02-15 | 新智道枢(上海)科技有限公司 | Method and system for recognizing named entities of digital police service warning situation addresses |
CN114385824A (en) * | 2021-12-07 | 2022-04-22 | 暨南大学 | An Ethical Behavior Extraction Method Based on Pre-training Model |
CN114218954B (en) * | 2021-12-16 | 2024-11-08 | 云知声智能科技股份有限公司 | Method and device for distinguishing the positive and negative nature of disease entities and symptom entities in medical record text |
CN114218954A (en) * | 2021-12-16 | 2022-03-22 | 云知声智能科技股份有限公司 | Method and device for distinguishing negative and positive of disease entity and symptom entity in medical record text |
CN114330350B (en) * | 2022-01-05 | 2024-10-11 | 北京环境特性研究所 | Named entity recognition method and device, electronic equipment and storage medium |
CN114330350A (en) * | 2022-01-05 | 2022-04-12 | 北京环境特性研究所 | Named entity identification method and device, electronic equipment and storage medium |
CN114530223A (en) * | 2022-01-18 | 2022-05-24 | 华南理工大学 | NLP-based cardiovascular disease medical record structuring system |
CN114692634A (en) * | 2022-01-27 | 2022-07-01 | 清华大学 | Chinese named entity recognition and classification method and device |
CN114692634B (en) * | 2022-01-27 | 2024-12-31 | 清华大学 | Chinese named entity recognition and classification method and device |
CN114528408A (en) * | 2022-02-25 | 2022-05-24 | 大连海洋大学 | ResNeXt-fused EBAC fish text combined extraction method |
CN114757191A (en) * | 2022-03-29 | 2022-07-15 | 国网江苏省电力有限公司营销服务中心 | Named Entity Recognition Method System in Electric Power Public Opinion Field Based on Deep Learning |
CN114944211A (en) * | 2022-05-20 | 2022-08-26 | 医渡云(北京)技术有限公司 | Medical entity identification method, device, equipment and storage medium |
CN115148367B (en) * | 2022-06-27 | 2025-03-14 | 厦门快商通科技股份有限公司 | A medical symptom entity information normalization method and system |
CN115148367A (en) * | 2022-06-27 | 2022-10-04 | 厦门快商通科技股份有限公司 | A method and system for normalizing medical symptom entity information |
CN114913953A (en) * | 2022-07-19 | 2022-08-16 | 北京惠每云科技有限公司 | Medical entity relationship identification method and device, electronic equipment and storage medium |
CN115146642B (en) * | 2022-07-21 | 2023-08-29 | 北京市科学技术研究院 | Named entity recognition-oriented training set automatic labeling method and system |
CN115146642A (en) * | 2022-07-21 | 2022-10-04 | 北京市科学技术研究院 | Automatic training set labeling method and system for named entity recognition |
CN115171835B (en) * | 2022-09-02 | 2022-12-23 | 北京智源人工智能研究院 | Case structured model training method, device and case structured method |
CN115171835A (en) * | 2022-09-02 | 2022-10-11 | 北京智源人工智能研究院 | Case structured model training method and device and case structured method |
CN115631868B (en) * | 2022-11-17 | 2023-04-21 | 神州医疗科技股份有限公司 | Infectious disease early warning direct-reporting method and system based on prompt learning model |
CN115631868A (en) * | 2022-11-17 | 2023-01-20 | 神州医疗科技股份有限公司 | Infectious disease early warning direct reporting method and system based on prompt learning model |
CN116386800A (en) * | 2023-06-06 | 2023-07-04 | 神州医疗科技股份有限公司 | Medical record data segmentation method and system based on pre-training language model |
CN116386800B (en) * | 2023-06-06 | 2023-08-18 | 神州医疗科技股份有限公司 | Medical record data segmentation method and system based on pre-training language model |
CN116720519B (en) * | 2023-06-08 | 2023-12-19 | 吉首大学 | Seedling medicine named entity identification method |
CN116720519A (en) * | 2023-06-08 | 2023-09-08 | 吉首大学 | A method for identifying named entities of Miao medicine |
CN116525125A (en) * | 2023-07-04 | 2023-08-01 | 之江实验室 | A method and device for generating a virtual electronic medical record |
CN116525125B (en) * | 2023-07-04 | 2023-09-19 | 之江实验室 | Virtual electronic medical record generation method and device |
CN116804998A (en) * | 2023-08-22 | 2023-09-26 | 神州医疗科技股份有限公司 | Medical terminology retrieval method and system based on medical semantic understanding |
CN117094325B (en) * | 2023-09-25 | 2024-03-29 | 安徽农业大学 | Named entity recognition method in the field of rice diseases and insect pests |
CN117094325A (en) * | 2023-09-25 | 2023-11-21 | 安徽农业大学 | Named entity identification method in rice pest field |
CN116994694B (en) * | 2023-09-27 | 2024-01-09 | 之江实验室 | Patient medical record data screening method, device and medium based on information extraction |
CN116994694A (en) * | 2023-09-27 | 2023-11-03 | 之江实验室 | Patient medical record data screening method, device and medium based on information extraction |
CN119361058A (en) * | 2024-10-29 | 2025-01-24 | 南京迈拓医药科技有限公司 | Electronic medical record feature extraction method based on natural language processing |
CN119514536A (en) * | 2025-01-20 | 2025-02-25 | 成都医星科技有限公司 | Method, electronic device and storage medium for extracting post-structured information from electronic medical records based on large language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110705293A (en) | Electronic medical record text named entity recognition method based on pre-training language model | |
CN111949759B (en) | Retrieval method, system and computer equipment for medical record text similarity | |
CN110459287B (en) | Structured report data from medical text reports | |
CN109697285B (en) | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation | |
CN111540468B (en) | A method and system for ICD automatic coding with visualization of diagnosis causes | |
US11244755B1 (en) | Automatic generation of medical imaging reports based on fine grained finding labels | |
CN111078875B (en) | Method for extracting question-answer pairs from semi-structured document based on machine learning | |
Zhang et al. | MIE: A medical information extractor towards medical dialogues | |
Li et al. | Ffa-ir: Towards an explainable and reliable medical report generation benchmark | |
CN109192255B (en) | Medical record structuring method | |
US20200160510A1 (en) | Automated Patient Complexity Classification for Artificial Intelligence Tools | |
López-Úbeda et al. | Natural language processing in radiology: update on clinical applications | |
CN111651991B (en) | Medical named entity identification method utilizing multi-model fusion strategy | |
CN112541066B (en) | Text-based structured medical report detection method and related equipment | |
CN112151183A (en) | An entity recognition method for Chinese electronic medical records based on Lattice LSTM model | |
Ke et al. | Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF | |
CN113808758B (en) | Method and device for normalizing check data, electronic equipment and storage medium | |
US11763081B2 (en) | Extracting fine grain labels from medical imaging reports | |
CN114492444B (en) | A method for labeling medical entity parts of speech in Chinese electronic medical records | |
CN106844351A (en) | A kind of medical institutions towards multi-data source organize class entity recognition method and device | |
Yu et al. | Bios: An algorithmically generated biomedical knowledge graph | |
CN115841861A (en) | Similar medical record recommendation method and system | |
Peng et al. | A self-attention based deep learning method for lesion attribute detection from CT reports | |
CN118171653B (en) | Health physical examination text treatment method based on deep neural network | |
JP2022128583A (en) | Medical information processing device, medical information processing method and medical information processing program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200117 |