[go: up one dir, main page]

CN116312915B - A method and system for standardized association of drug terms in electronic medical records - Google Patents

A method and system for standardized association of drug terms in electronic medical records Download PDF

Info

Publication number
CN116312915B
CN116312915B CN202310567874.4A CN202310567874A CN116312915B CN 116312915 B CN116312915 B CN 116312915B CN 202310567874 A CN202310567874 A CN 202310567874A CN 116312915 B CN116312915 B CN 116312915B
Authority
CN
China
Prior art keywords
drug
term
terms
synonym
electronic medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310567874.4A
Other languages
Chinese (zh)
Other versions
CN116312915A (en
Inventor
李劲松
马爽
杨宗峰
王昱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310567874.4A priority Critical patent/CN116312915B/en
Publication of CN116312915A publication Critical patent/CN116312915A/en
Application granted granted Critical
Publication of CN116312915B publication Critical patent/CN116312915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method and a system for standardized association of drug terms in electronic medical records, wherein a drug term library is updated through a synonym mining technology to obtain a drug term library based on synonym mining update, so that the problem of low semantic similarity between standard drug terms in the drug term library and external drug terms in the electronic medical records is solved; when the external medicine terms in the electronic medical record are associated with standard medicine terms in the medicine term library based on synonym mining update, the semantic information is utilized to add pinyin character sequences of corresponding medicine terms except Chinese character marks, and the diagram structure information of the external medicine terms in the medicine term library and the electronic medical record is fully utilized; an association prediction model based on semantic embedding and structural embedding is constructed, so that the association between external medicine terms and standard medicine terms in a medicine term library in the real-world electronic medical record is accurately established.

Description

一种电子病历中药物术语标准化关联方法及系统A method and system for standardized association of drug terms in electronic medical records

技术领域Technical field

本发明属于医疗信息技术领域,尤其涉及一种电子病历中药物术语标准化关联方法及系统。The invention belongs to the field of medical information technology, and in particular relates to a method and system for standardized association of drug terms in electronic medical records.

背景技术Background technique

随着信息技术的发展及其在医疗健康产业的不断深入应用,医疗健康产业中存储了大量的数据。其中,比较典型的是包括以相对标准化形式呈现的知识库(Knowledgebase, KB)和以真实世界诊疗过程数据形式呈现的电子健康病历(Electronic healthrecords, EHRs)。知识库是计算机系统用于存储复杂的结构化和非结构化信息的技术,其中,术语库(Terminology base, TB)作为知识库的一种特殊类型,用于存储术语概念及其相关信息,在药物研究领域,已建成并仍在不断更新完善的通用药物术语库包括Drugbank和WHODrug等,并且在学术界和工业界,也有应需求建立的中文药物术语库。但是,由于在真实世界临床诊疗的实践过程中,不同地区甚至是不同医院、不同医生都可能使用多种不同的名称来表示同一种药物,而已存在的药物术语库未必能记录药物的所有名称。例如,Drugbank ID=DB00736的药物,其英文名称为“Esomeprazole magnesium”,中文曾用名是“埃索美拉唑镁”,现用名是“艾司奥美拉唑镁”,而在电子健康病历系统中,可能在药物通用名修改之前记录成“埃索美拉唑镁”,而在药物通用名修改之后记录为“艾司奥美拉唑镁”,在利用电子健康病历数据开展真实世界药物研究时,如果漏掉其中任何一个名称,都将会因数据检索不全,从而导致研究人群筛选不合理、用药情况推算错误而最终影响研究质量。因此,在使用EHRs数据开展真实世界药物研究,尤其是多中心、涉及多种药物的真实世界药物研究时,将EHRs中的药物名称和药物术语库中的相应药物进行关联是十分必要的,也是保证研究质量和结果可靠性的重要前提条件。而药物术语库作为医学研究及工程领域的重要信息,其及时更新是促进本领域信息交流乃至技术进步的基础,将其与真实世界电子健康病历数据建立关联,能够为基于EHRs的自然语言处理、人工智能、专家系统、真实世界药物研究等方面的研究和工程任务提供底层支撑并具有推动和促进作用。With the development of information technology and its continuous in-depth application in the medical and health industry, a large amount of data is stored in the medical and health industry. Among them, the more typical ones include knowledge base (KB) presented in a relatively standardized form and electronic health records (EHRs) presented in the form of real-world diagnosis and treatment process data. Knowledge base is a technology used by computer systems to store complex structured and unstructured information. Terminology base (TB), as a special type of knowledge base, is used to store terminology concepts and related information. In the field of drug research, general drug terminology databases have been established and are still being continuously updated and improved, including Drugbank and WHODrug. In addition, in academia and industry, there are also Chinese drug terminology databases established in response to demand. However, in the practice of real-world clinical diagnosis and treatment, different regions and even different hospitals and different doctors may use multiple different names to refer to the same drug, and existing drug terminology databases may not be able to record all names of drugs. For example, the English name of the drug with Drugbank ID=DB00736 is "Esomeprazole magnesium", the Chinese name was "Esomeprazole Magnesium", and the current name is "Esomeprazole Magnesium", and in Electronic Health In the medical record system, it may be recorded as "esomeprazole magnesium" before the drug's generic name is modified, and it may be recorded as "esomeprazole magnesium" after the drug's generic name is modified. When using electronic health record data to conduct real-world During drug research, if any of these names are missed, incomplete data retrieval will lead to unreasonable screening of the research population and incorrect estimation of drug use, ultimately affecting the quality of the research. Therefore, when using EHRs data to conduct real-world drug research, especially multi-center real-world drug research involving multiple drugs, it is very necessary to associate the drug names in the EHRs with the corresponding drugs in the drug glossary. An important prerequisite to ensure the quality of research and the reliability of results. As important information in the field of medical research and engineering, drug terminology databases are updated in a timely manner and are the basis for promoting information exchange and even technological progress in this field. Correlating it with real-world electronic health record data can provide a basis for natural language processing based on EHRs, Research and engineering tasks in artificial intelligence, expert systems, real-world drug research, etc. provide underlying support and promote and promote.

现有的药物关联方法中,一种基于通用模型的医学标准术语管理系统及方法(公开号为CN115080751A)涉及了病历文本与标准术语的映射,首先基于序列标注模型对病历文本拆分得到文本细分属性,再计算其与任意语义标准词的相似度,并通过语义相似度判断标准化映射的有效性,如有效则直接作为映射结果,如无效,则重新计算其它可能的标准化映射,最终作为算法推荐的需要人工审核的映射结果。但是该技术方案仅使用语义相似度判断映射有效性,忽略了药物术语库具有的结构特征。Among the existing drug association methods, a universal model-based medical standard terminology management system and method (publication number CN115080751A) involves the mapping of medical record text and standard terminology. First, the medical record text is split based on the sequence annotation model to obtain text details. attributes, and then calculate the similarity with any semantic standard word, and judge the effectiveness of the standardized mapping through the semantic similarity. If it is valid, it will be directly used as the mapping result. If it is invalid, other possible standardized mappings will be recalculated, and finally used as an algorithm. Recommended mapping results that require manual review. However, this technical solution only uses semantic similarity to judge the validity of the mapping and ignores the structural characteristics of the drug term database.

一种药物名称匹配方法和装置(公开号为CN112711642A)涉及了不同电子病历之间的药物匹配,首先通过电子病历数据训练得到电子病历语料集的词向量,并基于统一医学语言系统抽取药物名称,获得药物实体词向量,并采用神经网络模型得到组成向量,同时结合工程特征,计算药物实体之间的相似性,最终实现不同电子病历系统之间的药物匹配。该技术方案在假设统一医学语言系统完善的情况下,解决了不同电子健康病历数据之间的药物匹配问题,对于本发明要解决的将电子健康病历中的药物术语匹配到药物术语库中的问题可借鉴价值有限。A drug name matching method and device (publication number CN112711642A) involves drug matching between different electronic medical records. First, the word vector of the electronic medical record corpus is obtained through electronic medical record data training, and the drug name is extracted based on the unified medical language system. Obtain the drug entity word vector, and use the neural network model to obtain the component vector. At the same time, combined with engineering features, the similarity between drug entities is calculated, and finally drug matching between different electronic medical record systems is achieved. This technical solution solves the problem of drug matching between different electronic health medical record data under the assumption that the unified medical language system is perfect. The problem to be solved by the present invention is to match drug terms in the electronic health medical record to the drug term library. The reference value is limited.

药品信息匹配方法及系统(公开号为107103048B)涉及了药品之间的匹配,首先获取待匹配药品的多个维度的子信息,例如药品名称、制剂规格、剂型等,将目标子信息与标准子信息进行关联度识别,在关联度识别结果满足预设关联要求时,将满足预设关联要求的目标信息与一个或多个标准信息分别配置成一个或多个候选信息对;针对各候选信息对,分别在多个维度的子信息上计算目标信息与标准信息的相似度,并基于计算出的相似度计算出各候选信息对的综合匹配分值,最后,将最大的综合匹配分值对应的候选信息对中的标准信息确定为目标信息的匹配信息。医学药品匹配方法、装置、电子设备及存储介质(公开号为CN111798969A)涉及了目标药品与药品标准库之间的匹配,该申请案的方法包括:对于待匹配的目标药品,从药品信息中选择多个用于表征目标药品的药品标识或规格作为基准项,并依据重要性对每个基准项赋予权重值,将基准项与药品基准库中的标准项进行匹配,并计算比对值,再根据比对值以及权重值,计算目标药品与药品基准库中药品的匹配度,从而实现将目标药品的药品标识与药品基准库中存储的目标药品的标准标识建立映射关系。这两种技术方案解决了目标药品与药品标准库的匹配问题,与药物相比,药品包含的子信息更多,除药品名称外,还包括制剂规格、剂型、厂商、批准文号等多维度信息,但是由于本发明要解决的问题是针对药物关联,能利用的文本信息有限,因此这些方法并不适用。The drug information matching method and system (publication number 107103048B) involves matching between drugs. First, multiple dimensions of sub-information of the drugs to be matched are obtained, such as drug names, preparation specifications, dosage forms, etc., and the target sub-information is compared with the standard sub-information. The information is identified for correlation. When the correlation identification result meets the preset correlation requirements, the target information that meets the preset correlation requirements and one or more standard information are respectively configured into one or more candidate information pairs; for each candidate information pair , calculate the similarity between the target information and the standard information on the sub-information of multiple dimensions, and calculate the comprehensive matching score of each candidate information pair based on the calculated similarity. Finally, the maximum comprehensive matching score corresponding to The standard information in the candidate information pair is determined as the matching information of the target information. The medical drug matching method, device, electronic equipment and storage medium (publication number is CN111798969A) involves the matching between the target drug and the drug standard library. The method of the application includes: for the target drug to be matched, select from drug information Multiple drug labels or specifications used to characterize the target drug are used as benchmark items, and a weight value is assigned to each benchmark item based on importance. The benchmark items are matched with the standard items in the drug benchmark library, and the comparison value is calculated. Based on the comparison value and weight value, the matching degree of the target drug and the drugs in the drug benchmark library is calculated, thereby establishing a mapping relationship between the drug identifier of the target drug and the standard identifier of the target drug stored in the drug benchmark library. These two technical solutions solve the matching problem of target drugs and drug standard libraries. Compared with drugs, drugs contain more sub-information. In addition to drug names, they also include preparation specifications, dosage forms, manufacturers, approval numbers, etc. However, since the problem to be solved by the present invention is drug association, the text information that can be used is limited, so these methods are not applicable.

现有技术存在的局限性主要体现在:在关联过程中仅利用了语义相似性,而未利用图结构信息;在关联时,语义相似性没有使用拼音信息,由于药物名称可能存在用字不同但读音相同的情况,如果单纯使用中文名称的语义相似性,例如可能会将“头孢拉啶”与“头孢拉定”、“头孢他啶”计算为具有相近的相似度,但是从拼音上来看,显然“头孢拉啶”与“头孢拉定”是同种药物,导致关联结果不准确。The limitations of the existing technology are mainly reflected in: in the association process, only semantic similarity is used, but the graph structure information is not used; in the association process, semantic similarity does not use pinyin information, because the drug names may have different words but If the pronunciation is the same, if you simply use the semantic similarity of the Chinese name, for example, "cefradine" may be calculated as having close similarity with "cefradine" and "ceftazidime", but from the pinyin point of view, it is obvious that "cefradine" "Cefradine" and "cefradine" are the same drugs, resulting in inaccurate correlation results.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足,提供一种电子病历中药物术语标准化关联方法及系统,实现了电子病历中外部药物术语与药物术语库中标准药物术语的关联。The purpose of the present invention is to provide a standardized association method and system for drug terms in electronic medical records in view of the shortcomings of the existing technology, so as to realize the association between external drug terms in electronic medical records and standard drug terms in the drug term library.

本发明的目的是通过以下技术方案实现的:The purpose of the present invention is achieved through the following technical solutions:

根据本说明书的第一方面,提供一种电子病历中药物术语标准化关联方法,包括:According to the first aspect of this specification, a standardized association method for drug terms in electronic medical records is provided, including:

S1,输入药物术语库,得到每个标准药物术语的同义词集;S1, input the drug term database and obtain the synonym set of each standard drug term;

S2,获得基于同义词挖掘更新的药物术语库,包括:S2, obtain an updated drug terminology database based on synonym mining, including:

构建用于同义词挖掘的语料库,从所述语料库中获取药物术语列表;Constructing a corpus for synonym mining and obtaining a list of drug terms from said corpus;

训练同义词集分类器,得到所述药物术语列表中每个药物术语与所述药物术语库中同义词集的分类预测结果,根据预设概率阈值得到基于同义词挖掘更新的所有同义词集;Train a synonym set classifier to obtain classification prediction results for each drug term in the drug term list and the synonym set in the drug term database, and obtain all synonym sets updated based on synonym mining according to a preset probability threshold;

根据所述基于同义词挖掘更新的所有同义词集对药物术语库进行更新;Update the drug terminology database according to all synonym sets updated based on synonym mining;

S3,根据更新后的药物术语库以及电子病历中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型,包括:S3, based on the updated drug term database and external drug terms in the electronic medical record, train an association prediction model based on semantic embedding and structural embedding, including:

通过预训练语言模型获取电子病历中外部药物术语和更新后的药物术语库中标准药物术语对的语义嵌入表示,具体为:将所述外部药物术语及其拼音字符序列、所述标准药物术语及其拼音字符序列,结合起始字符和分隔字符构成关联药物术语对字符序列,输入预训练语言模型,得到语义嵌入表示;The semantic embedding representation of the pairs of external drug terms in the electronic medical record and the standard drug terms in the updated drug term database is obtained through the pre-trained language model, specifically: the external drug terms and their pinyin character sequences, the standard drug terms and The pinyin character sequence, combined with the starting character and the separator character, forms a character sequence of associated drug term pairs, which is input into the pre-trained language model to obtain a semantic embedding representation;

通过图卷积神经网络模型获取电子病历中外部药物术语和更新后的药物术语库中标准药物术语对的结构嵌入表示,具体为:将所述外部药物术语与所述更新后的药物术语库中药物术语基于相似度计算建立候选关联关系,将所述外部药物术语和所述更新后的药物术语库中药物术语的语义嵌入表示分别作为相应药物术语的初始化节点嵌入表示,输入图卷积神经网络模型,得到相应药物术语的节点嵌入表示,将所述外部药物术语和所述标准药物术语的节点嵌入表示的乘积作为结构嵌入表示;Obtain the structural embedded representation of the pair of external drug terms in the electronic medical record and the standard drug terminology in the updated drug terminology database through the graph convolutional neural network model, specifically: combining the external drug terminology with the updated drug terminology database Drug terms establish candidate association relationships based on similarity calculations, use the semantic embedding representations of the external drug terms and the drug terms in the updated drug term database as initialization node embedding representations of the corresponding drug terms, and input the graph convolutional neural network Model, obtain the node embedding representation of the corresponding drug term, and use the product of the node embedding representation of the external drug term and the standard drug term as the structure embedding representation;

S4,使用所述关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果。S4: Use the association prediction model to predict the association results between external drug terms in the electronic medical record and standard drug terms in the drug term database.

进一步地,所述同义词集分类器的训练过程中,基于集合统一性得分变化预测待归类的药物术语属于同义词集的概率,集合统一性得分计算方法为:计算集合中每个术语的嵌入表示,将嵌入表示输入全连接神经网络模型得到新的术语表示,将所有新的术语表示加和后取均值,得到初始化术语集表示,将所述初始化术语集表示输入全连接神经网络模型,得到集合统一性得分。Further, during the training process of the synonym set classifier, the probability that the drug term to be classified belongs to the synonym set is predicted based on the set unity score change. The set unity score calculation method is: calculate the embedded representation of each term in the set , input the embedding representation into the fully connected neural network model to obtain a new term representation, add all new term representations and take the average to obtain the initialized term set representation, input the initialized term set representation into the fully connected neural network model to obtain the set Unity score.

进一步地,所述同义词集分类器的训练集生成方式包括:采用随机抽取的方式从同义词集中抽取一个药物术语,结合同义词集中其余药物术语构成的集合生成阳性训练样本;对于每个阳性训练样本,匹配多个阴性训练样本,所述阴性训练样本通过在排除了所述同义词集中的药物术语后,从药物术语库中抽取得到。Further, the training set generation method of the synonym set classifier includes: extracting a drug term from the synonym set by random extraction, and generating a positive training sample by combining the set of other drug terms in the synonym set; for each positive training sample, Match multiple negative training samples, which are extracted from the drug term database after excluding drug terms in the synonym set.

进一步地,根据所述基于同义词挖掘更新的所有同义词集对药物术语库进行更新,具体为:如果是作为上位语的标准药物术语的同义词集有更新,将相应的同义词与作为上位语的标准药物术语建立同义词关联,同时将相应的同义词与所述作为上位语的标准药物术语关联的所有标准药物术语建立下位语关联;如果是非上位语的标准药物术语的同义词集有更新,将相应的同义词与所述非上位语的标准药物术语建立同义词关联。Further, the drug term database is updated according to all the synonym sets updated based on synonym mining, specifically: if the synonym set of the standard drug term as a hypernym is updated, the corresponding synonyms are compared with the standard drug as a hypernym. Establish synonym associations for the terms, and at the same time establish hyponym associations between the corresponding synonyms and all standard drug terms associated with the standard drug terms as hypernyms; if the synonym set of non-hypernym standard drug terms is updated, associate the corresponding synonyms with The non-hypernym standard pharmaceutical terminology establishes synonym associations.

进一步地,对预训练语言模型进行调整,具体为:将起始字符的语义嵌入表示作为自变量,因变量为电子病历中外部药物术语和更新后的药物术语库中药物术语语义关联与否的标签;采用非线性激活函数获取基于语义嵌入表示的预测结果;采用人工标注方式获取阳性训练样本,采用随机抽取方式获取阴性训练样本,得到训练集,进行预训练语言模型训练,优化调整语义嵌入表示。Further, the pre-trained language model is adjusted, specifically: the semantic embedding representation of the starting character is used as the independent variable, and the dependent variable is whether the external drug terms in the electronic medical record and the drug terms in the updated drug term database are semantically related or not. Label; use a nonlinear activation function to obtain prediction results based on semantic embedding representation; use manual annotation to obtain positive training samples, and use random extraction to obtain negative training samples to obtain a training set, perform pre-training language model training, and optimize and adjust the semantic embedding representation. .

进一步地,将所述外部药物术语与所述更新后的药物术语库中药物术语建立候选关联关系,具体为:计算药物术语中每个词的TF-IDF值,获得电子病历中外部药物术语与更新后的药物术语库中每个药物术语的向量表示,计算两个药物术语间的相似度,如果所述相似度大于预设的相似度阈值,则将电子病历中外部药物术语与对应的药物术语库中的药物术语建立候选关联关系。Further, a candidate association relationship is established between the external drug term and the drug term in the updated drug term database, specifically: calculating the TF-IDF value of each word in the drug term, and obtaining the relationship between the external drug term in the electronic medical record and The vector representation of each drug term in the updated drug term database is calculated, and the similarity between the two drug terms is calculated. If the similarity is greater than the preset similarity threshold, the external drug term in the electronic medical record is compared with the corresponding drug. Candidate association relationships are established for drug terms in the term database.

进一步地,所述图卷积神经网络模型中每层的输入包含两部分,第一部分为节点嵌入表示矩阵,第二部分为邻接矩阵,每一层的输出作为下一层的节点嵌入表示矩阵,通过归一化的图拉普拉斯变换得到,采用基于边际的距离损失函数优化图卷积神经网络模型。Further, the input of each layer in the graph convolutional neural network model includes two parts. The first part is the node embedding representation matrix, and the second part is the adjacency matrix. The output of each layer is used as the node embedding representation matrix of the next layer. Obtained through the normalized graph Laplacian transform, a marginal-based distance loss function is used to optimize the graph convolutional neural network model.

进一步地,所述邻接矩阵的取值具体地:如果存在一条从更新后的药物术语库中一个药物术语到另一个药物术语的边,那么对应取值为1,否则取值为0;如果存在一条从电子病历中外部药物术语到更新后的药物术语库中药物术语的边,那么对应取值为候选关联关系中的相似度值。Further, the value of the adjacency matrix is specifically: if there is an edge from one drug term to another drug term in the updated drug term database, then the corresponding value is 1, otherwise the value is 0; if there is An edge from an external drug term in the electronic medical record to a drug term in the updated drug term database, then the corresponding value is the similarity value in the candidate association relationship.

进一步地,将语义嵌入表示和结构嵌入表示进行拼接,将拼接后的表示输入多层感知机,所述多层感知机包括多个全连接的隐藏层和单节点的输出层,将所述多层感知机的输出通过非线性激活函数转化成标量,得到每个电子病历中外部药物术语与更新后的药物术语库中标准药物术语的关联概率。Further, the semantic embedding representation and the structure embedding representation are spliced, and the spliced representation is input into a multi-layer perceptron. The multi-layer perceptron includes multiple fully connected hidden layers and a single-node output layer. The multi-layer perceptron is input into the multi-layer perceptron. The output of the layer perceptron is converted into a scalar through a nonlinear activation function to obtain the association probability of external drug terms in each electronic medical record with standard drug terms in the updated drug term database.

根据本说明书的第二方面,提供一种电子病历中药物术语标准化关联系统,包括:According to the second aspect of this specification, a standardized association system for drug terms in electronic medical records is provided, including:

电子病历药物术语输入模块,用于获取电子病历中待进行药物术语标准化的所有外部药物术语;The electronic medical record drug term input module is used to obtain all external drug terms to be standardized in the electronic medical record;

药物术语库同义词挖掘更新模块,用于构建语料库,从所述语料库中获取用于同义词挖掘的药物术语列表;获取药物术语库中每个标准药物术语的同义词集;训练同义词集分类器,得到所述药物术语列表中每个药物术语与所述药物术语库中同义词集的分类预测结果,根据预设概率阈值得到基于同义词挖掘更新的所有同义词集,对药物术语库进行更新;The drug term database synonym mining update module is used to construct a corpus, obtain a drug term list for synonym mining from the corpus; obtain a synonym set for each standard drug term in the drug term database; train a synonym set classifier to obtain all Classification prediction results of each drug term in the drug term list and the synonym set in the drug term database are obtained, and all synonym sets updated based on synonym mining are obtained according to the preset probability threshold, and the drug term database is updated;

候选关联关系建立模块,用于将电子病历中外部药物术语与更新后的药物术语库中药物术语基于相似度计算建立候选关联关系;The candidate association relationship establishment module is used to establish a candidate association relationship between external drug terms in the electronic medical record and drug terms in the updated drug term database based on similarity calculation;

语义嵌入表示模块,用于将电子病历中外部药物术语及其拼音字符序列、更新后的药物术语库中标准药物术语及其拼音字符序列,结合起始字符和分隔字符构成关联药物术语对字符序列,输入预训练语言模型,得到语义嵌入表示;The semantic embedding representation module is used to combine external drug terms and their pinyin character sequences in electronic medical records, standard drug terms and their pinyin character sequences in the updated drug term database, combined with starting characters and delimiter characters to form an associated drug term pair character sequence. , input the pre-trained language model and obtain the semantic embedding representation;

结构嵌入表示模块,用于将电子病历中外部药物术语和更新后的药物术语库中药物术语的语义嵌入表示分别作为相应药物术语的初始化节点嵌入表示,输入图卷积神经网络模型,得到相应药物术语的节点嵌入表示,将所述外部药物术语和所述标准药物术语的节点嵌入表示的乘积作为结构嵌入表示;The structural embedding representation module is used to use the semantic embedding representation of external drug terms in the electronic medical record and the updated drug terminology database as the initialization node embedding representation of the corresponding drug terms, and input the graph convolutional neural network model to obtain the corresponding drug. The node embedding representation of the term is a product of the node embedding representation of the external drug term and the standard drug term as a structural embedding representation;

关联预测模块,用于根据更新后的药物术语库以及电子病历中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型;使用所述关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果。The association prediction module is used to train an association prediction model based on semantic embedding and structural embedding based on the updated drug term database and external drug terms in the electronic medical record; use the association prediction model to predict and obtain the external drug terms and drug terms in the electronic medical record Association results for standard drug terms in the library.

本发明的有益效果是:本发明通过同义词挖掘技术,丰富了语义信息和图结构信息;在关联预测模型预测中,同时使用了语义信息和图结构信息;具体体现在:The beneficial effects of the present invention are: the present invention enriches semantic information and graph structure information through synonym mining technology; in association prediction model prediction, semantic information and graph structure information are used simultaneously; specifically embodied in:

1)通过同义词挖掘技术对药物术语库进行更新,得到基于同义词挖掘更新的药物术语库,解决了药物术语库中标准药物术语和电子病历中外部药物术语语义相似性低的问题;1) Update the drug terminology database through synonym mining technology to obtain an updated drug terminology database based on synonym mining, which solves the problem of low semantic similarity between standard drug terms in the drug terminology database and external drug terms in electronic medical records;

2)在电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中标准药物术语关联时,利用的语义信息除中文字符标记外,加入相应术语的拼音标记;2) When the external drug terms in the electronic medical record are associated with the standard drug terms in the drug term database updated based on synonym mining, the semantic information used is to add the pinyin mark of the corresponding term in addition to the Chinese character mark;

3)在电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中标准药物术语关联时,充分利用药物术语库的图结构信息;3) Make full use of the graph structure information of the drug terminology database when associating external drug terms in the electronic medical record with standard drug terms in the drug terminology database updated based on synonym mining;

4)在电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中标准药物术语关联时,通过将电子病历中外部药物术语与药物术语库中药物术语建立关联,获取电子病历中外部药物术语的图结构信息;4) When the external drug terms in the electronic medical record are associated with the standard drug terms in the drug term database updated based on synonym mining, obtain the external drug terms in the electronic medical record by associating the external drug terms in the electronic medical record with the drug terms in the drug term database. graph structure information;

5)通过以上方法,最终获得电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中标准药物术语对的嵌入表示信息,进行关联预测模型的预测。5) Through the above method, the embedded representation information of the external drug terms in the electronic medical record and the standard drug term pair in the drug term database updated based on synonym mining is finally obtained, and the prediction of the association prediction model is performed.

附图说明Description of drawings

图1为一示例性实施例提供的电子病历中药物术语标准化关联方法整体步骤流程图;Figure 1 is an overall step flow chart of a standardized association method for drug terms in electronic medical records provided by an exemplary embodiment;

图2为一示例性实施例提供的电子病历中药物术语标准化关联方法实现流程图;Figure 2 is a flow chart for implementing a method for standardized association of drug terms in electronic medical records provided by an exemplary embodiment;

图3为一示例性实施例提供的原始药物术语库示意图;Figure 3 is a schematic diagram of the original drug term database provided by an exemplary embodiment;

图4为一示例性实施例提供的基于同义词挖掘更新的药物术语库示意图;Figure 4 is a schematic diagram of a drug term database updated based on synonym mining provided by an exemplary embodiment;

图5为一示例性实施例提供的电子病历中药物术语标准化关联系统结构图。Figure 5 is a structural diagram of a standardized association system for drug terms in electronic medical records provided by an exemplary embodiment.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objects, features and advantages of the present invention more obvious and easy to understand, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。Many specific details are set forth in the following description to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Those skilled in the art can do so without departing from the connotation of the present invention. Similar generalizations are made, and therefore the present invention is not limited to the specific embodiments disclosed below.

如图1、图2所示,本发明实施例提供的一种电子病历中药物术语标准化关联方法,包括以下步骤:As shown in Figures 1 and 2, an embodiment of the present invention provides a method for standardized association of drug terms in electronic medical records, including the following steps:

步骤S1:输入药物术语库,药物术语库表示为,其中E表示药物术语集合,包含作为上位语或非上位语的标准药物术语、标准药物术语的同义词,R表示药物术语之间的关系集合,具体地,/>,其中/>表示下位语关系,/>表示同义词关系,药物术语之间的关系可表示为,代表药物术语h是t的同义词,或者,代表药物术语h是t的下位语;将药物术语库中关系为/>的药物术语转换成同义词集,得到每个标准药物术语的同义词集;Step S1: Enter the drug term database, which is expressed as , where E represents a set of drug terms, including standard drug terms as hypernyms or non-hypernyms, and synonyms of standard drug terms, and R represents a set of relationships between drug terms, specifically, /> , of which/> Indicates a hyponym relationship,/> Representing synonym relationships, the relationship between drug terms can be expressed as , representing that the drug term h is a synonym for t, or , indicating that the drug term h is a hyponym of t; the relationship in the drug term database is/> Convert drug terms into synonym sets to obtain synonym sets for each standard drug term;

图3为一药物术语库示例,其中标准药物术语包括埃索美拉唑、埃索美拉唑锶、埃索美拉唑锶水合物、埃索美拉唑钠、埃索美拉唑镁、埃索美拉唑镁二水合物、埃索美拉唑镁三水合物、钆喷酸、钆喷酸二葡甲胺、拉氧头孢、拉氧头孢钠;埃索美拉唑的下位语包括埃索美拉唑锶、埃索美拉唑锶水合物、埃索美拉唑钠、埃索美拉唑镁、埃索美拉唑镁二水合物和埃索美拉唑镁三水合物,以埃索美拉唑锶为例,关系可表示为;拉氧头孢的同义词包括拉他头孢,下位语包括拉氧头孢钠;钆喷酸的下位语包括钆喷酸二葡甲胺,钆喷酸二葡甲胺的同义词包括钆喷酸双葡甲胺和钆喷酸葡胺,钆喷酸双葡甲胺和钆喷酸二葡甲胺的关系可表示为;标准药物术语是本实施例的目标关联对象。Figure 3 is an example of a drug term database, in which standard drug terms include esomeprazole, esomeprazole strontium, esomeprazole strontium hydrate, esomeprazole sodium, esomeprazole magnesium, Esomeprazole magnesium dihydrate, esomeprazole magnesium trihydrate, gadopentetate, gadopentetate dimeglumine, laoxycephalosporin, laoxycephalosporin sodium; the hyponyms of esomeprazole include Esomeprazole strontium, esomeprazole strontium hydrate, esomeprazole sodium, esomeprazole magnesium, esomeprazole magnesium dihydrate and esomeprazole magnesium trihydrate, Taking esomeprazole strontium as an example, the relationship can be expressed as ; Synonyms of laoxycephalosporin include latacephalosporin, and hyponyms include laoxycephalosporin sodium; hyponyms of gadopentetate include gadopentetate dimeglumine, and synonyms of gadopentetate dimeglumine include gadopentetate dimeglumine The relationship between amine and gadopentetate dimeglumine, gadopentetate dimeglumine and gadopentetate dimeglumine can be expressed as ;Standard drug terminology is the target association object of this embodiment.

步骤S2:针对原始药物术语库由于翻译不规范、人工标注不严谨、数据更新不及时等问题导致的同义词关系不全的问题,采用同义词挖掘方法完善药物术语库的同义词关系,获得基于同义词挖掘更新的药物术语库;具体包括以下子步骤:Step S2: In order to solve the problem of incomplete synonym relationships in the original drug terminology database due to problems such as irregular translation, inaccurate manual annotation, and untimely data updates, the synonym mining method is used to improve the synonym relationships of the drug term database and obtain updated synonyms based on synonym mining. Drug term database; specifically includes the following sub-steps:

步骤S21:从中国知网、万方等文献检索平台获取药物相关中文文献的中文摘要及正文构成同义词挖掘的语料库,并采用命名实体识别方法获取用于同义词挖掘的药物术语,表示为药物术语列表/>,其中/>表示第i个药物术语,/>表示/>中药物术语的数量,本实施例中采用的命名实体识别方法为条件随机场模型;Step S21: Obtain Chinese abstracts and text of drug-related Chinese literature from literature retrieval platforms such as CNKI and Wanfang to form a corpus for synonym mining , and use named entity recognition method to obtain drug terms for synonym mining, expressed as a drug term list/> , of which/> Represents the i-th drug term,/> Express/> The number of Chinese medicine terms, the named entity recognition method used in this embodiment is a conditional random field model;

将药物术语库中的所有同义词集记为,其中/>表示第j个标准药物术语的同义词集,/>表示/>中同义词集的数量,/>等于药物术语库中所有标准药物术语的数量,如果某标准药物术语没有同义词,则其对应的同义词集中仅包含1个元素即该标准药物术语。Note all synonym sets in the drug glossary as , of which/> Represents the synonym set of the jth standard drug term,/> Express/> The number of synonym sets in /> It is equal to the number of all standard drug terms in the drug term database. If a standard drug term has no synonyms, its corresponding synonym set only contains 1 element, which is the standard drug term.

步骤S22:训练同义词集分类器,得到药物术语列表中每个药物术语与药物术语库中同义词集的分类预测结果;具体包括以下子步骤:Step S22: Train the synonym set classifier to obtain a list of drug terms The classification prediction results of each drug term in the drug term database and the synonym set in the drug term database; specifically include the following sub-steps:

步骤S221:将同义词集分类器表示为,其中/>代表一个同义词集,/>代表一个待归类到同义词集中的药物术语;Step S221: Express the synonym set classifier as , of which/> Represents a synonym set,/> Represents a drug term to be classified into a synonym set;

步骤S222:基于集合统一性得分变化预测待归类的药物术语属于同义词集/>的概率,公式可表示为/>,其中Pr表示概率,为sigmoid激活函数,/>为集合统一性得分函数;Step S222: Predict drug terms to be classified based on changes in set unity scores Belongs to synonym set/> The probability of , the formula can be expressed as/> , where Pr represents probability, is the sigmoid activation function,/> is the set unity score function;

步骤S223:对于一组数据,其中/>表示同义词集/>和待归类到同义词集/>中的药物术语/>,/>表示标签,/>表示/>,/>表示/>,/>;本实施例中,同义词集分类器采用全连接神经网络模型,同义词集分类器的损失函数采用对数损失函数,具体表示为以下形式:Step S223: For a set of data , of which/> Represents a synonym set/> and to be classified into synonym sets/> Drug terminology in /> ,/> Indicates label,/> Express/> ,/> Express/> ,/> , ; In this embodiment, the synonym set classifier adopts a fully connected neural network model, and the loss function of the synonym set classifier adopts a logarithmic loss function, which is specifically expressed in the following form:

.

在一个实施例中,集合统一性得分通过以下方式估计:In one embodiment, the set unity score Estimated by:

首先对于术语集中的每一个术语/>,使用文本嵌入方法计算它的嵌入表示/>,作为嵌入层的初始化参数,输入嵌入层,本实施例中使用的文本嵌入方法为Word2Vec;然后将嵌入表示输入全连接神经网络模型,得到相应术语的新的术语表示/>,将术语集/>对应的所有新的术语表示/>进行加和操作后取均值,以保证置换不变性,得到初始化的术语集表示,即/>;最后将上述术语集表示/>输入全连接神经网络模型,得到最终的术语集/>的统一性得分/>,用来衡量集合/>内所有术语的相似程度。First for the term set Every term in/> , use text embedding method to calculate its embedding representation/> , as the initialization parameter of the embedding layer, input the embedding layer. The text embedding method used in this embodiment is Word2Vec; then enter the embedding representation into the fully connected neural network model to obtain a new term representation of the corresponding term/> , convert the term set/> Corresponding to all new terminology/> After performing the addition operation, the average value is taken to ensure the invariance of substitution and the initialized term set representation is obtained, that is,/> ;Finally represent the above term set/> Enter the fully connected neural network model to get the final term set/> unity score/> , used to measure the set/> The degree of similarity of all terms within.

在一个实施例中,训练集通过以下方式生成:In one embodiment, the training set is generated by:

首先将药物术语库中关系为的药物术语转换成同义词集,药物术语库中的所有同义词集表示为/>;每一个同义词集表示为/>,其中/>表示该同义词集中的每一个药物术语;采用随机抽取的方式从集合ES中抽取任意一个药物术语/>,集合ES中其余术语构成集合,得到用于模型训练的阳性样本/>,标签y=1;对于每个阳性样本,匹配K个阴性样本,表示为/>,标签y=0,本实施例中K取5,其中/>可以通过在排除了同义词集ES中的药物术语后,从药物术语库中抽取得到,具体地,可采用按以下两种方式抽取的样本按设定比例混合后得到:①采用完全随机抽样的方式抽取得到;②将抽样范围限定在与集合/>中的药物术语包含相同字符的药物术语,再采用随机抽样的方式抽取得到;本实施例中,比例设定为2:3。First, the relationship in the drug term database is Drug terms are converted into synonym sets, and all synonym sets in the drug term database are expressed as/> ;Each synonym set is expressed as/> , of which/> Represents each drug term in the synonym set; randomly selects any drug term from the set ES/> , the remaining terms in the set ES constitute the set , get the positive samples used for model training/> , label y=1; for each positive sample, match K negative samples, expressed as/> , label y=0, in this example, K is 5, where/> It can be obtained by extracting from the drug term database after excluding the drug terms in the synonym set ES. Specifically, it can be obtained by mixing samples extracted in the following two ways according to a set proportion: ① Using completely random sampling. Obtained by extraction; ②Limit the sampling range to the set/> The drug terms in contain drug terms with the same characters, and are extracted by random sampling; in this embodiment, the ratio is set to 2:3.

步骤S224:利用训练集进行模型训练,并基于训练好的模型预测药物术语列表中的每个药物术语属于药物术语库中的每个同义词集的概率;Step S224: Use the training set to train the model, and predict the drug term list based on the trained model The probability that each drug term in belongs to each synonym set in the drug term base;

具体地,未进行同义词挖掘前,药物术语库中的所有同义词集表示为;对于/>中的每个药物术语/>,计算其属于/>中任意同义词集/>的概率,表示为/>,取概率最大值作为药物术语/>的潜在分类,设置概率阈值/>,如果上述概率最大值大于上述概率阈值/>,将更新相应的同义词集,如果上述概率最大值小于或等于上述概率阈值/>,将上述药物术语/>重新放回到/>;开始下一轮循环,直至中所有药物术语属于任意同义词集的概率均小于或等于上述概率阈值/>,最终得到基于同义词挖掘更新的所有同义词集,表示为/>;本实施例中取/>Specifically, before synonym mining is performed, all synonym sets in the drug term database are expressed as ;for/> Every drug term in /> , calculate which belongs to/> Any synonym set in/> The probability of , expressed as/> , take the maximum probability value as the drug term/> Potential classification, set probability threshold/> , if the above probability maximum value is greater than the above probability threshold/> , the corresponding synonym set will be updated, if the above probability maximum value is less than or equal to the above probability threshold/> , change the above drug terms/> Return to/> ;Start the next cycle until The probability that all drug terms in belong to any synonym set is less than or equal to the above probability threshold/> , and finally obtain all synonym sets updated based on synonym mining, expressed as/> ; In this embodiment, take/> .

步骤S23:根据上述基于同义词挖掘更新的所有同义词集,对药物术语库进行更新,得到基于同义词挖掘更新的药物术语库;Step S23: Update the drug terminology database based on all the synonym sets updated based on synonym mining, and obtain a drug term database updated based on synonym mining;

具体地,如果是作为上位语的标准药物术语的同义词集有更新,除将相应的同义词与作为上位语的标准药物术语建立同义词关联Specifically, if the synonym set of the standard drug term as a hypernym is updated, in addition to establishing a synonym association between the corresponding synonym and the standard drug term as a hypernym

外,同时将相应的同义词与所述作为上位语的标准药物术语关联的所有非上位语的标准药物术语建立下位语关联,关系表示为 In addition, at the same time, establish a hyponym association between the corresponding synonyms and all non-hypernym standard drug terms associated with the standard drug term as a hypernym. The relationship is expressed as

。如果是非上位语的标准药物术语的同义词集有更新,将相应的同义词与所述非上位语的标准药物术语建立同义词关联,关系表示为 . If the synonym set of a non-hypernym standard drug term is updated, establish a synonym relationship between the corresponding synonym and the non-hypernym standard drug term. The relationship is expressed as

。在图4示例中,基于同义词挖掘更新的同义词集:[{埃索美拉唑,艾司奥美拉唑},{埃索美拉唑钠,艾司奥美拉唑钠},…],将同义词艾司奥美拉唑与作为上位语的标准药物术语埃索美拉唑建立同义词关联,同时将同义词艾司奥美拉唑与作为上位语的标准药物术语埃索美拉唑关联的以下标准药物术语建立下位语关联:埃索美拉唑锶、埃索美拉唑锶水合物、埃索美拉唑钠、埃索美拉唑镁、埃索美拉唑镁二水合物、埃索美拉唑镁三水合物;将埃索美拉唑钠与艾司奥美拉唑钠建立同义词关联;最终得到基于同义词挖掘更新的药物术语库。 . In the example in Figure 4, the updated synonym set based on synonym mining: [{esomeprazole, esomeprazole}, {esomeprazole sodium, esomeprazole sodium},…], Establish a synonym relationship between the synonym esomeprazole and the standard drug term esomeprazole as a hypernym, and relate the synonym esomeprazole to the standard drug term esomeprazole as a hypernym. Standard drug terminology establishes hyponym associations: esomeprazole strontium, esomeprazole strontium hydrate, esomeprazole sodium, esomeprazole magnesium, esomeprazole magnesium dihydrate, Esso Meprazole magnesium trihydrate; establish a synonym association between esomeprazole sodium and esomeprazole sodium; and finally obtain an updated drug term database based on synonym mining.

步骤S3:根据上述基于同义词挖掘更新的药物术语库,以及真实世界电子病历数据中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型;具体包括以下子步骤:Step S3: Based on the above-mentioned updated drug terminology database based on synonym mining and external drug terms in real-world electronic medical record data, train an association prediction model based on semantic embedding and structural embedding; specifically including the following sub-steps:

步骤S31:通过预训练语言模型获取电子病历中外部药物术语和基于同义词挖掘更新的药物术语库中标准药物术语对的语义嵌入表示,本实施例中预训练语言模型采用BERT模型;Step S31: Obtain semantic embedding representations of external drug terms in the electronic medical record and standard drug term pairs in the drug term database updated based on synonym mining through a pre-trained language model. In this embodiment, the pre-trained language model adopts the BERT model;

具体地,真实世界电子病历中的外部药物术语集合表示为,基于同义词挖掘更新的药物术语库中的标准药物术语集合表示为/>,将真实世界电子病历中的外部药物术语集合G中任意外部药物术语/>的拼音字符序列表示为,将基于同义词挖掘更新的药物术语库中的标准药物术语集合E中任意标准药物术语/>的拼音字符序列表示为/>,将/>结合起始字符[CLS]和分隔字符[SEP]构成关联药物术语对字符序列,记为/>,以/>为“艾司奥美拉唑钠”, />为“埃索美拉唑钠”为例,关联药物术语对字符序列可表示为{[CLS][艾][司][奥][美][拉][唑][钠][ai][si][ao][mei][la][zuo][na][SEP][埃][索][美][拉][唑][钠] [ai][suo] [mei][la][zuo][na] [SEP]};本实施例中采用基于中文语料库预训练的BERT模型,通过Transformer的多个双向编码层,获得上述关联药物术语对字符序列的语义嵌入表示,最终利用起始字符[CLS]的语义嵌入表示来表示/>和/>的关联性。Specifically, the set of external drug terms in real-world electronic medical records is represented as , the set of standard drug terms in the drug term database updated based on synonym mining is expressed as/> , any external drug term in the external drug term set G in the real-world electronic medical record /> The pinyin character sequence is expressed as , any standard drug term in the standard drug term set E in the updated drug term database based on synonym mining/> The Pinyin character sequence is expressed as/> , will/> Combine the starting character [CLS] and the delimiting character [SEP] to form a character sequence of associated drug term pairs, recorded as /> , with/> For "esomeprazole sodium", /> Taking "esomeprazole sodium" as an example, the character sequence of the associated drug term can be expressed as {[CLS][Ai][Division][Ao][Me][L][Azole][Sodium][ai][ si][ao][mei][la][zuo][na][SEP][Angstrom][Soul][American][La][azole][Sodium] [ai][suo] [mei][la] [zuo][na] [SEP]}; In this embodiment, the BERT model pre-trained based on the Chinese corpus is used, and through multiple bidirectional encoding layers of the Transformer, the semantic embedding representation of the character sequence of the above-mentioned associated drug terms is obtained, and finally used represented by the semantic embedding representation of the initial character [CLS]/> and/> correlation.

本实施例中对BERT模型进行调整,具体为:将起始字符的语义嵌入表示作为自变量,记为,因变量为真实世界电子病历中的外部药物术语/>和基于同义词挖掘更新的药物术语库中的标准药物术语/>语义关联与否的标签/>,如果语义关联存在则,否则/>;采用非线性激活函数/>获取基于BERT语义嵌入表示的预测结果/>,本实施例中采用sigmoid激活函数,表示为,损失函数使用二分类交叉熵损失函数,表示为;采用人工标注的方式获取阳性训练样本,采用随机抽取的方式获取阴性训练样本,得到训练集,进行BERT模型训练,优化调整BERT语义嵌入表示。In this embodiment, the BERT model is adjusted, specifically: the semantic embedding representation of the starting character is used as an independent variable, recorded as , the dependent variable is external drug terms in real-world electronic medical records/> and standard drug terms in an updated drug glossary based on synonym mining/> Semantically related or not tags/> , if the semantic association exists then , otherwise/> ;Use nonlinear activation function/> Get prediction results based on BERT semantic embedding representation/> , in this embodiment, the sigmoid activation function is used, expressed as , the loss function uses the binary cross-entropy loss function, expressed as ; Use manual annotation to obtain positive training samples, and use random extraction to obtain negative training samples to obtain a training set, conduct BERT model training, and optimize and adjust the BERT semantic embedding representation.

步骤S32:通过图卷积神经网络模型获取电子病历中外部药物术语和基于同义词挖掘更新的药物术语库中标准药物术语对的结构嵌入表示;Step S32: Obtain structural embedding representations of external drug terms in the electronic medical record and standard drug term pairs in the drug term database updated based on synonym mining through the graph convolutional neural network model;

具体地,将真实世界电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中药物术语建立候选关联关系:计算每个药物术语中每个词的TF-IDF值,进而获得电子病历中外部药物术语与基于同义词挖掘更新的药物术语库中每个药物术语的向量表示,采用余弦相似度计算它们之前的相似度,并设置相似度阈值,如果上述相似度大于上述相似度阈值,则将电子病历中外部药物术语与对应的药物术语库中的药物术语建立候选关联关系,表示为Specifically, a candidate association relationship is established between external drug terms in the real-world electronic medical record and drug terms in the drug term database updated based on synonym mining: the TF-IDF value of each word in each drug term is calculated, and then the external drug terms in the electronic medical record are obtained. The vector representation of drug terms and each drug term in the drug term database updated based on synonym mining is used to calculate their previous similarity using cosine similarity, and a similarity threshold is set. If the above similarity is greater than the above similarity threshold, the electronic A candidate association relationship is established between the external drug terms in the medical record and the drug terms in the corresponding drug term database, expressed as

,示例如图4,艾司奥美拉唑与艾司奥美拉唑镁建立候选关联,埃索美拉唑镁与艾司奥美拉唑镁建立候选关联。 , an example is shown in Figure 4, esomeprazole establishes a candidate association with esomeprazole magnesium, and esomeprazole magnesium establishes a candidate association with esomeprazole magnesium.

将电子病历中外部药物术语转换成序列,将基于同义词挖掘更新的药物术语库中药物术语转换成序列/>,利用步骤S31训练的BERT模型计算得到上述序列的语义嵌入表示,将起始字符[CLS]对应的语义嵌入表示作为相应药物术语的初始化节点嵌入表示;Convert external drug terms in electronic medical records into sequences , convert drug terms in the drug term database updated based on synonym mining into sequences /> , use the BERT model trained in step S31 to calculate the semantic embedding representation of the above sequence, and use the semantic embedding representation corresponding to the starting character [CLS] as the initialization node embedding representation of the corresponding drug term;

将上述初始化节点嵌入表示输入图卷积神经网络模型,具体地,图卷积神经网络模型包含L层,本实施例中取L=10,其中第l层的输入包含两部分,第一部分为 Embedding the above initialization nodes represents the input graph convolutional neural network model. Specifically, the graph convolutional neural network model includes L layers. In this embodiment, L=10 is taken, where the input of the l-th layer includes two parts. The first part is

维的节点嵌入表示矩阵,其中n表示节点数量,为电子病历中外部药物术语和基于同义词挖掘更新的药物术语库中药物术语的总和,/>表示第l层的节点嵌入表示维度,第二部分为/>维的邻接矩阵A,第l层的输出作为第l+1层的节点嵌入表示矩阵,通过归一化的图拉普拉斯变换得到,公式为:dimensional node embedding representation matrix , where n represents the number of nodes, which is the sum of external drug terms in the electronic medical record and drug terms in the drug term database updated based on synonym mining,/> Represents the node embedding representation dimension of the l-th layer, and the second part is/> Dimensional adjacency matrix A, the output of the l-th layer is used as the node embedding representation matrix of the l+1-th layer, which is obtained through the normalized graph Laplace transform. The formula is:

其中为非线性激活函数,可以采用sigmoid激活函数,/>,I为单位矩阵,/>为对角矩阵,对角线上元素的取值为/>,/>为第l层的权重矩阵;in As a nonlinear activation function, you can use the sigmoid activation function,/> ,I is the identity matrix,/> is a diagonal matrix, and the values of the elements on the diagonal are/> ,/> is the weight matrix of layer l;

上述邻接矩阵A的取值,具体地,如果存在一条从基于同义词挖掘更新的药物术语库中药物术语到/>的边,那么/>取值为1,否则取值为0;如果存在一条从电子病历中外部药物术语/>到基于同义词挖掘更新的药物术语库中药物术语/>的边,那么/>取值为上述候选关联关系中它们的相似度的取值;The value of the above adjacency matrix A, specifically, if there is a drug term in the drug term database updated based on synonym mining to/> side, then/> The value is 1, otherwise the value is 0; if there is an external drug term from the electronic medical record/> To drug terms in updated drug terminology database based on synonym mining/> side, then/> The value is the value of their similarity in the above candidate association relationship;

最终第L层的图卷积神经网络模型的输出作为基于同义词挖掘更新的药物术语库中药物术语及电子病历中外部药物术语的节点嵌入表示,从中得到每个基于同义词挖掘更新的药物术语库中标准药物术语和每个电子病历中外部药物术语/>的节点嵌入表示,两者的节点嵌入表示的乘积作为代表/>和/>关联性的结构嵌入表示,记为/>The output of the final L-layer graph convolutional neural network model is used as the node embedding representation of drug terms in the drug terminology database updated based on synonym mining and external drug terms in the electronic medical record, from which each drug term database updated based on synonym mining is obtained. standard drug terminology and external drug terms in each electronic medical record/> The node embedding representation of , the product of the node embedding representations of the two is used as the representative/> and/> Structural embedding representation of correlation, denoted as/> ;

采用基于边际的距离损失函数优化图卷积神经网络模型,损失函数公式为:The marginal-based distance loss function is used to optimize the graph convolutional neural network model. The loss function formula is:

其中表示基于同义词挖掘更新的药物术语库中标准药物术语/>和每个电子病历中外部药物术语/>的结构嵌入表示的距离函数/>为表示区分正负样本的边际值的超参数,/>分别表示正、负样本集;本实施例中使用的结构嵌入表示的距离函数是欧氏距离,即/>,本实施例中取/>in Represents standard drug terms in the drug term database updated based on synonym mining/> and external drug terms in each electronic medical record/> The structural embedding represents the distance function /> is a hyperparameter representing the marginal value that distinguishes positive and negative samples,/> represent the positive and negative sample sets respectively; the distance function represented by the structural embedding used in this embodiment is the Euclidean distance, that is,/> , in this embodiment, take/> .

步骤S33:将步骤S31输出的语义嵌入表示和步骤S32输出的结构嵌入表示/>拼接在一起,表示为/>,将上述表示作为多层感知机的输入,多层感知机包括多个全连接的隐藏层和单节点的输出层,输出层表示为/>;将多层感知机的输出向量通过非线性激活函数/>转化成标量,最终得到每个电子病历中外部药物术语/>与基于同义词挖掘更新的药物术语库中标准药物术语/>的关联概率作为输出,本实施例中采用sigmoid激活函数,表示为;损失函数采用与BERT模型相同的二分类交叉熵损失函数,表示为,其中/>为电子病历中外部药物术语/>与基于同义词挖掘更新的药物术语库中标准药物术语/>关联与否的标签,如果/>和/>关联存在则/>,否则/>Step S33: Embedding the semantics output in step S31 to represent and the structure embedding representation output by step S32/> Spliced together, expressed as/> , use the above representation as the input of the multi-layer perceptron. The multi-layer perceptron includes multiple fully connected hidden layers and a single-node output layer. The output layer is expressed as/> ;Put the output vector of the multi-layer perceptron through the nonlinear activation function/> Convert to scalar, finally get the external drug terms in each electronic medical record/> Standard drug terms in updated drug terminology library based on synonym mining/> The correlation probability of As the output, the sigmoid activation function is used in this embodiment, expressed as ;The loss function adopts the same binary cross-entropy loss function as the BERT model, expressed as , of which/> External drug terminology for electronic medical records/> Standard drug terms in updated drug terminology library based on synonym mining/> Tags associated or not, if/> and/> If the association exists/> , otherwise/> .

步骤S4:使用关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果,建立真实世界电子病历中外部药物术语与药物术语库中标准药物术语的关联。Step S4: Use the association prediction model to predict the association results between external drug terms in the electronic medical record and standard drug terms in the drug terminology database, and establish an association between external drug terms in the real-world electronic medical record and standard drug terms in the drug terminology database.

如图5所示,本发明还提供一种基于上述方法实现的电子病历中药物术语标准化关联系统实施例,该系统包括:As shown in Figure 5, the present invention also provides an embodiment of a standardized association system for drug terms in electronic medical records implemented based on the above method. The system includes:

电子病历药物术语输入模块,用于获取电子病历中待进行药物术语标准化的所有外部药物术语;The electronic medical record drug term input module is used to obtain all external drug terms to be standardized in the electronic medical record;

药物术语库同义词挖掘更新模块,用于构建语料库,从所述语料库中获取用于同义词挖掘的药物术语列表;获取药物术语库中每个标准药物术语的同义词集;训练同义词集分类器,得到所述药物术语列表中每个药物术语与所述药物术语库中同义词集的分类预测结果,根据预设概率阈值得到基于同义词挖掘更新的所有同义词集,对药物术语库进行更新;The drug term database synonym mining update module is used to construct a corpus, obtain a drug term list for synonym mining from the corpus; obtain a synonym set for each standard drug term in the drug term database; train a synonym set classifier to obtain all Classification prediction results of each drug term in the drug term list and the synonym set in the drug term database are obtained, and all synonym sets updated based on synonym mining are obtained according to the preset probability threshold, and the drug term database is updated;

候选关联关系建立模块,用于将电子病历中外部药物术语与更新后的药物术语库中药物术语基于相似度计算建立候选关联关系;The candidate association relationship establishment module is used to establish a candidate association relationship between external drug terms in the electronic medical record and drug terms in the updated drug term database based on similarity calculation;

语义嵌入表示模块,用于将电子病历中外部药物术语及其拼音字符序列、更新后的药物术语库中标准药物术语及其拼音字符序列,结合起始字符和分隔字符构成关联药物术语对字符序列,输入预训练语言模型,得到语义嵌入表示;The semantic embedding representation module is used to combine external drug terms and their pinyin character sequences in electronic medical records, standard drug terms and their pinyin character sequences in the updated drug term database, combined with starting characters and delimiter characters to form an associated drug term pair character sequence. , input the pre-trained language model and obtain the semantic embedding representation;

结构嵌入表示模块,用于将电子病历中外部药物术语和更新后的药物术语库中药物术语的语义嵌入表示分别作为相应药物术语的初始化节点嵌入表示,输入图卷积神经网络模型,得到相应药物术语的节点嵌入表示,将所述外部药物术语和所述标准药物术语的节点嵌入表示的乘积作为结构嵌入表示;The structural embedding representation module is used to use the semantic embedding representation of external drug terms in the electronic medical record and the updated drug terminology database as the initialization node embedding representation of the corresponding drug terms, and input the graph convolutional neural network model to obtain the corresponding drug. The node embedding representation of the term is a product of the node embedding representation of the external drug term and the standard drug term as a structural embedding representation;

关联预测模块,用于根据更新后的药物术语库以及电子病历中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型;使用所述关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果。The association prediction module is used to train an association prediction model based on semantic embedding and structural embedding based on the updated drug term database and external drug terms in the electronic medical record; use the association prediction model to predict and obtain the external drug terms and drug terms in the electronic medical record Association results for standard drug terms in the library.

与前述电子病历中药物术语标准化关联方法的实施例相对应,本发明还提供了电子病历中药物术语标准化关联装置的实施例。本发明实施例提供的电子病历中药物术语标准化关联装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,用于实现上述实施例中的电子病历中药物术语标准化关联方法。Corresponding to the embodiments of the method for standardized association of drug terms in electronic medical records, the present invention also provides an embodiment of a device for standardized association of drug terms in electronic medical records. The device for standardizing drug terminology association in electronic medical records provided by an embodiment of the present invention includes a memory and one or more processors. The memory stores executable code. When the processor executes the executable code, it is used to implement The standardized association method of drug terms in electronic medical records in the above embodiment.

本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的电子病历中药物术语标准化关联方法。An embodiment of the present invention also provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the method for standardized association of drug terms in electronic medical records in the above embodiment is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash memory card equipped on the device. (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本说明书一个或多个实施例,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书一个或多个实施例保护的范围之内。The above are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. Within the spirit and principles of one or more embodiments of this specification, Any modifications, equivalent substitutions, improvements, etc. shall be included in the scope of protection of one or more embodiments of this specification.

Claims (2)

1.一种电子病历中药物术语标准化关联方法,其特征在于,包括以下步骤:1. A method for standardized association of drug terms in electronic medical records, which is characterized by including the following steps: S1,输入药物术语库,得到每个标准药物术语的同义词集;所述药物术语库包含药物术语集合和药物术语之间的关系集合,所述药物术语集合中包含作为上位语或非上位语的标准药物术语、标准药物术语的同义词;所述关系集合包含下位语关系、同义词关系;S1. Enter a drug term database to obtain a synonym set for each standard drug term; the drug term database includes a drug term set and a relationship set between drug terms, and the drug term set includes a hypernym or a non-hypernym. Standard drug terminology and synonyms of standard drug terminology; the relationship set includes hyponym relationships and synonym relationships; S2,获得基于同义词挖掘更新的药物术语库,包括:S2, obtain an updated drug terminology database based on synonym mining, including: 构建用于同义词挖掘的语料库,从所述语料库中获取药物术语列表;Constructing a corpus for synonym mining and obtaining a list of drug terms from said corpus; 训练同义词集分类器,得到所述药物术语列表中每个药物术语与所述药物术语库中同义词集的分类预测结果,根据预设概率阈值得到基于同义词挖掘更新的所有同义词集;所述同义词集分类器的训练过程中,基于集合统一性得分变化预测待归类的药物术语属于同义词集的概率,集合统一性得分计算方法为:计算集合中每个术语的嵌入表示,将嵌入表示输入全连接神经网络模型得到新的术语表示,将所有新的术语表示加和后取均值,得到初始化术语集表示,将所述初始化术语集表示输入全连接神经网络模型,得到集合统一性得分;Train a synonym set classifier to obtain classification prediction results for each drug term in the drug term list and the synonym set in the drug term database, and obtain all synonym sets updated based on synonym mining according to a preset probability threshold; the synonym set During the training process of the classifier, the probability that the drug term to be classified belongs to the synonym set is predicted based on the change of the set unity score. The set unity score calculation method is: calculate the embedding representation of each term in the set, and input the embedding representation into the full connection The neural network model obtains a new term representation, adds all new term representations and takes the average to obtain an initialized term set representation, and inputs the initialized term set representation into the fully connected neural network model to obtain a set unity score; 所述同义词集分类器的训练集生成方式包括:采用随机抽取的方式从同义词集中抽取一个药物术语,结合同义词集中其余药物术语构成的集合生成阳性训练样本;对于每个阳性训练样本,匹配多个阴性训练样本,所述阴性训练样本通过在排除了所述同义词集中的药物术语后,从药物术语库中抽取得到;The training set generation method of the synonym set classifier includes: using random extraction to extract a drug term from the synonym set, and combining it with a set of other drug terms in the synonym set to generate a positive training sample; for each positive training sample, match multiple Negative training samples, the negative training samples are obtained by extracting from the drug term database after excluding the drug terms in the synonym set; 根据所述基于同义词挖掘更新的所有同义词集对药物术语库进行更新,具体为:如果是作为上位语的标准药物术语的同义词集有更新,将相应的同义词与作为上位语的标准药物术语建立同义词关联,同时将相应的同义词与所述作为上位语的标准药物术语关联的所有标准药物术语建立下位语关联;如果是非上位语的标准药物术语的同义词集有更新,将相应的同义词与所述非上位语的标准药物术语建立同义词关联;The drug term database is updated according to all the synonym sets updated based on synonym mining, specifically: if the synonym set of the standard drug term as a hypernym is updated, the corresponding synonyms are established as synonyms with the standard drug term as a hypernym. At the same time, establish a hyponym association between the corresponding synonyms and all the standard drug terms associated with the standard drug term as a hypernym; if the synonym set of the non-hypernym standard drug term is updated, associate the corresponding synonym with the non-hypernym standard drug term. Establish synonym associations for standard drug terms in hypernyms; S3,根据更新后的药物术语库以及电子病历中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型,包括:S3, based on the updated drug term database and external drug terms in the electronic medical record, train an association prediction model based on semantic embedding and structural embedding, including: 通过预训练语言模型获取电子病历中外部药物术语和更新后的药物术语库中标准药物术语对的语义嵌入表示,具体为:将所述外部药物术语及其拼音字符序列、所述标准药物术语及其拼音字符序列,结合起始字符和分隔字符构成关联药物术语对字符序列,输入预训练语言模型,得到语义嵌入表示;The semantic embedding representation of the pairs of external drug terms in the electronic medical record and the standard drug terms in the updated drug term database is obtained through the pre-trained language model, specifically: the external drug terms and their pinyin character sequences, the standard drug terms and The pinyin character sequence, combined with the starting character and the separator character, forms a character sequence of associated drug term pairs, which is input into the pre-trained language model to obtain a semantic embedding representation; 对预训练语言模型进行调整,具体为:将起始字符的语义嵌入表示作为自变量,因变量为电子病历中外部药物术语和更新后的药物术语库中药物术语语义关联与否的标签;采用非线性激活函数获取基于语义嵌入表示的预测结果;采用人工标注方式获取阳性训练样本,采用随机抽取方式获取阴性训练样本,得到训练集,进行预训练语言模型训练,优化调整语义嵌入表示;Adjust the pre-trained language model, specifically: use the semantic embedding representation of the starting character as the independent variable, and the dependent variable is the label of whether the external drug terms in the electronic medical record are semantically related or not in the updated drug term database; use The nonlinear activation function obtains prediction results based on semantic embedding representation; manual annotation is used to obtain positive training samples, and random extraction is used to obtain negative training samples to obtain a training set. Pre-training language model training is performed to optimize and adjust the semantic embedding representation; 通过图卷积神经网络模型获取电子病历中外部药物术语和更新后的药物术语库中标准药物术语对的结构嵌入表示,具体为:将所述外部药物术语与所述更新后的药物术语库中药物术语基于相似度计算建立候选关联关系,包括:计算药物术语中每个词的TF-IDF值,获得电子病历中外部药物术语与更新后的药物术语库中每个药物术语的向量表示,计算两个药物术语间的相似度,如果所述相似度大于预设的相似度阈值,则将电子病历中外部药物术语与对应的药物术语库中的药物术语建立候选关联关系;将所述外部药物术语和所述更新后的药物术语库中药物术语的语义嵌入表示分别作为相应药物术语的初始化节点嵌入表示,输入图卷积神经网络模型,得到相应药物术语的节点嵌入表示,将所述外部药物术语和所述标准药物术语的节点嵌入表示的乘积作为结构嵌入表示;Obtain the structural embedded representation of the pair of external drug terms in the electronic medical record and the standard drug terminology in the updated drug terminology database through the graph convolutional neural network model, specifically: combining the external drug terminology with the updated drug terminology database Drug terms establish candidate association relationships based on similarity calculation, including: calculating the TF-IDF value of each word in the drug term, obtaining the vector representation of each drug term in the external drug term in the electronic medical record and the updated drug term database, calculating The similarity between the two drug terms. If the similarity is greater than the preset similarity threshold, then a candidate association relationship is established between the external drug term in the electronic medical record and the drug term in the corresponding drug term database; the external drug is The term and the semantic embedding representation of the drug term in the updated drug term database are respectively used as the initialization node embedding representation of the corresponding drug term. The graph convolutional neural network model is input to obtain the node embedding representation of the corresponding drug term. The external drug is The product of the node embedding representation of the term and the standard drug term is used as the structure embedding representation; 所述图卷积神经网络模型中每层的输入包含两部分,第一部分为节点嵌入表示矩阵,第二部分为邻接矩阵,每一层的输出作为下一层的节点嵌入表示矩阵,通过归一化的图拉普拉斯变换得到,采用基于边际的距离损失函数优化图卷积神经网络模型;所述邻接矩阵的取值具体地:如果存在一条从更新后的药物术语库中一个药物术语到另一个药物术语的边,那么对应取值为1,否则取值为0;如果存在一条从电子病历中外部药物术语到更新后的药物术语库中药物术语的边,那么对应取值为候选关联关系中的相似度值;The input of each layer in the graph convolutional neural network model includes two parts. The first part is the node embedding representation matrix, and the second part is the adjacency matrix. The output of each layer is used as the node embedding representation matrix of the next layer. Through normalization Obtained from the graph Laplace transform of If there is an edge of another drug term, then the corresponding value is 1, otherwise the value is 0; if there is an edge from the external drug term in the electronic medical record to the drug term in the updated drug term database, then the corresponding value is candidate association The similarity value in the relationship; 将语义嵌入表示和结构嵌入表示进行拼接,将拼接后的表示输入多层感知机,所述多层感知机包括多个全连接的隐藏层和单节点的输出层,将所述多层感知机的输出通过非线性激活函数转化成标量,得到每个电子病历中外部药物术语与更新后的药物术语库中标准药物术语的关联概率;The semantic embedding representation and the structure embedding representation are spliced, and the spliced representation is input into a multi-layer perceptron. The multi-layer perceptron includes multiple fully connected hidden layers and a single-node output layer. The multi-layer perceptron is The output of is converted into a scalar through a nonlinear activation function to obtain the association probability between external drug terms in each electronic medical record and standard drug terms in the updated drug term database; S4,使用所述关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果。S4: Use the association prediction model to predict the association results between external drug terms in the electronic medical record and standard drug terms in the drug term database. 2.一种基于权利要求1所述方法实现的电子病历中药物术语标准化关联系统,其特征在于,包括:2. A standardized association system for drug terms in electronic medical records implemented based on the method of claim 1, characterized in that it includes: 电子病历药物术语输入模块,用于获取电子病历中待进行药物术语标准化的所有外部药物术语;The electronic medical record drug term input module is used to obtain all external drug terms to be standardized in the electronic medical record; 药物术语库同义词挖掘更新模块,用于构建语料库,从所述语料库中获取用于同义词挖掘的药物术语列表;获取药物术语库中每个标准药物术语的同义词集;训练同义词集分类器,得到所述药物术语列表中每个药物术语与所述药物术语库中同义词集的分类预测结果,根据预设概率阈值得到基于同义词挖掘更新的所有同义词集,对药物术语库进行更新;The drug term database synonym mining update module is used to construct a corpus, obtain a drug term list for synonym mining from the corpus; obtain a synonym set for each standard drug term in the drug term database; train a synonym set classifier to obtain all Classification prediction results of each drug term in the drug term list and the synonym set in the drug term database are obtained, and all synonym sets updated based on synonym mining are obtained according to the preset probability threshold, and the drug term database is updated; 候选关联关系建立模块,用于将电子病历中外部药物术语与更新后的药物术语库中药物术语基于相似度计算建立候选关联关系;The candidate association relationship establishment module is used to establish a candidate association relationship between external drug terms in the electronic medical record and drug terms in the updated drug term database based on similarity calculation; 语义嵌入表示模块,用于将电子病历中外部药物术语及其拼音字符序列、更新后的药物术语库中标准药物术语及其拼音字符序列,结合起始字符和分隔字符构成关联药物术语对字符序列,输入预训练语言模型,得到语义嵌入表示;The semantic embedding representation module is used to combine external drug terms and their pinyin character sequences in electronic medical records, standard drug terms and their pinyin character sequences in the updated drug term database, combined with starting characters and delimiter characters to form an associated drug term pair character sequence. , input the pre-trained language model and obtain the semantic embedding representation; 结构嵌入表示模块,用于将电子病历中外部药物术语和更新后的药物术语库中药物术语的语义嵌入表示分别作为相应药物术语的初始化节点嵌入表示,输入图卷积神经网络模型,得到相应药物术语的节点嵌入表示,将所述外部药物术语和所述标准药物术语的节点嵌入表示的乘积作为结构嵌入表示;The structural embedding representation module is used to use the semantic embedding representation of external drug terms in the electronic medical record and the updated drug terminology database as the initialization node embedding representation of the corresponding drug terms, and input the graph convolutional neural network model to obtain the corresponding drug. The node embedding representation of the term is a product of the node embedding representation of the external drug term and the standard drug term as a structural embedding representation; 关联预测模块,用于根据更新后的药物术语库以及电子病历中外部药物术语,训练基于语义嵌入和结构嵌入的关联预测模型;使用所述关联预测模型预测得到电子病历中外部药物术语与药物术语库中标准药物术语的关联结果。The association prediction module is used to train an association prediction model based on semantic embedding and structural embedding based on the updated drug term database and external drug terms in the electronic medical record; use the association prediction model to predict and obtain the external drug terms and drug terms in the electronic medical record Association results for standard drug terms in the library.
CN202310567874.4A 2023-05-19 2023-05-19 A method and system for standardized association of drug terms in electronic medical records Active CN116312915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310567874.4A CN116312915B (en) 2023-05-19 2023-05-19 A method and system for standardized association of drug terms in electronic medical records

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310567874.4A CN116312915B (en) 2023-05-19 2023-05-19 A method and system for standardized association of drug terms in electronic medical records

Publications (2)

Publication Number Publication Date
CN116312915A CN116312915A (en) 2023-06-23
CN116312915B true CN116312915B (en) 2023-09-19

Family

ID=86781981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310567874.4A Active CN116312915B (en) 2023-05-19 2023-05-19 A method and system for standardized association of drug terms in electronic medical records

Country Status (1)

Country Link
CN (1) CN116312915B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118210960B (en) * 2023-12-13 2024-10-18 西湖大学 Construction and use of natural medicinal materials domain knowledge base
CN118227776B (en) * 2024-05-23 2024-07-23 四川省肿瘤医院 Disease science popularization method and system based on artificial intelligence

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544383A (en) * 2013-10-10 2014-01-29 中国中医科学院 Standard-term-based fast EMR (electronic medical record) entry system
US9436760B1 (en) * 2016-02-05 2016-09-06 Quid, Inc. Measuring accuracy of semantic graphs with exogenous datasets
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN107665190A (en) * 2017-09-29 2018-02-06 李晓妮 A kind of method for automatically constructing and device of text proofreading mistake dictionary
CN111460175A (en) * 2020-04-08 2020-07-28 福州数据技术研究院有限公司 SNOMED-CT-based medical noun dictionary construction and expansion method
KR20200097949A (en) * 2019-02-11 2020-08-20 네이버 주식회사 Method and system for extracting synonym by using keyword relation structure
CN111986759A (en) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 Method and system for analyzing electronic medical record, computer equipment and readable storage medium
CN113657109A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Method, apparatus and computer device for standardization of model-based clinical terminology
CN114091425A (en) * 2021-11-25 2022-02-25 北京富通东方科技有限公司 Medical entity alignment method and device
CN114417809A (en) * 2021-12-27 2022-04-29 北京滴普科技有限公司 Entity Alignment Method Based on Combining Graph Structure Information and Text Semantic Model
WO2022088672A1 (en) * 2020-10-29 2022-05-05 平安科技(深圳)有限公司 Machine reading comprehension method and apparatus based on bert, and device and storage medium
CN114444501A (en) * 2022-01-24 2022-05-06 荃豆数字科技有限公司 Method and device for searching traditional Chinese medicine decoction pieces, electronic equipment and storage medium
CN115374792A (en) * 2022-09-14 2022-11-22 山东省计算中心(国家超级计算济南中心) Policy text labeling method and system combining pre-training and graph neural network
WO2023065858A1 (en) * 2021-10-19 2023-04-27 之江实验室 Medical term standardization system and method based on heterogeneous graph neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380241B2 (en) * 2010-05-26 2019-08-13 Warren Daniel Child Modular system and method for managing chinese, japanese, and korean linguistic data in electronic form
US10157220B2 (en) * 2015-07-23 2018-12-18 International Business Machines Corporation Context sensitive query expansion
US12242975B2 (en) * 2020-10-01 2025-03-04 International Business Machines Corporation Querying knowledge graphs with sub-graph matching networks

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544383A (en) * 2013-10-10 2014-01-29 中国中医科学院 Standard-term-based fast EMR (electronic medical record) entry system
US9436760B1 (en) * 2016-02-05 2016-09-06 Quid, Inc. Measuring accuracy of semantic graphs with exogenous datasets
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN107665190A (en) * 2017-09-29 2018-02-06 李晓妮 A kind of method for automatically constructing and device of text proofreading mistake dictionary
KR20200097949A (en) * 2019-02-11 2020-08-20 네이버 주식회사 Method and system for extracting synonym by using keyword relation structure
CN111460175A (en) * 2020-04-08 2020-07-28 福州数据技术研究院有限公司 SNOMED-CT-based medical noun dictionary construction and expansion method
CN111986759A (en) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 Method and system for analyzing electronic medical record, computer equipment and readable storage medium
WO2022088672A1 (en) * 2020-10-29 2022-05-05 平安科技(深圳)有限公司 Machine reading comprehension method and apparatus based on bert, and device and storage medium
CN113657109A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Method, apparatus and computer device for standardization of model-based clinical terminology
WO2023065858A1 (en) * 2021-10-19 2023-04-27 之江实验室 Medical term standardization system and method based on heterogeneous graph neural network
CN114091425A (en) * 2021-11-25 2022-02-25 北京富通东方科技有限公司 Medical entity alignment method and device
CN114417809A (en) * 2021-12-27 2022-04-29 北京滴普科技有限公司 Entity Alignment Method Based on Combining Graph Structure Information and Text Semantic Model
CN114444501A (en) * 2022-01-24 2022-05-06 荃豆数字科技有限公司 Method and device for searching traditional Chinese medicine decoction pieces, electronic equipment and storage medium
CN115374792A (en) * 2022-09-14 2022-11-22 山东省计算中心(国家超级计算济南中心) Policy text labeling method and system combining pre-training and graph neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Text Preprocessing Approach for Efficacious Information Retrieval;Shweta Taneja 等;《Smart Innovations in Communication and Computational Sciences》;第669卷;13–22 *
基于本体概念相似度的网页排序算法研究;张健;冯飞;刘宇;马红烨;;情报学报(第11期);56-65 *
基于语料库对比的英语母语者有标转折复句习得研究;赵蒙月;《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》(第11期);F084-699 *

Also Published As

Publication number Publication date
CN116312915A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
Pezoulas et al. Medical data quality assessment: On the development of an automated framework for medical data curation
WO2021000676A1 (en) Q&a method, q&a device, computer equipment and storage medium
CN116312915B (en) A method and system for standardized association of drug terms in electronic medical records
CN113035362A (en) Medical prediction method and system based on semantic graph network
CN113312461A (en) Intelligent question-answering method, device, equipment and medium based on natural language processing
CN116994694B (en) Patient medical record data screening method, device and medium based on information extraction
CN113707339B (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
CN111259111B (en) Medical record-based decision-making assisting method and device, electronic equipment and storage medium
CN106844351A (en) A kind of medical institutions towards multi-data source organize class entity recognition method and device
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN113420552B (en) Biomedical multi-event extraction method based on reinforcement learning
CN116737945B (en) Mapping method for EMR knowledge map of patient
CN116719840B (en) Medical information pushing method based on post-medical-record structured processing
CN117577254A (en) Method and system for constructing language model in medical field and structuring text of electronic medical record
Adduru et al. Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification.
CN114840685A (en) Emergency plan knowledge graph construction method
CN116975212A (en) Answer searching method and device for question text, computer equipment and storage medium
Grissette Semisupervised neural biomedical sense disambiguation approach for aspect-based sentiment analysis on social networks
US11961622B1 (en) Application-specific processing of a disease-specific semantic model instance
CN113111660A (en) Data processing method, device, equipment and storage medium
CN119127979A (en) A method for structuring electronic medical records based on large language model
CN117689027A (en) Prompt text generation method and device, electronic equipment and storage medium
CN114707615B (en) Ancient character similarity quantification method based on duration Chinese character knowledge graph
CN116644179A (en) Text classification method, device, electronic equipment and storage medium
CN116258136A (en) Error detection model training method, medical image report detection method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant