CN112100405B - Veterinary drug residue knowledge graph construction method based on weighted LDA - Google Patents
Veterinary drug residue knowledge graph construction method based on weighted LDA Download PDFInfo
- Publication number
- CN112100405B CN112100405B CN202011010727.XA CN202011010727A CN112100405B CN 112100405 B CN112100405 B CN 112100405B CN 202011010727 A CN202011010727 A CN 202011010727A CN 112100405 B CN112100405 B CN 112100405B
- Authority
- CN
- China
- Prior art keywords
- topic
- veterinary drug
- knowledge
- lda
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000000273 veterinary drug Substances 0.000 title claims abstract description 115
- 238000010276 construction Methods 0.000 title abstract description 3
- 238000000034 method Methods 0.000 claims abstract description 53
- 238000000605 extraction Methods 0.000 claims abstract description 21
- 238000005065 mining Methods 0.000 claims abstract description 8
- 230000006378 damage Effects 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims description 16
- 231100000027 toxicology Toxicity 0.000 claims description 16
- 230000000694 effects Effects 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000012706 support-vector machine Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 235000013305 food Nutrition 0.000 claims description 10
- 238000002474 experimental method Methods 0.000 claims description 9
- 241001465754 Metazoa Species 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 231100000703 Maximum Residue Limit Toxicity 0.000 claims description 6
- 230000007059 acute toxicity Effects 0.000 claims description 6
- 231100000403 acute toxicity Toxicity 0.000 claims description 6
- 210000000056 organ Anatomy 0.000 claims description 6
- 231100000331 toxic Toxicity 0.000 claims description 6
- 230000002588 toxic effect Effects 0.000 claims description 6
- 231100000419 toxicity Toxicity 0.000 claims description 6
- 230000001988 toxicity Effects 0.000 claims description 6
- 239000003814 drug Substances 0.000 claims description 5
- 238000007619 statistical method Methods 0.000 claims description 5
- 244000144972 livestock Species 0.000 claims description 4
- 206010067484 Adverse reaction Diseases 0.000 claims description 3
- 241000282472 Canis lupus familiaris Species 0.000 claims description 3
- 206010007269 Carcinogenicity Diseases 0.000 claims description 3
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 claims description 3
- 241000699670 Mus sp. Species 0.000 claims description 3
- 241000283973 Oryctolagus cuniculus Species 0.000 claims description 3
- 241000700159 Rattus Species 0.000 claims description 3
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 claims description 3
- 206010043275 Teratogenicity Diseases 0.000 claims description 3
- 230000006838 adverse reaction Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 210000004369 blood Anatomy 0.000 claims description 3
- 239000008280 blood Substances 0.000 claims description 3
- 230000007670 carcinogenicity Effects 0.000 claims description 3
- 231100000260 carcinogenicity Toxicity 0.000 claims description 3
- 210000000750 endocrine system Anatomy 0.000 claims description 3
- 210000001508 eye Anatomy 0.000 claims description 3
- 210000001035 gastrointestinal tract Anatomy 0.000 claims description 3
- 210000000987 immune system Anatomy 0.000 claims description 3
- 230000008676 import Effects 0.000 claims description 3
- 210000003734 kidney Anatomy 0.000 claims description 3
- 210000004185 liver Anatomy 0.000 claims description 3
- 230000007886 mutagenicity Effects 0.000 claims description 3
- 231100000299 mutagenicity Toxicity 0.000 claims description 3
- 210000000653 nervous system Anatomy 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 235000013613 poultry product Nutrition 0.000 claims description 3
- 230000003449 preventive effect Effects 0.000 claims description 3
- 238000011084 recovery Methods 0.000 claims description 3
- 238000011160 research Methods 0.000 claims description 3
- 210000002345 respiratory system Anatomy 0.000 claims description 3
- 229910052711 selenium Inorganic materials 0.000 claims description 3
- 239000011669 selenium Substances 0.000 claims description 3
- 210000003491 skin Anatomy 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 231100000211 teratogenicity Toxicity 0.000 claims description 3
- 235000013601 eggs Nutrition 0.000 abstract description 4
- 235000013372 meat Nutrition 0.000 abstract description 4
- 235000013336 milk Nutrition 0.000 abstract description 3
- 239000008267 milk Substances 0.000 abstract description 3
- 210000004080 milk Anatomy 0.000 abstract description 3
- 208000027418 Wounds and injury Diseases 0.000 abstract 1
- 208000014674 injury Diseases 0.000 abstract 1
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 238000013481 data capture Methods 0.000 description 2
- 208000031295 Animal disease Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 235000013365 dairy product Nutrition 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 231100000572 poisoning Toxicity 0.000 description 1
- 230000000607 poisoning effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Animal Behavior & Ethology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域Technical field
本发明涉及自然语言处理领域,尤其涉及一种基于加权LDA的兽药残留知识图谱构建方法。The present invention relates to the field of natural language processing, and in particular to a method for constructing a knowledge graph of veterinary drug residues based on weighted LDA.
背景技术Background technique
食品安全问题越来越受到大家的关注,其中肉、蛋、奶食品安全问题更是重中之重。兽药在防治动物疾病、促进动物生长上有重要作用,畜产品的养殖过程离不开兽药。但是,不规范、违禁使用和滥用兽药的现象导致兽药残留超标,从而引发中毒事件。通过构建兽药残留知识图谱,找出兽药残留特点规律以及兽药残留对人体造成伤害的原因,保证肉蛋奶产品的质量安全,从而保护人们的身体健康和生命安全。Food safety issues have attracted more and more attention, among which the safety of meat, eggs, and dairy products is a top priority. Veterinary drugs play an important role in preventing and treating animal diseases and promoting animal growth. The breeding process of livestock products is inseparable from veterinary drugs. However, irregular, illegal use and abuse of veterinary drugs have led to excessive veterinary drug residues, leading to poisoning incidents. By constructing a knowledge map of veterinary drug residues, we can find out the characteristics and patterns of veterinary drug residues and the reasons why veterinary drug residues cause harm to the human body, ensuring the quality and safety of meat, egg and milk products, thereby protecting people's health and life safety.
兽药残留数据涉及到兽药的残留标准,兽药的抽检超标情况,兽药残留动物毒理学实验数据以及对人的危害症状等。这些数据包含结构化数据和非结构化文本数据。利用这些数据进行知识抽取和分类,构建兽药残留知识基本框架,再利用构造的兽药知识框架,再下载兽药知识相关文献。Veterinary drug residue data involves the residue standards of veterinary drugs, sampling inspections of veterinary drugs that exceed the standards, animal toxicology experimental data of veterinary drug residues, and harmful symptoms to humans. These data include structured data and unstructured text data. Use these data to extract and classify knowledge, build a basic framework of veterinary drug residue knowledge, and then use the constructed veterinary drug knowledge framework to download relevant literature on veterinary drug knowledge.
结合兽药残留知识基本框架来进行文献下载,获得兽药知识相关文献。采取LDA来进行主题挖掘,获得兽药文献中的潜在信息。LDA(Latent Dirichlet Allocation,狄利克雷分布)是一种非监督机器学习技术,可以用来识别大规模文档集或语料库中潜藏的主题信息。LDA主题挖掘方法认为所有的词汇具有同样的权重,实际上大量高频的无关词汇对于主题挖掘没有贡献,采用结合兽药知识层级语义相似度和TF-IDF的加权LDA主题挖掘方法来进行文献下载。Download the literature based on the basic framework of veterinary drug residue knowledge and obtain relevant literature on veterinary drug knowledge. LDA was used for topic mining to obtain potential information in veterinary drug literature. LDA (Latent Dirichlet Allocation, Dirichlet distribution) is an unsupervised machine learning technology that can be used to identify latent topic information in large-scale document sets or corpora. The LDA topic mining method believes that all words have the same weight. In fact, a large number of high-frequency irrelevant words do not contribute to topic mining. A weighted LDA topic mining method that combines veterinary drug knowledge level semantic similarity and TF-IDF is used to download documents.
通过数据融合、数据整合、进行实体识别和关系抽取,进而构建知识图谱,找出兽药残留特点规律以及兽药残留对人体造成伤害的原因,保证肉、蛋、奶的质量安全,从而保护人们的身体健康和生命安全。Through data fusion, data integration, entity recognition and relationship extraction, we can then build a knowledge graph to find out the characteristics and patterns of veterinary drug residues and the reasons why veterinary drug residues cause harm to the human body, ensuring the quality and safety of meat, eggs and milk, thereby protecting people's health. Health and life safety.
发明内容Contents of the invention
本发明的目的是提供一种基于加权LDA的兽药残留知识图谱构建方法,为解决上述技术问题,本发明的主要技术内容如下:The purpose of the present invention is to provide a method for constructing a knowledge map of veterinary drug residues based on weighted LDA. In order to solve the above technical problems, the main technical contents of the present invention are as follows:
一种基于加权LDA的兽药残留知识图谱构建方法,包括以下步骤:A method for constructing a knowledge graph of veterinary drug residues based on weighted LDA, including the following steps:
(1)构建兽药知识框架:使用基于层次分析和规则的方法从兽医药理学,兽医毒理学书本中抽取知识。使用基于包装器的方法从Pubchem网址获得兽药毒理相关知识。利用jieba分词工具对所述的这些语料进行去停用词、分词、词性标注最终形成词典,形成层级的兽药知识框架;(1) Construct a veterinary drug knowledge framework: use hierarchical analysis and rule-based methods to extract knowledge from veterinary pharmacology and veterinary toxicology books. Obtain veterinary drug toxicology-related knowledge from the Pubchem website using a wrapper-based approach. The jieba word segmentation tool is used to remove stop words, word segmentation, and part-of-speech tagging of these corpus, and finally form a dictionary to form a hierarchical veterinary drug knowledge framework;
(2)下载文献数据:利用上一步得到的词典,结合兽药名称,在Web of science上进行多层搜索,即遍历根节点到叶子节点的每一条路径,对于每一条路径上的所有词汇进行多层结果中搜索。使用支持向量机SVM方法对于得到的文献进行分类,包含兽药知识相关和兽药知识不相关两大类。对于兽药知识相关文献,使用加权LDA方法进行主题提取;(2) Download literature data: Use the dictionary obtained in the previous step, combined with the name of the veterinary drug, to conduct a multi-level search on the Web of science, that is, traverse every path from the root node to the leaf node, and perform a multi-layer search on all the words on each path. Search within layer results. The support vector machine (SVM) method was used to classify the obtained documents, including those related to veterinary drug knowledge and those not related to veterinary drug knowledge. For literature related to veterinary drug knowledge, the weighted LDA method is used for topic extraction;
(3)信息抽取:基于词典的命名实体识别和关系抽取;(3) Information extraction: dictionary-based named entity recognition and relationship extraction;
(4)构建知识图谱:将上述兽药领域知识的实体以及实体之间的关系,以csv格式导入Neo4j数据库中。(4) Build a knowledge graph: Import the entities of the above-mentioned knowledge in the field of veterinary medicine and the relationships between entities into the Neo4j database in csv format.
步骤(1)中构建兽药知识框架包括以下内容:The construction of a veterinary drug knowledge framework in step (1) includes the following contents:
(a)制定兽药残留知识结构,共包含五大部分:兽药残留、毒理绪论、对器官和系统的影响、属性和毒性;(a) Develop a knowledge structure for veterinary drug residues, which includes five parts: veterinary drug residues, introduction to toxicology, effects on organs and systems, properties and toxicity;
(b)兽药残留:包含原因、影响和危害。危害又可以分为对人体的、对食物的和对环境的危害三部分;(b) Veterinary drug residues: including causes, effects and hazards. Hazards can be divided into three parts: harm to the human body, food and the environment;
(c)兽药属性:类别、理化性质、药动学、作用、应用、最高残留限量和不良反应;(c) Veterinary drug attributes: category, physical and chemical properties, pharmacokinetics, effects, applications, maximum residue limits and adverse reactions;
(d)兽药的毒性:毒性作用分类、常用参数、特殊风险人群、暴露途径、预防措施、吸入方式、动物实验。毒性作用分类包含性质、发生时间、部位和恢复情况。常用参数包含急性毒性、诱变性、致癌性、致畸性、急性毒性等。动物实验的对象包含小鼠、大鼠、家兔和犬等;(d) Toxicity of veterinary drugs: classification of toxic effects, commonly used parameters, special risk groups, exposure routes, preventive measures, inhalation methods, and animal experiments. The classification of toxic effects includes nature, time of occurrence, site and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, etc. Animal experiments include mice, rats, rabbits, dogs, etc.;
(e)兽药毒理绪论:包含目的、内容和方法。方法又可以分为生物实验和群体调查两部分;对于器官和系统的影响:包含眼、皮肤、肝脏、肾脏、神经系统、血液系统、免疫系统、胃肠道、内分泌系统和呼吸系统;(e) Introduction to veterinary drug toxicology: including purpose, content and methods. Methods can be divided into two parts: biological experiments and group surveys; effects on organs and systems: including eyes, skin, liver, kidneys, nervous system, blood system, immune system, gastrointestinal tract, endocrine system and respiratory system;
(f)每一部分若包含表格内容,放在对应的分类之下。(f) If each part contains table content, place it under the corresponding category.
步骤(2)中多层搜索包括以下步骤:The multi-level search in step (2) includes the following steps:
(a)选取的兽药:《食品安全国家标准食品中兽药最大残留限量》标准规定了267种(类)兽药在畜禽产品、水产品、蜂产品中的2191项残留限量及使用要求;(a) Selected veterinary drugs: The "National Food Safety Standard Maximum Residue Limits of Veterinary Drugs in Foods" standard stipulates 2191 residue limits and usage requirements for 267 kinds (categories) of veterinary drugs in livestock and poultry products, aquatic products, and bee products;
(b)利用Selenium和chrome driver完后动态网页(Ajax)数据抓取。搜索的范围是web ofscience建立数据库至今的所有文献,考虑到兽药毒理学研究的数据量较少,所以不限制期刊进行搜索;(b) Use Selenium and chrome driver to complete dynamic web page (Ajax) data capture. The scope of the search is all documents since the establishment of the web of science database. Considering that the amount of data in veterinary drug toxicology research is small, the search is not restricted to journals;
(c)按照兽药知识框架,从根节点到叶子节点。对于每一条路径上的所有节点,将这些关键词结合起来进行多层结果中搜索。(c) According to the veterinary drug knowledge framework, from the root node to the leaf node. For all nodes on each path, these keywords are combined to search in multi-level results.
步骤(2)中SVM文本分类包括以下步骤:SVM text classification in step (2) includes the following steps:
(a)目的是将得到的文献集分成两类,兽药知识相关和兽药知识不相关;(a) The purpose is to divide the obtained literature set into two categories, those related to veterinary drug knowledge and those not related to veterinary drug knowledge;
(b)TF-IDF方法通过统计的方法计算和表达某个关键词在文本中的重要程度。TF指词频,表示某一指定的词条在文本中出现的频率,IDF指逆文本频率,是一个词语普遍重要性的度量。词条ti在文本dj中的TF计算方法:(b) The TF-IDF method calculates and expresses the importance of a certain keyword in the text through statistical methods. TF refers to word frequency, which indicates how often a specified term appears in a text, and IDF refers to inverse text frequency, which is a measure of the general importance of a word. The TF calculation method of entry t i in text d j :
其中,ni,j为词条ti在文本dj中出现的次数,分母表示将文本dj中所有词出现的次数求和。Among them, n i, j is the number of times the term t i appears in the text d j , and the denominator represents the sum of the number of times all words appear in the text d j .
IDF的计算方法:How to calculate IDF:
其中,|D|为文本总数,|j:ti∈dj|为含有词条ti的文本总数,为防止词条不在文本中导致分母为0的情况,将分母加1。最终求得该词条ti的TF-IDF值:Among them, |D| is the total number of texts, |j: t i ∈d j | is the total number of texts containing the entry t i . In order to prevent the denominator from being 0 due to the entry being not in the text, add 1 to the denominator. Finally, the TF-IDF value of the entry t i is obtained:
tfidfi,j=tfi,j×idfi tfidf i, j = tf i, j × idf i
(c)首先使用TF-IDF算法提取论文部分摘要的特征词,并生成文档向量。选择摘要中全文,向量维度过高,会增加计算的复杂度,不利于后续的分类。根据兽药知识相关文献的特点,选择论文摘要中的结论部分的短文本即可,对于这部分文本使用TF-IDF算法提取特征词,生成文档向量;(c) First use the TF-IDF algorithm to extract the feature words of part of the abstract of the paper and generate a document vector. If you select the full text in the abstract, the vector dimension is too high, which will increase the complexity of the calculation and is not conducive to subsequent classification. According to the characteristics of literature related to veterinary drug knowledge, just select the short text in the conclusion part of the paper abstract. For this part of the text, use the TF-IDF algorithm to extract feature words and generate a document vector;
(d)随机选择部分数据,人工标注。设定训练集和测试集的比例为8:2;(d) Randomly select part of the data and manually label it. Set the ratio of training set to test set to 8:2;
(e)调整SVM惩罚参数C,结合正确率(a)、精准率(P)、召回率(R)和F1值来评估模型;(e) Adjust the SVM penalty parameter C and evaluate the model based on accuracy (a), precision (P), recall (R) and F1 value;
(f)测试集验证模型;(f) Test set to verify the model;
(g)对于文献数据,使用训练好的模型,得到兽药残留主题相关的文献数据。(g) For literature data, use the trained model to obtain literature data related to the topic of veterinary drug residues.
步骤(2)中建立加权LDA主题模型包括以下步骤:Establishing the weighted LDA topic model in step (2) includes the following steps:
(a)LDA(latentdirichletallocation)是一种3层贝叶斯模型,它描述了文档、主题、词汇间的关系。其图模型见图2。图中各个符号的含义:α是狄利克雷分布θ的超参数、β是狄利克雷分布的超参数、θ是“文档-主题”的多项式分布、/>是“主题-词汇”的多项式分布、z是词的主题分配、w是词、K是主题数目、M是文档数目、N是一篇文档的词数;(a) LDA (latentdirichletallocation) is a 3-layer Bayesian model that describes the relationship between documents, topics, and vocabulary. Its graph model is shown in Figure 2. The meaning of each symbol in the figure: α is the hyperparameter of the Dirichlet distribution θ, β is the Dirichlet distribution The hyperparameters, θ is the polynomial distribution of "document-topic", /> is the polynomial distribution of "topic-vocabulary", z is the topic distribution of words, w is the word, K is the number of topics, M is the number of documents, and N is the number of words in a document;
(b)LDA的过程:(b)LDA process:
1.对语料库中的每篇文档中的每个词汇,随机的赋予一个主题编号Z;1. Randomly assign a topic number Z to each word in each document in the corpus;
2.重新扫描语料库,对每个词,使用Gibbs Sampling公式对其采样,求出它的主题,在语料中更新;2. Re-scan the corpus, use the Gibbs Sampling formula to sample each word, find its theme, and update it in the corpus;
3.重复步骤2,直到Gibbs Sampling收敛;3. Repeat step 2 until Gibbs Sampling converges;
4.统计语料库的主题词汇共现频率矩阵,该矩阵就是LDA的模型。4. The topic word co-occurrence frequency matrix of the statistical corpus, this matrix is the model of LDA.
(c)Gibbs Sampling拟合θ、的过程:(c) Gibbs Sampling fitting θ, the process of:
1.扫描文章,对每个词wn随机赋予一个主题Zj;1. Scan the article and randomly assign a topic Z j to each word w n ;
2.初始化Zj,使其为1~K之间的某个整数;2. Initialize Z j to an integer between 1 and K;
3.重新扫描每篇文章,采用LDA模型对语料库进行主题建模,参数推理利用GibbsSampling不断迭代,同时记录Zj的值。参数θ、的计算公式如下:3. Re-scan each article, use the LDA model to perform topic modeling on the corpus, use GibbsSampling for parameter inference to continuously iterate, and record the value of Z j at the same time. Parameter θ, The calculation formula is as follows:
其中,是文章d中主题j的单词数,/>是文章d中所有主题的单词数,/>是单词w在主题j下出现的次数,/>是文章d中主题j的单词总数。in, is the number of words of topic j in article d,/> is the number of words of all topics in article d,/> is the number of times word w appears under topic j,/> is the total number of words of topic j in article d.
(d)LDA算法在主题建模过程中没有较好地结合相关的语义信息,这严重影响了主题的语义连贯性、可解释性和文本语义表征的准确性。本文针对兽药知识的词汇分布特点,根据每个词与兽药知识种子词的语义相似度,使用层级的语义相似度计算公式来计算相似度,赋予词汇不同权重,并将权值信息融入吉布斯采样过程;(d) The LDA algorithm does not combine relevant semantic information well in the topic modeling process, which seriously affects the semantic coherence, interpretability and accuracy of text semantic representation of the topic. In view of the vocabulary distribution characteristics of veterinary drug knowledge, this paper uses a hierarchical semantic similarity calculation formula to calculate the similarity based on the semantic similarity between each word and the seed word of veterinary drug knowledge, gives different weights to the words, and integrates the weight information into Gibbs sampling process;
p1和p2表示两个词汇,d表示p1和p2在兽药知识层次体系中的路径距离,d越大则相似度越小。相似度的取值范围为[0,1]。k是一个可调节的参数,通常默认设置为20。p1 and p2 represent two words, and d represents the path distance between p1 and p2 in the veterinary medicine knowledge hierarchy system. The larger d, the smaller the similarity. The value range of similarity is [0,1]. k is an adjustable parameter, usually set to 20 by default.
(e)LDA在参数估计过程中偏向高频词的抽取,而一些隐含主题的低频特征词被淹没。既考虑到高频词,也考虑到隐含主题的低频特征词,考虑使用TF-IDF方法进行优化,TF-IDF通过统计的方法计算和表达某个关键词在文本中的重要程度。通过计算词的TF-IDF值对主题模型迭代生成的主题-词矩阵进行加权,有效弱化高频噪声词的影响。(e) LDA favors the extraction of high-frequency words during the parameter estimation process, while some low-frequency feature words of hidden topics are drowned. Taking into account both high-frequency words and low-frequency feature words of hidden topics, consider using the TF-IDF method for optimization. TF-IDF uses statistical methods to calculate and express the importance of a certain keyword in the text. The topic-word matrix iteratively generated by the topic model is weighted by calculating the TF-IDF value of the word, which effectively weakens the influence of high-frequency noise words.
(f)加权的LDA的步骤:(f) Steps of weighted LDA:
1.将论文摘要数据集进行分词和去停用词处理;1. Process the paper abstract data set into word segmentation and remove stop words;
2.对语料进行吉布斯采样,生成文档-主题分布和主题-词分布;2. Perform Gibbs sampling on the corpus to generate document-topic distribution and topic-word distribution;
3.计算相似度,按照相似度大小进行排序,保留前K/2个主题作为候选主题,结合候选主题构造新的文档-主题分布和主题-词分布;3. Calculate the similarity, sort according to the similarity, retain the top K/2 topics as candidate topics, and construct a new document-topic distribution and topic-word distribution based on the candidate topics;
4.利用TF-IDF对主题-词分布进行加权,得到加权概率,再根据主题-词分布情况,选取权重最高的20个特征词。4. Use TF-IDF to weight the topic-word distribution to obtain the weighted probability, and then select the 20 feature words with the highest weight based on the topic-word distribution.
(g)LDA中主题数目的确定:模型训练时,需要事先设置主题数,根据训练出来的结果,手动调参。主题数目取40,超参数α取0.25,β取0.1;(g) Determination of the number of topics in LDA: When training the model, it is necessary to set the number of topics in advance and manually adjust the parameters based on the training results. The number of topics is 40, the hyperparameter α is 0.25, and β is 0.1;
(h)语料库中的文档是上一步骤中SVM分类得到的兽药知识相关文档,对于这些文档进行主题挖掘,得到相关主题词汇。接着用这些主题词汇,再次进行搜索。(h) The documents in the corpus are veterinary drug knowledge-related documents obtained by SVM classification in the previous step. Topic mining is performed on these documents to obtain related topic vocabulary. Then search again using these topic words.
步骤(3)中建立加权信息抽取包括以下步骤:The establishment of weighted information extraction in step (3) includes the following steps:
(a)命名实体识别:利用开源的分词工具进行分词和去停用词,利用兽药知识词典,进行命名实体识别;(a) Named entity recognition: Use open source word segmentation tools for word segmentation and remove stop words, and use the veterinary drug knowledge dictionary for named entity recognition;
(b)关系抽取:预先定义兽药知识实体之间的关系抽取模型,进行关系抽取。(b) Relationship extraction: Define the relationship extraction model between veterinary drug knowledge entities in advance and perform relationship extraction.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.
图1为本发明的兽药知识词典框架;Figure 1 is the veterinary drug knowledge dictionary framework of the present invention;
图2为LDA算法示意图;Figure 2 is a schematic diagram of the LDA algorithm;
图3为一种加权LDA文献搜索算法流程图。Figure 3 is a flow chart of a weighted LDA document search algorithm.
具体实施方式Detailed ways
为更进一步阐述本发明为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本发明提出的其具体实施方式、结构、特征及其功效,详细说明如后。In order to further elaborate on the technical means and effects adopted by the present invention to achieve the intended inventive purpose, the specific implementation manner, structure, characteristics and effects proposed according to the present invention are described in detail below in conjunction with the accompanying drawings and preferred embodiments. back.
一种基于加权LDA的兽药残留知识图谱构建方法,包括以下步骤:A method for constructing a knowledge graph of veterinary drug residues based on weighted LDA, including the following steps:
(1)构建兽药知识框架:使用基于层次分析和规则的方法从兽医药理学,兽医毒理学书本中抽取知识。使用基于包装器的方法从Pubchem网址获得兽药毒理相关知识。利用jieba分词工具对所述的这些语料进行去停用词、分词、词性标注最终形成词典,形成层级的兽药知识框架;(1) Construct a veterinary drug knowledge framework: use hierarchical analysis and rule-based methods to extract knowledge from veterinary pharmacology and veterinary toxicology books. Obtain veterinary drug toxicology-related knowledge from the Pubchem website using a wrapper-based approach. The jieba word segmentation tool is used to remove stop words, word segmentation, and part-of-speech tagging of these corpus, and finally form a dictionary to form a hierarchical veterinary drug knowledge framework;
(2)下载文献数据:利用上一步得到的词典,结合兽药名称,在Web of science上进行多层搜索,即遍历根节点到叶子节点的每一条路径,对于每一条路径上的所有词汇进行多层结果中搜索。使用支持向量机SVM方法对于得到的文献进行分类,包含兽药知识相关和兽药知识不相关两大类。对于兽药知识相关文献,使用改进的LDA方法进行主题提取;(2) Download literature data: Use the dictionary obtained in the previous step, combined with the name of the veterinary drug, to conduct a multi-level search on the Web of science, that is, traverse every path from the root node to the leaf node, and perform a multi-layer search on all the words on each path. Search within layer results. The support vector machine (SVM) method was used to classify the obtained documents, including those related to veterinary drug knowledge and those not related to veterinary drug knowledge. For literature related to veterinary drug knowledge, the improved LDA method was used for topic extraction;
(3)信息抽取:基于词典的命名实体识别和关系抽取;(3) Information extraction: dictionary-based named entity recognition and relationship extraction;
(4)构建知识图谱:将上述兽药领域知识的实体以及实体之间的关系,以csv格式导入Neo4j数据库中。(4) Build a knowledge graph: Import the entities of the above-mentioned knowledge in the field of veterinary medicine and the relationships between entities into the Neo4j database in csv format.
步骤(1)中用构建兽药知识框架包括以下内容:The veterinary drug knowledge framework used in step (1) includes the following contents:
(a)制定兽药残留知识结构,共包含五大部分:兽药残留、毒理绪论、对器官和系统的影响、属性和毒性;(a) Develop a knowledge structure for veterinary drug residues, which includes five parts: veterinary drug residues, introduction to toxicology, effects on organs and systems, properties and toxicity;
(b)兽药残留:包含原因、影响和危害。危害又可以分为对人体的、对食物的和对环境的危害三部分;(b) Veterinary drug residues: including causes, effects and hazards. Hazards can be divided into three parts: harm to the human body, food and the environment;
(c)兽药属性:类别、理化性质、药动学、作用、应用、最高残留限量和不良反应;(c) Veterinary drug attributes: category, physical and chemical properties, pharmacokinetics, effects, applications, maximum residue limits and adverse reactions;
(d)兽药的毒性:毒性作用分类、常用参数、特殊风险人群、暴露途径、预防措施、吸入方式、动物实验。毒性作用分类包含性质、发生时间、部位和恢复情况。常用参数包含急性毒性、诱变性、致癌性、致畸性、急性毒性等。动物实验的对象包含小鼠、大鼠、家兔和犬等;(d) Toxicity of veterinary drugs: classification of toxic effects, commonly used parameters, special risk groups, exposure routes, preventive measures, inhalation methods, and animal experiments. The classification of toxic effects includes nature, time of occurrence, site and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, etc. Animal experiments include mice, rats, rabbits, dogs, etc.;
(e)兽药毒理绪论:包含目的、内容和方法。方法又可以分为生物实验和群体调查两部分;对于器官和系统的影响:包含眼、皮肤、肝脏、肾脏、神经系统、血液系统、免疫系统、胃肠道、内分泌系统和呼吸系统;(e) Introduction to veterinary drug toxicology: including purpose, content and methods. Methods can be divided into two parts: biological experiments and group surveys; effects on organs and systems: including eyes, skin, liver, kidneys, nervous system, blood system, immune system, gastrointestinal tract, endocrine system and respiratory system;
(f)每一部分若包含表格内容,放在对应的分类之下。(f) If each part contains table content, place it under the corresponding category.
步骤(2)中多层搜索包括以下步骤:The multi-level search in step (2) includes the following steps:
(a)选取的兽药:《食品安全国家标准食品中兽药最大残留限量》标准规定了267种(类)兽药在畜禽产品、水产品、蜂产品中的2191项残留限量及使用要求;(a) Selected veterinary drugs: The "National Food Safety Standard Maximum Residue Limits of Veterinary Drugs in Foods" standard stipulates 2191 residue limits and usage requirements for 267 kinds (categories) of veterinary drugs in livestock and poultry products, aquatic products, and bee products;
(b)利用Selenium和chrome driver完后动态网页(Ajax)数据抓取。搜索的范围是web ofscience建立数据库至今的所有文献,考虑到兽药毒理学研究的数据量较少,所以不限制期刊进行搜索;(b) Use Selenium and chrome driver to complete dynamic web page (Ajax) data capture. The scope of the search is all documents since the establishment of the web of science database. Considering that the amount of data in veterinary drug toxicology research is small, the search is not restricted to journals;
(c)按照兽药知识框架,从根节点到叶子节点。对于每一条路径上的所有节点,将这些关键词结合起来进行多层结果中搜索。(c) According to the veterinary drug knowledge framework, from the root node to the leaf node. For all nodes on each path, these keywords are combined to search in multi-level results.
步骤(2)中SVM文本分类包括以下步骤:SVM text classification in step (2) includes the following steps:
(a)目的是将得到的文献集分成两类,兽药知识相关和兽药知识不相关;(a) The purpose is to divide the obtained literature set into two categories, those related to veterinary drug knowledge and those not related to veterinary drug knowledge;
(b)TF-IDF方法通过统计的方法计算和表达某个关键词在文本中的重要程度。TF指词频,表示某一指定的词条在文本中出现的频率,IDF指逆文本频率,是一个词语普遍重要性的度量。词条t_i在文本d_j中的TF计算方法:(b) The TF-IDF method calculates and expresses the importance of a certain keyword in the text through statistical methods. TF refers to word frequency, which indicates the frequency of a specified term appearing in the text, and IDF refers to inverse text frequency, which is a measure of the general importance of a word. The TF calculation method of entry t_i in text d_j:
其中,ni,j为词条ti在文本dj中出现的次数,分母表示将文本dj中所有词出现的次数求和。Among them, n i, j is the number of times the term t i appears in the text d j , and the denominator represents the sum of the number of times all words appear in the text d j .
IDF的计算方法:How to calculate IDF:
其中,|D|为文本总数,|j:ti∈dj|为含有词条ti的文本总数,为防止词条不在文本中导致分母为0的情况,将分母加1。最终求得该词条ti的TF-IDF值:Among them, |D| is the total number of texts, |j:t i ∈d j | is the total number of texts containing the entry t i . In order to prevent the denominator from being 0 due to the entry being not in the text, add 1 to the denominator. Finally, the TF-IDF value of the entry t i is obtained:
tfidfi,j=tfi,j×idfi tfidf i, j = tf i, j × idf i
(a)首先使用TF-IDF算法提取论文部分摘要的特征词,并生成文档向量。选择摘要中全文,向量维度过高,会增加计算的复杂度,不利于后续的分类。根据兽药知识相关文献的特点,选择论文摘要中的结论部分的短文本即可,对于这部分文本使用TF-IDF算法提取特征词,生成文档向量;(a) First, use the TF-IDF algorithm to extract the feature words of part of the abstract of the paper and generate a document vector. If the full text of the abstract is selected, the vector dimension is too high, which will increase the complexity of the calculation and is not conducive to subsequent classification. According to the characteristics of literature related to veterinary drug knowledge, just select the short text in the conclusion part of the paper abstract. For this part of the text, use the TF-IDF algorithm to extract feature words and generate a document vector;
(b)随机选择部分数据,人工标注。设定训练集和测试集的比例为8:2;(b) Randomly select part of the data and manually label it. Set the ratio of training set to test set to 8:2;
(c)调整SVM惩罚参数C,结合正确率(a)、精准率(P)、召回率(R)和F1值来评估模型;(c) Adjust the SVM penalty parameter C and evaluate the model based on accuracy (a), precision (P), recall (R) and F1 value;
(d)测试集验证模型;(d) Test set to verify the model;
(e)对于文献数据,使用训练好的模型,得到兽药残留主题相关的文献数据。(e) For literature data, use the trained model to obtain literature data related to the topic of veterinary drug residues.
步骤(2)中建立加权LDA主题模型包括以下步骤:Establishing the weighted LDA topic model in step (2) includes the following steps:
(a)LDA(latentdirichletallocation)是一种3层贝叶斯模型,它描述了文档、主题、词汇间的关系。其图模型见图2。图中各个符号的含义:α是狄利克雷分布θ的超参数、β是狄利克雷分布的超参数、θ是“文档-主题”的多项式分布、/>是“主题-词汇”的多项式分布、z是词的主题分配、w是词、K是主题数目、M是文档数目、N是一篇文档的词数;(a) LDA (latentdirichletallocation) is a 3-layer Bayesian model that describes the relationship between documents, topics, and vocabulary. Its graph model is shown in Figure 2. The meaning of each symbol in the figure: α is the hyperparameter of the Dirichlet distribution θ, β is the Dirichlet distribution The hyperparameters, θ is the polynomial distribution of "document-topic", /> is the polynomial distribution of "topic-vocabulary", z is the topic distribution of words, w is the word, K is the number of topics, M is the number of documents, and N is the number of words in a document;
(b)LDA的过程:(b)LDA process:
1.对语料库中的每篇文档中的每个词汇,随机的赋予一个主题编号Z;1. Randomly assign a topic number Z to each word in each document in the corpus;
2.重新扫描语料库,对每个词,使用Gibbs Sampling公式对其采样,求出它的主题,在语料中更新;2. Re-scan the corpus, use the Gibbs Sampling formula to sample each word, find its theme, and update it in the corpus;
3.重复步骤2,直到Gibbs Sampling收敛;3. Repeat step 2 until Gibbs Sampling converges;
4.统计语料库的主题词汇共现频率矩阵,该矩阵就是LDA的模型。4. The topic word co-occurrence frequency matrix of the statistical corpus, this matrix is the model of LDA.
(c)Gibbs Sampling拟合θ、的过程:(c) Gibbs Sampling fitting θ, the process of:
1.扫描文章,对每个词wn随机赋予一个主题Zj;1. Scan the article and randomly assign a topic Z j to each word w n ;
2.初始化Zj,使其为1~K之间的某个整数;2. Initialize Z j to an integer between 1 and K;
3.重新扫描每篇文章,采用LDA模型对语料库进行主题建模,参数推理利用GibbsSampling不断迭代,同时记录Zj的值。参数θ、的计算公式如下:3. Re-scan each article, use the LDA model to perform topic modeling on the corpus, use GibbsSampling for parameter inference to continuously iterate, and record the value of Z j at the same time. Parameter θ, The calculation formula is as follows:
其中,是文章d中主题j的单词数,/>是文章d中所有主题的单词数,/>是单词w在主题j下出现的次数,/>是文章d中主题j的单词总数。in, is the number of words of topic j in article d,/> is the number of words of all topics in article d,/> is the number of times word w appears under topic j,/> is the total number of words of topic j in article d.
(d)LDA算法在主题建模过程中没有较好地结合相关的语义信息,这严重影响了主题的语义连贯性、可解释性和文本语义表征的准确性。本文针对兽药知识的词汇分布特点,根据每个词与兽药知识种子词的语义相似度,使用层级的语义相似度计算公式来计算相似度,赋予词汇不同权重,并将权值信息融入吉布斯采样过程;(d) The LDA algorithm does not combine relevant semantic information well in the topic modeling process, which seriously affects the semantic coherence, interpretability and accuracy of text semantic representation of the topic. In view of the vocabulary distribution characteristics of veterinary drug knowledge, this paper uses a hierarchical semantic similarity calculation formula to calculate the similarity based on the semantic similarity between each word and the seed word of veterinary drug knowledge, gives different weights to the words, and integrates the weight information into Gibbs sampling process;
p1和p2表示两个词汇,d表示p1和p2在兽药知识层次体系中的路径距离,d越大则相似度越小。相似度的取值范围为[0,1]。k是一个可调节的参数,通常默认设置为20。p1 and p2 represent two words, and d represents the path distance between p1 and p2 in the veterinary medicine knowledge hierarchy system. The larger d, the smaller the similarity. The value range of similarity is [0,1]. k is an adjustable parameter, usually set to 20 by default.
(e)LDA在参数估计过程中偏向高频词的抽取,而一些隐含主题的低频特征词被淹没。既考虑到高频词,也考虑到隐含主题的低频特征词,考虑使用TF-IDF方法进行优化,TF-IDF通过统计的方法计算和表达某个关键词在文本中的重要程度。通过计算词的TF-IDF值对主题模型迭代生成的主题-词矩阵进行加权,有效弱化高频噪声词的影响。(e) LDA favors the extraction of high-frequency words during the parameter estimation process, while some low-frequency feature words of hidden topics are drowned. Taking into account both high-frequency words and low-frequency feature words of hidden topics, consider using the TF-IDF method for optimization. TF-IDF uses statistical methods to calculate and express the importance of a certain keyword in the text. The topic-word matrix iteratively generated by the topic model is weighted by calculating the TF-IDF value of the word, which effectively weakens the influence of high-frequency noise words.
(f)加权的LDA的步骤:(f) Steps of weighted LDA:
1.将论文摘要数据集进行分词和去停用词处理;1. Process the paper abstract data set into word segmentation and remove stop words;
2.对语料进行吉布斯采样,生成文档-主题分布和主题-词分布;2. Perform Gibbs sampling on the corpus to generate document-topic distribution and topic-word distribution;
3.计算相似度,按照相似度大小进行排序,保留前K/2个主题作为候选主题,结合候选主题构造新的文档-主题分布和主题-词分布;3. Calculate the similarity, sort according to the similarity, retain the top K/2 topics as candidate topics, and construct a new document-topic distribution and topic-word distribution based on the candidate topics;
4.利用TF-IDF对主题-词分布进行加权,得到加权概率,再根据主题-词分布情况,选取权重最高的20个特征词。4. Use TF-IDF to weight the topic-word distribution to obtain the weighted probability, and then select the 20 feature words with the highest weight based on the topic-word distribution.
(g)LDA中主题数目的确定:模型训练时,需要事先设置主题数,根据训练出来的结果,手动调参。主题数目取40,超参数α取0.25,β取0.1;(g) Determination of the number of topics in LDA: When training the model, it is necessary to set the number of topics in advance and manually adjust the parameters based on the training results. The number of topics is 40, the hyperparameter α is 0.25, and β is 0.1;
(h)语料库中的文档是上一步骤中SVM分类得到的兽药知识相关文档,对于这些文档进行主题挖掘,得到相关主题词汇。接着用这些主题词汇,再次进行搜索。(h) The documents in the corpus are veterinary drug knowledge-related documents obtained by SVM classification in the previous step. Topic mining is performed on these documents to obtain related topic vocabulary. Then search again using these topic words.
步骤(3)中建立加权信息抽取包括以下步骤:The establishment of weighted information extraction in step (3) includes the following steps:
(a)命名实体识别:利用开源的分词工具进行分词和去停用词,利用兽药知识词典,进行命名实体识别;(a) Named entity recognition: Use open source word segmentation tools for word segmentation and remove stop words, and use the veterinary drug knowledge dictionary for named entity recognition;
(b)关系抽取:预先定义兽药知识实体之间的关系抽取模型,进行关系抽取。(b) Relationship extraction: Define the relationship extraction model between veterinary drug knowledge entities in advance and perform relationship extraction.
本发明未涉及部分均与现有技术相同或可采用现有技术加以实现。All parts not involved in the present invention are the same as the prior art or can be implemented using the prior art.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011010727.XA CN112100405B (en) | 2020-09-23 | 2020-09-23 | Veterinary drug residue knowledge graph construction method based on weighted LDA |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011010727.XA CN112100405B (en) | 2020-09-23 | 2020-09-23 | Veterinary drug residue knowledge graph construction method based on weighted LDA |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112100405A CN112100405A (en) | 2020-12-18 |
CN112100405B true CN112100405B (en) | 2024-01-30 |
Family
ID=73755147
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011010727.XA Active CN112100405B (en) | 2020-09-23 | 2020-09-23 | Veterinary drug residue knowledge graph construction method based on weighted LDA |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112100405B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113127627B (en) * | 2021-04-23 | 2023-01-17 | 中国石油大学(华东) | Poetry Recommendation Method Based on LDA Topic Model and Poetry Knowledge Graph |
WO2022246691A1 (en) * | 2021-05-26 | 2022-12-01 | 深圳晶泰科技有限公司 | Construction method and system for small molecule drug crystal form knowledge graph |
CN114188024A (en) * | 2021-12-14 | 2022-03-15 | 扬州大学 | Prediction and classification method of livestock and poultry diseases based on knowledge map of livestock and poultry diseases |
CN114428862A (en) * | 2021-12-22 | 2022-05-03 | 国家石油天然气管网集团有限公司 | Oil and gas pipeline-based knowledge graph construction method and processor |
CN114117082B (en) * | 2022-01-28 | 2022-04-19 | 北京欧应信息技术有限公司 | Method, apparatus and medium for correction of data to be corrected |
CN118211650B (en) * | 2024-04-09 | 2025-04-29 | 广东利通科技投资有限公司 | A method for constructing a knowledge base of highway electromechanical operation and maintenance based on big data |
CN118823757B (en) * | 2024-07-17 | 2025-02-25 | 上海寰通商务科技有限公司 | A method, device and storage medium for identifying the name of a drug to be identified |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103823848A (en) * | 2014-02-11 | 2014-05-28 | 浙江大学 | LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105677856A (en) * | 2016-01-07 | 2016-06-15 | 中国农业大学 | Text classification method based on semi-supervised topic model |
CN107122444A (en) * | 2017-04-24 | 2017-09-01 | 北京科技大学 | A kind of legal knowledge collection of illustrative plates method for auto constructing |
CN109684483A (en) * | 2018-12-11 | 2019-04-26 | 平安科技(深圳)有限公司 | Construction method, device, computer equipment and the storage medium of knowledge mapping |
CN110633364A (en) * | 2019-09-23 | 2019-12-31 | 中国农业大学 | Construction method and display mode of food safety knowledge graph based on graph database |
CN110674274A (en) * | 2019-09-23 | 2020-01-10 | 中国农业大学 | A knowledge graph construction method for food safety regulations question answering system |
CN110941721A (en) * | 2019-09-28 | 2020-03-31 | 国家计算机网络与信息安全管理中心 | Short text topic mining method and system based on variational self-coding topic model |
WO2020082560A1 (en) * | 2018-10-25 | 2020-04-30 | 平安科技(深圳)有限公司 | Method, apparatus and device for extracting text keyword, as well as computer readable storage medium |
CN111159430A (en) * | 2019-12-31 | 2020-05-15 | 秒针信息技术有限公司 | Live pig breeding prediction method and system based on knowledge graph |
CN111209412A (en) * | 2020-02-10 | 2020-05-29 | 同方知网(北京)技术有限公司 | A method for constructing knowledge graph of journal literature with cyclic update and iteration |
CN111291156A (en) * | 2020-01-21 | 2020-06-16 | 同方知网(北京)技术有限公司 | Question-answer intention identification method based on knowledge graph |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102131099B1 (en) * | 2014-02-13 | 2020-08-05 | 삼성전자 주식회사 | Dynamically modifying elements of User Interface based on knowledge graph |
-
2020
- 2020-09-23 CN CN202011010727.XA patent/CN112100405B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103823848A (en) * | 2014-02-11 | 2014-05-28 | 浙江大学 | LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105677856A (en) * | 2016-01-07 | 2016-06-15 | 中国农业大学 | Text classification method based on semi-supervised topic model |
CN107122444A (en) * | 2017-04-24 | 2017-09-01 | 北京科技大学 | A kind of legal knowledge collection of illustrative plates method for auto constructing |
WO2020082560A1 (en) * | 2018-10-25 | 2020-04-30 | 平安科技(深圳)有限公司 | Method, apparatus and device for extracting text keyword, as well as computer readable storage medium |
CN109684483A (en) * | 2018-12-11 | 2019-04-26 | 平安科技(深圳)有限公司 | Construction method, device, computer equipment and the storage medium of knowledge mapping |
CN110633364A (en) * | 2019-09-23 | 2019-12-31 | 中国农业大学 | Construction method and display mode of food safety knowledge graph based on graph database |
CN110674274A (en) * | 2019-09-23 | 2020-01-10 | 中国农业大学 | A knowledge graph construction method for food safety regulations question answering system |
CN110941721A (en) * | 2019-09-28 | 2020-03-31 | 国家计算机网络与信息安全管理中心 | Short text topic mining method and system based on variational self-coding topic model |
CN111159430A (en) * | 2019-12-31 | 2020-05-15 | 秒针信息技术有限公司 | Live pig breeding prediction method and system based on knowledge graph |
CN111291156A (en) * | 2020-01-21 | 2020-06-16 | 同方知网(北京)技术有限公司 | Question-answer intention identification method based on knowledge graph |
CN111209412A (en) * | 2020-02-10 | 2020-05-29 | 同方知网(北京)技术有限公司 | A method for constructing knowledge graph of journal literature with cyclic update and iteration |
Non-Patent Citations (2)
Title |
---|
基于改进LDA特征抽取的重大事件趋势预测;彭博远;彭冬亮;谷雨;彭俊利;;杭州电子科技大学学报(自然科学版)(第02期);全文 * |
基于本体的食品安全新闻爬虫的设计与实现;张瀚驰;杨璐;方雄武;郑丽敏;;农业网络信息(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112100405A (en) | 2020-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112100405B (en) | Veterinary drug residue knowledge graph construction method based on weighted LDA | |
He et al. | A new benchmark and approach for fine-grained cross-media retrieval | |
Barsalou | Flexibility, structure, and linguistic vagary in concepts: Manifestations of a compositional system of perceptual symbols | |
Romero et al. | Commonsense properties from query logs and question answering forums | |
Riordan et al. | Redundancy in perceptual and linguistic experience: Comparing feature‐based and distributional models of semantic representation | |
Rijcken et al. | Towards interpreting topic models with ChatGPT | |
Vitevitch et al. | Simulating retrieval from a highly clustered network: Implications for spoken word recognition | |
Barve et al. | A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation | |
Palomera et al. | Leveraging linguistic traits and semi-supervised learning to single out informational content across how-to community question-answering archives | |
Van de Cruys | Mining for meaning: the extraction of lexico-semantic knowledge from text | |
Brown et al. | Investigating the extent to which distributional semantic models capture a broad range of semantic relations | |
Buttery | Computational models for first language acquisition | |
Reyes-Magana et al. | Designing an electronic reverse dictionary based on two word association norms of English language | |
CN110199354B (en) | Biological system information retrieval system and method | |
Gu et al. | Optimizing corpus creation for training word embedding in low resource domains: A case study in autism spectrum disorder (ASD) | |
Qahl | An automatic similarity detection engine between sacred texts using text mining and similarity measures | |
CN117033660B (en) | Domain dictionary construction method for behavior characteristics of depression | |
Schoene et al. | Unsupervised suicide note classification | |
Rashid et al. | Quax: Mining the web for high-utility faq | |
Novák et al. | CBOW-tag: A modified CBOW algorithm for generating embedding models from annotated corpora | |
Schwartz et al. | Minimally supervised classification to semantic categories using automatically acquired symmetric patterns | |
Mutiah et al. | Topic modeling on covid-19 vaccination in indonesia using lda model | |
Freihat | An organizational approach to the polysemy problem in wordnet | |
Rosario | Extraction of semantic relations from bioscience text | |
Indurthi et al. | Believe it or not! Identifying bizarre news in online news media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Chen Juan Inventor after: Yang Lu Inventor after: Zheng Limin Inventor after: Wang Ran Inventor after: Zhang Tian Inventor after: Li Yixuan Inventor after: Wang Pengjie Inventor after: Liu Rong Inventor after: Fang Bing Inventor after: Liu Siyuan Inventor after: Qiu Ju Inventor before: Zheng Limin Inventor before: Yang Lu Inventor before: Zhang Tian |
|
CB03 | Change of inventor or designer information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |