CN112100405B

CN112100405B - Veterinary drug residue knowledge graph construction method based on weighted LDA

Info

Publication number: CN112100405B
Application number: CN202011010727.XA
Authority: CN
Inventors: 陈娟; 王然; 张恬; 李依璇; 王鹏杰; 刘蓉; 方冰; 刘思源; 仇菊; 杨璐; 郑丽敏
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2024-01-30
Anticipated expiration: 2040-09-23
Also published as: CN112100405A

Abstract

The invention discloses a veterinary drug residue knowledge graph construction method based on weighted LDA (Latent Dirichlet Allocation ). Firstly, constructing a veterinary drug knowledge framework, and carrying out deep search by combining a web crawler with the knowledge framework and downloading documents. Aiming at the topic noise and the bias problem of the feature words existing in the LDA topic model, a weighted LDA method is used for topic mining, and veterinary drug related documents are downloaded again. Named entity recognition and relationship extraction are accomplished using a dictionary-based model. And finally, constructing a veterinary drug knowledge graph by utilizing a Neo4j graph database. The invention can be used for constructing a veterinary drug residue knowledge graph, finding out the characteristic rule of veterinary drug residues and the reason of injury of veterinary drug residues to human bodies, and ensuring the quality safety of meat, eggs and milk, thereby protecting the physical health and life safety of people.

Description

A method for constructing veterinary drug residue knowledge graph based on weighted LDA

技术领域Technical field

本发明涉及自然语言处理领域，尤其涉及一种基于加权LDA的兽药残留知识图谱构建方法。The present invention relates to the field of natural language processing, and in particular to a method for constructing a knowledge graph of veterinary drug residues based on weighted LDA.

背景技术Background technique

食品安全问题越来越受到大家的关注，其中肉、蛋、奶食品安全问题更是重中之重。兽药在防治动物疾病、促进动物生长上有重要作用，畜产品的养殖过程离不开兽药。但是，不规范、违禁使用和滥用兽药的现象导致兽药残留超标，从而引发中毒事件。通过构建兽药残留知识图谱，找出兽药残留特点规律以及兽药残留对人体造成伤害的原因，保证肉蛋奶产品的质量安全，从而保护人们的身体健康和生命安全。Food safety issues have attracted more and more attention, among which the safety of meat, eggs, and dairy products is a top priority. Veterinary drugs play an important role in preventing and treating animal diseases and promoting animal growth. The breeding process of livestock products is inseparable from veterinary drugs. However, irregular, illegal use and abuse of veterinary drugs have led to excessive veterinary drug residues, leading to poisoning incidents. By constructing a knowledge map of veterinary drug residues, we can find out the characteristics and patterns of veterinary drug residues and the reasons why veterinary drug residues cause harm to the human body, ensuring the quality and safety of meat, egg and milk products, thereby protecting people's health and life safety.

兽药残留数据涉及到兽药的残留标准,兽药的抽检超标情况,兽药残留动物毒理学实验数据以及对人的危害症状等。这些数据包含结构化数据和非结构化文本数据。利用这些数据进行知识抽取和分类，构建兽药残留知识基本框架，再利用构造的兽药知识框架，再下载兽药知识相关文献。Veterinary drug residue data involves the residue standards of veterinary drugs, sampling inspections of veterinary drugs that exceed the standards, animal toxicology experimental data of veterinary drug residues, and harmful symptoms to humans. These data include structured data and unstructured text data. Use these data to extract and classify knowledge, build a basic framework of veterinary drug residue knowledge, and then use the constructed veterinary drug knowledge framework to download relevant literature on veterinary drug knowledge.

结合兽药残留知识基本框架来进行文献下载，获得兽药知识相关文献。采取LDA来进行主题挖掘，获得兽药文献中的潜在信息。LDA(Latent Dirichlet Allocation，狄利克雷分布)是一种非监督机器学习技术，可以用来识别大规模文档集或语料库中潜藏的主题信息。LDA主题挖掘方法认为所有的词汇具有同样的权重，实际上大量高频的无关词汇对于主题挖掘没有贡献，采用结合兽药知识层级语义相似度和TF-IDF的加权LDA主题挖掘方法来进行文献下载。Download the literature based on the basic framework of veterinary drug residue knowledge and obtain relevant literature on veterinary drug knowledge. LDA was used for topic mining to obtain potential information in veterinary drug literature. LDA (Latent Dirichlet Allocation, Dirichlet distribution) is an unsupervised machine learning technology that can be used to identify latent topic information in large-scale document sets or corpora. The LDA topic mining method believes that all words have the same weight. In fact, a large number of high-frequency irrelevant words do not contribute to topic mining. A weighted LDA topic mining method that combines veterinary drug knowledge level semantic similarity and TF-IDF is used to download documents.

通过数据融合、数据整合、进行实体识别和关系抽取，进而构建知识图谱，找出兽药残留特点规律以及兽药残留对人体造成伤害的原因，保证肉、蛋、奶的质量安全，从而保护人们的身体健康和生命安全。Through data fusion, data integration, entity recognition and relationship extraction, we can then build a knowledge graph to find out the characteristics and patterns of veterinary drug residues and the reasons why veterinary drug residues cause harm to the human body, ensuring the quality and safety of meat, eggs and milk, thereby protecting people's health. Health and life safety.

发明内容Contents of the invention

本发明的目的是提供一种基于加权LDA的兽药残留知识图谱构建方法，为解决上述技术问题，本发明的主要技术内容如下：The purpose of the present invention is to provide a method for constructing a knowledge map of veterinary drug residues based on weighted LDA. In order to solve the above technical problems, the main technical contents of the present invention are as follows:

一种基于加权LDA的兽药残留知识图谱构建方法，包括以下步骤：A method for constructing a knowledge graph of veterinary drug residues based on weighted LDA, including the following steps:

(1)构建兽药知识框架：使用基于层次分析和规则的方法从兽医药理学，兽医毒理学书本中抽取知识。使用基于包装器的方法从Pubchem网址获得兽药毒理相关知识。利用jieba分词工具对所述的这些语料进行去停用词、分词、词性标注最终形成词典，形成层级的兽药知识框架；(1) Construct a veterinary drug knowledge framework: use hierarchical analysis and rule-based methods to extract knowledge from veterinary pharmacology and veterinary toxicology books. Obtain veterinary drug toxicology-related knowledge from the Pubchem website using a wrapper-based approach. The jieba word segmentation tool is used to remove stop words, word segmentation, and part-of-speech tagging of these corpus, and finally form a dictionary to form a hierarchical veterinary drug knowledge framework;

(2)下载文献数据：利用上一步得到的词典,结合兽药名称，在Web of science上进行多层搜索，即遍历根节点到叶子节点的每一条路径，对于每一条路径上的所有词汇进行多层结果中搜索。使用支持向量机SVM方法对于得到的文献进行分类，包含兽药知识相关和兽药知识不相关两大类。对于兽药知识相关文献，使用加权LDA方法进行主题提取；(2) Download literature data: Use the dictionary obtained in the previous step, combined with the name of the veterinary drug, to conduct a multi-level search on the Web of science, that is, traverse every path from the root node to the leaf node, and perform a multi-layer search on all the words on each path. Search within layer results. The support vector machine (SVM) method was used to classify the obtained documents, including those related to veterinary drug knowledge and those not related to veterinary drug knowledge. For literature related to veterinary drug knowledge, the weighted LDA method is used for topic extraction;

(3)信息抽取：基于词典的命名实体识别和关系抽取；(3) Information extraction: dictionary-based named entity recognition and relationship extraction;

(4)构建知识图谱：将上述兽药领域知识的实体以及实体之间的关系，以csv格式导入Neo4j数据库中。(4) Build a knowledge graph: Import the entities of the above-mentioned knowledge in the field of veterinary medicine and the relationships between entities into the Neo4j database in csv format.

步骤(1)中构建兽药知识框架包括以下内容：The construction of a veterinary drug knowledge framework in step (1) includes the following contents:

(a)制定兽药残留知识结构,共包含五大部分：兽药残留、毒理绪论、对器官和系统的影响、属性和毒性；(a) Develop a knowledge structure for veterinary drug residues, which includes five parts: veterinary drug residues, introduction to toxicology, effects on organs and systems, properties and toxicity;

(b)兽药残留：包含原因、影响和危害。危害又可以分为对人体的、对食物的和对环境的危害三部分；(b) Veterinary drug residues: including causes, effects and hazards. Hazards can be divided into three parts: harm to the human body, food and the environment;

(c)兽药属性：类别、理化性质、药动学、作用、应用、最高残留限量和不良反应；(c) Veterinary drug attributes: category, physical and chemical properties, pharmacokinetics, effects, applications, maximum residue limits and adverse reactions;

(d)兽药的毒性：毒性作用分类、常用参数、特殊风险人群、暴露途径、预防措施、吸入方式、动物实验。毒性作用分类包含性质、发生时间、部位和恢复情况。常用参数包含急性毒性、诱变性、致癌性、致畸性、急性毒性等。动物实验的对象包含小鼠、大鼠、家兔和犬等；(d) Toxicity of veterinary drugs: classification of toxic effects, commonly used parameters, special risk groups, exposure routes, preventive measures, inhalation methods, and animal experiments. The classification of toxic effects includes nature, time of occurrence, site and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, etc. Animal experiments include mice, rats, rabbits, dogs, etc.;

(e)兽药毒理绪论：包含目的、内容和方法。方法又可以分为生物实验和群体调查两部分；对于器官和系统的影响：包含眼、皮肤、肝脏、肾脏、神经系统、血液系统、免疫系统、胃肠道、内分泌系统和呼吸系统；(e) Introduction to veterinary drug toxicology: including purpose, content and methods. Methods can be divided into two parts: biological experiments and group surveys; effects on organs and systems: including eyes, skin, liver, kidneys, nervous system, blood system, immune system, gastrointestinal tract, endocrine system and respiratory system;

(f)每一部分若包含表格内容，放在对应的分类之下。(f) If each part contains table content, place it under the corresponding category.

步骤(2)中多层搜索包括以下步骤：The multi-level search in step (2) includes the following steps:

(a)选取的兽药：《食品安全国家标准食品中兽药最大残留限量》标准规定了267种(类)兽药在畜禽产品、水产品、蜂产品中的2191项残留限量及使用要求；(a) Selected veterinary drugs: The "National Food Safety Standard Maximum Residue Limits of Veterinary Drugs in Foods" standard stipulates 2191 residue limits and usage requirements for 267 kinds (categories) of veterinary drugs in livestock and poultry products, aquatic products, and bee products;

(b)利用Selenium和chrome driver完后动态网页(Ajax)数据抓取。搜索的范围是web ofscience建立数据库至今的所有文献，考虑到兽药毒理学研究的数据量较少，所以不限制期刊进行搜索；(b) Use Selenium and chrome driver to complete dynamic web page (Ajax) data capture. The scope of the search is all documents since the establishment of the web of science database. Considering that the amount of data in veterinary drug toxicology research is small, the search is not restricted to journals;

(c)按照兽药知识框架，从根节点到叶子节点。对于每一条路径上的所有节点，将这些关键词结合起来进行多层结果中搜索。(c) According to the veterinary drug knowledge framework, from the root node to the leaf node. For all nodes on each path, these keywords are combined to search in multi-level results.

步骤(2)中SVM文本分类包括以下步骤：SVM text classification in step (2) includes the following steps:

(a)目的是将得到的文献集分成两类，兽药知识相关和兽药知识不相关；(a) The purpose is to divide the obtained literature set into two categories, those related to veterinary drug knowledge and those not related to veterinary drug knowledge;

(b)TF-IDF方法通过统计的方法计算和表达某个关键词在文本中的重要程度。TF指词频，表示某一指定的词条在文本中出现的频率，IDF指逆文本频率，是一个词语普遍重要性的度量。词条t_i在文本d_j中的TF计算方法：(b) The TF-IDF method calculates and expresses the importance of a certain keyword in the text through statistical methods. TF refers to word frequency, which indicates how often a specified term appears in a text, and IDF refers to inverse text frequency, which is a measure of the general importance of a word. The TF calculation method of entry t _i in text d _j :

其中，n_i，j为词条t_i在文本d_j中出现的次数，分母表示将文本d_j中所有词出现的次数求和。Among them, n _{i, j} is the number of times the term t _i appears in the text d _j , and the denominator represents the sum of the number of times all words appear in the text d _j .

IDF的计算方法：How to calculate IDF:

tfidf_i，j＝tf_i，j×idf_i tfidf _{i, j} = tf _{i, j} × idf _i

(c)首先使用TF-IDF算法提取论文部分摘要的特征词，并生成文档向量。选择摘要中全文，向量维度过高，会增加计算的复杂度，不利于后续的分类。根据兽药知识相关文献的特点，选择论文摘要中的结论部分的短文本即可，对于这部分文本使用TF-IDF算法提取特征词，生成文档向量；(c) First use the TF-IDF algorithm to extract the feature words of part of the abstract of the paper and generate a document vector. If you select the full text in the abstract, the vector dimension is too high, which will increase the complexity of the calculation and is not conducive to subsequent classification. According to the characteristics of literature related to veterinary drug knowledge, just select the short text in the conclusion part of the paper abstract. For this part of the text, use the TF-IDF algorithm to extract feature words and generate a document vector;

(d)随机选择部分数据，人工标注。设定训练集和测试集的比例为8:2；(d) Randomly select part of the data and manually label it. Set the ratio of training set to test set to 8:2;

(e)调整SVM惩罚参数C，结合正确率(a)、精准率(P)、召回率(R)和F1值来评估模型；(e) Adjust the SVM penalty parameter C and evaluate the model based on accuracy (a), precision (P), recall (R) and F1 value;

(f)测试集验证模型；(f) Test set to verify the model;

(g)对于文献数据，使用训练好的模型，得到兽药残留主题相关的文献数据。(g) For literature data, use the trained model to obtain literature data related to the topic of veterinary drug residues.

步骤(2)中建立加权LDA主题模型包括以下步骤：Establishing the weighted LDA topic model in step (2) includes the following steps:

(a)LDA(latentdirichletallocation)是一种3层贝叶斯模型，它描述了文档、主题、词汇间的关系。其图模型见图2。图中各个符号的含义：α是狄利克雷分布θ的超参数、β是狄利克雷分布的超参数、θ是“文档－主题”的多项式分布、/>是“主题－词汇”的多项式分布、z是词的主题分配、w是词、K是主题数目、M是文档数目、N是一篇文档的词数；(a) LDA (latentdirichletallocation) is a 3-layer Bayesian model that describes the relationship between documents, topics, and vocabulary. Its graph model is shown in Figure 2. The meaning of each symbol in the figure: α is the hyperparameter of the Dirichlet distribution θ, β is the Dirichlet distribution The hyperparameters, θ is the polynomial distribution of "document-topic", /> is the polynomial distribution of "topic-vocabulary", z is the topic distribution of words, w is the word, K is the number of topics, M is the number of documents, and N is the number of words in a document;

(b)LDA的过程：(b)LDA process:

1.对语料库中的每篇文档中的每个词汇，随机的赋予一个主题编号Z；1. Randomly assign a topic number Z to each word in each document in the corpus;

2.重新扫描语料库，对每个词，使用Gibbs Sampling公式对其采样，求出它的主题，在语料中更新；2. Re-scan the corpus, use the Gibbs Sampling formula to sample each word, find its theme, and update it in the corpus;

3.重复步骤2，直到Gibbs Sampling收敛；3. Repeat step 2 until Gibbs Sampling converges;

4.统计语料库的主题词汇共现频率矩阵，该矩阵就是LDA的模型。4. The topic word co-occurrence frequency matrix of the statistical corpus, this matrix is the model of LDA.

(c)Gibbs Sampling拟合θ、的过程：(c) Gibbs Sampling fitting θ, the process of:

1.扫描文章，对每个词w_n随机赋予一个主题Z_j；1. Scan the article and randomly assign a topic Z _j to each word w _n ;

2.初始化Z_j，使其为1～K之间的某个整数；2. Initialize Z _j to an integer between 1 and K;

3.重新扫描每篇文章，采用LDA模型对语料库进行主题建模，参数推理利用GibbsSampling不断迭代，同时记录Z_j的值。参数θ、的计算公式如下:3. Re-scan each article, use the LDA model to perform topic modeling on the corpus, use GibbsSampling for parameter inference to continuously iterate, and record the value of Z _j at the same time. Parameter θ, The calculation formula is as follows:

其中，是文章d中主题j的单词数，/>是文章d中所有主题的单词数，/>是单词w在主题j下出现的次数，/>是文章d中主题j的单词总数。in, is the number of words of topic j in article d,/> is the number of words of all topics in article d,/> is the number of times word w appears under topic j,/> is the total number of words of topic j in article d.

(d)LDA算法在主题建模过程中没有较好地结合相关的语义信息，这严重影响了主题的语义连贯性、可解释性和文本语义表征的准确性。本文针对兽药知识的词汇分布特点，根据每个词与兽药知识种子词的语义相似度，使用层级的语义相似度计算公式来计算相似度，赋予词汇不同权重，并将权值信息融入吉布斯采样过程；(d) The LDA algorithm does not combine relevant semantic information well in the topic modeling process, which seriously affects the semantic coherence, interpretability and accuracy of text semantic representation of the topic. In view of the vocabulary distribution characteristics of veterinary drug knowledge, this paper uses a hierarchical semantic similarity calculation formula to calculate the similarity based on the semantic similarity between each word and the seed word of veterinary drug knowledge, gives different weights to the words, and integrates the weight information into Gibbs sampling process;

p1和p2表示两个词汇，d表示p1和p2在兽药知识层次体系中的路径距离，d越大则相似度越小。相似度的取值范围为[0,1]。k是一个可调节的参数，通常默认设置为20。p1 and p2 represent two words, and d represents the path distance between p1 and p2 in the veterinary medicine knowledge hierarchy system. The larger d, the smaller the similarity. The value range of similarity is [0,1]. k is an adjustable parameter, usually set to 20 by default.

(e)LDA在参数估计过程中偏向高频词的抽取，而一些隐含主题的低频特征词被淹没。既考虑到高频词，也考虑到隐含主题的低频特征词，考虑使用TF-IDF方法进行优化，TF-IDF通过统计的方法计算和表达某个关键词在文本中的重要程度。通过计算词的TF-IDF值对主题模型迭代生成的主题-词矩阵进行加权，有效弱化高频噪声词的影响。(e) LDA favors the extraction of high-frequency words during the parameter estimation process, while some low-frequency feature words of hidden topics are drowned. Taking into account both high-frequency words and low-frequency feature words of hidden topics, consider using the TF-IDF method for optimization. TF-IDF uses statistical methods to calculate and express the importance of a certain keyword in the text. The topic-word matrix iteratively generated by the topic model is weighted by calculating the TF-IDF value of the word, which effectively weakens the influence of high-frequency noise words.

(f)加权的LDA的步骤：(f) Steps of weighted LDA:

1.将论文摘要数据集进行分词和去停用词处理；1. Process the paper abstract data set into word segmentation and remove stop words;

2.对语料进行吉布斯采样，生成文档-主题分布和主题-词分布；2. Perform Gibbs sampling on the corpus to generate document-topic distribution and topic-word distribution;

3.计算相似度，按照相似度大小进行排序，保留前K/2个主题作为候选主题，结合候选主题构造新的文档-主题分布和主题-词分布；3. Calculate the similarity, sort according to the similarity, retain the top K/2 topics as candidate topics, and construct a new document-topic distribution and topic-word distribution based on the candidate topics;

4.利用TF-IDF对主题-词分布进行加权，得到加权概率，再根据主题-词分布情况，选取权重最高的20个特征词。4. Use TF-IDF to weight the topic-word distribution to obtain the weighted probability, and then select the 20 feature words with the highest weight based on the topic-word distribution.

(g)LDA中主题数目的确定：模型训练时，需要事先设置主题数，根据训练出来的结果，手动调参。主题数目取40，超参数α取0.25，β取0.1；(g) Determination of the number of topics in LDA: When training the model, it is necessary to set the number of topics in advance and manually adjust the parameters based on the training results. The number of topics is 40, the hyperparameter α is 0.25, and β is 0.1;

(h)语料库中的文档是上一步骤中SVM分类得到的兽药知识相关文档，对于这些文档进行主题挖掘，得到相关主题词汇。接着用这些主题词汇，再次进行搜索。(h) The documents in the corpus are veterinary drug knowledge-related documents obtained by SVM classification in the previous step. Topic mining is performed on these documents to obtain related topic vocabulary. Then search again using these topic words.

步骤(3)中建立加权信息抽取包括以下步骤：The establishment of weighted information extraction in step (3) includes the following steps:

(a)命名实体识别：利用开源的分词工具进行分词和去停用词，利用兽药知识词典，进行命名实体识别；(a) Named entity recognition: Use open source word segmentation tools for word segmentation and remove stop words, and use the veterinary drug knowledge dictionary for named entity recognition;

(b)关系抽取：预先定义兽药知识实体之间的关系抽取模型，进行关系抽取。(b) Relationship extraction: Define the relationship extraction model between veterinary drug knowledge entities in advance and perform relationship extraction.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1为本发明的兽药知识词典框架；Figure 1 is the veterinary drug knowledge dictionary framework of the present invention;

图2为LDA算法示意图；Figure 2 is a schematic diagram of the LDA algorithm;

图3为一种加权LDA文献搜索算法流程图。Figure 3 is a flow chart of a weighted LDA document search algorithm.

具体实施方式Detailed ways

为更进一步阐述本发明为达成预定发明目的所采取的技术手段及功效，以下结合附图及较佳实施例，对依据本发明提出的其具体实施方式、结构、特征及其功效，详细说明如后。In order to further elaborate on the technical means and effects adopted by the present invention to achieve the intended inventive purpose, the specific implementation manner, structure, characteristics and effects proposed according to the present invention are described in detail below in conjunction with the accompanying drawings and preferred embodiments. back.

(2)下载文献数据：利用上一步得到的词典,结合兽药名称，在Web of science上进行多层搜索，即遍历根节点到叶子节点的每一条路径，对于每一条路径上的所有词汇进行多层结果中搜索。使用支持向量机SVM方法对于得到的文献进行分类，包含兽药知识相关和兽药知识不相关两大类。对于兽药知识相关文献，使用改进的LDA方法进行主题提取；(2) Download literature data: Use the dictionary obtained in the previous step, combined with the name of the veterinary drug, to conduct a multi-level search on the Web of science, that is, traverse every path from the root node to the leaf node, and perform a multi-layer search on all the words on each path. Search within layer results. The support vector machine (SVM) method was used to classify the obtained documents, including those related to veterinary drug knowledge and those not related to veterinary drug knowledge. For literature related to veterinary drug knowledge, the improved LDA method was used for topic extraction;

步骤(1)中用构建兽药知识框架包括以下内容：The veterinary drug knowledge framework used in step (1) includes the following contents:

(b)TF-IDF方法通过统计的方法计算和表达某个关键词在文本中的重要程度。TF指词频，表示某一指定的词条在文本中出现的频率，IDF指逆文本频率，是一个词语普遍重要性的度量。词条t_i在文本d_j中的TF计算方法：(b) The TF-IDF method calculates and expresses the importance of a certain keyword in the text through statistical methods. TF refers to word frequency, which indicates the frequency of a specified term appearing in the text, and IDF refers to inverse text frequency, which is a measure of the general importance of a word. The TF calculation method of entry t_i in text d_j:

IDF的计算方法：How to calculate IDF:

tfidf_i，j＝tf_i，j×idf_i tfidf _{i, j} = tf _{i, j} × idf _i

(a)首先使用TF-IDF算法提取论文部分摘要的特征词，并生成文档向量。选择摘要中全文，向量维度过高，会增加计算的复杂度，不利于后续的分类。根据兽药知识相关文献的特点，选择论文摘要中的结论部分的短文本即可，对于这部分文本使用TF-IDF算法提取特征词，生成文档向量；(a) First, use the TF-IDF algorithm to extract the feature words of part of the abstract of the paper and generate a document vector. If the full text of the abstract is selected, the vector dimension is too high, which will increase the complexity of the calculation and is not conducive to subsequent classification. According to the characteristics of literature related to veterinary drug knowledge, just select the short text in the conclusion part of the paper abstract. For this part of the text, use the TF-IDF algorithm to extract feature words and generate a document vector;

(b)随机选择部分数据，人工标注。设定训练集和测试集的比例为8:2；(b) Randomly select part of the data and manually label it. Set the ratio of training set to test set to 8:2;

(c)调整SVM惩罚参数C，结合正确率(a)、精准率(P)、召回率(R)和F1值来评估模型；(c) Adjust the SVM penalty parameter C and evaluate the model based on accuracy (a), precision (P), recall (R) and F1 value;

(d)测试集验证模型；(d) Test set to verify the model;

(e)对于文献数据，使用训练好的模型，得到兽药残留主题相关的文献数据。(e) For literature data, use the trained model to obtain literature data related to the topic of veterinary drug residues.

(b)LDA的过程：(b)LDA process:

(f)加权的LDA的步骤：(f) Steps of weighted LDA:

本发明未涉及部分均与现有技术相同或可采用现有技术加以实现。All parts not involved in the present invention are the same as the prior art or can be implemented using the prior art.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing a knowledge graph of veterinary drug residues based on weighted LDA, which is characterized by including the following steps:

(1) Build a veterinary drug knowledge framework: use hierarchical analysis and rule-based methods to extract knowledge from veterinary pharmacology and veterinary toxicology books, use a packager-based method to obtain veterinary drug toxicology-related knowledge from the Pubchem website, and use the jieba word segmentation tool to The described knowledge is used to remove stop words, word segmentation, and part-of-speech tagging to finally form a dictionary, forming a hierarchical veterinary drug knowledge framework;

(2) Download literature data: Use the dictionary obtained in the previous step, combined with the names of veterinary drugs, to conduct a multi-layer search on the Web of science, that is, traverse every path from the root node to the leaf node, and conduct a search for all words on each path. Search in multi-layer results, use the support vector machine SVM method to classify the obtained documents, including two categories: related to veterinary drug knowledge and irrelevant to veterinary drug knowledge. For documents related to veterinary drug knowledge, use the weighted LDA method for topic extraction;

Establishing the weighted LDA topic model in step (2) includes the following steps:

(4a) LDA is a 3-layer Bayesian model, which describes the relationship between documents, topics, and words. α is the hyperparameter of the Dirichlet distribution θ, and β is the Dirichlet distribution. The hyperparameters, θ is the polynomial distribution of "document-topic", /> is the polynomial distribution of "topic-vocabulary", z is the topic distribution of words, w is the word, K is the number of topics, M is the number of documents, and N is the number of words in a document;

(4b) LDA process:

(1) Randomly assign a topic number Z to each word in each document in the corpus;

(2) Re-scan the corpus, use the Gibbs Sampling formula to sample each word, find its topic, and update it in the corpus;

(3) Repeat step (2) until Gibbs Sampling converges;

(4) The topic word co-occurrence frequency matrix of the statistical corpus, which is the model of LDA;

(4c) Gibbs Sampling fitting θ, the process of:

(1) Scan the article and randomly assign a topic Z _j to each word w _n ;

(2) Initialize Z _j to an integer between 1 and K;

(3) Re-scan each article, use the LDA model to perform topic modeling on the corpus, use GibbsSampling for parameter inference to continuously iterate, and record the value of Z _j , parameters θ, The calculation formula is as follows:

in, is the number of words of topic j in article d,/> is the number of words of all topics in article d,/> is the number of times word w appears under topic j,/> is the total number of words of topic j in article d;

(4d) Based on the vocabulary distribution characteristics of veterinary drug knowledge, based on the semantic similarity between each word and the seed word of veterinary drug knowledge, use a hierarchical semantic similarity calculation formula to calculate the similarity, assign different weights to the vocabulary, and integrate the weight information into the vocabulary. Booth sampling process;

p1 and p2 represent two words, d represents the path distance between p1 and p2 in the veterinary drug knowledge hierarchy system. The larger d, the smaller the similarity. The value range of similarity is [0,1], and k is set to 20;

(4e) Use the TF-IDF method for optimization. TF-IDF calculates and expresses the importance of a keyword in the text through statistical methods, and calculates the TF-IDF value of the word to iteratively generate the topic-word matrix of the topic model. Weighting is performed to weaken the influence of high-frequency noise words;

(4f) Steps of weighted LDA:

(1) Process the paper abstract data set into word segmentation and remove stop words;

(2) Perform Gibbs sampling on the corpus to generate document-topic distribution and topic-word distribution;

(3) Calculate the similarity, sort according to the similarity, retain the top K/2 topics as candidate topics, and construct a new document-topic distribution and topic-word distribution based on the candidate topics;

(4) Use TF-IDF to weight the topic-word distribution to obtain the weighted probability, and then select the 20 feature words with the highest weight based on the topic-word distribution;

(4g) Determination of the number of topics in LDA: When training the model, the number of topics needs to be set in advance. According to the training results, the parameters are manually adjusted. The hyperparameter α is set to 0.25 and β is set to 0.1;

(4h) The documents in the corpus are veterinary drug knowledge-related documents obtained by SVM classification in the previous step. Perform topic mining on these documents to obtain related topic words, and then use these topic words to search again;

(3) Information extraction: dictionary-based named entity recognition and relationship extraction;

(4) Build a knowledge graph: Import the entities of the above-mentioned knowledge in the field of veterinary medicine and the relationships between entities into the Neo4j database in csv format.

2. The method for constructing a knowledge map of veterinary drug residues based on weighted LDA according to claim 1, characterized in that constructing a veterinary drug knowledge framework in step (1) includes the following content:

(2a) Formulate a veterinary drug residue knowledge system structure, which includes five parts: veterinary drug residues, toxicology introduction, effects on organs and systems, properties and toxicity;

(2b) Veterinary drug residues: including causes, effects and hazards. Harms can be divided into three parts: harm to the human body, food and the environment;

(2c) Veterinary drug attributes: category, physical and chemical properties, pharmacokinetics, effects, applications, maximum residue limits and adverse reactions;

(2d) Toxicity of veterinary drugs: toxic effect classification, common parameters, special risk groups, exposure routes, preventive measures, inhalation methods, animal experiments. Toxic effect classification includes nature, occurrence time, site and recovery status. Common parameters include acute toxicity, Mutagenicity, carcinogenicity, teratogenicity, and acute toxicity. The subjects of animal experiments include mice, rats, rabbits, and dogs;

(2e) Introduction to veterinary drug toxicology: including purpose, content and methods, which are divided into two parts: biological experiments and group surveys; effects on organs and systems include eyes, skin, liver, kidneys, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;

(2f) If each part contains table content, place it under the corresponding category.

3. The method for constructing a knowledge map of veterinary drug residues based on weighted LDA according to claim 1, characterized in that the multi-layer search in step (2) includes the following steps:

(3a) Selected veterinary drugs: The "National Food Safety Standard Maximum Residue Limits of Veterinary Drugs in Foods" standard stipulates 2191 residue limits and usage requirements for 267 types/categories of veterinary drugs in livestock and poultry products, aquatic products, and bee products;

(3b) Use Selenium and chrome driver to capture dynamic web page data. The search scope is all the literature since webofscience established the database. Considering that the amount of data in veterinary drug toxicology research is small, the search is not restricted to journals;

(3c) According to the veterinary drug knowledge framework, from the root node to the leaf node, for all nodes on each path, these keywords are combined to search in multi-layer results.