CN109033738A - A kind of pharmaceutical activity prediction technique based on deep learning - Google Patents
A kind of pharmaceutical activity prediction technique based on deep learning Download PDFInfo
- Publication number
- CN109033738A CN109033738A CN201810742486.4A CN201810742486A CN109033738A CN 109033738 A CN109033738 A CN 109033738A CN 201810742486 A CN201810742486 A CN 201810742486A CN 109033738 A CN109033738 A CN 109033738A
- Authority
- CN
- China
- Prior art keywords
- molecule
- node
- data
- conv
- pharmaceutical activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000000694 effects Effects 0.000 title claims abstract description 36
- 238000013135 deep learning Methods 0.000 title claims abstract description 20
- 239000003814 drug Substances 0.000 claims abstract description 32
- 229940079593 drug Drugs 0.000 claims abstract description 32
- 239000000284 extract Substances 0.000 claims abstract description 21
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 40
- 238000012360 testing method Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000011161 development Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 7
- 239000003574 free electron Substances 0.000 claims description 5
- 150000001491 aromatic compounds Chemical group 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 241001269238 Data Species 0.000 claims 1
- 238000012512 characterization method Methods 0.000 claims 1
- 238000002372 labelling Methods 0.000 claims 1
- 230000006403 short-term memory Effects 0.000 abstract description 3
- 230000007774 longterm Effects 0.000 abstract 1
- 230000018109 developmental process Effects 0.000 description 12
- 238000012706 support-vector machine Methods 0.000 description 9
- 238000011176 pooling Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 238000001994 activation Methods 0.000 description 6
- 150000001875 compounds Chemical class 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 231100000331 toxic Toxicity 0.000 description 5
- 230000002588 toxic effect Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000015654 memory Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 208000026350 Inborn Genetic disease Diseases 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 239000002547 new drug Substances 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000012827 research and development Methods 0.000 description 3
- 231100000419 toxicity Toxicity 0.000 description 3
- 230000001988 toxicity Effects 0.000 description 3
- OTMSDBZUPAUEDD-UHFFFAOYSA-N Ethane Chemical compound CC OTMSDBZUPAUEDD-UHFFFAOYSA-N 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 208000016361 genetic disease Diseases 0.000 description 2
- LELOWRISYMNNSU-UHFFFAOYSA-N hydrogen cyanide Chemical compound N#C LELOWRISYMNNSU-UHFFFAOYSA-N 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 230000000144 pharmacologic effect Effects 0.000 description 2
- 238000003041 virtual screening Methods 0.000 description 2
- CDKIEBFIMCSCBB-UHFFFAOYSA-N 1-(6,7-dimethoxy-3,4-dihydro-1h-isoquinolin-2-yl)-3-(1-methyl-2-phenylpyrrolo[2,3-b]pyridin-3-yl)prop-2-en-1-one;hydrochloride Chemical compound Cl.C1C=2C=C(OC)C(OC)=CC=2CCN1C(=O)C=CC(C1=CC=CN=C1N1C)=C1C1=CC=CC=C1 CDKIEBFIMCSCBB-UHFFFAOYSA-N 0.000 description 1
- 101001033249 Homo sapiens Interleukin-1 beta Proteins 0.000 description 1
- 102100039065 Interleukin-1 beta Human genes 0.000 description 1
- 208000024556 Mendelian disease Diseases 0.000 description 1
- 101710143111 Mothers against decapentaplegic homolog 3 Proteins 0.000 description 1
- 206010059516 Skin toxicity Diseases 0.000 description 1
- 102000049939 Smad3 Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013398 bayesian method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000001311 chemical methods and process Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 231100000226 haematotoxicity Toxicity 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 231100000417 nephrotoxicity Toxicity 0.000 description 1
- 231100000252 nontoxic Toxicity 0.000 description 1
- 230000003000 nontoxic effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 125000001997 phenyl group Chemical group [H]C1=C([H])C([H])=C(*)C([H])=C1[H] 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 231100000438 skin toxicity Toxicity 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 231100000167 toxic agent Toxicity 0.000 description 1
- 239000003440 toxic substance Substances 0.000 description 1
- 230000002110 toxicologic effect Effects 0.000 description 1
- 231100000027 toxicology Toxicity 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
本发明公开了一种基于深度学习的药物活性预测方法。本发明使用RDkit开源库用于计算给定分子中每个原子的基本特征,包括原子类型,化合价,形式电荷等,只计算原子特征大大减少时间耗费。本发明是结合了图卷积和LSTM两种模型(长短期记忆网络)的预测模型,对于图卷积模型,通过将原子视为节点并将键作为无向图中的边来将所有分子特征化为图,提取分子结构特征,使用图卷积神经网络可以减少时间耗费的同时获取传统方法无法得到的特征。LSTM通过在证据和查询分子之间交换信息来学习复杂的度量。从而达到在低数据量下较高的预测准确度。
The invention discloses a drug activity prediction method based on deep learning. The present invention uses the RDkit open source library to calculate the basic characteristics of each atom in a given molecule, including atom type, valence, formal charge, etc. Only calculating the atomic characteristics greatly reduces time consumption. The present invention is a prediction model that combines two models of graph convolution and LSTM (long-term short-term memory network). For the graph convolution model, all molecular features are integrated by treating atoms as nodes and bonds as edges in an undirected graph. Turn it into a graph, extract the molecular structure features, and use the graph convolutional neural network to reduce time consumption and obtain features that cannot be obtained by traditional methods. LSTMs learn complex metrics by exchanging information between evidence and query molecules. So as to achieve higher prediction accuracy under low data volume.
Description
技术领域technical field
本发明涉及一种基于深度学习的药物活性预测方法,属于软件技术领域。The invention relates to a method for predicting drug activity based on deep learning, which belongs to the field of software technology.
背景内容background content
药物研究以及制药业发展的主要目标是发现与治疗疾病相关的药物分子,探索先导物发现方法是实现这一目标的主要途径。当生物学研究发现某一特定分子具有治疗活性时,发现的分子常常因为毒性,低活性和低溶解度等多种原因而被废弃。据美国药物研究与制造商协会统计,整个制药业中新药研究和开发占销售收入的12.8%,而其中的75%是因为新药研究和开发的失败,在初筛中被命中的化合物不到5%能进入临床前评价。由于计算机虚拟筛选不存在样品的限制,因此如果先进行计算机虚拟筛选,然后再进行药理测试,这样的研究策略与传统的直接进行药理测试的策略比较,将显著地缩短新药的研发周期、降低研发费用。目前,先导物发现的主流方向在于分子的定量结构和活性关系(QSAR)的研究,主要是定量描述分子的结构,即分子特征描述方法的选择和连接这些分子特征与活性的数学函数关系的选择。The main goal of drug research and the development of the pharmaceutical industry is to discover drug molecules related to the treatment of diseases, and exploring lead discovery methods is the main way to achieve this goal. When biological research finds that a particular molecule has therapeutic activity, the discovered molecule is often discarded for a variety of reasons, including toxicity, low activity, and low solubility. According to the statistics of the American Pharmaceutical Research and Manufacturers Association, the research and development of new drugs in the entire pharmaceutical industry accounted for 12.8% of sales revenue, and 75% of them were due to the failure of new drug research and development, and less than 5 compounds were hit in the primary screening. % can enter the preclinical evaluation. Since there is no sample limitation in computer virtual screening, if computer virtual screening is performed first, and then pharmacological testing is carried out, this research strategy will significantly shorten the development cycle of new drugs and reduce the cost of research and development compared with the traditional strategy of directly conducting pharmacological testing. cost. At present, the mainstream direction of lead discovery lies in the study of quantitative structure and activity relationship (QSAR) of molecules, mainly to quantitatively describe the structure of molecules, that is, the selection of molecular feature description methods and the selection of mathematical functional relationships connecting these molecular features and activity .
目前通常的做法主要分为以下几种:The current common practices are mainly divided into the following categories:
基于化合物分子的拓扑结构、侧链、骨架与特定的毒性作用部位之间的关系。Wang等人研究了化学物质毒性作用登记RTECS (Registry of Toxic Effect of ChemicalSubstances)数据库中约六万个毒性化合物分子的拓扑结构、侧链、骨架与特定的毒性作用部位之间的关系(比如皮肤毒性、血液毒性以及肾脏毒性等),并对这些拓扑结构在整个数据库中出现的次数,以及在毒性化学库中出现的次数进行比较。此方法需要的数据量大,正样本多,而且只提取毒性特征会对导致无毒分子的判断误差较大。Based on the relationship between the topological structure, side chain, skeleton and specific toxic action site of the compound molecule. Wang et al. studied the relationship between the topology, side chains, skeletons and specific toxic sites of about 60,000 toxic compound molecules in the RTECS (Registry of Toxic Effects of Chemical Substances) database (such as skin toxicity , hematological toxicity, and renal toxicity, etc.), and compare the number of occurrences of these topological structures in the entire database with the number of occurrences in the toxicity chemical library. This method requires a large amount of data and many positive samples, and only extracting toxic features will lead to large errors in the judgment of non-toxic molecules.
1.基于支持向量机方法预测待测药物的活性。Zhang等人根据获取的遗传性疾病对应的相关基因信息从得到的药物靶标中筛选出与遗传性疾病关联的靶标基因,获取每个样本药物的特征属性,所述特征属性为样本药物对应的药物靶标与遗传性疾病关联的靶标基因的相关关系;以每个样本药物的特征属性为输入向量,以样本药物的活性为输出,通过支持向量机方法建立模型,预测待测药物的活性。此方法分子特征较难获取,需要特定的数据集,普适性较差。1. Predict the activity of the drug to be tested based on the support vector machine method. Zhang et al. screened the target genes associated with hereditary diseases from the obtained drug targets according to the obtained genetic disease-related gene information, and obtained the characteristic attributes of each sample drug, which is the drug corresponding to the sample drug. The correlation between the target and the target gene associated with the genetic disease; the characteristic attribute of each sample drug is used as the input vector, and the activity of the sample drug is used as the output, and the model is established by the support vector machine method to predict the activity of the drug to be tested. This method is difficult to obtain molecular features, requires a specific data set, and has poor universality.
2.基于深度学习的有监督和无监督算法结合进行药物活性分子识别。高双印将支持向量机(Support Vector Machine)、人工神经网络 (Artificial Neural Network)、半监督支持向量机(Semi-supervised support vector machine)、代价安全性半监督支持向量机(Cost security semi-supervised support vector machine)、栈式自编码(StackedAutoEncode)、深度信念网络(Deep Belief Network)几种种方法进结合,分别对三类药物活性分子(PLK1PBD、SMAD3、IL-1B)进行深入探究。由于药物活性分子结构繁杂,选用化学计量软件MOE对其进行精密计算,分别获得其2D及3D分子描述符,通过上述两类算法进行药物活性分子识别。此方法需要大数据集,使用化学计量软件计算分子特征要耗费大量时间。2. Combination of supervised and unsupervised algorithms based on deep learning for drug active molecule identification. Gao Shuangyin will support vector machine (Support Vector Machine), artificial neural network (Artificial Neural Network), semi-supervised support vector machine (Semi-supervised support vector machine), cost security semi-supervised support vector machine (Cost security semi-supervised support vector machine) machine), stacked autoencode (StackedAutoEncode), and deep belief network (Deep Belief Network) methods are combined to conduct in-depth exploration of three types of drug active molecules (PLK1PBD, SMAD3, IL-1B). Due to the complex structure of active pharmaceutical molecules, the chemometric software MOE is used for precise calculations to obtain their 2D and 3D molecular descriptors, and the above two types of algorithms are used to identify active pharmaceutical molecules. This method requires large data sets, and calculation of molecular features using chemometric software is time-consuming.
综上所述,药物活性预测的各种方法都会受限于自身的特点,基于大数据分析的方法需要大量数据,对于样本的分布要求较高;传统机器学习类方法对于样本采集分类、训练需要耗费大量的时间;以上基于有监督和无监督的机器学习算法不仅需要大量数据,而且使用化学计量软件计算分子特征同样需要耗费大量时间。To sum up, various methods for drug activity prediction are limited by their own characteristics. Methods based on big data analysis require a large amount of data and have high requirements for the distribution of samples; Time-consuming; the above machine learning algorithms based on supervised and unsupervised not only require a large amount of data, but also use chemometric software to calculate molecular features also require a lot of time.
名词解释:Glossary:
LSTM:即长短期记忆网络。LSTM: Long short-term memory network.
原子的degree:用RDkit计算出的每个原子的权重值,是该原子直接相连的原子个数。Atom degree: The weight value of each atom calculated by RDkit is the number of atoms directly connected to the atom.
Lewis结构式:一种分子的书写方式,如氰化氢H-C≡NLewis structural formula: a way of writing molecules, such as hydrogen cyanide H-C≡N
Sigmoid:Sigmoid函数是一个S形曲线的数学函数,其公式为Sigmoid: The Sigmoid function is a mathematical function of an S-shaped curve, and its formula is
在逻辑回归、人工神经网络中有着广泛的应用。It has a wide range of applications in logistic regression and artificial neural networks.
Tanh:双曲正切函数,是由基本双曲函数双曲正弦和双曲余弦推导而来:Tanh: The hyperbolic tangent function, which is derived from the basic hyperbolic functions hyperbolic sine and hyperbolic cosine:
发明内容Contents of the invention
本发明克服现有技术存在的不足,本发明公开了一种基于深度学习的药物活性预测方法。本发明使用RDkit开源库用于计算给定分子中每个原子的基本特征,包括原子类型,化合价,形式电荷等,只计算原子特征大大减少时间耗费。对于图卷积模型,通过将原子视为节点并将键作为无向图中的边来将所有分子特征化为图,提取分子结构特征,使用图卷积神经网络可以减少时间耗费的同时获取传统方法无法得到的特征。LSTM通过在证据和查询分子之间交换信息来学习复杂的度量。从而达到在低数据量下较高的预测准确度。The invention overcomes the shortcomings of the prior art, and discloses a method for predicting drug activity based on deep learning. The present invention uses the RDkit open source library to calculate the basic characteristics of each atom in a given molecule, including atom type, valence, formal charge, etc. Only calculating the atomic characteristics greatly reduces time consumption. For the graph convolutional model, all molecules are characterized as graphs by treating atoms as nodes and bonds as edges in an undirected graph, and molecular structure features are extracted. Using graph convolutional neural networks can reduce time consumption and obtain traditional A characteristic that cannot be obtained by the method. LSTMs learn complex metrics by exchanging information between evidence and query molecules. So as to achieve higher prediction accuracy under low data volume.
为解决上述技术问题,本发明所采用的技术方案为:In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:
一种基于深度学习的药物活性预测方法,包括如下步骤:A method for predicting drug activity based on deep learning, comprising the steps of:
步骤一、构建药物活性数据集,对药物活性数据集进行切分,其中,药物活性数据集中一部分数据作为训练集、一部分数据作为开发集,还有一部分数据作为测试集;Step 1. Construct a drug activity data set and segment the drug activity data set, wherein a part of the data in the drug activity data set is used as a training set, a part of the data is used as a development set, and a part of the data is used as a test set;
步骤二、对训练集的分子提取原子特征,并将训练集的分子结构转化为邻接矩阵;Step 2, extracting atomic features from the molecules in the training set, and converting the molecular structure of the training set into an adjacency matrix;
步骤三、构建预测模型,预测模型包含五层图卷积,一层LSTM;Step 3. Build a prediction model, which includes five layers of graph convolution and one layer of LSTM;
步骤四、将步骤二和三得到的数据进行训练;Step 4, train the data obtained in steps 2 and 3;
步骤五、通过图卷积,池化,全连接后,将输出值输送给分类器,优化损失函数,继续训练;Step 5. After graph convolution, pooling, and full connection, the output value is sent to the classifier, the loss function is optimized, and training continues;
步骤六、经过迭代计算,得到训练后的预测模型;Step 6. After iterative calculation, the trained prediction model is obtained;
步骤七、将待预测药物输入预测模型得到预测结果。Step 7: Input the drug to be predicted into the prediction model to obtain the prediction result.
2.如权利要求1所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤七中,先将开发集与测试集同样经过步骤二到六的处理,灌入预测模型得到测试结果。2. The drug activity prediction method based on deep learning as claimed in claim 1, wherein in said step seven, the development set and the test set are first processed through steps two to six, and poured into the prediction model to obtain the test result.
3.如权利要求1所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤一包括如下步骤:3. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, described step 1 comprises the steps:
1.1将药物活性数据集进行切分,打乱,包括80%的训练集、10%开发集和10%测试集,将开发集和测试集固定不变用于对照;其中,对数据集的切分保证训练集、开发集和测试集的数据在数据集中均均匀分布;1.1 The drug activity data set is divided and disrupted, including 80% of the training set, 10% of the development set and 10% of the test set, and the development set and the test set are fixed for control; Ensure that the data of the training set, development set and test set are evenly distributed in the data set;
1.2将数据集中对受体有影响的分子标记为1即作为正样本,无影响的标记为0即负样本,没有数据的空值去除,剔除干扰数据提高准确度。1.2 Mark the molecules that have an impact on the receptor in the data set as 1, that is, as a positive sample, and mark the molecules that have no influence as 0, that is, a negative sample, remove the null value without data, and remove the interference data to improve the accuracy.
4.如权利要求1所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤二中,对训练集的分子提取原子特征,同时将训练集的分子结构转化为邻接矩阵:4. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, in described step 2, extracts atomic feature to the molecule of training set, simultaneously the molecular structure of training set is transformed into adjacency matrix:
2.1对分子数据提取统计特征:['C','N','O','S','F','Si', 'P','Cl','Br','Mg','Na','Ca','Fe','As','Al','I','B', 'V','K','Tl','Yb','Sb','Sn','Ag','Pd','Co','Se','Ti', 'Zn','H','Li','Ge','Cu','Au','Ni','Cd','In','Mn', 'Zr','Cr','Pt','Hg','Pb','=','+','-','(',')','/', '\','[',']','@','#','Unknown'],以上特征忽略数字,小数点,得到一个包含分子中所有统计特征的字典,字典值为分子或分子对应字符出现的次数;2.1 Extract statistical features from molecular data: ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg',' Na', 'Ca', 'Fe', 'As', 'Al', 'I', 'B', 'V', 'K', 'Tl', 'Yb', 'Sb', 'Sn' , 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn', 'H', 'Li', 'Ge', 'Cu', 'Au', 'Ni',' Cd', 'In', 'Mn', 'Zr', 'Cr', 'Pt', 'Hg', 'Pb', '=', '+', '-', '(', ')' , '/', '\', '[', ']', '@', '#', 'Unknown'], the above features ignore numbers, decimal points, and get a dictionary containing all statistical features in the molecule, dictionary values is the number of occurrences of the molecule or the corresponding character of the molecule;
2.2提取分子的中原子的degree,范围为0~10,原子degree被定义为与该原子直接相连的原子个数;2.2 Extract the degree of the atom in the molecule, ranging from 0 to 10, and the atom degree is defined as the number of atoms directly connected to the atom;
2.3提取分子中隐式高自旋的数量,范围为0~6,原子核具有的角动量称为原子核的自旋;2.3 Extract the number of implicit high spins in the molecule, ranging from 0 to 6, and the angular momentum possessed by the nucleus is called the spin of the nucleus;
2.4提取分子中原子的形式电荷;2.4 Extract the formal charge of atoms in the molecule;
2.5提取分子中原子的自由电子数量;2.5 Extract the number of free electrons of atoms in the molecule;
2.6提取分子是否是芳香族化合物;2.6 Whether the extracted molecule is an aromatic compound;
2.7通过将分子中的原子视为节点并将化学键作为无向图中的边来将所有分子表示为结构图,生成以邻接矩阵表示的分子结构图,邻接矩阵将分子中所有原子作为矩阵行和列的标签,当分子中两个原子有化学键相连接时,矩阵相应位置值为1。2.7 Represent all molecules as a structural graph by treating the atoms in the molecule as nodes and the chemical bonds as edges in the undirected graph, and generate a molecular structure graph represented by an adjacency matrix. The adjacency matrix takes all the atoms in the molecule as a matrix row and The label of the column, when two atoms in the molecule are chemically bonded, the value of the corresponding position in the matrix is 1.
5.如权利要求1所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤三包括如下步骤:5. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, described step 3 comprises the steps:
3.1输入x分为两个部分,一是分子的原子特征,二是分子结构转化成的邻接矩阵,x是将分子的原子特征和分子结构转化成的邻接矩结合转化成的一个矩阵;3.1 The input x is divided into two parts, one is the atomic feature of the molecule, and the other is the adjacency matrix converted from the molecular structure, and x is a matrix formed by combining the atomic feature of the molecule and the adjacency moment converted from the molecular structure;
3.2对于输出y的真实值用数组[1,0]表示0,数组[0,1]表示 1,每次训练和测试的结果为一个数组[a,b],a,b为两个概率值,a+ b=1;a和b一个表示输出y的真实值为数组[1,0]的概率,另一个表示输出y的真实值为数组[0,1]的概率;3.2 For the real value of the output y, the array [1,0] represents 0, the array [0,1] represents 1, and the result of each training and test is an array [a,b], a,b are two probability values , a+ b=1; a and b represent the probability that the actual value of the output y is the array [1,0], and the other represents the probability that the actual value of the output y is the array [0,1];
3.3预测模型使用五层图卷积神经网络,图卷积神经网络具有两个基本特征:一是每个节点都有自己的特征信息;二是图中的每个节点还具有结构信息;下式为图卷积的计算公式,设图卷积的中心节点为v:3.3 The prediction model uses a five-layer graph convolutional neural network. The graph convolutional neural network has two basic features: one is that each node has its own feature information; the other is that each node in the graph also has structural information; the following formula is the calculation formula of graph convolution, and the central node of graph convolution is v:
u:表示中心节点v的邻居节点;hconv(v):表示中心节点v和节点 u的图卷积特征值;M:表示图卷积神经网络中所有的节点的集合;u: represents the neighbor nodes of the central node v; h conv (v): represents the graph convolution feature value of the central node v and node u; M: represents the set of all nodes in the graph convolutional neural network;
表示特征参数,会预设一个值,都为1,在训练的过程中参数不断更新; Indicates the characteristic parameters, a value will be preset, all of which are 1, and the parameters will be continuously updated during the training process;
σ:表示池化函数;σ: represents the pooling function;
设 Assume
式(1)将中心节点v的一个边的特征转化为hconv(v),再将所有邻居节点u的hconv(v)累加,即为中心节点v的图卷积;Equation (1) transforms the feature of one edge of the central node v into h conv (v), and then accumulates the h conv (v) of all neighbor nodes u, which is the graph convolution of the central node v;
hconv(G)=[hconv(v1),hconv(v2),hconv(v3),...](2)h conv (G) = [h conv (v 1 ), h conv (v 2 ), h conv (v 3 ), ...] (2)
hconv(G)表示当前计算的药物分子的hconv(v)的集合,G表示当前计算的分子G;h conv (G) represents the collection of h conv (v) of the currently calculated drug molecule, and G represents the currently calculated molecule G;
最后得到分子中所有节点v的图卷积的集合,即为分子结构特征的集合。Finally, the set of graph convolutions of all nodes v in the molecule is obtained, which is the set of molecular structural features.
6.如权利要求4所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤五中,图卷积过程如下:6. the drug activity prediction method based on deep learning as claimed in claim 4, is characterized in that, in described step 5, graph convolution process is as follows:
5.1.1遍历分子结构图中所有节点;5.1.1 Traverse all nodes in the molecular structure diagram;
5.1.3设置图卷积的中心节点为v;5.1.3 Set the central node of the graph convolution to v;
5.1.4遍历中心节点v的所有邻居节点u,建立关系字典d;5.1.4 Traverse all the neighbor nodes u of the central node v, and establish a relational dictionary d;
5.1.5将节点u的特征转化为u′: 5.1.5 Transform the feature of node u into u′:
其中,表示特征参数,会预设一个值,都为1,在训练的过程中参数不断更新;in, Indicates the characteristic parameters, a value will be preset, all of which are 1, and the parameters will be continuously updated during the training process;
5.1.6将所有的u′相加;5.1.6 Add all u';
5.1.7返回中心节点v的特征;5.1.7 Return the characteristics of the central node v;
池化过程如下:The pooling process is as follows:
5.2.1最大池化邻居节点u′;5.2.1 Maximum pooling of neighbor nodes u′;
5.2.2返回中心节点v的图卷积特征hconv(v);5.2.2 Return the graph convolution feature h conv (v) of the central node v;
全连接过程如下:The full connection process is as follows:
5.3.1使用LSTM判断分子的图卷积特征是否有用,从而挑选出有用的特征;5.3.1 Use LSTM to judge whether the graph convolution features of molecules are useful, so as to select useful features;
5.3.2连接挑选出的所有有用的特征,将输出值送给分类器。5.3.2 Connect all the selected useful features and send the output value to the classifier.
7.如权利要求1所述的基于深度学习的药物活性预测方法,其特征在于,所述步骤六中多次迭代计算,得到训练后的模型的步骤如下:7. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, multiple iterative calculations in the step 6, the step of obtaining the model after training is as follows:
6.1每次从训练集中随机抽取128batchsize大小的样本,灌入模型进行训练,得到训练结果后,使用梯度下降法优化损失函数。6.1 Each time a sample of 128 batchsize is randomly selected from the training set, poured into the model for training, and after the training results are obtained, the gradient descent method is used to optimize the loss function.
进一步的改进,所述步骤三中,预测模型为二分类的预测模型。As a further improvement, in the third step, the prediction model is a two-category prediction model.
与现有技术相比,采用本发明的优点如下:Compared with prior art, adopt the advantage of the present invention as follows:
1.第一步和第二步对数据进行更合理的预处理,将没有数据的干扰数据剔除,提高模型的准确度;同时,对特征的提取采取更简单有效的方法,只计算原子特征,不需要对分子结构进行模拟,将分子结构转化为邻接矩阵,用图卷积的方法提取特征,大大减少时间耗费。1. The first and second steps preprocess the data more reasonably, remove the interference data without data, and improve the accuracy of the model; at the same time, adopt a simpler and more effective method for feature extraction, and only calculate atomic features. There is no need to simulate the molecular structure, and the molecular structure is converted into an adjacency matrix, and features are extracted by graph convolution, which greatly reduces time consumption.
2.第三步构建更为合理的模型,五层图卷积层可以更高效提取分子的结构特征,而LSTM层对特征进行筛选,得到更好的特征。2. The third step is to build a more reasonable model. The five-layer graph convolution layer can extract the structural features of molecules more efficiently, while the LSTM layer filters the features to obtain better features.
3.第四步到第七步实现了整个训练过程,对模型进行训练优化, 2000次训练每批数据大小为128,可以保证遍历到所有训练集数据的同时,对模型进行更好的优化,得到比较低的损失函数值。3. Steps 4 to 7 implement the entire training process and optimize the model. The size of each batch of data for 2000 training sessions is 128, which ensures better optimization of the model while traversing all the training set data. A lower loss function value is obtained.
4.本专利的方法结合了图卷积和LSTM,大大减少特征提取的时间,同时对分子中的原子提取合理适当的特征,不需要使用传统计算化学方法耗费时间计算更详细的分子特征数据,又能得到传统方法无法得到的更合理的特征,从而达到在低数据量下实现更好的药物活性预测准确度。4. The method of this patent combines graph convolution and LSTM, which greatly reduces the time for feature extraction, and at the same time extracts reasonable and appropriate features for atoms in molecules, without using traditional computational chemistry methods to calculate time-consuming and detailed molecular feature data. It can also obtain more reasonable features that cannot be obtained by traditional methods, so as to achieve better drug activity prediction accuracy under low data volume.
附图说明Description of drawings
图1为总流程图;Fig. 1 is a general flowchart;
图2为乙烷(C2H6)分子的邻接矩阵;Fig. 2 is the adjacency matrix of ethane (C 2 H 6 ) molecules;
图3为LSTM流程图。Figure 3 is a flow chart of LSTM.
具体实施方式Detailed ways
图1是本专利的总流程图。Fig. 1 is the general flowchart of this patent.
本专利的具体技术方案为:The specific technical scheme of this patent is:
第一步、构建数据集:The first step is to build a data set:
1.1将药物活性数据集进行切分,打乱,包括80%的训练集、10%的开发集和10%的测试集,将开发集和测试集固定不变用于对照。1.1 Segment and scramble the drug activity data set, including 80% training set, 10% development set and 10% test set, and keep the development set and test set unchanged for control.
1.2将数据集中对受体有影响的分子标记为1(正样本),无影响的标记为0(负样本),没有数据的空值去除,剔除干扰数据可以显著提高准确度。1.2 Mark the molecules that have an impact on the receptor in the data set as 1 (positive sample), and mark the molecules that have no influence as 0 (negative sample), remove the null value without data, and eliminate the interference data can significantly improve the accuracy.
1.3对数据的切分保证训练集、开发集和测试集的分布一致。1.3 The segmentation of the data ensures that the distribution of the training set, development set and test set is consistent.
第二步、对训练集的分子提取原子特征,同时将训练集的分子结构转化为邻接矩阵:The second step is to extract the atomic features of the molecules in the training set, and at the same time convert the molecular structure of the training set into an adjacency matrix:
2.1对分子数据提取统计特征:['C','N','O','S','F','Si', 'P','Cl','Br','Mg','Na','Ca','Fe','As','Al','I','B', 'V','K','Tl','Yb','Sb','Sn','Ag','Pd','Co','Se','Ti', 'Zn','H','Li','Ge','Cu','Au','Ni','Cd','In','Mn', 'Zr','Cr','Pt','Hg','Pb','=','+','-','(',')','/', '\','[',']','@','#','Unknown']。以上特征包含常见元素以及代表特殊价键,括号,特殊分子,离子等的符号,忽略数字,小数点。得到一个包含分子中所有统计特征的字典,字典值为该分子或字符出现次数;2.1 Extract statistical features from molecular data: ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg',' Na', 'Ca', 'Fe', 'As', 'Al', 'I', 'B', 'V', 'K', 'Tl', 'Yb', 'Sb', 'Sn' , 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn', 'H', 'Li', 'Ge', 'Cu', 'Au', 'Ni',' Cd', 'In', 'Mn', 'Zr', 'Cr', 'Pt', 'Hg', 'Pb', '=', '+', '-', '(', ')' , '/', '\', '[', ']', '@', '#', 'Unknown']. The above features contain common elements as well as symbols representing special valence bonds, parentheses, special molecules, ions, etc. Numbers, decimal points are ignored. Get a dictionary containing all the statistical features in the molecule, and the dictionary value is the number of occurrences of the molecule or character;
2.2提取分子的中原子的degree,范围为0~10,原子degree被定义为与该原子直接相连的原子个数;2.2 Extract the degree of the atom in the molecule, ranging from 0 to 10, and the atom degree is defined as the number of atoms directly connected to the atom;
2.3提取分子中隐式高自旋的数量,范围为0~6,原子核具有的角动量称为原子核的自旋,属于原子核重要的量子力学性质。2.3 Extract the number of implicit high spins in the molecule, ranging from 0 to 6. The angular momentum of the nucleus is called the spin of the nucleus, which belongs to the important quantum mechanical properties of the nucleus.
2.4提取分子中原子的形式电荷,形式电荷是在写共价化合物的 Lewis结构式时为了判断各可能物种的稳定性时引入的。2.4 Extract the formal charge of atoms in the molecule. The formal charge is introduced when writing the Lewis structural formula of the covalent compound in order to judge the stability of each possible species.
2.5提取分子中原子的自由电子数量,自由电子就是指不被约束在某一个原子内部的电子,自由电子的多寡会影响物质的导电性、导热性等特性。2.5 Extract the number of free electrons of atoms in the molecule. Free electrons refer to electrons that are not confined inside a certain atom. The amount of free electrons will affect the electrical conductivity and thermal conductivity of the material.
2.6提取分子是否是芳香族化合物,芳香族化合物具有苯环结构的化合物,具有结构稳定,不易分解,毒性强的性质。2.6 Whether the extracted molecule is an aromatic compound, an aromatic compound is a compound with a benzene ring structure, which has a stable structure, is not easy to decompose, and has strong toxicity.
2.7通过将原子视为节点并将键作为无向图中的边来将所有分子特征化为图,生成以邻接矩阵表示的分子拓扑结构,邻接矩阵将分子中所有原子作为矩阵行和列的标签,当分子中两个原子有化学键相连接时,矩阵相应位置值为1。如图2为乙烷(C2H6)分子的邻接矩阵形式2.7 Characterize all molecules as graphs by treating atoms as nodes and bonds as edges in an undirected graph, generating a molecular topology represented by an adjacency matrix that takes all atoms in a molecule as labels for the rows and columns of the matrix , when two atoms in the molecule are connected by a chemical bond, the value of the corresponding position in the matrix is 1. Figure 2 shows the adjacency matrix form of ethane (C 2 H 6 ) molecules
第三步、构建预测模型(二分类的预测模型),包含五层图卷积,一层LSTM:The third step is to build a prediction model (two-category prediction model), including five layers of graph convolution and one layer of LSTM:
3.1输入x为分子的原子特征和分子结构转化成的邻接矩阵;3.1 Input x is the adjacency matrix converted from the atomic features of the molecule and the molecular structure;
3.2对于输出y的真实值用[1,0]表示0,[0,1]表示1,每次训练和测试的结果为一个数组[a,b],a,b为两个概率值,a+b=1;3.2 For the real value of the output y, use [1,0] to represent 0, [0,1] to represent 1, the result of each training and test is an array [a,b], a,b are two probability values, a +b=1;
3.3预测模型使用五层图卷积神经网络,图卷积神经网络具有两个基本特征:一是每个节点都有自己的特征信息;二是图中的每个节点还具有结构信息;下式为图卷积的计算公式,设图卷积的中心节点为v:3.3 The prediction model uses a five-layer graph convolutional neural network. The graph convolutional neural network has two basic features: one is that each node has its own feature information; the other is that each node in the graph also has structural information; the following formula is the calculation formula of graph convolution, and the central node of graph convolution is v:
u:中心节点v的邻居节点;hconv(v):中心节点v和节点v的图卷积特征值,M:图卷积神经网络中所有的节点的集合;u: the neighbor node of the center node v; h conv (v): the graph convolution feature value of the center node v and node v, M: the set of all nodes in the graph convolutional neural network;
特征参数,会预设一个值,都为1,在训练的过程中参数会不断更新; The characteristic parameters will preset a value, all of which are 1, and the parameters will be continuously updated during the training process;
σ:池化函数;σ: pooling function;
式(1)将节点v的一个边的特征转化为hconv(v),再将所有邻居节点u的hconv(v)累加,即为节点v的图卷积;Equation (1) converts the feature of an edge of node v into h conv (v), and then accumulates h conv (v) of all neighbor nodes u, which is the graph convolution of node v;
hconv(G)=[hconv(v1),hconv(v2),hconv(v3),...](2)h conv (G) = [h conv (v 1 ), h conv (v 2 ), h conv (v 3 ), ...] (2)
hconv(G)表示当前计算分子hconv(v)的集合,G表示当前计算的分子G;h conv (G) represents the set of currently calculated molecules h conv (v), and G represents the currently calculated molecule G;
最后得到分子中所有节点v的图卷积的集合,即为分子结构特征的集合。Finally, the set of graph convolutions of all nodes v in the molecule is obtained, which is the set of molecular structural features.
3.4LSTM(长短期记忆网络):3.4LSTM (long short-term memory network):
LSTM区别于RNN的地方,主要就在于它在算法中加入了一个判断信息有用与否的“处理器”(图3.中间的模块)。The difference between LSTM and RNN is that it adds a "processor" to the algorithm to judge whether the information is useful or not (Figure 3. The middle module).
LSTM中的重复模块包含四个相互作用的激活函数(三个 sigmoid,一个tanh):图中每条线表示一个完整向量,从一个节点的输出到其他节点的输入。如图3所示,圆圈代表逐点操作,比如向量加法,而矩形框表示门限激活函数。线条合并表示串联,线条分差表示复制内容并输出到不同地方。A repeating module in an LSTM consists of four interacting activation functions (three sigmoids, one tanh): each line in the graph represents a full vector, from the output of one node to the input of the other. As shown in Figure 3, circles represent point-wise operations, such as vector addition, while rectangular boxes represent threshold activation functions. Combining lines means concatenation, and dividing lines means copying content and outputting it to different places.
存储单元中管理向单元移除或添加的结构叫门限,有三种:遗忘门、输入门、输出门。门限由sigmoid激活函数和逐点乘法运算组成。前一个时间步骤的隐藏状态,一个送到遗忘门(输入节点),一个送到输入门,一个送到输出门。就前传递而言,输入门学习决定何时让激活传入存储单元,而输出门学习何时让激活传出存储单元。相应的,对于后传递,输出门学习何时让错误流入存储单元,输入门学习何时让它流出存储单元。The structure in the storage unit that manages removal or addition to the unit is called a threshold, and there are three types: forget gate, input gate, and output gate. The threshold consists of a sigmoid activation function and a pointwise multiplication operation. The hidden state of the previous time step, one is sent to the forget gate (input node), one is sent to the input gate, and one is sent to the output gate. For the forward pass, the input gate learns to decide when to pass activations into the memory cell, and the output gate learns when to pass activations out of the memory cell. Correspondingly, for the back pass, the output gate learns when to let errors flow into the memory cell, and the input gate learns when to let it flow out of the memory cell.
用输入xt,t-1次的输出ht-1,计算遗忘率决定一个特征是否要遗忘,0代表完全遗忘,1代表全部记住。Use the input x t , the output h t-1 of t-1 times to calculate the forgetting rate Decide whether a feature should be forgotten, 0 means completely forgotten, 1 means remember all.
第四步、将步骤2)和3)得到的数据进行训练。The fourth step is to train the data obtained in steps 2) and 3).
第五步、通过卷积,池化,全连接后,将输出值输送给分类器,优化损失函数,继续训练。具体过程如下:Step 5: After convolution, pooling, and full connection, the output value is sent to the classifier, the loss function is optimized, and training continues. The specific process is as follows:
5.1图卷积过程:5.1 Graph convolution process:
for all nodes v in graphfor all nodes v in graph
set k=deg(v)set k = deg(v)
for u in neigh(v)∪{v}for u in neigh(v)∪{v}
set d=dist(v,u)set d = dist(v,u)
transforn features u′=Wk,du+bk,d transform features u′=W k,d u+b k,d
sum all u and apply nonlinearitysum all u and apply nonlinearity
return new features for vreturn new features for v
即5.1.1遍历分子结构图中所有节点;That is, 5.1.1 traverse all nodes in the molecular structure diagram;
5.1.3设置图卷积的中心节点为v;5.1.3 Set the central node of the graph convolution to v;
5.1.4遍历中心节点v的所有邻居u,建立关系字典d;5.1.4 Traversing all neighbors u of the central node v, and establishing a relational dictionary d;
5.1.5将节点u的特征转化为u′:5.1.5 Transform the feature of node u into u′:
(如公式(1)中说明) (As explained in Equation (1))
5.1.6将所有的u′相加;5.1.6 Add all u';
5.1.7返回节点v的特征;5.1.7 Return the characteristics of node v;
5.2池化过程:5.2 Pooling process:
max over u in neigh(v)∪{v}max over u in neigh(v)∪{v}
return new features for vreturn new features for v
5.2.1最大池化邻居节点u′;5.2.1 Maximum pooling of neighbor nodes u′;
5.2.2返回节点v的新特征。5.2.2 Return new features for node v.
5.3全连接过程:5.3 Full connection process:
5.3.1使用LSTM判断特征是否有用,从而挑选出有意义的特征。5.3.1 Use LSTM to judge whether the features are useful, so as to select meaningful features.
5.3.2连接挑选出的所有特征,将输出值送给分类器5.3.2 Connect all the selected features and send the output value to the classifier
第六步、经过多次迭代计算,得到训练后的模型:The sixth step is to obtain the trained model after multiple iterative calculations:
6.1每次从训练集中随机抽取128batchsize大小的样本,灌入模型进行训练,得到训练结果后,使用梯度下降法优化损失函数。6.1 Each time a sample of 128 batchsize is randomly selected from the training set, poured into the model for training, and after the training results are obtained, the gradient descent method is used to optimize the loss function.
第七步、将开发集与测试集经过同样的特征处理,灌入模型得到测试结果。The seventh step is to process the development set and the test set through the same feature processing, and inject them into the model to obtain the test results.
第八步、实验结果及其讨论。The eighth step, the experimental results and their discussion.
8.1本专利使用的数据集为Tox21数据集(Tox21 Data Challenge) [https://tripod.nih.gov/tox21/challenge/],2014年Tox21数据挑战旨在帮助科学家了解化学物质和化合物破坏生物器官的的潜力, tox21数据集是科学家通过毒理学分析,表明这些化学物质和化合物可能对生物有毒性效应;8.1 The dataset used in this patent is the Tox21 dataset (Tox21 Data Challenge) [https://tripod.nih.gov/tox21/challenge/], the 2014 Tox21 data challenge aims to help scientists understand how chemicals and compounds destroy biological organs The potential of the tox21 data set is that scientists have shown through toxicological analysis that these chemicals and compounds may have toxic effects on organisms;
8.2tox21数据集包含8013种可能对人体12种受体(NR-AR, NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP SR-p53)产生影响的数据,每种受体有8000条数据;8.2tox21 data set contains 8013 possible receptors for 12 human body (NR-AR, NR-AR-LBD, NR-AhR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR -ARE, SR-ATAD5, SR-HSE, SR-MMP SR-p53) impact data, each receptor has 8000 pieces of data;
8.3本专利的实验一将对这12种受体数据分别建立模型,得到12 个预测的结果:8.3 Experiment 1 of this patent will establish models for these 12 kinds of receptor data, and get 12 predicted results:
表1.Tox21数据集实验结果Table 1. Tox21 dataset experimental results
8.4本专利的实验二将比较本模型与传统机器学习模型性能的优劣,对实验一的结果与逻辑回归、支持向量机和三种贝叶斯 (BernoulliNB、GaussianNB和MultinomialNB)5种方法进行比较:8.4 Experiment 2 of this patent will compare the performance of this model with traditional machine learning models, and compare the results of Experiment 1 with five methods of logistic regression, support vector machine and three Bayesian methods (BernoulliNB, GaussianNB and MultinomialNB) :
表2.与传统方法的对比Table 2. Comparison with traditional methods
实验中对所有比较的机器学习方法采用与本专利模型同样的的数据处理,保证实验比较的有效性,表中数据取的是测试集的结果。表2的数据表明,在Tox21数据集下,本专利比传统方法12种受体数据中有5种是全面优于传统机器学习方法的,相比传统机器学习方法本专利可以得到更好更稳定的结果,证明本专利在模型上的创新的是有效果的。In the experiment, the same data processing as the model of this patent is used for all compared machine learning methods to ensure the validity of the experimental comparison. The data in the table are the results of the test set. The data in Table 2 shows that, under the Tox21 data set, this patent is better than the traditional machine learning method in 5 out of the 12 receptor data of the traditional method. Compared with the traditional machine learning method, this patent can obtain better and more stable The result proves that the innovation of this patent on the model is effective.
上述实例仅仅是本发明的一个具体实施方式,对其的简单变换、替换等也均在发明的保护范围内。The above example is only a specific embodiment of the present invention, and its simple transformation, replacement, etc. are also within the protection scope of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810742486.4A CN109033738B (en) | 2018-07-09 | 2018-07-09 | Deep learning-based drug activity prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810742486.4A CN109033738B (en) | 2018-07-09 | 2018-07-09 | Deep learning-based drug activity prediction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033738A true CN109033738A (en) | 2018-12-18 |
CN109033738B CN109033738B (en) | 2022-01-11 |
Family
ID=64641565
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810742486.4A Active CN109033738B (en) | 2018-07-09 | 2018-07-09 | Deep learning-based drug activity prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033738B (en) |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210330A (en) * | 2019-05-13 | 2019-09-06 | 清华大学 | Electromagnetic signal recognition methods and device based on Tacit Knowledge structure figures convolutional network |
CN110277173A (en) * | 2019-05-21 | 2019-09-24 | 湖南大学 | BiGRU Drug Toxicity Prediction System and Prediction Method Based on Smi2Vec |
CN110322972A (en) * | 2019-05-29 | 2019-10-11 | 平安科技(深圳)有限公司 | Intelligent drug toxicity judgment method, device and computer readable storage medium |
CN110517790A (en) * | 2019-06-24 | 2019-11-29 | 江苏大学 | Compound hepatotoxicity wind agitation method for early prediction based on deep learning and gene expression data |
CN110600085A (en) * | 2019-06-01 | 2019-12-20 | 重庆大学 | Organic matter physicochemical property prediction method based on Tree-LSTM |
CN110689919A (en) * | 2019-08-13 | 2020-01-14 | 复旦大学 | Pharmaceutical protein binding rate prediction method and system based on structure and grade classification |
CN110797093A (en) * | 2019-11-20 | 2020-02-14 | 中国石油大学(北京) | Gas hydrate 512Cage identification method and system |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
CN110970098A (en) * | 2019-11-26 | 2020-04-07 | 重庆大学 | A kind of functional polypeptide bitterness prediction method |
CN111062543A (en) * | 2019-12-30 | 2020-04-24 | 集美大学 | Method for predicting hydrogen release temperature of metal borohydride |
CN111199779A (en) * | 2019-12-26 | 2020-05-26 | 中科曙光国际信息产业有限公司 | Virtual drug screening method and device based on molecular docking |
CN111243682A (en) * | 2020-01-10 | 2020-06-05 | 京东方科技集团股份有限公司 | Method, device, medium and apparatus for predicting toxicity of drug |
CN111370073A (en) * | 2020-02-27 | 2020-07-03 | 福州大学 | A deep learning-based prediction method for drug interaction rules |
CN111402948A (en) * | 2020-04-02 | 2020-07-10 | 江苏食品药品职业技术学院 | Pharmacokinetic prediction model based on artificial intelligence and animal experimental datasets |
CN111445020A (en) * | 2019-01-16 | 2020-07-24 | 阿里巴巴集团控股有限公司 | Graph-based convolutional network training method, device and system |
CN111540419A (en) * | 2020-04-28 | 2020-08-14 | 上海交通大学 | Anti-senile dementia drug effectiveness prediction system based on deep learning |
CN111626119A (en) * | 2020-04-23 | 2020-09-04 | 北京百度网讯科技有限公司 | Target recognition model training method, device, equipment and storage medium |
CN111710376A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Load balancing method and system for block computing in macromolecular and cluster systems |
CN111755078A (en) * | 2020-07-30 | 2020-10-09 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
CN111798935A (en) * | 2019-04-09 | 2020-10-20 | 南京药石科技股份有限公司 | A neural network-based prediction method for universal compound structure-property correlation |
CN111816252A (en) * | 2020-07-21 | 2020-10-23 | 腾讯科技(深圳)有限公司 | Drug screening method and device and electronic equipment |
CN111916143A (en) * | 2020-07-27 | 2020-11-10 | 西安电子科技大学 | Molecular activity prediction method based on fusion of multiple substructure features |
CN111933225A (en) * | 2020-09-27 | 2020-11-13 | 平安科技(深圳)有限公司 | Drug classification method and device, terminal equipment and storage medium |
WO2020230043A1 (en) * | 2019-05-15 | 2020-11-19 | International Business Machines Corporation | Feature vector feasibilty estimation |
CN112102889A (en) * | 2020-10-14 | 2020-12-18 | 深圳晶泰科技有限公司 | Free energy perturbation network design method based on machine learning |
CN112309509A (en) * | 2019-10-15 | 2021-02-02 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN112635080A (en) * | 2021-01-15 | 2021-04-09 | 复星领智(上海)医药科技有限公司 | Deep learning-based drug prediction method and device |
CN112885415A (en) * | 2021-01-22 | 2021-06-01 | 中国科学院生态环境研究中心 | Molecular surface point cloud-based estrogen activity rapid screening method |
CN112955962A (en) * | 2019-10-11 | 2021-06-11 | 迈立塔股份有限公司 | New drug candidate substance derivation method and device |
CN113053457A (en) * | 2021-03-25 | 2021-06-29 | 湖南大学 | Drug target prediction method based on multi-pass graph convolution neural network |
CN113140260A (en) * | 2020-01-20 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Method and device for predicting reactant molecular composition data of composition |
CN113474841A (en) * | 2019-02-22 | 2021-10-01 | 3M创新有限公司 | Machine learning quantification of target organisms using nucleic acid amplification assays |
CN113544786A (en) * | 2019-02-08 | 2021-10-22 | 谷歌有限责任公司 | Systems and methods for predicting olfactory properties of molecules using machine learning |
CN113628696A (en) * | 2021-07-19 | 2021-11-09 | 武汉大学 | Drug connection graph score prediction method and device based on double-graph convolution fusion model |
CN113673610A (en) * | 2021-08-25 | 2021-11-19 | 上海鹏冠生物医药科技有限公司 | Image preprocessing method for tissue cell pathological image diagnosis system |
CN114255878A (en) * | 2021-12-07 | 2022-03-29 | 广东省人民医院 | A training method, system, device and storage medium for a disease classification model |
CN114429801A (en) * | 2022-01-26 | 2022-05-03 | 北京百度网讯科技有限公司 | Data processing method, training method, identification method, apparatus, equipment and medium |
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
WO2023029352A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Drug small molecule property prediction method and apparatus based on graph neural network, and device |
WO2023115338A1 (en) * | 2021-12-21 | 2023-06-29 | 深圳晶泰科技有限公司 | Improvement processing method and apparatus for drug pressing parameters, device, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526281A (en) * | 1993-05-21 | 1996-06-11 | Arris Pharmaceutical Corporation | Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics |
CN101587510A (en) * | 2008-05-23 | 2009-11-25 | 中国科学院上海药物研究所 | Carcinogenic Toxicity Prediction Method of Compounds Based on Complex Sampling and Improved Decision Forest Algorithm |
CN102592040A (en) * | 2002-07-24 | 2012-07-18 | 基德姆生物科学有限公司 | Drug discovery method |
CN106874688A (en) * | 2017-03-01 | 2017-06-20 | 中国药科大学 | Intelligent lead compound based on convolutional neural networks finds method |
-
2018
- 2018-07-09 CN CN201810742486.4A patent/CN109033738B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526281A (en) * | 1993-05-21 | 1996-06-11 | Arris Pharmaceutical Corporation | Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics |
CN102592040A (en) * | 2002-07-24 | 2012-07-18 | 基德姆生物科学有限公司 | Drug discovery method |
CN101587510A (en) * | 2008-05-23 | 2009-11-25 | 中国科学院上海药物研究所 | Carcinogenic Toxicity Prediction Method of Compounds Based on Complex Sampling and Improved Decision Forest Algorithm |
CN106874688A (en) * | 2017-03-01 | 2017-06-20 | 中国药科大学 | Intelligent lead compound based on convolutional neural networks finds method |
Non-Patent Citations (2)
Title |
---|
HAN ALTAE-TRAN ET AL.: "Low Data Drug Discovery with One-Shot Learning", 《ACS CENTRAL SCIENCE》 * |
黄丽霞 等: "《信息检索教程》", 31 July 2014, 知识产权出版社 * |
Cited By (63)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111445020A (en) * | 2019-01-16 | 2020-07-24 | 阿里巴巴集团控股有限公司 | Graph-based convolutional network training method, device and system |
CN111445020B (en) * | 2019-01-16 | 2023-05-23 | 阿里巴巴集团控股有限公司 | Graph-based convolutional network training method, device and system |
CN113544786A (en) * | 2019-02-08 | 2021-10-22 | 谷歌有限责任公司 | Systems and methods for predicting olfactory properties of molecules using machine learning |
CN113474841A (en) * | 2019-02-22 | 2021-10-01 | 3M创新有限公司 | Machine learning quantification of target organisms using nucleic acid amplification assays |
CN111798935A (en) * | 2019-04-09 | 2020-10-20 | 南京药石科技股份有限公司 | A neural network-based prediction method for universal compound structure-property correlation |
CN110210330A (en) * | 2019-05-13 | 2019-09-06 | 清华大学 | Electromagnetic signal recognition methods and device based on Tacit Knowledge structure figures convolutional network |
WO2020230043A1 (en) * | 2019-05-15 | 2020-11-19 | International Business Machines Corporation | Feature vector feasibilty estimation |
GB2599520A (en) * | 2019-05-15 | 2022-04-06 | Ibm | Feature vector feasibilty estimation |
CN113795889A (en) * | 2019-05-15 | 2021-12-14 | 国际商业机器公司 | Feature vector feasibility estimation |
US11798655B2 (en) | 2019-05-15 | 2023-10-24 | International Business Machines Corporation | Feature vector feasibility estimation |
CN110277173A (en) * | 2019-05-21 | 2019-09-24 | 湖南大学 | BiGRU Drug Toxicity Prediction System and Prediction Method Based on Smi2Vec |
CN110322972B (en) * | 2019-05-29 | 2022-05-20 | 平安科技(深圳)有限公司 | Intelligent drug toxicity judgment method and device and computer readable storage medium |
CN110322972A (en) * | 2019-05-29 | 2019-10-11 | 平安科技(深圳)有限公司 | Intelligent drug toxicity judgment method, device and computer readable storage medium |
CN110600085B (en) * | 2019-06-01 | 2024-04-09 | 重庆大学 | Tree-LSTM-based organic matter physicochemical property prediction method |
CN110600085A (en) * | 2019-06-01 | 2019-12-20 | 重庆大学 | Organic matter physicochemical property prediction method based on Tree-LSTM |
CN110517790A (en) * | 2019-06-24 | 2019-11-29 | 江苏大学 | Compound hepatotoxicity wind agitation method for early prediction based on deep learning and gene expression data |
CN110689919A (en) * | 2019-08-13 | 2020-01-14 | 复旦大学 | Pharmaceutical protein binding rate prediction method and system based on structure and grade classification |
CN112955962A (en) * | 2019-10-11 | 2021-06-11 | 迈立塔股份有限公司 | New drug candidate substance derivation method and device |
CN112309509A (en) * | 2019-10-15 | 2021-02-02 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN112309509B (en) * | 2019-10-15 | 2021-05-28 | 腾讯科技(深圳)有限公司 | Compound property prediction method, device, computer device and readable storage medium |
CN110867254A (en) * | 2019-11-18 | 2020-03-06 | 北京市商汤科技开发有限公司 | Prediction method and device, electronic device and storage medium |
JP2022518283A (en) * | 2019-11-18 | 2022-03-14 | ベイジン センスタイム テクノロジー デベロップメント カンパニー, リミテッド | Prediction methods and devices, electronic devices and storage media |
TWI771803B (en) * | 2019-11-18 | 2022-07-21 | 大陸商北京市商湯科技開發有限公司 | Prediction method, electronic device and storage medium thereof |
WO2021098256A1 (en) * | 2019-11-18 | 2021-05-27 | 北京市商汤科技开发有限公司 | Prediction method and apparatus, electronic device, and storage medium |
CN110797093A (en) * | 2019-11-20 | 2020-02-14 | 中国石油大学(北京) | Gas hydrate 512Cage identification method and system |
CN110970098A (en) * | 2019-11-26 | 2020-04-07 | 重庆大学 | A kind of functional polypeptide bitterness prediction method |
CN110957012A (en) * | 2019-11-28 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for analyzing properties of compound |
WO2021103761A1 (en) * | 2019-11-28 | 2021-06-03 | 腾讯科技(深圳)有限公司 | Compound property analysis method and apparatus, compound property analysis model training method, and storage medium |
CN111199779A (en) * | 2019-12-26 | 2020-05-26 | 中科曙光国际信息产业有限公司 | Virtual drug screening method and device based on molecular docking |
CN111062543B (en) * | 2019-12-30 | 2022-04-29 | 集美大学 | A method for predicting the dehydrogenation temperature of metal borohydrides |
CN111062543A (en) * | 2019-12-30 | 2020-04-24 | 集美大学 | Method for predicting hydrogen release temperature of metal borohydride |
CN111243682A (en) * | 2020-01-10 | 2020-06-05 | 京东方科技集团股份有限公司 | Method, device, medium and apparatus for predicting toxicity of drug |
CN113140260B (en) * | 2020-01-20 | 2023-09-08 | 腾讯科技(深圳)有限公司 | Method and device for predicting reactant molecular composition data of composition |
CN113140260A (en) * | 2020-01-20 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Method and device for predicting reactant molecular composition data of composition |
CN111370073A (en) * | 2020-02-27 | 2020-07-03 | 福州大学 | A deep learning-based prediction method for drug interaction rules |
CN111370073B (en) * | 2020-02-27 | 2023-04-07 | 福州大学 | Medicine interaction rule prediction method based on deep learning |
CN111402948A (en) * | 2020-04-02 | 2020-07-10 | 江苏食品药品职业技术学院 | Pharmacokinetic prediction model based on artificial intelligence and animal experimental datasets |
CN111626119A (en) * | 2020-04-23 | 2020-09-04 | 北京百度网讯科技有限公司 | Target recognition model training method, device, equipment and storage medium |
CN111626119B (en) * | 2020-04-23 | 2023-09-01 | 北京百度网讯科技有限公司 | Target recognition model training method, device, equipment and storage medium |
CN111540419A (en) * | 2020-04-28 | 2020-08-14 | 上海交通大学 | Anti-senile dementia drug effectiveness prediction system based on deep learning |
CN111710376B (en) * | 2020-05-13 | 2023-04-07 | 中国科学院计算机网络信息中心 | Block calculation load balancing method and system for macromolecules and cluster systems |
CN111710376A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Load balancing method and system for block computing in macromolecular and cluster systems |
CN111816252A (en) * | 2020-07-21 | 2020-10-23 | 腾讯科技(深圳)有限公司 | Drug screening method and device and electronic equipment |
CN111916143B (en) * | 2020-07-27 | 2023-07-28 | 西安电子科技大学 | Molecular activity prediction method based on multi-substructural feature fusion |
CN111916143A (en) * | 2020-07-27 | 2020-11-10 | 西安电子科技大学 | Molecular activity prediction method based on fusion of multiple substructure features |
CN111755078A (en) * | 2020-07-30 | 2020-10-09 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
CN111755078B (en) * | 2020-07-30 | 2022-09-23 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
CN111933225B (en) * | 2020-09-27 | 2021-01-05 | 平安科技(深圳)有限公司 | Drug classification method and device, terminal equipment and storage medium |
CN111933225A (en) * | 2020-09-27 | 2020-11-13 | 平安科技(深圳)有限公司 | Drug classification method and device, terminal equipment and storage medium |
CN112102889A (en) * | 2020-10-14 | 2020-12-18 | 深圳晶泰科技有限公司 | Free energy perturbation network design method based on machine learning |
CN112635080A (en) * | 2021-01-15 | 2021-04-09 | 复星领智(上海)医药科技有限公司 | Deep learning-based drug prediction method and device |
CN112885415B (en) * | 2021-01-22 | 2024-02-06 | 中国科学院生态环境研究中心 | Quick screening method for estrogen activity based on molecular surface point cloud |
CN112885415A (en) * | 2021-01-22 | 2021-06-01 | 中国科学院生态环境研究中心 | Molecular surface point cloud-based estrogen activity rapid screening method |
CN113053457A (en) * | 2021-03-25 | 2021-06-29 | 湖南大学 | Drug target prediction method based on multi-pass graph convolution neural network |
CN113628696B (en) * | 2021-07-19 | 2023-10-31 | 武汉大学 | Medicine connection graph score prediction method and device based on double-graph convolution fusion model |
CN113628696A (en) * | 2021-07-19 | 2021-11-09 | 武汉大学 | Drug connection graph score prediction method and device based on double-graph convolution fusion model |
CN113673610A (en) * | 2021-08-25 | 2021-11-19 | 上海鹏冠生物医药科技有限公司 | Image preprocessing method for tissue cell pathological image diagnosis system |
WO2023029352A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Drug small molecule property prediction method and apparatus based on graph neural network, and device |
CN114255878A (en) * | 2021-12-07 | 2022-03-29 | 广东省人民医院 | A training method, system, device and storage medium for a disease classification model |
WO2023115338A1 (en) * | 2021-12-21 | 2023-06-29 | 深圳晶泰科技有限公司 | Improvement processing method and apparatus for drug pressing parameters, device, and storage medium |
CN114429801A (en) * | 2022-01-26 | 2022-05-03 | 北京百度网讯科技有限公司 | Data processing method, training method, identification method, apparatus, equipment and medium |
CN115171807B (en) * | 2022-09-07 | 2022-12-06 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
Also Published As
Publication number | Publication date |
---|---|
CN109033738B (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033738A (en) | A kind of pharmaceutical activity prediction technique based on deep learning | |
He et al. | A generalization of vit/mlp-mixer to graphs | |
Dai et al. | Generative modeling of convolutional neural networks | |
Djenouri et al. | Fast and accurate deep learning framework for secure fault diagnosis in the industrial internet of things | |
CN107862179A (en) | A kind of miRNA disease association Relationship Prediction methods decomposed based on similitude and logic matrix | |
Shi et al. | Protein complex detection with semi-supervised learning in protein interaction networks | |
CN106874688A (en) | Intelligent lead compound based on convolutional neural networks finds method | |
US9043326B2 (en) | Methods and systems for biclustering algorithm | |
CN114091603A (en) | A spatial transcriptome cell clustering and analysis method | |
Sarwar et al. | A survey of big data analytics in healthcare | |
CN116798652A (en) | Anticancer drug response prediction method based on multitasking learning | |
CN102722578B (en) | Unsupervised cluster characteristic selection method based on Laplace regularization | |
CN113990401A (en) | Method and apparatus for designing drug molecules of intrinsically disordered proteins | |
CN110782948A (en) | Predicting potential associations of miRNAs with diseases based on constrained probability matrix factorization | |
Meirom et al. | Optimizing tensor network contraction using reinforcement learning | |
CN115019878B (en) | A drug discovery method based on graph representation and deep learning | |
Dobos et al. | A comparative study of anomaly detection methods for gross error detection problems | |
CN114242168A (en) | Method for identifying biologically essential protein | |
CN113272646B (en) | Correlating complex data | |
Zhu et al. | Structural landmarking and interaction modelling: a “slim” network for graph classification | |
Pearson et al. | Multi-round random subspace feature selection for incomplete gene expression data | |
CN114360637B (en) | Protein-ligand affinity evaluation method based on graph attention network | |
CN113780334B (en) | High Dimensional Data Classification Method Based on Two-Stage Hybrid Feature Selection | |
CN115394354A (en) | Drug target prediction method based on graph convolution neural network | |
Jui et al. | Evolution of graph embedding trends: A review with potential future directions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |