CN109033738A

CN109033738A - A kind of pharmaceutical activity prediction technique based on deep learning

Info

Publication number: CN109033738A
Application number: CN201810742486.4A
Authority: CN
Inventors: 全哲; 范益世; 王凡; 乐雨泉; 林轩; 刘彦
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2018-12-18
Anticipated expiration: 2038-07-09
Also published as: CN109033738B

Abstract

The invention discloses a drug activity prediction method based on deep learning. The present invention uses the RDkit open source library to calculate the basic characteristics of each atom in a given molecule, including atom type, valence, formal charge, etc. Only calculating the atomic characteristics greatly reduces time consumption. The present invention is a prediction model that combines two models of graph convolution and LSTM (long-term short-term memory network). For the graph convolution model, all molecular features are integrated by treating atoms as nodes and bonds as edges in an undirected graph. Turn it into a graph, extract the molecular structure features, and use the graph convolutional neural network to reduce time consumption and obtain features that cannot be obtained by traditional methods. LSTMs learn complex metrics by exchanging information between evidence and query molecules. So as to achieve higher prediction accuracy under low data volume.

Description

A drug activity prediction method based on deep learning

技术领域technical field

本发明涉及一种基于深度学习的药物活性预测方法，属于软件技术领域。The invention relates to a method for predicting drug activity based on deep learning, which belongs to the field of software technology.

背景内容background content

药物研究以及制药业发展的主要目标是发现与治疗疾病相关的药物分子，探索先导物发现方法是实现这一目标的主要途径。当生物学研究发现某一特定分子具有治疗活性时，发现的分子常常因为毒性，低活性和低溶解度等多种原因而被废弃。据美国药物研究与制造商协会统计，整个制药业中新药研究和开发占销售收入的12.8％，而其中的75％是因为新药研究和开发的失败，在初筛中被命中的化合物不到5％能进入临床前评价。由于计算机虚拟筛选不存在样品的限制，因此如果先进行计算机虚拟筛选，然后再进行药理测试，这样的研究策略与传统的直接进行药理测试的策略比较，将显著地缩短新药的研发周期、降低研发费用。目前，先导物发现的主流方向在于分子的定量结构和活性关系(QSAR)的研究，主要是定量描述分子的结构，即分子特征描述方法的选择和连接这些分子特征与活性的数学函数关系的选择。The main goal of drug research and the development of the pharmaceutical industry is to discover drug molecules related to the treatment of diseases, and exploring lead discovery methods is the main way to achieve this goal. When biological research finds that a particular molecule has therapeutic activity, the discovered molecule is often discarded for a variety of reasons, including toxicity, low activity, and low solubility. According to the statistics of the American Pharmaceutical Research and Manufacturers Association, the research and development of new drugs in the entire pharmaceutical industry accounted for 12.8% of sales revenue, and 75% of them were due to the failure of new drug research and development, and less than 5 compounds were hit in the primary screening. % can enter the preclinical evaluation. Since there is no sample limitation in computer virtual screening, if computer virtual screening is performed first, and then pharmacological testing is carried out, this research strategy will significantly shorten the development cycle of new drugs and reduce the cost of research and development compared with the traditional strategy of directly conducting pharmacological testing. cost. At present, the mainstream direction of lead discovery lies in the study of quantitative structure and activity relationship (QSAR) of molecules, mainly to quantitatively describe the structure of molecules, that is, the selection of molecular feature description methods and the selection of mathematical functional relationships connecting these molecular features and activity .

目前通常的做法主要分为以下几种：The current common practices are mainly divided into the following categories:

基于化合物分子的拓扑结构、侧链、骨架与特定的毒性作用部位之间的关系。Wang等人研究了化学物质毒性作用登记RTECS (Registry of Toxic Effect of ChemicalSubstances)数据库中约六万个毒性化合物分子的拓扑结构、侧链、骨架与特定的毒性作用部位之间的关系(比如皮肤毒性、血液毒性以及肾脏毒性等)，并对这些拓扑结构在整个数据库中出现的次数，以及在毒性化学库中出现的次数进行比较。此方法需要的数据量大，正样本多，而且只提取毒性特征会对导致无毒分子的判断误差较大。Based on the relationship between the topological structure, side chain, skeleton and specific toxic action site of the compound molecule. Wang et al. studied the relationship between the topology, side chains, skeletons and specific toxic sites of about 60,000 toxic compound molecules in the RTECS (Registry of Toxic Effects of Chemical Substances) database (such as skin toxicity , hematological toxicity, and renal toxicity, etc.), and compare the number of occurrences of these topological structures in the entire database with the number of occurrences in the toxicity chemical library. This method requires a large amount of data and many positive samples, and only extracting toxic features will lead to large errors in the judgment of non-toxic molecules.

1.基于支持向量机方法预测待测药物的活性。Zhang等人根据获取的遗传性疾病对应的相关基因信息从得到的药物靶标中筛选出与遗传性疾病关联的靶标基因，获取每个样本药物的特征属性，所述特征属性为样本药物对应的药物靶标与遗传性疾病关联的靶标基因的相关关系；以每个样本药物的特征属性为输入向量，以样本药物的活性为输出，通过支持向量机方法建立模型，预测待测药物的活性。此方法分子特征较难获取，需要特定的数据集，普适性较差。1. Predict the activity of the drug to be tested based on the support vector machine method. Zhang et al. screened the target genes associated with hereditary diseases from the obtained drug targets according to the obtained genetic disease-related gene information, and obtained the characteristic attributes of each sample drug, which is the drug corresponding to the sample drug. The correlation between the target and the target gene associated with the genetic disease; the characteristic attribute of each sample drug is used as the input vector, and the activity of the sample drug is used as the output, and the model is established by the support vector machine method to predict the activity of the drug to be tested. This method is difficult to obtain molecular features, requires a specific data set, and has poor universality.

2.基于深度学习的有监督和无监督算法结合进行药物活性分子识别。高双印将支持向量机(Support Vector Machine)、人工神经网络 (Artificial Neural Network)、半监督支持向量机(Semi-supervised support vector machine)、代价安全性半监督支持向量机(Cost security semi-supervised support vector machine)、栈式自编码(StackedAutoEncode)、深度信念网络(Deep Belief Network)几种种方法进结合，分别对三类药物活性分子(PLK1PBD、SMAD3、IL-1B)进行深入探究。由于药物活性分子结构繁杂，选用化学计量软件MOE对其进行精密计算，分别获得其2D及3D分子描述符，通过上述两类算法进行药物活性分子识别。此方法需要大数据集，使用化学计量软件计算分子特征要耗费大量时间。2. Combination of supervised and unsupervised algorithms based on deep learning for drug active molecule identification. Gao Shuangyin will support vector machine (Support Vector Machine), artificial neural network (Artificial Neural Network), semi-supervised support vector machine (Semi-supervised support vector machine), cost security semi-supervised support vector machine (Cost security semi-supervised support vector machine) machine), stacked autoencode (StackedAutoEncode), and deep belief network (Deep Belief Network) methods are combined to conduct in-depth exploration of three types of drug active molecules (PLK1PBD, SMAD3, IL-1B). Due to the complex structure of active pharmaceutical molecules, the chemometric software MOE is used for precise calculations to obtain their 2D and 3D molecular descriptors, and the above two types of algorithms are used to identify active pharmaceutical molecules. This method requires large data sets, and calculation of molecular features using chemometric software is time-consuming.

综上所述，药物活性预测的各种方法都会受限于自身的特点，基于大数据分析的方法需要大量数据，对于样本的分布要求较高；传统机器学习类方法对于样本采集分类、训练需要耗费大量的时间；以上基于有监督和无监督的机器学习算法不仅需要大量数据，而且使用化学计量软件计算分子特征同样需要耗费大量时间。To sum up, various methods for drug activity prediction are limited by their own characteristics. Methods based on big data analysis require a large amount of data and have high requirements for the distribution of samples; Time-consuming; the above machine learning algorithms based on supervised and unsupervised not only require a large amount of data, but also use chemometric software to calculate molecular features also require a lot of time.

名词解释：Glossary:

LSTM：即长短期记忆网络。LSTM: Long short-term memory network.

原子的degree：用RDkit计算出的每个原子的权重值，是该原子直接相连的原子个数。Atom degree: The weight value of each atom calculated by RDkit is the number of atoms directly connected to the atom.

Lewis结构式：一种分子的书写方式，如氰化氢H-C≡NLewis structural formula: a way of writing molecules, such as hydrogen cyanide H-C≡N

Sigmoid：Sigmoid函数是一个S形曲线的数学函数，其公式为Sigmoid: The Sigmoid function is a mathematical function of an S-shaped curve, and its formula is

在逻辑回归、人工神经网络中有着广泛的应用。It has a wide range of applications in logistic regression and artificial neural networks.

Tanh：双曲正切函数，是由基本双曲函数双曲正弦和双曲余弦推导而来：Tanh: The hyperbolic tangent function, which is derived from the basic hyperbolic functions hyperbolic sine and hyperbolic cosine:

发明内容Contents of the invention

本发明克服现有技术存在的不足，本发明公开了一种基于深度学习的药物活性预测方法。本发明使用RDkit开源库用于计算给定分子中每个原子的基本特征，包括原子类型，化合价，形式电荷等，只计算原子特征大大减少时间耗费。对于图卷积模型，通过将原子视为节点并将键作为无向图中的边来将所有分子特征化为图，提取分子结构特征，使用图卷积神经网络可以减少时间耗费的同时获取传统方法无法得到的特征。LSTM通过在证据和查询分子之间交换信息来学习复杂的度量。从而达到在低数据量下较高的预测准确度。The invention overcomes the shortcomings of the prior art, and discloses a method for predicting drug activity based on deep learning. The present invention uses the RDkit open source library to calculate the basic characteristics of each atom in a given molecule, including atom type, valence, formal charge, etc. Only calculating the atomic characteristics greatly reduces time consumption. For the graph convolutional model, all molecules are characterized as graphs by treating atoms as nodes and bonds as edges in an undirected graph, and molecular structure features are extracted. Using graph convolutional neural networks can reduce time consumption and obtain traditional A characteristic that cannot be obtained by the method. LSTMs learn complex metrics by exchanging information between evidence and query molecules. So as to achieve higher prediction accuracy under low data volume.

为解决上述技术问题，本发明所采用的技术方案为：In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种基于深度学习的药物活性预测方法，包括如下步骤：A method for predicting drug activity based on deep learning, comprising the steps of:

步骤一、构建药物活性数据集，对药物活性数据集进行切分，其中，药物活性数据集中一部分数据作为训练集、一部分数据作为开发集，还有一部分数据作为测试集；Step 1. Construct a drug activity data set and segment the drug activity data set, wherein a part of the data in the drug activity data set is used as a training set, a part of the data is used as a development set, and a part of the data is used as a test set;

步骤二、对训练集的分子提取原子特征，并将训练集的分子结构转化为邻接矩阵；Step 2, extracting atomic features from the molecules in the training set, and converting the molecular structure of the training set into an adjacency matrix;

步骤三、构建预测模型，预测模型包含五层图卷积，一层LSTM；Step 3. Build a prediction model, which includes five layers of graph convolution and one layer of LSTM;

步骤四、将步骤二和三得到的数据进行训练；Step 4, train the data obtained in steps 2 and 3;

步骤五、通过图卷积，池化，全连接后，将输出值输送给分类器，优化损失函数，继续训练；Step 5. After graph convolution, pooling, and full connection, the output value is sent to the classifier, the loss function is optimized, and training continues;

步骤六、经过迭代计算，得到训练后的预测模型；Step 6. After iterative calculation, the trained prediction model is obtained;

步骤七、将待预测药物输入预测模型得到预测结果。Step 7: Input the drug to be predicted into the prediction model to obtain the prediction result.

2.如权利要求1所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤七中，先将开发集与测试集同样经过步骤二到六的处理，灌入预测模型得到测试结果。2. The drug activity prediction method based on deep learning as claimed in claim 1, wherein in said step seven, the development set and the test set are first processed through steps two to six, and poured into the prediction model to obtain the test result.

3.如权利要求1所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤一包括如下步骤：3. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, described step 1 comprises the steps:

1.1将药物活性数据集进行切分，打乱，包括80％的训练集、10％开发集和10％测试集，将开发集和测试集固定不变用于对照；其中，对数据集的切分保证训练集、开发集和测试集的数据在数据集中均均匀分布；1.1 The drug activity data set is divided and disrupted, including 80% of the training set, 10% of the development set and 10% of the test set, and the development set and the test set are fixed for control; Ensure that the data of the training set, development set and test set are evenly distributed in the data set;

1.2将数据集中对受体有影响的分子标记为1即作为正样本，无影响的标记为0即负样本，没有数据的空值去除，剔除干扰数据提高准确度。1.2 Mark the molecules that have an impact on the receptor in the data set as 1, that is, as a positive sample, and mark the molecules that have no influence as 0, that is, a negative sample, remove the null value without data, and remove the interference data to improve the accuracy.

4.如权利要求1所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤二中，对训练集的分子提取原子特征，同时将训练集的分子结构转化为邻接矩阵：4. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, in described step 2, extracts atomic feature to the molecule of training set, simultaneously the molecular structure of training set is transformed into adjacency matrix:

2.1对分子数据提取统计特征：['C'，'N'，'O'，'S'，'F'，'Si'， 'P'，'Cl'，'Br'，'Mg'，'Na'，'Ca'，'Fe'，'As'，'Al'，'I'，'B'， 'V'，'K'，'Tl'，'Yb'，'Sb'，'Sn'，'Ag'，'Pd'，'Co'，'Se'，'Ti'， 'Zn'，'H'，'Li'，'Ge'，'Cu'，'Au'，'Ni'，'Cd'，'In'，'Mn'， 'Zr'，'Cr'，'Pt'，'Hg'，'Pb'，'＝'，'+'，'-'，'('，')'，'/'， '\'，'['，']'，'@'，'#'，'Unknown']，以上特征忽略数字，小数点，得到一个包含分子中所有统计特征的字典，字典值为分子或分子对应字符出现的次数；2.1 Extract statistical features from molecular data: ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg',' Na', 'Ca', 'Fe', 'As', 'Al', 'I', 'B', 'V', 'K', 'Tl', 'Yb', 'Sb', 'Sn' , 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn', 'H', 'Li', 'Ge', 'Cu', 'Au', 'Ni',' Cd', 'In', 'Mn', 'Zr', 'Cr', 'Pt', 'Hg', 'Pb', '=', '+', '-', '(', ')' , '/', '\', '[', ']', '@', '#', 'Unknown'], the above features ignore numbers, decimal points, and get a dictionary containing all statistical features in the molecule, dictionary values is the number of occurrences of the molecule or the corresponding character of the molecule;

2.2提取分子的中原子的degree，范围为0～10，原子degree被定义为与该原子直接相连的原子个数；2.2 Extract the degree of the atom in the molecule, ranging from 0 to 10, and the atom degree is defined as the number of atoms directly connected to the atom;

2.3提取分子中隐式高自旋的数量，范围为0～6，原子核具有的角动量称为原子核的自旋；2.3 Extract the number of implicit high spins in the molecule, ranging from 0 to 6, and the angular momentum possessed by the nucleus is called the spin of the nucleus;

2.4提取分子中原子的形式电荷；2.4 Extract the formal charge of atoms in the molecule;

2.5提取分子中原子的自由电子数量；2.5 Extract the number of free electrons of atoms in the molecule;

2.6提取分子是否是芳香族化合物；2.6 Whether the extracted molecule is an aromatic compound;

2.7通过将分子中的原子视为节点并将化学键作为无向图中的边来将所有分子表示为结构图，生成以邻接矩阵表示的分子结构图，邻接矩阵将分子中所有原子作为矩阵行和列的标签，当分子中两个原子有化学键相连接时，矩阵相应位置值为1。2.7 Represent all molecules as a structural graph by treating the atoms in the molecule as nodes and the chemical bonds as edges in the undirected graph, and generate a molecular structure graph represented by an adjacency matrix. The adjacency matrix takes all the atoms in the molecule as a matrix row and The label of the column, when two atoms in the molecule are chemically bonded, the value of the corresponding position in the matrix is 1.

5.如权利要求1所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤三包括如下步骤：5. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, described step 3 comprises the steps:

3.1输入x分为两个部分，一是分子的原子特征，二是分子结构转化成的邻接矩阵，x是将分子的原子特征和分子结构转化成的邻接矩结合转化成的一个矩阵；3.1 The input x is divided into two parts, one is the atomic feature of the molecule, and the other is the adjacency matrix converted from the molecular structure, and x is a matrix formed by combining the atomic feature of the molecule and the adjacency moment converted from the molecular structure;

3.2对于输出y的真实值用数组[1,0]表示0，数组[0,1]表示 1，每次训练和测试的结果为一个数组[a,b],a,b为两个概率值，a+ b＝1；a和b一个表示输出y的真实值为数组[1,0]的概率，另一个表示输出y的真实值为数组[0,1]的概率；3.2 For the real value of the output y, the array [1,0] represents 0, the array [0,1] represents 1, and the result of each training and test is an array [a,b], a,b are two probability values , a+ b=1; a and b represent the probability that the actual value of the output y is the array [1,0], and the other represents the probability that the actual value of the output y is the array [0,1];

3.3预测模型使用五层图卷积神经网络，图卷积神经网络具有两个基本特征：一是每个节点都有自己的特征信息；二是图中的每个节点还具有结构信息；下式为图卷积的计算公式，设图卷积的中心节点为v：3.3 The prediction model uses a five-layer graph convolutional neural network. The graph convolutional neural network has two basic features: one is that each node has its own feature information; the other is that each node in the graph also has structural information; the following formula is the calculation formula of graph convolution, and the central node of graph convolution is v:

u：表示中心节点v的邻居节点；h_conv(v)：表示中心节点v和节点 u的图卷积特征值；M：表示图卷积神经网络中所有的节点的集合；u: represents the neighbor nodes of the central node v; h _conv (v): represents the graph convolution feature value of the central node v and node u; M: represents the set of all nodes in the graph convolutional neural network;

表示特征参数，会预设一个值，都为1，在训练的过程中参数不断更新； Indicates the characteristic parameters, a value will be preset, all of which are 1, and the parameters will be continuously updated during the training process;

σ：表示池化函数；σ: represents the pooling function;

设 Assume

式(1)将中心节点v的一个边的特征转化为h_conv(v)，再将所有邻居节点u的h_conv(v)累加，即为中心节点v的图卷积；Equation (1) transforms the feature of one edge of the central node v into h _conv (v), and then accumulates the h _conv (v) of all neighbor nodes u, which is the graph convolution of the central node v;

h_conv(G)＝[h_conv(v₁)，h_conv(v₂)，h_conv(v₃)，...](2)h _conv (G) = [h _conv (v ₁ ), h _conv (v ₂ ), h _conv (v ₃ ), ...] (2)

h_conv(G)表示当前计算的药物分子的h_conv(v)的集合，G表示当前计算的分子G；h _conv (G) represents the collection of h _conv (v) of the currently calculated drug molecule, and G represents the currently calculated molecule G;

最后得到分子中所有节点v的图卷积的集合，即为分子结构特征的集合。Finally, the set of graph convolutions of all nodes v in the molecule is obtained, which is the set of molecular structural features.

6.如权利要求4所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤五中，图卷积过程如下：6. the drug activity prediction method based on deep learning as claimed in claim 4, is characterized in that, in described step 5, graph convolution process is as follows:

5.1.1遍历分子结构图中所有节点；5.1.1 Traverse all nodes in the molecular structure diagram;

5.1.3设置图卷积的中心节点为v；5.1.3 Set the central node of the graph convolution to v;

5.1.4遍历中心节点v的所有邻居节点u，建立关系字典d；5.1.4 Traverse all the neighbor nodes u of the central node v, and establish a relational dictionary d;

5.1.5将节点u的特征转化为u′： 5.1.5 Transform the feature of node u into u′:

其中，表示特征参数，会预设一个值，都为1，在训练的过程中参数不断更新；in, Indicates the characteristic parameters, a value will be preset, all of which are 1, and the parameters will be continuously updated during the training process;

5.1.6将所有的u′相加；5.1.6 Add all u';

5.1.7返回中心节点v的特征；5.1.7 Return the characteristics of the central node v;

池化过程如下：The pooling process is as follows:

5.2.1最大池化邻居节点u′；5.2.1 Maximum pooling of neighbor nodes u′;

5.2.2返回中心节点v的图卷积特征h_conv(v)；5.2.2 Return the graph convolution feature h _conv (v) of the central node v;

全连接过程如下：The full connection process is as follows:

5.3.1使用LSTM判断分子的图卷积特征是否有用，从而挑选出有用的特征；5.3.1 Use LSTM to judge whether the graph convolution features of molecules are useful, so as to select useful features;

5.3.2连接挑选出的所有有用的特征，将输出值送给分类器。5.3.2 Connect all the selected useful features and send the output value to the classifier.

7.如权利要求1所述的基于深度学习的药物活性预测方法，其特征在于，所述步骤六中多次迭代计算，得到训练后的模型的步骤如下：7. the drug activity prediction method based on deep learning as claimed in claim 1, is characterized in that, multiple iterative calculations in the step 6, the step of obtaining the model after training is as follows:

6.1每次从训练集中随机抽取128batchsize大小的样本，灌入模型进行训练，得到训练结果后，使用梯度下降法优化损失函数。6.1 Each time a sample of 128 batchsize is randomly selected from the training set, poured into the model for training, and after the training results are obtained, the gradient descent method is used to optimize the loss function.

进一步的改进，所述步骤三中，预测模型为二分类的预测模型。As a further improvement, in the third step, the prediction model is a two-category prediction model.

与现有技术相比，采用本发明的优点如下：Compared with prior art, adopt the advantage of the present invention as follows:

1.第一步和第二步对数据进行更合理的预处理，将没有数据的干扰数据剔除，提高模型的准确度；同时，对特征的提取采取更简单有效的方法，只计算原子特征，不需要对分子结构进行模拟，将分子结构转化为邻接矩阵，用图卷积的方法提取特征，大大减少时间耗费。1. The first and second steps preprocess the data more reasonably, remove the interference data without data, and improve the accuracy of the model; at the same time, adopt a simpler and more effective method for feature extraction, and only calculate atomic features. There is no need to simulate the molecular structure, and the molecular structure is converted into an adjacency matrix, and features are extracted by graph convolution, which greatly reduces time consumption.

2.第三步构建更为合理的模型，五层图卷积层可以更高效提取分子的结构特征，而LSTM层对特征进行筛选，得到更好的特征。2. The third step is to build a more reasonable model. The five-layer graph convolution layer can extract the structural features of molecules more efficiently, while the LSTM layer filters the features to obtain better features.

3.第四步到第七步实现了整个训练过程，对模型进行训练优化， 2000次训练每批数据大小为128，可以保证遍历到所有训练集数据的同时，对模型进行更好的优化，得到比较低的损失函数值。3. Steps 4 to 7 implement the entire training process and optimize the model. The size of each batch of data for 2000 training sessions is 128, which ensures better optimization of the model while traversing all the training set data. A lower loss function value is obtained.

4.本专利的方法结合了图卷积和LSTM，大大减少特征提取的时间，同时对分子中的原子提取合理适当的特征，不需要使用传统计算化学方法耗费时间计算更详细的分子特征数据，又能得到传统方法无法得到的更合理的特征，从而达到在低数据量下实现更好的药物活性预测准确度。4. The method of this patent combines graph convolution and LSTM, which greatly reduces the time for feature extraction, and at the same time extracts reasonable and appropriate features for atoms in molecules, without using traditional computational chemistry methods to calculate time-consuming and detailed molecular feature data. It can also obtain more reasonable features that cannot be obtained by traditional methods, so as to achieve better drug activity prediction accuracy under low data volume.

附图说明Description of drawings

图1为总流程图；Fig. 1 is a general flowchart;

图2为乙烷(C₂H₆)分子的邻接矩阵；Fig. 2 is the adjacency matrix of ethane (C ₂ H ₆ ) molecules;

图3为LSTM流程图。Figure 3 is a flow chart of LSTM.

具体实施方式Detailed ways

图1是本专利的总流程图。Fig. 1 is the general flowchart of this patent.

本专利的具体技术方案为：The specific technical scheme of this patent is:

第一步、构建数据集：The first step is to build a data set:

1.1将药物活性数据集进行切分，打乱，包括80％的训练集、10％的开发集和10％的测试集，将开发集和测试集固定不变用于对照。1.1 Segment and scramble the drug activity data set, including 80% training set, 10% development set and 10% test set, and keep the development set and test set unchanged for control.

1.2将数据集中对受体有影响的分子标记为1(正样本)，无影响的标记为0(负样本)，没有数据的空值去除，剔除干扰数据可以显著提高准确度。1.2 Mark the molecules that have an impact on the receptor in the data set as 1 (positive sample), and mark the molecules that have no influence as 0 (negative sample), remove the null value without data, and eliminate the interference data can significantly improve the accuracy.

1.3对数据的切分保证训练集、开发集和测试集的分布一致。1.3 The segmentation of the data ensures that the distribution of the training set, development set and test set is consistent.

第二步、对训练集的分子提取原子特征，同时将训练集的分子结构转化为邻接矩阵：The second step is to extract the atomic features of the molecules in the training set, and at the same time convert the molecular structure of the training set into an adjacency matrix:

2.1对分子数据提取统计特征：['C'，'N'，'O'，'S'，'F'，'Si'， 'P'，'Cl'，'Br'，'Mg'，'Na'，'Ca'，'Fe'，'As'，'Al'，'I'，'B'， 'V'，'K'，'Tl'，'Yb'，'Sb'，'Sn'，'Ag'，'Pd'，'Co'，'Se'，'Ti'， 'Zn'，'H'，'Li'，'Ge'，'Cu'，'Au'，'Ni'，'Cd'，'In'，'Mn'， 'Zr'，'Cr'，'Pt'，'Hg'，'Pb'，'＝'，'+'，'-'，'('，')'，'/'， '\'，'['，']'，'@'，'#'，'Unknown']。以上特征包含常见元素以及代表特殊价键，括号，特殊分子，离子等的符号，忽略数字，小数点。得到一个包含分子中所有统计特征的字典，字典值为该分子或字符出现次数；2.1 Extract statistical features from molecular data: ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg',' Na', 'Ca', 'Fe', 'As', 'Al', 'I', 'B', 'V', 'K', 'Tl', 'Yb', 'Sb', 'Sn' , 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn', 'H', 'Li', 'Ge', 'Cu', 'Au', 'Ni',' Cd', 'In', 'Mn', 'Zr', 'Cr', 'Pt', 'Hg', 'Pb', '=', '+', '-', '(', ')' , '/', '\', '[', ']', '@', '#', 'Unknown']. The above features contain common elements as well as symbols representing special valence bonds, parentheses, special molecules, ions, etc. Numbers, decimal points are ignored. Get a dictionary containing all the statistical features in the molecule, and the dictionary value is the number of occurrences of the molecule or character;

2.3提取分子中隐式高自旋的数量，范围为0～6，原子核具有的角动量称为原子核的自旋，属于原子核重要的量子力学性质。2.3 Extract the number of implicit high spins in the molecule, ranging from 0 to 6. The angular momentum of the nucleus is called the spin of the nucleus, which belongs to the important quantum mechanical properties of the nucleus.

2.4提取分子中原子的形式电荷，形式电荷是在写共价化合物的 Lewis结构式时为了判断各可能物种的稳定性时引入的。2.4 Extract the formal charge of atoms in the molecule. The formal charge is introduced when writing the Lewis structural formula of the covalent compound in order to judge the stability of each possible species.

2.5提取分子中原子的自由电子数量，自由电子就是指不被约束在某一个原子内部的电子，自由电子的多寡会影响物质的导电性、导热性等特性。2.5 Extract the number of free electrons of atoms in the molecule. Free electrons refer to electrons that are not confined inside a certain atom. The amount of free electrons will affect the electrical conductivity and thermal conductivity of the material.

2.6提取分子是否是芳香族化合物，芳香族化合物具有苯环结构的化合物，具有结构稳定，不易分解，毒性强的性质。2.6 Whether the extracted molecule is an aromatic compound, an aromatic compound is a compound with a benzene ring structure, which has a stable structure, is not easy to decompose, and has strong toxicity.

2.7通过将原子视为节点并将键作为无向图中的边来将所有分子特征化为图，生成以邻接矩阵表示的分子拓扑结构，邻接矩阵将分子中所有原子作为矩阵行和列的标签，当分子中两个原子有化学键相连接时，矩阵相应位置值为1。如图2为乙烷(C₂H₆)分子的邻接矩阵形式2.7 Characterize all molecules as graphs by treating atoms as nodes and bonds as edges in an undirected graph, generating a molecular topology represented by an adjacency matrix that takes all atoms in a molecule as labels for the rows and columns of the matrix , when two atoms in the molecule are connected by a chemical bond, the value of the corresponding position in the matrix is 1. Figure 2 shows the adjacency matrix form of ethane (C ₂ H ₆ ) molecules

第三步、构建预测模型(二分类的预测模型)，包含五层图卷积，一层LSTM：The third step is to build a prediction model (two-category prediction model), including five layers of graph convolution and one layer of LSTM:

3.1输入x为分子的原子特征和分子结构转化成的邻接矩阵；3.1 Input x is the adjacency matrix converted from the atomic features of the molecule and the molecular structure;

3.2对于输出y的真实值用[1,0]表示0，[0,1]表示1，每次训练和测试的结果为一个数组[a,b],a,b为两个概率值，a+b＝1；3.2 For the real value of the output y, use [1,0] to represent 0, [0,1] to represent 1, the result of each training and test is an array [a,b], a,b are two probability values, a +b=1;

u：中心节点v的邻居节点；h_conv(v)：中心节点v和节点v的图卷积特征值，M：图卷积神经网络中所有的节点的集合；u: the neighbor node of the center node v; h _conv (v): the graph convolution feature value of the center node v and node v, M: the set of all nodes in the graph convolutional neural network;

特征参数，会预设一个值，都为1，在训练的过程中参数会不断更新； The characteristic parameters will preset a value, all of which are 1, and the parameters will be continuously updated during the training process;

σ：池化函数；σ: pooling function;

式(1)将节点v的一个边的特征转化为h_conv(v)，再将所有邻居节点u的h_conv(v)累加，即为节点v的图卷积；Equation (1) converts the feature of an edge of node v into h _conv (v), and then accumulates h _conv (v) of all neighbor nodes u, which is the graph convolution of node v;

h_conv(G)表示当前计算分子h_conv(v)的集合，G表示当前计算的分子G；h _conv (G) represents the set of currently calculated molecules h _conv (v), and G represents the currently calculated molecule G;

3.4LSTM(长短期记忆网络)：3.4LSTM (long short-term memory network):

LSTM区别于RNN的地方，主要就在于它在算法中加入了一个判断信息有用与否的“处理器”(图3.中间的模块)。The difference between LSTM and RNN is that it adds a "processor" to the algorithm to judge whether the information is useful or not (Figure 3. The middle module).

LSTM中的重复模块包含四个相互作用的激活函数(三个 sigmoid，一个tanh)：图中每条线表示一个完整向量，从一个节点的输出到其他节点的输入。如图3所示，圆圈代表逐点操作，比如向量加法，而矩形框表示门限激活函数。线条合并表示串联，线条分差表示复制内容并输出到不同地方。A repeating module in an LSTM consists of four interacting activation functions (three sigmoids, one tanh): each line in the graph represents a full vector, from the output of one node to the input of the other. As shown in Figure 3, circles represent point-wise operations, such as vector addition, while rectangular boxes represent threshold activation functions. Combining lines means concatenation, and dividing lines means copying content and outputting it to different places.

存储单元中管理向单元移除或添加的结构叫门限，有三种：遗忘门、输入门、输出门。门限由sigmoid激活函数和逐点乘法运算组成。前一个时间步骤的隐藏状态，一个送到遗忘门(输入节点)，一个送到输入门，一个送到输出门。就前传递而言，输入门学习决定何时让激活传入存储单元，而输出门学习何时让激活传出存储单元。相应的，对于后传递，输出门学习何时让错误流入存储单元，输入门学习何时让它流出存储单元。The structure in the storage unit that manages removal or addition to the unit is called a threshold, and there are three types: forget gate, input gate, and output gate. The threshold consists of a sigmoid activation function and a pointwise multiplication operation. The hidden state of the previous time step, one is sent to the forget gate (input node), one is sent to the input gate, and one is sent to the output gate. For the forward pass, the input gate learns to decide when to pass activations into the memory cell, and the output gate learns when to pass activations out of the memory cell. Correspondingly, for the back pass, the output gate learns when to let errors flow into the memory cell, and the input gate learns when to let it flow out of the memory cell.

用输入x_t，t-1次的输出h_t-1，计算遗忘率决定一个特征是否要遗忘，0代表完全遗忘，1代表全部记住。Use the input x _t , the output h _t-1 of t-1 times to calculate the forgetting rate Decide whether a feature should be forgotten, 0 means completely forgotten, 1 means remember all.

第四步、将步骤2)和3)得到的数据进行训练。The fourth step is to train the data obtained in steps 2) and 3).

第五步、通过卷积，池化，全连接后，将输出值输送给分类器，优化损失函数，继续训练。具体过程如下：Step 5: After convolution, pooling, and full connection, the output value is sent to the classifier, the loss function is optimized, and training continues. The specific process is as follows:

5.1图卷积过程：5.1 Graph convolution process:

for all nodes v in graphfor all nodes v in graph

set k＝deg(v)set k = deg(v)

for u in neigh(v)∪{v}for u in neigh(v)∪{v}

set d＝dist(v，u)set d = dist(v,u)

transforn features u′＝W^k，du+b^k，d transform features u′=W ^k,d u+b ^k,d

sum all u and apply nonlinearitysum all u and apply nonlinearity

return new features for vreturn new features for v

即5.1.1遍历分子结构图中所有节点；That is, 5.1.1 traverse all nodes in the molecular structure diagram;

5.1.4遍历中心节点v的所有邻居u，建立关系字典d；5.1.4 Traversing all neighbors u of the central node v, and establishing a relational dictionary d;

5.1.5将节点u的特征转化为u′：5.1.5 Transform the feature of node u into u′:

(如公式(1)中说明) (As explained in Equation (1))

5.1.6将所有的u′相加；5.1.6 Add all u';

5.1.7返回节点v的特征；5.1.7 Return the characteristics of node v;

5.2池化过程：5.2 Pooling process:

max over u in neigh(v)∪{v}max over u in neigh(v)∪{v}

return new features for vreturn new features for v

5.2.2返回节点v的新特征。5.2.2 Return new features for node v.

5.3全连接过程：5.3 Full connection process:

5.3.1使用LSTM判断特征是否有用，从而挑选出有意义的特征。5.3.1 Use LSTM to judge whether the features are useful, so as to select meaningful features.

5.3.2连接挑选出的所有特征，将输出值送给分类器5.3.2 Connect all the selected features and send the output value to the classifier

第六步、经过多次迭代计算，得到训练后的模型：The sixth step is to obtain the trained model after multiple iterative calculations:

第七步、将开发集与测试集经过同样的特征处理，灌入模型得到测试结果。The seventh step is to process the development set and the test set through the same feature processing, and inject them into the model to obtain the test results.

第八步、实验结果及其讨论。The eighth step, the experimental results and their discussion.

8.1本专利使用的数据集为Tox21数据集(Tox21 Data Challenge) [https://tripod.nih.gov/tox21/challenge/]，2014年Tox21数据挑战旨在帮助科学家了解化学物质和化合物破坏生物器官的的潜力， tox21数据集是科学家通过毒理学分析，表明这些化学物质和化合物可能对生物有毒性效应；8.1 The dataset used in this patent is the Tox21 dataset (Tox21 Data Challenge) [https://tripod.nih.gov/tox21/challenge/], the 2014 Tox21 data challenge aims to help scientists understand how chemicals and compounds destroy biological organs The potential of the tox21 data set is that scientists have shown through toxicological analysis that these chemicals and compounds may have toxic effects on organisms;

8.2tox21数据集包含8013种可能对人体12种受体(NR-AR， NR-AR-LBD，NR-AhR，NR-Aromatase，NR-ER，NR-ER-LBD，NR-PPAR-gamma，SR-ARE，SR-ATAD5，SR-HSE，SR-MMP SR-p53)产生影响的数据，每种受体有8000条数据；8.2tox21 data set contains 8013 possible receptors for 12 human body (NR-AR, NR-AR-LBD, NR-AhR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR -ARE, SR-ATAD5, SR-HSE, SR-MMP SR-p53) impact data, each receptor has 8000 pieces of data;

8.3本专利的实验一将对这12种受体数据分别建立模型，得到12 个预测的结果：8.3 Experiment 1 of this patent will establish models for these 12 kinds of receptor data, and get 12 predicted results:

训练集Training set 测试集test set 开发集development set 激活函数activation function epochepoch 特征数量number of features 开发集准确度Dev set accuracy 测试集准确度Test set accuracy NR-ARNR-AR 59515951 744744 744744 tanhtanh 22 7575 0.9610.961 0.9620.962 NR-AR-LBDNR-AR-LBD 55215521 691691 691691 tanhtanh 22 7575 0.9720.972 0.9720.972 NR-AhRNR-AhR 53535353 669669 669669 tanhtanh 22 7575 0.9220.922 0.9210.921 NR-AromataseNR-Aromatase 47524752 594594 594594 tanhtanh 22 7575 0.9490.949 0.9490.949 NR-ERNR-ER 50525052 632632 632632 tanhtanh 22 7575 0.8670.867 0.8670.867 NR-ER-LBDNR-ER-LBD 56125612 710710 710710 tanhtanh 22 7575 0.9490.949 0.9510.951 NR-PPAR-gammaNR-PPAR-gamma 52665266 658658 658658 tanhtanh 22 7575 0.9530.953 0.9560.956 SR-ARESR-ARE 47484748 593593 593593 tanhtanh 22 7575 0.8280.828 0.8290.829 SR-ATAD5SR-ATAD5 57865786 723723 723723 tanhtanh 22 7575 0.9290.929 0.9380.938 SR-HSESR-HSE 52755275 660660 660660 tanhtanh 22 7575 0.9160.916 0.9120.912 SR-MMPSR-MMP 47364736 592592 592592 tanhtanh 22 7575 0.8970.897 0.8660.866 SR-p53SR-p53 55275527 691691 691691 tanhtanh 22 7575 0.9350.935 0.920 0.920

表1.Tox21数据集实验结果Table 1. Tox21 dataset experimental results

8.4本专利的实验二将比较本模型与传统机器学习模型性能的优劣，对实验一的结果与逻辑回归、支持向量机和三种贝叶斯 (BernoulliNB、GaussianNB和MultinomialNB)5种方法进行比较：8.4 Experiment 2 of this patent will compare the performance of this model with traditional machine learning models, and compare the results of Experiment 1 with five methods of logistic regression, support vector machine and three Bayesian methods (BernoulliNB, GaussianNB and MultinomialNB) :

本方法This method LRLR SVMSVM BernoulliNBBernoulli NB GaussianNBGaussian NB MultinomialNBMultinomial NB NR-ARNR-AR 0.9620.962 0.9570.957 0.9610.961 0.8580.858 0.1900.190 0.8840.884 NR-AR-LBDNR-AR-LBD 0.9720.972 0.9680.968 0.9650.965 0.8900.890 0.1580.158 0.9070.907 NR-AhRNR-AhR 0.9220.922 0.8610.861 0.8550.855 0.8100.810 0.2280.228 0.6820.682 NR-AromataseNR-Aromatase 0.9490.949 0.9330.933 0.9330.933 0.8790.879 0.1430.143 0.8320.832 NR-ERNR-ER 0.8670.867 0.8590.859 0.8500.850 0.7900.790 0.1850.185 0.8050.805 NR-ER-LBDNR-ER-LBD 0.9570.957 0.9450.945 0.9380.938 0.9040.904 0.1380.138 0.8850.885 NR-PPAR-gammaNR-PPAR-gamma 0.9560.956 0.8310.831 0.8040.804 0.7750.775 0.1140.114 0.9120.912 SR-ARESR-ARE 0.8290.829 0.8150.815 0.8150.815 0.7530.753 0.2020.202 0.7240.724 SR-ATAD5SR-ATAD5 0.9380.938 0.9450.945 0.7980.798 0.7340.734 0.1410.141 0.8400.840 SR-HSESR-HSE 0.9160.916 0.9300.930 0.7270.727 0.9260.926 0.1390.139 0.8740.874 SR-MMPSR-MMP 0.8970.897 0.8550.855 0.8180.818 0.7960.796 0.2210.221 0.7400.740 SR-p53SR-p53 0.9350.935 0.9310.931 0.9310.931 0.8830.883 0.1260.126 0.842 0.842

表2.与传统方法的对比Table 2. Comparison with traditional methods

实验中对所有比较的机器学习方法采用与本专利模型同样的的数据处理，保证实验比较的有效性，表中数据取的是测试集的结果。表2的数据表明，在Tox21数据集下，本专利比传统方法12种受体数据中有5种是全面优于传统机器学习方法的，相比传统机器学习方法本专利可以得到更好更稳定的结果，证明本专利在模型上的创新的是有效果的。In the experiment, the same data processing as the model of this patent is used for all compared machine learning methods to ensure the validity of the experimental comparison. The data in the table are the results of the test set. The data in Table 2 shows that, under the Tox21 data set, this patent is better than the traditional machine learning method in 5 out of the 12 receptor data of the traditional method. Compared with the traditional machine learning method, this patent can obtain better and more stable The result proves that the innovation of this patent on the model is effective.

上述实例仅仅是本发明的一个具体实施方式，对其的简单变换、替换等也均在发明的保护范围内。The above example is only a specific embodiment of the present invention, and its simple transformation, replacement, etc. are also within the protection scope of the present invention.

Claims

1. a kind of pharmaceutical activity prediction technique based on deep learning, which comprises the steps of:

Step 1: building pharmaceutical activity data set, carries out cutting to pharmaceutical activity data set, wherein in pharmaceutical activity data set A part of data are as training set, a part of data as development set, some data is as test set；

Step 2: the molecule to training set extracts atomic features, and adjacency matrix is converted by the molecular structure of training set；

Step 3: building prediction model, prediction model includes five layers of picture scroll product, one layer of LSTM；

Step 4: the data that step 2 and three obtain are trained；

Step 5: by picture scroll product, output valve after full connection, is conveyed to classifier, optimizes loss function, continue to instruct by Chi Hua Practice；

Step 6: by iterative calculation, the prediction model after being trained；

Step 7: drug input prediction model to be predicted is obtained prediction result.

2. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that the step 7 In, development set and test set are first also passed through into step 2 to six processing, prediction model is poured into and obtains test result.

3. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that the step 1 packet Include following steps:

Pharmaceutical activity data set is carried out cutting by 1.1, is upset, including 80% training set, 10% development set and 10% test set, Development set and test set are immobilized and are used to compare；Wherein, training set, development set and test set are guaranteed to the cutting of data set Data be uniformly distributed in data set；

1.2 regard molecular labeling influential on receptor in data set as positive sample for 1, and the label of no influence is i.e. negative sample This, the not null value removal of data rejects interference data and improves accuracy.

4. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that the step 2 In, atomic features are extracted to the molecule of training set, while converting adjacency matrix for the molecular structure of training set:

2.1 pairs of molecular datas extract statistical nature: [' C', ' N', ' O', ' S', ' F', ' Si', ' P', ' Cl', ' Br', ' Mg', ' Na', ' Ca', ' Fe', ' As', ' Al', ' I', ' B', ' V', ' K', ' Tl', ' Yb', ' Sb', ' Sn', ' Ag', ' Pd', ' Co', ' Se', ' Ti', ' Zn', ' H', ' Li', ' Ge', ' Cu', ' Au', ' Ni', ' Cd', ' In', ' Mn', ' Zr', ' Cr', ' Pt', ' Hg', ' Pb', '=', '+', '-', ' (', ') ', '/', ' ', ' [', '] ', '@', ' #', ' Unknown'], features above is ignored Number, decimal point obtain the dictionary comprising all statistical natures in molecule, and dictionary value is that molecule or molecule correspond to character and go out Existing number；

2.2 extract the degree of the middle atom of molecule, and range is 0~10, and atom degree is defined as and the direct phase of the atom Atom number even；

2.3 extract the quantity of implicit high-spin in molecules, and range is 0~6, the angular momentum that atomic nucleus has be known as it is nuclear from Rotation；

2.4 extract the formal charge of atom in molecule；

2.5 extract the free electron quantity of atom in molecule；

2.6 extract whether molecule is aromatic compound；

2.7 by being considered as node for the atom in molecule and being expressed as all molecules using chemical bond as the side in non-directed graph Structure chart generates the molecular structure indicated with adjacency matrix, and adjacency matrix is using atoms all in molecule as matrix row and column Label, when two atomic ordering keys are connected in molecule, matrix corresponding position value be 1.

5. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that the step three guarantees Include following steps:

It is two parts that 3.1 input x, which are divided to, first is that the atomic features of molecule, second is that the adjacency matrix that molecular structure is converted to, x are The adjoining square that the atomic features of molecule and molecular structure are converted to is combined into a matrix being converted to；

3.2 indicate 0 with array [1,0] for exporting the true value of y, and array [0,1] indicates 1, every time the result of training and test For an array [a, b], a, b are two probability values, a+b=1；A and b mono- true value for indicating output y are array [1,0] Probability, another indicates that the true value of output y is the probability of array [0,1]；

3.3 prediction models use five layers of figure convolutional neural networks, and there are two essential characteristics for figure convolutional neural networks tool: first is that each Node has the characteristic information of oneself；Second is that each node in figure also has structural information；The following figure is that the calculating of picture scroll product is public Formula, if the central node of picture scroll product is υ:

U: the neighbor node of central node v is indicated；h_conv(v): indicating the picture scroll product characteristic value of central node v and node u；M: table The set of all nodes in diagram convolutional neural networks；b^v: it indicates characteristic parameter, a value can be preset, be all 1, Parameter is constantly updated during training；

σ: pond function is indicated；

If

The feature on a side of central node υ is converted h by formula (1)_conv(v), then by the h of all neighbor node u_conv(v) tire out Add, the picture scroll product of as central node υ；

h_conv(G)=[h_conv(v₁), h_conv(v₂), h_conv(v₃) ...] (2)

h_conv(G) h of the drug molecule currently calculated is indicated_conv(v) set, G indicate the molecule G currently calculated；

Finally obtain the set of the picture scroll product of all node v in molecule, the as set of molecular characterization.

6. the pharmaceutical activity prediction technique based on deep learning as claimed in claim 4, which is characterized in that the step 5 In, figure convolution process is as follows:

5.1.1 all nodes in molecular structure are traversed；

5.1.3 the central node of setting picture scroll product is v；

5.1.4 all neighbor node u, opening relationships dictionary d of central node v are traversed；

5.1.5 u ' is converted by the feature of node u:

Wherein,b^v: it indicates characteristic parameter, a value can be preset, be all 1, parameter is constantly updated during training；

5.1.6 all u ' are added；

5.1.7 the feature of central node v is returned；

Pond process is as follows:

5.2.1 maximum pond neighbor node u '；

5.2.2 the picture scroll product feature h of central node v is returned_conv(v)；

Full connection procedure is as follows:

5.3.1 judge whether the picture scroll product feature of molecule is useful using LSTM, to pick out useful feature；

5.3.2 all useful features picked out are connected, give output valve to classifier.

7. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that in the step 6 The step of successive ignition calculates, model after being trained is as follows:

6.1 randomly select the sample of 128batchsize size from training set every time, pour into model and are trained, are trained As a result after, optimize loss function using gradient descent method.

8. the pharmaceutical activity prediction technique based on deep learning as described in claim 1, which is characterized in that the step 3 In, prediction model is the prediction model of two classification.