CN116189759A

CN116189759A - Virtual screening method and application of group induction lead compound

Info

Publication number: CN116189759A
Application number: CN202310234744.9A
Authority: CN
Inventors: 江高飞; 薛卫; 张家璇; 刘佐; 韦中; 徐阳春; 沈其荣
Original assignee: Nanjing Agricultural University
Current assignee: Nanjing Agricultural University
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-05-30
Anticipated expiration: 2043-03-13
Also published as: CN116189759B

Abstract

The invention discloses a virtual screening method of a group induction lead compound, which mainly comprises the following steps: the input molecular compound structure constructs a molecular adjacency matrix through pretreatment, and is sent into a GNN1 network to generate compound characteristics; the input protein sequence extracts the protein amino acid composition and dipeptide frequency composition of the protein sequence to form a preliminary protein feature vector, and the preliminary protein feature vector is sent into a cross network to generate cross fusion features; simultaneously, generating a corresponding contact diagram by the protein sequence, and then sending the contact diagram into a GNN2 network to generate protein sequence characteristics; finally, three characteristic combinations are sent to the full-connection layer for prediction to obtain an affinity value. The invention can be used for discovering new compounds with quorum sensing activity and providing new thinking and means for controlling and preventing bacteria such as bacterial wilt and the like; meanwhile, the method can be used for efficiently screening out the compounds combined with the PhcA and PhcR proteins, so that the compounds with quorum sensing activity can be found.

Description

A virtual screening method and application of quorum sensing lead compounds

技术领域technical field

本发明涉及药物化学领域，具体涉及一种青枯菌群体感应先导化合物的靶向虚拟筛选方法及应用。The invention relates to the field of medicinal chemistry, in particular to a targeted virtual screening method and application of quorum sensing lead compounds of R. solanacearum.

背景技术Background technique

青枯雷氏菌(Ralstoniasolanacearum)是世界上最具破坏性的土传病原菌之一，该病原菌广泛分布于全球的热带、亚热带和温带气候地区，并逐渐向高维度高海拔地区扩散蔓延。土传青枯菌入侵作物根际过程中相关的毒力行为受群体感应调控。青枯菌拥有两套群体感应系统：AHL系统和三羟基棕榈酸甲酯(3-OH PAME)系统，其中AHL系统不影响毒力。青枯菌通过3-OH PAME群体感应系统对代谢和毒力行为进行全局性调控，协调各类分泌系统组装，指挥各种毒性因子时序性表达和分泌，从而顺利完成根际入侵过程。该系统由PhcBSR合成组分和调控因子PhcA构成。其中，PhcB负责合成信号分子3-OH PAME，PhcS负责接收感知信号分子，当3-OH PAME的浓度超过一定阈值，PhcS便激活PhcR，进而解除PhcR对PhcA的抑制。PhcA不仅调控青枯菌的初级代谢和AHL群体感应系统，还调控青枯菌的运动性、铁载体、生物膜、细胞壁降解酶、破坏植物免疫系统的III型毒力因子和胞外多糖等与根际入侵过程密切相关的毒力行为。有研究尝试通过降解群体感应分子阻断土传青枯菌的根际入侵过程，但阻控效果并不理想。Ralstonia solanacearum is one of the most destructive soil-borne pathogens in the world. The pathogen is widely distributed in tropical, subtropical and temperate climate regions around the world, and gradually spreads to high-latitude and high-altitude regions. The virulence behaviors related to soil-borne R. solanacearum invasion into crop rhizosphere are regulated by quorum sensing. Ralstonia solanacearum has two quorum sensing systems: AHL system and trihydroxypalmitic acid methyl ester (3-OH PAME) system, and the AHL system does not affect the virulence. R. solanacearum globally regulates metabolism and virulence behavior through the 3-OH PAME quorum sensing system, coordinates the assembly of various secretion systems, and directs the sequential expression and secretion of various virulence factors, thus successfully completing the rhizosphere invasion process. The system consists of PhcBSR synthesis components and regulatory factor PhcA. Among them, PhcB is responsible for synthesizing the signal molecule 3-OH PAME, and PhcS is responsible for receiving the sensory signal molecule. When the concentration of 3-OH PAME exceeds a certain threshold, PhcS activates PhcR, thereby releasing the inhibition of PhcR on PhcA. PhcA not only regulates the primary metabolism and AHL quorum sensing system of R. solanacearum, but also regulates the motility, siderophore, biofilm, cell wall degrading enzymes, type III virulence factors and exopolysaccharides of R. solanacearum, which destroy the plant immune system, etc. Virulence behavior closely related to the rhizosphere invasion process. Some studies have tried to block the rhizosphere invasion process of soil-borne R. solanacearum by degrading quorum-sensing molecules, but the blocking effect is not satisfactory.

目前，虚拟筛选是计算机辅助药物设计中一种非常常见的策略，已得到广泛应用。药物靶向亲和性(DTA)预测是虚拟筛选的一个重要步骤，它可以快速匹配靶向和药物，加快药物开发过程。DTA预测提供了药物与靶蛋白结合强度的信息，可用于显示小分子是否与蛋白结合。对于具有已知结构和位点信息的蛋白质，我们可以使用分子模拟和分子对接进行详细模拟，从而获得更准确的结果，这被称为基于结构的虚拟筛选。然而，仍有许多蛋白质没有结构信息。即使使用同源模型，仍然很难获得许多蛋白质的结构信息。因此，利用序列(基于序列的虚拟筛选)预测蛋白质与药物分子的结合亲和力是一个迫切的问题，这也是本发明的重点。Currently, virtual screening is a very common strategy in computer-aided drug design and has been widely used. Drug-target affinity (DTA) prediction is an important step in virtual screening, which can quickly match targets and drugs and speed up the drug development process. DTA prediction provides information on the binding strength of a drug to a target protein and can be used to show whether a small molecule is bound to a protein. For proteins with known structure and site information, we can use molecular simulation and molecular docking to perform detailed simulations to obtain more accurate results, which is called structure-based virtual screening. However, there are still many proteins without structural information. Even with homology models, it is still difficult to obtain structural information for many proteins. Therefore, it is an urgent problem to use sequences (sequence-based virtual screening) to predict the binding affinity of proteins and drug molecules, which is also the focus of the present invention.

基于分子对接的虚拟筛选已成为计算机辅助化合物设计的核心技术，广泛应用于新化合物的靶向研制过程。因此，采用虚拟筛选青枯菌群体感应先导化合物来干扰群体感应可能是控制土传青枯病的重要途径之一。Virtual screening based on molecular docking has become the core technology of computer-aided compound design and is widely used in the targeted development of new compounds. Therefore, using virtual screening of quorum sensing lead compounds of R. solanacearum to interfere with quorum sensing may be one of the important ways to control soil-borne bacterial wilt.

发明内容Contents of the invention

本发明的目的在于提供一种青枯菌群体感应先导化合物的靶向虚拟筛选方法，该方法能够基于强化学习自动搜索最佳的图结构，提高图网络的群体感应先导化合物的虚拟筛选模型的性能，且其是基于PhcA和PhcR蛋白结构的群体感应先导化合物的筛选方法，可用于发现新的具有群体感应活性的化合物，为青枯菌等细菌的控制和防治提供新的思路和手段。The object of the present invention is to provide a targeted virtual screening method for quorum sensing lead compounds of R. solanacearum, which can automatically search for the best graph structure based on reinforcement learning, and improve the performance of the virtual screening model for quorum sensing lead compounds of the graph network , and it is a screening method for quorum sensing lead compounds based on the protein structure of PhcA and PhcR, which can be used to discover new compounds with quorum sensing activity, and provide new ideas and means for the control and prevention of bacteria such as Ralstonia solanacearum.

本发明采取的技术方案如下：The technical scheme that the present invention takes is as follows:

一种群体感应先导化合物的虚拟筛选方法，其将分子化合物结构、蛋白质序列作为输入送入预处理模块提取初步特征，再将其送入预测模型网络，其中预测模型的结构及参数通过LSTM控制器训练生成；具体流程如下：A virtual screening method for quorum sensing lead compounds, which sends molecular compound structure and protein sequence as input to the preprocessing module to extract preliminary features, and then sends it to the prediction model network, wherein the structure and parameters of the prediction model are passed through the LSTM controller Training generation; the specific process is as follows:

输入的分子化合物结构通过预处理构建分子邻接矩阵，送入GNN1网络生成化合物特征；The input molecular compound structure is preprocessed to construct a molecular adjacency matrix, which is sent to the GNN1 network to generate compound characteristics;

输入的蛋白质序列，提取其蛋白质氨基酸组成、二肽频率组合成蛋白质初步特征向量，送入交叉网络，生成交叉融合特征；同时，将蛋白质序列生成对应的接触图，随后送入GNN2网络生成蛋白序列特征；From the input protein sequence, extract its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, send it to the cross network to generate cross fusion features; at the same time, generate a corresponding contact map for the protein sequence, and then send it to the GNN2 network to generate a protein sequence feature;

最终将三个特征组合送入全连接层预测得到亲和力值。Finally, the combination of the three features is sent to the fully connected layer to predict the affinity value.

进一步的，设构建得到的分子邻接矩阵为X₁,分子结构图上相邻的原子矩阵元素值为1，不相邻为0，分子邻接矩阵大小为(n*n)，n是结构图中节点的数量，即所有原子的数量。Further, let the constructed molecular adjacency matrix be X ₁ , the element value of the adjacent atomic matrix on the molecular structure diagram is 1, and the non-adjacent element is 0, and the size of the molecular adjacency matrix is (n*n), where n is The number of nodes, i.e. the number of all atoms.

进一步的，使用Pconsc4软件处理蛋白序列，输出残差对是否接触的概率矩阵，大小为m*m，保留矩阵中大于0.5的值，其他值置为0，过滤后的矩阵为蛋白质接触图X₂，其中m是残差的数量。Further, use the Pconsc4 software to process the protein sequence, and output the probability matrix of residuals versus contact or not, the size of which is m*m, retain values greater than 0.5 in the matrix, and set other values to 0, and the filtered matrix is the protein contact map X ₂ , where m is the number of residuals.

进一步的，氨基酸组成是构成序列的20种氨基酸各自出现的频率，二肽的频率是任意两个氨基酸构成的氨基酸对出现的频率。Further, the amino acid composition is the frequency of occurrence of each of the 20 amino acids constituting the sequence, and the frequency of the dipeptide is the frequency of occurrence of an amino acid pair composed of any two amino acids.

进一步的，所述预测模型网络由GNN1网络、GNN2网络、交叉网络并联组成，合并后送入一个拼接层和DROPOUT层，再接上两个全连接层。Further, the prediction model network is composed of GNN1 network, GNN2 network, and crossover network connected in parallel, which are combined and sent to a splicing layer and DROPOUT layer, and then connected to two fully connected layers.

进一步的，所述交叉网络由5个交叉层串联组成，最后接一个128维全连接层，全连接层输出为f₃，每个交叉层具有以下公式：Further, the cross network is composed of 5 cross layers connected in series, and finally connected with a 128-dimensional fully connected layer, the output of the fully connected layer is f ₃ , and each cross layer has the following formula:

C_l+1＝X₀C^T _lW_c,l+b_c,l+C_l C _l+1 ＝X ₀ C ^T _l W _c,l +b _c,l +C _l

其中：l＝1,2,…,5，C_l和C_l+1分别是第l层和第l+1层cross layer的输出，C₀即氨基酸组成、二肽频率的组合X₃，W_c,l和b_c,l是这两层之间的连接参数；上式中所有的变量均是列向量。每一层的输出，都是上一层的输出加上特征交叉。Where: l=1,2,...,5, C _l and C _l+1 are the outputs of the l-th layer and l+1-th layer cross layer respectively, C ₀ is the combination of amino acid composition and dipeptide frequency X ₃ , W _c,l and b _c,l are the connection parameters between the two layers; all variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature cross.

进一步的，所述分子邻接矩阵、蛋白质接触图被分别输入到两个不同的GNN1和GNN2网络，每个网络由3个GNN层组成，两个GNN网络的输出特征为f₁、f₂，再加上交叉融合特征，拼接后为f₁+f₂+f₃，得到用于预测的相应小分子-蛋白质对的总体特征；随后送入一个全连接层，输出维度为128，接着送入第二个全连接层，输出维度为1，即网络预测的亲和力值。Further, the molecular adjacency matrix and the protein contact map are respectively input into two different GNN1 and GNN2 networks, each network is composed of three GNN layers, and the output features of the two GNN networks are f ₁ and f ₂ , and then Add the cross-fusion feature, and after splicing, it will be f ₁ +f ₂ +f ₃ , to obtain the overall features of the corresponding small molecule-protein pair for prediction; then send it to a fully connected layer with an output dimension of 128, and then send it to the first Two fully connected layers, the output dimension is 1, which is the affinity value predicted by the network.

进一步的，通过LSTM控制器实现虚拟筛选网络模型优化，模型优化就是在确定的参数空间使用强化学习得到两个GNN最佳结构参数以及整个网络中其它神经元的参数；GNN结构M需要要确定几个参数：采样功能(S)、相关度量函数(Att)、聚合功能(Agg)、多头注意力数量(K)、输出隐藏嵌入(Dim)和激活函数(Act)。Furthermore, the LSTM controller is used to optimize the virtual screening network model. Model optimization is to use reinforcement learning in the determined parameter space to obtain the optimal structure parameters of two GNNs and the parameters of other neurons in the entire network; the GNN structure M needs to determine how many parameters: sampling function (S), correlation metric function (Att), aggregation function (Agg), multi-head attention amount (K), output hidden embedding (Dim), and activation function (Act).

更进一步的，优化由两步组成，首先LSTM预测一个GNN1的S、Att、Agg、Act、K、Dim对应操作，每个预测都由LSTM的softmax分类器执行，接着将该预测值输入到下一个时间点，得到下一个参数预测；当达到GNN Layer的层数3时，LSTM控制器完成一次架构的生成；重复该过程生成GNN2的参数；构建并训练整个预测网络，得到GNN网络及其它网络层权重参数；然后基于网络训练后得到的准确率用强化学习优化LSTM的参数，以得到优化控制器模型；两个步骤交替执行一定步数结束得到最终筛选网络模型。Furthermore, the optimization consists of two steps. First, the LSTM predicts the corresponding operations of S, Att, Agg, Act, K, and Dim of a GNN1. Each prediction is performed by the softmax classifier of the LSTM, and then the predicted value is input to the next At one time point, the next parameter prediction is obtained; when the layer number of GNN Layer reaches 3, the LSTM controller completes the generation of the architecture; repeat this process to generate the parameters of GNN2; build and train the entire prediction network, and obtain the GNN network and other networks Layer weight parameters; then optimize the parameters of LSTM with reinforcement learning based on the accuracy obtained after network training to obtain an optimized controller model; the two steps are alternately executed for a certain number of steps to obtain the final screening network model.

上述方法可在青枯菌群体感应先导化合物的虚拟筛选中应用。The above method can be applied in the virtual screening of quorum sensing lead compounds of R. solanacearum.

本发明的有益效果是：首先，本方法提取分子邻接矩阵、蛋白质接触图、蛋白序列交叉融合特征形成多维度特征，能更好体现分子与蛋白特征；其次，本方法使用强化学习优化模型结构，避免了以往靠经验或大量人工选取模型参数。本方法的完善和推广能够有效地进行先导化合物的虚拟筛选，具有广阔的前景和非凡的意义。The beneficial effects of the present invention are as follows: firstly, the method extracts molecular adjacency matrix, protein contact map, and protein sequence cross-fusion features to form multi-dimensional features, which can better reflect the characteristics of molecules and proteins; secondly, the method uses reinforcement learning to optimize the model structure, It avoids selecting model parameters by experience or a lot of manual work in the past. The perfection and promotion of this method can effectively carry out virtual screening of lead compounds, which has broad prospects and extraordinary significance.

附图说明Description of drawings

图1是本发明的虚拟筛选模型图；Fig. 1 is a virtual screening model figure of the present invention;

图2是分子图表示法。Figure 2 is a molecular graph representation.

具体实施方式Detailed ways

下面结合附图对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings.

如图1，一种群体感应先导化合物的虚拟筛选方法，该方法是基于PhcA和PhcR蛋白结构的群体感应先导化合物的筛选方法，其是通过基于PhcA和PhcR蛋白结构的计算机辅助虚拟筛选技术、分子生物学技术等建立的一种青枯菌群体感应先导化合物的筛选方法。方法的实现通过虚拟筛选模型，虚拟筛选模型的输入是分子化合物结构、蛋白质序列，然后送入预处理模块提取初步特征，再将其送入预测模型网络，预测模型的结构及参数通过LSTM控制器训练生成。基本流程：输入化合物分子式，构建分子邻接矩阵，送入GNN1子网络生成化合物特征。输入蛋白质序列，提取其蛋白质氨基酸组成、二肽频率组合成蛋白质初步特征向量，送入交叉网络，生成交叉融合特征；同时，将蛋白质序列生成对应的接触图，随后送入GNN2网络生成蛋白序列特征。最终将三个特征组合送入全连接层预测得到亲和力值，输出值为0无亲和力，1有亲和力。下面分别介绍构建分子邻接矩阵、蛋白质接触图、生成蛋白序列交叉融合特征，GNN结构优化、预测网络其它参数优化由LSTM控制器实现。As shown in Fig. 1, a kind of virtual screening method of quorum sensing lead compound, this method is the screening method of the quorum sensing lead compound based on PhcA and PhcR protein structure, it is through the computer aided virtual screening technology based on PhcA and PhcR protein structure, molecular A screening method for quorum sensing lead compounds of R. solanacearum established by Biological Technology. The realization of the method is through the virtual screening model. The input of the virtual screening model is the molecular compound structure and protein sequence, which are then sent to the preprocessing module to extract preliminary features, and then sent to the prediction model network. The structure and parameters of the prediction model are passed through the LSTM controller. training generated. Basic process: input the molecular formula of the compound, construct the molecular adjacency matrix, and send it to the GNN1 sub-network to generate the compound characteristics. Input the protein sequence, extract its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, send it to the cross network to generate cross fusion features; at the same time, generate the corresponding contact map of the protein sequence, and then send it to the GNN2 network to generate protein sequence features . Finally, the combination of the three features is sent to the fully connected layer to predict the affinity value, and the output value is 0 without affinity and 1 with affinity. The following introduces the construction of molecular adjacency matrix, protein contact map, generation of protein sequence cross-fusion features, GNN structure optimization, and optimization of other parameters of the prediction network are realized by the LSTM controller.

1、数据预处理1. Data preprocessing

(1)构建分子邻接矩阵(1) Construct molecular adjacency matrix

分子表示在数据集中，为SMILES格式。根据药物SMILES串构建分子图，该串以原子为节点，键为边。分子的图结构构建过程表示如图2所示。设构建得到的分子邻接矩阵为X₁,相邻的原子矩阵元素值为1，不相邻为0，大小为(n*n)，n是图中节点的数量，即所有原子的数量。Molecules are represented in the dataset, in SMILES format. The molecular graph is constructed according to the drug SMILES string, which uses atoms as nodes and bonds as edges. The molecular graph structure construction process representation is shown in Figure 2. Let the molecular adjacency matrix obtained by construction be X ₁ , the element value of the adjacent atomic matrix is 1, and the value of non-adjacent is 0, and the size is (n*n), where n is the number of nodes in the graph, that is, the number of all atoms.

(2)蛋白质接触图(2) Protein contact map

蛋白质接触图是一种用于描述蛋白质之间相互作用的图形表示方法，它展示了蛋白质之间的接触和相互作用，用于描述蛋白质结构和功能。使用Pconsc4软件处理蛋白序列，输出残差对是否接触的概率矩阵，大小为m*m，保留矩阵中大于0.5的值，其他值置为0，过滤后的矩阵为蛋白质接触图X₂。The protein contact diagram is a graphical representation method for describing the interaction between proteins, which shows the contacts and interactions between proteins, and is used to describe protein structure and function. Use Pconsc4 software to process the protein sequence, output the probability matrix of residuals versus contact or not, the size is m*m, keep the values greater than 0.5 in the matrix, and set other values to 0, and the filtered matrix is the protein contact map X ₂ .

(3)蛋白质组成特征(3) Protein composition characteristics

蛋白质组成特征为氨基酸组成、二肽频率的组合X₃，大小为420维。氨基酸组成是构成序列的20种氨基酸各自出现的频率。二肽的频率是任意两个氨基酸构成的氨基酸对出现的频率，组成蛋白序列的氨基酸共有20种，二肽共有400种。The protein composition feature is the combination X ₃ of amino acid composition and dipeptide frequency, and the size is 420 dimensions. Amino acid composition is the frequency of occurrence of each of the 20 amino acids that make up the sequence. The frequency of a dipeptide is the frequency of any pair of amino acids composed of two amino acids. There are 20 amino acids that make up a protein sequence, and there are 400 dipeptides.

2、预测模型架构2. Prediction model architecture

(1)交叉网络(1) Cross network

交叉网络输入X₃，网络由5个交叉层串联组成，最后接一个128维全连接层，全连接层输出为f₃，每个交叉层具有以下公式：The input of the cross network is X ₃ , the network is composed of 5 cross layers in series, and finally connected with a 128-dimensional fully connected layer, the output of the fully connected layer is f ₃ , and each cross layer has the following formula:

C_l+1＝X₀C^T _lW_c,l+b_c,l+C_l C _l+1 ＝X ₀ C ^T _l W _c,l +b _c,l +C _l

其中：l＝1,2,…,5。C_l和C_l+1分别是第l层和第l+1层crosslayer的输出，C₀即X₃，W_c,l和b_c,l是这两层之间的连接参数。上式中所有的变量均是列向量。每一层的输出，都是上一层的输出加上特征交叉。Where: l=1,2,...,5. C _l and C _l+1 are the outputs of the l-th layer and the l+1-th layer crosslayer respectively, C ₀ is X ₃ , W _c,l and b _c,l are the connection parameters between these two layers. All variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature cross.

(2)亲和力预测网络整体结构(2) The overall structure of the affinity prediction network

预测模型由两个GNN网络(GNN1、GNN2)、交叉网络并联组成，合并后送入一个拼接层和DROPOUT层，再接上两个全连接层。药物分子和蛋白质的分子邻接矩阵、蛋白质接触图被输入到两个不同的GNN1、GNN2网络。每个网络由3个GNN层组成。两个GNN网络的输出特征为f₁、f₂,再加上交叉融合特征，拼接后为f₁+f₂+f₃，得到用于预测的相应小分子-蛋白质对的总体特征。随后送入一个全连接层，输出维度为128，接着送入第二个全连接层，输出维度为1，即网络预测的亲和力值。其中GNN层具体结构参数以及其它网络层由下面网络模型优化训练过程得出。The prediction model consists of two GNN networks (GNN1, GNN2) and cross-connected networks in parallel, which are combined and sent to a splicing layer and DROPOUT layer, and then connected to two fully connected layers. Molecular adjacency matrices and protein contact graphs of drug molecules and proteins are input into two different GNN1, GNN2 networks. Each network consists of 3 GNN layers. The output features of the two GNN networks are f ₁ , f ₂ , plus the cross-fusion feature, which is f ₁ +f ₂ +f ₃ after splicing, and the overall features of the corresponding small molecule-protein pairs for prediction are obtained. Then it is sent to a fully connected layer with an output dimension of 128, and then sent to the second fully connected layer with an output dimension of 1, which is the affinity value predicted by the network. The specific structural parameters of the GNN layer and other network layers are obtained from the following network model optimization training process.

(3)LSTM控制器实现虚拟筛选网络模型优化(3) LSTM controller realizes virtual screening network model optimization

模型优化就是在确定的参数空间使用强化学习得到两个GNN最佳结构参数以及整个网络中其它神经元的参数。GNN结构M需要要确定几个参数：采样功能(S)、相关度量函数(Att)、聚合功能(Agg)、多头注意力数量(K)、输出隐藏嵌入(Dim)和激活函数(Act)。Model optimization is to use reinforcement learning in a determined parameter space to obtain the optimal structural parameters of two GNNs and the parameters of other neurons in the entire network. The GNN structure M needs to determine several parameters: sampling function (S), correlation measurement function (Att), aggregation function (Agg), multi-head attention number (K), output hidden embedding (Dim) and activation function (Act).

具体描述以及各个参数对应的参数取值如下：The specific description and the corresponding parameter values of each parameter are as follows:

1.输出隐藏嵌入(Dim)。Dim为每一层GNN的输出维度，为一个整数值。1. Output the hidden embedding (Dim). Dim is the output dimension of each layer of GNN, which is an integer value.

2.采样功能(S)。对于每一层GNN，需要一种采样功能。采样是在图神经网络中用来为给定目标节点选择感受野。本方法中使用三种，a.固定邻居数量采样法，b.重要性采样法，c.一阶邻居排序法。2. Sampling function (S). For each layer of GNN, a sampling function is required. Sampling is used in graph neural networks to select receptive fields for a given target node. Three methods are used in this method, a. fixed neighbor number sampling method, b. importance sampling method, c. first-order neighbor sorting method.

3.相关度量函数(Att)和多头注意力个数K。对于GNN的每一层，我们选择一种Att方法和多头注意力个数K。Att可选两种度量函数：GAT和GCN，分别对应GAT和GCN两种网络结构。GCN网络包括2个图卷积层、1个ReLU激活函数层与1个Dropout层。GAT网络包括K个图多头注意力层、1个Softmax激活函数层与1个Dropout层。其中GAT通过使用关注层来分配邻域重要性，GCN根据节点的程度分配邻域重要性。3. Correlation metric function (Att) and the number K of multi-head attention. For each layer of GNN, we choose an Att method and the number K of multi-head attention. Att can choose two measurement functions: GAT and GCN, which correspond to the two network structures of GAT and GCN respectively. The GCN network includes 2 graph convolution layers, 1 ReLU activation function layer and 1 Dropout layer. The GAT network includes K graph multi-head attention layers, a Softmax activation function layer and a Dropout layer. Among them, GAT assigns neighborhood importance by using attention layers, and GCN assigns neighborhood importance according to the degree of nodes.

4.聚合功能(Agg)。对于GNN的每一层，都需要使用Agg聚合。可选聚合函数Agg有：a.总和聚合器、b.均值聚合器、c.池聚合器。4. Aggregation function (Agg). For each layer of GNN, Agg aggregation needs to be used. Optional aggregation functions Agg are: a. sum aggregator, b. mean aggregator, c. pool aggregator.

5.激活函数(Act)。对于GNN的每一层，都需要使用Act激活函数。可选激活函数Act有：a.ReLU、b.LeakyReLU、c.ELU、d.Linear、e.Softmax.增加图网络的非线性拟合能力，提高模型的表达能力。5. Activation function (Act). For each layer of GNN, the Act activation function needs to be used. The optional activation functions Act are: a.ReLU, b.LeakyReLU, c.ELU, d.Linear, e.Softmax. Increase the nonlinear fitting ability of the graph network and improve the expression ability of the model.

网络模型优化使用一个LSTM控制器神经网络对网络进行训练，由两步组成，首先LSTM预测一个GNN1的[S、Att、Agg、Act、K、Dim]对应操作，每个预测都由LSTM的softmax分类器执行，接着将该预测值输入到下一个时间点，得到下一个参数预测。当达到GNN Layer的层数3时，LSTM控制器完成一次架构的生成；重复该过程生成GNN2的参数。构建并训练整个预测网络，得到GNN网络及其它网络层权重参数。然后基于网络训练后得到的准确率用强化学习优化LSTM的参数，以得到优化控制器模型。两个步骤交替执行一定步数结束得到最终筛选网络模型。Network model optimization uses an LSTM controller neural network to train the network, which consists of two steps. First, the LSTM predicts the corresponding operation of [S, Att, Agg, Act, K, Dim] of a GNN1. Each prediction is determined by the softmax of the LSTM. The classifier executes, then feeds that prediction to the next time point to get the next parameter prediction. When the layer number 3 of GNN Layer is reached, the LSTM controller completes the generation of an architecture; repeat this process to generate the parameters of GNN2. Construct and train the entire prediction network to obtain the weight parameters of the GNN network and other network layers. Then, based on the accuracy rate obtained after network training, the parameters of LSTM are optimized by reinforcement learning to obtain an optimized controller model. The two steps are executed alternately for a certain number of steps to obtain the final screening network model.

下面对模型的训练进一步说明：The training of the model is further explained below:

模型训练使用公共的KIBA数据集。数据集共包括229个独特的蛋白质和2111个独特的药物，蛋白与药分子之间有亲和作用对118254个。训练方法中，按照80％：10％：10％的比例将数据集划分为训练集、验证集和测试集。Model training uses the public KIBA dataset. The data set includes a total of 229 unique proteins and 2111 unique drugs, and there are 118254 affinity pairs between proteins and drug molecules. In the training method, the data set is divided into training set, verification set and test set according to the ratio of 80%:10%:10%.

训练参数：控制器是一个具有100个隐藏单元的LSTM网络。它使用ADAM优化器进行训练，学习率为0.0035。控制器对图网络结构进行采样，生成子模型，并训练200个Epoch。在训练期间，应用λ＝0.0005的L2正则化。此外，p＝0.5的Dropout被应用于两个层的输入以及归一化的注意力层。Training parameters: The controller is an LSTM network with 100 hidden units. It is trained using the ADAM optimizer with a learning rate of 0.0035. The controller samples the graph network structure, generates sub-models, and trains for 200 epochs. During training, L2 regularization with λ = 0.0005 is applied. Furthermore, dropout with p = 0.5 is applied to the inputs of both layers as well as the normalized attention layer.

LSTM包括三层：输入层、隐藏层和输出层，输入维数6*1维，隐藏层神经元100，时间步长为10。LSTM consists of three layers: input layer, hidden layer and output layer, the input dimension is 6*1 dimension, the hidden layer neurons are 100, and the time step is 10.

LSTM网络参数设置如下：The LSTM network parameters are set as follows:

Layer1:LSTM(input_size＝8,hidden_size＝100,num_layers＝2)Layer1: LSTM (input_size=8, hidden_size=100, num_layers=2)

Layer2:Dropout(p＝0.5)Layer2: Dropout (p=0.5)

Layer3:Linear(hidden_size＝500,n_class＝1)Layer3: Linear (hidden_size=500, n_class=1)

其中，input_size表示的是输入的数据维数；hidden_size表示的是输出维数；num_layers表示堆叠几层的LSTM，默认是1；n_class表示LSTM网络的输出维度，为1表示输出回归值。Among them, input_size represents the input data dimension; hidden_size represents the output dimension; num_layers represents the stacked layers of LSTM, the default is 1; n_class represents the output dimension of the LSTM network, and 1 represents the output regression value.

在控制器训练1000次后，我们让控制器从200个采样GNN中输出最佳模型。结果表明本优化方法可以设计出原虚拟筛选模型性能最佳模型。After the controller is trained 1000 times, we let the controller output the best model from the 200 sampled GNNs. The results show that this optimization method can design the best performance model of the original virtual screening model.

实验结果Experimental results

最终经过优化与筛选后的最佳虚拟筛选模型结构如下：The structure of the best virtual screening model after optimization and screening is as follows:

GNN1结构：GNN1 structure:

Layer1：分子图注意力层GAT1(输入维度＝(n*n)，输出维度＝128，隐藏层单元个数＝128，头注意力个数＝4，激活函数＝elu()，聚合函数＝sum()).Layer1: molecular graph attention layer GAT1 (input dimension = (n*n), output dimension = 128, number of hidden layer units = 128, number of head attention = 4, activation function = elu(), aggregation function = sum ()).

Layer2：分子图卷积层GCN2(输入维度＝128，输出维度＝256，隐藏层单元个数＝256，头注意力个数＝4，激活函数＝relu()，聚合函数＝max()).Layer2: Molecular graph convolutional layer GCN2 (input dimension=128, output dimension=256, number of hidden layer units=256, number of head attention=4, activation function=relu(), aggregation function=max()).

Layer3：分子图注意力层GAT3(输入维度＝256，输出维度＝128，隐藏层单元个数＝128，头注意力个数＝8，激活函数＝elu()，聚合函数＝avg()).Layer3: molecular graph attention layer GAT3 (input dimension=256, output dimension=128, number of hidden layer units=128, number of head attention=8, activation function=elu(), aggregation function=avg()).

GNN2结构：GNN2 structure:

Layer1：蛋白质图卷积层GCN4(输入维度＝(m*m)，输出维度＝64，隐藏层单元个数＝64，头注意力个数＝16，激活函数＝relu()，聚合函数＝max()).Layer1: Protein graph convolutional layer GCN4 (input dimension = (m*m), output dimension = 64, number of hidden layer units = 64, number of head attention = 16, activation function = relu(), aggregation function = max ()).

Layer2：蛋白质图注意力层GCN5(输入维度＝64，输出维度＝256，隐藏层单元个数＝256，头注意力个数＝4，激活函数＝elu()，聚合函数＝pooling()).Layer2: Protein graph attention layer GCN5 (input dimension=64, output dimension=256, number of hidden layer units=256, number of head attention=4, activation function=elu(), aggregation function=pooling()).

Layer3：蛋白质图卷积层GCN6(输入维度＝256，输出维度＝256，隐藏层单元个数＝256，头注意力个数＝16，激活函数＝relu()，聚合函数＝max()).Layer3: Protein graph convolutional layer GCN6 (input dimension=256, output dimension=256, number of hidden layer units=256, number of head attention=16, activation function=relu(), aggregation function=max()).

交叉网络：输入维度＝420，输出维度＝128,层数n＝5Cross network: input dimension=420, output dimension=128, number of layers n=5

特征拼接层：Concat1(输入维度＝(128,256,128)，输出维度＝512).Feature splicing layer: Concat1 (input dimension=(128,256,128), output dimension=512).

Dropout层：(512，512)，比率p＝0.5.Dropout layer: (512, 512), ratio p=0.5.

全连接层：Linear(512,128).Fully connected layer: Linear(512,128).

全连接层：Linear(128,1).Fully connected layer: Linear(128,1).

经过300次迭代网络达到较好状态，保存相应参数用于青枯菌群体感应先导化合物的虚拟筛选。After 300 iterations, the network reached a better state, and the corresponding parameters were saved for virtual screening of quorum sensing lead compounds of R. solanacearum.

总之，本发明提供了一种基于PhcA和PhcR蛋白结构的群体感应先导化合物的筛选方法，可用于发现新的具有群体感应活性的化合物，为青枯菌等细菌的控制和防治提供新的思路和手段。该方法利用计算机辅助虚拟筛选技术和分子生物学技术相结合，可以高效地筛选出与PhcA和PhcR蛋白结合的化合物，从而发现具有群体感应活性的化合物。该方法具有操作简便、高效快速、筛选准确性高等优点，可广泛应用于生物医药和农业等领域。In a word, the present invention provides a screening method for quorum sensing lead compounds based on the protein structure of PhcA and PhcR, which can be used to discover new compounds with quorum sensing activity, and provide new ideas and ideas for the control and prevention of bacteria such as Ralstonia solanacearum. means. The method combines computer-aided virtual screening technology and molecular biology technology to efficiently screen compounds that bind to PhcA and PhcR proteins, thereby discovering compounds with quorum sensing activity. The method has the advantages of simple operation, high efficiency and rapidity, and high screening accuracy, and can be widely used in fields such as biomedicine and agriculture.

值得注意的是，本发明中的PhcA和PhcR是青枯菌中的两个关键蛋白，其结构和功能在细菌的群体感应中起着重要作用。然而，在其他细菌中，可能存在不同的群体感应蛋白，因此需要根据不同的细菌种类进行筛选和研究，以便发现适用于不同细菌的群体感应先导化合物。It is worth noting that PhcA and PhcR in the present invention are two key proteins in Ralstonia solanacearum, and their structure and function play an important role in quorum sensing of bacteria. However, in other bacteria, there may be different quorum-sensing proteins, so it is necessary to screen and study according to different bacterial species in order to find quorum-sensing lead compounds suitable for different bacteria.

以上显示和描述了本发明的基本原理、主要特征和优点。本领域的普通技术人员应该了解，上述实施例不以任何形式限制本发明的保护范围，凡采用等同替换等方式所获得的技术方案，均落于本发明的保护范围内。The basic principles, main features and advantages of the present invention have been shown and described above. Those of ordinary skill in the art should understand that the above-mentioned embodiments do not limit the protection scope of the present invention in any form, and all technical solutions obtained by means of equivalent replacement or the like fall within the protection scope of the present invention.

本发明未涉及部分均与现有技术相同或可采用现有技术加以实现。The parts not involved in the present invention are the same as the prior art or can be realized by adopting the prior art.

Claims

1. A method for virtual screening of quorum sensing lead compounds, characterized in that molecular compound structures and protein sequences are sent to the preprocessing module as input to extract preliminary features, and then sent to the prediction model network, wherein the structure of the prediction model and The parameters are generated through LSTM controller training; the specific process is as follows:

The input molecular compound structure is preprocessed to construct a molecular adjacency matrix, which is sent to the GNN1 network to generate compound features;

From the input protein sequence, extract its protein amino acid composition and dipeptide frequency to form a preliminary protein feature vector, send it to the cross network to generate cross fusion features; at the same time, generate a corresponding contact map for the protein sequence, and then send it to the GNN2 network to generate a protein sequence feature;

Finally, the combination of the three features is sent to the fully connected layer to predict the affinity value.

2. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, it is characterized in that, suppose the molecular adjacency matrix that builds to obtain is X ₁ , the element value of the adjacent atomic matrix on the molecular structure diagram is 1, not Adjacency is 0, and the molecular adjacency matrix size is (n*n), where n is the number of nodes in the structure graph, that is, the number of all atoms.

3. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, it is characterized in that, use Pconsc4 software to process protein sequence, output the probability matrix of residual error pair whether to contact, size is m*m, keep in the matrix If the value is greater than 0.5, other values are set to 0, and the filtered matrix is the protein contact map X ₂ , where m is the number of residuals.

4. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, is characterized in that, amino acid composition is the frequency that 20 kinds of amino acids that constitute sequence occur respectively, and the frequency of dipeptide is the amino acid that any two amino acids form to the frequency of occurrence.

5. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, is characterized in that, described predictive model network is made up of GNN1 network, GNN2 network, cross network parallel connection, sends into a splicing layer and DROPOUT after merging layer, followed by two fully connected layers.

6. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, is characterized in that, described intersection network is made up of 5 intersection layers in series, connects a 128-dimensional fully connected layer at last, fully connected layer output is f ₃ , each intersection layer has the following formula:

C _l+1 ＝X ₀ C ^T _l W _c,l +b _c,l +C _l

Where: l=1,2,...,5, C _l and C _l+1 are the outputs of the l-th layer and l+1-th layer cross layer respectively, C ₀ is the combination of amino acid composition and dipeptide frequency X ₃ , W _c,l and b _c,l are the connection parameters between the two layers; all variables in the above formula are column vectors. The output of each layer is the output of the previous layer plus the feature cross.

7. according to the virtual screening method of a kind of quorum sensing lead compound described in claim 1 or 5 or 6, it is characterized in that, described molecular adjacency matrix, protein contact map are respectively input into two different GNN1 and GNN2 networks, Each network is composed of 3 GNN layers. The output features of the two GNN networks are f ₁ , f ₂ , plus the cross fusion feature, which is f ₁ +f ₂ +f ₃ after splicing, and the corresponding small The overall characteristics of the molecule-protein pair; then sent to a fully connected layer with an output dimension of 128, and then sent to a second fully connected layer with an output dimension of 1, which is the affinity value predicted by the network.

8. the virtual screening method of a kind of quorum sensing lead compound according to claim 1, is characterized in that, realizes virtual screening network model optimization by LSTM controller, and model optimization is exactly to use reinforcement learning to obtain two GNNs in the determined parameter space The optimal structure parameters and the parameters of other neurons in the entire network; GNN structure M needs to determine several parameters: sampling function (S), correlation measurement function (Att), aggregation function (Agg), multi-head attention number (K) , the output hidden embedding (Dim) and activation function (Act).

9. the virtual screening method of a kind of quorum sensing lead compound according to claim 8, is characterized in that, optimization is made up of two steps, first LSTM predicts the corresponding operation of S, Att, Agg, Act, K, Dim of a GNN1, Each prediction is performed by the softmax classifier of LSTM, and then the predicted value is input to the next time point to obtain the next parameter prediction; when the number of layers of GNN Layer reaches 3, the LSTM controller completes the generation of an architecture; repeat This process generates the parameters of GNN2; builds and trains the entire prediction network to obtain the weight parameters of the GNN network and other network layers; then optimizes the parameters of the LSTM based on the accuracy rate obtained after network training to obtain an optimized controller model; two The steps are executed alternately for a certain number of steps to obtain the final screening network model.

10. The application of a virtual screening method for quorum sensing lead compounds in Ralstonia solanacearum.