Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a 'drug-target' interaction prediction method based on a neighbor attention network.
Therefore, the invention firstly carries out intensive research and analysis on the problems in the prior art and discovers that:
1. the embedded representation description of the DTI prediction mechanism is not sufficient due to the deficiency of interpretability; mainly because the drug/target embedded representation learned by the existing Deep Learning (DL) or matrix decomposition (MF) is always difficult to interpret, the generated implicit space is difficult to provide an easy-to-handle way to indicate how these properties affect the interaction, and their black-box nature hinders direct guidance of drug design.
2. The prediction model is very sensitive to missing tags; mainly because in practice the collection of labels for "drug-target" is not complete, existing methods rarely take into account the missing interaction labels between "drug-target" pairs and do not concern whether the missing interactions contribute to the prediction of DTI.
3. The prediction mode is difficult to predict the interaction of new compound molecules/proteins; at present, two main prediction modes are provided, namely direct-push prediction and inductive prediction;
the task of the direct-push prediction is to construct a function mapping F: DXT → [0,1]To infer potential interactions between unlabeled "drug-target" pairs, the characteristics or similarities of the drug and target are used to learn the function F. Inductive learning is a well-known cold start problem in recommendation systems, and the task of inductive prediction is usually to learn a functional mapping F: DXT → [0,1](ii) a However, it can infer new drug molecules
And novel target proteins
Potential interaction between, or inference of D and T
yInteraction between them, and, D
x and T
yIs learned in F. However, almost all the current methods of DTI prediction based on similarity belong to direct-push prediction, which extracts topological embedded features from a DTI network or a similar matrix, and the training phase uses labeled training samples and unlabeled test samples at the same time, so that when new samples determine their labels in practice, they need to train the model again, and cannot meet the requirements of current drug development.
Therefore, in order to achieve the above object, the technical solution provided by the present invention is:
the 'drug-target' interaction prediction method based on the neighbor attention network is characterized by comprising the following steps of:
1) construction of a model for predicting drug-target pair interactions
The drug-target pair interaction prediction model consists of a neighbor attention module and a deep neural network module;
2) collecting sample data, and training the 'drug-target' pair interaction prediction model constructed in the step 1) to obtain a trained 'drug-target' pair interaction prediction model;
the sample data comprises relevant data of the drug and the target and real interaction of the drug and the target; the specific training process is as follows:
2.1) calculating the similarity between every two of all the drug molecules and the similarity between every two of all the target proteins by using the related data, and constructing an interaction relation matrix A of the drug molecules and the target proteins;
wherein the related data comprises the structural information of the drug molecules, the sequence information of the target protein and the interaction relation information of the drug molecules and the target protein;
2.2) constructing a TsDNA module by utilizing the interaction relation matrix A of the drug molecules and the target protein obtained in the step 2.1) and similarity data between the drug molecules, and extracting the embedded expression of the target protein and all the drug molecules, namely whether the target protein is connected or not;
and/or
Constructing a DsTNA module by using the similarity data between the interaction relation matrix A of the drug molecules and the target protein obtained in the step 2.1), and extracting the embedded expression of the drug molecules and all the target proteins, namely the characteristic vector, which indicates whether a continuous edge exists or not;
wherein a drug d is extracted by the TsDNA modulexAnd a target tpThe extraction process is as follows:
a1. according to the combination of all the medicaments with the medicament dxThe similarity is sequenced from high to low to obtain K1、K2、…Km;
a2. Obtaining all drugs and targets tpEliminating non-interacting drugs;
a3. obtainingThe medicament dxAnd the target tpIs expressed as follows:
wherein ,
is a key assigned, which is a drug
Is a series of assigned keys, v
iIs that
Is d
xAnd
similarity of (c);
extraction of a target t by the DsTNA modulepWith a medicament dxThe extraction process is as follows:
b1. according to all targets and the target tpSequencing all targets from high similarity to low similarity to obtain H1、H2、…Hm;
b2. Obtaining all targets and drugs dxEliminating targets without interaction;
b3. obtaining the target tpAnd the said medicament dxIs expressed as follows
wherein ,
is a key assigned, which is a drug
Is a series of assigned keys, u
iIs that
Is t
pAnd
similarity of (c);
medicine dxAnd target tpIs generated by concatenating the bi-directional representations:
e(dx,tp)=[a(dx,tp)||a(tp,dx)];
for the extraction of new drug embedded representation, only TsDNA was constructed at the time of constructing the test set and training set in order to maintain the balance of data since it cannot construct DsTNA (new data);
for the extraction of the new target embedded representation, only DsTNA is constructed when the test set and the training set are constructed in order to keep the balance of data because TsDNA (new data) cannot be constructed;
that is, e (d)x,tp)=[a(dx,tp)]Or e (d)x,tp)=[a(tp,dx)];
2.3) processing the embedded representation obtained in the step 2.2) by using the characteristic important network
S1, performing step 2.2) on all drugs and targets, and stacking the obtained embedded expressions of the drug-target pairs together to obtain a matrix E;
s2, constructing a mapping attention matrix M for the matrix E obtained in the step S1 through a deep neural network;
s3, constructing an attention-enhanced expression matrix through the matrix M obtained in the step S2
The identification is convenient;
2.4) representing the matrix obtained in step 3)
Inputting the data into a deep neural network model as an input layer to obtain the predicted interaction of the drug and the target;
2.5) comparing the predicted interaction of the drug-target obtained in the step 2.4) with the real interaction of the drug and the target, and obtaining a weight value in the model through back propagation to obtain a trained drug-target pair interaction prediction model.
That is, the training of the prediction model uses an interpretable model based on deep learning, namely NNAttNet, which comprises three modules, a neighbor attention module, a feature importance network and a multi-layer deep neural network model. For the "drug-target" pairs, the first module generates their interpretable embedded representations that have stronger expressive properties for the missing tags in the training data and are feasible in both the direct-push prediction and inductive prediction scenarios. In addition, the algorithm is not only adaptive to the feature input, but also adaptive to the similarity input. The second module, the feature importance network, which represents the importance of each dimension of the embedded feature, provides an interpretable feature selection, and belongs to one of the steps in building the neighbor attention module. The last module distinguishes whether a "drug-target" pair is a potential DTIs.
3) And (3) predicting the interaction by using the 'drug-target' interaction prediction model trained in the step 2).
Further, in the step 2.1), the similarity between every two of all the drug molecules is calculated by using the acquired structural information of the drug molecules and adopting a SIMCOMP method;
and calculating the similarity between every two target proteins by using the sequence information of the collected target proteins and adopting a Smith-Waterman algorithm.
Further, the SIMCOMP method is as follows:
SIMCOMP provides a global similarity score based on the size of the common substructure between two pharmaceutical compounds using a graph alignment algorithm, wherein the similarity s (c, c ') of compounds c and c' is calculated as follows:
further, the Smith-Waterman algorithm is specifically as follows:
two target sequences to be aligned are defined as A ═ a1a2a3…an,B=b1b2b3…bmWherein n and m are the length of sequences A and B, respectively;
determining parameters:
s is a score when the elements constituting the sequence are identical;
Wka gap penalty for length k;
creating a score matrix H and initializing the head row and the head column of the score matrix H, wherein the size of the matrix is (n +1) × (m + 1);
scoring from left to right, top to bottom, filling the remainder of the scoring matrix H, where:
and selecting the item with the highest score in the score matrix H, namely the matching score of the sequence A and the sequence B, and marking as SW (A, B).
The similarity between sequence a and sequence B is:
further, in step S2), the attention matrix M is mapped as follows:
M(:,i)=DNNi(E)。
further, in step S3), the expression matrix of attention enhancement
The method comprises the following specific steps:
further, in step 2.4), the deep neural network model includes an input layer, a hidden layer using Relu as an activation function, and two neuron output layers using Sigmoid as an activation function; the deep neural network model acts as a binary predictor, with the output layer producing a probability representing the likelihood of drug-target pair interaction. The entire network of NNAttNet with neighbor attention weights, feature importance terms, and DNN weights can be jointly optimized by a binary cross entropy loss function, as follows:
wherein Y is the authentic tag of the drug target pair; f (-) is DNN; θ is a weight parameter of the entire network; r (-) is L2-norm; the coefficients of the lambda regularization term.
The present invention also provides a computer-readable storage medium having stored thereon a computer program characterized in that: which when executed by a processor implements the steps of the above-described method.
An electronic device, characterized in that: including a processor and a computer-readable storage medium;
the computer-readable storage medium has stored thereon a computer program which, when being executed by the processor, performs the steps of the above-mentioned method.
The invention has the advantages that:
the invention provides a prediction method based on deep learning, wherein a prediction model adopted is a neighbor attention network (NNAttNet), the problems are solved by constructing embedded representation (DTPs) of a medicament to a neighbor, the prediction method enables the interaction of the medicament and protein to have interpretability, reduces the influence caused by the lack of DTI entries, and provides a unified representation for direct-push prediction and inductive prediction. In addition, NNAttNet provides an attention-based selection of key features to predict DTI more accurately, and evaluation of NNAttNet on a baseline dataset shows that NNAttNet has better DTI prediction performance.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
an embodiment of the "drug-target" interaction prediction method based on the neighbor attention network proposed by the present invention is specifically as follows:
the present embodiment includes three parts:
in the first part, when a test set is constructed, the continuous edges in a part of networks are deleted, and then the continuous edges are predicted.
In the second part, some drugs and all their links in the network are deleted in order to simulate the scenario of new drug prediction.
And in the third part, some target points and all the connecting edges in the network are deleted, so as to simulate the scene of new target point prediction.
In the process of result statistics and display, the performance parameters of the overall result are displayed, and a specific certain drug (target) is not specially displayed.
This example uses an inventive baseline dataset in a predictive performance comparison experiment, dividing the receptors into 4 subsets, enzymes (En), Ion Channels (IC), G-protein coupled receptors (GPCR) and Nuclear Receptors (NR), respectively, based on the characteristics of the protein family. Each subset includes known "drug-target" interactions, pairwise similarities between drugs, and pairwise similarities between targets. Wherein, the calculation of pairwise similarity between the medicines is performed by SIMCOMP algorithm, and the calculation of pairwise similarity between the targets is performed by Smith-Waterman algorithm. The details of this data set are shown in table 1.
TABLE 1 details of the reference data set
And (3) setting a dictionary K of a virtual key type for each data subset, assigning values to the dictionary by using the acquired data, constructing a TsDNA module and a DsTNA module between a drug molecule and a target protein, and finally obtaining the embedded representation of all the drug-target pairs.
TsDNA (see FIG. 2) consists of a dictionary K of virtual key types and values v describing these virtual keys. In the dictionary, the virtual keys are sorted by semantic adjacency. In short, the first key is its nearest neighbor, the second key is its second nearest neighbor, and the last key is its farthest neighbor. It is noted that this is a contribution that may explain the classification distinction between known DTIs and unknown DTIs.
When considering a drug dxWhether or not to react with a target protein tpWhen there is an interaction, these empty bonds are bound to the target tpThere are other known drug assignments that have an interaction relationship. In contrast, for those sums tpDrugs with no interaction are not madeOperation, then dxFor tpAttention of (c) can be defined as:
in this connection, it is possible to use,
is a key assigned, which is a drug
Is a series of assigned keys, v
iIs that
Is d
xAnd
the similarity of (c). Note that a virtual key is featureless, and only after being assigned, will there be a feature. Such attention is given to d
x→t
pAnd (4) unidirectional representation.
To enhance the interpretability of the TsDNA module, we set V ═ V1,v2,…,v|K|]Is a diagonal matrix of which v isiIs a one-hot-like vector, that is, viIs not 0, and the other elements are all 0. By this method, dx→tpThe attention vector of (a) is sparse. Considering a well-accepted hypothesis that similar drugs tend to interact with target proteins of interest, it is hypothesized that if drug dxAnd target tpPossessing an interaction relationship, and possessing more non-zero values in some of their initial characteristic dimensions than drugs and targets that do not have an interaction relationship. In other words, for dx and tpThe interaction relationship between the twoOther drugs, if desired, with tpThere are interactions, that is they are usually dxThe first few neighbors of (a). This sparse attention insert expression provides evidence for later interpretability of TsDNA.
Due to the symmetric effect of the nodes in the two networks, we can similarly construct a DsTNA block that outputs another one-way representation tp→dxThe expression is a (t)p,dx) And thus d is requested finallyx and tpThe representation of pairs is generated by concatenating the bi-directional representations.
e(dx,tp)=[a(dx,tp)||a(tp,dx)]
The embedded representations of all "drug-target" pairs are stacked together, denoted as the attention matrix E.
By the formula M (: i) ═ DNNi(E) And modeling the attention matrix E to obtain an embedded expression matrix M.
By the formula
Constructing an attention-enhancing representation matrix
A generic DNN was used as a binary predictor to predict whether a "drug-target" pair would interact. The binary predictor comprises an input layer, namely an embedded expression of a drug-target pair, a hidden layer taking Relu as an activation function, and two neuron output layers taking Sigmoid as the activation function. The output layer generates probabilities that represent the likelihood of drug-target pair interaction. The entire network of NNAttNet with neighbor attention weights, feature importance terms, and DNN weights can be jointly optimized by a binary cross entropy loss function, as follows:
wherein Y is the authentic tag of the drug target pair; f (-) is DNN; θ is a weight parameter of the entire network; r (-) is L2-norm; the coefficients of the lambda regularization term.
In this example we evaluated the performance of each method by cross validation at 10 fold (CV), and used AUROC (area under receiver operating characteristic curve) and aurrc (area under exact recall curve) as indicators to measure DTI predictive performance.
In 10-fold cross-validation, we calculated AUROC/aurrc scores for each prediction method and obtained final AUROC/aurrc scores by calculating the average AUROC/aurrc score over 10 replicates.
In order to comprehensively evaluate the performance of each method, the CV test was performed in consideration of the following three scenarios.
Under CVS1, 90% of the DTPs (embedded representation of drug versus neighbor) were used for training, while the remaining 10% were used for each round of testing.
Under CVS2 (or CVS3), 90% of the drug (or target) interactions were used for training, and the remaining 10% of the drug (or target) interactions were used for testing.
CVS2(CVS3) is a cold start DTI prediction because there is no overlap between the training drug (target) and the test drug (target).
Notably, CVS1 is a direct-push type of prediction task.
The CVS2/CVS3 may be a straight-push or inductive prediction task, depending on the nature of the prediction method. The experimental results of NNAttNet are shown in tables 2, 3, 4.
TABLE 2 Performance display of DTI prediction on 4 data sets by CVS1
Note: ROC and PR are abbreviations for AUROC and AUPRC.
TABLE 3 Performance display of DTI prediction on 4 data sets by CVS2
Note: ROC and PR are abbreviations for AUROC and AUPRC.
TABLE 4 Performance display of DTI prediction on 4 data sets by CVS3
Note: ROC and PR are abbreviations for AUROC and AUPRC.
The interpretability of the prediction method in the present invention is explained below based on the experimental results of the present example.
Taking the GPCR dataset as an example, by calculating two mean embedding vectors of known DTIs and unlabeled DTPs, a dictionary distribution of drug bond types from K1 to K100 was obtained (see fig. 4). The significantly high values of the insertion characteristic that occur in the first n nearest neighbors indicate that a drug interacting with a particular target will always find its top n nearest neighbors in drugs interacting with the same target. This observation indicates that if a drug has more nonzero-value units in the first n characteristic dimensions (bonds) than it does not, it may interact with the target.
The present invention also indicates on this embodiment which embedded features in the M matrix cause the interaction to occur. Since the cells with larger M median represent an important feature dimension, each feature fiThe importance M (: i) of M can be measured by the average of the values in the ith column of M (see fig. 5). The distribution of importance of the keys in dictionary K illustrates that features of higher importance are typically located in the top n nearest neighbors. This observation is in marked agreement with the above-described visual observation on the first 10 keys with the larger Spearman correlation (r-0.8182).
This example investigates the predicted performance of top-k features (see FIG. 6). k takes the value {1,5,10,15, …,220 }. The prediction effect increases dramatically when k increases to 50. As k continues to increase, performance increases slowly, and decreases even when k is greater.
One reason NNAttNet still performs well on the missing tag problem is that it utilizes an embedded vector composed of neighboring nodes. We investigated the distribution of feature importance at different rates of loss of DTIs (fig. 7). The graph reveals that the distribution of characteristic bonds shows a similar trend at different deletion rates. Meanwhile, the feature importance vectors under the 9 deficiency rates have high correlation. The Spearman correlation coefficients of the feature importance vectors at 10% deletion rate and other deletion rates (20% -90%) are 0.9996, 0.9993, 0.9989, 0.9979, 0.9969, 0.9943, 0.9919 and 0.9770, respectively. This high degree of correlation indicates that in the absence of data, the feature importance network can still indicate critical features. Thus, in the absence of a tag, even if a few drugs are found in the top n neighbors of the drug closest to the target, the ordering key dictionary in its neighbor attention module can still guarantee that the queried drug interacts with the target.
The invention demonstrates the feasibility of NNAttNet by the above examples: the interpretability of drug-protein interactions, stronger properties for predictions of missing DTI tags, consistent representation of straight-forward and generalized DTI predictions, and selection of important features based on attention for more accurate DTI predictions.
Well-known implementations and features of the above-described arrangements are not described in great detail herein. It should be noted that, for those skilled in the art, various modifications can be made without departing from the invention, and these should also be construed as the scope of the invention, which does not affect the effect of the invention and the practicability of the patent. The scope of protection claimed in the present application shall be defined by the claims, and the detailed description and the like in the specification shall be used for explaining the contents of the claims.