27786-Article Text-31840-1-2-20240324
27786-Article Text-31840-1-2-20240324
                                                                        329
                                      The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
2021; Huang et al. 2022; Zhao et al. 2022; Bai et al. 2023).                    sentations that possess better robustness and generaliza-
However, these approaches often overlook the enormous                           tion capabilities. We observed that the representations
amount of unlabeled biomedical data, which hinders the                          obtained by the BERT models significantly enhance the
models from fully leveraging the chemical structures and                        accuracy of pseudo-labeling.
interactions of drugs and proteins. Consequently, the mod-                    • We propose a novel multi-level attention mechanism
els struggle to extract truly informative features, leading to                  which enables effective feature extraction by allowing
limited generalization ability.                                                 the model to dynamically focus on different aspects of
   The second challenge is the hidden bias and shortcut                         proteins and drugs during the learning process. The atten-
learning. The issue of hidden bias has been reported on the                     tion mechanism mitigates the shortcut learning problem
DUD-E and MUV datasets (Sieg, Flachsenberg, and Rarey                           and reduces the impact of hidden bias on predictions.
2019). It has been observed that models trained on the DUD-                   • We propose a simple yet effective pseudo-label domain
E dataset (Chen et al. 2019) and other datasets (Chen et al.                    adaptation method, which significantly reduces the noise
2020), tend to rely predominantly on drug patterns when                         of pseudo-labels.
making predictions, rather than capturing the comprehen-
sive interaction between drugs and targets. This lead to a gap                                          Related Work
between theoretical modeling and practical application. We
further identify two main reasons for this issue: 1) The pres-               Leveraging Additional Data
ence of a greater variety and quantity of drug molecules in                  One of the key to DTI prediction is how to represent drug
datasets than proteins; 2) The inherent ease of feature extrac-              molecules and proteins that allows the model to learn use-
tion for drug molecules compared to proteins. These factors                  ful features. Learning from 3D structural information (Wal-
result in shortcut learning, where the model tends to priori-                lach, Dzamba, and Heifets 2015; Stepniewska-Dziubinska,
tize learning features from the larger and easier-to-learn drug              Zielenkiewicz, and Siedlecki 2018) is undoubtedly the most
molecule data, rather than focusing on the features of pro-                  direct approach, but it is limited by the high computational
teins. Consequently, the model struggles to effectively cap-                 costs and model complexity. Another indirect approach is
ture the interaction features between drugs and proteins.                    to provide additional data containing 3D structural infor-
   The third challenge lies in the model’s ability to gener-                 mation, such as molecular dynamics simulations (Wu et al.
alize and make predictions on out-of-domain data, which is                   2022) and protein pocket data (Yazdani-Jahromi et al. 2022).
closely related to the previous two challenges. Developing a                 While the aforementioned methods are limited by the avail-
first-in-class drug often involves predicting interactions with              ability of a finite amount of 3D structural data, Moltrans
a completely new target using novel compounds, which may                     (Huang et al. 2021), in contrast, leverages a vast amount
have a distribution that differs significantly from the data on              of unlabeled protein and drug sequences by using Frequent
which the model was trained. Thus, the model needs to be                     Consecutive Sub-sequence (FCS) algorithm to extract high-
capable of cross-domain generalization (Abbasi et al. 2020;                  quality substructures and enhances the representations us-
Bai et al. 2023; Kao et al. 2021). Currently, most models are                ing transformers. However, FCS has certain limitations in
trained on limited labeled data and fail to address the issue                its ability to comprehensively extract information from se-
of shortcut learning, resulting in limited ability to predict in-            quence data, and the quantity of unlabeled data utilized is
teractions between completely new drugs and proteins.                        also insufficient. In this paper, we utilize two pre-trained
   To tackle the three challenges, we propose MlanDTI, a                     BERT (Devlin et al. 2018) models learned on a large amount
semi-supervised domain adaptive multilevel attention net-                    of unlabeled data to obtain rich representations of proteins
work for DTI prediction. We utilize two pre-trained BERT                     and drug sequences with powerful generalization abilities.
models to acquire bidirectional embeddings of protein and
SMILES (drug) sequences from millions of unlabeled data.                     Learning Interactions
Inspired by the least mean squared error reconstruction                      Proteins and drugs are two fundamentally different types
(Lmser) network (Xu 1993; Huang, Tu, and Xu 2022), we                        of data, and the task of DTI prediction requires the model
then devise a variant of transformer with a multi-level atten-               to learn their interaction features. The simplest approach
tion mechanism with drug and protein embeddings as input.                    is to concatenate the features (Öztürk, Özgür, and Ozkir-
It enables the joint extraction of both drug and target fea-                 imli 2018; Lee, Keum, and Nam 2019; Zheng et al. 2020;
tures with reduced hidden bias and facilitates the learning of               Nguyen et al. 2021) and pass them through a Fully-
multi-level interactions. Moreover, we incorporates a sim-                   Connected Network (FCN) to obtain the prediction results.
ple yet effective semi-supervised pseudo-labeling method                     Another approach (Qian, Wu, and Zhang 2022) is to over-
to further enhance our model’s predictive ability in cross-                  lap the feature maps and use CNN to extract interaction
domain scenarios. Experiments on four datasets demonstrate                   features. However, these methods lack interpretability and
that MlanDTI achieves state-of-the-art performances over                     overlook the inherent structure of interactions. Recently, at-
other methods under intra-domain settings and outperforms                    tention mechanisms have been demonstrated effective in
all other approaches under cross-domain settings.                            capturing intricate interactions between proteins and drugs.
   The main contributions are three-fold as follows:                         Multi-head attention (Bian et al. 2023; Chen et al. 2020) and
                                                                             other attention variations (Bai et al. 2023; Zhao et al. 2022)
 • To leverage massive unlabeled biomedical data, we em-                     have been widely applied in DTI prediction. However, (Chen
   ployed two pre-trained BERT models to acquire repre-                      et al. 2020) found that the hidden bias in some datasets that
                                                                      330
                                     The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
led models to rely mainly on drug patterns rather than the                  data, the model predicts on unlabeled target domain data
interactions for prediction. We further observed that this is-              to obtain pseudo-labels. The pseudo-label learning process
sue was prevalent in existing models. To address this issue,                consists learning high-confidence pseudo-labels and mini-
we proposed a multi-level attention mechanism.                              mizing conflicting predictions.
Domain Generalization in DTI Predictions                                    Encoder for Protein Sequence We build the encoder by
                                                                            adopting a modification on the transformer similar to Trans-
In previous works (Huang et al. 2021; Yazdani-Jahromi et al.                formerCPI (Chen et al. 2020). Instead of using the self-
2022; Zhao et al. 2022), the evaluation of model generaliza-                attention module, we utilize a 1D-CNN and GLU (gated lin-
tion was often conducted through the partitioning of datasets               ear unit) (Dauphin et al. 2017) as alternatives. The hidden
into “unseen drug” or “unseen protein” scenarios, where                     layers h0 , ..., hL in the encoder are computed as:
drugs or proteins were only present in the test set. However,
such evaluations still fall into the intra-domain setting, dif-                      hi (XT ) = (XT Wi1 + s) ⊗ σ(XT Wi2 + t),          (1)
ferent from real-world applications. Currently, there is lim-                                      n×m1
                                                                            where XT ∈ R              is the input of layer hi , Wi1 ∈
ited research on domain generalization in DTI prediction.
                                                                            Rk×m1 ×m2 , s ∈ Rm2 , Wi2 ∈ Rk×m1 ×m2 , t ∈ Rm2 are pa-
DrugBAN (Bai et al. 2023) addresses this challenge by uti-
                                                                            rameters, n is the input sequence length, k is the patch size,
lizing Conditional Domain Adversarial Network (CDAN) to
                                                                            m1 , m2 are the dimensions of input and hidden vectors, σ is
transfer the learned knowledge from the source domain to
                                                                            the sigmoid function, and ⊗ is the element-wise product.
the target domain, thereby enhancing the model’s perfor-
                                                                               Since the length of a protein sequence may range in the
mance in cross-domain settings. Here, we leverage pseudo-
                                                                            thousands or even tens of thousands, the self-attention mod-
labeling techniques to mitigate the distribution discrepancy
                                                                            ule in transformers poses a significant computational and
between the target and source domains. Through the integra-
                                                                            memory burden with O(n2 ) time and space complexity, and
tion of an auxiliary classifier and the powerful representa-
                                                                            is prone to overfitting when working on small datasets. The
tional capacity of BERT models, our approach significantly
                                                                            above modification by Eq. (1) mitigates the computational
improves the accuracy of pseudo-labeling. Under the cross-
                                                                            and storage burden on long protein sequences and remedies
domain setting, our method demonstrates remarkable per-
                                                                            overfitting on small datasets.
formance surpassing that of DrugBAN.
                                                                            Multilevel Cross-Attention For the task of DTI predic-
                          Method                                            tion, the most crucial ability for the model is to learn the
                                                                            interaction patterns between drugs and targets. It involves
Problem Formulation
                                                                            aligning the features of proteins with the features of drugs in
The task of DTI prediction aims to determine whether a                      a shared feature subspace. However, extracting multi-level
drug compound and a target protein will interact. For drug                  features from proteins is more challenging than extracting
compounds, most existing deep learning methods utilize the                  features from drugs, because protein sequences are notably
SMILES strings to represent the drugs. Specifically, a drug                 long, with intricate multi-level structures, while drugs are
is represented as D = (d1 , ..., dm ), where di is a SMILES                 often small chemical molecules. This difference contributes
symbol with chemical meanings such as atoms, m is the                       to the hidden bias in DTI models (another is inherent dataset
length. As for target proteins, each protein sequence is rep-               bias). Aligning protein features with drug features also re-
resented as T = (a1 , ..., an ), where ai corresponds to one of             quires a multi-level process, but the model may not capture
the 23 amino acids, n is the length of the protein sequence.                the multi-level features of proteins well and effectively align
   Given a drug SMILES sequence D and a protein sequence                    them with drug features. Thus, the existing models tend to
T, the objective is to train a model to assign an interaction               learn a shortcut by relying on the features of drug molecules
probability score P ∈ [0, 1] by mapping the joint feature                   to predict drug-target interactions.
representation space D × T.                                                    In an early literature (Xu 1993), Lmser network was pro-
                                                                            posed to enhance the representation learning by building
The Proposed Framework                                                      bidirectional skip connections on every levels of layers be-
An overview of MlanDTI is depicted in Figure 1. It                          tween encoder and decoder. It was first demonstrated in a
commences by encoding the drug and target sequences                         deep CNN implementation to be robust and effective on
into vector embeddings via pre-trained BERT models, i.e.,                   image processing (Huang, Tu, and Xu 2022; Xu 2019),
ChemBERTa-2 (Ahmad et al. 2022) and ProtTrans (Elnag-                       and then Lmser-transformer was developed to improve the
gar et al. 2021). Subsequently, these embeddings are passed                 molecular representation learning by adding hierarchical
through the encoder and decoder of a modified transformer                   connections to the original transformer (Qian et al. 2022).
architecture with a multi-level attention module to extract in-             Inspired by these works, we propose a multi-level cross-
teraction features. The classifier comprises a bilinear atten-              attention mechanism to address this issue, as illustrated in
tion module and a max pooling layer, followed by a FCN for                  Figure 1(a). In the vanilla transformer, the encoder uses the
prediction. For cross-domain prediction, we employ an aux-                  protein features from the last layer of the encoder as the Key
iliary classifier that directly accepts BERT outputs. It aids               and Value for the cross-attention layer of the decoder, align-
in learning implicit distributional information from BERT                   ing them with the drug features in the decoder. However,
representations, thereby enhancing pseudo-label accuracy.                   the protein features obtained from the encoder’s output do
After training the two classifiers on labeled source domain                 not fully capture the expression of the multi-level structural
                                                                     331
                                               The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Protein Sequence … …
                                                                                                                                                                      Max pooling
                            Prot-trans
                                                                                                        1DCNN
                                               1DCNN
                                                                                                                                                                                                                                                 encoder                           decoder
                                                                                                                    GLU
                                                                 GLU
                                                                                                                                                                                                                                                 encoder                           decoder
                                                                                                                                                                                                                           Prediction results
                                                                                                                                                                                                   Full connected layers
                                                                                                                                                 Bilinear attention
                                                                                                                                                                                    Concatenated
      MGN…PVR                                                                                                                                                                                                                                   (c)
                          Protein
                                                                                                                                                                                                                                                 Primary                      Select
                          embeddings                                                       Multilevel             Multilevel                                                                                                                     classifier                   Pseudo
                          Drug                                                             Feature
                                                                                                                                                                                                                                                                               labels
                                                                                            Fusing                Attention                                                                                                                      outputs
   SMILES Sequence        embeddings
                                                                                                                                                                                                                                                       classifier                       Train
                                               Self Attention
                           Chemberta-2
heads Cross
                                                                                                                             Feedforward
                                                                                                                Add & Norm
                                                                Add & Norm
                                                                                                                                                                      Max pooling
                                                                                                                                                                                                                                                                     Auxiliary          with
                                                                                      Attention
                                                                                                                                                                                                                                                                     classifier
                                                                                       Talking
                                                                                                                                                                                                                                                                                        pseudo
                                                                                                                                                                                                                                                      Transformer    outputs            labels
                                                                                                                                                                                                                                                           with
                                                                                                                                                                                                                                                       Multilevel
                                                                                                                                                                                                                                                        attention         classifier
      C1C1=C
                                                                                                                                                                                                                                                              Bert Model
                                                                                           N layers
Figure 1: (a). The overall framework of MlanDTI, it consists of two pre-trained BERT models that convert SMILES and amino
acid sequences into vector embeddings. The Encoder and Decoder are connected by a multilevel attention module, and the final
output is processed through the classifier with a bilinear attention module and a max pooling layer before being fed into FCN
to generate the prediction results. (b). The detail of Multilevel attention. (c). Training with pseudo labeling with an auxiliary
classifier.
information of proteins, and they do not align with drug fea-                                                                                  across different attention heads and improving the overall
tures at different levels in the decoder.                                                                                                      performance of the model, i.e.,
   Here, we develop the multi-level attention mechanism by
                                                                                                                                                                                      QK T
                                                                                                                                                                                          
two steps: 1) the multi-level feature fusing step and 2) the
                                                                                                                                                  Attention(Q, K, V ) = softmax Pℓ √         Pw V, (4)
cross-attention feature aligning step. Suppose the protein                                                                                                                              dk
feature matrices of each encoder layers are T0 , ..., Tn ∈
Rm×d , where n is the number of transformer layers, m is                                                                                       where Q, K, V are given by Eq. (3), and Pℓ ∈
the size of the protein sequence and d is the vector dimen-                                                                                    Rhk ×hk , Pw ∈ Rhk ×hv are the two additional linear projec-
sion. For the ℓ-th decoder layer, we concatenate the pro-                                                                                      tions. hk represents the number of attention heads for keys
tein feature matrices from the preceding ℓ layers to form                                                                                      and queries, and hv denotes the number of attention heads
Tcatℓ = [T0 , ..., Tℓ ] ∈ Rℓ×m×d . Then, we perform a cross-                                                                                   for values, and they can optionally differ in size.
layer feature aggregation by applying a fusion matrix Fℓ ∈                                                                                        The advantages of the proposed multi-level attention
Rℓ×1 . This results in multi-level fused protein feature matrix                                                                                mechanism are briefly summarized below.
Tℓ′ = FℓT Tcatℓ . To summarize, we compute all Tℓ′ as:                                                                                          • Encourage multi-level feature learning: By fusing pro-
          diag(T0′ , ..., Tn′ ) = F · diag(Tcat0 , ..., Tcatn ),                                                 (2)                              tein features, drug features are derived to interact with
                                                                                                                                                  relevant characteristics, which thereby captures multi-
where F is a learnable diagonal matrix with each diagonal                                                                                         level interaction features, leading to a more comprehen-
element being Fℓ from each layer, i.e., ℓ = 0, ..., n. Then,                                                                                      sive understanding of drug-target interactions.
the query, key, and value for the multi-level cross-attention
mechanism at the ℓ-th layer are respectively computed by                                                                                        • Alleviate hidden bias and reduce overfitting: Multi-
                                                                                                                                                  level attention encourages the model to focus more on
           Q = D ℓ Wq ,          K=      Tℓ′ Wk ,               V =               Tℓ′ Wv ,                       (3)                              hierarchical interaction features, the model becomes less
where Dℓ is the drug feature matrix which has passed the                                                                                          prone to biased representations that might emerge from
self-attention module, and Tℓ′ the multi-level protein feature                                                                                    focusing solely on specific patterns, and thus the model
matrix given by Eq. (2).                                                                                                                          is less likely to overfit to noisy patterns of the data.
   To enhance the extraction capabilities of attention heads                                                                                    • Improve generalization abilities: Multi-level attention
for multi-level interactions, we incorporate the talking-heads                                                                                    enables the model to learn domain-invariant interaction
attention mechanism (Shazeer et al. 2020) for feature align-                                                                                      features. These representations exhibit robustness and
ment. This variation of multi-head attention in the trans-                                                                                        enhance transferability across different data domains.
former introduces two additional linear projections. These
projections transform the attention logits and the atten-                                                                                      The Classifier The classifier consists of the bilinear at-
tion weights, respectively, allowing the flow of information                                                                                   tention module from hyperattentionDTI (Zhao et al. 2022)
                                                                                                                                332
                                      The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
to further extract bidirectional interaction features. Subse-                  The second step is to penalize conflict predictions. Let Xd
quently, we utilize a multi-layer FCN with each layer fol-                   be the set of samples for which the two classifiers exhibit
lowed by a leaky ReLU activation function (He et al. 2015)                   conflicting classifications, i.e.,
to combine these features and generate prediction results.
Since it is a binary classification problem, we utilize the bi-                Xd = {x(i) |x(i) ∈ Xt , argmax p(i) ̸= argmax p(i)
                                                                                                                              aux }, (9)
nary cross-entropy loss function to train the model.                                              (i)       (i)   (i)   (i)   (i)
                                                                             where p(i) = (p0 , p1 ), paux = (p0,aux , p1,aux ).
           LCE = −[y log ŷ + (1 − y) log(1 − ŷ)],              (5)
                                                                                We randomly select a subset, Xd′ , of size M ′ , from Xd ,
where y is the ground truth label, ŷ is the classifier’s output.            where the value of M ′ increases with the number of model
Pseudo Label Learning for Domain Adaptation                                  iterations. We utilize a modified binary cross entropy loss to
Pseudo-labeling (Lee et al. 2013) is a semi-supervised                       augment the prediction uncertainty for conflicting samples
learning (SSL) method that utilizes a model trained on                       between the two classifiers, i.e.,
labeled source domain data to generate pseudo-labels                                        M           ′
                                                                       333
                                     The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Table 1: The results of the proposed model and baslines on four datasets (5 random runs), Metric: AUROC (AUC), AUPRC
(AUPR), F1-score (F1), The best results are indicated by bold. ”–” means no result for this metric.
                                        cold                                                         cross-domain
 methods                  BindingDB            BioSNAP                                     BindingDB              BioSNAP
                     AUC AUPR F1           AUC AUPR F1                                AUC AUPR F1            AUC AUPR F1
 Moltrans            0.595 0.522    0.511 0.672 0.697  0.437                          0.537 0.476    0.389 0.632 0.635    0.401
 TransformerCPI      0.656 0.594    0.566 0.680 0.708  0.523                          0.568 0.450    0.410 0.656 0.693    0.432
 HyperAttDTI         0.661 0.598    0.582 0.732 0.760  0.539                          0.545 0.462    0.376 0.654 0.685    0.395
 DrugBAN             0.655 0.600    0.542 0.651 0.667  0.449                          0.578 0.471    0.484 0.608 0.606    0.438
 DrugBANCDAN         NA     NA      NA     NA   NA     NA                             0.616 0.512    0.426 0.673 0.706    0.542
 Ours                0.671 0.594    0.601 0.782 0.801  0.653                          0.657 0.537    0.489 0.728 0.759    0.604
 Ours (with PL)      NA     NA      NA     NA   NA     NA                             0.687 0.579    0.564 0.749 0.770    0.629
Table 2: In-domain (cold pair split: unseen drugs & proteins) and cross-domain (clustering-based split) comparison on the Bind-
ingDB and BioSNAP datasets (5 random runs). 1) Underlined values explanation: We chose a threshold of 0.5 (the same one as
in MolTrans) to calculate the F1-score of DrugBAN. This is to ensure a fair comparison and to avoid ineffective classification
caused by overly low thresholds in DrugBAN. Further information is provided in the appendix. 2) NA, not applicable to this
study. 3) The term “with PL” within parentheses refers to our method that incorporates the pseudo-labeling module.
(SVM) (Cortes and Vapnik 1995), Random Forest (RF)                          was not particularly competitive. This discrepancy was due
(Ho 1995), GraphDTA (Nguyen et al. 2021), DeepConv-                         to the hidden bias issue present in the BindingDB dataset.
DTI (Lee, Keum, and Nam 2019), MolTrans (Huang et al.                          The BindingDB dataset contains 14643 drugs and 2623
2021), TransformerCPI (Chen et al. 2020), Hyperatten-                       proteins, which results in an extremely imbalanced drug-to-
tionDTI (Zhao et al. 2022), and DrugBAN (Bai et al. 2023).                  protein ratio compared to the other datasets (BioSNAP: 4510
These baselines encompass both classic machine learning                     / 2181, human: 2726 / 2001, C.elegans: 1767 / 1876). Com-
methods and the current state-of-the-art deep learning ap-                  pared to the other three datasets, deep learning models even
proaches, ensuring a comprehensive comparison. All deep                     struggle to outperform traditional machine learning meth-
learning methods were employed with their default configu-                  ods (AUC: RF 0.942, deepConv-DTI 0.945) on the Bind-
rations as provided by their respective authors. Our proposed               ingDB dataset. Previous studies (Bai et al. 2023) have also
method in implemented in PyTorch, utilizing the Adam op-                    reported that the performance in the BindingDB dataset un-
timizer with an initial learning rate of 0.001. Detailed hyper-             der unseen-drug setting shows minimal decline compared to
parameter settings are provided in the appendix.                            random splits. This phenomenon is attributed to the presence
                                                                            of a large number of highly similar molecules in the dataset,
Intra-domain Experiments                                                    which makes it challenging for the naive unseen-drug set-
Table 1 displays the comparison on the human and C.elegans                  ting to distinguish between them. The excessive number
datasets. These two datasets are relatively small, with bal-                of highly similar drug samples causes baseline models to
anced positive and negative samples, enabling us to evalu-                  lean towards learning drug patterns rather than drug-target
ate the model’s predictive ability within the same distribu-                interactions for prediction. As a result, deep learning and
tion. Our method outperforms all deep learning baselines in                 machine learning methods exhibit similar performance lev-
terms of AUROC and AUPRC, and it also exhibits competi-                     els. However, this shortcut learning approach contradicts the
tive performance in terms of F1-score.                                      original intent of DTI prediction and cannot be considered
   We also conducted comparisons on the larger datasets,                    reliable in practical applications.
BindingDB and BioSNAP. In the random split tests, our                          However, our model focuses more on learning the multi-
model achieved state-of-the-art performance on the BioS-                    level interactions between proteins and drugs. In the cold
NAP dataset, but its performance on the BindingDB dataset                   split setting in Table 2, the model can only learn drug-target
                                                                     334
                                       The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
                     BindingDB                  BioSNAP
  Ablation
             AUC       AUPR F1          AUC      AUPR        F1
             0.687     0.579   0.564    0.749    0.770       0.629
  -BERT      0.573     0.455   0.413    0.648    0.671       0.499
  -MLA       0.628     0.511   0.523    0.731    0.753       0.585
  -PL        0.657     0.537   0.489    0.728    0.759       0.604
  -Aux Cls   0.626     0.486   0.503    0.739    0.776       0.633
interaction features, due to the lack of sufficiently similar                 Figure 2: Ablation experiments of Pseudo labeling accuracy
drug and protein molecules as references. Our model out-                      on (a) BindingDB dataset (b) BioSNAP dataset
performs other baselines on the BindingDB dataset, while
on the more balanced BioSNAP dataset, our model achieves
a superior performance compared to the baselines.                             substantial amount of noisy pseudo-labels that deteriorated
   Overall, the challenges posed by the hidden bias issue                     the model’s performance.
on the BindingDB dataset highlight the importance of our
model’s ability to capture multi-level drug-target interac-
tions, which allows it to perform well in scenarios where                     Effectiveness of Multilevel Attention We replaced the
other baselines struggle to maintain effectiveness.                           multilevel attention (MLA) mechanism with the origi-
                                                                              nal Transformer multi-head attention. However, on both
Cross-Domain Experiments                                                      datasets, the model exhibited performance drop in varying
                                                                              degrees. With an increase in training iterations, a significant
Table 2 presents a comparison of model performance on the
                                                                              decline in the accuracy of pseudo-labels was observed. It
BindingDB and BioSNAP datasets under the cross-domain
                                                                              turns out that the multilevel attention mechanism is better
setting. Compared to the intra-domain setting, the major-
                                                                              equipped to capture domain-invariant drug-target interaction
ity of models experience significant performance drop due
                                                                              features, thereby enhancing the model’s performance in the
to the differences in data distributions. Particularly, for the
                                                                              target domain.
BindingDB dataset, the clustering-based strategy ensures
that there are no similar drugs or proteins between the train-
ing and testing sets, preventing the models from relying                      Effectiveness of Pseudo Labeling and Auxiliary Classi-
on drug patterns. This breaks the false high-performance                      fier Pseudo-labeling (PL) proves effective in enhancing
illusion observed in the intra-domain scenario, and some                      the model’s performance within the target domain. Concur-
models even show no better performances than random                           rently, auxiliary classifiers contribute to reducing the noise
guessing (AUC: 0.5). Among all baselines, DrugBANCDAN ,                       within these pseudo-labels. This effect is particularly pro-
which leveraged a conditional domain adversarial network                      nounced in BindingDB dataset, which exhibits substantial
(CDAN) for domain adaptation, achieved the best perfor-                       disparities in domain distributions. The absence of auxiliary
mance. However, DrugBANCDAN did not surpass our vanilla                       classifiers exacerbates the noise present within the pseudo-
model with out pseudo labeling, and our model with pseudo                     labels, leading to the insufficiency of the pseudo-labeling ap-
labeling significantly outperformed all state-of-the-art mod-                 proach in enhancing the model’s performance.
els, including DrugBAN with domain adaptation module.
Specifically, our model outperformed DrugBANCDAN by
11.52% and 11.29% (AUROC) on the BindingDB and BioS-                                                       Conclusion
NAP datasets, respectively.
                                                                              In this paper, we proposed MlanDTI, a semi-supervised
Ablation Studies                                                              domain adaptive multilevel attention network that lever-
We conducted ablation studies in Table 3 under the cross-                     ages a large amount of unlabeled data to obtain enriched
domain setting on the BindingDB and BioSNAP datasets to                       bidirectional representations of drugs and proteins from a
analyze the effectiveness of modules in our proposed model.                   pre-trained BERT model. Additionally, we introduced the
                                                                              multilevel-attention mechanism to capture domain-invariant
Effectiveness of BERT Embeddings We replaced BERT                             interaction features between proteins and drugs at differ-
with Word2Vec and GCN as used in TransformerCPI (Chen                         ent levels and depths. Finally, we incorporated a simple
et al. 2020) to obtain embeddings for drugs and proteins. As                  yet effective pseudo labeling method to further enhance our
shown in Table 3, the performance of the model experienced                    model’s generalization ability. Our model demonstrated ex-
a notable decline. This outcome can be attributed to the aux-                 cellent domain generalization capabilities, making it well-
iliary classifier’s inability to effectively capture the implicit             suited for predicting interactions between new drugs and tar-
relationship between the source and target domains through                    gets in drug development. Through comprehensive compar-
the representations. As a result, in Figure 2 the accuracy                    isons with state-of-the-art models, we establish a substantial
of pseudo-labels exhibited a significant drop, introducing a                  performance superiority over prior methodologies.
                                                                       335
                                    The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
                  Acknowledgements                                         Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
                                                                           Bert: Pre-training of deep bidirectional transformers for lan-
This work was supported by the National                 Natural
                                                                           guage understanding. arXiv preprint arXiv:1810.04805.
Science Foundation of China (62172273) and               Shang-
hai Municipal Science and Technology Major               Project           Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.;
(2021SHZDZX0102). Shikui Tu and Lei Xu                  are co-            Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.;
corresponding authors.                                                     Steinegger, M.; et al. 2021. Prottrans: Toward understanding
                                                                           the language of life through self-supervised learning. IEEE
                                                                           transactions on pattern analysis and machine intelligence,
                       References                                          44(10): 7112–7127.
Abbasi, K.; Razzaghi, P.; Poso, A.; Amanlou, M.; Ghasemi,                  Ezzat, A.; Wu, M.; Li, X.-L.; and Kwoh, C.-K. 2019.
J. B.; and Masoudi-Nejad, A. 2020. DeepCDA: deep                           Computational prediction of drug–target interactions using
cross-domain compound–protein affinity prediction through                  chemogenomic approaches: an empirical survey. Briefings
LSTM and convolutional neural networks. Bioinformatics,                    in bioinformatics, 20(4): 1337–1357.
36(17): 4633–4642.
                                                                           Faulon, J.-L.; Misra, M.; Martin, S.; Sale, K.; and Sapra, R.
Agamah, F. E.; Mazandu, G. K.; Hassan, R.; Bope, C. D.;                    2008. Genome scale enzyme–metabolite and drug–target in-
Thomford, N. E.; Ghansah, A.; and Chimusa, E. R. 2020.                     teraction predictions using the signature molecular descrip-
Computational/in silico methods in drug target and lead pre-               tor. Bioinformatics, 24(2): 225–233.
diction. Briefings in bioinformatics, 21(5): 1663–1675.                    He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep
Ahmad, W.; Simon, E.; Chithrananda, S.; Grand, G.; and                     into rectifiers: Surpassing human-level performance on im-
Ramsundar, B. 2022. Chemberta-2: Towards chemical foun-                    agenet classification. In Proceedings of the IEEE interna-
dation models. arXiv preprint arXiv:2209.01712.                            tional conference on computer vision, 1026–1034.
Bai, P.; Miljković, F.; John, B.; and Lu, H. 2023. Inter-                 Ho, T. K. 1995. Random decision forests. In Proceedings
pretable bilinear attention network with domain adaptation                 of 3rd international conference on document analysis and
improves drug–target prediction. Nature Machine Intelli-                   recognition, volume 1, 278–282. IEEE.
gence, 5(2): 126–136.                                                      Huang, K.; Xiao, C.; Glass, L. M.; and Sun, J. 2021.
Bakheet, T. M.; and Doig, A. J. 2009. Properties and identifi-             MolTrans: molecular interaction transformer for drug–target
cation of human protein drug targets. Bioinformatics, 25(4):               interaction prediction. Bioinformatics, 37(6): 830–836.
451–457.                                                                   Huang, L.; Lin, J.; Liu, R.; Zheng, Z.; Meng, L.; Chen, X.;
Bian, J.; Zhang, X.; Zhang, X.; Xu, D.; and Wang, G.                       Li, X.; and Wong, K.-C. 2022. CoaDTI: multi-modal co-
2023. MCANet: shared-weight-based MultiheadCrossAt-                        attention based framework for drug–target interaction anno-
tention network for drug–target interaction prediction. Brief-             tation. Briefings in Bioinformatics, 23(6): bbac446.
ings in Bioinformatics, 24(2): bbad082.                                    Huang, W.; Tu, S.; and Xu, L. 2022. Deep CNN based Lmser
                                                                           and strengths of two built-in dualities. Neural Processing
Broach, J. R.; Thorner, J.; et al. 1996. High-throughput                   Letters, 54(5): 3565–3581.
screening for drug discovery. Nature, 384(6604): 14–16.
                                                                           Kao, P.-Y.; Kao, S.-M.; Huang, N.-L.; and Lin, Y.-C.
Chen, L.; Cruz, A.; Ramsey, S.; Dickson, C. J.; Duca, J. S.;               2021. Toward drug-target interaction prediction via en-
Hornak, V.; Koes, D. R.; and Kurtzman, T. 2019. Hidden                     semble modeling and transfer learning. In 2021 IEEE In-
bias in the DUD-E dataset leads to misleading performance                  ternational Conference on Bioinformatics and Biomedicine
of deep learning in structure-based virtual screening. PloS                (BIBM), 2384–2391. IEEE.
one, 14(8): e0220113.
                                                                           Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi-
Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.;                 fication with graph convolutional networks. arXiv preprint
Luo, X.; Chen, K.; Jiang, H.; and Zheng, M. 2020. Trans-                   arXiv:1609.02907.
formerCPI: improving compound–protein interaction pre-                     Lee, D.-H.; et al. 2013. Pseudo-label: The simple and effi-
diction by sequence-based deep learning with self-attention                cient semi-supervised learning method for deep neural net-
mechanism and label reversal experiments. Bioinformatics,                  works. In Workshop on challenges in representation learn-
36(16): 4406–4414.                                                         ing, ICML, volume 3, 896. Atlanta.
Cheng, F.; Zhou, Y.; Li, J.; Li, W.; Liu, G.; and Tang,                    Lee, I.; Keum, J.; and Nam, H. 2019. DeepConv-DTI: Pre-
Y. 2012.     Prediction of chemical–protein interactions:                  diction of drug-target interactions via deep learning with
multitarget-QSAR versus computational chemogenomic                         convolution on protein sequences. PLoS computational bi-
methods. Molecular BioSystems, 8(9): 2373–2384.                            ology, 15(6): e1007129.
Cortes, C.; and Vapnik, V. 1995. Support-vector networks.                  Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; and Gilson, M. K.
Machine learning, 20: 273–297.                                             2007. BindingDB: a web-accessible database of experimen-
Dauphin, Y. N.; Fan, A.; Auli, M.; and Grangier, D. 2017.                  tally determined protein–ligand binding affinities. Nucleic
Language modeling with gated convolutional networks. In                    acids research, 35(suppl 1): D198–D201.
International conference on machine learning, 933–941.                     Meng, F.-R.; You, Z.-H.; Chen, X.; Zhou, Y.; and An, J.-
PMLR.                                                                      Y. 2017. Prediction of drug–target interaction networks
                                                                    336
                                    The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
from the integration of protein sequences and drug chemi-                  Wu, F.; Jin, S.; Jiang, Y.; Jin, X.; Tang, B.; Niu, Z.; Liu,
cal structures. Molecules, 22(7): 1119.                                    X.; Zhang, Q.; Zeng, X.; and Li, S. Z. 2022. Pre-Training
Nguyen, T.; Le, H.; Quinn, T. P.; Nguyen, T.; Le, T. D.;                   of Equivariant Graph Matching Networks with Conforma-
and Venkatesh, S. 2021. GraphDTA: Predicting drug–target                   tion Flexibility for Drug Binding. Advanced Science, 9(33):
binding affinity with graph neural networks. Bioinformatics,               2203796.
37(8): 1140–1147.                                                          Xu, L. 1993. Least mean square error reconstruction prin-
Öztürk, H.; Özgür, A.; and Ozkirimli, E. 2018. DeepDTA:                ciple for self-organizing neural-nets. Neural networks, 6(5):
deep drug–target binding affinity prediction. Bioinformatics,              627–648.
34(17): i821–i829.                                                         Xu, L. 2019. An overview and perspectives on bidirectional
Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger,                  intelligence: Lmser duality, double IA harmony, and causal
C. C.; Munos, B. H.; Lindborg, S. R.; and Schacht, A. L.                   computation. IEEE/CAA Journal of Automatica Sinica, 6(4):
2010. How to improve R&D productivity: the pharmaceuti-                    865–893.
cal industry’s grand challenge. Nature reviews Drug discov-                Yazdani-Jahromi, M.; Yousefi, N.; Tayebi, A.; Kolanthai, E.;
ery, 9(3): 203–214.                                                        Neal, C. J.; Seal, S.; and Garibay, O. O. 2022. AttentionSit-
Qian, H.; Lin, C.; Zhao, D.; Tu, S.; and Xu, L. 2022. Al-                  eDTI: an interpretable graph-based model for drug-target in-
phaDrug: protein target specific de novo molecular genera-                 teraction prediction using NLP sentence-level relation clas-
tion. PNAS nexus, 1(4): pgac227.                                           sification. Briefings in Bioinformatics, 23(4): bbac272.
Qian, Y.; Wu, J.; and Zhang, Q. 2022. CAT-CPI: Combin-                     Zhao, Q.; Zhao, H.; Zheng, K.; and Wang, J. 2022. HyperAt-
ing CNN and transformer to learn compound image features                   tentionDTI: improving drug–protein interaction prediction
for predicting compound-protein interactions. Frontiers in                 by sequence-based deep learning with attention mechanism.
Molecular Biosciences, 9: 963912.                                          Bioinformatics, 38(3): 655–662.
Rifaioglu, A. S.; Atas, H.; Martin, M. J.; Cetin-Atalay, R.;               Zheng, S.; Li, Y.; Chen, S.; Xu, J.; and Yang, Y. 2020. Pre-
Atalay, V.; and Doğan, T. 2019. Recent applications of deep               dicting drug–protein interaction using quasi-visual question
learning and machine intelligence on in silico drug discov-                answering system. Nature Machine Intelligence, 2(2): 134–
ery: methods, tools and databases. Briefings in bioinformat-               140.
ics, 20(5): 1878–1912.
Rizve, M. N.; Duarte, K.; Rawat, Y. S.; and Shah, M.
2021. In defense of pseudo-labeling: An uncertainty-aware
pseudo-label selection framework for semi-supervised
learning. arXiv preprint arXiv:2101.06329.
Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and
Monfardini, G. 2008. The graph neural network model.
IEEE transactions on neural networks, 20(1): 61–80.
Shazeer, N.; Lan, Z.; Cheng, Y.; Ding, N.; and Hou, L. 2020.
Talking-heads attention. arXiv preprint arXiv:2003.02436.
Sieg, J.; Flachsenberg, F.; and Rarey, M. 2019. In need of
bias control: evaluating chemical data for machine learning
in structure-based virtual screening. Journal of chemical in-
formation and modeling, 59(3): 947–961.
Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; and
Siedlecki, P. 2018. Development and evaluation of a deep
learning model for protein–ligand binding affinity predic-
tion. Bioinformatics, 34(21): 3666–3674.
Tsubaki, M.; Tomii, K.; and Sese, J. 2019. Compound–
protein interaction prediction with end-to-end learning of
neural networks for graphs and sequences. Bioinformatics,
35(2): 309–318.
Wallach, I.; Dzamba, M.; and Heifets, A. 2015. Atom-
Net: a deep convolutional neural network for bioactivity pre-
diction in structure-based drug discovery. arXiv preprint
arXiv:1510.02855.
Wang, X.-r.; Cao, T.-t.; Jia, C. M.; Tian, X.-m.; and Wang,
Y. 2021. Quantitative prediction model for affinity of drug–
target interactions based on molecular vibrations and overall
system of ligand-receptor. BMC bioinformatics, 22(1): 1–
18.
337