[go: up one dir, main page]

0% found this document useful (0 votes)
27 views9 pages

27786-Article Text-31840-1-2-20240324

Combination of articles

Uploaded by

saleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views9 pages

27786-Article Text-31840-1-2-20240324

Combination of articles

Uploaded by

saleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Multilevel Attention Network with Semi-supervised Domain Adaptation for


Drug-Target Prediction
Zhousan Xie1 , Shikui Tu1 * , Lei Xu1,2*
1
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2
Guangdong Institute of Intelligence Science and Technology, Zhuhai, Guangdong 519031, China
{waduhek, tushikui, leixu}@sjtu.edu.cn

Abstract docking and molecular simulations have shown great suc-


cess in drug discovery (Cheng et al. 2012), but they are lim-
Prediction of drug-target interactions (DTIs) is a crucial step
in drug discovery, and deep learning methods have shown ited for being computationally resource-intensive and rely-
great promise on various DTI datasets. However, existing ap- ing on the availability of 3D structure data. Methods includ-
proaches still face several challenges, including limited la- ing machine learning approaches (Faulon et al. 2008; Wang
beled data, hidden bias issue, and a lack of generalization et al. 2021; Meng et al. 2017) perform well in predicting
ability to out-of-domain data. These challenges hinder the DTIs for known drug-target pairs, while their performance
model’s capacity to learn truly informative interaction fea- tends to deteriorate when applied to unknown structures.
tures, leading to shortcut learning and inferior predictive per- With the accumulation of a large volume of labeled DTI
formance on novel drug-target pairs. To address these is-
sues, we propose MlanDTI, a semi-supervised domain adap-
data in recent years, numerous end-to-end deep learning
tive multilevel attention network (Mlan) for DTI prediction. methods have been employed for predicting DTIs. From
We utilize two pre-trained BERT models to acquire bidirec- the perspective of the input data, DTI prediction models
tional representations enriched with information from unla- can be categorized into three groups. The first category
beled data. Then, we introduce a multilevel attention mech- is sequence-based models, where drugs are represented as
anism, enabling the model to learn domain-invariant DTIs at Simplified Molecular Input Line Entry System (SMILES)
different hierarchical levels. Moreover, we present a simple or Extended-Connectivity Fingerprints (ECFP) and pro-
yet effective semi-supervised pseudo-labeling method to fur- teins are treated as amino acid sequences. These models
ther enhance our model’s predictive ability in cross-domain commonly utilize 1D-CNN (Öztürk, Özgür, and Ozkirimli
scenarios. Experiments on four datasets show that MlanDTI 2018; Lee, Keum, and Nam 2019; Zhao et al. 2022; Bai
achieves state-of-the-art performances over other methods
under intra-domain settings and outperforms all other ap-
et al. 2023) or transformer architectures (Chen et al. 2020;
proaches under cross-domain settings. The source code is Huang et al. 2022). Secondly, drug molecules can be repre-
available at https://github.com/CMACH508/MlanDTI. sented as graphs (Nguyen et al. 2021; Tsubaki, Tomii, and
Sese 2019; Huang et al. 2022) or images (Qian, Wu, and
Zhang 2022). Similarly, protein distance maps can serve as
Introduction a 2D abstraction of their 3D structural information (Zheng
The process of drug discovery and development is charac- et al. 2020), enabling the use of Graph Neural Networks
terized by its high costs and time-intensive nature. Bring- (GNNs) (Scarselli et al. 2008), Graph Convolutional Net-
ing a first-in-class drug to the market typically requires sev- works (GCNs) (Kipf and Welling 2016), and Convolutional
eral decades and substantial investments amounting to bil- Neural Networks (CNNs). Thirdly, the incorporation of 3D
lions of dollars. Predicting drug-target interactions (DTIs) structural data such as protein pockets (Yazdani-Jahromi
is an essential task in drug discovery and drug repurposing et al. 2022) or molecular dynamics simulation data (Wu et al.
(Paul et al. 2010), which hold significant value in the field of 2022) undoubtedly improves model performance and re-
biomedicine (Agamah et al. 2020; Ezzat et al. 2019). While duces computational complexity compared to those directly
traditional techniques like high-throughput screening, pro- using the whole 3D data as input (Wallach, Dzamba, and
teomics, and genomics remain prevalent, they suffer from Heifets 2015; Stepniewska-Dziubinska, Zielenkiewicz, and
time and cost constraints due to the vast chemical space Siedlecki 2018). Nonetheless, they are still constrained by
involved (Broach, Thorner et al. 1996; Bakheet and Doig the limited availability of 3D structural data.
2009). Despite these remarkable development, deep learning
In order to expedite the drug discovery process and re- methods still face several challenges. The first challenge is
duce costs, virtual screening (VS) techniques have been de- the restriction of limited labeled data. Previous works have
veloped to aid in silico (Rifaioglu et al. 2019). Molecular primarily concentrated on utilizing the available labeled data
* Corresponding authors and learn interactions on a few thousands labeled drug-target
Copyright © 2024, Association for the Advancement of Artificial pairs (Öztürk, Özgür, and Ozkirimli 2018; Lee, Keum, and
Intelligence (www.aaai.org). All rights reserved. Nam 2019; Tsubaki, Tomii, and Sese 2019; Nguyen et al.

329
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

2021; Huang et al. 2022; Zhao et al. 2022; Bai et al. 2023). sentations that possess better robustness and generaliza-
However, these approaches often overlook the enormous tion capabilities. We observed that the representations
amount of unlabeled biomedical data, which hinders the obtained by the BERT models significantly enhance the
models from fully leveraging the chemical structures and accuracy of pseudo-labeling.
interactions of drugs and proteins. Consequently, the mod- • We propose a novel multi-level attention mechanism
els struggle to extract truly informative features, leading to which enables effective feature extraction by allowing
limited generalization ability. the model to dynamically focus on different aspects of
The second challenge is the hidden bias and shortcut proteins and drugs during the learning process. The atten-
learning. The issue of hidden bias has been reported on the tion mechanism mitigates the shortcut learning problem
DUD-E and MUV datasets (Sieg, Flachsenberg, and Rarey and reduces the impact of hidden bias on predictions.
2019). It has been observed that models trained on the DUD- • We propose a simple yet effective pseudo-label domain
E dataset (Chen et al. 2019) and other datasets (Chen et al. adaptation method, which significantly reduces the noise
2020), tend to rely predominantly on drug patterns when of pseudo-labels.
making predictions, rather than capturing the comprehen-
sive interaction between drugs and targets. This lead to a gap Related Work
between theoretical modeling and practical application. We
further identify two main reasons for this issue: 1) The pres- Leveraging Additional Data
ence of a greater variety and quantity of drug molecules in One of the key to DTI prediction is how to represent drug
datasets than proteins; 2) The inherent ease of feature extrac- molecules and proteins that allows the model to learn use-
tion for drug molecules compared to proteins. These factors ful features. Learning from 3D structural information (Wal-
result in shortcut learning, where the model tends to priori- lach, Dzamba, and Heifets 2015; Stepniewska-Dziubinska,
tize learning features from the larger and easier-to-learn drug Zielenkiewicz, and Siedlecki 2018) is undoubtedly the most
molecule data, rather than focusing on the features of pro- direct approach, but it is limited by the high computational
teins. Consequently, the model struggles to effectively cap- costs and model complexity. Another indirect approach is
ture the interaction features between drugs and proteins. to provide additional data containing 3D structural infor-
The third challenge lies in the model’s ability to gener- mation, such as molecular dynamics simulations (Wu et al.
alize and make predictions on out-of-domain data, which is 2022) and protein pocket data (Yazdani-Jahromi et al. 2022).
closely related to the previous two challenges. Developing a While the aforementioned methods are limited by the avail-
first-in-class drug often involves predicting interactions with ability of a finite amount of 3D structural data, Moltrans
a completely new target using novel compounds, which may (Huang et al. 2021), in contrast, leverages a vast amount
have a distribution that differs significantly from the data on of unlabeled protein and drug sequences by using Frequent
which the model was trained. Thus, the model needs to be Consecutive Sub-sequence (FCS) algorithm to extract high-
capable of cross-domain generalization (Abbasi et al. 2020; quality substructures and enhances the representations us-
Bai et al. 2023; Kao et al. 2021). Currently, most models are ing transformers. However, FCS has certain limitations in
trained on limited labeled data and fail to address the issue its ability to comprehensively extract information from se-
of shortcut learning, resulting in limited ability to predict in- quence data, and the quantity of unlabeled data utilized is
teractions between completely new drugs and proteins. also insufficient. In this paper, we utilize two pre-trained
To tackle the three challenges, we propose MlanDTI, a BERT (Devlin et al. 2018) models learned on a large amount
semi-supervised domain adaptive multilevel attention net- of unlabeled data to obtain rich representations of proteins
work for DTI prediction. We utilize two pre-trained BERT and drug sequences with powerful generalization abilities.
models to acquire bidirectional embeddings of protein and
SMILES (drug) sequences from millions of unlabeled data. Learning Interactions
Inspired by the least mean squared error reconstruction Proteins and drugs are two fundamentally different types
(Lmser) network (Xu 1993; Huang, Tu, and Xu 2022), we of data, and the task of DTI prediction requires the model
then devise a variant of transformer with a multi-level atten- to learn their interaction features. The simplest approach
tion mechanism with drug and protein embeddings as input. is to concatenate the features (Öztürk, Özgür, and Ozkir-
It enables the joint extraction of both drug and target fea- imli 2018; Lee, Keum, and Nam 2019; Zheng et al. 2020;
tures with reduced hidden bias and facilitates the learning of Nguyen et al. 2021) and pass them through a Fully-
multi-level interactions. Moreover, we incorporates a sim- Connected Network (FCN) to obtain the prediction results.
ple yet effective semi-supervised pseudo-labeling method Another approach (Qian, Wu, and Zhang 2022) is to over-
to further enhance our model’s predictive ability in cross- lap the feature maps and use CNN to extract interaction
domain scenarios. Experiments on four datasets demonstrate features. However, these methods lack interpretability and
that MlanDTI achieves state-of-the-art performances over overlook the inherent structure of interactions. Recently, at-
other methods under intra-domain settings and outperforms tention mechanisms have been demonstrated effective in
all other approaches under cross-domain settings. capturing intricate interactions between proteins and drugs.
The main contributions are three-fold as follows: Multi-head attention (Bian et al. 2023; Chen et al. 2020) and
other attention variations (Bai et al. 2023; Zhao et al. 2022)
• To leverage massive unlabeled biomedical data, we em- have been widely applied in DTI prediction. However, (Chen
ployed two pre-trained BERT models to acquire repre- et al. 2020) found that the hidden bias in some datasets that

330
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

led models to rely mainly on drug patterns rather than the data, the model predicts on unlabeled target domain data
interactions for prediction. We further observed that this is- to obtain pseudo-labels. The pseudo-label learning process
sue was prevalent in existing models. To address this issue, consists learning high-confidence pseudo-labels and mini-
we proposed a multi-level attention mechanism. mizing conflicting predictions.

Domain Generalization in DTI Predictions Encoder for Protein Sequence We build the encoder by
adopting a modification on the transformer similar to Trans-
In previous works (Huang et al. 2021; Yazdani-Jahromi et al. formerCPI (Chen et al. 2020). Instead of using the self-
2022; Zhao et al. 2022), the evaluation of model generaliza- attention module, we utilize a 1D-CNN and GLU (gated lin-
tion was often conducted through the partitioning of datasets ear unit) (Dauphin et al. 2017) as alternatives. The hidden
into “unseen drug” or “unseen protein” scenarios, where layers h0 , ..., hL in the encoder are computed as:
drugs or proteins were only present in the test set. However,
such evaluations still fall into the intra-domain setting, dif- hi (XT ) = (XT Wi1 + s) ⊗ σ(XT Wi2 + t), (1)
ferent from real-world applications. Currently, there is lim- n×m1
where XT ∈ R is the input of layer hi , Wi1 ∈
ited research on domain generalization in DTI prediction.
Rk×m1 ×m2 , s ∈ Rm2 , Wi2 ∈ Rk×m1 ×m2 , t ∈ Rm2 are pa-
DrugBAN (Bai et al. 2023) addresses this challenge by uti-
rameters, n is the input sequence length, k is the patch size,
lizing Conditional Domain Adversarial Network (CDAN) to
m1 , m2 are the dimensions of input and hidden vectors, σ is
transfer the learned knowledge from the source domain to
the sigmoid function, and ⊗ is the element-wise product.
the target domain, thereby enhancing the model’s perfor-
Since the length of a protein sequence may range in the
mance in cross-domain settings. Here, we leverage pseudo-
thousands or even tens of thousands, the self-attention mod-
labeling techniques to mitigate the distribution discrepancy
ule in transformers poses a significant computational and
between the target and source domains. Through the integra-
memory burden with O(n2 ) time and space complexity, and
tion of an auxiliary classifier and the powerful representa-
is prone to overfitting when working on small datasets. The
tional capacity of BERT models, our approach significantly
above modification by Eq. (1) mitigates the computational
improves the accuracy of pseudo-labeling. Under the cross-
and storage burden on long protein sequences and remedies
domain setting, our method demonstrates remarkable per-
overfitting on small datasets.
formance surpassing that of DrugBAN.
Multilevel Cross-Attention For the task of DTI predic-
Method tion, the most crucial ability for the model is to learn the
interaction patterns between drugs and targets. It involves
Problem Formulation
aligning the features of proteins with the features of drugs in
The task of DTI prediction aims to determine whether a a shared feature subspace. However, extracting multi-level
drug compound and a target protein will interact. For drug features from proteins is more challenging than extracting
compounds, most existing deep learning methods utilize the features from drugs, because protein sequences are notably
SMILES strings to represent the drugs. Specifically, a drug long, with intricate multi-level structures, while drugs are
is represented as D = (d1 , ..., dm ), where di is a SMILES often small chemical molecules. This difference contributes
symbol with chemical meanings such as atoms, m is the to the hidden bias in DTI models (another is inherent dataset
length. As for target proteins, each protein sequence is rep- bias). Aligning protein features with drug features also re-
resented as T = (a1 , ..., an ), where ai corresponds to one of quires a multi-level process, but the model may not capture
the 23 amino acids, n is the length of the protein sequence. the multi-level features of proteins well and effectively align
Given a drug SMILES sequence D and a protein sequence them with drug features. Thus, the existing models tend to
T, the objective is to train a model to assign an interaction learn a shortcut by relying on the features of drug molecules
probability score P ∈ [0, 1] by mapping the joint feature to predict drug-target interactions.
representation space D × T. In an early literature (Xu 1993), Lmser network was pro-
posed to enhance the representation learning by building
The Proposed Framework bidirectional skip connections on every levels of layers be-
An overview of MlanDTI is depicted in Figure 1. It tween encoder and decoder. It was first demonstrated in a
commences by encoding the drug and target sequences deep CNN implementation to be robust and effective on
into vector embeddings via pre-trained BERT models, i.e., image processing (Huang, Tu, and Xu 2022; Xu 2019),
ChemBERTa-2 (Ahmad et al. 2022) and ProtTrans (Elnag- and then Lmser-transformer was developed to improve the
gar et al. 2021). Subsequently, these embeddings are passed molecular representation learning by adding hierarchical
through the encoder and decoder of a modified transformer connections to the original transformer (Qian et al. 2022).
architecture with a multi-level attention module to extract in- Inspired by these works, we propose a multi-level cross-
teraction features. The classifier comprises a bilinear atten- attention mechanism to address this issue, as illustrated in
tion module and a max pooling layer, followed by a FCN for Figure 1(a). In the vanilla transformer, the encoder uses the
prediction. For cross-domain prediction, we employ an aux- protein features from the last layer of the encoder as the Key
iliary classifier that directly accepts BERT outputs. It aids and Value for the cross-attention layer of the decoder, align-
in learning implicit distributional information from BERT ing them with the drug features in the decoder. However,
representations, thereby enhancing pseudo-label accuracy. the protein features obtained from the encoder’s output do
After training the two classifiers on labeled source domain not fully capture the expression of the multi-level structural

331
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

(a) N layers (b) Feature


Fusing
encoder decoder

Protein Sequence … …

Add & Scale


Add & Scale

Max pooling
Prot-trans

1DCNN
1DCNN
encoder decoder

GLU
GLU
encoder decoder

Prediction results
Full connected layers
Bilinear attention

Concatenated
MGN…PVR (c)
Protein
Primary Select
embeddings Multilevel Multilevel classifier Pseudo
Drug Feature
labels
Fusing Attention outputs
SMILES Sequence embeddings

classifier Train
Self Attention
Chemberta-2

heads Cross

Feedforward
Add & Norm
Add & Norm

Max pooling
Auxiliary with

Attention
classifier

Talking
pseudo
Transformer outputs labels
with
Multilevel
attention classifier
C1C1=C

Bert Model
N layers

Figure 1: (a). The overall framework of MlanDTI, it consists of two pre-trained BERT models that convert SMILES and amino
acid sequences into vector embeddings. The Encoder and Decoder are connected by a multilevel attention module, and the final
output is processed through the classifier with a bilinear attention module and a max pooling layer before being fed into FCN
to generate the prediction results. (b). The detail of Multilevel attention. (c). Training with pseudo labeling with an auxiliary
classifier.

information of proteins, and they do not align with drug fea- across different attention heads and improving the overall
tures at different levels in the decoder. performance of the model, i.e.,
Here, we develop the multi-level attention mechanism by
QK T
 
two steps: 1) the multi-level feature fusing step and 2) the
Attention(Q, K, V ) = softmax Pℓ √ Pw V, (4)
cross-attention feature aligning step. Suppose the protein dk
feature matrices of each encoder layers are T0 , ..., Tn ∈
Rm×d , where n is the number of transformer layers, m is where Q, K, V are given by Eq. (3), and Pℓ ∈
the size of the protein sequence and d is the vector dimen- Rhk ×hk , Pw ∈ Rhk ×hv are the two additional linear projec-
sion. For the ℓ-th decoder layer, we concatenate the pro- tions. hk represents the number of attention heads for keys
tein feature matrices from the preceding ℓ layers to form and queries, and hv denotes the number of attention heads
Tcatℓ = [T0 , ..., Tℓ ] ∈ Rℓ×m×d . Then, we perform a cross- for values, and they can optionally differ in size.
layer feature aggregation by applying a fusion matrix Fℓ ∈ The advantages of the proposed multi-level attention
Rℓ×1 . This results in multi-level fused protein feature matrix mechanism are briefly summarized below.
Tℓ′ = FℓT Tcatℓ . To summarize, we compute all Tℓ′ as: • Encourage multi-level feature learning: By fusing pro-
diag(T0′ , ..., Tn′ ) = F · diag(Tcat0 , ..., Tcatn ), (2) tein features, drug features are derived to interact with
relevant characteristics, which thereby captures multi-
where F is a learnable diagonal matrix with each diagonal level interaction features, leading to a more comprehen-
element being Fℓ from each layer, i.e., ℓ = 0, ..., n. Then, sive understanding of drug-target interactions.
the query, key, and value for the multi-level cross-attention
mechanism at the ℓ-th layer are respectively computed by • Alleviate hidden bias and reduce overfitting: Multi-
level attention encourages the model to focus more on
Q = D ℓ Wq , K= Tℓ′ Wk , V = Tℓ′ Wv , (3) hierarchical interaction features, the model becomes less
where Dℓ is the drug feature matrix which has passed the prone to biased representations that might emerge from
self-attention module, and Tℓ′ the multi-level protein feature focusing solely on specific patterns, and thus the model
matrix given by Eq. (2). is less likely to overfit to noisy patterns of the data.
To enhance the extraction capabilities of attention heads • Improve generalization abilities: Multi-level attention
for multi-level interactions, we incorporate the talking-heads enables the model to learn domain-invariant interaction
attention mechanism (Shazeer et al. 2020) for feature align- features. These representations exhibit robustness and
ment. This variation of multi-head attention in the trans- enhance transferability across different data domains.
former introduces two additional linear projections. These
projections transform the attention logits and the atten- The Classifier The classifier consists of the bilinear at-
tion weights, respectively, allowing the flow of information tention module from hyperattentionDTI (Zhao et al. 2022)

332
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

to further extract bidirectional interaction features. Subse- The second step is to penalize conflict predictions. Let Xd
quently, we utilize a multi-layer FCN with each layer fol- be the set of samples for which the two classifiers exhibit
lowed by a leaky ReLU activation function (He et al. 2015) conflicting classifications, i.e.,
to combine these features and generate prediction results.
Since it is a binary classification problem, we utilize the bi- Xd = {x(i) |x(i) ∈ Xt , argmax p(i) ̸= argmax p(i)
aux }, (9)
nary cross-entropy loss function to train the model. (i) (i) (i) (i) (i)
where p(i) = (p0 , p1 ), paux = (p0,aux , p1,aux ).
LCE = −[y log ŷ + (1 − y) log(1 − ŷ)], (5)
We randomly select a subset, Xd′ , of size M ′ , from Xd ,
where y is the ground truth label, ŷ is the classifier’s output. where the value of M ′ increases with the number of model
Pseudo Label Learning for Domain Adaptation iterations. We utilize a modified binary cross entropy loss to
Pseudo-labeling (Lee et al. 2013) is a semi-supervised augment the prediction uncertainty for conflicting samples
learning (SSL) method that utilizes a model trained on between the two classifiers, i.e.,
labeled source domain data to generate pseudo-labels M ′

for unlabeled target domain data. By incorporating these 1 X


Lconf =− ′ [y log 0.5 + (1 − y) log(1 − 0.5)]. (10)
pseudo-labels into the training process, the model can M i=1
adapt to target domain, which is particularly suitable
for DTI predictions where labeled data are limited and Both steps enable the model to acquire pseudo-labels
unlabeled data are massive. However, in other domains, with reduced noise for training, consequently enhancing the
pseudo-labeling based SSL methods often suffer from poor model’s performance in the target domain.
model performance due to the presence of noisy pseudo-
labels (Rizve et al. 2021). Here, we propose a simple yet Experiments
effective approach that significantly improves the accuracy Datasets
of generated labels and reduces the impact of noisy labels
on the model. We evaluated our model on the human dataset, Caenorhab-
Our method consists of two steps. In the first step, we per- ditis elegans dataset (Tsubaki, Tomii, and Sese 2019), bind-
form the selection and learning of high-confidence pseudo- ingdb dataset (Liu et al. 2007), and Biosnap dataset (Huang
labels. To achieve this, we introduce an auxiliary classi- et al. 2021). Specifically, we conducted both intra-domain
fier for co-training, which is essentially the classifier men- and cross-domain tests on the BindingDB and Biosnap
tioned earlier but directly takes BERT representations as datasets. For the intra-domain evaluation, we randomly split
(i) (i) N the dataset into training, validation, and test sets with a ratio
input. Let P1 = {p1 }N i=1 , P0 = {p0 }i=1 and P1,aux = of 8:1:1 in smaller human and C.elegans datasets, and 7:1:2
(i) (i)
{p1,aux }N N
i=1 , P0,aux = {p0,aux }i=1 be the probability out- in larger BindingDB and Biosnap datasets. We also con-
puts of the model and the auxiliary classifier for target do- ducted cold pair split experiments on BindingDB and Bios-
(i) (i)
main data Xt = {x(i) }N i=1 , respectively, such that p0 , p0,aux
nap datasets. We select 70% of drugs/proteins randomly, and
is the probability of no interaction for sample x(i) and all related DT pairs were collected as the training set. Sub-
(i) (i) sequently, DT pairs in the remaining 30% were split into a
p1 , p1,aux is the probability the sample interact. Rather 3:7 ratio, as validation set and test set. This ensures that all
than selecting thresholds, which we observed may lead drugs and proteins in the test set are unseen to model.
to unbalanced pseudo-labels, we sort (P1 + P1,aux ) and For the cross-domain evaluation, we followed the
(P0+P0,aux ) in descending order and select the top M pos- clustering-based split strategy used in DrugBAN. We ap-
itive and negative sample pairs based on their probabilities plied the ECFP4 and PSC algorithms to cluster drugs and
to assign pseudo-labels: proteins, respectively. Then, we randomly selected 60% of
(i) (i) (i) the drug and protein clusters and used all drug-protein pairs
Y1 = {ŷ1 = 1|p1 + p1,aux ∈ topM (P1 + P1,aux )}, (6)
belonging to these clusters as the source domain data. The
(i) (i) (i)
Y0 = {ŷ0 = 0|p0 + p0,aux ∈ topM (P0 + P0,aux )}, (7) drug-protein pairs in the remaining 40% clusters were used
where Y1 , Y0 represent the sets of pseudo-labels for positive as the target domain data. This data partitioning ensured that
and negative samples, respectively, and M is the number of the target domain and source domain data were from dis-
selected samples which grows with the number of iterations. joint distributions, making the evaluation more challenging
The auxiliary classifier focuses on learning the latent re- and enabling a true assessment of the model’s ability to pre-
lationships between target domain and source domain data dict interactions for unknown proteins and molecules.
within the BERT representations, while the main model pri- For the domain adaptation setting, we used all labeled
oritizes learning domain-invariant DT interaction features. source domain data and 80% of the unlabeled target domain
This leads classifier discrepancy in nature, enabling higher data as the training set. This 80% of the target domain data
accuracy for pseudo-label with high confidence on both clas- was also used as the validation set, while the remaining 20%
sifiers. After generating pseudo-labels, we employ the cross- of labeled data from the target domain served as the test set.
entropy loss to train model, i.e.,
2M
Baselines and Implementation Details
1 X We conducted a comparison between our proposed method
Lpseudo =− [y log ŷ (i) +(1−y) log(1− ŷ (i) )]. (8)
2M i=1 and eight baseline approaches: Support Vector Machine

333
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

human C.elegans BindingDB BioSNAP


methods
AUC AUPR F1 AUC AUPR F1 AUC AUPR F1 AUC AUPR F1
SVM 0.910 – 0.967 0.894 – 0.801 0.939 0.928 0.787 0.862 0.864 0.762
RF 0.940 – 0.878 0.902 – 0.832 0.942 0.921 0.858 0.860 0.886 0.808
GraphDTA 0.960 0.959 0.897 0.974 0.975 0.919 0.951 0.934 0.867 0.887 0.890 0.789
DeepConvDTI 0.967 0.964 0.922 0.983 0.985 0.944 0.945 0.925 0.859 0.886 0.890 0.797
MolTrans 0.974 0.976 0.944 0.982 0.985 0.966 0.952 0.936 0.865 0.895 0.897 0.824
TransformerCPI 0.973 0.975 0.920 0.988 0.986 0.952 0.943 0.925 0.855 0.889 0.893 0.798
HyperAttDTI 0.984 0.984 0.946 0.989 0.990 0.958 0.959 0.948 0.887 0.901 0.902 0.838
DrugBAN 0.981 0.983 0.940 0.986 0.988 0.949 0.959 0.947 0.881 0.903 0.902 0.832
Ours 0.988 0.990 0.961 0.990 0.992 0.962 0.945 0.926 0.857 0.909 0.912 0.841

Table 1: The results of the proposed model and baslines on four datasets (5 random runs), Metric: AUROC (AUC), AUPRC
(AUPR), F1-score (F1), The best results are indicated by bold. ”–” means no result for this metric.

cold cross-domain
methods BindingDB BioSNAP BindingDB BioSNAP
AUC AUPR F1 AUC AUPR F1 AUC AUPR F1 AUC AUPR F1
Moltrans 0.595 0.522 0.511 0.672 0.697 0.437 0.537 0.476 0.389 0.632 0.635 0.401
TransformerCPI 0.656 0.594 0.566 0.680 0.708 0.523 0.568 0.450 0.410 0.656 0.693 0.432
HyperAttDTI 0.661 0.598 0.582 0.732 0.760 0.539 0.545 0.462 0.376 0.654 0.685 0.395
DrugBAN 0.655 0.600 0.542 0.651 0.667 0.449 0.578 0.471 0.484 0.608 0.606 0.438
DrugBANCDAN NA NA NA NA NA NA 0.616 0.512 0.426 0.673 0.706 0.542
Ours 0.671 0.594 0.601 0.782 0.801 0.653 0.657 0.537 0.489 0.728 0.759 0.604
Ours (with PL) NA NA NA NA NA NA 0.687 0.579 0.564 0.749 0.770 0.629

Table 2: In-domain (cold pair split: unseen drugs & proteins) and cross-domain (clustering-based split) comparison on the Bind-
ingDB and BioSNAP datasets (5 random runs). 1) Underlined values explanation: We chose a threshold of 0.5 (the same one as
in MolTrans) to calculate the F1-score of DrugBAN. This is to ensure a fair comparison and to avoid ineffective classification
caused by overly low thresholds in DrugBAN. Further information is provided in the appendix. 2) NA, not applicable to this
study. 3) The term “with PL” within parentheses refers to our method that incorporates the pseudo-labeling module.

(SVM) (Cortes and Vapnik 1995), Random Forest (RF) was not particularly competitive. This discrepancy was due
(Ho 1995), GraphDTA (Nguyen et al. 2021), DeepConv- to the hidden bias issue present in the BindingDB dataset.
DTI (Lee, Keum, and Nam 2019), MolTrans (Huang et al. The BindingDB dataset contains 14643 drugs and 2623
2021), TransformerCPI (Chen et al. 2020), Hyperatten- proteins, which results in an extremely imbalanced drug-to-
tionDTI (Zhao et al. 2022), and DrugBAN (Bai et al. 2023). protein ratio compared to the other datasets (BioSNAP: 4510
These baselines encompass both classic machine learning / 2181, human: 2726 / 2001, C.elegans: 1767 / 1876). Com-
methods and the current state-of-the-art deep learning ap- pared to the other three datasets, deep learning models even
proaches, ensuring a comprehensive comparison. All deep struggle to outperform traditional machine learning meth-
learning methods were employed with their default configu- ods (AUC: RF 0.942, deepConv-DTI 0.945) on the Bind-
rations as provided by their respective authors. Our proposed ingDB dataset. Previous studies (Bai et al. 2023) have also
method in implemented in PyTorch, utilizing the Adam op- reported that the performance in the BindingDB dataset un-
timizer with an initial learning rate of 0.001. Detailed hyper- der unseen-drug setting shows minimal decline compared to
parameter settings are provided in the appendix. random splits. This phenomenon is attributed to the presence
of a large number of highly similar molecules in the dataset,
Intra-domain Experiments which makes it challenging for the naive unseen-drug set-
Table 1 displays the comparison on the human and C.elegans ting to distinguish between them. The excessive number
datasets. These two datasets are relatively small, with bal- of highly similar drug samples causes baseline models to
anced positive and negative samples, enabling us to evalu- lean towards learning drug patterns rather than drug-target
ate the model’s predictive ability within the same distribu- interactions for prediction. As a result, deep learning and
tion. Our method outperforms all deep learning baselines in machine learning methods exhibit similar performance lev-
terms of AUROC and AUPRC, and it also exhibits competi- els. However, this shortcut learning approach contradicts the
tive performance in terms of F1-score. original intent of DTI prediction and cannot be considered
We also conducted comparisons on the larger datasets, reliable in practical applications.
BindingDB and BioSNAP. In the random split tests, our However, our model focuses more on learning the multi-
model achieved state-of-the-art performance on the BioS- level interactions between proteins and drugs. In the cold
NAP dataset, but its performance on the BindingDB dataset split setting in Table 2, the model can only learn drug-target

334
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

BindingDB BioSNAP
Ablation
AUC AUPR F1 AUC AUPR F1
0.687 0.579 0.564 0.749 0.770 0.629
-BERT 0.573 0.455 0.413 0.648 0.671 0.499
-MLA 0.628 0.511 0.523 0.731 0.753 0.585
-PL 0.657 0.537 0.489 0.728 0.759 0.604
-Aux Cls 0.626 0.486 0.503 0.739 0.776 0.633

Table 3: Ablation study on BindingDB and BioSNAP


datasets (cross-domain,five random runs)
(a) (b)

interaction features, due to the lack of sufficiently similar Figure 2: Ablation experiments of Pseudo labeling accuracy
drug and protein molecules as references. Our model out- on (a) BindingDB dataset (b) BioSNAP dataset
performs other baselines on the BindingDB dataset, while
on the more balanced BioSNAP dataset, our model achieves
a superior performance compared to the baselines. substantial amount of noisy pseudo-labels that deteriorated
Overall, the challenges posed by the hidden bias issue the model’s performance.
on the BindingDB dataset highlight the importance of our
model’s ability to capture multi-level drug-target interac-
tions, which allows it to perform well in scenarios where Effectiveness of Multilevel Attention We replaced the
other baselines struggle to maintain effectiveness. multilevel attention (MLA) mechanism with the origi-
nal Transformer multi-head attention. However, on both
Cross-Domain Experiments datasets, the model exhibited performance drop in varying
degrees. With an increase in training iterations, a significant
Table 2 presents a comparison of model performance on the
decline in the accuracy of pseudo-labels was observed. It
BindingDB and BioSNAP datasets under the cross-domain
turns out that the multilevel attention mechanism is better
setting. Compared to the intra-domain setting, the major-
equipped to capture domain-invariant drug-target interaction
ity of models experience significant performance drop due
features, thereby enhancing the model’s performance in the
to the differences in data distributions. Particularly, for the
target domain.
BindingDB dataset, the clustering-based strategy ensures
that there are no similar drugs or proteins between the train-
ing and testing sets, preventing the models from relying Effectiveness of Pseudo Labeling and Auxiliary Classi-
on drug patterns. This breaks the false high-performance fier Pseudo-labeling (PL) proves effective in enhancing
illusion observed in the intra-domain scenario, and some the model’s performance within the target domain. Concur-
models even show no better performances than random rently, auxiliary classifiers contribute to reducing the noise
guessing (AUC: 0.5). Among all baselines, DrugBANCDAN , within these pseudo-labels. This effect is particularly pro-
which leveraged a conditional domain adversarial network nounced in BindingDB dataset, which exhibits substantial
(CDAN) for domain adaptation, achieved the best perfor- disparities in domain distributions. The absence of auxiliary
mance. However, DrugBANCDAN did not surpass our vanilla classifiers exacerbates the noise present within the pseudo-
model with out pseudo labeling, and our model with pseudo labels, leading to the insufficiency of the pseudo-labeling ap-
labeling significantly outperformed all state-of-the-art mod- proach in enhancing the model’s performance.
els, including DrugBAN with domain adaptation module.
Specifically, our model outperformed DrugBANCDAN by
11.52% and 11.29% (AUROC) on the BindingDB and BioS- Conclusion
NAP datasets, respectively.
In this paper, we proposed MlanDTI, a semi-supervised
Ablation Studies domain adaptive multilevel attention network that lever-
We conducted ablation studies in Table 3 under the cross- ages a large amount of unlabeled data to obtain enriched
domain setting on the BindingDB and BioSNAP datasets to bidirectional representations of drugs and proteins from a
analyze the effectiveness of modules in our proposed model. pre-trained BERT model. Additionally, we introduced the
multilevel-attention mechanism to capture domain-invariant
Effectiveness of BERT Embeddings We replaced BERT interaction features between proteins and drugs at differ-
with Word2Vec and GCN as used in TransformerCPI (Chen ent levels and depths. Finally, we incorporated a simple
et al. 2020) to obtain embeddings for drugs and proteins. As yet effective pseudo labeling method to further enhance our
shown in Table 3, the performance of the model experienced model’s generalization ability. Our model demonstrated ex-
a notable decline. This outcome can be attributed to the aux- cellent domain generalization capabilities, making it well-
iliary classifier’s inability to effectively capture the implicit suited for predicting interactions between new drugs and tar-
relationship between the source and target domains through gets in drug development. Through comprehensive compar-
the representations. As a result, in Figure 2 the accuracy isons with state-of-the-art models, we establish a substantial
of pseudo-labels exhibited a significant drop, introducing a performance superiority over prior methodologies.

335
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgements Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
Bert: Pre-training of deep bidirectional transformers for lan-
This work was supported by the National Natural
guage understanding. arXiv preprint arXiv:1810.04805.
Science Foundation of China (62172273) and Shang-
hai Municipal Science and Technology Major Project Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.;
(2021SHZDZX0102). Shikui Tu and Lei Xu are co- Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.;
corresponding authors. Steinegger, M.; et al. 2021. Prottrans: Toward understanding
the language of life through self-supervised learning. IEEE
transactions on pattern analysis and machine intelligence,
References 44(10): 7112–7127.
Abbasi, K.; Razzaghi, P.; Poso, A.; Amanlou, M.; Ghasemi, Ezzat, A.; Wu, M.; Li, X.-L.; and Kwoh, C.-K. 2019.
J. B.; and Masoudi-Nejad, A. 2020. DeepCDA: deep Computational prediction of drug–target interactions using
cross-domain compound–protein affinity prediction through chemogenomic approaches: an empirical survey. Briefings
LSTM and convolutional neural networks. Bioinformatics, in bioinformatics, 20(4): 1337–1357.
36(17): 4633–4642.
Faulon, J.-L.; Misra, M.; Martin, S.; Sale, K.; and Sapra, R.
Agamah, F. E.; Mazandu, G. K.; Hassan, R.; Bope, C. D.; 2008. Genome scale enzyme–metabolite and drug–target in-
Thomford, N. E.; Ghansah, A.; and Chimusa, E. R. 2020. teraction predictions using the signature molecular descrip-
Computational/in silico methods in drug target and lead pre- tor. Bioinformatics, 24(2): 225–233.
diction. Briefings in bioinformatics, 21(5): 1663–1675. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep
Ahmad, W.; Simon, E.; Chithrananda, S.; Grand, G.; and into rectifiers: Surpassing human-level performance on im-
Ramsundar, B. 2022. Chemberta-2: Towards chemical foun- agenet classification. In Proceedings of the IEEE interna-
dation models. arXiv preprint arXiv:2209.01712. tional conference on computer vision, 1026–1034.
Bai, P.; Miljković, F.; John, B.; and Lu, H. 2023. Inter- Ho, T. K. 1995. Random decision forests. In Proceedings
pretable bilinear attention network with domain adaptation of 3rd international conference on document analysis and
improves drug–target prediction. Nature Machine Intelli- recognition, volume 1, 278–282. IEEE.
gence, 5(2): 126–136. Huang, K.; Xiao, C.; Glass, L. M.; and Sun, J. 2021.
Bakheet, T. M.; and Doig, A. J. 2009. Properties and identifi- MolTrans: molecular interaction transformer for drug–target
cation of human protein drug targets. Bioinformatics, 25(4): interaction prediction. Bioinformatics, 37(6): 830–836.
451–457. Huang, L.; Lin, J.; Liu, R.; Zheng, Z.; Meng, L.; Chen, X.;
Bian, J.; Zhang, X.; Zhang, X.; Xu, D.; and Wang, G. Li, X.; and Wong, K.-C. 2022. CoaDTI: multi-modal co-
2023. MCANet: shared-weight-based MultiheadCrossAt- attention based framework for drug–target interaction anno-
tention network for drug–target interaction prediction. Brief- tation. Briefings in Bioinformatics, 23(6): bbac446.
ings in Bioinformatics, 24(2): bbad082. Huang, W.; Tu, S.; and Xu, L. 2022. Deep CNN based Lmser
and strengths of two built-in dualities. Neural Processing
Broach, J. R.; Thorner, J.; et al. 1996. High-throughput Letters, 54(5): 3565–3581.
screening for drug discovery. Nature, 384(6604): 14–16.
Kao, P.-Y.; Kao, S.-M.; Huang, N.-L.; and Lin, Y.-C.
Chen, L.; Cruz, A.; Ramsey, S.; Dickson, C. J.; Duca, J. S.; 2021. Toward drug-target interaction prediction via en-
Hornak, V.; Koes, D. R.; and Kurtzman, T. 2019. Hidden semble modeling and transfer learning. In 2021 IEEE In-
bias in the DUD-E dataset leads to misleading performance ternational Conference on Bioinformatics and Biomedicine
of deep learning in structure-based virtual screening. PloS (BIBM), 2384–2391. IEEE.
one, 14(8): e0220113.
Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi-
Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; fication with graph convolutional networks. arXiv preprint
Luo, X.; Chen, K.; Jiang, H.; and Zheng, M. 2020. Trans- arXiv:1609.02907.
formerCPI: improving compound–protein interaction pre- Lee, D.-H.; et al. 2013. Pseudo-label: The simple and effi-
diction by sequence-based deep learning with self-attention cient semi-supervised learning method for deep neural net-
mechanism and label reversal experiments. Bioinformatics, works. In Workshop on challenges in representation learn-
36(16): 4406–4414. ing, ICML, volume 3, 896. Atlanta.
Cheng, F.; Zhou, Y.; Li, J.; Li, W.; Liu, G.; and Tang, Lee, I.; Keum, J.; and Nam, H. 2019. DeepConv-DTI: Pre-
Y. 2012. Prediction of chemical–protein interactions: diction of drug-target interactions via deep learning with
multitarget-QSAR versus computational chemogenomic convolution on protein sequences. PLoS computational bi-
methods. Molecular BioSystems, 8(9): 2373–2384. ology, 15(6): e1007129.
Cortes, C.; and Vapnik, V. 1995. Support-vector networks. Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; and Gilson, M. K.
Machine learning, 20: 273–297. 2007. BindingDB: a web-accessible database of experimen-
Dauphin, Y. N.; Fan, A.; Auli, M.; and Grangier, D. 2017. tally determined protein–ligand binding affinities. Nucleic
Language modeling with gated convolutional networks. In acids research, 35(suppl 1): D198–D201.
International conference on machine learning, 933–941. Meng, F.-R.; You, Z.-H.; Chen, X.; Zhou, Y.; and An, J.-
PMLR. Y. 2017. Prediction of drug–target interaction networks

336
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

from the integration of protein sequences and drug chemi- Wu, F.; Jin, S.; Jiang, Y.; Jin, X.; Tang, B.; Niu, Z.; Liu,
cal structures. Molecules, 22(7): 1119. X.; Zhang, Q.; Zeng, X.; and Li, S. Z. 2022. Pre-Training
Nguyen, T.; Le, H.; Quinn, T. P.; Nguyen, T.; Le, T. D.; of Equivariant Graph Matching Networks with Conforma-
and Venkatesh, S. 2021. GraphDTA: Predicting drug–target tion Flexibility for Drug Binding. Advanced Science, 9(33):
binding affinity with graph neural networks. Bioinformatics, 2203796.
37(8): 1140–1147. Xu, L. 1993. Least mean square error reconstruction prin-
Öztürk, H.; Özgür, A.; and Ozkirimli, E. 2018. DeepDTA: ciple for self-organizing neural-nets. Neural networks, 6(5):
deep drug–target binding affinity prediction. Bioinformatics, 627–648.
34(17): i821–i829. Xu, L. 2019. An overview and perspectives on bidirectional
Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger, intelligence: Lmser duality, double IA harmony, and causal
C. C.; Munos, B. H.; Lindborg, S. R.; and Schacht, A. L. computation. IEEE/CAA Journal of Automatica Sinica, 6(4):
2010. How to improve R&D productivity: the pharmaceuti- 865–893.
cal industry’s grand challenge. Nature reviews Drug discov- Yazdani-Jahromi, M.; Yousefi, N.; Tayebi, A.; Kolanthai, E.;
ery, 9(3): 203–214. Neal, C. J.; Seal, S.; and Garibay, O. O. 2022. AttentionSit-
Qian, H.; Lin, C.; Zhao, D.; Tu, S.; and Xu, L. 2022. Al- eDTI: an interpretable graph-based model for drug-target in-
phaDrug: protein target specific de novo molecular genera- teraction prediction using NLP sentence-level relation clas-
tion. PNAS nexus, 1(4): pgac227. sification. Briefings in Bioinformatics, 23(4): bbac272.
Qian, Y.; Wu, J.; and Zhang, Q. 2022. CAT-CPI: Combin- Zhao, Q.; Zhao, H.; Zheng, K.; and Wang, J. 2022. HyperAt-
ing CNN and transformer to learn compound image features tentionDTI: improving drug–protein interaction prediction
for predicting compound-protein interactions. Frontiers in by sequence-based deep learning with attention mechanism.
Molecular Biosciences, 9: 963912. Bioinformatics, 38(3): 655–662.
Rifaioglu, A. S.; Atas, H.; Martin, M. J.; Cetin-Atalay, R.; Zheng, S.; Li, Y.; Chen, S.; Xu, J.; and Yang, Y. 2020. Pre-
Atalay, V.; and Doğan, T. 2019. Recent applications of deep dicting drug–protein interaction using quasi-visual question
learning and machine intelligence on in silico drug discov- answering system. Nature Machine Intelligence, 2(2): 134–
ery: methods, tools and databases. Briefings in bioinformat- 140.
ics, 20(5): 1878–1912.
Rizve, M. N.; Duarte, K.; Rawat, Y. S.; and Shah, M.
2021. In defense of pseudo-labeling: An uncertainty-aware
pseudo-label selection framework for semi-supervised
learning. arXiv preprint arXiv:2101.06329.
Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and
Monfardini, G. 2008. The graph neural network model.
IEEE transactions on neural networks, 20(1): 61–80.
Shazeer, N.; Lan, Z.; Cheng, Y.; Ding, N.; and Hou, L. 2020.
Talking-heads attention. arXiv preprint arXiv:2003.02436.
Sieg, J.; Flachsenberg, F.; and Rarey, M. 2019. In need of
bias control: evaluating chemical data for machine learning
in structure-based virtual screening. Journal of chemical in-
formation and modeling, 59(3): 947–961.
Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; and
Siedlecki, P. 2018. Development and evaluation of a deep
learning model for protein–ligand binding affinity predic-
tion. Bioinformatics, 34(21): 3666–3674.
Tsubaki, M.; Tomii, K.; and Sese, J. 2019. Compound–
protein interaction prediction with end-to-end learning of
neural networks for graphs and sequences. Bioinformatics,
35(2): 309–318.
Wallach, I.; Dzamba, M.; and Heifets, A. 2015. Atom-
Net: a deep convolutional neural network for bioactivity pre-
diction in structure-based drug discovery. arXiv preprint
arXiv:1510.02855.
Wang, X.-r.; Cao, T.-t.; Jia, C. M.; Tian, X.-m.; and Wang,
Y. 2021. Quantitative prediction model for affinity of drug–
target interactions based on molecular vibrations and overall
system of ligand-receptor. BMC bioinformatics, 22(1): 1–
18.

337

You might also like