SESNet
SESNet
SESNet
Abstract
Deep learning has been widely used for protein engineering. However, it is limited by
the lack of sufficient experimental data to train an accurate model for predicting the
functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-
learning model to predict the fitness for protein mutants by leveraging both sequence
and structure information, and exploiting attention mechanism. Our model integrates
local evolutionary context from homologous sequences, the global evolutionary context
encoding rich semantic from the universal protein sequence space and the structure
information accounting for the microenvironment around each residue in a protein. We
show that SESNet outperforms state-of-the-art models for predicting the sequence-
function relationship on 26 deep mutational scanning datasets. More importantly, we
propose a data augmentation strategy by leveraging the data from unsupervised models
to pre-train our model. After that, our model can achieve strikingly high accuracy in
prediction of the fitness of protein mutants, especially for the higher order variants (>
4 mutation sites), when finetuned by using only a small number of experimental
mutation data (<50). The strategy proposed is of great practical value as the required
experimental effort, i.e., producing a few tens of experimental mutation data on a given
protein, is generally affordable by an ordinary biochemical group and can be applied
on almost any protein.
Introduction
Proteins are workhorses of the life activities. Their various functions such as
catalysis, binding, and transportation undertake most of the metabolic activities in cells.
In addition, they are the key components of the cytoskeleton, supporting the stable and
diverse form of organisms. Nature provides numerous proteins with great potential
value for practical applications. However, the natural proteins often do not have the
optimal function to meet the demand of bioengineering. Directed evolution is a widely
used experimental method to optimize proteins’ functionality, namely fitness, by
employing a greedy local search to optimize protein fitness[1, 2]. During this process,
gain-of-function mutants are achieved and optimized via mutating several Amino Acids
(AA) in the protein, which were selected and accumulated through the iterative
processes of mutation by testing hundreds to thousands of variants in each generation.
Despite the great success directed evolution has achieved, the phase space of the protein
fitness landscape can be screened by this method is rather limited. Furthermore, to
acquire a mutant of excellent fitness, especially a high-order mutant with multiple AA
being mutated, the directed evolution often needs to develop an effective high-
throughput screening or conduct a large number of experimental tests, which is
experimentally and economically challenging[3].
Since experimental screening for directed evolution is largely costing, particularly
for high-order mutations, prediction of the fitness of protein variants in silico are highly
desirable. Recently, deep learning methods have been applied for predicting the fitness
landscape of the protein variants[2]. By building models trained to learn the sequence-
function relationship, deep learning can predict the fitness of each mutant in the whole
sequence space and give a list of the most favorable candidate mutants for experimental
tests. Generally, these deep learning models can be classified into protein language
models [4-11], learning the representations from the global unlabeled sequences[6, 7,
12] and multiple sequence alignment (MSA) based model, capturing the feature of
evolutional information within the family of the protein targeted[13-16]. And more
recent works have proposed to combine these two strategies: learning on evolutionary
information together with global natural sequences as the representation[17, 18], and
trained the model on the labelled experimental data of screened variants to predict the
fitness of all possible sequences. Nevertheless, all these models are focused on protein
sequence, i.e., using protein sequence as the input of the model. Apart from sequence
information, protein structure can provide additional information on function. Due to
the experimental challenge of determining the protein structure, the number of reported
protein structures is orders of magnitude smaller than that of known protein sequences,
which hinders the development of geometric deep learning model to leverage protein
structural feature. Thanks to the dramatic breakthrough in deep learning-based
technique for predicting protein structure[19, 20], especially AlphaFold 2, it is now
possible to efficiently predict protein structures from sequences at a large scale [21].
Recently, some researches directly take the protein structure feature as input to train the
geometric deep learning model, which has been proved to achieve better or similar
performance in prediction of protein function compared to language models [22-24].
However, the fused deep-learning method which can make the use of both sequence
and structural information of the protein to map the sequence-function is yet much to
be explored [25].
Recently, both supervised and unsupervised models have been developed for
protein engineering, i.e., prediction of the fitness of protein mutants[24, 26]. Generally
speaking, the supervised model can often achieve better performance as compared to
the unsupervised model[26], but the former requires a great amount (at least hundreds
to thousands) of experimental mutation data of the protein studied for training, which
is experimentally challenging[18]. In contrast, the unsupervised model does not need
any of such experimental data, but its performance is relatively worse, especially for
the high-order mutant, which is often the final product of a direct-evolution project. It
is thus highly desirable to develop a deep-learning algorithm, which can efficiently and
accurately predict the fitness of protein variants, especially the high-order mutant,
without the need of a large size of experimental mutation data of the protein concerned.
In the present work, we built a supervised deep learning model (SESNet), which can
effectively fuse the protein sequence and structure information together to predict the
fitness of variant sequences (Fig 1A). We demonstrated that SESNet outperforms
several state-of-the-art models on 26 metagenesis datasets. Moreover, to reduce the
dependence of the model on the quantity of experimental mutation data, we proposed a
data-augmentation strategy (Fig 1B), where the model was firstly pre-trained using a
large quantity of the low-quality results derived from the unsupervised model and then
finetuned by a small amount of the high-quality experimental results. We showed that
the proposed model can achieve very high accuracy in predicting the fitness of high-
order variants of a protein, even for those with more than four mutation sites, when the
experimental dataset used for finetuning is as small as 40. Moreover, our model can
predict the key AA sites, which are crucial for the protein fitness, and thus the protein
engineer can focus on these key sites for mutagenesis. This can greatly reduce the
experiment cost of trial and error.
Results
Deep learning-based architecture of SESNet for predicting protein fitness.
To exploit the diverse information from protein sequence, coevolution and
structure, we fuse three encoder modules into our model. As shown in Fig 1A: the first
one (local encoder) accounts for residue interdependence in a specific protein learned
from evolution-related sequences[15, 16]; the second one (global encoder) captures the
sequence feature in global protein sequence universe[6, 12]; and the third one (structure
module) captures local structural feature around each residue learned from 3D
geometric structure of the protein[23, 24]. To integrate the information of different
modules, we first concatenate representations of local and global encoders and get an
integrated sequence representation. This integrated sequence representation is then sent
to an attention layer and becomes the sequence attention weights, which will be further
averaged with the structure attention weights derived from structure module, leading to
the combined attention weights. Finally, the product of combined attention weights and
the integrated sequence representation is then fed into a fully connected layer to
generate the predicted fitness. The combined attention weights can also be used to
predict the key AA sites, critical for the protein fitness, details of which is discussed in
the section of Method.
Figure 1. Architecture of model and the schematic of data-augmentation strategy. Architecture
of SESNet (A): The local encoder accounts for the inter-residue dependence in a protein learned
from MSA of homologous sequences using a Markov random field[27]. The global encoder captures
the sequence feature in global protein sequence universe using protein language model[6]. The
structure module accounts for the microscopically environmental feature of a residue learned from
3D geometric structure of the protein[23, 28]. Schematic of data-augmentation strategy. (B): We
first build a mutant library containing all of the single-site mutants and numerous double-site
mutants. Then, all of these mutated sequences are scored by the unsupervised model. After that,
these mutants are used to pre-train the initial model (SESNet), which will be further finetuned on a
small number of low-order experimental mutational data.
The combined attention weights guide the finding of the key AA site.
The combined attention weights can be used to measure the importance of each AA
site on protein fitness when mutated. To the first approximation, higher the attention
score is, more important the AA site is. To test this approximation, we trained our model
on the experimental data of 1084 single-site mutants in the dataset of GFP [29], a green
fluorescent protein from Aequorea victoria. The ground truth of the key sites of GFP
are defined here as the experimentally discovered top 20 sites, which exhibit the largest
change of protein fitness when mutated, or the AAs forming and stabilizing the
chromophore, which are known to significantly affect the fluorescent function of the
protein [30], but lack the fitness results in the experimental dataset. Indeed, one can
observe that, at least 4 out of 7 top attention-score AA sites predicted by our model are
the key sites as two of them (AG65 and T201) are located at the chromophore, and the
other two (P73 and R71) were among the top 20 residues discovered in experiment to
render the highest change of fitness when mutated (Fig 3A and Fig S1A). Interestingly,
when we removed the structure module from the model, only one residue in the
predicted top-7 attention-score AA is the key site (Fig 3B and Fig S1B).
To further verify this discovery, we also performed these tests on the dataset of
RRM, the RNA recognition motif of the Saccharomyces cerevisiae poly(A)-binding
protein[31]. The key sites of RRM are defined as the experimentally discovered top 20
sites, which render the largest change of fitness of the protein when mutated, or the
binding sites, which are within 5 Å of the RNA molecules as revealed in the structure
of PDB 6R5K. Fig 3C and Fig S2A show that 4 out of 7 top attention-score AA sites
predicted by our model are the key AAs. One of them (I12) is among the top 20 residues
and three of them (N7, P10 and K39) are binding sites. Whereas, no key residue can be
found in the predicted top-seven attention-score AAs, when we removed the structure
module. (Fig 3D and Fig S2B).
The results in Fig. 3 demonstrate that the structural module which learns the
microscopically structural information around each residue makes important
contribution to identify the key AAs, which are crucial for the protein fitness. Although
the ablation study (Supplementary Table 5) reveals that the addition of the structural
module improves the average spearman correlation over 20 datasets only by 4 percent,
Fig. 3 demonstrates an important role of the structural module, which can guide the
protein engineer to identify the important AA sites in a protein for mutagenesis .
Figure 5. Results of models trained on different datasets. A-B: The spearman correlation of
fitness prediction on high-order mutants by finetuning on 37 experimental single-site mutation
results from datasets of F7YBW8 and on 20 experimental single-site mutation results of
CAPSD_AAV2S, respectively. Where the red and blue bars represent the results of the pre-trained
model and the original model without pretraining. And the green bars correspond to the results of
the unsupervised model, which is ProGen2 for F7YBW8 and ESM-IF1 for CAPSD_AAV2S,
respectively.
Discussion
In this study, we present a supervised deep learning model, which leverages the
information of both sequence and structure of protein to predict the fitness of variants.
And this model is found to outperform the existing state-of-the-art ones for protein
engineering. Moreover, we proposed a data augmentation strategy, which pretrains our
model using the results predicted by other unsupervised model, and then finetunes the
model with only a small number of experimental results. We demonstrated that such
data augmentation will significantly improve the accuracy of the model when the
experimental results are very limited (~40), and also for high-order mutants with >4
mutation sites. We noted that our work, especially the data-augmentation strategy
proposed here, will be of great practical importance as the experimental effort it requires
is generally affordable by an ordinary biochemical research group and can be applied
on most protein.
Method
Details of Model Architecture
Local encoder. Residue interdependencies are crucial to evaluate if a mutation is
acceptable. Several models, including ESM-MSA-1b [37], DeepSequence[14],
EVE[38] and the Potts model[27], such as EVmutation[16] and ECNet [39], utilize
multiple sequence alignment (MSA) to dig the constraints of evolutionary process in
the residues level. In the present work, we use Potts model to establish the local encoder.
This method first searches for the homologous sequences and builds MSA of the given
protein with HHsuite [40]. After that, a statistical model is used to identify the
evolutionary couplings by learning a generative model of the MSA of homologous
sequences using a Markov random field. In the model, the probability of each sequence
depends on an energy function, which is defined as the sum of single-site constraints
𝑒𝑖 and all pairwise coupling constraints 𝑒𝑖𝑗 :
𝐸(𝑥) = ∑𝑖 𝒆𝑖 (𝑥𝑖 ) + ∑𝑖≠𝑗 𝒆𝑖𝑗 (𝑥𝑖 , 𝑥𝑗 ) (1)
Where 𝑖 and 𝑗 are position indices along the sequence. The i-th amino acid xi is
encoded by a vector, in which elements are set to the single-site term ei(xi) and pairwise
coupling terms eij(xi, xj) for j=1,…,n, n is the number of residues in the sequence. These
coupling parameters ei and eij can be estimated using regularized maximum
pseudolikelihood algorithm [41, 42]. As the result, each amino acid in the sequence is
represented by a vector whose length is (𝐿 + 1) , and the whole input sequence is
encoded as a matrix whose size is (𝐿 + 1) × 𝐿 . Since the length of the local
evolutionary representation of each amino acid is close to the length of the sequence,
the (𝐿 + 1)-long vector would be transformed into a new vector with fixed length 𝑑𝑙
(in our local encoder, 𝑑𝑙 =128) through a fully connected layer to avoid the overfitting
issue. Sequence of protein would also pass a Bi-LSTM layer and be transformed into
an 𝐿 × 𝑑𝑙 matrix for random initialization. By concatenating two matrices above, we
obtain the output of local encoder 𝒆′ =< 𝒆′𝟏 , 𝒆′𝟐 , … 𝒆′𝑳 >, whose size is 𝐿 × 2𝑑𝑙 .
Global Encoder. Recently, the large scale pre-trained models have been successfully
applied in diverse tasks for inferring protein structure or function based on sequence
information. Such as prediction of secondary structure, contact prediction and
prediction of mutational effects. Thus, we take a pre-trained protein language model as
the global encoder which is responsible to extract biochemical properties and evolution
information of the protein sequences. There are some effective language models such
as UniRep [12], TAPE [43], ESM-1v [44], ESM-1b [37], ProteinBERT[11] etc. We test
these language models on our validation datasets, and results show that ESM-1b
performs better than others. Therefore, we chose to use ESM-1b as the global encoder.
The model is a bert-based [45] context-aware language model for protein, trained on
the protein sequence dataset of UniRef 50 (86 billion amino acids across 250 million
protein sequences). Due to its ability to represent the biological properties and
evolutionary diversity of proteins, we utilize this model as our global encoder to encode
the evolutionary protein sequence. Formally, given a protein sequence 𝒙 =<
𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝑳 > ∈ 𝑳𝑵 as input, where 𝒙𝒊 is the one-hot representation of 𝒊𝒕𝒉 amino
acids in the evolutionary sequence, 𝑳 is the length of the sequence, and 𝑵 is the size
of amino acids alphabet. The global encoder first encodes each amino acid and its
context to 𝒈 =< 𝒈𝟏 , 𝒈𝟐 , … , 𝒈𝑳 > , where 𝒈𝒊 ∈ 𝑹𝒏 , (in ESM-1b, 𝑛 = 1420 ). Then
𝒈𝒊 is projected to 𝒈′𝒊 of a hidden space 𝑹𝒉 with a lower dimension (in our default
model configuration, ℎ = 256), 𝒈′𝒊 = 𝑾𝑮 𝒈𝒊 + 𝒃, where 𝑾𝑮 ∈ 𝑹𝒏×𝒉 is a learnable
affine transform parameter matrix and 𝒃 ∈ 𝑹𝒉 is the bias. The output of global
encoder is 𝒈′ =< 𝒈′𝟏 , 𝒈′𝟐 , … 𝒈′𝑳 > ∈ 𝑹𝑳×𝒉 . We integrate the ESM-1b architecture into
our model i.e.; we update the parameters of ESM-1b dynamically during the training
process.
Intra-Attention. The outputs of local encoder and global encoder are embedding
vectors, aligning all positions of input sequence. We utilize intra-attention mechanism
to compress the whole embeddings to a context vector. The inputs of attention layer are:
(1) the global representations 𝒈′ =< 𝒈′𝟏 , 𝒈′𝟐 , … 𝒈′𝑳 > (2) the local representations
𝒆′ =< 𝒆′𝟏 , 𝒆′𝟐 , … 𝒆′𝑳 > (3) the reconstruction perplexities 𝒑′ =< 𝑝1′ , 𝑝2′ , … 𝑝𝐿′ >. Firstly,
the local representations and global representations are normalized by layer
normalization [46] over the length dimension respectively for stable training. That is,
𝒈′ = 𝑳𝒂𝒚𝒆𝒓𝑵𝒐𝒓𝒎(𝒈′ ) and 𝒆′ = 𝑳𝒂𝒚𝒆𝒓𝑵𝒐𝒓𝒎(𝒆′ ) . Secondly, the normalized
global representations and local representations are concatenated to joint-
representations 𝒓 =< 𝒓𝟏 , 𝒓𝟐 , … 𝒓𝑳 >, where 𝒓𝒊 = [𝒈′𝒊 ; 𝒓′𝒊 ] ∈ 𝑹𝟐𝒉 . Then we use an dot
attention layer to compute the sequence attention weights 𝑎 =< 𝑎1 , 𝑎2 , … , 𝑎𝐿 > ∈ 𝑅 𝐿 ,
exp(𝑟𝑖 ∙𝑊𝑎 𝑟𝑖 )
where 𝑎𝑖 ∈ 𝑅 is the attention weight on the 𝑖𝑡ℎ position, 𝑎𝑖 = ∑𝑛 ,
𝑘=1 exp(𝑟𝑘 ∙𝑊𝑎 𝑟𝑘 )
𝑊𝑎 ∈ 𝑅 ℎ×1 is the learnable parameter. Besides the sequence attention weights, there is
structure attention weights called structure attention 𝑠 =< 𝑠1 , 𝑠2 , … , 𝑠𝐿 > ∈ 𝑅 𝐿 , which
exp(𝑝𝑖′ )
are calculated by reconstruction perplexities, 𝑠𝑖 = ∑𝑛 ′ . We use the average of
𝑘=1 exp(𝑝𝑘 )
sequence attention and structure attention as the final combined attention weights, that
𝑎𝑖 +𝑠𝑖
is 𝑤 =< 𝑤1 , 𝑤2 , … , 𝑤𝐿 > , where 𝑤𝑖 = . According to the combined attention
2
weights, we get the context vector 𝒄 = ∑𝐿𝑖=1 𝑤𝑖 𝒓𝒊 as the embedding vector of the entire
sequence.
Output layer. The input of output layer is the context vector 𝒄 from the output of
attention aggregator, and an evolutionary score 𝑑 from the unsupervised model [23].
While the evolutionary score may not be trusted in many cases, we use a dynamic
weight to take the score into account. The context vector 𝒄 was firstly transformed to
a hidden vector 𝒉 , where 𝒉 = 𝑅𝑒𝐿𝑈(𝑊ℎ 𝑐 + 𝑏) , 𝑊ℎ and 𝑏 are learnable
parameters, and ReLU [47] is the activation function. Then, the hidden vector 𝒉 is
used to calculate the weight 𝑝 ∈ (0,1) on 𝑑: 𝑝 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑊𝑝 [𝒉; 𝑑]). The scale of
𝒑 quantifies how much should the model trust the score from the zero-shot model. At
last, we use a linear layer to compute a fitness score 𝑦𝑞 ∈ 𝑅 according to the hidden
vector 𝒉 directly, where 𝑦𝑞 = 𝑊𝑞 ℎ + 𝑏 . The output of our model, i.e., the
prediction fitness 𝑦 ∈ 𝑅 is computed as:
𝑦 = (1 − 𝑝) × 𝑦𝑝 + 𝑝 × 𝑦𝑞 . (2)
We utilize the mean square error (MSE) as the loss function to update model parameters
during back-propagation:
1
𝑙𝑜𝑠𝑠 = 𝑁 ∑𝑁
𝑖=1(𝑡𝑖 − 𝑦𝑖 )
2
(3)
, where 𝑁 is the number of samples in a mini-batch, 𝑡𝑖 is the target fitness and 𝑦𝑖 is
the output fitness.
Training details. SESNet was trained with adam optimizer with weight decay (equals
to L2 norm). Hyperparameters of the model were tuned with a local grid search on the
validation set. Since conducting 5-fold cross-validation and grid search on 20 datasets
is costly, we only searched on two representative datasets. We performed grid search
on GFP dataset for multi-sites dataset and RRM dataset for single-site dataset to obtain
the best hyperparameters configuration and apply the search results in other datasets.
We tested the hidden size of [128, 256, 512], learning rate of [1e-3, 5e-4, 1e-4, 5e-5,
1e-5], and dropout of [0.1, 0.2, 0.4]. Table 7 in SI shows the details of the
hyperparameters configuration. All experiments are conducted on a GPU server with
10 RTX 3090 GPUs (24GB VRAM) and 2 Intel Gold 6226R CPUs with 2TB RAM.
Model contrast. The source code of ECNet model for contrast is downloaded from the
GitHub website (https://github.com/luoyunan/ECNet) provided by Ref [17]. The ESM-
1b model is also reproduced in our local computers with architecture that is described
in their publication [6]. The code of ESM-IF1, ESM-1v and MSA transformer (ESM-
MSA-1b) are got from the GitHub website of Facebook research
(https://github.com/facebookresearch/esm). For each assay, all experiments of three
different models are performed in the same dataset.
Ethical Approval
Not applicable
Competing interests
The authors declare no competing interests.
Authors’ contributions
LH and PT designed this project, PT and ML proposed this model, ML, LQ, YX, YGW
and GF implemented the method, performed the calculations. All of the authors read
and approved the final manuscript.
Author’s notes
Funding
This work was financially supported by the Natural Science Foundation of China (Grant
No. 12104295, 11974239, 31630002, 61872094, 61832019), the Innovation Program
of Shanghai Municipal Education Commission, and Shanghai Jiao Tong university
Multidisciplinary research fund of medicine and engineering YG 2016QN13. The
computing hardware resource was supported by the Center for High Performance
Computing at Shanghai Jiao Tong University.