[go: up one dir, main page]

0% found this document useful (0 votes)
8 views19 pages

SESNet

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 19

SESNet: sequence-structure feature-integrated deep

learning method for data-efficient protein engineering


Mingchen Li1,4†, Liqi Kang1,2†, Yi Xiong5, Yu Guang Wang1, Guisheng

Fan4, Pan Tan1*, Liang Hong1,2,3*


1. Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of
Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
2. School of Physics and Astronomy & School of Pharmacy, Shanghai Jiao Tong
University,200240, China
3. Shanghai Artificial Intelligence Laboratory, Shanghai 200240, China
4. School of Information Science and Engineering, East China University of Science and
Technology, Shanghai 200240, China
5. School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240,
China

Abstract
Deep learning has been widely used for protein engineering. However, it is limited by
the lack of sufficient experimental data to train an accurate model for predicting the
functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-
learning model to predict the fitness for protein mutants by leveraging both sequence
and structure information, and exploiting attention mechanism. Our model integrates
local evolutionary context from homologous sequences, the global evolutionary context
encoding rich semantic from the universal protein sequence space and the structure
information accounting for the microenvironment around each residue in a protein. We
show that SESNet outperforms state-of-the-art models for predicting the sequence-
function relationship on 26 deep mutational scanning datasets. More importantly, we
propose a data augmentation strategy by leveraging the data from unsupervised models
to pre-train our model. After that, our model can achieve strikingly high accuracy in
prediction of the fitness of protein mutants, especially for the higher order variants (>
4 mutation sites), when finetuned by using only a small number of experimental
mutation data (<50). The strategy proposed is of great practical value as the required
experimental effort, i.e., producing a few tens of experimental mutation data on a given
protein, is generally affordable by an ordinary biochemical group and can be applied
on almost any protein.

Introduction
Proteins are workhorses of the life activities. Their various functions such as
catalysis, binding, and transportation undertake most of the metabolic activities in cells.
In addition, they are the key components of the cytoskeleton, supporting the stable and
diverse form of organisms. Nature provides numerous proteins with great potential
value for practical applications. However, the natural proteins often do not have the
optimal function to meet the demand of bioengineering. Directed evolution is a widely
used experimental method to optimize proteins’ functionality, namely fitness, by
employing a greedy local search to optimize protein fitness[1, 2]. During this process,
gain-of-function mutants are achieved and optimized via mutating several Amino Acids
(AA) in the protein, which were selected and accumulated through the iterative
processes of mutation by testing hundreds to thousands of variants in each generation.
Despite the great success directed evolution has achieved, the phase space of the protein
fitness landscape can be screened by this method is rather limited. Furthermore, to
acquire a mutant of excellent fitness, especially a high-order mutant with multiple AA
being mutated, the directed evolution often needs to develop an effective high-
throughput screening or conduct a large number of experimental tests, which is
experimentally and economically challenging[3].
Since experimental screening for directed evolution is largely costing, particularly
for high-order mutations, prediction of the fitness of protein variants in silico are highly
desirable. Recently, deep learning methods have been applied for predicting the fitness
landscape of the protein variants[2]. By building models trained to learn the sequence-
function relationship, deep learning can predict the fitness of each mutant in the whole
sequence space and give a list of the most favorable candidate mutants for experimental
tests. Generally, these deep learning models can be classified into protein language
models [4-11], learning the representations from the global unlabeled sequences[6, 7,
12] and multiple sequence alignment (MSA) based model, capturing the feature of
evolutional information within the family of the protein targeted[13-16]. And more
recent works have proposed to combine these two strategies: learning on evolutionary
information together with global natural sequences as the representation[17, 18], and
trained the model on the labelled experimental data of screened variants to predict the
fitness of all possible sequences. Nevertheless, all these models are focused on protein
sequence, i.e., using protein sequence as the input of the model. Apart from sequence
information, protein structure can provide additional information on function. Due to
the experimental challenge of determining the protein structure, the number of reported
protein structures is orders of magnitude smaller than that of known protein sequences,
which hinders the development of geometric deep learning model to leverage protein
structural feature. Thanks to the dramatic breakthrough in deep learning-based
technique for predicting protein structure[19, 20], especially AlphaFold 2, it is now
possible to efficiently predict protein structures from sequences at a large scale [21].
Recently, some researches directly take the protein structure feature as input to train the
geometric deep learning model, which has been proved to achieve better or similar
performance in prediction of protein function compared to language models [22-24].
However, the fused deep-learning method which can make the use of both sequence
and structural information of the protein to map the sequence-function is yet much to
be explored [25].
Recently, both supervised and unsupervised models have been developed for
protein engineering, i.e., prediction of the fitness of protein mutants[24, 26]. Generally
speaking, the supervised model can often achieve better performance as compared to
the unsupervised model[26], but the former requires a great amount (at least hundreds
to thousands) of experimental mutation data of the protein studied for training, which
is experimentally challenging[18]. In contrast, the unsupervised model does not need
any of such experimental data, but its performance is relatively worse, especially for
the high-order mutant, which is often the final product of a direct-evolution project. It
is thus highly desirable to develop a deep-learning algorithm, which can efficiently and
accurately predict the fitness of protein variants, especially the high-order mutant,
without the need of a large size of experimental mutation data of the protein concerned.
In the present work, we built a supervised deep learning model (SESNet), which can
effectively fuse the protein sequence and structure information together to predict the
fitness of variant sequences (Fig 1A). We demonstrated that SESNet outperforms
several state-of-the-art models on 26 metagenesis datasets. Moreover, to reduce the
dependence of the model on the quantity of experimental mutation data, we proposed a
data-augmentation strategy (Fig 1B), where the model was firstly pre-trained using a
large quantity of the low-quality results derived from the unsupervised model and then
finetuned by a small amount of the high-quality experimental results. We showed that
the proposed model can achieve very high accuracy in predicting the fitness of high-
order variants of a protein, even for those with more than four mutation sites, when the
experimental dataset used for finetuning is as small as 40. Moreover, our model can
predict the key AA sites, which are crucial for the protein fitness, and thus the protein
engineer can focus on these key sites for mutagenesis. This can greatly reduce the
experiment cost of trial and error.

Results
Deep learning-based architecture of SESNet for predicting protein fitness.
To exploit the diverse information from protein sequence, coevolution and
structure, we fuse three encoder modules into our model. As shown in Fig 1A: the first
one (local encoder) accounts for residue interdependence in a specific protein learned
from evolution-related sequences[15, 16]; the second one (global encoder) captures the
sequence feature in global protein sequence universe[6, 12]; and the third one (structure
module) captures local structural feature around each residue learned from 3D
geometric structure of the protein[23, 24]. To integrate the information of different
modules, we first concatenate representations of local and global encoders and get an
integrated sequence representation. This integrated sequence representation is then sent
to an attention layer and becomes the sequence attention weights, which will be further
averaged with the structure attention weights derived from structure module, leading to
the combined attention weights. Finally, the product of combined attention weights and
the integrated sequence representation is then fed into a fully connected layer to
generate the predicted fitness. The combined attention weights can also be used to
predict the key AA sites, critical for the protein fitness, details of which is discussed in
the section of Method.
Figure 1. Architecture of model and the schematic of data-augmentation strategy. Architecture
of SESNet (A): The local encoder accounts for the inter-residue dependence in a protein learned
from MSA of homologous sequences using a Markov random field[27]. The global encoder captures
the sequence feature in global protein sequence universe using protein language model[6]. The
structure module accounts for the microscopically environmental feature of a residue learned from
3D geometric structure of the protein[23, 28]. Schematic of data-augmentation strategy. (B): We
first build a mutant library containing all of the single-site mutants and numerous double-site
mutants. Then, all of these mutated sequences are scored by the unsupervised model. After that,
these mutants are used to pre-train the initial model (SESNet), which will be further finetuned on a
small number of low-order experimental mutational data.

SESNet outperforms state-of-the-art methods for predicting fitness of variants on


deep mutation scan (DMS) datasets
We compared our supervised model against the existing state-of-the-art supervised
models, ECNet [17], ESM-1b [6]; and unsupervised models, ESM-1v [9], ESM-IF1[23]
and MSA transformer[15]. As can be seen in Fig 2A, in 19 out of 20 datasets, the
supervised models generally outperform the unsupervised ones as expected, and our
model (SESNet) achieves the best performance among all the models. Moreover, we
further explored the ability of our model to predict the fitness of higher-order variants
by training it using the experimental results of the low-order variants on 6 datasets of
DMS. As shown in Fig 2B&C, our model outperforms all the other models. Data in Fig
2 is presented in Supplementary Tables 1,2&3. These datasets cover various proteins
and different types of functionalities, including catalytic rate, stability, and binding
affinity to peptide, DNA, RNA and antibody, as well as fluorescence intensity (Table
4). While most of the datasets contain only single-site mutants, five of them involve
both single-site and double-site mutants, and the dataset of GFP contains data up to 15-
site mutants.

All three components contribute positively to the performance of SESNet.


As described in the above architecture (Fig.1A), our model integrates three
different encoders or modules together. To investigate how much contribution each of
the three parts makes, we performed ablation studies in 20 datasets of single-site
mutants. Briefly, we removed each of the three components and compared the
performance to that of the original model. As shown in Supplementary Table 5, the
average spearman correlation of the original model is 0.672, much higher than that
without local encoder (0.639), that without global encoder (0.247) and that without
structure module (0.630). The ablation study reveals that all three components
contribute to the improvement of model performance, and the contribution from the
global encoder, which captures the sequence feature in global protein sequence universe,
is the most significant.

The combined attention weights guide the finding of the key AA site.
The combined attention weights can be used to measure the importance of each AA
site on protein fitness when mutated. To the first approximation, higher the attention
score is, more important the AA site is. To test this approximation, we trained our model
on the experimental data of 1084 single-site mutants in the dataset of GFP [29], a green
fluorescent protein from Aequorea victoria. The ground truth of the key sites of GFP
are defined here as the experimentally discovered top 20 sites, which exhibit the largest
change of protein fitness when mutated, or the AAs forming and stabilizing the
chromophore, which are known to significantly affect the fluorescent function of the
protein [30], but lack the fitness results in the experimental dataset. Indeed, one can
observe that, at least 4 out of 7 top attention-score AA sites predicted by our model are
the key sites as two of them (AG65 and T201) are located at the chromophore, and the
other two (P73 and R71) were among the top 20 residues discovered in experiment to
render the highest change of fitness when mutated (Fig 3A and Fig S1A). Interestingly,
when we removed the structure module from the model, only one residue in the
predicted top-7 attention-score AA is the key site (Fig 3B and Fig S1B).
To further verify this discovery, we also performed these tests on the dataset of
RRM, the RNA recognition motif of the Saccharomyces cerevisiae poly(A)-binding
protein[31]. The key sites of RRM are defined as the experimentally discovered top 20
sites, which render the largest change of fitness of the protein when mutated, or the
binding sites, which are within 5 Å of the RNA molecules as revealed in the structure
of PDB 6R5K. Fig 3C and Fig S2A show that 4 out of 7 top attention-score AA sites
predicted by our model are the key AAs. One of them (I12) is among the top 20 residues
and three of them (N7, P10 and K39) are binding sites. Whereas, no key residue can be
found in the predicted top-seven attention-score AAs, when we removed the structure
module. (Fig 3D and Fig S2B).
The results in Fig. 3 demonstrate that the structural module which learns the
microscopically structural information around each residue makes important
contribution to identify the key AAs, which are crucial for the protein fitness. Although
the ablation study (Supplementary Table 5) reveals that the addition of the structural
module improves the average spearman correlation over 20 datasets only by 4 percent,
Fig. 3 demonstrates an important role of the structural module, which can guide the
protein engineer to identify the important AA sites in a protein for mutagenesis .

Data-augmentation strategy boosts the performance of the fitness prediction


when finetuned by a small size of labelled experimental data.
Supervised model is normally performing better than the unsupervised models
(see Fig. 2)[26]. But the accuracy of the supervised model is highly affected by the
amount of input experimental results used for training. However, it is experimentally
challenging and costly to generate sufficient data (many hundreds or even thousands)
for such purpose on every protein studied. To address this challenge, we propose a
simple strategy of data augmentation by using the result generated by one unsupervised
model to pre-train our model on a given protein, and then finetuning it using a limited
number of experimental results on the same protein. We call it a pre-trained model. We
note that data-augmentation strategy has been applied in various earlier work and has
achieved good success in protein design[23, 32, 33]. In particular, to improve the
accuracy of inverse folding, ref [23] used 16153 experimentally determined 3-D
structures of proteins and 12 million structures predicted by the AlphaFold 2 [19] to
train the model ESM-IF1[23] . In the present work, the data augmentation strategy is
used for a different purpose that it can reduce the dependence of the supervised model
on the size of the experimental data when predicting the fitness of protein mutants. We
took GFP as an example to illustrate our data-augmentation strategy as GFP has a large
number of experimental data for testing, particularly the experimental data for high-
order mutants (up to 15-site mutant). We used the fitness results of low-order mutants
predicted by the unsupervised model, ESM-IF1, to pre-train our model. The pre-
training dataset contains the fitness of all single-site mutants and 30,000 double-site
mutants randomly selected out of tens of million double-site variants. Then, we
finetuned the pre-trained model by a certain number of experimental results of single-
site mutants. The resulting model was used to predict the fitness of high-order mutants.
As can be seen in Fig. 4A-D, when comparing with the original model without pre-
training (blue bars), the performance of the pre-trained model is significantly improved
(red bars). Such improvement is particularly large when only a small number of
experimental data (40) is fed for training, and it will be gradually reduced when feeding
more experimental data, eventually disappearing when more than 1000 experimental
data were used for training. Here, we would like to particularly highlight the case when
the finetuning experimental dataset contains only 40 data points. As can be seen in Fig.
4A, the pretrained model can achieve high spearman correlation of 0.5-0.7 for multisite-
mutants, even for high-order mutants with 5-8 mutation sites. This is remarkably
important for most protein engineers, as such experimental workload (40 data points)
is generally affordable in an ordinary biochemical research group. However, without
pre-training, the performance of the supervised model is rather low (~0.2). This
comparison demonstrates the advantage of the data augmentation strategy proposed in
the present work.
Moreover, we also compared the performance of the pretrained model with respect
to the unsupervised model (green bars), which were used for generating the low-quality
pretraining datasets. As can be seen, when only 40 experimental data were used for
training, the pretrained model has similar performance as compared to the unsupervised
model for low-order mutants (< 4 mutation sites), but clearly outperforms the latter for
high-order mutants (>4 mutation sites). When feeding more experimental data,
especially a couple of hundreds, the pretrained model will outperform the unsupervised
model regardless of how many sites of the protein were mutated.
The unsupervised model used for analysis in Fig. 4 is ESM-1F1, which captures
the local structural information of a residue. To demonstrate the general superiority of
data-augmentation strategy proposed here, we also tested the results using other
unsupervised model to generate the augmented datasets for GFP. As can be seen in Fig.
S3, we used ProGen2 [8], an unsupervised model to learn the global sequence
information, for data augmentation, and still derived the similar conclusion as in Fig. 4.
That is, the pretrained model outperforms the original model without pretraining
especially when a small experimental dataset is used for training, and it also beats the
unsupervised model particularly for the high-order mutants.
To further validate the generality of the data augmentation strategy proposed here,
we did the analysis on the dataset of other proteins: toxin-antitoxin complex (F7YBW8)
[34]containing data up to 4 sites mutants, and Adeno-associated virus capsids
(CAPSD_AAV2S)[35], a deep mutational dataset including data up to 23-site mutants.
We used the unsupervised model ProGen2[8] to generate the low-quality data of
F7YBW8 for pretraining, since we found ProGen2 performs better than ESM-IF1 on
this dataset. As shown in Fig 5A, the pre-trained model outperforms both the original
model without pretraining and the unsupervised model in the fitness prediction of all
multi-site mutants (2-4 sites) after finetuned by using only 37 experimental data points.
In addition, in the dataset of CAPSD_AAV2S (Fig 5B), the pre-trained model also
achieves the best performance in all of the high-order mutants ranging from 2 to 23
sites, when finetuned by only 20 experimental data points. These results further support
the practical use of our data augmentation strategy, as the required experimental effort
is largely affordable on most proteins.

Learned models provide insight into protein fitness.


SESNet projects a protein sequence into a high dimensional latent space and
represents each mutant as a vector by the last hidden layer. Thus, we can visualize the
relationships between sequences in these latent spaces to reveal how the networks learn
and comprehend protein fitness. Specifically, we trained SESNet on the experimental
data of single-site mutants from the datasets of GFP and RRM, then we used the trained
model and untrained model to encode each variant and extracted the output of the last
hidden layer as a representation of the variant sequence. Fig S4 shows a two-
dimensional projection of the high dimensional latent space using t-SNE[36]. We found
that the representations of positive and negative variants, i.e., the experimental fitness
values being larger or smaller than that of wildtype, generated by the trained SESNet
are clearly clustered into distinct groups (Fig S4A and Fig S4B). In contrast, the
representations from untrained model cannot provide a distinguishable boundary
between positive and negative variants (Fig S4C and Fig S4D). Therefore, SESNet can
learn to distinguish functional fitness of mutants into a latent representation space with
supervised training.
Furthermore, to explore why the data-augmentation strategy works, we performed
a case study on GFP dataset. Here, we compared the latent-space representation from
the last hidden layer generated by our model with and without pre-training using the
augmented data from the unsupervised model. As seen in Fig. S5A, after pretraining
even without finetuning by the experimental data, SESNet can already roughly
distinguish the negative and positive mutants. One thus can deduce that the pre-training
can furnish a good parameter initialization for SESNet. After further finetuning the pre-
trained SESNet by only 40 experimental data points of single-site mutants, a rather
clear boundary between negative and positive high-order mutants is further outlined
(Fig S5B). In contrast, when we skipped the pretraining process, i.e., directly training
the model on 40 experimental data points, the separation between the positive and
negative high-order mutants is rather ambiguous (Fig S5C). This comparison
demonstrates the superiority of our data-augmentation strategy in distinguishing
mutants of distinct fitness values, when the number of available experimental data is
limited.
Figure 2. Spearman correlation of predicted fitness. A: Comparison of our model to other models
on the predicted fitness of the single-site mutants on 20 datasets. We performed five-fold cross-
validation with 7:1:2 as the ratio of train versus validation versus test set. B: comparison of predicted
fitness of double-site mutants of our model to other unsupervised models (ESM-1v, ESM-IF1 and
MSA transformer), or supervised models (ECNet and ESM-1b). Here, our model and other
supervised models were trained on the data of single-site mutants. We used 10% of double-site
mutants as validation set and the remaining 90% as test set. C: Comparison of our model to other
models on fitness prediction of quadruple-site mutants of GFP. Here, our model and other
supervised model were trained using the single, double, triple-site mutants and all the three together.
We used 10% of quadruple-site mutants as validation set and the remaining 90% as test set. The
error bar in single-site mutant was got from the five-fold cross-validation. Since we cannot do five-
fold cross-validation in the fitness prediction of high-order mutants trained on low-order mutants,
we don’t put error bar for those data.
Figure 3. The sites with the top 7 largest attention scores on the wildtype sequence. A&B:
The key sites of GFP have been marked as red spheres. A: 4 key sites were recovered by our model.
G65 and T201 are the active residues helping to form and stabilize the chromophore in GFP as
described by Ref [30]. P73 and R71 are among the experimentally-discovered top 20 sites, which
render the highest change of fitness when mutated. B: Only one key site was identified by the model
when removing the structure module and it is Y37, which is among the experimentally-discovered
top 20 AA sites. C&D: The key sites of RRM have been marked as red spheres. C: 4 key sites were
recovered by the original model. N7, P10 and K39 are the binding sites which are within 5Å of the
RNA molecules. I12 is among the experimentally-discovered top 20 sites, which render the highest
change of fitness when mutated. D: There is no key site identified by the model when removing the
structure module.
Figure 4. Results of models trained on different number of experimental variants. A-D:
The spearman correlation of fitness prediction on multiple sites (2-8 sites) mutants by finetuning
using 40, 100, 400, 1084 single-site experimental mutation results from dataset of GFP. Where the
red and blue bars represent the results of the pre-trained model and the original model without
pretraining, respectively. And the green bars correspond to the results of the unsupervised model
ESM-IF1 as a control.

Figure 5. Results of models trained on different datasets. A-B: The spearman correlation of
fitness prediction on high-order mutants by finetuning on 37 experimental single-site mutation
results from datasets of F7YBW8 and on 20 experimental single-site mutation results of
CAPSD_AAV2S, respectively. Where the red and blue bars represent the results of the pre-trained
model and the original model without pretraining. And the green bars correspond to the results of
the unsupervised model, which is ProGen2 for F7YBW8 and ESM-IF1 for CAPSD_AAV2S,
respectively.
Discussion
In this study, we present a supervised deep learning model, which leverages the
information of both sequence and structure of protein to predict the fitness of variants.
And this model is found to outperform the existing state-of-the-art ones for protein
engineering. Moreover, we proposed a data augmentation strategy, which pretrains our
model using the results predicted by other unsupervised model, and then finetunes the
model with only a small number of experimental results. We demonstrated that such
data augmentation will significantly improve the accuracy of the model when the
experimental results are very limited (~40), and also for high-order mutants with >4
mutation sites. We noted that our work, especially the data-augmentation strategy
proposed here, will be of great practical importance as the experimental effort it requires
is generally affordable by an ordinary biochemical research group and can be applied
on most protein.

Method
Details of Model Architecture
Local encoder. Residue interdependencies are crucial to evaluate if a mutation is
acceptable. Several models, including ESM-MSA-1b [37], DeepSequence[14],
EVE[38] and the Potts model[27], such as EVmutation[16] and ECNet [39], utilize
multiple sequence alignment (MSA) to dig the constraints of evolutionary process in
the residues level. In the present work, we use Potts model to establish the local encoder.
This method first searches for the homologous sequences and builds MSA of the given
protein with HHsuite [40]. After that, a statistical model is used to identify the
evolutionary couplings by learning a generative model of the MSA of homologous
sequences using a Markov random field. In the model, the probability of each sequence
depends on an energy function, which is defined as the sum of single-site constraints
𝑒𝑖 and all pairwise coupling constraints 𝑒𝑖𝑗 :
𝐸(𝑥) = ∑𝑖 𝒆𝑖 (𝑥𝑖 ) + ∑𝑖≠𝑗 𝒆𝑖𝑗 (𝑥𝑖 , 𝑥𝑗 ) (1)
Where 𝑖 and 𝑗 are position indices along the sequence. The i-th amino acid xi is
encoded by a vector, in which elements are set to the single-site term ei(xi) and pairwise
coupling terms eij(xi, xj) for j=1,…,n, n is the number of residues in the sequence. These
coupling parameters ei and eij can be estimated using regularized maximum
pseudolikelihood algorithm [41, 42]. As the result, each amino acid in the sequence is
represented by a vector whose length is (𝐿 + 1) , and the whole input sequence is
encoded as a matrix whose size is (𝐿 + 1) × 𝐿 . Since the length of the local
evolutionary representation of each amino acid is close to the length of the sequence,
the (𝐿 + 1)-long vector would be transformed into a new vector with fixed length 𝑑𝑙
(in our local encoder, 𝑑𝑙 =128) through a fully connected layer to avoid the overfitting
issue. Sequence of protein would also pass a Bi-LSTM layer and be transformed into
an 𝐿 × 𝑑𝑙 matrix for random initialization. By concatenating two matrices above, we
obtain the output of local encoder 𝒆′ =< 𝒆′𝟏 , 𝒆′𝟐 , … 𝒆′𝑳 >, whose size is 𝐿 × 2𝑑𝑙 .
Global Encoder. Recently, the large scale pre-trained models have been successfully
applied in diverse tasks for inferring protein structure or function based on sequence
information. Such as prediction of secondary structure, contact prediction and
prediction of mutational effects. Thus, we take a pre-trained protein language model as
the global encoder which is responsible to extract biochemical properties and evolution
information of the protein sequences. There are some effective language models such
as UniRep [12], TAPE [43], ESM-1v [44], ESM-1b [37], ProteinBERT[11] etc. We test
these language models on our validation datasets, and results show that ESM-1b
performs better than others. Therefore, we chose to use ESM-1b as the global encoder.
The model is a bert-based [45] context-aware language model for protein, trained on
the protein sequence dataset of UniRef 50 (86 billion amino acids across 250 million
protein sequences). Due to its ability to represent the biological properties and
evolutionary diversity of proteins, we utilize this model as our global encoder to encode
the evolutionary protein sequence. Formally, given a protein sequence 𝒙 =<
𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝑳 > ∈ 𝑳𝑵 as input, where 𝒙𝒊 is the one-hot representation of 𝒊𝒕𝒉 amino
acids in the evolutionary sequence, 𝑳 is the length of the sequence, and 𝑵 is the size
of amino acids alphabet. The global encoder first encodes each amino acid and its
context to 𝒈 =< 𝒈𝟏 , 𝒈𝟐 , … , 𝒈𝑳 > , where 𝒈𝒊 ∈ 𝑹𝒏 , (in ESM-1b, 𝑛 = 1420 ). Then
𝒈𝒊 is projected to 𝒈′𝒊 of a hidden space 𝑹𝒉 with a lower dimension (in our default
model configuration, ℎ = 256), 𝒈′𝒊 = 𝑾𝑮 𝒈𝒊 + 𝒃, where 𝑾𝑮 ∈ 𝑹𝒏×𝒉 is a learnable
affine transform parameter matrix and 𝒃 ∈ 𝑹𝒉 is the bias. The output of global
encoder is 𝒈′ =< 𝒈′𝟏 , 𝒈′𝟐 , … 𝒈′𝑳 > ∈ 𝑹𝑳×𝒉 . We integrate the ESM-1b architecture into
our model i.e.; we update the parameters of ESM-1b dynamically during the training
process.

Structure module. Structure module utilizes the microenvironmental information to


guide the fitness prediction. In this part, we use the ESM-IF1 model [23] to generate
the scores of mutant sequences, which evaluate their ability to be folded to the wildtype
structure of the given protein. Higher scores mean these mutations are more favorable
than others. Specifically, all possible single mutants at each position of a sequence
would obtain the corresponding scores. The prediction sequence distribution is an
(𝐿 × 20) matrix. Then we calculated the cross-entropy at each position of the sequence
between the matrix above and one-hot encoding matrix of mutant sequence. After
passing the results through a SoftMax function, we obtained an (𝐿 × 1) output vector,
which is the reconstruction perplexities 𝒑′ =< 𝑝1′ , 𝑝2′ , … 𝑝𝐿′ > align the evolutionary
sequence. In the present work, we do not directly encode distance map or the 3D
coordinate of mutated protein. Since before that encoding process, we need to fold
every specific mutant from their sequences, which will lead to unaffordable
computational cost and is unpractical for the task of fitness prediction.

Intra-Attention. The outputs of local encoder and global encoder are embedding
vectors, aligning all positions of input sequence. We utilize intra-attention mechanism
to compress the whole embeddings to a context vector. The inputs of attention layer are:
(1) the global representations 𝒈′ =< 𝒈′𝟏 , 𝒈′𝟐 , … 𝒈′𝑳 > (2) the local representations
𝒆′ =< 𝒆′𝟏 , 𝒆′𝟐 , … 𝒆′𝑳 > (3) the reconstruction perplexities 𝒑′ =< 𝑝1′ , 𝑝2′ , … 𝑝𝐿′ >. Firstly,
the local representations and global representations are normalized by layer
normalization [46] over the length dimension respectively for stable training. That is,
𝒈′ = 𝑳𝒂𝒚𝒆𝒓𝑵𝒐𝒓𝒎(𝒈′ ) and 𝒆′ = 𝑳𝒂𝒚𝒆𝒓𝑵𝒐𝒓𝒎(𝒆′ ) . Secondly, the normalized
global representations and local representations are concatenated to joint-
representations 𝒓 =< 𝒓𝟏 , 𝒓𝟐 , … 𝒓𝑳 >, where 𝒓𝒊 = [𝒈′𝒊 ; 𝒓′𝒊 ] ∈ 𝑹𝟐𝒉 . Then we use an dot
attention layer to compute the sequence attention weights 𝑎 =< 𝑎1 , 𝑎2 , … , 𝑎𝐿 > ∈ 𝑅 𝐿 ,
exp(𝑟𝑖 ∙𝑊𝑎 𝑟𝑖 )
where 𝑎𝑖 ∈ 𝑅 is the attention weight on the 𝑖𝑡ℎ position, 𝑎𝑖 = ∑𝑛 ,
𝑘=1 exp(𝑟𝑘 ∙𝑊𝑎 𝑟𝑘 )

𝑊𝑎 ∈ 𝑅 ℎ×1 is the learnable parameter. Besides the sequence attention weights, there is
structure attention weights called structure attention 𝑠 =< 𝑠1 , 𝑠2 , … , 𝑠𝐿 > ∈ 𝑅 𝐿 , which
exp(𝑝𝑖′ )
are calculated by reconstruction perplexities, 𝑠𝑖 = ∑𝑛 ′ . We use the average of
𝑘=1 exp(𝑝𝑘 )

sequence attention and structure attention as the final combined attention weights, that
𝑎𝑖 +𝑠𝑖
is 𝑤 =< 𝑤1 , 𝑤2 , … , 𝑤𝐿 > , where 𝑤𝑖 = . According to the combined attention
2

weights, we get the context vector 𝒄 = ∑𝐿𝑖=1 𝑤𝑖 𝒓𝒊 as the embedding vector of the entire
sequence.

Output layer. The input of output layer is the context vector 𝒄 from the output of
attention aggregator, and an evolutionary score 𝑑 from the unsupervised model [23].
While the evolutionary score may not be trusted in many cases, we use a dynamic
weight to take the score into account. The context vector 𝒄 was firstly transformed to
a hidden vector 𝒉 , where 𝒉 = 𝑅𝑒𝐿𝑈(𝑊ℎ 𝑐 + 𝑏) , 𝑊ℎ and 𝑏 are learnable
parameters, and ReLU [47] is the activation function. Then, the hidden vector 𝒉 is
used to calculate the weight 𝑝 ∈ (0,1) on 𝑑: 𝑝 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑊𝑝 [𝒉; 𝑑]). The scale of
𝒑 quantifies how much should the model trust the score from the zero-shot model. At
last, we use a linear layer to compute a fitness score 𝑦𝑞 ∈ 𝑅 according to the hidden
vector 𝒉 directly, where 𝑦𝑞 = 𝑊𝑞 ℎ + 𝑏 . The output of our model, i.e., the
prediction fitness 𝑦 ∈ 𝑅 is computed as:
𝑦 = (1 − 𝑝) × 𝑦𝑝 + 𝑝 × 𝑦𝑞 . (2)
We utilize the mean square error (MSE) as the loss function to update model parameters
during back-propagation:
1
𝑙𝑜𝑠𝑠 = 𝑁 ∑𝑁
𝑖=1(𝑡𝑖 − 𝑦𝑖 )
2
(3)
, where 𝑁 is the number of samples in a mini-batch, 𝑡𝑖 is the target fitness and 𝑦𝑖 is
the output fitness.

Dataset and experimental settings


Benchmark dataset collection. We first collected 20 multiple deep mutational scanning
datasets from Ref [14]. Most of them only contain the fitness data of single-site mutants,
while one of them (RRM)[31] also provides data of high-order mutants. The fitness
data measured in these datasets include enzyme function, growth rate, peptide binding,
viral replication and protein stability. We also collected the mutant data of the WW
domain of human Yap1, GB1 domain of protein G in Streptococcus sp. group G and
FOS-JUN heterodimer from Ref [48], and the prion-like domain of TDP-43 from Ref
[49] to evaluate the ability of our model to predict the effect of double-sites mutant by
learning from the data of single-site mutant. Besides, the ability to predict the fitness of
higher order mutants (larger than 2) is tested in the dataset from Ref [29]. This study
analyzed the local fitness landscape of the green fluorescent protein from Aequorea
victoria (avGFP) by measuring the native function (fluorescence) of tens of thousands
of derivative genotypes of avGFP. The detailed information on these datasets are
provided in Table 4 in the Supplement Information.

Prediction of single-site mutation effects. We compared our model to ECNet, ESM-1b,


ESM-1v and MSA transformer model on the DMS datasets. For the supervised models
(ECNet and ESM-1b), we performed five-fold cross-validation on these datasets, and
12.5% of each train set are randomly selected as valid set. Spearman correlation was
used to evaluate the performances of different models.

Prediction of High-order mutation effects. We evaluated the performance for


predicting the fitness of high-order mutants by the model trained on low-order mutants.
The training set for the prediction of double-site mutants only contains the experimental
fitness of single-site mutants. The models used to predict the fitness of quadruple
mutants of avGFP are trained on single, double, triple, and all the three types of mutants,
respectively. Both in the prediction of effect of double mutants and quadruple mutants,
we chose 10% of the high-order mutant data as valid set. The performances of models
were evaluated by Spearman correlation.

Data-augmentation strategy. The data augmentation was conducted by pre-training our


model on the results predicted by the unsupervised model. To be specific, we first built
a mutant library, which contains all of the single-site mutants and 30,000 double-site
mutants randomly selected from tens of millions of saturated double-site mutants. Then,
we used ESM-IF1 (or ProGen2) to score all of these sequences. Those sequence-score
data were used to pre-train our model. While we used 90% of the data as training test,
10% as validation set. After that, we finetuned the pre-trained model on single-site
mutants from experiment with the high-order mutants as test set.

Training details. SESNet was trained with adam optimizer with weight decay (equals
to L2 norm). Hyperparameters of the model were tuned with a local grid search on the
validation set. Since conducting 5-fold cross-validation and grid search on 20 datasets
is costly, we only searched on two representative datasets. We performed grid search
on GFP dataset for multi-sites dataset and RRM dataset for single-site dataset to obtain
the best hyperparameters configuration and apply the search results in other datasets.
We tested the hidden size of [128, 256, 512], learning rate of [1e-3, 5e-4, 1e-4, 5e-5,
1e-5], and dropout of [0.1, 0.2, 0.4]. Table 7 in SI shows the details of the
hyperparameters configuration. All experiments are conducted on a GPU server with
10 RTX 3090 GPUs (24GB VRAM) and 2 Intel Gold 6226R CPUs with 2TB RAM.
Model contrast. The source code of ECNet model for contrast is downloaded from the
GitHub website (https://github.com/luoyunan/ECNet) provided by Ref [17]. The ESM-
1b model is also reproduced in our local computers with architecture that is described
in their publication [6]. The code of ESM-IF1, ESM-1v and MSA transformer (ESM-
MSA-1b) are got from the GitHub website of Facebook research
(https://github.com/facebookresearch/esm). For each assay, all experiments of three
different models are performed in the same dataset.

Ethical Approval
Not applicable

Competing interests
The authors declare no competing interests.

Authors’ contributions
LH and PT designed this project, PT and ML proposed this model, ML, LQ, YX, YGW
and GF implemented the method, performed the calculations. All of the authors read
and approved the final manuscript.

Author’s notes

†These authors contributed equally to this work.

*To whom correspondence should be addressed: tpan1039@alumni.sjtu.edu.cn;


hongl3liang@sjtu.edu.cn.

Funding
This work was financially supported by the Natural Science Foundation of China (Grant
No. 12104295, 11974239, 31630002, 61872094, 61832019), the Innovation Program
of Shanghai Municipal Education Commission, and Shanghai Jiao Tong university
Multidisciplinary research fund of medicine and engineering YG 2016QN13. The
computing hardware resource was supported by the Center for High Performance
Computing at Shanghai Jiao Tong University.

Availability of data and materials


Source code for SESNet and all the datasets used in the present work can be found in
the supplemental materials. Where the original sources of datasets have been declaimed
and cited in the main text.
Reference
1. Arnold, F.H., Design by directed evolution. Accounts of chemical research, 1998. 31(3): p. 125-
131.
2. Wu, Z., et al., Machine learning-assisted directed protein evolution with combinatorial libraries.
Proceedings of the National Academy of Sciences, 2019. 116(18): p. 8852-8858.
3. Cui, Y., et al., Computational Redesign of a PETase for Plastic Biodegradation under Ambient
Condition by the GRAPE Strategy. ACS Catalysis, 2021. 11(3): p. 1340-1350.
4. Hie, B., et al., Learning the language of viral evolution and escape. Science, 2021. 371(6526):
p. 284-288.
5. Hie, B.L., K.K. Yang, and P.S. Kim, Evolutionary velocity with protein language models
predicts evolutionary dynamics of diverse proteins. Cell Systems, 2022.
6. Rives, A., et al., Biological structure and function emerge from scaling unsupervised learning
to 250 million protein sequences. Proceedings of the National Academy of Sciences, 2021.
118(15): p. e2016239118.
7. Rao, R., et al., Evaluating protein transfer learning with TAPE. Advances in neural information
processing systems, 2019. 32: p. 9689.
8. Nijkamp, E., et al. ProGen2: Exploring the Boundaries of Protein Language Models. 2022.
9. Meier, J., et al., Language models enable zero-shot prediction of the effects of mutations on
protein function. bioRxiv, 2021: p. 2021.07.09.450648.
10. Elnaggar, A., et al., ProtTrans: towards cracking the language of Life's code through self-
supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225,
2020.
11. Brandes, N., et al., ProteinBERT: A universal deep-learning model of protein sequence and
function. Bioinformatics, 2022. 38(8): p. 2102-2110.
12. Alley, E.C., et al., Unified rational protein engineering with sequence-based deep
representation learning. Nature methods, 2019. 16(12): p. 1315-1322.
13. Russ, W.P., et al., An evolution-based model for designing chorismate mutase enzymes. Science,
2020. 369(6502): p. 440-445.
14. Riesselman, A.J., J.B. Ingraham, and D.S. Marks, Deep generative models of genetic variation
capture the effects of mutations. Nature Methods, 2018. 15(10): p. 816-822.
15. Rao, R.M., et al., MSA Transformer, in Proceedings of the 38th International Conference on
Machine Learning, M. Marina and Z. Tong, Editors. 2021, PMLR: Proceedings of Machine
Learning Research. p. 8844--8856.
16. Hopf, T.A., et al., Mutation effects predicted from sequence co-variation. Nature Biotechnology,
2017. 35(2): p. 128-135.
17. Luo, Y., et al., ECNet is an evolutionary context-integrated deep learning framework for protein
engineering. Nature Communications, 2021. 12(1): p. 5743.
18. Biswas, S., et al., Low-N protein engineering with data-efficient deep learning. Nature Methods,
2021. 18(4): p. 389-396.
19. Jumper, J., et al., Highly accurate protein structure prediction with AlphaFold. Nature, 2021.
596(7873): p. 583-589.
20. Baek, M., et al., Accurate prediction of protein structures and interactions using a three-track
neural network. Science, 2021. 373(6557): p. 871-+.
21. Varadi, M., et al., AlphaFold Protein Structure Database: massively expanding the structural
coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2022.
50(D1): p. D439-D444.
22. Zhang, Z., et al., Protein Representation Learning by Geometric Structure Pretraining. arXiv
preprint arXiv:2203.06125, 2022.
23. Hsu, C., et al., Learning inverse folding from millions of predicted structures. bioRxiv, 2022.
24. Lu, H., et al., Machine learning-aided engineering of hydrolases for PET depolymerization.
Nature, 2022. 604(7907): p. 662-667.
25. Wang, Z., et al., LM-GVP: an extensible sequence and structure informed deep learning
framework for protein property prediction. Scientific Reports, 2022. 12(1): p. 6832.
26. Gelman, S., et al., Neural networks to learn protein sequence–function relationships from deep
mutational scanning data. Proceedings of the National Academy of Sciences, 2021. 118(48): p.
e2104878118.
27. Ekeberg, M., et al., Improved contact prediction in proteins: Using pseudolikelihoods to infer
Potts models. Physical Review E, 2013. 87(1): p. 012707.
28. Shroff, R., et al., Discovery of novel gain-of-function mutations guided by structure-based deep
learning. ACS synthetic biology, 2020. 9(11): p. 2927-2935.
29. Sarkisyan, K.S., et al., Local fitness landscape of the green fluorescent protein. Nature, 2016.
533(7603): p. 397-401.
30. Zimmer, M., Green fluorescent protein (GFP): applications, structure, and related
photophysical behavior. Chemical reviews, 2002. 102(3): p. 759-782.
31. Melamed, D., et al., Deep mutational scanning of an RRM domain of the Saccharomyces
cerevisiae poly (A)-binding protein. Rna, 2013. 19(11): p. 1537-1551.
32. Minot, M. and S.T. Reddy, Nucleotide augmentation for machine learning-guided protein
engineering. bioRxiv, 2022: p. 2022.03.08.483422.
33. Hsu, C., et al., Learning protein fitness models from evolutionary and assay-labeled data.
Nature Biotechnology, 2022. 40(7): p. 1114-1122.
34. Aakre, Christopher D., et al., Evolving New Protein-Protein Interaction Specificity through
Promiscuous Intermediates. Cell, 2015. 163(3): p. 594-606.
35. Sinai, S., et al., Generative AAV capsid diversification by latent interpolation. bioRxiv, 2021: p.
2021.04.16.440236.
36. Van der Maaten, L. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning
research, 2008. 9(11).
37. Rao, R.M., et al. Msa transformer. in International Conference on Machine Learning. 2021.
PMLR.
38. Frazer, J., et al., Disease variant prediction with deep generative models of evolutionary data.
Nature, 2021. 599(7883): p. 91-95.
39. Luo, Y., et al., ECNet is an evolutionary context-integrated deep learning framework for protein
engineering. Nature communications, 2021. 12(1): p. 1-14.
40. Steinegger, M., et al., HH-suite3 for fast remote homology detection and deep protein
annotation. BMC bioinformatics, 2019. 20(1): p. 1-15.
41. Hopf, T.A., et al., Sequence co-evolution gives 3D contacts and structures of protein complexes.
elife, 2014. 3: p. e03430.
42. Seemayer, S., M. Gruber, and J. Söding, CCMpred — fast and precise prediction of protein
residue–residue contacts from correlated mutations. Bioinformatics, 2014. 30(21): p. 3128-
3130.
43. Rao, R., et al., Evaluating protein transfer learning with TAPE. Advances in neural information
processing systems, 2019. 32.
44. Meier, J., et al., Language models enable zero-shot prediction of the effects of mutations on
protein function. Advances in Neural Information Processing Systems, 2021. 34: p. 29287-
29303.
45. Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
46. Ba, J.L., J.R. Kiros, and G.E. Hinton, Layer normalization. arXiv preprint arXiv:1607.06450,
2016.
47. Fukushima, K., Cognitron: A self-organizing multilayered neural network. Biological
cybernetics, 1975. 20(3): p. 121-136.
48. Rollins, N.J., et al., Inferring protein 3D structure from deep mutation scans. Nature genetics,
2019. 51(7): p. 1170-1176.
49. Bolognesi, B., et al., The mutational landscape of a prion-like domain. Nature communications,
2019. 10(1): p. 1-12.

You might also like