自然语言处理——蛋白序列

Synthesis
Learning the protein language: Evolution, structure,

and function
Highlights Authors
d Deep protein language models can learn information from Tristan Bepler, Bonnie Berger
protein sequence
Correspondence
d They capture the structure, function, and evolutionary fitness
tbepler@nysbc.org (T.B.),
of sequence variants bab@mit.edu (B.B.)
d They can be enriched with prior knowledge and inform

In brief
function predictions
In this synthesis, Bepler and Berger
d They can revolutionize protein biology by suggesting new discuss recent advances in protein
ways to approach design language modeling and their applications
to downstream protein property
prediction problems. They consider how
these models can be enriched with prior
biological knowledge and introduce an
approach for encoding protein structural
knowledge into the learned
representations.
Bepler & Berger, 2021, Cell Systems 12, 654–669

June 16, 2021 ª 2021 The Authors. Published by Elsevier Inc.
https://doi.org/10.1016/j.cels.2021.05.017 ll
ll
OPEN ACCESS
Synthesis
Learning the protein language:
Evolution, structure, and function
Tristan Bepler1,2,3,* and Bonnie Berger2,4,5,*
1Simons Machine Learning Center, New York Structural Biology Center, New York, NY, USA
2Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
3Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA
4Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
5Lead contact
*Correspondence: tbepler@nysbc.org (T.B.), bab@mit.edu (B.B.)

https://doi.org/10.1016/j.cels.2021.05.017
SUMMARY
Language models have recently emerged as a powerful machine-learning approach for distilling information
from massive protein sequence databases. From readily available sequence data alone, these models
discover evolutionary, structural, and functional organization across protein space. Using language models,
we can encode amino-acid sequences into distributed vector representations that capture their structural
and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss
recent advances in protein language modeling and their applications to downstream protein property predic-
tion problems. We then consider how these models can be enriched with prior biological knowledge and
introduce an approach for encoding protein structural knowledge into the learned representations. The
knowledge distilled by these models allows us to improve downstream function prediction through transfer
learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to
approach protein and therapeutic design. However, further developments are needed to encode strong bio-
logical priors into protein language models and to increase their accessibility to the broader community.
INTRODUCTION about evolutionary processes, but have become increasingly

data driven with the growing amount of available natural
Proteins are molecular machines that carry out the majority of the sequence information.
molecular function of cells. They are composed of linear se- Physics-based approaches use all atom energy functions (Hor-
quences of amino acids which fold into complex ensembles of nak et al., 2006; Hess et al., 2008; Alford et al., 2017) or heuristics
3-dimensional structures, which can range from ordered to designed for proteins (Rohl et al., 2004) to estimate the energy of a
disordered and undergo conformational changes; biochemical given conformation and simulate natural motions. These methods
and cellular functions emerge from protein sequence and are appealing, because they draw on our fundamental under-
structure. Understanding the sequence-structure-function rela- standing of the physics of these systems and generate interpret-
tionship is the central problem of protein biology and is pivotal able hypotheses. The Rosetta tool, which stitches together folded
for understanding disease mechanisms and designing proteins fragments associated with small constant-size contiguous subse-
and drugs for therapeutic and bioengineering applications. quences, has been remarkably successful in its use of free energy
The complexity of the sequence-structure-function relation- estimation for protein folding and design (Leaver-Fay et al., 2011),
ship continues to challenge our computational modeling ability, and molecular dynamics software such as GROMACS are widely
in part because existing tools do not fully realize the potential used for modeling dynamics and fine-grained structure prediction
of the increasing quantity of sequence, structure, and functional (Hess et al., 2008). Statistical sampling approaches have also
information stored in large databases. Until recently, computa- been developed that seek to sample from accessible conforma-
tional methods for analyzing proteins have used either first prin- tions based on coarse grained energy functions (Godzik, Kolinski
ciples-based structural simulations or statistical sequence and Skolnick, 1993; Srinivasan and Rose, 1995; Choi and Pappu,
modeling approaches that seek to identify sequence patterns 2019). Rosetta has been especially successful for solving the
that reflect evolutionary, and therefore functional, pressures design problem by using a mix of structural templates and free en-
(Marks, Hopf and Sander, 2012; Ekeberg et al., 2013; Wang ergy minimization to find sequences that match a target structure.
et al., 2017; Liu et al., 2018; Yang et al., 2020) (Figure 1). Within However, despite Rosetta’s successes, it and similar approaches
these methods, structural analysis has been largely first princi- assume simplified energy models, are extremely computationally
ples driven while sequence analysis methods are primarily based expensive, require expert knowledge to set up correctly, and have
on statistical sequence models, which make strong assumptions limited accuracy.
654 Cell Systems 12, 654–669, June 16, 2021 ª 2021 The Authors. Published by Elsevier Inc.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
ll
Synthesis OPEN ACCESS
Language models have recently emerged as a powerful para-

digm for generative modeling of sequences and as a means to
learn ‘‘content-aware’’ data representations from large-scale
sequence datasets. Statistical language models are probability
distributions over sequences of tokens (e.g., words or characters
in natural language processing, amino acids for proteins). Given
a sequence of tokens, a language model assigns a probability to
the whole sequence. In natural language processing (NLP), lan-
guage models are widely used for machine translation, question-
answering, and information retrieval among other applications.
In biology, profile Hidden Markov Models (HMMs) are simple lan-
guage models that are already widely used for homology
modeling and search. Language models are able to capture
complex dependencies between amino acids and can be trained
on all protein sequences rather than being focused on individual
families; in doing so, they have the potential to push the limits of
statistical sequence modeling. In bringing these models to
biology, we now not only have the ability to learn from naturally
observed sequences, including across all of known sequence
space (Alley et al., 2019; Bepler and Berger, 2019), but are also
able to incorporate existing structural and functional knowledge
Figure 1. Two-dimensional schematic of some recent and classical through multi-task learning. (Box 1 provides a glossary of terms
methods in protein sequence and structure analysis, characterized that might be less familiar.) Language models learn the probabil-
by the extent to which the approach is motivated by first principles ity of a sequence occurring and this can be directly applied to
(strong biological priors) versus driven by big data
predict the fitness of sequence mutations (Riesselman, Ingra-
We color methods by types of input-output pairs. Green: sequence-sequence,
purple: sequence-structure, blue: structure-sequence, orange: structure- ham and Marks, 2018; Hie et al., 2020a, 2021). They also learn
structure. Classical methods tend to be more strongly first principles driven summary representations, powerful features that can be used
while newer methods are increasingly data driven. Existing methods tend to be to better capture sequence relationships and link sequence to
either data driven or first principles-based with few methods existing in be- function via transfer learning (Alley et al., 2019; Bepler and
tween. Note that, at this time, details of AlphaFold2 have not been made Berger, 2019; Rao et al., 2019; Rives et al., 2019; Hie et al.,
public, so placement in Figure 1 is a rough estimate. Some methods, especially
2020b; Luo et al., 2020). Finally, language models also offer
Rosetta, can perform multiple functions.
the potential for controlled sequence generation by conditioning
the language model on structural (Ingraham et al., 2019a) or
At the other end of the spectrum, statistical sequence models functional (Madani et al., 2020) specifications.
have proven extremely useful for modeling the amino acid se- Deep language models are an exciting breakthrough in protein
quences of related sets of proteins. These methods allow us to sequence modeling, allowing us to discover aspects of structure
discover constraints on amino acids imposed by evolutionary and function from only the evolutionary relationships present in a
pressures and are widely used for homology search (Altschul corpus of sequences. However, the full potential of these models
and Koonin, 1998; Bateman et al., 2004; Rohl et al., 2004; Finn, has not been realized as they continue to benefit from more pa-
Clements and Eddy, 2011; Remmert et al., 2011a) and for predict- rameters, more compute power, and more data. At the same
ing residue-residue contacts in the 3D protein structure using time, these models can be enriched with strong biological priors
covariation between amino acids at pairs of positions in the through multi-task learning.
sequence (coevolution) (Göbel et al., 1994; Berger, 1995; Berger Here, we propose that methods incorporating both large data-
et al., 1995; Wolf, Kim and Berger, 1997; McDonnell et al., 2006; sets and strong domain knowledge will be key to unlocking the
Trigg et al., 2011; Marks, Hopf and Sander, 2012; de Juan, Pazos full potential of protein sequence modeling. Specifically, physical
and Valencia, 2013; Ekeberg et al., 2013). Advances in protein structure-based priors can be learned through structure supervi-
structure prediction have been driven by building increasingly sion while also learning evolutionary relationships from hundreds
large deep learning systems to predict residue-residue distances of millions of natural protein sequences. Furthermore, the evolu-
from sequence families (Liu et al., 2018; Xu and Wang, 2019) and tionary and structural relationships encoded allow us to learn
fold proteins based on the predicted distance constraints which functional properties of proteins through transfer learning. In
culminated recently in the success of AlphaFold2 at the Critical this synergy, we will discuss these developments and present
Assessment of protein Structure Prediction (CASP) 14 competi- new results toward enriching large-scale language models with
tion (Jumper et al., 2020). These methods rely on large datasets structure-based priors through multi-task learning. First, we
of protein sequences that are similar enough to be aligned with will discuss new developments in deep learning and language
high confidence but contain enough divergence to confidently modeling and their application to protein sequence modeling
infer statistical couplings between positions. Accordingly, they with large datasets. Second, we will discuss how we can enrich
are unable to learn patterns across large-scale databases of these models with structure supervision. Third, we will discuss
possibly unrelated proteins and have limited ability to draw on transfer learning and demonstrate that the evolutionary and
the increasing structure and function information available. structural information encoded in our deep language models
Cell Systems 12, 654–669, June 16, 2021 655

ll
OPEN ACCESS Synthesis
Box 1. Glossary
1-hot [embedding]. Vector representation of a discrete variable commonly used for discrete values that have no meaningful
ordering. Each token is transformed into a V-dimensional zero vector, where V is the size of the vocabulary (the number of unique
tokens, e.g., 20, 21, or 26 for amino acids depending on inclusion of missing and non-canonical amino acid tokens), except for the
index representing the token, which is set to one.
autoregressive [language model]. Language models that factorize the probability of a sequence into a product of conditional
Q
probabilities in which the probability of each token is conditioned on the preceding tokens, pðx1 ::: xL Þ = Li= 1 pðxi jx1 :::xi1 Þ. Ex-
amples of autoregressive language models include k-mer (AKA n-gram) models, Hidden Markov Models, and typical autoregres-
sive recurrent neural network or generative transformer language models. These models are called autoregressive because they
model the probability of one token after another in order.
Bayesian methods. A statistical inference approach that uses Bayes rule to infer a posterior distribution over model parameters
given by the observed data. Because these methods describe distributions over parameters or functions, they are especially useful
in small data regimes or other settings when prediction uncertainties are desirable.
cloze task. A task in natural language processing, also known as the cloze test. The task is to fill in missing words given the
context. For example, ‘‘The quick brown ____ jumps over the lazy dog.’’
conditional random field. Models the probability of a set (sequence in this case, i.e. linear chain CRF) of labels given a set of input
variables by factorizing it into locally conditioned potentials conditioned on the input variables, pðy1 ::: yL j x1 ::: xL Þ =
Q
pðy1 j x1 ::: xL Þ Li= 2 pðyi j yi1 ; x1 ::: xL Þ. This is often simplified such that each conditional only depends on the local input variable,
Q
i.e., pðy1 ::: yL j x1 ::: xL Þ = pðy1 j x1 Þ Li= 2 pðyi j yi1 ; xi Þ. Linear chain CRFs can be seen as the discriminative version of Hidden Mar-
kov Models.
contextual vector embedding. Vector embeddings that include information about the sequence context in which a token occurs.
Encoding context into vector embeddings is important in NLP, because words can have different meanings in different contexts
(i.e. many homonyms exist). For example, in the sentences, ‘‘she tied the ribbon into a bow’’ and ‘‘she drew back the string on her
bow,’’ the word bow refers to two different objects that can only be inferred from context. In the case of proteins, this problem is
even worse, because there are only 20 (canonical) amino acids and so their ‘‘meaning’’ is highly context dependent. This is in
contrast to typical vector embedding methods that learn a single vector embedding per token regardless of context.
distributional hypothesis. The observation that words that occur in similar contexts tend to have similar meanings. Applies also to
proteins due to evolutionary pressure (Harris, 1954).
Gaussian process. A class of models that describes distributions over functions conditioned on observations from those func-
tions. Gaussian processes model outputs as being jointly normally distributed where the covariance between the outputs is a func-
tion of the input features. See Rasmussen and Williams for a comprehensive overview (Rasmussen and Williams, 2005)
generative model. A model of the data distribution, pðXÞ, joint data distribution, pðX;YÞ, or conditional data distribution, pðXjY =
yÞ. Usually framed in contrast to discriminative models that model the probability of the target given an observation, pðYjX = xÞ.
Here, Xis observable, for example the protein sequence, and Yis a target that is not observed, for example the protein structure or
function. Conditional generative and discriminative models are related by Bayes’ theorem. Language models are generative
models.
hidden layer. Intermediate vector representations in a deep neural network. Deep neural networks are structured as layered data
transformations before outputting a final prediction. The intermediate layers are referred to as ‘‘hidden’’ layers.
inductive bias. Describes the assumptions that a model uses to make predictions for data points it has not seen (Mitchell, 1980).
That is, the inductive bias of a model is how that model generalizes to new data. Every machine learning model has inductive
biases, implicitly or explicitly. For example, protein phenotype prediction based on homology assumes that phenotypes covary
over evolutionary relatedness. In other words, it formally models the idea that proteins that are more evolutionarily related are likely
to share the same function. In thinking about deep neural networks applied to proteins, it is important to understand the inductive
biases these models assume, because it naturally relates to the true properties of the function we are trying to model. However, this
is challenging, because we can only roughly describe the inductive biases of these models (Battaglia, Hamrick and Bapst, 2018).
language model. Probabilistic model of whole sequences. In the case of natural language, language models typically describe the
probability of sentences or documents. In the case of proteins, they model the probability of amino acid sequences. Being simply
probabilistic models, language models can take on many specific incarnations from column frequencies in multiple sequence
alignments to Hidden Markov Models to Potts models (direct coupling analysis) to deep neural networks.
manifold embedding. A distance preserving, low dimensional embedding of the data. The goal of manifold embedding is to find
points low dimensional vectors, z1 ::: zn , such that the distances, dðzi ; zj Þ , are as close as possible to the distances in the original
data space, dðxi ; xj Þ, given n high dimensional data vectors, x1 ::: xn . t-SNE is a commonly used manifold embedding approach for
visualization of high dimensional data.
masked language model. The training task used by BERT and other recent bidirectional language models. Instead of modeling
the probability of a sequence autoregressively, masked language models seek to model the probability of each token given all
other tokens. For computational convenience, this is achieved by randomly masking some percentage of the tokens in each
(Continued on next page)
656 Cell Systems 12, 654–669, June 16, 2021

ll
Box 1. Continued
minibatch and training the model to recover those tokens. An auxiliary token is added to the vocabulary to indicate that this token
has been masked.
multi-task learning. A machine learning paradigm in which multiple tasks are learned simultaneously. The idea is that similarities
between tasks can lead to each task being learned better in combination rather than learning each individually. In the case of rep-
resentation learning, multi-task learning can also be useful for learning representations that encode information relevant for all
tasks. Multi-task learning allows us to use the signals encoded in other training signals as an inductive bias when learning the
goal task.
representation learning. The problem of learning features, or intermediate data representations, better suited for solving a pre-
diction problem on raw data. Deep learning systems are described as representation learning systems, because they learn a series
of data transformations that make the goal task progressively easier to solve before outputting a prediction.
residue-residue contact prediction. The task of learning which amino acid residues are in contact in folded protein structures,
where contact is assumed to be within a small number of angstroms, often with the goal of constraining the search space for pro-
tein structure prediction.
self-supervised learning. A relatively new term for methods for learning from data without labels. Generally used to describe
methods that ‘‘automatically’’ create labels through data augmentation or generative modeling. Can be viewed as a subset of un-
supervised learning focused on learning representations useful for transfer learning.
semantic priors. Prior semantic understanding of a word or token, e.g., protein structure or function.
semantics. The meaning of a word or token. In reference to proteins, we use semantics to mean the ‘‘functional’’ purpose of a
residue, or combinations of residues.
structural classification of proteins (SCOP). A mostly manual curation of structural domains based on similarities of their se-
quences and structure. Similar databases include CATH (Sillitoe et al., 2021).
structural similarity prediction. Given two protein sequences, predict how similar their respective structures would be according
to some similarity measure.
supervised learning. A problem in machine learning. How we can learn a function to predict a target variable, usually denoted y,
given an observed one, usually denoted x, from a set of known x, y pairs.
transfer learning. A problem in machine learning. How we can take knowledge learned from one task and apply it to solve another
related task. When the tasks are different but related, representations learned on one task can be applied to the other. For example,
representations learned from recognizing dogs could be transferred to recognizing cats. In the case of proteins and language
models, we are interested in applying knowledge gained from learning to generate sequences to predicting function. Transfer
learning could also be applied to applying representations learned from predicting structure to function or from predicting one
function to another function among other applications.
unsupervised learning. A problem in machine learning that asks how we can learn patterns from unlabeled data. Clustering is a
classic unsupervised learning problem. Unsupervised learning is often formulated as a generative modeling problem, where we
view the data as being generated from some unobserved latent variable(s) that we infer jointly with the parameters of the model.
vector embedding. A term used to describe multidimensional real numbered representations of data that is usually discrete or
high dimensional, word embeddings being a classic example. Sometimes referred to as ‘‘distributed vector embeddings’’ or
‘‘manifold embeddings’’ or simply just ‘‘embeddings.’’ Low-dimensional vector representations of high dimensional data such
as images or gene expression vectors as found by methods such as t-SNE are also vector embeddings. Usually, the goal in
learning vector embeddings is to capture some semantic similarity between data as a function of similarity or distance in the vector
embedding space.
can be used to improve protein function prediction. Finally, we properties are often subject to evolutionary pressures because
will discuss future directions in protein machine learning and these functions must be maintained or enhanced in order for an
large-scale language modeling. organism to survive and reproduce. These pressures manifest
in the distribution over amino acids present in natural protein se-
Protein language models distill information from quences and, hence, are discoverable from large and diverse
massive protein sequence databases enough sets of naturally occurring sequences.
Language models for protein sequence representation learning The ability to learn semantics emerges from the distributional
(Figure 2) have seen a surge of interest following the success of hypothesis: tokens (e.g., words, amino acids) that occur in
large-scale models in the field of natural language processing similar contexts tend to carry similar meanings. Language
(NLP). These models draw on the idea that distributed vector rep- models only require sequences to be observed and are trained
resentations of proteins can be extracted from generative models to model the probability distribution over amino acids using an
of protein sequences, learned from a large and diverse database autoregressive formulation (Figures 2A and 2B) or masked posi-
of sequences across natural protein space, and thus can capture tion prediction formulation (also called a cloze task in NLP, Fig-
the semantics, or function, of a given sequence. Here, function re- ure 2C). In autoregressive language models, the probability of
fers to any and all properties related to what a protein does. These a sequence is factorized such that the probability of each token

ll
Figure 2. Diagram of model architectures and language modeling approaches

(A) Language models model the probability of sequences. Typically, this distribution is factorized over the sequence such that the probability of a token (e.g.,
amino acid) at position i (xi) is conditioned on the previous tokens. In neural language models, this is achieved by first computing a hidden layer (hi) given by the
sequence up to position i-1 and then calculating the probability distribution over token xi given hi. In this example sequence, ‘‘^’’ and ‘‘$’’ represent start and stop
tokens respectively and the sequence has length L.
(B) Bidirectional language models instead model the probability of a token conditioned on the previous and following tokens independently. For each token xi, we
compute a hidden layer using separate forward and reverse direction models. These hidden layers are then used to calculate the probability distribution over
tokens at position i conditioned on all other tokens in the sequence. This allows us to extract representations that capture complete sequence context.
(C) Masked language models model the probability of tokens at each position conditioned on all other tokens in the sequence by replacing the token at each
position with an extra ‘‘mask’’ token (‘‘X’’). In these models, the hidden layer at each position is calculated from all tokens in the sequence which allows the model
to capture conditional non-independence between tokens on either side of the masked token. This formulation lends itself well to transfer learning, because the
representations can depend on the full context of each token.
is conditioned only on the preceding tokens. This factorization is Recent advances in NLP have been driven by innovations in
exact and is useful when sampling from the distribution or neural network architectures, new training approaches,
evaluating the probabilities themselves is of primary interest. increasing compute power, and increasing accessibility of
The drawback to this formulation is that the representations huge text corpuses. Several NLP methods have been proposed
learned for each position depend only on preceding positions, that draw on unsupervised, now often called self-supervised,
potentially making them less useful as contextual representa- learning (Devlin et al., 2018; Peters et al., 2018) to fit large-scale
tions. The masked position prediction formulation (also known bidirectional long-short term recurrent neural networks (bidirec-
as masked language modeling) addresses this problem by tional LSTMs or biLSTMs) (Hochreiter and Schmidhuber, 1997;
considering the probability distribution over each token at each Graves, Fernández and Schmidhuber, 2005) or Transformers
position conditioned on all other tokens in the sequence. The (Vaswani et al., 2017) and its recent variants. LSTMs are recur-
masked language modeling approach does not allow calculating rent neural networks. These models process sequences one to-
correctly normalized probabilities of whole sequences but is ken at a time in order and therefore learn representations that
more appropriate when the learned representations are the out- capture information from a position and all previous positions.
comes of primary interest. The unprecedented recent success of In order to include information from tokens before and after
language models in natural language processing, e.g.Google’s any given position, bidirectional LSTMs combine two separate
BERT and OpenAI’s GTP-3, is largely driven by their ability to LSTMs operating in the forward and backward directions in
learn from billions of text entries in enormous online corpora. each layer (e.g., as in Figure 2B). Although these models can
Analogously, we have natural protein sequence databases with learn representations including whole sequence context, their
100 s of millions of unique sequences that continue to grow ability to learn distant dependencies is limited in practice. To
rapidly. address this limitation, transformers learn representations by

ll
explicitly calculating an attention vector over each position in the describing the empirical distribution over naturally occurring se-
sequence. In the self-attention mechanism, the representation quences has shown promise for predicting the fitness of
for each position is learned by ‘‘attending to’’ each position of sequence variants (Riesselman, Ingraham and Marks, 2018;
the same sequence, well suited for masked language modeling Hie et al., 2020a, 2021). Because these models learn from evolu-
(Figure 2C). In a self-attention module, the output representation tionary data directly, they can make accurate predictions about
of each element of a sequence is calculated as a weighted sum protein function when function is reflected in the fitness of natural
over transformations of the input representations at each posi- sequences. Riesselman et al. first demonstrated that language
tion where the weighting itself is based on a learned transforma- models fit on individual protein families are surprisingly accurate
tion of the inputs. The attention mechanism is typically believed predictors of variant fitness measured in deep mutational scan-
to allow transformers to learn dependencies between positions ning datasets (Riesselman, Ingraham and Marks, 2018). New
distant in the linear sequence more easily. Transformers are work has since shown that the representations learned by lan-
also useful as autoregressive language models. guage models are also powerful features for learning of variant
In natural language processing, Peters et al. recognized that fitness as a subsequent supervised learning task (Rives et al.,
the hidden layers (intermediate representations of stack neural 2019; Luo et al., 2020), building on earlier observations that lan-
networks) of biLSTMs encoded semantic meaning of words in guage models can improve protein property prediction through
context. This observation has been newly leveraged for biolog- transfer learning (Bepler and Berger, 2019). Recently, Hie et al.
ical sequence analysis (Alley et al., 2019; Bepler and Berger, used language models to learn evolutionary fitness of viral enve-
2019) to learn more semantically meaningful sequence represen- lope proteins and were able to predict mutations that could allow
tations. The success of deep transformers for machine transla- the SARS-CoV-2 spike protein to escape neutralizing antibodies
tion inspired their application to contextual text embedding, (Hie et al., 2020a, 2021). As of publication, several variants pre-
that is learning contextual vector embeddings of words and sen- dicted to have high escape potential have appeared in SARS-
tences, giving rise to the now widely used Bidirectional Encoder CoV-2 sequencing efforts around the world, but viral escape
Representations from Transformers (BERT) model in NLP (Devlin has not yet been experimentally verified (Walensky et al., 2021).
et al., 2018). BERT is a deep transformer trained as a masked A few recent works have focused on increasing the scale of
language model on a large text corpus. As a result, it learns these models by adding more parameters and more learnable
contextual representations of text that capture contextual mean- layers to improve sequence modeling. Interestingly, because
ing and improve the accuracy of downstream NLP systems. so many sequences are available, these models continue to
Transformers have also demonstrated impressive performance benefit from increased size (Rives et al., 2019). This parallels
as autoregressive language models, for example with the Gener- the general trend in natural language processing, where the
ative Pre-trained Transformer (GPT) family of models (Radford number of parameters, rather than specific architectural
et al., 2018, 2019; Brown et al., 2020), which have made impres- choices, is the best indicator of model performance (Kaplan
sive strides in natural language generation. These works have et al., 2020). However, ultimately, model size is limited by the
inspired subsequent applications to protein sequences (Rao computational resources available to train and apply these
et al., 2019; Rives et al., 2019; Elnaggar et al., 2020; Vig models. In NLP, models such as BERT and GPT-3 have become
et al., 2020). so large that only the best funded organizations with massive
Although transformers are powerful models, they require enor- Graphics Processing Unit (GPU) compute clusters are realisti-
mous numbers of parameters and train more slowly than typical cally able to train and deploy them. This is demonstrated in
recurrent neural networks. With massive scale datasets and some recent work on protein models where single transformer-
compute and time budgets, transformers can achieve impres- based models were trained for days to weeks on hundreds of
sive results, but, generally, recurrent neural networks (e.g., GPUs (Rives et al., 2019; Elnaggar et al., 2020; Vig et al.,
biLSTMs) need less training data and less compute, so might 2020), costing potentially 100 s of thousands of dollars for
be more suitable for problems where fewer sequences are avail- training. Increasing the scale of these models promises to
able, such as training on individual protein families, or compute continue to improve our ability to model proteins, but more
budgets are tight. Constructing language models that achieve resource efficient algorithms are needed to make these models
high accuracy with better compute efficiency is an algorithmic more accessible to the broader scientific community.
challenge for the field. An advantage of general purpose pre- So far, the language models we have discussed use natural
trained protein models is that we only need to do the expensive protein sequence information. However, they do not learn from
training step once; the models can then be used to make predic- the protein structure and function knowledge that has been
tions or can be applied to new problems via transfer learning accumulated over the past decades of protein research. Incor-
(Bengio, 2012), as discussed below. porating such knowledge requires supervised approaches.
Using these and other tools, protein language models are able
to synthesize the enormous quantity of known protein se- Supervision encodes biological meaning
quences by training on 100 s of millions of sequences stored in Proteins are more than sequences of characters: they are phys-
protein databases (e.g., UniProt, Pfam, NCBI (Bateman et al., ical chains of amino acids that fold into three-dimensional struc-
2004; Pruitt, Tatusova and Maglott, 2007; UniProt Consortium, tures and carry out functions based on those structures. The
2019)). The distribution over sequences learned by language sequence-structure-function relationship is the central pillar of
models captures the evolutionary fitness landscape of known protein biology and significant time and effort has been spent
proteins. When trained on tens of thousands of evolutionarily to elucidate this relationship for select proteins of interest. In
related proteins, the learned probability mass function particular, the increasing throughput and ease-of-use of protein

ll
structure determination methods, (e.g., X-ray crystallography models alone. Supervision that represents known protein struc-
and cryo-EM (Cheng et al., 2015; Callaway, 2020)), has driven ture, function, and other prior knowledge may be necessary to
a rapid increase in the number of known protein structures avail- encode distant sequence relationships into learned embed-
able in databases such as the Protein Data Bank (PDB) (Berman dings. By analogy, cars and boats are both means of transporta-
et al., 2000). There are nearly 175,000 entries in PDB as of tion, but we would not expect a generative image model to infer
publication and this number is growing rapidly. 14,000 new this relationship from still images alone. However, we can teach
structures were deposited in 2020 and the rate of new structure these relationships through supervision.
deposition is increasing. We pursue the intuition that incorpo- On this premise, we hypothesize that incorporating structural
rating such knowledge into our models via supervised learning supervision when training protein language models will improve
can aid in predicting function from sequence, bypassing the the ability to predict function in downstream tasks through trans-
need for solved structures. fer learning. Eventually, such language models may become
Supervised learning is the problem of finding a mathematical powerful enough that we can predict function directly without
function to predict a target variable given some observed vari- the need for solved structures. In the remainder of this Synthesis,
ables. In the case of proteins, supervised learning is commonly we will explore this idea.
used to predict protein structure from sequence, protein function
from sequence, or for other sequence annotation problems (e.g., Multi-task language models capture the semantic
signal peptide or transmembrane region annotation). Beyond organization of proteins
making predictions, supervised learning can be used to encode Here, we will demonstrate that training protein language models
specific semantics into learned representations. This is common with self-supervision on a large amount of natural sequence data
in computer vision where, for example, pre-training image recog- and with structure supervision on a smaller set of sequence,
nition models on the large ImageNet dataset is used to prime the structure pairs enriches the learned representations and trans-
model with information from natural image categories (Russa- lates into improvements in downstream prediction problems
kovsky et al., 2015). (Figure 3). First, we generate a dataset that contains 76 million
When we use supervised approaches, we encode semantic protein sequences from Uniref (Suzek et al., 2007) and an addi-
priors into our models. These priors are important for learning re- tional 28,000 protein sequences with structures from the Struc-
lationships that are not obvious from the raw data. For example, tural Classification of Proteins (SCOP) database, which classifies
unrelated protein sequences can form the same structural fold protein sequences into a hierarchy of structural motifs based on
and, therefore, are semantically similar. However, we cannot their sequence and structural similarities (e.g., family, super-
deduce this relationship from sequences alone. Supervision is family, class) (Fox, Brenner and Chandonia, 2014; Chandonia,
required to learn that these sequences belong to the same se- Fox and Brenner, 2017). Next, we train a bidirectional LSTM
mantic category. Although structure is more informative of func- with three learning tasks simultaneously: 1) the masked lan-
tion than sequence (Zhang and Kim, 2003; Shin et al., 2007) and guage modeling task (Figures 2C and 3A), 2) residue-residue
structure is encoded by sequence, predicting structure remains contact prediction (Figure 3B), and 3) structural similarity predic-
hard, particularly due to the relative paucity of structural relative tion (Figure 3C).
to sequence data. Significant strides have been made recently The fundamental idea behind this novel training scheme is to
with massive computing resources (Jumper et al., 2020); yet combine self-supervised and supervised learning approaches
there is still a long way to go before a complete sequence to to overcome the shortcomings of each. Specifically, the masked
structure mapping is possible. The degree to which such a language modeling objective (self-supervision) allows us to learn
map could or should be possible, even in principle, is unclear. from millions of natural protein sequences from the Uniprot data-
Evolutionary relationships between sequences are informative base. However, this does not include any prior semantic knowl-
of structural and functional relationships, but only when the de- edge from protein structure and, therefore, has difficulty learning
gree of sequence homology is sufficiently high. Above 30% semantic similarity between divergent sequences. To address
sequence identity, structure and function are usually conserved this, we consider two structural supervision tasks, residue-resi-
between natural proteins (Rost, 1999). Often called the ‘‘twilight due contact prediction and structural similarity prediction,
zone’’ of protein sequence homology, proteins with similar struc- trained with tens of thousands of protein structures classified
tures and functions still exist below this level, but they can no by SCOP. In the residue-residue contact prediction task, we
longer be detected from sequence similarity alone and it is un- use the hidden layers of the language model to predict contacts
clear whether their functions are conserved. Although it is gener- between residues within the 3D structure using a learned bilinear
ally believed that proteins with similar sequences form similar projection layer (Figure 3B). In the structural similarity prediction
structures, there are also interesting examples of highly similar task, we use the hidden layers of the language model to predict
protein sequences having radically different structures and func- the number of shared structural levels in the SCOP hierarchy by
tions (Kosloff and Kolodny, 2008; Wei et al., 2020) and of se- aligning the proteins in vector embedding space and using this
quences that can form multiple folds (James and Tawfik, alignment score to predict structural similarity from the
2003). Evolutionary innovation requires that protein function sequence embeddings. This task is critical for encoding struc-
can change with only a few mutations. Furthemore, it is impor- tural relationships between unrelated sequences into the model.
tant to note that although structure and function are related, The parameters of the language model are shared across the
they should not be directly conflated. self-supervised and two supervised tasks and the entire model
These phenomena suggest that there are aspects of protein is trained end-to-end. The set of proteins with known structure
biology that may not be discoverable by statistical sequence is much smaller than the full set of known proteins in Uniprot

ll
Figure 3. Our multi-task contextual embed-

ding model learning framework
We train a neural network (NN) sequence encoder to
solve three tasks simultaneously. The first task is
masked language modeling on millions of natural
protein sequences. We include two sources of
structural supervision in a multi-task framework
(MT-LSTM for Multi-Task LSTM) in order to encode
structural semantics directly into the representa-
tions learned by our language model. We combine
this with the masked language model objective to
benefit from evolutionary and less available struc-
ture information (only 10 s of thousands of proteins).
(A) The masked language model objective allows us
to learn contextual embeddings from hundreds of
millions of sequences. Our training framework is
agnostic to the NN architecture, but we specifically
use a three layer bidirectional LSTM with skip con-
nections (inset box) in this work in order to capture
long range dependencies but train quickly. We can
train language models using only this objective
(DLM-LSTM), but can also enrich the model with
structural supervision.
(B) The first structure task is predicting contacts
between residues in protein structures using a
bilinear projection of the learned embeddings. In this
task, The hidden layer representations of the lan-
guage model are then used to predict residue-res-
idue contacts using a bilinear projection. That is, we
model the log likelihood ratio of a contact between
the i-th and j-th residues in the protein sequence, by
zi Wzj + bwhere matrix W and scalar b are learned
parameters.
(C) The second source of structural supervision is
structural similarity, defined by the Structural Classification of Proteins (SCOP) hierarchy (Hubbard et al., 1997). We predict the ordinal levels of similarity between
pairs of proteins by aligning the sequences in embedding space. Here, we embed the query and target sequences using the language model (Z1 and Z2) and then
predict the structural homology by calculating the pairwise distances between the query and target embeddings (di,j) and aligning the sequences based on these
distances.
and, therefore, by combining these tasks in a multi-task learning with the masked language modeling objective (DLM-LSTM),
approach we can learn language models and sequence repre- which is not enriched with the structure-based priors. We
sentations that are enriched with strong biological priors from observe that even though the DLM-LSTM model was trained us-
known protein structures. We refer to this model as the multi- ing only sequence information, protein sequences still organize
task (MT)-LSTM. roughly by structure in embedding space. However, this organi-
Next, we demonstrate how the trained language model can be zation is improved when we include structure supervision in the
used for protein sequence analysis and compare this with con- language model training (Figure 4B).
ventional approaches. Given the trained MT-LSTM, we apply it The semantic organization of our learned embedding space
to new protein sequences to embed them into the learned se- enables a direct application: we can search protein sequence
mantic representation space (Figure 4A). Sequences are fed databases for semantically related proteins by comparing pro-
through the model and the hidden layer vectors are combined teins based on their vector embeddings (Bepler and Berger,
to form vector embeddings of each position of the sequences. 2019). Because we embed sequences into a semantic represen-
Given a sequence of length L, this yields L D-dimensional vec- tation space, we can find structurally related proteins even
tors, where D is the dimension of the vector embeddings. This although their sequences are not closely related (Figure 4C, Table
lows us to map the semantic space of each residue within a S1). To demonstrate this, we take pairs of proteins in the SCOP
sequence, but we can also map the semantic space of whole se- database, not seen by our multi-task model during training, and
quences by summarizing them into fixed size vector embed- calculate the similarity between these pairs of sequences using
dings via a reduction operation. Practically, this is useful for direct sequence homology-based methods (Needleman-
coarse sequence comparisons including clustering and manifold Wunsch alignment, HMM-sequence alignment, and HMM-
embedding for visualization of large protein datasets, revealing HMM alignment (Needleman and Wunsch, 1970; Eddy, 2011;
evolutionary, structural, and functional relationships between se- Remmert et al., 2011b)), a popular structure-based method
quences in the dataset (Figure 4B). In this figure, we visualize (TMalign (Zhang and Skolnick, 2005)), and an alignment between
proteins in the SCOP dataset, colored by structural class, after the sequences in our learned embedding space. We then eval-
embedding with our MT-LSTM. For comparison, we also show uate these methods based on their ability to correctly find pairs
results of embedding using a bidirectional LSTM trained only of proteins that are similar at the class, fold, superfamily, and

ll
Figure 4. Language models capture the semantic organization of proteins

(A) Given a trained language model, we embed sequences by processing them with the neural network and taking the hidden layer representations for each
position of the sequence. This gives an LxD matrix containing a D-dimensional vector embedding for each position of a length L sequence. We can reduce this to a
D-dimensional vector ‘‘summarizing’’ the entire sequence by a pooling operation. Specifically, we use averaging here. These representations allow us to directly
visualize large protein datasets with manifold embedding techniques.
(B) Manifold embedding of SCOP protein sequences reveals that our language models learn protein sequence representations that capture structural semantics
of proteins. We embed thousands of protein sequences from the SCOP database and show t-SNE plots of the embedded proteins colored by SCOP structural
class. The masked language (unsupervised) model (DLM-LSTM) learns embeddings that separate protein sequences by structural class, whereas the multi-task
language model (MT-LSTM) with structural supervision learns an even better organized embedding space. In contrast, manifold embedding of sequences directly
(edit distance) produces an unintelligible mash and does not resolve structural groupings of proteins.
(C) In order to quantitatively evaluate the quality of the learned semantic embeddings, we calculate the correspondence between semantic similarity predicted by
our language model representations and ground truth structural similarities between proteins in the SCOP database. Given two proteins, we calculate the se-
mantic similarity between them by embedding these proteins using our MT-LSTM, align the proteins using the embeddings, and calculate an alignment score. We
compute the average-precision score for retrieving pairs of proteins similar at different structural levels in the SCOP hierarchy based on this predicted semantic
similarity and find that our semantic similarity score dramatically outperforms other direct sequence comparison methods for predicting protein similarity.
(legend continued on next page)

ll
family level, based on their SCOP classification. We find that our find that increasing model size improves transfer learning
learned semantic embeddings dramatically outperform the performance.
sequence comparison methods and even outperform structure Here, we demonstrate two use cases where transfer learning
comparison with TMalign when predicting structural similarity. from our MT-LSTM improves performance on downstream
Interestingly, we observe that the structural supervision compo- tasks. First, we consider the problem of transmembrane predic-
nent is critical for learning well organized embeddings at a fine- tion. This is a sequence labeling task in which we are provided
grained level, because the DLM-LSTM representations alone do with the amino acid sequence of a protein and wish to decode,
not perform well at this task (Table S1). Furthermore, the multi- for each position of the protein, whether that position is in a
task learning approach outperforms a two-step learning transmembrane (i.e., membrane spanning) region of the protein
approach presented previously (SSA-LSTM) (Bepler and Berger, or not. This problem is complicated by the presence of signal
2019). peptides, which are often confused as transmembrane regions.
With the success of our self-supervised and supervised lan- In order to compare different sequence representations for
guage models, we sought to investigate whether protein lan- this problem, we train a small one layer bidirectional LSTM
guage models could improve function prediction through trans- with a conditional random field (BiLSTM+CRF) decoder on a
fer learning. well-defined transmembrane protein benchmark dataset (Tsiri-
gos et al., 2015a). Methods are compared by 10-fold cross
Transfer learning improves downstream applications validation. We find that the BiLSTM+CRFs with our new embed-
A key challenge in biology is that many problems are small data dings (DLM-LSTM and MT-LSTM) outperform existing trans-
problems. Quantitative protein characterization assays are rarely membrane predictors and a BiLSTM+CRF using our previous
high throughput and methods are needed that can generalize smaller embedding model (SSA-LSTM). Furthermore, represen-
given only 10 s to 100 s of experimental measurements. Further- tations learned by our MT-LSTM model significantly outperform
more, we are often interested in extrapolating from data (paired t test, p = 0.044) the embeddings learned by our DLM-
collected over a small region of protein sequence space to other LSTM model on this application (Figure 5B).
sequences, often with little to no homology. Learned protein rep- Second, we demonstrate that we are able to accurately pre-
resentations improve predictive ability for downstream predic- dict functional implications of small changes in protein sequence
tion problems through transfer learning (Figure 5A). Transfer through transfer learning. An ideal model would be sensitive
learning is the problem of applying knowledge learned from solv- down to the single amino acid level and would group mutations
ing some prior tasks to a different task of interest. In other words, with similar functional outcomes closely in semantic space.
learning to solve task A can help learn to solve task B; analo- Recently, Luo et al. presented a method for combining language
gously, learning how to wax cars helps to learn karate moves model-based representations with local evolutionary context-
(Karate Kid, 1984). This is especially useful for tasks with little based representations (ECNet) and demonstrated that these
available training data, such as protein function prediction, representations were powerful for sequence-to-phenotype
because models can be pre-trained on other tasks with plentiful mapping on a panel of deep mutational scanning datasets (Luo
training data to improve performance through transfer learning. et al., 2020). In this problem, we observe a relatively small set
Application of protein language models to downstream tasks (100 s-1000s) of sequence-phenotype measurement pairs and
through transfer learning was first demonstrated by Bepler and our goal is to predict phenotypes for unmeasured variants.
Berger (2019). They showed that transfer learning was useful Observing that these are small data problems, we reasoned
for structural similarity prediction, secondary structure predic- that this is an ideal setting for Bayesian methods and that trans-
tion, residue-residue contact prediction, and transmembrane fer learning will be important for achieving good performance. To
region prediction, by fitting task specific models on top of a this end, we propose a framework in which sequence variants
pre-trained bidirectional language model. The key insight was are first embedded using our MT-LSTM and then phenotype pre-
that the sequence representations (vector embeddings) learned dictions are made using Gaussian process (GP) regression using
by the language model were powerful features for solving other our embeddings as features. We find that we can predict the
prediction problems. Since then, various language model-based phenotypes of unobserved sequence variants across datasets
protein embedding methods have been applied to these and better than existing methods (Figure 5C). Our MT-LSTM embed-
other protein prediction problems through transfer learning, ding powered GP achieves an average Spearman correlation of
including protein phenotype prediction (Alley et al., 2019; Rao 0.65 with the measured phenotypes significantly outperforming
et al., 2019; Rives et al., 2019; Luo et al., 2020), residue-residue (paired t test, p = 0.006) the next best method, ECNet, which rea-
contact prediction (Rives et al., 2019; Rao et al., 2020), fold ches 0.60 average Spearman correlation.
recognition (Rao et al., 2019), protein-protein (Zhou et al., Semi-supervised learning (van Engelen and Hoos, 2020), few-
2020; Sledzieski et al., 2021) and protein-drug interaction predic- shot learning (Wang et al., 2020a), meta-learning (Vanschoren,
tion (Hie et al., 2020b; Truong and Truong, 2020). Recent works 2018; Hospedales et al., 2020), and other methods for rapid
have shown that increasing language model scale leads to adaptation to new problems and domains will be key future de-
continued improvements in downstream applications, such as velopments for pushing the limit of data efficient learning.
residue-residue contact prediction (Rao et al., 2020). We also Methods that capture uncertainty (e.g., Gaussian processes
Furthermore, our entirely sequence-based method even outperforms structural comparison with TMalign when predicting structural similarity in the SCOP
database. Furthermore, we contrast our end-to-end MT-LSTM model with an earlier two-step language model (SSA-LSTM) and find that training end-to-end in a
unified multi-task framework improves structural similarity classification.

ll
Figure 5. Protein language models with transfer learning improve function prediction
(A) Transfer learning is the problem of applying knowledge gained from learning to solve some task, A, to another related task, B. For example, applying
knowledge from recognizing dogs to recognizing cats. Usually, transfer learning is used to improve performance on tasks with little available data by transferring
knowledge from other tasks with large amounts of available data. In the case of proteins, we are interested in applying knowledge from evolutionary sequence
modeling and structure modeling to protein function prediction tasks.
(B) Transfer learning improves transmembrane prediction. Our transmembrane prediction model consists of two components. First, the protein sequence is
embedded using our pre-trained language model (MT-LSTM) by taking the hidden layers of the language model at each position. Then, these representations are
fed into a small single layer bidirectional LSTM (BiLSTM) and the output of this is fed into a conditional random field (CRF) to predict the transmembrane label at
each position. We evaluate the model by 10-fold cross validation on proteins split into four categories: transmembrane only (TM), signal peptide and trans-
membrane (TM+SP), globular only (Globular), and globular with signal peptide (Globular+SP). A protein is considered correctly predicted if 1) the presence or
absence of signal peptide is correctly predicted and 2) the number of locations of transmembrane regions is correctly predicted. The table reports the fraction of
correctly predicted proteins in each category for our model (BiLSTM+CRF) and widely used transmembrane prediction methods. A BiLSTM+CRF model trained
(legend continued on next page)
ll
and other Bayesian methods) will continue to be important, At the same time, current protein language models make
particularly for guiding experimental design. Some recent works heavily simplified phylogenetic assumptions. By treating each
have explored Gaussian process-based methods for guiding sequence as an independent draw from some prior distribution
protein design with simple protein sequence representations over sequences, current methods assume that all protein se-
(Romero, Krause and Arnold, 2013; Bedbrook et al., 2017; quences arise independently in a star phylogeny. Convention-
Yang et al., 2018). Hie et al., presented a GP-based method for ally, this problem is crudely addressed by filtering sequences
guiding experimental drug design informed by deep protein em- based on percent identity. However, significant effort has been
beddings (Hie et al., 2020b). Other works have explored dedicated to understanding protein sequences as emerging
combining neural network and GP models (Ding et al., 2019; Pa- from tree-structured evolutionary processes over time or coales-
tacchiola et al., 2019); while still others considered non-GP- cent processes in reverse time (Rosenberg and Nordborg, 2002;
based uncertainty aware prediction methods for antibody design Nascimento, Reis and Yang, 2017). Methods for inferring these
and major histocompatibility complex (MHC) peptide display latent phylogenetic trees continue to be of substantial interest
prediction (Zeng and Gifford, 2019; Liu et al., 2020). Methods (Huelsenbeck and Ronquist, 2001; Lartillot, Lepage and Blan-
for combining multiple predictors and for incorporating strong quart, 2009; Bouckaert et al., 2019), but are frustrated by long
priors into protein design can also help to alleviate problems run times and poor scalability to large datasets. In the future,
that arise in the low data regime (Brookes, Park and Listgarten, deep generative models of proteins might seek to merge these
2019). Transfer learning and massive protein language models disciplines to model proteins as being generated from evolu-
will play a key role in future protein property prediction and ma- tionary processes other than star phylogenies.
chine learning driven protein and drug design efforts. Other practical considerations continue to frustrate our ability
to develop new protein language models and rapidly iterate on ex-
Conclusions and perspectives: Strong biological priors periments. High compute costs and murky design guidelines
are key to improving protein language models mean that developing new models is often an expensive, time
Future developments in protein language modeling and repre- consuming, and ad hoc process. It is not clear at what dataset
sentation learning will need to model properties that are unique sizes and levels of sequence diversity one model will outperform
to proteins. Biological sequences are not natural language, and another or how many parameters a model should include. At the
we should develop new language models that capture the funda- upper limit of large natural protein databases, larger models
mental nature of biological sequences. While demonstrably use- continue to yield improved performance. However, for individual
ful, existing methods based on recurrent neural networks and protein families or other application specific protein datasets,
Transformers still do not fundamentally encode key protein prop- the gold standard is to select model architectures and number
erties in the model architecture and the inductive biases of these of parameters via brute force hyperparameter search methods.
models are only roughly understood (Box 1). Fine-tuning pre-trained models can help with this problem but
Proteins are objects that exist in physical space. Similarly, we does not fully resolve it. Sequence length also remains a chal-
understand many of the fundamental evolutionary processes lenge for these models. Transformers scale quadratically with
that give rise to the diversity of protein sequences observed sequence length, which means that in practical implementations
today. These two elements, physics and evolution, are the long sequences need to either be excluded or truncated. New
key properties of proteins and our models might benefit from linear complexity attention mechanisms may help to alleviate
being structured explicitly to incorporate evolutionary and this limitation (Choromanski et al., 2020; Wang et al., 2020b).
physics-based inductive biases. Early attempts at capturing This problem is less extreme for recurrent neural networks, which
physical properties of proteins as part of machine learning scale linearly with sequence length, but very long sequences are
models have already demonstrated that conditioning on struc- still impractical for RNNs to handle and long-range sequence de-
ture improves generative models of sequence (Ingraham et al., pendencies are unlikely to be learned well by these models.
2019b) and significant work has been done in the opposite direc- Language models capture complex relationships between
tion of machine learning-based structure prediction methods residues in protein sequences by condensing information from
that explicitly incorporate constraints on protein geometries enormous protein sequence databases. They are a powerful
(Liu et al., 2018; AlQuraishi, 2019; Ingraham et al., 2019b; Xu, new development for understanding and making predictions
2019; Jumper et al., 2020; Yang et al., 2020). However, new about biological sequences. Increasing model size, compute po-
methods are needed to fuse these directions with physics-based wer, and dataset size will only continue to improve performance
approaches and to start to fully merge sequence- and structure- of protein language models. Already, these methods are trans-
based models. forming computational protein biology today due to their ease
using 1-hot embeddings of the protein sequence instead of our language model representations performs poorly, highlighting the importance of transfer learning
for this task (Table S2).
(C) Transfer learning improves sequence-to-phenotype prediction. Deep mutational scanning measures function for thousands of protein sequence variants. We
consider 19 mutational scanning datasets spanning a variety of proteins and phenotypes. For each dataset, we learn the sequence-to-phenotype mapping by
fitting a Gaussian process regression model on top of representations given by our pre-trained language model. We compare three unsupervised approaches (+),
prior works in supervised learning (o), and our Gaussian process regression approaches with (ÿ, GP (MT-LSTM)) and without (GP (1-hot)) transfer learning by 5-
fold cross validation. Spearman rank correlation coefficients between predicted and ground truth functional measurements are plotted. Our GP with transfer
learning outperforms all other methods, having an average correlation of 0.65 across datasets. The benefits of transfer learning are highlighted by the
improvement over the 1-hot representations which only reach 0.57 average correlation across datasets. Transfer learning improves performance on 18 out of 19
datasets.

ll
of use and widespread applicability. Furthermore, augmenting Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G.M. (2019).
language models with protein specific properties such as struc- Unified rational protein engineering with sequence-based deep representation
learning. Nat. Methods 16, 1315–1322.
ture and function offers one already successful route toward
even richer representations and novel biology. However, it re- AlQuraishi, M. (2019). End-to-End Differentiable Learning of Protein Structure.
Cell Syst. 8, 292–301.e3.
mains unclear how best to encode prior biological knowledge
into the inductive bias of these models. We hope this Synthesis Altschul, S.F., and Koonin, E.V. (1998). Iterated profile searches with PSI-
BLAST–a tool for discovery in protein databases. Trends Biochem. Sci. 23,
propels the community to work toward developing purpose-built
444–447.
protein language models with natural inductive biases suited for
Araya, C.L., Fowler, D.M., Chen, W., Muniez, I., Kelly, J.W., and Fields, S.
the physical nature of proteins and how they evolve.
(2012). A fundamental protein property, thermodynamic stability, revealed
solely from large-scale measurements of protein function. Proc. Natl. Acad.
STAR+METHODS Sci. USA 109, 16858–16863.
Bandaru, P., Shah, N.H., Bhattacharyya, M., Barton, J.P., Kondo, Y., Cofsky,
Detailed methods are provided in the online version of this paper J.C., Gee, C.L., Chakraborty, A.K., Kortemme, T., Ranganathan, R., and
and include the following: Kuriyan, J. (2017). Deconstruction of the Ras switching cycle through satura-
tion mutagenesis. eLife 6, e27810. https://doi.org/10.7554/eLife.27810.
d RESOURCE AVAILABILITY Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S.,
B Lead contact Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004). The
B Materials availability Pfam protein families database. Nucleic Acids Res. 32, D138–D141.
B Data and code availability Battaglia, P.W., Hamrick, J.B., and Bapst, V. (2018). Relational inductive
d METHOD DETAILS biases, deep learning, and graph networks. arXiv, 1806.01261 https://arxiv.
B Bidirectional LSTM encoder with skip connections org/abs/1806.01261.
B Masked language modeling module Bedbrook, C.N., Yang, K.K., Rice, A.J., Gradinaru, V., and Arnold, F.H. (2017).
B Residue-residue contact prediction module Machine learning to design integral membrane channelrhodopsins for efficient
eukaryotic expression and plasma membrane localization. PLoS Comput.
B Structure similarity prediction module
Biol. 13, e1005786.
B Multi-task loss
Bengio, Y. (2012) Deep learning of representations for unsupervised and trans-
B Training datasets
fer learning. In Proceedings of ICML workshop on unsupervised and transfer
B Hyperparameters and training details learning. jmlr.org, pp. 17–36.
B Protein structural similarity prediction evaluation
Bepler, T., and Berger, B. (2019). Learning protein sequence embeddings us-
B Transmembrane region prediction training and eval- ing information from structure. In International Conference on Learning
uation Representations. 1902.08661, https://arxiv.org/abs/1902.08661.
B Sequence-to-phenotype prediction and evaluation Berger, B. (1995). Algorithms for protein structural motif recognition.
J. Comput. Biol. 2, 125–138.
SUPPLEMENTAL INFORMATION Berger, B., Wilson, D.B., Wolf, E., Tonchev, T., Milla, M., and Kim, P.S. (1995).
Predicting coiled coils by use of pairwise residue correlations. Proc. Natl.
Supplemental information can be found online at https://doi.org/10.1016/j. Acad. Sci. USA 92, 8259–8263.
cels.2021.05.017.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucleic
ACKNOWLEDGMENTS
Acids Res. 28, 235–242.
The authors are grateful to Grace Yeo and Brian Hie for helpful suggestions. Bouckaert, R., Vaughan, T.G., Barido-Sottani, J., Duchêne, S., Fourment, M.,
T.B. is supported by the Simons Foundation International, Ltd. (SF349247). €hnert, D., De Maio, N., et al. (2019).
Gavryushkina, A., Heled, J., Jones, G., Ku
B.B. is partially supported by NIH grant R35 GM141861. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.
PLoS Comput. Biol. 15, e1006650.
AUTHOR CONTRIBUTIONS Brenan, L., Andreev, A., Cohen, O., Pantel, S., Kamburov, A., Cacchiarelli, D.,
Persky, N.S., Zhu, C., Bagul, M., Goetz, E.M., et al. (2016). Phenotypic
All authors conceived and guided the project and methodology. T.B. wrote the Characterization of a Comprehensive Set of MAPK1/ERK2 Missense
software and performed the computational experiments. All authors inter- Mutants. Cell Rep. 17, 1171–1183.
preted the results and wrote the manuscript.
Brookes, D., Park, H., and Listgarten, J. (2019) ‘Conditioning by adaptive sam-
pling for robust design’, in Chaudhuri, K. and Salakhutdinov, R. (eds)
DECLARATION OF INTERESTS
Proceedings of the 36th International Conference on Machine Learning.
The authors declare no competing interests. PMLR (Proceedings of Machine Learning Research), pp. 773–782.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Received: January 16, 2021 Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language
Revised: May 20, 2021 Models are Few-Shot Learners. arXiv, 2005.14165 http://arxiv.org/abs/
Accepted: May 20, 2021 2005.14165.
Published: June 16, 2021
Callaway, E. (2020). Revolutionary cryo-EM is taking over structural biology.
Nature 578, 201.
REFERENCES
Chandonia, J.-M., Fox, N.K., and Brenner, S.E. (2017). SCOPe: Manual
Alford, R.F., Leaver-Fay, A., Jeliazkov, J.R., O’Meara, M.J., DiMaio, F.P., Park, Curation and Artifact Removal in the Structural Classification of Proteins -
H., Shapovalov, M.V., Renfrew, P.D., Mulligan, V.K., Kappel, K., et al. (2017). extended Database. J. Mol. Biol. 429, 348–355.
The Rosetta All-Atom Energy Function for Macromolecular Modeling and Cheng, Y., Grigorieff, N., Penczek, P.A., and Walz, T. (2015). A primer to single-
Design. J. Chem. Theory Comput. 13, 3031–3048. particle cryo-electron microscopy. Cell 161, 438–449.

ll
Choi, J.-M., and Pappu, R.V. (2019). Improvements to the ABSINTH Force Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C.
Field for Proteins Based on Experimentally Derived Amino Acid Specific (2006). Comparison of multiple Amber force fields and development of
Backbone Conformational Statistics. J. Chem. Theory Comput. 15, improved protein backbone parameters. Proteins 65, 712–725.
1367–1382. Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. (2020). Meta-
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, Learning in Neural Networks: A Survey. arXiv, 2004.05439 http://arxiv.org/
T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking abs/2004.05439.
Attention with Performers. In International Conference on Learning Hubbard, T.J., Murzin, A.G., Brenner, S.E., and Chothia, C. (1997). SCOP: a
Representations. https://openreview.net/pdf?id=Ua6zuk0WRH (Accessed: structural classification of proteins database. Nucleic Acids Res. 25, 236–239.
20 May 2021).
Huelsenbeck, J.P., and Ronquist, F. (2001). MRBAYES: Bayesian inference of
de Juan, D., Pazos, F., and Valencia, A. (2013). Emerging methods in protein phylogenetic trees. Bioinformatics 17, 754–755.
co-evolution. Nat. Rev. Genet. 14, 249–261. Ingraham, J., Garg, V.K., Barzilay, R., and Jaakkola, T. (2019a). Generative
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training Models for Graph-Based Protein Design. In Advances in Neural Information
of Deep Bidirectional Transformers for Language Understanding. arXiv, Processing Systems, H. Wallach, et al., eds. (Curran Associates, Inc.),
1810.04805 http://arxiv.org/abs/1810.04805. pp. 15820–15831.
Ding, X., Zou, Z., and Brooks III, C.L. (2019). Deciphering protein evolution and Ingraham, J., Riesselman, A., Sander, C., and Marks, D. (2019b). Learning pro-
fitness landscapes with latent space models. Nat. Commun. 10, 5644. tein structure with a differentiable simulator. In International Conference on
Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, Learning Representations https://openreview.net/forum?id=Byg3y3C9Km.
e1002195. Jacquier, H., Birgy, A., Le Nagard, H., Mechulam, Y., Schmitt, E., Glodt, J.,
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., and Aurell, E. (2013). Improved Bercot, B., Petit, E., Poulain, J., Barnaud, G., et al. (2013). Capturing the muta-
contact prediction in proteins: using pseudolikelihoods to infer Potts models. tional landscape of the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. USA 110,
Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 87, 012707. 13067–13072.
Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G., Wang, Y., Jones, L., James, L.C., and Tawfik, D.S. (2003). Conformational diversity and protein
Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al. (2020). ProtTrans: evolution–a 60-year-old hypothesis revisited. Trends Biochem. Sci. 28,
Towards Cracking the Language of Life’s Code Through Self-Supervised 361–368.
Deep Learning and High Performance Computing. arXiv, 2007.06225 http:// Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Tunyasuvunakool,
arxiv.org/abs/2007.06225. K., Ronneberger, O., Bates, R., Zidek, A., Brigland, A., et al.. https://
predictioncenter.org/casp14/doc/CASP14_Abstracts.pdf.
Findlay, G.M., Daza, R.M., Martin, B., Zhang, M.D., Leith, A.P., Gasperini, M.,
Janizek, J.D., Huang, X., Starita, L.M., and Shendure, J. (2018). Accurate clas- Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R.,
sification of BRCA1 variants with saturation genome editing. Nature 562, Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling Laws for
217–222. Neural Language Models. arXiv, 2001.08361 http://arxiv.org/abs/2001.08361.
Finn, R.D., Clements, J., and Eddy, S.R. (2011). HMMER web server: interac- Kingma, D.P., and Ba, J. (2015). Adam: A Method for Stochastic Optimization.
tive sequence similarity searching. Nucleic Acids Res. 39, 29–37. In International Conference on Learning Representations. 1412.6980, http://
arxiv.org/abs/1412.6980.
Fox, N.K., Brenner, S.E., and Chandonia, J.-M. (2014). SCOPe: Structural
Classification of Proteins–extended, integrating SCOP and ASTRAL data Kitzman, J.O., Starita, L.M., Lo, R.S., Fields, S., and Shendure, J. (2015).
and classification of new structures. Nucleic Acids Res. 42, D304–D309. Massively parallel single-amino-acid mutagenesis. Nat. Methods 12,
203–206, 4 p following 206.
Gardner, J., et al. (2018). GPyTorch: Blackbox Matrix-Matrix Gaussian
Process Inference with GPU Acceleration. In Advances in Neural Information Klesmith, J.R., Bacik, J.-P., Wrenbeck, E.E., Michalczyk, R., and Whitehead,
Processing Systems, S. Bengio, et al., eds. (Curran Associates, Inc), T.A. (2017). Trade-offs between enzyme fitness and solubility illuminated by
pp. 7576–7586. deep mutational scanning. PNAS 114, 2265–2270.
Kosloff, M., and Kolodny, R. (2008). Sequence-similar, structure-dissimilar
Göbel, U., Sander, C., Schneider, R., and Valencia, A. (1994). Correlated mu-
protein pairs in the PDB. Proteins 71, 891–902.
tations and residue contacts in proteins. Proteins 18, 309–317.
Lartillot, N., Lepage, T., and Blanquart, S. (2009). PhyloBayes 3: a Bayesian
Godzik, A., Kolinski, A., and Skolnick, J. (1993). De novo and inverse folding
software package for phylogenetic reconstruction and molecular dating.
predictions of protein structure and dynamics. J. Comput. Aided Mol. Des.
Bioinformatics 25, 2286–2288.
7, 397–438.
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R.,
Graves, A., Fernández, S., and Schmidhuber, J. (2005). Bidirectional LSTM
Kaufman, K., Renfrew, P.D., Smith, C.A., et al. (2011). Chapter nineteen -
Networks for Improved Phoneme Classification and Recognition. In Artificial
Rosetta3: An Object-Oriented Software Suite for the Simulation and Design
Neural Networks: Formal Models and Their Applications – ICANN 2005
of Macromolecules. In Methods in Enzymology, M.L. Johnson and L. Brand,
(Springer Berlin Heidelberg), pp. 799–804.
eds. (Academic Press), pp. 545–574.
Harris, Z.S. (1954). Distributional Structure. Word World 10, 146–162.
Liu, Y., Palmedo, P., Ye, Q., Berger, B., and Peng, J. (2018). Enhancing Evolutionary
Hess, B., Kutzner, C., van der Spoel, D., and Lindahl, E. (2008). GROMACS Couplings with Deep Convolutional Neural Networks. Cell Syst. 6, 65–74.e3.
4:cAlgorithms for Highly Efficient, Load-Balanced, and Scalable Molecular
Liu, G., Zeng, H., Mueller, J., Carter, B., Wang, Z., Schilz, J., Horny, G.,
Simulation. J. Chem. Theory Comput. 4, 435–447.
Birnbaum, M.E., Ewert, S., and Gifford, D.K. (2020). Antibody complementarity
Hie, B., Zhong, E., Bryson, B., and Berger, B. (2020a). Learning mutational se- determining region design using high-capacity machine learning.
mantics. Advances in Neural Information Processing Systems 33. https:// Bioinformatics 36, 2126–2133.
proceedings.neurips.cc/paper/2020/hash/6754e06e46dfa419d5afe3c9781c Luo, Y., Vo, L., Ding, H., Su, Y., Liu, Y., Qian, W.W., Zhao, H., and Peng, J.
ecad-Abstract.html. (2020). Evolutionary Context-Integrated Deep Sequence Modeling for
Hie, B., Bryson, B.D., and Berger, B. (2020b). Leveraging Uncertainty in Protein Engineering. In Research in Computational Molecular Biology
Machine Learning Accelerates Biological Discovery and Design. Cell Syst. (Springer International Publishing), pp. 261–263.
11, 461–477.e9. Madani, A., McCann, B., Naik, N., Keskar, N.S., Anand, N., Eguchi, R.R.,
Hie, B., Zhong, E.D., Berger, B., and Bryson, B. (2021). Learning the language Huang, P.-S., and Sochler, R. (2020). ProGen: Language Modeling for
of viral evolution and escape. Science 371, 284–288. Protein Generation. arXiv, 2004.03497 http://arxiv.org/abs/2004.03497.
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Marks, D.S., Hopf, T.A., and Sander, C. (2012). Protein structure prediction
Comput. 9, 1735–1780. from sequence variation. Nat. Biotechnol. 30, 1072–1080.

ll
Matreyek, K.A., Starita, L.M., Stephany, J.J., Martin, B., Chiasson, M.A., Gray, Rohl, C.A., Strauss, C.E., Misura, K.M., and Baker, D. (2004). Protein structure
V.E., Kircher, M., Khechaduri, A., Dines, J.N., Hause, R.J., et al. (2018). prediction using Rosetta. Methods Enzymol. 383, 66–93.
Multiplex assessment of protein variant abundance by massively parallel Romero, P.A., Krause, A., and Arnold, F.H. (2013). Navigating the protein
sequencing. Nat. Genet. 50, 874–882. fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. USA 110,
McDonnell, A.V., Jiang, T., Keating, A.E., and Berger, B. (2006). Paircoil2: E193–E201.
improved prediction of coiled coils from sequence. Bioinformatics 22, 356–358. Rosenberg, N.A., and Nordborg, M. (2002). Genealogical trees, coalescent
McLaughlin, R.N., Jr., Poelwijk, F.J., Raman, A., Gosal, W.S., and theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3,
Ranganathan, R. (2012). The spatial architecture of protein function and adap- 380–390.
tation. Nature 491, 138–142. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng.
Melamed, D., Young, D.L., Gamble, C.E., Miller, C.R., and Fields, S. (2013). 12, 85–94.
Deep mutational scanning of an RRM domain of the Saccharomyces cerevi- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
siae poly(A)-binding protein. RNA 19, 1537–1551. Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., et al. (2015).
ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis.
Mitchell, T. M. 1980. The need for biases in learning generalizations. New
115, 211–252.
Jersey: Department of Computer Science, Laboratory for Computer Science
Research, Rutgers Univ.; 1980 May. Shin, D.H., Hou, J., Chandonia, J.M., Das, D., Choi, I.G., Kim, R., and Kim, S.H.
(2007). Structure-based inference of molecular functions of proteins of un-
Nascimento, F.F., Reis, M.D., and Yang, Z. (2017). A biologist’s guide to
known function from Berkeley Structural Genomics Center. J. Struct. Funct.
Bayesian phylogenetic analysis. Nat. Ecol. Evol. 1, 1446–1454.
Genomics 8, 99–105.
Needleman, S.B., and Wunsch, C.D. (1970). A general method applicable to
Sillitoe, I., Bordin, N., Dawson, N., Waman, V.P., Ashford, P., Scholes, H.M.,
the search for similarities in the amino acid sequence of two proteins.
Pang, C.S.M., Woodridge, L., Rauer, C., Sen, N., et al. (2021). CATH: increased
J. Mol. Biol. 48, 443–453.
structural coverage of functional space. Nucleic Acids Res. 49 (D1),
Patacchiola, M., Turner, J., Crowley, E.J., O’Boyle, M., and Storkey, A. (2019). D266–D273.
Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels. arXiv, Sledzieski, S., Singh, R., Cowen, L., and Berger, B. (2021). Sequence-based
1910.05199 https://arxiv.org/abs/1910.05199. prediction of protein-protein interactions: a structure-aware interpretable
Peters, M.E., Neumann M, and Iyyer M. (2018). Deep contextualized word rep- deep learning model. In Research in Computational Molecular Biology
resentations. arXiv, 1802.05365 http://arxiv.org/abs/1802.05365. (RECOMB). https://doi.org/10.1101/2021.01.22.427866.
Pruitt, K.D., Tatusova, T., and Maglott, D.R. (2007). NCBI reference sequences Srinivasan, R., and Rose, G.D. (1995). LINUS: a hierarchic procedure to predict
(RefSeq): a curated non-redundant sequence database of genomes, tran- the fold of a protein. Proteins 22, 81–99. https://doi.org/10.1002/prot.
scripts and proteins. Nucleic Acids Res. 35, D61–D65. 340220202.
Paszke, A., Gross, S., Chintala, S., et al.. https://proceedings.neurips.cc/ Starita, L.M., Pruneda, J.N., Lo, R.S., Fowler, D.M., Kim, H.J., Hiatt, J.B.,
paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html. Shendure, J., Brzovic, P.S., Fields, S., and Klevit, R.E. (2013). Activity-
enhancing mutations in an E3 ubiquitin ligase identified by high-throughput
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018) Improving
mutagenesis. Proc. Natl. Acad. Sci. USA 110, E1263–E1272.
language understanding by generative pre-training. cs.ubc.ca. https://www.
cs.ubc.ca/amuham01/LING530/papers/radford2018improving.pdf Suzek, B.E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C.H. (2007).
(Accessed: 14 January 2021). UniRef: comprehensive and non-redundant UniProt reference clusters.
Bioinformatics 23, 1282–1288.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019).
Trigg, J., Gutwin, K., Keating, A.E., and Berger, B. (2011). Multicoil2: predicting
Language models are unsupervised multitask learners. OpenAI blog 1.
coiled coils and their oligomerization states from sequence in the twilight zone.
https://d4mucfpksywv.cloudfront.net/better-language-models/language_
PLoS ONE 6, e23519.
models_are_unsupervised_multitask_learners.pdf.
Truong, T.F., and Truong, T.F., Jr. (2020). Interpretable deep learning frame-
Rao, R., Bhattacharya, N., Thomas, N., Duan, Y., Chen, X., Canny, J., Abbeel,
work for binding affinity prediction. Massachusetts Institute of Technology
P., and Song, Y.S. (2019). Evaluating Protein Transfer Learning with TAPE.
https://dspace.mit.edu/handle/1721.1/127527?show=full.
Adv. Neural Inf. Process. Syst. 32, 9689–9701.
€ll, L., and Elofsson, A. (2015a). The
Tsirigos, K.D., Peters, C., Shu, N., Ka
Rao, R., Meier, J., Sercu, T., Ovchinnikov, S., and Rives, A. (2020). Transformer
TOPCONS web server for consensus prediction of membrane protein topol-
protein language models are unsupervised structure learners. bioRxiv. https://
ogy and signal peptides. Nucleic Acids Res. 43 (W1), W401-7.
doi.org/10.1101/2020.12.15.422761.
€ll, L., and Elofsson, A. (2015b). The
Tsirigos, K.D., Peters, C., Shu, N., Ka
Rasmussen, C.E., and Williams, C.K.I. (2005). Gaussian Processes for TOPCONS web server for consensus prediction of membrane protein topol-
Machine Learning. (MIT Press). ogy and signal peptides. Nucleic Acids Res. 43 (W1), W401-7.
Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2011a). HHblits: light- UniProt Consortium (2019). UniProt: a worldwide hub of protein knowledge.
ning-fast iterative protein sequence searching by HMM-HMM alignment. Nucleic Acids Res. 47 (D1), D506–D515.
Nat. Methods 9, 173–175.
van Engelen, J.E., and Hoos, H.H. (2020). A survey on semi-supervised
Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2011b). HHblits: light- learning. Mach. Learn. 109, 373–440.
ning-fast iterative protein sequence searching by HMM-HMM alignment. Vanschoren, J. (2018). Meta-Learning: A Survey. arXiv, 1810.03548 http://
Nat. Methods 9, 173–175. arxiv.org/abs/1810.03548.
Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2011c). HHblits: light- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
ning-fast iterative protein sequence searching by HMM-HMM alignment. Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. In Advances
Nat. Methods 9, 173–175. in Neural Information Processing Systems, I. Guyon, et al., eds. (Curran
Riesselman, A.J., Ingraham, J.B., and Marks, D.S. (2018). Deep generative Associates, Inc.), pp. 5998–6008.
models of genetic variation capture the effects of mutations. Nat. Methods Vig, J., Madani, A., Varshney, L.R., Xiong, C., Socher, R., and Rajani, N.F.
15, 816–822. (2020). BERTology Meets Biology: Interpreting Attention in Protein Language
Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., and Models. arXiv, 2006.15222 http://arxiv.org/abs/2006.15222.
Fergus, R. (2019). Biological structure and function emerge from scaling unsu- Wang, S., Sun, S., Li, Z., Zhang, R., and Xu, J. (2017). Accurate De Novo
pervised learning to 250 million protein sequences. bioRxiv https://www. Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS
biorxiv.org/content/10.1101/622803v1. Comput. Biol. 13, e1005324.

ll
Walensky, R.P., Walke, H.T., and Fauci, A.S. (2021). SARS-CoV-2 Variants of Xu, J. (2019). Distance-based protein folding powered by deep learning. Proc.
Concern in the United States—Challenges and Opportunities. JAMA 325 (11), Natl. Acad. Sci. USA 116, 16856–16865.
1037–1038. https://doi.org/10.1001/jama.2021.2294. Xu, J., and Wang, S. (2019). Analysis of distance-based protein structure pre-
Wang, S., Li, B.Z., Khabsa, M., Fang, H., and Ma, H. (2020a). Linformer: Self- diction by deep learning in CASP13. Proteins 87, 1069–1081.
Attention with Linear Complexity. arXiv, 2006.04768 http://arxiv.org/abs/
Yang, K.K., Wu, Z., Bedbrook, C.N., and Arnold, F.H. (2018). Learned protein
2006.04768.
embeddings for machine learning. Bioinformatics 34, 4138.
Wang, Y., Yao, Q., Kwok, J., and Ni, L.M. (2020b). Generalizing from a Few
Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., and Baker, D.
Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 53, 1–34.
(2020). Improved protein structure prediction using predicted interresidue ori-
Wei, K.Y., Moschidi, D., Bick, M.J., Nerli, S., McShan, A.C., Carter, L.P., entations. Proc. Natl. Acad. Sci. USA 117, 1496–1503.
Huang, P.S., Fletcher, D.A., Sgourakis, N.G., Boyken, S.E., and Baker, D.
(2020). Computational design of closely related proteins that adopt two well- Zeng, H., and Gifford, D.K. (2019). Quantification of Uncertainty in Peptide-
defined but structurally divergent folds. Proc. Natl. Acad. Sci. USA 117, MHC Binding Prediction Improves High-Affinity Peptide Selection for
7208–7215. Therapeutic Design. Cell Syst. 9, 159–166.e3.
Weile, J., Sun, S., Cote, A.G., Knapp, J., Verby, M., Mellor, J.C., Wu, Y., Pons, Zhang, C., and Kim, S.-H. (2003). Overview of structural genomics: from struc-
C., Wong, C., van Lieshout, N., et al. (2017). A framework for exhaustively map- ture to function. Curr. Opin. Chem. Biol. 7, 28–32.
ping functional missense variants. Mol. Syst. Biol. 13, 957. Zhang, Y., and Skolnick, J. (2005). TM-align: a protein structure alignment al-
Wolf, E., Kim, P.S., and Berger, B. (1997). MultiCoil: a program for predicting gorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309.
two- and three-stranded coiled coils. Protein Sci. 6, 1179–1189. Zhou, G., Chen, M., Ju, C.J.T., Wang, Z., Jiang, J.Y., and Wang, W. (2020).
Wrenbeck, E.E., Azouz, L.R., and Whitehead, T.A. (2017). Single-mutation Mutation effect estimation on protein-protein interactions using deep contex-
fitness landscapes for an enzyme on multiple substrates reveal specificity is tualized representation learning. NAR Genom. Bioinform. 2, a015. https://doi.
globally encoded. Nat. Commun. 8, 15695. org/10.1093/nargab/lqaa015.

ll
STAR+METHODS
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Bonnie
Berger (bab@mit.edu).
Materials availability
This study did not generate new materials.
Data and code availability

d This paper did not generate new data.
d Source code and model parameters are available at https://github.com/tbepler/prose.
d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
METHOD DETAILS
Bidirectional LSTM encoder with skip connections

We structure the sequence encoder of our DLM- and MT-LSTM models as a three-layered bidirectional LSTM with skip connections
from each layer to the final output. Our LSTMs have 1024 hidden units in each direction of each layer. We feed a 1-hot encoding of the
amino acid sequence as the input to the first layer. Given a sequence input, x, of length L, this sequence is 1-hot encoded into a ma-
trix, O, of size Lx21 where entry oi,j = 1 if xi = j (that is, amino acid xi has index j) and oi,j = 0 otherwise. We then calculate H(1) = f(1)(O),
H(2) = f(2)(H(1)), H(3) = f(3)(H(2)), and Z = [H(1) H(2) H(3)] where H(a) is the hidden units of the ath layer and f(a) is the ath BiLSTM layer. The final
output of the encoder, Z, is the concatenation of the hidden units of each layer along the embedding dimension.
Masked language modeling module

We use a masked language modeling objective for training on sequences only. During training, we randomly replace 10% of the
amino acids in a sequence with either an auxiliary mask token or a uniformly random draw from the amino acids and train our model
to predict the original amino acids at those positions. Given an input sequence, x, we randomly mask this sequence to create a new
sequence, x’. This sequence is fed into our encoder to give a sequence of vector representations, Z. We decode these vectors into a
distribution over amino acids at each position, p, using a linear layer. The parameters of this layer are learned jointly with the param-
eters of the encoder network. We calculate the masked language modeling loss as the negative log likelihood of the true amino acid at
P
each of the masked positions, Lmasked = 1n log pi;xi where there are n masked positions indexed by i.
i
Residue-residue contact prediction module

We predict intra-residue contacts using a bilinear projection of the sequence embeddings. Given a sequence, x, with embeddings, Z,
calculated using our encoder network, the bilinear projection calculates ZWZT + b, where W and b are learnable parameters of dimen-
sion DxD and 1 respectively where D is the dimension of an embedding vector. These parameters are fit together with the parameters
of the encoder network. This produces an LxL matrix, where L is the length of x. We interpret the i,jth entry in this matrix as the log-
likelihood ratio between the probability that the ith and jth residues are within 8 Å in the 3D protein structure and the probability that
they are not. We then calculate the contact loss, Lcontact , as the negative log-likelihood of the true contacts given the predicted con-
tact probabilities.
Structure similarity prediction module

Our structure similarity prediction module follows previously described methods (Bepler and Berger, 2019). Given two input se-
quences, X and X’ with lengths N and M, that have been encoded into vector representations, Z and Z’, we calculate reduced dimen-
sion projections, A = ZB and A’ = Z’B, where B is a DxK matrix that is trained together with the encoder network parameters. K is a
hyperparameter and is set to 100. Given A and A’, we calculate the inter-residue semantic distances between the two sequences as
the Manhattan distance between the embedding at position i in the first sequence and embedding at position j in the second
sequence, di,j = ||Ai - A’j||1. Given these distances, we calculate a soft alignment between the positions of sequences X and X’.
k k
The alignment weight between two positions, i and j, is defined as ci;j = ai;j + bi;j ai;j bi;j where ai;j = PNi;j and bi;j = PMi;j and
k k
l = 1 i;l l = 1 l;j
ki;j = edi;j . With the inter-residue semantic distances and the alignment weights, we then define a global similarity between the
P
two sequences as the negative semantic distance between the positions averaged over the alignment, s = C1 ci;j di;j where
i;j
P
c = ci;j .
i;j
e1 Cell Systems 12, 654–669.e1–e3, June 16, 2021

ll
With this global similarity based on the sequence embeddings in hand, we need to compare it against a ground truth similarity to
calculate the gradient of our loss signal and update the parameters. Because we want our semantic similarity to reflect structural
similarity, we retrieve ground truth labels, t, from the SCOP database by assigning increasing levels of similarity to proteins based
on the number of levels in the SCOP hierarchy that they share. In other words, we assign a ground truth label of 0 to proteins not
in the same class, 1 to proteins in the same class but not the same fold, 2 to proteins in the same fold but not the same superfamily,
3 to proteins in the same superfamily but not in the same family, and finally 4 to proteins in the same family. We relate our semantic
similarity to these levels of structural similarity through ordinal regression. We calculate the probability that two sequences are similar
at a level t or higher as pðy RtÞ = qt s + bt where qt and bt are additional learnable parameters for tR1. We impose the constraint that
qt R0 in order to ensure that increasing similarity between the embeddings corresponds to increasing numbers of shared levels in the
SCOP hierarchy. Given these distributions, we calculate the probability that two proteins are similar at exactly level t as pðy = tÞ =
pðy RtÞð1 pðy Rt + 1ÞÞ. That is, the probability that two sequences are similar at exactly level t is equal to the probability they are
similar at at least level t times the probability they are not similar at a level above t.
We then define the structural similarity prediction loss to be the negative log-likelihood of the observed similarity labels under this
model, Lsimilarity = log pðy = tÞ.
Multi-task loss
We define the combined multi-task loss as a weighted sum of the language modeling, contact prediction, and similarity prediction
losses, LMT = lmasked Lmasked + lcontact Lcontact + lsimilarity Lsimilarity :
Training datasets
We train our masked language models on a large corpus of protein sequences, UniRef90 (Suzek et al., 2007), retrieved in July 2018.
This dataset contains 76,215,872 protein sequences filtered to 90% sequence identity. For structural supervision, we use the SCOPe
ASTRAL protein dataset previously presented by Bepler & Berger (Fox, Brenner and Chandonia, 2014; Chandonia, Fox and Brenner,
2017; Bepler and Berger, 2019). This dataset contains 28,010 protein sequences with known structures and SCOP classifications
from the SCOPe ASTRAL 2.06 release. These sequences are split into 22,408 training sequences and 5,602 testing sequences.
Hyperparameters and training details

We train two language models with different settings of the weights in the loss term. The first model, DLM-LSTM, uses only the
masked language modeling objective so is trained with lmasked = 1; lcontact = 0; and lsimilarity = 0:The second model, MT-LSTM,
uses the full multi-task objective with weights lmasked = 0:5; lcontact = 0:9; and lsimilarity = 0:1:The DLM-LSTM model was trained for
1,000,000 parameter updates using a minibatch size of 100 using the Adam optimizer (Kingma and Ba, 2015) with a learning rate
of 0.0001. The MT-LSTM model was also trained for 1,000,000 parameter updates using Adam with a learning rate of 0.0001,
but, due to GPU RAM restrictions, we had to train the MT-LSTM model with smaller minibatch sizes of 64 for the masked language
model objective and 16 for the structure-based objectives. Following Bepler & Berger (Bepler and Berger, 2019), we sampled pairs of
proteins for the structural similarity prediction task with an exponential smoothing parameter, t = 0:5; in order to oversample the rela-
tively rare highly similar protein pairs in the dataset. During training, we applied a mild regularization on the structure tasks by
randomly resampling positions from a uniform distribution over amino acids with probability 0.05.
Models were implemented using PyTorch (Paszke et al., 2017) and trained on a single NVIDIA V100 GPU with 32GB of RAM.
Training time was roughly 13 days for the DLM-LSTM model and 51 days for the MT-LSTM model.
Protein structural similarity prediction evaluation

We evaluate protein structural similarity methods on the SCOPe ASTRAL test set described above (Training datasets). All methods
are evaluated on 100,000 randomly sampled protein pairs in this dataset. For each prediction method, we calculate the predicted
similarity between each pair using only the sequence of each protein with the exception of TMalign which operates on the protein
structures. Because TMscore is not symmetric, we calculate TMscore for both comparison directions and average them together
for each protein pair. We found this outperformed other methods of combining the two scores. For HHalign, we first constructed pro-
file HMMs for each protein by iteratively searching for homologs in the uniprot30 database provided by the authors using HHblits
(Remmert et al., 2011c). We then calculate the similarities between each pair of proteins by aligning their HMMs with HHalign. For
protein language model embedding methods, we calculate the predicted similarity as described above (Structural similarity predic-
tion module).
We compare the predicted structural similarity scores against the ground truth scores defined by SCOP across a variety of metrics.
Accuracy is the fraction of protein pairs for which the similarity level is predicted exactly correctly. We also calculate the Pearson
correlation coefficient (r) and Spearman rank correlation coefficient ðrÞ between the predicted and ground truth similarities. Finally,
we calculate the average-precision score for retrieving pairs of proteins at or above each level of similarity. That is, we report the
average-precision score for each method where the positive set is proteins in the same class, in the same fold, in the same super-
family, or in the same family.
Cell Systems 12, 654–669.e1–e3, June 16, 2021 e2

ll
Transmembrane region prediction training and evaluation

We follow the procedure for transmembrane prediction and evaluation previously described by Tsirigos et al. and the model
described by Bepler & Berger (Tsirigos et al., 2015a; Bepler and Berger, 2019). The TOPCONS2 dataset contains protein sequences
and transmembrane annotations for four categories of proteins: 1) proteins with transmembrane regions (TM), 2) proteins with trans-
membrane regions and a signal peptide (TM+SP), 3) proteins without transmembrane regions or a signal peptide (globular), and 4)
proteins without transmembrane regions but with a signal peptide (globular+SP). Altogether, the dataset contains 5154 proteins
broken down into 286 TM, 627 TM+SP, 2927 globular, and 1314 globular+SP proteins.
In order to compare different protein representations for transmembrane prediction, we fit a single layer BiLSTM followed by a con-
ditional random field (CRF) decoder using either 1-hot encodings of the amino acid sequence or embeddings generated by the SSA-
LSTM, DLM-LSTM, or MT-LSTM models. The BiLSTM has 150 hidden units in each direction and the CRF decodes the outputs of the
BiLSTM to one of four states: signal peptide, cytosolic region, transmembrane region, or extracellular region. In the CRF, we use the
hidden state grammar and transitions defined by Tsirigos et al. (Tsirigos et al., 2015b) and only fit the input potentials. The models are
trained for 10 epochs over the data with a batch size of 1 using the Adam optimizer (Kingma and Ba, 2015) with a learning rate
of 0.0003.
We compare methods by 10-fold cross validation. We calculate prediction performance over proteins in the held-out set by decod-
ing the most likely sequence of labels using the Viterbi algorithm and then scoring a protein as correctly predicted if 1) the protein is
globular and we predict no transmembrane or signal peptide regions, 2) the protein is globular+SP and we predict that the protein
starts with a signal peptide and has no transmembrane regions, 3) the protein is TM and we predict the correct number of transmem-
brane regions with at least 50% overlap to the ground truth regions and no signal peptide, and 4) the protein is TM+SP and is the same
as TM except that we also predict that the protein starts with a signal peptide.
Sequence-to-phenotype prediction and evaluation

We retrieve the set of deep mutational scanning datasets aggregated by Riesselman et al. (Riesselman, Ingraham, and Marks, 2018)
and follow the supervised learning procedure used by Luo et al. (Luo et al., 2020). These datasets contain phenotypic measurements
of sequence variants across a variety of proteins and measured phenotypes. Phenotypes include enzyme function (Bandaru et al.,
2017; Wrenbeck, Azouz and Whitehead, 2017), growth (Melamed et al., 2013; Kitzman et al., 2015; Brenan et al., 2016; Klesmith et al.,
2017; Weile et al., 2017; Findlay et al., 2018), stability (Matreyek et al., 2018), peptide binding (Araya et al., 2012; McLaughlin et al.,
2012), ligase activity (Starita et al., 2013), and MIC (Jacquier et al., 2013).
For each dataset, we featurize the amino acid sequences of each variant as either a 1-hot encoding or by embedding the sequence
with our MT-LSTM model. We then apply dimensionality reduction to these vectors using PCA down to the minimum of 1000 PCs or
the number of data points in the dataset in order to improve the runtime of the learning algorithm. We then fit a Gaussian process (GP)
regression model using the RBF kernel and fit the kernel hyperparameters by maximum likelihood. We implement our GP models in
GPyTorch (Gardner et al., 2018). To compare methods, we follow Luo et al. and perform 5-fold cross validation on each deep muta-
tional scanning dataset (Luo et al., 2020) and calculate the Spearman rank correlation coefficient between our predicted phenotypes
and the ground truth phenotypes on the heldout data for each fold.
e3 Cell Systems 12, 654–669.e1–e3, June 16, 2021

自然语言处理——蛋白序列

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

自然语言处理——蛋白序列

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

自然语言处理——蛋白序列

Uploaded by

Copyright:

Available Formats

Synthesis

Learning the protein language: Evolution, structure,

d They can be enriched with prior knowledge and inform

Bepler & Berger, 2021, Cell Systems 12, 654–669

*Correspondence: tbepler@nysbc.org (T.B.), bab@mit.edu (B.B.)

INTRODUCTION about evolutionary processes, but have become increasingly

Language models have recently emerged as a powerful para-

Cell Systems 12, 654–669, June 16, 2021 655

656 Cell Systems 12, 654–669, June 16, 2021

Cell Systems 12, 654–669, June 16, 2021 657

Figure 2. Diagram of model architectures and language modeling approaches

658 Cell Systems 12, 654–669, June 16, 2021

Cell Systems 12, 654–669, June 16, 2021 659

660 Cell Systems 12, 654–669, June 16, 2021

Figure 3. Our multi-task contextual embed-

Cell Systems 12, 654–669, June 16, 2021 661

Figure 4. Language models capture the semantic organization of proteins

(legend continued on next page)

662 Cell Systems 12, 654–669, June 16, 2021

Cell Systems 12, 654–669, June 16, 2021 663

Cell Systems 12, 654–669, June 16, 2021 665

666 Cell Systems 12, 654–669, June 16, 2021

Cell Systems 12, 654–669, June 16, 2021 667

668 Cell Systems 12, 654–669, June 16, 2021

Cell Systems 12, 654–669, June 16, 2021 669

Data and code availability

Bidirectional LSTM encoder with skip connections

Masked language modeling module

Residue-residue contact prediction module

Structure similarity prediction module

e1 Cell Systems 12, 654–669.e1–e3, June 16, 2021

Hyperparameters and training details

Protein structural similarity prediction evaluation

Cell Systems 12, 654–669.e1–e3, June 16, 2021 e2

Transmembrane region prediction training and evaluation

Sequence-to-phenotype prediction and evaluation

e3 Cell Systems 12, 654–669.e1–e3, June 16, 2021

You might also like