自然语言处理——蛋白序列
自然语言处理——蛋白序列
自然语言处理——蛋白序列
Synthesis
Learning the protein language:
Evolution, structure, and function
Tristan Bepler1,2,3,* and Bonnie Berger2,4,5,*
1Simons Machine Learning Center, New York Structural Biology Center, New York, NY, USA
2Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
3Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA
4Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
5Lead contact
SUMMARY
Language models have recently emerged as a powerful machine-learning approach for distilling information
from massive protein sequence databases. From readily available sequence data alone, these models
discover evolutionary, structural, and functional organization across protein space. Using language models,
we can encode amino-acid sequences into distributed vector representations that capture their structural
and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss
recent advances in protein language modeling and their applications to downstream protein property predic-
tion problems. We then consider how these models can be enriched with prior biological knowledge and
introduce an approach for encoding protein structural knowledge into the learned representations. The
knowledge distilled by these models allows us to improve downstream function prediction through transfer
learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to
approach protein and therapeutic design. However, further developments are needed to encode strong bio-
logical priors into protein language models and to increase their accessibility to the broader community.
654 Cell Systems 12, 654–669, June 16, 2021 ª 2021 The Authors. Published by Elsevier Inc.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
ll
Synthesis OPEN ACCESS
Box 1. Glossary
1-hot [embedding]. Vector representation of a discrete variable commonly used for discrete values that have no meaningful
ordering. Each token is transformed into a V-dimensional zero vector, where V is the size of the vocabulary (the number of unique
tokens, e.g., 20, 21, or 26 for amino acids depending on inclusion of missing and non-canonical amino acid tokens), except for the
index representing the token, which is set to one.
autoregressive [language model]. Language models that factorize the probability of a sequence into a product of conditional
Q
probabilities in which the probability of each token is conditioned on the preceding tokens, pðx1 ::: xL Þ = Li= 1 pðxi jx1 :::xi1 Þ. Ex-
amples of autoregressive language models include k-mer (AKA n-gram) models, Hidden Markov Models, and typical autoregres-
sive recurrent neural network or generative transformer language models. These models are called autoregressive because they
model the probability of one token after another in order.
Bayesian methods. A statistical inference approach that uses Bayes rule to infer a posterior distribution over model parameters
given by the observed data. Because these methods describe distributions over parameters or functions, they are especially useful
in small data regimes or other settings when prediction uncertainties are desirable.
cloze task. A task in natural language processing, also known as the cloze test. The task is to fill in missing words given the
context. For example, ‘‘The quick brown ____ jumps over the lazy dog.’’
conditional random field. Models the probability of a set (sequence in this case, i.e. linear chain CRF) of labels given a set of input
variables by factorizing it into locally conditioned potentials conditioned on the input variables, pðy1 ::: yL j x1 ::: xL Þ =
Q
pðy1 j x1 ::: xL Þ Li= 2 pðyi j yi1 ; x1 ::: xL Þ. This is often simplified such that each conditional only depends on the local input variable,
Q
i.e., pðy1 ::: yL j x1 ::: xL Þ = pðy1 j x1 Þ Li= 2 pðyi j yi1 ; xi Þ. Linear chain CRFs can be seen as the discriminative version of Hidden Mar-
kov Models.
contextual vector embedding. Vector embeddings that include information about the sequence context in which a token occurs.
Encoding context into vector embeddings is important in NLP, because words can have different meanings in different contexts
(i.e. many homonyms exist). For example, in the sentences, ‘‘she tied the ribbon into a bow’’ and ‘‘she drew back the string on her
bow,’’ the word bow refers to two different objects that can only be inferred from context. In the case of proteins, this problem is
even worse, because there are only 20 (canonical) amino acids and so their ‘‘meaning’’ is highly context dependent. This is in
contrast to typical vector embedding methods that learn a single vector embedding per token regardless of context.
distributional hypothesis. The observation that words that occur in similar contexts tend to have similar meanings. Applies also to
proteins due to evolutionary pressure (Harris, 1954).
Gaussian process. A class of models that describes distributions over functions conditioned on observations from those func-
tions. Gaussian processes model outputs as being jointly normally distributed where the covariance between the outputs is a func-
tion of the input features. See Rasmussen and Williams for a comprehensive overview (Rasmussen and Williams, 2005)
generative model. A model of the data distribution, pðXÞ, joint data distribution, pðX;YÞ, or conditional data distribution, pðXjY =
yÞ. Usually framed in contrast to discriminative models that model the probability of the target given an observation, pðYjX = xÞ.
Here, Xis observable, for example the protein sequence, and Yis a target that is not observed, for example the protein structure or
function. Conditional generative and discriminative models are related by Bayes’ theorem. Language models are generative
models.
hidden layer. Intermediate vector representations in a deep neural network. Deep neural networks are structured as layered data
transformations before outputting a final prediction. The intermediate layers are referred to as ‘‘hidden’’ layers.
inductive bias. Describes the assumptions that a model uses to make predictions for data points it has not seen (Mitchell, 1980).
That is, the inductive bias of a model is how that model generalizes to new data. Every machine learning model has inductive
biases, implicitly or explicitly. For example, protein phenotype prediction based on homology assumes that phenotypes covary
over evolutionary relatedness. In other words, it formally models the idea that proteins that are more evolutionarily related are likely
to share the same function. In thinking about deep neural networks applied to proteins, it is important to understand the inductive
biases these models assume, because it naturally relates to the true properties of the function we are trying to model. However, this
is challenging, because we can only roughly describe the inductive biases of these models (Battaglia, Hamrick and Bapst, 2018).
language model. Probabilistic model of whole sequences. In the case of natural language, language models typically describe the
probability of sentences or documents. In the case of proteins, they model the probability of amino acid sequences. Being simply
probabilistic models, language models can take on many specific incarnations from column frequencies in multiple sequence
alignments to Hidden Markov Models to Potts models (direct coupling analysis) to deep neural networks.
manifold embedding. A distance preserving, low dimensional embedding of the data. The goal of manifold embedding is to find
points low dimensional vectors, z1 ::: zn , such that the distances, dðzi ; zj Þ , are as close as possible to the distances in the original
data space, dðxi ; xj Þ, given n high dimensional data vectors, x1 ::: xn . t-SNE is a commonly used manifold embedding approach for
visualization of high dimensional data.
masked language model. The training task used by BERT and other recent bidirectional language models. Instead of modeling
the probability of a sequence autoregressively, masked language models seek to model the probability of each token given all
other tokens. For computational convenience, this is achieved by randomly masking some percentage of the tokens in each
(Continued on next page)
Box 1. Continued
minibatch and training the model to recover those tokens. An auxiliary token is added to the vocabulary to indicate that this token
has been masked.
multi-task learning. A machine learning paradigm in which multiple tasks are learned simultaneously. The idea is that similarities
between tasks can lead to each task being learned better in combination rather than learning each individually. In the case of rep-
resentation learning, multi-task learning can also be useful for learning representations that encode information relevant for all
tasks. Multi-task learning allows us to use the signals encoded in other training signals as an inductive bias when learning the
goal task.
representation learning. The problem of learning features, or intermediate data representations, better suited for solving a pre-
diction problem on raw data. Deep learning systems are described as representation learning systems, because they learn a series
of data transformations that make the goal task progressively easier to solve before outputting a prediction.
residue-residue contact prediction. The task of learning which amino acid residues are in contact in folded protein structures,
where contact is assumed to be within a small number of angstroms, often with the goal of constraining the search space for pro-
tein structure prediction.
self-supervised learning. A relatively new term for methods for learning from data without labels. Generally used to describe
methods that ‘‘automatically’’ create labels through data augmentation or generative modeling. Can be viewed as a subset of un-
supervised learning focused on learning representations useful for transfer learning.
semantic priors. Prior semantic understanding of a word or token, e.g., protein structure or function.
semantics. The meaning of a word or token. In reference to proteins, we use semantics to mean the ‘‘functional’’ purpose of a
residue, or combinations of residues.
structural classification of proteins (SCOP). A mostly manual curation of structural domains based on similarities of their se-
quences and structure. Similar databases include CATH (Sillitoe et al., 2021).
structural similarity prediction. Given two protein sequences, predict how similar their respective structures would be according
to some similarity measure.
supervised learning. A problem in machine learning. How we can learn a function to predict a target variable, usually denoted y,
given an observed one, usually denoted x, from a set of known x, y pairs.
transfer learning. A problem in machine learning. How we can take knowledge learned from one task and apply it to solve another
related task. When the tasks are different but related, representations learned on one task can be applied to the other. For example,
representations learned from recognizing dogs could be transferred to recognizing cats. In the case of proteins and language
models, we are interested in applying knowledge gained from learning to generate sequences to predicting function. Transfer
learning could also be applied to applying representations learned from predicting structure to function or from predicting one
function to another function among other applications.
unsupervised learning. A problem in machine learning that asks how we can learn patterns from unlabeled data. Clustering is a
classic unsupervised learning problem. Unsupervised learning is often formulated as a generative modeling problem, where we
view the data as being generated from some unobserved latent variable(s) that we infer jointly with the parameters of the model.
vector embedding. A term used to describe multidimensional real numbered representations of data that is usually discrete or
high dimensional, word embeddings being a classic example. Sometimes referred to as ‘‘distributed vector embeddings’’ or
‘‘manifold embeddings’’ or simply just ‘‘embeddings.’’ Low-dimensional vector representations of high dimensional data such
as images or gene expression vectors as found by methods such as t-SNE are also vector embeddings. Usually, the goal in
learning vector embeddings is to capture some semantic similarity between data as a function of similarity or distance in the vector
embedding space.
can be used to improve protein function prediction. Finally, we properties are often subject to evolutionary pressures because
will discuss future directions in protein machine learning and these functions must be maintained or enhanced in order for an
large-scale language modeling. organism to survive and reproduce. These pressures manifest
in the distribution over amino acids present in natural protein se-
Protein language models distill information from quences and, hence, are discoverable from large and diverse
massive protein sequence databases enough sets of naturally occurring sequences.
Language models for protein sequence representation learning The ability to learn semantics emerges from the distributional
(Figure 2) have seen a surge of interest following the success of hypothesis: tokens (e.g., words, amino acids) that occur in
large-scale models in the field of natural language processing similar contexts tend to carry similar meanings. Language
(NLP). These models draw on the idea that distributed vector rep- models only require sequences to be observed and are trained
resentations of proteins can be extracted from generative models to model the probability distribution over amino acids using an
of protein sequences, learned from a large and diverse database autoregressive formulation (Figures 2A and 2B) or masked posi-
of sequences across natural protein space, and thus can capture tion prediction formulation (also called a cloze task in NLP, Fig-
the semantics, or function, of a given sequence. Here, function re- ure 2C). In autoregressive language models, the probability of
fers to any and all properties related to what a protein does. These a sequence is factorized such that the probability of each token
is conditioned only on the preceding tokens. This factorization is Recent advances in NLP have been driven by innovations in
exact and is useful when sampling from the distribution or neural network architectures, new training approaches,
evaluating the probabilities themselves is of primary interest. increasing compute power, and increasing accessibility of
The drawback to this formulation is that the representations huge text corpuses. Several NLP methods have been proposed
learned for each position depend only on preceding positions, that draw on unsupervised, now often called self-supervised,
potentially making them less useful as contextual representa- learning (Devlin et al., 2018; Peters et al., 2018) to fit large-scale
tions. The masked position prediction formulation (also known bidirectional long-short term recurrent neural networks (bidirec-
as masked language modeling) addresses this problem by tional LSTMs or biLSTMs) (Hochreiter and Schmidhuber, 1997;
considering the probability distribution over each token at each Graves, Fernández and Schmidhuber, 2005) or Transformers
position conditioned on all other tokens in the sequence. The (Vaswani et al., 2017) and its recent variants. LSTMs are recur-
masked language modeling approach does not allow calculating rent neural networks. These models process sequences one to-
correctly normalized probabilities of whole sequences but is ken at a time in order and therefore learn representations that
more appropriate when the learned representations are the out- capture information from a position and all previous positions.
comes of primary interest. The unprecedented recent success of In order to include information from tokens before and after
language models in natural language processing, e.g.Google’s any given position, bidirectional LSTMs combine two separate
BERT and OpenAI’s GTP-3, is largely driven by their ability to LSTMs operating in the forward and backward directions in
learn from billions of text entries in enormous online corpora. each layer (e.g., as in Figure 2B). Although these models can
Analogously, we have natural protein sequence databases with learn representations including whole sequence context, their
100 s of millions of unique sequences that continue to grow ability to learn distant dependencies is limited in practice. To
rapidly. address this limitation, transformers learn representations by
explicitly calculating an attention vector over each position in the describing the empirical distribution over naturally occurring se-
sequence. In the self-attention mechanism, the representation quences has shown promise for predicting the fitness of
for each position is learned by ‘‘attending to’’ each position of sequence variants (Riesselman, Ingraham and Marks, 2018;
the same sequence, well suited for masked language modeling Hie et al., 2020a, 2021). Because these models learn from evolu-
(Figure 2C). In a self-attention module, the output representation tionary data directly, they can make accurate predictions about
of each element of a sequence is calculated as a weighted sum protein function when function is reflected in the fitness of natural
over transformations of the input representations at each posi- sequences. Riesselman et al. first demonstrated that language
tion where the weighting itself is based on a learned transforma- models fit on individual protein families are surprisingly accurate
tion of the inputs. The attention mechanism is typically believed predictors of variant fitness measured in deep mutational scan-
to allow transformers to learn dependencies between positions ning datasets (Riesselman, Ingraham and Marks, 2018). New
distant in the linear sequence more easily. Transformers are work has since shown that the representations learned by lan-
also useful as autoregressive language models. guage models are also powerful features for learning of variant
In natural language processing, Peters et al. recognized that fitness as a subsequent supervised learning task (Rives et al.,
the hidden layers (intermediate representations of stack neural 2019; Luo et al., 2020), building on earlier observations that lan-
networks) of biLSTMs encoded semantic meaning of words in guage models can improve protein property prediction through
context. This observation has been newly leveraged for biolog- transfer learning (Bepler and Berger, 2019). Recently, Hie et al.
ical sequence analysis (Alley et al., 2019; Bepler and Berger, used language models to learn evolutionary fitness of viral enve-
2019) to learn more semantically meaningful sequence represen- lope proteins and were able to predict mutations that could allow
tations. The success of deep transformers for machine transla- the SARS-CoV-2 spike protein to escape neutralizing antibodies
tion inspired their application to contextual text embedding, (Hie et al., 2020a, 2021). As of publication, several variants pre-
that is learning contextual vector embeddings of words and sen- dicted to have high escape potential have appeared in SARS-
tences, giving rise to the now widely used Bidirectional Encoder CoV-2 sequencing efforts around the world, but viral escape
Representations from Transformers (BERT) model in NLP (Devlin has not yet been experimentally verified (Walensky et al., 2021).
et al., 2018). BERT is a deep transformer trained as a masked A few recent works have focused on increasing the scale of
language model on a large text corpus. As a result, it learns these models by adding more parameters and more learnable
contextual representations of text that capture contextual mean- layers to improve sequence modeling. Interestingly, because
ing and improve the accuracy of downstream NLP systems. so many sequences are available, these models continue to
Transformers have also demonstrated impressive performance benefit from increased size (Rives et al., 2019). This parallels
as autoregressive language models, for example with the Gener- the general trend in natural language processing, where the
ative Pre-trained Transformer (GPT) family of models (Radford number of parameters, rather than specific architectural
et al., 2018, 2019; Brown et al., 2020), which have made impres- choices, is the best indicator of model performance (Kaplan
sive strides in natural language generation. These works have et al., 2020). However, ultimately, model size is limited by the
inspired subsequent applications to protein sequences (Rao computational resources available to train and apply these
et al., 2019; Rives et al., 2019; Elnaggar et al., 2020; Vig models. In NLP, models such as BERT and GPT-3 have become
et al., 2020). so large that only the best funded organizations with massive
Although transformers are powerful models, they require enor- Graphics Processing Unit (GPU) compute clusters are realisti-
mous numbers of parameters and train more slowly than typical cally able to train and deploy them. This is demonstrated in
recurrent neural networks. With massive scale datasets and some recent work on protein models where single transformer-
compute and time budgets, transformers can achieve impres- based models were trained for days to weeks on hundreds of
sive results, but, generally, recurrent neural networks (e.g., GPUs (Rives et al., 2019; Elnaggar et al., 2020; Vig et al.,
biLSTMs) need less training data and less compute, so might 2020), costing potentially 100 s of thousands of dollars for
be more suitable for problems where fewer sequences are avail- training. Increasing the scale of these models promises to
able, such as training on individual protein families, or compute continue to improve our ability to model proteins, but more
budgets are tight. Constructing language models that achieve resource efficient algorithms are needed to make these models
high accuracy with better compute efficiency is an algorithmic more accessible to the broader scientific community.
challenge for the field. An advantage of general purpose pre- So far, the language models we have discussed use natural
trained protein models is that we only need to do the expensive protein sequence information. However, they do not learn from
training step once; the models can then be used to make predic- the protein structure and function knowledge that has been
tions or can be applied to new problems via transfer learning accumulated over the past decades of protein research. Incor-
(Bengio, 2012), as discussed below. porating such knowledge requires supervised approaches.
Using these and other tools, protein language models are able
to synthesize the enormous quantity of known protein se- Supervision encodes biological meaning
quences by training on 100 s of millions of sequences stored in Proteins are more than sequences of characters: they are phys-
protein databases (e.g., UniProt, Pfam, NCBI (Bateman et al., ical chains of amino acids that fold into three-dimensional struc-
2004; Pruitt, Tatusova and Maglott, 2007; UniProt Consortium, tures and carry out functions based on those structures. The
2019)). The distribution over sequences learned by language sequence-structure-function relationship is the central pillar of
models captures the evolutionary fitness landscape of known protein biology and significant time and effort has been spent
proteins. When trained on tens of thousands of evolutionarily to elucidate this relationship for select proteins of interest. In
related proteins, the learned probability mass function particular, the increasing throughput and ease-of-use of protein
structure determination methods, (e.g., X-ray crystallography models alone. Supervision that represents known protein struc-
and cryo-EM (Cheng et al., 2015; Callaway, 2020)), has driven ture, function, and other prior knowledge may be necessary to
a rapid increase in the number of known protein structures avail- encode distant sequence relationships into learned embed-
able in databases such as the Protein Data Bank (PDB) (Berman dings. By analogy, cars and boats are both means of transporta-
et al., 2000). There are nearly 175,000 entries in PDB as of tion, but we would not expect a generative image model to infer
publication and this number is growing rapidly. 14,000 new this relationship from still images alone. However, we can teach
structures were deposited in 2020 and the rate of new structure these relationships through supervision.
deposition is increasing. We pursue the intuition that incorpo- On this premise, we hypothesize that incorporating structural
rating such knowledge into our models via supervised learning supervision when training protein language models will improve
can aid in predicting function from sequence, bypassing the the ability to predict function in downstream tasks through trans-
need for solved structures. fer learning. Eventually, such language models may become
Supervised learning is the problem of finding a mathematical powerful enough that we can predict function directly without
function to predict a target variable given some observed vari- the need for solved structures. In the remainder of this Synthesis,
ables. In the case of proteins, supervised learning is commonly we will explore this idea.
used to predict protein structure from sequence, protein function
from sequence, or for other sequence annotation problems (e.g., Multi-task language models capture the semantic
signal peptide or transmembrane region annotation). Beyond organization of proteins
making predictions, supervised learning can be used to encode Here, we will demonstrate that training protein language models
specific semantics into learned representations. This is common with self-supervision on a large amount of natural sequence data
in computer vision where, for example, pre-training image recog- and with structure supervision on a smaller set of sequence,
nition models on the large ImageNet dataset is used to prime the structure pairs enriches the learned representations and trans-
model with information from natural image categories (Russa- lates into improvements in downstream prediction problems
kovsky et al., 2015). (Figure 3). First, we generate a dataset that contains 76 million
When we use supervised approaches, we encode semantic protein sequences from Uniref (Suzek et al., 2007) and an addi-
priors into our models. These priors are important for learning re- tional 28,000 protein sequences with structures from the Struc-
lationships that are not obvious from the raw data. For example, tural Classification of Proteins (SCOP) database, which classifies
unrelated protein sequences can form the same structural fold protein sequences into a hierarchy of structural motifs based on
and, therefore, are semantically similar. However, we cannot their sequence and structural similarities (e.g., family, super-
deduce this relationship from sequences alone. Supervision is family, class) (Fox, Brenner and Chandonia, 2014; Chandonia,
required to learn that these sequences belong to the same se- Fox and Brenner, 2017). Next, we train a bidirectional LSTM
mantic category. Although structure is more informative of func- with three learning tasks simultaneously: 1) the masked lan-
tion than sequence (Zhang and Kim, 2003; Shin et al., 2007) and guage modeling task (Figures 2C and 3A), 2) residue-residue
structure is encoded by sequence, predicting structure remains contact prediction (Figure 3B), and 3) structural similarity predic-
hard, particularly due to the relative paucity of structural relative tion (Figure 3C).
to sequence data. Significant strides have been made recently The fundamental idea behind this novel training scheme is to
with massive computing resources (Jumper et al., 2020); yet combine self-supervised and supervised learning approaches
there is still a long way to go before a complete sequence to to overcome the shortcomings of each. Specifically, the masked
structure mapping is possible. The degree to which such a language modeling objective (self-supervision) allows us to learn
map could or should be possible, even in principle, is unclear. from millions of natural protein sequences from the Uniprot data-
Evolutionary relationships between sequences are informative base. However, this does not include any prior semantic knowl-
of structural and functional relationships, but only when the de- edge from protein structure and, therefore, has difficulty learning
gree of sequence homology is sufficiently high. Above 30% semantic similarity between divergent sequences. To address
sequence identity, structure and function are usually conserved this, we consider two structural supervision tasks, residue-resi-
between natural proteins (Rost, 1999). Often called the ‘‘twilight due contact prediction and structural similarity prediction,
zone’’ of protein sequence homology, proteins with similar struc- trained with tens of thousands of protein structures classified
tures and functions still exist below this level, but they can no by SCOP. In the residue-residue contact prediction task, we
longer be detected from sequence similarity alone and it is un- use the hidden layers of the language model to predict contacts
clear whether their functions are conserved. Although it is gener- between residues within the 3D structure using a learned bilinear
ally believed that proteins with similar sequences form similar projection layer (Figure 3B). In the structural similarity prediction
structures, there are also interesting examples of highly similar task, we use the hidden layers of the language model to predict
protein sequences having radically different structures and func- the number of shared structural levels in the SCOP hierarchy by
tions (Kosloff and Kolodny, 2008; Wei et al., 2020) and of se- aligning the proteins in vector embedding space and using this
quences that can form multiple folds (James and Tawfik, alignment score to predict structural similarity from the
2003). Evolutionary innovation requires that protein function sequence embeddings. This task is critical for encoding struc-
can change with only a few mutations. Furthemore, it is impor- tural relationships between unrelated sequences into the model.
tant to note that although structure and function are related, The parameters of the language model are shared across the
they should not be directly conflated. self-supervised and two supervised tasks and the entire model
These phenomena suggest that there are aspects of protein is trained end-to-end. The set of proteins with known structure
biology that may not be discoverable by statistical sequence is much smaller than the full set of known proteins in Uniprot
and, therefore, by combining these tasks in a multi-task learning with the masked language modeling objective (DLM-LSTM),
approach we can learn language models and sequence repre- which is not enriched with the structure-based priors. We
sentations that are enriched with strong biological priors from observe that even though the DLM-LSTM model was trained us-
known protein structures. We refer to this model as the multi- ing only sequence information, protein sequences still organize
task (MT)-LSTM. roughly by structure in embedding space. However, this organi-
Next, we demonstrate how the trained language model can be zation is improved when we include structure supervision in the
used for protein sequence analysis and compare this with con- language model training (Figure 4B).
ventional approaches. Given the trained MT-LSTM, we apply it The semantic organization of our learned embedding space
to new protein sequences to embed them into the learned se- enables a direct application: we can search protein sequence
mantic representation space (Figure 4A). Sequences are fed databases for semantically related proteins by comparing pro-
through the model and the hidden layer vectors are combined teins based on their vector embeddings (Bepler and Berger,
to form vector embeddings of each position of the sequences. 2019). Because we embed sequences into a semantic represen-
Given a sequence of length L, this yields L D-dimensional vec- tation space, we can find structurally related proteins even
tors, where D is the dimension of the vector embeddings. This al- though their sequences are not closely related (Figure 4C, Table
lows us to map the semantic space of each residue within a S1). To demonstrate this, we take pairs of proteins in the SCOP
sequence, but we can also map the semantic space of whole se- database, not seen by our multi-task model during training, and
quences by summarizing them into fixed size vector embed- calculate the similarity between these pairs of sequences using
dings via a reduction operation. Practically, this is useful for direct sequence homology-based methods (Needleman-
coarse sequence comparisons including clustering and manifold Wunsch alignment, HMM-sequence alignment, and HMM-
embedding for visualization of large protein datasets, revealing HMM alignment (Needleman and Wunsch, 1970; Eddy, 2011;
evolutionary, structural, and functional relationships between se- Remmert et al., 2011b)), a popular structure-based method
quences in the dataset (Figure 4B). In this figure, we visualize (TMalign (Zhang and Skolnick, 2005)), and an alignment between
proteins in the SCOP dataset, colored by structural class, after the sequences in our learned embedding space. We then eval-
embedding with our MT-LSTM. For comparison, we also show uate these methods based on their ability to correctly find pairs
results of embedding using a bidirectional LSTM trained only of proteins that are similar at the class, fold, superfamily, and
family level, based on their SCOP classification. We find that our find that increasing model size improves transfer learning
learned semantic embeddings dramatically outperform the performance.
sequence comparison methods and even outperform structure Here, we demonstrate two use cases where transfer learning
comparison with TMalign when predicting structural similarity. from our MT-LSTM improves performance on downstream
Interestingly, we observe that the structural supervision compo- tasks. First, we consider the problem of transmembrane predic-
nent is critical for learning well organized embeddings at a fine- tion. This is a sequence labeling task in which we are provided
grained level, because the DLM-LSTM representations alone do with the amino acid sequence of a protein and wish to decode,
not perform well at this task (Table S1). Furthermore, the multi- for each position of the protein, whether that position is in a
task learning approach outperforms a two-step learning transmembrane (i.e., membrane spanning) region of the protein
approach presented previously (SSA-LSTM) (Bepler and Berger, or not. This problem is complicated by the presence of signal
2019). peptides, which are often confused as transmembrane regions.
With the success of our self-supervised and supervised lan- In order to compare different sequence representations for
guage models, we sought to investigate whether protein lan- this problem, we train a small one layer bidirectional LSTM
guage models could improve function prediction through trans- with a conditional random field (BiLSTM+CRF) decoder on a
fer learning. well-defined transmembrane protein benchmark dataset (Tsiri-
gos et al., 2015a). Methods are compared by 10-fold cross
Transfer learning improves downstream applications validation. We find that the BiLSTM+CRFs with our new embed-
A key challenge in biology is that many problems are small data dings (DLM-LSTM and MT-LSTM) outperform existing trans-
problems. Quantitative protein characterization assays are rarely membrane predictors and a BiLSTM+CRF using our previous
high throughput and methods are needed that can generalize smaller embedding model (SSA-LSTM). Furthermore, represen-
given only 10 s to 100 s of experimental measurements. Further- tations learned by our MT-LSTM model significantly outperform
more, we are often interested in extrapolating from data (paired t test, p = 0.044) the embeddings learned by our DLM-
collected over a small region of protein sequence space to other LSTM model on this application (Figure 5B).
sequences, often with little to no homology. Learned protein rep- Second, we demonstrate that we are able to accurately pre-
resentations improve predictive ability for downstream predic- dict functional implications of small changes in protein sequence
tion problems through transfer learning (Figure 5A). Transfer through transfer learning. An ideal model would be sensitive
learning is the problem of applying knowledge learned from solv- down to the single amino acid level and would group mutations
ing some prior tasks to a different task of interest. In other words, with similar functional outcomes closely in semantic space.
learning to solve task A can help learn to solve task B; analo- Recently, Luo et al. presented a method for combining language
gously, learning how to wax cars helps to learn karate moves model-based representations with local evolutionary context-
(Karate Kid, 1984). This is especially useful for tasks with little based representations (ECNet) and demonstrated that these
available training data, such as protein function prediction, representations were powerful for sequence-to-phenotype
because models can be pre-trained on other tasks with plentiful mapping on a panel of deep mutational scanning datasets (Luo
training data to improve performance through transfer learning. et al., 2020). In this problem, we observe a relatively small set
Application of protein language models to downstream tasks (100 s-1000s) of sequence-phenotype measurement pairs and
through transfer learning was first demonstrated by Bepler and our goal is to predict phenotypes for unmeasured variants.
Berger (2019). They showed that transfer learning was useful Observing that these are small data problems, we reasoned
for structural similarity prediction, secondary structure predic- that this is an ideal setting for Bayesian methods and that trans-
tion, residue-residue contact prediction, and transmembrane fer learning will be important for achieving good performance. To
region prediction, by fitting task specific models on top of a this end, we propose a framework in which sequence variants
pre-trained bidirectional language model. The key insight was are first embedded using our MT-LSTM and then phenotype pre-
that the sequence representations (vector embeddings) learned dictions are made using Gaussian process (GP) regression using
by the language model were powerful features for solving other our embeddings as features. We find that we can predict the
prediction problems. Since then, various language model-based phenotypes of unobserved sequence variants across datasets
protein embedding methods have been applied to these and better than existing methods (Figure 5C). Our MT-LSTM embed-
other protein prediction problems through transfer learning, ding powered GP achieves an average Spearman correlation of
including protein phenotype prediction (Alley et al., 2019; Rao 0.65 with the measured phenotypes significantly outperforming
et al., 2019; Rives et al., 2019; Luo et al., 2020), residue-residue (paired t test, p = 0.006) the next best method, ECNet, which rea-
contact prediction (Rives et al., 2019; Rao et al., 2020), fold ches 0.60 average Spearman correlation.
recognition (Rao et al., 2019), protein-protein (Zhou et al., Semi-supervised learning (van Engelen and Hoos, 2020), few-
2020; Sledzieski et al., 2021) and protein-drug interaction predic- shot learning (Wang et al., 2020a), meta-learning (Vanschoren,
tion (Hie et al., 2020b; Truong and Truong, 2020). Recent works 2018; Hospedales et al., 2020), and other methods for rapid
have shown that increasing language model scale leads to adaptation to new problems and domains will be key future de-
continued improvements in downstream applications, such as velopments for pushing the limit of data efficient learning.
residue-residue contact prediction (Rao et al., 2020). We also Methods that capture uncertainty (e.g., Gaussian processes
Furthermore, our entirely sequence-based method even outperforms structural comparison with TMalign when predicting structural similarity in the SCOP
database. Furthermore, we contrast our end-to-end MT-LSTM model with an earlier two-step language model (SSA-LSTM) and find that training end-to-end in a
unified multi-task framework improves structural similarity classification.
Figure 5. Protein language models with transfer learning improve function prediction
(A) Transfer learning is the problem of applying knowledge gained from learning to solve some task, A, to another related task, B. For example, applying
knowledge from recognizing dogs to recognizing cats. Usually, transfer learning is used to improve performance on tasks with little available data by transferring
knowledge from other tasks with large amounts of available data. In the case of proteins, we are interested in applying knowledge from evolutionary sequence
modeling and structure modeling to protein function prediction tasks.
(B) Transfer learning improves transmembrane prediction. Our transmembrane prediction model consists of two components. First, the protein sequence is
embedded using our pre-trained language model (MT-LSTM) by taking the hidden layers of the language model at each position. Then, these representations are
fed into a small single layer bidirectional LSTM (BiLSTM) and the output of this is fed into a conditional random field (CRF) to predict the transmembrane label at
each position. We evaluate the model by 10-fold cross validation on proteins split into four categories: transmembrane only (TM), signal peptide and trans-
membrane (TM+SP), globular only (Globular), and globular with signal peptide (Globular+SP). A protein is considered correctly predicted if 1) the presence or
absence of signal peptide is correctly predicted and 2) the number of locations of transmembrane regions is correctly predicted. The table reports the fraction of
correctly predicted proteins in each category for our model (BiLSTM+CRF) and widely used transmembrane prediction methods. A BiLSTM+CRF model trained
(legend continued on next page)
664 Cell Systems 12, 654–669, June 16, 2021
ll
Synthesis OPEN ACCESS
and other Bayesian methods) will continue to be important, At the same time, current protein language models make
particularly for guiding experimental design. Some recent works heavily simplified phylogenetic assumptions. By treating each
have explored Gaussian process-based methods for guiding sequence as an independent draw from some prior distribution
protein design with simple protein sequence representations over sequences, current methods assume that all protein se-
(Romero, Krause and Arnold, 2013; Bedbrook et al., 2017; quences arise independently in a star phylogeny. Convention-
Yang et al., 2018). Hie et al., presented a GP-based method for ally, this problem is crudely addressed by filtering sequences
guiding experimental drug design informed by deep protein em- based on percent identity. However, significant effort has been
beddings (Hie et al., 2020b). Other works have explored dedicated to understanding protein sequences as emerging
combining neural network and GP models (Ding et al., 2019; Pa- from tree-structured evolutionary processes over time or coales-
tacchiola et al., 2019); while still others considered non-GP- cent processes in reverse time (Rosenberg and Nordborg, 2002;
based uncertainty aware prediction methods for antibody design Nascimento, Reis and Yang, 2017). Methods for inferring these
and major histocompatibility complex (MHC) peptide display latent phylogenetic trees continue to be of substantial interest
prediction (Zeng and Gifford, 2019; Liu et al., 2020). Methods (Huelsenbeck and Ronquist, 2001; Lartillot, Lepage and Blan-
for combining multiple predictors and for incorporating strong quart, 2009; Bouckaert et al., 2019), but are frustrated by long
priors into protein design can also help to alleviate problems run times and poor scalability to large datasets. In the future,
that arise in the low data regime (Brookes, Park and Listgarten, deep generative models of proteins might seek to merge these
2019). Transfer learning and massive protein language models disciplines to model proteins as being generated from evolu-
will play a key role in future protein property prediction and ma- tionary processes other than star phylogenies.
chine learning driven protein and drug design efforts. Other practical considerations continue to frustrate our ability
to develop new protein language models and rapidly iterate on ex-
Conclusions and perspectives: Strong biological priors periments. High compute costs and murky design guidelines
are key to improving protein language models mean that developing new models is often an expensive, time
Future developments in protein language modeling and repre- consuming, and ad hoc process. It is not clear at what dataset
sentation learning will need to model properties that are unique sizes and levels of sequence diversity one model will outperform
to proteins. Biological sequences are not natural language, and another or how many parameters a model should include. At the
we should develop new language models that capture the funda- upper limit of large natural protein databases, larger models
mental nature of biological sequences. While demonstrably use- continue to yield improved performance. However, for individual
ful, existing methods based on recurrent neural networks and protein families or other application specific protein datasets,
Transformers still do not fundamentally encode key protein prop- the gold standard is to select model architectures and number
erties in the model architecture and the inductive biases of these of parameters via brute force hyperparameter search methods.
models are only roughly understood (Box 1). Fine-tuning pre-trained models can help with this problem but
Proteins are objects that exist in physical space. Similarly, we does not fully resolve it. Sequence length also remains a chal-
understand many of the fundamental evolutionary processes lenge for these models. Transformers scale quadratically with
that give rise to the diversity of protein sequences observed sequence length, which means that in practical implementations
today. These two elements, physics and evolution, are the long sequences need to either be excluded or truncated. New
key properties of proteins and our models might benefit from linear complexity attention mechanisms may help to alleviate
being structured explicitly to incorporate evolutionary and this limitation (Choromanski et al., 2020; Wang et al., 2020b).
physics-based inductive biases. Early attempts at capturing This problem is less extreme for recurrent neural networks, which
physical properties of proteins as part of machine learning scale linearly with sequence length, but very long sequences are
models have already demonstrated that conditioning on struc- still impractical for RNNs to handle and long-range sequence de-
ture improves generative models of sequence (Ingraham et al., pendencies are unlikely to be learned well by these models.
2019b) and significant work has been done in the opposite direc- Language models capture complex relationships between
tion of machine learning-based structure prediction methods residues in protein sequences by condensing information from
that explicitly incorporate constraints on protein geometries enormous protein sequence databases. They are a powerful
(Liu et al., 2018; AlQuraishi, 2019; Ingraham et al., 2019b; Xu, new development for understanding and making predictions
2019; Jumper et al., 2020; Yang et al., 2020). However, new about biological sequences. Increasing model size, compute po-
methods are needed to fuse these directions with physics-based wer, and dataset size will only continue to improve performance
approaches and to start to fully merge sequence- and structure- of protein language models. Already, these methods are trans-
based models. forming computational protein biology today due to their ease
using 1-hot embeddings of the protein sequence instead of our language model representations performs poorly, highlighting the importance of transfer learning
for this task (Table S2).
(C) Transfer learning improves sequence-to-phenotype prediction. Deep mutational scanning measures function for thousands of protein sequence variants. We
consider 19 mutational scanning datasets spanning a variety of proteins and phenotypes. For each dataset, we learn the sequence-to-phenotype mapping by
fitting a Gaussian process regression model on top of representations given by our pre-trained language model. We compare three unsupervised approaches (+),
prior works in supervised learning (o), and our Gaussian process regression approaches with (ÿ, GP (MT-LSTM)) and without (GP (1-hot)) transfer learning by 5-
fold cross validation. Spearman rank correlation coefficients between predicted and ground truth functional measurements are plotted. Our GP with transfer
learning outperforms all other methods, having an average correlation of 0.65 across datasets. The benefits of transfer learning are highlighted by the
improvement over the 1-hot representations which only reach 0.57 average correlation across datasets. Transfer learning improves performance on 18 out of 19
datasets.
of use and widespread applicability. Furthermore, augmenting Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G.M. (2019).
language models with protein specific properties such as struc- Unified rational protein engineering with sequence-based deep representation
learning. Nat. Methods 16, 1315–1322.
ture and function offers one already successful route toward
even richer representations and novel biology. However, it re- AlQuraishi, M. (2019). End-to-End Differentiable Learning of Protein Structure.
Cell Syst. 8, 292–301.e3.
mains unclear how best to encode prior biological knowledge
into the inductive bias of these models. We hope this Synthesis Altschul, S.F., and Koonin, E.V. (1998). Iterated profile searches with PSI-
BLAST–a tool for discovery in protein databases. Trends Biochem. Sci. 23,
propels the community to work toward developing purpose-built
444–447.
protein language models with natural inductive biases suited for
Araya, C.L., Fowler, D.M., Chen, W., Muniez, I., Kelly, J.W., and Fields, S.
the physical nature of proteins and how they evolve.
(2012). A fundamental protein property, thermodynamic stability, revealed
solely from large-scale measurements of protein function. Proc. Natl. Acad.
STAR+METHODS Sci. USA 109, 16858–16863.
Bandaru, P., Shah, N.H., Bhattacharyya, M., Barton, J.P., Kondo, Y., Cofsky,
Detailed methods are provided in the online version of this paper J.C., Gee, C.L., Chakraborty, A.K., Kortemme, T., Ranganathan, R., and
and include the following: Kuriyan, J. (2017). Deconstruction of the Ras switching cycle through satura-
tion mutagenesis. eLife 6, e27810. https://doi.org/10.7554/eLife.27810.
d RESOURCE AVAILABILITY Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S.,
B Lead contact Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004). The
B Materials availability Pfam protein families database. Nucleic Acids Res. 32, D138–D141.
B Data and code availability Battaglia, P.W., Hamrick, J.B., and Bapst, V. (2018). Relational inductive
d METHOD DETAILS biases, deep learning, and graph networks. arXiv, 1806.01261 https://arxiv.
B Bidirectional LSTM encoder with skip connections org/abs/1806.01261.
B Masked language modeling module Bedbrook, C.N., Yang, K.K., Rice, A.J., Gradinaru, V., and Arnold, F.H. (2017).
B Residue-residue contact prediction module Machine learning to design integral membrane channelrhodopsins for efficient
eukaryotic expression and plasma membrane localization. PLoS Comput.
B Structure similarity prediction module
Biol. 13, e1005786.
B Multi-task loss
Bengio, Y. (2012) Deep learning of representations for unsupervised and trans-
B Training datasets
fer learning. In Proceedings of ICML workshop on unsupervised and transfer
B Hyperparameters and training details learning. jmlr.org, pp. 17–36.
B Protein structural similarity prediction evaluation
Bepler, T., and Berger, B. (2019). Learning protein sequence embeddings us-
B Transmembrane region prediction training and eval- ing information from structure. In International Conference on Learning
uation Representations. 1902.08661, https://arxiv.org/abs/1902.08661.
B Sequence-to-phenotype prediction and evaluation Berger, B. (1995). Algorithms for protein structural motif recognition.
J. Comput. Biol. 2, 125–138.
SUPPLEMENTAL INFORMATION Berger, B., Wilson, D.B., Wolf, E., Tonchev, T., Milla, M., and Kim, P.S. (1995).
Predicting coiled coils by use of pairwise residue correlations. Proc. Natl.
Supplemental information can be found online at https://doi.org/10.1016/j. Acad. Sci. USA 92, 8259–8263.
cels.2021.05.017.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,
Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucleic
ACKNOWLEDGMENTS
Acids Res. 28, 235–242.
The authors are grateful to Grace Yeo and Brian Hie for helpful suggestions. Bouckaert, R., Vaughan, T.G., Barido-Sottani, J., Duchêne, S., Fourment, M.,
T.B. is supported by the Simons Foundation International, Ltd. (SF349247). €hnert, D., De Maio, N., et al. (2019).
Gavryushkina, A., Heled, J., Jones, G., Ku
B.B. is partially supported by NIH grant R35 GM141861. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis.
PLoS Comput. Biol. 15, e1006650.
AUTHOR CONTRIBUTIONS Brenan, L., Andreev, A., Cohen, O., Pantel, S., Kamburov, A., Cacchiarelli, D.,
Persky, N.S., Zhu, C., Bagul, M., Goetz, E.M., et al. (2016). Phenotypic
All authors conceived and guided the project and methodology. T.B. wrote the Characterization of a Comprehensive Set of MAPK1/ERK2 Missense
software and performed the computational experiments. All authors inter- Mutants. Cell Rep. 17, 1171–1183.
preted the results and wrote the manuscript.
Brookes, D., Park, H., and Listgarten, J. (2019) ‘Conditioning by adaptive sam-
pling for robust design’, in Chaudhuri, K. and Salakhutdinov, R. (eds)
DECLARATION OF INTERESTS
Proceedings of the 36th International Conference on Machine Learning.
The authors declare no competing interests. PMLR (Proceedings of Machine Learning Research), pp. 773–782.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Received: January 16, 2021 Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language
Revised: May 20, 2021 Models are Few-Shot Learners. arXiv, 2005.14165 http://arxiv.org/abs/
Accepted: May 20, 2021 2005.14165.
Published: June 16, 2021
Callaway, E. (2020). Revolutionary cryo-EM is taking over structural biology.
Nature 578, 201.
REFERENCES
Chandonia, J.-M., Fox, N.K., and Brenner, S.E. (2017). SCOPe: Manual
Alford, R.F., Leaver-Fay, A., Jeliazkov, J.R., O’Meara, M.J., DiMaio, F.P., Park, Curation and Artifact Removal in the Structural Classification of Proteins -
H., Shapovalov, M.V., Renfrew, P.D., Mulligan, V.K., Kappel, K., et al. (2017). extended Database. J. Mol. Biol. 429, 348–355.
The Rosetta All-Atom Energy Function for Macromolecular Modeling and Cheng, Y., Grigorieff, N., Penczek, P.A., and Walz, T. (2015). A primer to single-
Design. J. Chem. Theory Comput. 13, 3031–3048. particle cryo-electron microscopy. Cell 161, 438–449.
Choi, J.-M., and Pappu, R.V. (2019). Improvements to the ABSINTH Force Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C.
Field for Proteins Based on Experimentally Derived Amino Acid Specific (2006). Comparison of multiple Amber force fields and development of
Backbone Conformational Statistics. J. Chem. Theory Comput. 15, improved protein backbone parameters. Proteins 65, 712–725.
1367–1382. Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. (2020). Meta-
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, Learning in Neural Networks: A Survey. arXiv, 2004.05439 http://arxiv.org/
T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking abs/2004.05439.
Attention with Performers. In International Conference on Learning Hubbard, T.J., Murzin, A.G., Brenner, S.E., and Chothia, C. (1997). SCOP: a
Representations. https://openreview.net/pdf?id=Ua6zuk0WRH (Accessed: structural classification of proteins database. Nucleic Acids Res. 25, 236–239.
20 May 2021).
Huelsenbeck, J.P., and Ronquist, F. (2001). MRBAYES: Bayesian inference of
de Juan, D., Pazos, F., and Valencia, A. (2013). Emerging methods in protein phylogenetic trees. Bioinformatics 17, 754–755.
co-evolution. Nat. Rev. Genet. 14, 249–261. Ingraham, J., Garg, V.K., Barzilay, R., and Jaakkola, T. (2019a). Generative
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training Models for Graph-Based Protein Design. In Advances in Neural Information
of Deep Bidirectional Transformers for Language Understanding. arXiv, Processing Systems, H. Wallach, et al., eds. (Curran Associates, Inc.),
1810.04805 http://arxiv.org/abs/1810.04805. pp. 15820–15831.
Ding, X., Zou, Z., and Brooks III, C.L. (2019). Deciphering protein evolution and Ingraham, J., Riesselman, A., Sander, C., and Marks, D. (2019b). Learning pro-
fitness landscapes with latent space models. Nat. Commun. 10, 5644. tein structure with a differentiable simulator. In International Conference on
Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, Learning Representations https://openreview.net/forum?id=Byg3y3C9Km.
e1002195. Jacquier, H., Birgy, A., Le Nagard, H., Mechulam, Y., Schmitt, E., Glodt, J.,
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., and Aurell, E. (2013). Improved Bercot, B., Petit, E., Poulain, J., Barnaud, G., et al. (2013). Capturing the muta-
contact prediction in proteins: using pseudolikelihoods to infer Potts models. tional landscape of the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. USA 110,
Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 87, 012707. 13067–13072.
Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G., Wang, Y., Jones, L., James, L.C., and Tawfik, D.S. (2003). Conformational diversity and protein
Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al. (2020). ProtTrans: evolution–a 60-year-old hypothesis revisited. Trends Biochem. Sci. 28,
Towards Cracking the Language of Life’s Code Through Self-Supervised 361–368.
Deep Learning and High Performance Computing. arXiv, 2007.06225 http:// Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Tunyasuvunakool,
arxiv.org/abs/2007.06225. K., Ronneberger, O., Bates, R., Zidek, A., Brigland, A., et al.. https://
predictioncenter.org/casp14/doc/CASP14_Abstracts.pdf.
Findlay, G.M., Daza, R.M., Martin, B., Zhang, M.D., Leith, A.P., Gasperini, M.,
Janizek, J.D., Huang, X., Starita, L.M., and Shendure, J. (2018). Accurate clas- Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R.,
sification of BRCA1 variants with saturation genome editing. Nature 562, Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling Laws for
217–222. Neural Language Models. arXiv, 2001.08361 http://arxiv.org/abs/2001.08361.
Finn, R.D., Clements, J., and Eddy, S.R. (2011). HMMER web server: interac- Kingma, D.P., and Ba, J. (2015). Adam: A Method for Stochastic Optimization.
tive sequence similarity searching. Nucleic Acids Res. 39, 29–37. In International Conference on Learning Representations. 1412.6980, http://
arxiv.org/abs/1412.6980.
Fox, N.K., Brenner, S.E., and Chandonia, J.-M. (2014). SCOPe: Structural
Classification of Proteins–extended, integrating SCOP and ASTRAL data Kitzman, J.O., Starita, L.M., Lo, R.S., Fields, S., and Shendure, J. (2015).
and classification of new structures. Nucleic Acids Res. 42, D304–D309. Massively parallel single-amino-acid mutagenesis. Nat. Methods 12,
203–206, 4 p following 206.
Gardner, J., et al. (2018). GPyTorch: Blackbox Matrix-Matrix Gaussian
Process Inference with GPU Acceleration. In Advances in Neural Information Klesmith, J.R., Bacik, J.-P., Wrenbeck, E.E., Michalczyk, R., and Whitehead,
Processing Systems, S. Bengio, et al., eds. (Curran Associates, Inc), T.A. (2017). Trade-offs between enzyme fitness and solubility illuminated by
pp. 7576–7586. deep mutational scanning. PNAS 114, 2265–2270.
Kosloff, M., and Kolodny, R. (2008). Sequence-similar, structure-dissimilar
Göbel, U., Sander, C., Schneider, R., and Valencia, A. (1994). Correlated mu-
protein pairs in the PDB. Proteins 71, 891–902.
tations and residue contacts in proteins. Proteins 18, 309–317.
Lartillot, N., Lepage, T., and Blanquart, S. (2009). PhyloBayes 3: a Bayesian
Godzik, A., Kolinski, A., and Skolnick, J. (1993). De novo and inverse folding
software package for phylogenetic reconstruction and molecular dating.
predictions of protein structure and dynamics. J. Comput. Aided Mol. Des.
Bioinformatics 25, 2286–2288.
7, 397–438.
Leaver-Fay, A., Tyka, M., Lewis, S.M., Lange, O.F., Thompson, J., Jacak, R.,
Graves, A., Fernández, S., and Schmidhuber, J. (2005). Bidirectional LSTM
Kaufman, K., Renfrew, P.D., Smith, C.A., et al. (2011). Chapter nineteen -
Networks for Improved Phoneme Classification and Recognition. In Artificial
Rosetta3: An Object-Oriented Software Suite for the Simulation and Design
Neural Networks: Formal Models and Their Applications – ICANN 2005
of Macromolecules. In Methods in Enzymology, M.L. Johnson and L. Brand,
(Springer Berlin Heidelberg), pp. 799–804.
eds. (Academic Press), pp. 545–574.
Harris, Z.S. (1954). Distributional Structure. Word World 10, 146–162.
Liu, Y., Palmedo, P., Ye, Q., Berger, B., and Peng, J. (2018). Enhancing Evolutionary
Hess, B., Kutzner, C., van der Spoel, D., and Lindahl, E. (2008). GROMACS Couplings with Deep Convolutional Neural Networks. Cell Syst. 6, 65–74.e3.
4:cAlgorithms for Highly Efficient, Load-Balanced, and Scalable Molecular
Liu, G., Zeng, H., Mueller, J., Carter, B., Wang, Z., Schilz, J., Horny, G.,
Simulation. J. Chem. Theory Comput. 4, 435–447.
Birnbaum, M.E., Ewert, S., and Gifford, D.K. (2020). Antibody complementarity
Hie, B., Zhong, E., Bryson, B., and Berger, B. (2020a). Learning mutational se- determining region design using high-capacity machine learning.
mantics. Advances in Neural Information Processing Systems 33. https:// Bioinformatics 36, 2126–2133.
proceedings.neurips.cc/paper/2020/hash/6754e06e46dfa419d5afe3c9781c Luo, Y., Vo, L., Ding, H., Su, Y., Liu, Y., Qian, W.W., Zhao, H., and Peng, J.
ecad-Abstract.html. (2020). Evolutionary Context-Integrated Deep Sequence Modeling for
Hie, B., Bryson, B.D., and Berger, B. (2020b). Leveraging Uncertainty in Protein Engineering. In Research in Computational Molecular Biology
Machine Learning Accelerates Biological Discovery and Design. Cell Syst. (Springer International Publishing), pp. 261–263.
11, 461–477.e9. Madani, A., McCann, B., Naik, N., Keskar, N.S., Anand, N., Eguchi, R.R.,
Hie, B., Zhong, E.D., Berger, B., and Bryson, B. (2021). Learning the language Huang, P.-S., and Sochler, R. (2020). ProGen: Language Modeling for
of viral evolution and escape. Science 371, 284–288. Protein Generation. arXiv, 2004.03497 http://arxiv.org/abs/2004.03497.
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Marks, D.S., Hopf, T.A., and Sander, C. (2012). Protein structure prediction
Comput. 9, 1735–1780. from sequence variation. Nat. Biotechnol. 30, 1072–1080.
Walensky, R.P., Walke, H.T., and Fauci, A.S. (2021). SARS-CoV-2 Variants of Xu, J. (2019). Distance-based protein folding powered by deep learning. Proc.
Concern in the United States—Challenges and Opportunities. JAMA 325 (11), Natl. Acad. Sci. USA 116, 16856–16865.
1037–1038. https://doi.org/10.1001/jama.2021.2294. Xu, J., and Wang, S. (2019). Analysis of distance-based protein structure pre-
Wang, S., Li, B.Z., Khabsa, M., Fang, H., and Ma, H. (2020a). Linformer: Self- diction by deep learning in CASP13. Proteins 87, 1069–1081.
Attention with Linear Complexity. arXiv, 2006.04768 http://arxiv.org/abs/
Yang, K.K., Wu, Z., Bedbrook, C.N., and Arnold, F.H. (2018). Learned protein
2006.04768.
embeddings for machine learning. Bioinformatics 34, 4138.
Wang, Y., Yao, Q., Kwok, J., and Ni, L.M. (2020b). Generalizing from a Few
Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., and Baker, D.
Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 53, 1–34.
(2020). Improved protein structure prediction using predicted interresidue ori-
Wei, K.Y., Moschidi, D., Bick, M.J., Nerli, S., McShan, A.C., Carter, L.P., entations. Proc. Natl. Acad. Sci. USA 117, 1496–1503.
Huang, P.S., Fletcher, D.A., Sgourakis, N.G., Boyken, S.E., and Baker, D.
(2020). Computational design of closely related proteins that adopt two well- Zeng, H., and Gifford, D.K. (2019). Quantification of Uncertainty in Peptide-
defined but structurally divergent folds. Proc. Natl. Acad. Sci. USA 117, MHC Binding Prediction Improves High-Affinity Peptide Selection for
7208–7215. Therapeutic Design. Cell Syst. 9, 159–166.e3.
Weile, J., Sun, S., Cote, A.G., Knapp, J., Verby, M., Mellor, J.C., Wu, Y., Pons, Zhang, C., and Kim, S.-H. (2003). Overview of structural genomics: from struc-
C., Wong, C., van Lieshout, N., et al. (2017). A framework for exhaustively map- ture to function. Curr. Opin. Chem. Biol. 7, 28–32.
ping functional missense variants. Mol. Syst. Biol. 13, 957. Zhang, Y., and Skolnick, J. (2005). TM-align: a protein structure alignment al-
Wolf, E., Kim, P.S., and Berger, B. (1997). MultiCoil: a program for predicting gorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309.
two- and three-stranded coiled coils. Protein Sci. 6, 1179–1189. Zhou, G., Chen, M., Ju, C.J.T., Wang, Z., Jiang, J.Y., and Wang, W. (2020).
Wrenbeck, E.E., Azouz, L.R., and Whitehead, T.A. (2017). Single-mutation Mutation effect estimation on protein-protein interactions using deep contex-
fitness landscapes for an enzyme on multiple substrates reveal specificity is tualized representation learning. NAR Genom. Bioinform. 2, a015. https://doi.
globally encoded. Nat. Commun. 8, 15695. org/10.1093/nargab/lqaa015.
STAR+METHODS
RESOURCE AVAILABILITY
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Bonnie
Berger (bab@mit.edu).
Materials availability
This study did not generate new materials.
METHOD DETAILS
ki;j = edi;j . With the inter-residue semantic distances and the alignment weights, we then define a global similarity between the
P
two sequences as the negative semantic distance between the positions averaged over the alignment, s = C1 ci;j di;j where
i;j
P
c = ci;j .
i;j
With this global similarity based on the sequence embeddings in hand, we need to compare it against a ground truth similarity to
calculate the gradient of our loss signal and update the parameters. Because we want our semantic similarity to reflect structural
similarity, we retrieve ground truth labels, t, from the SCOP database by assigning increasing levels of similarity to proteins based
on the number of levels in the SCOP hierarchy that they share. In other words, we assign a ground truth label of 0 to proteins not
in the same class, 1 to proteins in the same class but not the same fold, 2 to proteins in the same fold but not the same superfamily,
3 to proteins in the same superfamily but not in the same family, and finally 4 to proteins in the same family. We relate our semantic
similarity to these levels of structural similarity through ordinal regression. We calculate the probability that two sequences are similar
at a level t or higher as pðy RtÞ = qt s + bt where qt and bt are additional learnable parameters for tR1. We impose the constraint that
qt R0 in order to ensure that increasing similarity between the embeddings corresponds to increasing numbers of shared levels in the
SCOP hierarchy. Given these distributions, we calculate the probability that two proteins are similar at exactly level t as pðy = tÞ =
pðy RtÞð1 pðy Rt + 1ÞÞ. That is, the probability that two sequences are similar at exactly level t is equal to the probability they are
similar at at least level t times the probability they are not similar at a level above t.
We then define the structural similarity prediction loss to be the negative log-likelihood of the observed similarity labels under this
model, Lsimilarity = log pðy = tÞ.
Multi-task loss
We define the combined multi-task loss as a weighted sum of the language modeling, contact prediction, and similarity prediction
losses, LMT = lmasked Lmasked + lcontact Lcontact + lsimilarity Lsimilarity :
Training datasets
We train our masked language models on a large corpus of protein sequences, UniRef90 (Suzek et al., 2007), retrieved in July 2018.
This dataset contains 76,215,872 protein sequences filtered to 90% sequence identity. For structural supervision, we use the SCOPe
ASTRAL protein dataset previously presented by Bepler & Berger (Fox, Brenner and Chandonia, 2014; Chandonia, Fox and Brenner,
2017; Bepler and Berger, 2019). This dataset contains 28,010 protein sequences with known structures and SCOP classifications
from the SCOPe ASTRAL 2.06 release. These sequences are split into 22,408 training sequences and 5,602 testing sequences.