Trends in

Biochemical Sciences OPEN ACCESS


Novel machine learning approaches

revolutionize protein knowledge
Nicola Bordin , 1 Christian Dallago , 2,3 Michael Heinzinger , 2,4 Stephanie Kim , 5,6 Maria Littmann ,2
Clemens Rauer , 1 Martin Steinegger , 5,6,@ Burkhard Rost , 2,7,8,@ and Christine Orengo 1,*,@

Breakthrough methods in machine learning (ML), protein structure prediction, Highlights

and novel ultrafast structural aligners are revolutionizing structural biology. Two artificial intelligence (AI)-based
Obtaining accurate models of proteins and annotating their functions on a methods for protein structure prediction,
AlphaFold 2 and RoseTTAFold, increase
large scale is no longer limited by time and resources. The most recent method
dramatically the quality of structural
to be top ranked by the Critical Assessment of Structure Prediction (CASP) modeling from sequence, nearing exper-
assessment, AlphaFold 2 (AF2), is capable of building structural models with imental accuracy.
an accuracy comparable to that of experimental structures. Annotations of 3D
Protein language models encode the
models are keeping pace with the deposition of the structures due to advance- written language of proteins, allowing
ments in protein language models (pLMs) and structural aligners that help vali- for more accurate annotations and pre-
date these transferred annotations. In this review we describe how recent dictions than homology-based methods.
developments in ML for protein science are making large-scale structural bioin-
Most model organisms, neglected
formatics available to the general scientific community. disease pathogens, and proteins with
curated annotations have models
available with varying quality, aiding
From protein sequence and structure to function through ML wet-laboratory experiments targeting
single-question issues.
The number of experimentally determined, high-resolution structures deposited in the Protein
Data Banki (PDB) [1] has grown immensely since its beginning in 1976, enabling research into Ultrafast alignment tools can traverse the
biological mechanisms, and in turn the development of novel therapeutics and industrial applica- protein space by both sequence and
structure to identify remote evolutionary
tions. This growth is, however, outpaced exponentially by that of known protein sequences
relations previously precluded to older
increasingly impacted by high-throughput metagenomic experiments which yield billions of and slower methods.
entries per experiment. Closing the ever-increasing gap between protein sequence and annota-
tions of structure and function is thus a desideratum in molecular and medical biology research. Preliminary analyses of predicted
AlphaFold 2 3D-models from 21 model
organisms suggest that the majority
Most proteins comprise two or more structural domains [2], that is, constituents with compact (>90%) of globular domains in proteins
structures assumed to fold largely independently. Structural domains are often associated with can be assigned to currently character-
specific functional roles [3], although functional sites can be formed from multiple domains [3]. ized domain evolutionary superfamilies.

These structural domains – often dubbed ‘folds’ – recur in nature [4], and have been estimated
to be limited to a number in the order of thousands [5]. Folds resemble more the Plato’s allegory
of the cave: more the image or idea or concept than the real object (Plato Politeia [6]); this image Institute of Structural and Molecular
Biology, University College London,
helps to map relations between proteins. Gower St, WC1E 6BT London, UK
Technical University of Munich (TUM)
Various resources emerged to classify domain structures in evolutionary families and fold groups Department of Informatics, Bioinformatics
and Computational Biology – i12,
(e.g., SCOPii [7], CATHiii [8], SCOPeiv [9], and ECODv [10]), and these have saturated at about Boltzmannstr. 3, 85748 Garching/Munich,
5000 structural families and about 1300 folds over the past decade, despite structural genomics Germany
initiatives targeting proteins likely to have new folds [11]. As increasingly powerful sequence VantAI, 151 W 42nd Street, New York,
NY 10036, USA
profile methods [12–14] have identified structural families in completely sequenced organisms 4
TUM Graduate School, Center of
(complete proteomes), studies suggest that up to 70% of all domains resemble those already Doctoral Studies in Informatics and its
classified in SCOP or CATH [3,15–17]. Trivially, the distribution of family size follows some Applications (CeDoSIA), Boltzmannstr.
11, 85748 Garching, Germany
power law: most families/folds are small or species-specific, but a few hundred are very highly 5
School of Biological Sciences, Seoul
populated, tend to be universal across species, and have important functions [8]. In parallel, National University, Seoul, South Korea

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 https://doi.org/10.1016/j.tibs.2022.11.001 345
© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Trends in Biochemical Sciences

flexible or intrinsically disordered regions (IDRs) [18] making up 20–30% of all residues in a given Artificial Intelligence Institute, Seoul
National University, Seoul, South Korea
proteome have been associated with protein function [19–21]. As much as structural domains 7
Institute for Advanced Study
could be thought of as the structural units of proteins, the millions of domain combinations create (TUM-IAS), Lichtenbergstr. 2a, 85748
the immense diversity of functional repertoires. Garching/Munich, Germany
TUM School of Life Sciences
Weihenstephan (TUM-WZW), Alte
Since the details of function for most proteins in most organisms remain uncharacterized, under- Akademie 8, Freising, Germany
standing how domains evolve and combine to modify function would be a major step in our quest
to understand and engineer biology. Protein structure data can provide a waymark, and exciting
advances in structure prediction over the past year suggest that a landmark has been reached
[22]. While structure prediction has steadily improved over time, thanks to the exponential growth *Correspondence:
in protein sequence data and covariation methods, this new era was kick-started by the remark- c.orengo@ucl.ac.uk (C. Orengo).
Twitter: @CATH_Gene3D (C. Orengo),
able performance of AlphaFoldvi (AF) at CASP13 [23]. The method, though, was not made avail- @rostlab (B. Rost), and @thesteinegger
able to the scientific community, resulting in various groups trying to replicate the features behind (M. Steinegger).
its breakthrough performance. Methods that were previously state-of-the-art released new ver-
sions based on these advancements, such as RoseTTAFold [24] and PREFMD [25]. DeepMind’s
AF2 outperformed others in CASP14 [26], and reports suggest that high-quality models can be
comparable to crystallographic structures, with competing methods reproducing DeepMind’s
results [24,27,28]. DeepMind has recently announced the availability of 214 million putative
protein structures for the whole of UniProt, which are available through the AF Database [29]
and 3D-Beacons platform at the European Bioinformatics Institute (EBI). The latter provides AF
models and other models by other prediction methods [30]. This 1000-fold increase in structural
data requires equally transformative developments in methods for processing and analyzing the
mix of experimental and putative structure/sequence data, including methods reliably predicting
aspects of function from sequence alone [31–33], and methods to quickly sift through putative
structures [34].

In this review, we consider recent developments in deep learning, a branch of ML (see Glossary)
operating on sequence and structure comparisons that enable highly sensitive detection of dis-
tant relationships between proteins. These will allow us to harness important insights on putative
structure space, on domain combinations, and on the extent and role of disorder. One important
observation from this review: no single modality has all the answers. Instead, protein sequence,
evolutionary information, latent embeddings from pLMs, and structure information all play
key roles in helping to uncover how proteins fold and act. Application of these tools have enabled
rapid evolutionary classification of good quality AF2 models (defined as AF2 Predicted Local Dis-
tance Difference Test (pLDDT) ≥ 70 [22]) for 21 model organisms, including human (Homo
sapiens), mouse (Mus musculus), rat (Rattus norvegicus), and thale cress (Arabidopsis thaliana)
[35]. We review the insights derived from these studies and the future opportunities they bring
for understanding the links between protein structural arrangements and their functions.

Sequence-based approaches to find homologs

Sequence similarity and evolutionary information provide gold-standard baselines
Comparing a query protein sequence against the growing sequence databases can reveal a
goldmine of evolutionary information encoded in related (similar) sequences (Table 1). Closely
related sequences with annotations of function and structure have successfully been used for
homology-based inference (HBI), that is, the transfer of annotations from labeled to
sequence-similar yet unlabeled proteins [36,37]. Beyond annotation transfer, evolutionary infor-
mation condensed in multiple sequence alignments (MSAs) can serve for de novo protein func-
tion and structure prediction methods, which have ranked highly for decades in independent
evaluations [37–40]. However, the runtime and parameter sensitivity of popular solutions to gen-
erate MSAs [12,13,41], in combination with selection biased sequence datasets, creates major

346 Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4

Trends in Biochemical Sciences

Table 1. Advantages and disadvantages of methods for homology detection Glossary

Advantages Disadvantages
Embedding-based annotation
Homology-based • Highly reliable • Computationally expensive transfer (EAT): embedding-based
inference (HBI) • Interpretable • Sensitive to choice of databases and parameters annotation transfer applies the same logic
Embedding-based • Fast inference, that is, generation • Computationally expensive pretraining (only has to as HBI but replaces sequence similarity
annotation transfer of embeddings be done once) (SIM) with similarity of embedding vectors
(EAT) • Data-driven feature learning and • Choice of dataset, redundancy level, (generalized sequences).
extraction --> reduced human bias preprocessing is still human biased Evolutionary couplings (EV): pairs of
• Detection of distant homologs residues coupled through coevolution.
The adequate preprocessing of these
Contrastive learning • Specialization for specific use-case • Pre-trained model is not generally applicable for
signals (e.g., through direct coupling
improves performance identification of homologs for all aspects of protein
analysis) was one important milestone
• Detection of distant homologs function
toward AF2.
Supervised learning • Detection of distant homologs • Difficult to extend to more classes Evolutionary information: information
• Requires enough data compiled through the comparison of
Structure-based • Detection of very distant homologs • Computationally expensive protein sequences and structures by
annotation transfer • Highly interpretable since • No standardized metrics of similarity [root mean grouping proteins into families
(SAT) alignments can be interpreted visually square deviation (RMSD), TM-score connected through evolution. Typically,
using structure returned by TM-align method] evolutionary information is compiled in
so-called family profiles, or position-
specific scoring matrices (PSSMs). The
combination of evolutionary information
bottlenecks: (i) slow runtime, (ii) uninformative MSAs from inappropriate default parameters, from and ML affected most breakthrough
difficult to align families (e.g., IDRs), or from lack of diversity for understudied or species-specific steps in protein structure prediction from
1992 to 2021.
families. Uninformative MSAs affect the prediction quality even for AF2 [22,42,43]. While Homology-based inference (HBI):
advances in computer hardware coupled with clever engineering overcame some of the speed inference at the base of most protein
limitations [14,44], the faster-than-Moore’s-law [45] growth of sequence databases demands annotations. Assume a query Q without
known phenotype, and a protein K with
alternative or complementary solutions.
experimentally known phenotype P,
then HBI operates as follows: if the
pLMs: deep learning learns protein grammar sequence similarity between Q and K
One alternative to direct evolutionary information extraction is leveraging deep learning teaching exceeds some threshold T, we assume
Q to have the same phenotype as K:
machines to encode information contained in billions of known protein sequences by adapting
if SIM(Q,K) > T -> phenotype (Q) =
so-called language models (LMs) from natural language processing (NLP) to learn aspects of phenotype (K) = P. How to measure SIM
the ‘grammar’ of the language of life as encoded in protein sequences [46–51]. Where traditional and what value to choose for T depends
ML models are trained to learn from labeled data (i.e., data with annotations) (supervised training), on the type of phenotype and has to be
empirically determined for each type of
pLMs implicitly learn data attributes, such as constraints (evolutionary, structural, or functional)
‘phenotype’: for example, for 3D
shaping protein sequences (dubbed self-supervised learning). This can be achieved either by structure, and also for secondary
autoregression, that is, training on predicting the next token (the word in text, the residue in structure, or for molecular function in the
pLMs) given all previous tokens in a sequence, or via masked-language modeling (i.e., by training GeneOntology and also for binding
particular classes of ligands. Incidentally,
on reconstructing corrupted sequences from noncorrupted sequence context) [47,48,50]. the evolutionary link in the word
Repeating this on billions of sequences forces the pLM to learn properties and statistical ‘homology’ is crucial, because protein
commonalities of the underlying protein language. The resulting solutions can be transferred to design can generate protein pairs with
other tasks (transfer learning) to predict many different phenotypes [43,52–55] (Figure 1). similar sequences and dissimilar
Machine learning (ML): computational
Technically, this can be achieved by extracting the hidden states from the pLM referred to as systems that aim to emulate human
embeddings. One key advantage of pLMs over evolutionary information is that the computation- intelligence, usually by means of statistics
heavy information extraction (learning the pLM) needs to be done only once during model training and probability.
Protein language models (pLMs)
on efficient, high-performance computing centers. The extraction and use of the embeddings, by and embeddings: while language
contrast, is done efficiently on consumer-grade hardware such as modern personal computers or models (LMs) from natural language
even laptops. processing (NLP) understand natural
language from data, pLMs aim at
understanding the language of life
pLMs improve the prediction of protein function through implicitly capturing the
Since the introduction of the first general-purpose pLMs around 3 years ago [46,48,50,56], pLMs evolutionary, functional, and structural
have been shown to astutely capture aspects of protein structure, function, and evolution just constraints on protein sequences.

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 347

Trends in Biochemical Sciences

Effectively, such constraints can be

learned from large sets of unannotated
protein sequences, because all
sequences that can ever be observed
are not a random subset of all possible
sequences. These constraints are
captured in the connections of the
‘neurons’ used to train the pLMs, and
can be written as vectors (rows of
real-valued numbers) that are referred to
as ‘the embeddings’.
Structure-based annotation
transfer (SAT): implementing a logic
similar to HBI, SAT expands beyond the
evolutionary connection. Instead, the
assumption is that similar shapes
(3D coordinates) and feature descriptors
(e.g., density of charged residues in
surface patch) affect some similarity in
terms of other function-related

Trends in Biochemical Sciences

Figure 1. Overview of embeddings applications in protein structure and function characterization. Images were
retrieved from Wikipedia (alpha helix and beta strand, binding sites), Creative Proteomics (cell structure), and bioRxiv
(Transmembrane Regions - CASP13 Target T1008, Structure Prediction - PDB 1Q9F) with permission from the authors.

from the information contained in databases of raw sequences [32,43,47,48,50,52,53,57,58]. In

an analogy with HBI, which transfers annotations based on sequence similarity, embedding-
based annotation transfer (EAT) captures more information through comparing proteins in
embedding, not sequence space [59,60]. Without any domain-specific optimization, and without
ever seeing any labeled data, simple EAT outperformed HBI by a large margin and ranked among
the top ten methods for predicting the molecular function of a protein during Critical Assessment
of Functional Annotation 4 (CAFA4) [31]. Adding domain optimizations, EAT predicted proteins
according to the CATH classification [8,60] beyond what could be detected by advanced
sequence profile methods [61]. The power of pLMs was confirmed as CATHe revealed distant
evolutionary relationships, not detected by sequence profile methods, yet confirmed by structure
comparison of AF2-predicted models.

Leap in protein structure prediction combines ML with evolutionary information and hardware
Considered 2021’s method of the year [62], AF2 [22] combines advanced deep learning with
evolutionary information from larger MSAs – obtained from the BFD with 2.1 billion sequences
[63] or MGnifyvii [64] with 2.4 billion sequences, as opposed to UniProtviii with 231 million [65] –
and more potent computer hardware to make major advances in protein structure prediction,
providing good quality models for at least 50% of the likely globular domains in UniProt
sequences. All top structure prediction methods, including AF2 and RoseTTAFold, rely on

348 Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4

Trends in Biochemical Sciences

evolutionary couplings (EV) [66] extracted from MSAs. These approaches detect protein
residues in close proximity and coevolving. The adequate preprocessing of this information has
been advancing crucially over the past decades: for example, through direct coupling analysis
sharpening this signal [67,68]. Although the leap of AF2 required this foundation in 2021, future
advances may build their models on a different foundation [43].

Protein structure proxies function for distant homologs

As alternatives to HBI or EAT, structure-based annotation transfer (SAT) emerged. SAT
more reliably captures distantly related proteins (Figure 2). With the recent breakthrough in protein
structure prediction [22] solving structures computationally at near-X-ray quality [26], new possi-
bilities to apply SAT at the proteome scale have arisen. A large new collection of in silico predicted
structures is available through the AF Protein Structure Database (AFDB)ix [35], which has been
analyzed through fold recognition algorithms to refine protein families and to discover novel

Trends in Biochemical Sciences

Figure 2. Comparison of search sensitivity and speed for language models, sequence/profile-profile and
structure aligner. Average sensitivity up to the fifth false positive (x-axis) for family, superfamily, and fold measured on
SCOP40e (version 2.01) [9] against average search time for a single query (y-axis) of 100 million proteins. Per SCOP40e
domain we compute the fraction of detected true positives for family, superfamily, and fold up to the 5th false positive (FP)
(= different fold), and plotted the average sensitivity over the domains (x-axis).

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 349

Trends in Biochemical Sciences

protein folds. For instance, by mining structures with a widely used structural alignment tool
(DALI) [69], a new member of the perforin/gasdermin (GSDM) pore-forming family in humans
was identified in spite of having only 1% sequence identity with the GSDM family [70,71]. Further-
more, the expanded search to all proteomes, covering 356 000 predicted structures, discovered
16 novel perforin-like proteins [72].

Faster solutions for structure–structure alignments enable high-throughput analyses in seconds

Despite efforts to improve the speed and sensitivity of structural aligners, traditional approaches
[69,73–75] are too slow to cope with the rapidly increasing size of predicted structure databases
[35,76] (Figure 2). Hence, novel ideas for structural comparison algorithms are emerging to accelerate
run times. These methods gain in speed by representing structures in a compressed form (Table 2).

One way to compress structural information is to break structures into fixed-size fragments.
Geometricus [77] represents proteins as a bag of shape-mers: fixed-sized structural fragments
described as moment invariants. It was used to cluster the AFDB and PDB using non-negative
matrix factorization to identify novel groups of protein structures [78]. RUPEE [79] is another
method that breaks structures into structural fragments. It discretizes protein structures by their
backbone torsion angles and then compares the Jaccard similarity of bags of torsion fragments
of the two structures. The top 8000 hits are then realigned by TM-align [74] in top-align mode.

Another category of tools represents tertiary structure as discretized volumes and compares
these. BioZernike [80], for example, approximates volumes through 3D Zernike descriptors and
compares these by a pretrained distance function. 3D-AF-Surfer [15] also applies 3D Zernike
descriptors followed by a support vector machine (SVM) trained to calculate the probability of
two structures being in the same fold. Results are ranked by the SVM scores, while individual
hits can be realigned using combinatorial extension (CE) [75].

The fastest category of structural aligners represents structures as sequences of a discrete struc-
tural alphabet. Most of these alphabets discretize the backbone angles of the structure [81–83],

Table 2. Advantages and disadvantages of methods for structure-based homology detection

Approach Tools Advantages Disadvantages Representation Similarity calculation Alignment
Structure Geometricus Fast structure similarity search Global Backbone encoded as Vector distance Not available
fragments tool Accurate compared to comparison only fixed sized fragments as similarity of
other alignment-free techniques Sensitivity is moment invariants Geometricus
limited embedding vectors
RUPEE Fast structure database Global Backbone encoded as Jaccard-similarity of TM-align [74]
search toolEasy to use comparison fixed sized fragments of torsion fragments or
through webserver onlySensitivity is backbone torsion angles TM-score
Structure BioZernike Protein chain, and quaternary Global 3D Zernike descriptor of Pretrained distance Not available
volume structure topology free comparison volume function to compare
technique (avoids chain onlySearches two volumes
matching problem)Both similar surface
3D-AF-Surfer 3D Zernike descriptors of Predicted probability CE [75]
methods provide easy to use shape, which is
volume of being in the same
webserver not sensitive
fold by neural network
Structural Foldseek Fast and accurate structure No quaternary 3Di alphabet that E-value, LDDT, Structural
alphabets alignment toolLocal or global structures describes tertiary TM-score Smith–Waterman
alignmentEasy to use through comparison residue–residue or TM-align
webserver interactions

350 Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4

Trends in Biochemical Sciences

however at a loss of information in the structured regions. Another type of structural discretization
to a sequence was proposed by Foldseekx [34]. It describes tertiary residue–residue interactions
as a discrete alphabet. It locally aligns the structure sequences using the fast MMseqs2xi algo-
rithm [84]. Foldseek achieves the sensitivity of a state-of-the-art structural aligner like TM-align,
while being at least 20 000 times faster.

Sequence-based structural alignment tools are well equipped to handle the upcoming avalanche
of predicted protein structures. Efficient storage of structure information and queries against
these makes searches against hundreds of millions of structures feasible. Representing struc-
tures as sequences allows us to also adapt fast clustering algorithms like Linclust [84] to compare
billions of structures within a day. We expect current tools to further increase in sensitivity to
match or exceed the performance of DALI [69] (Figure 2).

Embeddings from pLMs in combination with fast structural aligners (Foldseek) could be orthog-
onal in covering, classifying, and validating assignments in large swaths of protein fold space,
as shown in Figure 3.

Application of sequence and structure approaches to analyses of the protein

Deep learning extends fold space
Following AFDB [35], structural analyses using fast, deep-learning-based methods (e.g., Geometricus,
3DZD) [15,77] suggested a slight predominance of mainly-alpha structures compared to the PDB,
and predicted the existence of hundreds to thousands more structural families in the dataset [15].
About 75% of the AF2 structures are of sufficient global quality (pLDDT scores of ≥70) for these
studies, depending on the analyses. However, even in these well-predicted 3D models, at least
26% of residues were of low model quality [16]. Recent studies showed that nearly 6% of these

Trends in Biochemical Sciences

Figure 3. Visual analysis of the structure space spanned by CATH domains expanded by AlphaFold 2 (AF2) models. We showcase how distance in either
structure (left) or embedding space (middle and right) can be used to gain insight into large sets of proteins. Simply put, we used pairwise distance between proteins to
summarize ~850 000 protein domains in a single 2D plot and colored them according to their CATH class and architecture. This exemplifies a general-purpose tool for
breaking down the complexity of large sets of proteins and allows, for example, detection of large-scale relationships that would otherwise be hard to find, or to detect
outliers. More specifically, ~850 000 domains were structurally aligned using Foldseek [34] (left) in an all-versus-all fashion, resulting in a distance matrix based on the
average pairwise bitscore within a superfamily as superfamily distances. The domain sequences were converted to embeddings using the ProtT5 (center) and
ProtTucker (right) protein language models (pLMs). Similarly to the structural approach, the distance matrix between superfamilies were calculated using the average
euclidean distance between embeddings belonging to different superfamilies. Using different modalities (i.e., structure and sequence embeddings) for computing dis-
tances on the same set of proteins, provides different, potentially orthogonal angles on the same problem which can be helpful during hypothesis generation. The resulting
distance matrices were used as precomputed inputs for uniform manifold approximation and projection (UMAP) [121] and plotted with seaborn [122].

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 351

Trends in Biochemical Sciences

low-quality residues are predicted to be disordered by sequence-based approaches [78,85]. It

is also clear that AF2 struggles to predict domains from small, species-specific families [16],
suggesting that covariation data are needed for good-quality models. Preliminary analyses
[16] revealed some very unusual structural architectures in which common folds are connected
by large unordered regions or combined in quite regular arrangements using helical scaffolds
(see also Figure 4).

We recently analyzed the proportion of predicted AF2 structural domains in the 21 model organ-
isms that could be assigned to known superfamilies in CATH [16]. Only good-quality models were
analyzed according to a range of criteria (pLDDT ≥70, large proportions of ordered residues,
characteristic packing of secondary structure). We used well-established hidden Markov model
(HMM)-based protocols and a novel deep-learning method (CATHe [61] based on ProtT5 [47])
to detect domain regions in the AF2 models, and Foldseek comparisons gave rapid confirmation
of matches to CATH relatives [34]. We found that 92%, on average, of domains could be mapped
to 3253 CATH superfamilies (out of 5600). We see that the proportion of residues in compact
globular domains varies according to the organism, with well-studied model organisms having
higher proportions of residues assigned to globular regions (ranging from 32% for Leishmania
infantum to 76% for Escherichia coli).

By classifying good-quality AF2 models into CATH, we can expand the number of structurally
characterized domains by ~67%, and our knowledge of fold groups in superfamilies (structurally
similar relatives which can be well superposed) increases by ~36% to 14 859 [16]. As with other
recent studies of AF2 models [15], we observe the greatest expansion in global fold groups
for mainly-alpha proteins (2.5-fold). Less than 5% of CATH superfamilies (~250) are highly
populated, accounting for 57.7% of all domains [8], and in these so called MEGAfamilies AF2
considerably increases the structural diversity, with some superfamilies now identified as having
more than 1000 different fold groups, suggesting considerable structural plasticity outside the
common structural core.

Our analyses identified 2367 putative novel families [16]. However, detailed manual analyses of
618 human AF2 structures revealed problematic features in the models, and some very distant
homologies, with only 25 superfamilies verified as novel, suggesting that the majority of domain
superfamilies may already be known. It is even likely that, as we bring more relatives into the
AF2 superfamilies, links between current CATH superfamilies will be established and the number
of superfamilies reduced. Indeed, most of the 25 new superfamilies identified possess domain
structures with very similar architectures to those in existing CATH superfamilies; mainly-alpha
structures (both orthogonal and up–down bundles) were particularly common, as were small
alpha–beta two-layer sandwiches and mainly-beta barrels.

Biological discoveries enabled by AF2 data

The availability of off-the-shelf solutions based on AF2 – both as a tool (ColabFoldxii [42], AF2 [22],
AF-Multimer [86]) and as a collection of precomputed models (AFDB [35]) – is akin to the introduc-
tion of next-generation sequencing in small research groups enabled by nanopore sequencing.
Suddenly, almost every protein of interest in various projects from medical to environmental
research is not held back by a lack of experimentally derived structures in the PDB.

Although AF2 solves many challenging issues in structural modeling, its limitations have been
rapidly identified by the community. All models have accompanying scores for each residue,
indicating several aspects about the confidence of the prediction. For example, pLDDT gives

352 Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4

Trends in Biochemical Sciences

Trends in Biochemical Sciences

Figure 4. New folds in CATH-AlphaFold 2 (AF2). Examples of novel folds previously not encountered in CATH or
Protein Data Bank (PDB). Structures are identified as novel folds if they have no significant structural similarity to domains
or structures in the PDB using Foldseek as a comparison method. Each structure identifier is in the format UniProt_ID/
start–stop with its current name in UniProt.

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 353

Trends in Biochemical Sciences

the confidence for a particular residue, and predicted align error reflects inter-residue distances
and local structural environments. Other measures are also provided. Models scoring below an
average pLDDT value of 70 and containing large portions with incorrectly oriented secondary
structure segments are unsuitable for most biological applications and do not reach the quality
of experimentally derived structures [22]. Most issues could be related to the nature of the
MSA the model is built upon, as shallowness or gaps in the alignment often result in a poor
model [22,87].

Furthermore, overrepresentation of proteins with a particular folding state results in a model that is
not representative of other alternative states [88]. Some models with low pLDDT point to IDRs
that undergo disorder-to-order transition upon binding or are prone to fold-switching [89–91].
Other features that may be available for experimental structures are missing from AF2 models,
such as ions, cofactors, ligands, and post-translational modifications (PTMs) [92].

While some effects of sequence variants are captured by AF2, others – in particular point muta-
tions or single amino acid variants – remain elusive to AF2, partly because predictions constitute a
family-averaged more than a sequence-specific solution due to the MSA underlying each
prediction [78,93].

Small- and medium-scale applications of AF2

With these caveats and limitations taken into account, AF2 enabled both small- and large-scale
applications to biological questions. The sudden availability of a reliable model relieved many
research groups from long-term structural characterization efforts, allowing for targeted answers
in conformational studies [94,95], oligomerization prediction [96,97], drug channel conformations
[98], and early-stage assembly of complexes in disease [99]. Predictions derived from AF2
models helped in validating experimentally derived structures and complexes [100], aiding in
solving X-ray crystallography for molecular replacement experiments [101], as well as replacing
experimental characterization entirely when it fails with particularly tough cases [98]. Transmem-
brane proteins, in particular, are not easily solved by X-ray crystallography, so AF2 in combination
with other techniques such as NMR are being used as an orthogonal validation for experiments
where particular conformations of import channels were unclear [98].

Large-scale applications of AF2

Large-scale applications of AF2 and AF-Multimer are creating entirely novel resources (AFDB
[35]), complementing or expanding already established ones (CATH [8], APPRIS [102], and
Membranome [103]), or enabling more focused collections and analyses, such as the character-
ization of the ‘metallome’ by identifying all metal-binding sites across proteomes [104], or shining
light on the human dark proteome [105], or improving genomic annotation of the human genome
through comparison of the predicted structures of 140k isoforms [106]. Since AF has now also
released models for neglected tropical diseases, this will progress research on these often
underfunded or ignored diseases. The recent release of protein structure models for the whole
of UniProt will also enable large-scale analyses across the Tree of Life such as evolutionary studies
on domain archaeology, among others.

Unlocking new deep-learning venues

Thanks to the increase in high-quality structure predictions spawned by AF2, there is an increas-
ing need to readily leverage 3D information by prediction methods. Instead of using representa-
tions that first map 3D structures to 2D (e.g., contact maps) or 1D (e.g., secondary structure)
before feeding them to a predictor, such networks directly operate on 3D representations of
macromolecules to make predictions about their properties. Using so-called ‘inductive bias’

354 Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4

Trends in Biochemical Sciences

when designing a network (i.e., incorporating domain knowledge directly into the architecture) Outstanding questions
avoids information loss during abstraction and enables the network to directly learn useful infor- The best structural models rely on
mation from the raw 3D data itself. Recently, geometric deep-learning research [107], which quality and availability of protein
structures found in nature. How can
focuses on methods handling complex representations like graphs, has seen a steady increase
we further improve these deep-
in adoption, accuracy, and potential opportunities [108]. Protein 3D structures are naturally fit learning methods without relying on
for geometric deep-learning approaches, whether for supervised tasks, like the prediction of previous structural knowledge?
molecule binding [109], or unsupervised learning approaches, which could generate alternatives
How much structural novelty is hidden
to learned representations from pLMs [110,111]. Geometric deep-learning approaches stand to
in metagenomes? Are we close to
benefit the most from large putative 3D structures sets, potentially unlocking further opportunities discovering all ways nature can shape
for alternative, unsupervised protein representations derived from structure, or deep-learning- a protein?
based potentials to substitute expensive molecular dynamics simulations for molecular docking.
Could protein modeling entirely replace
experimentally derived structures?
A leveling effect
AF2 will help in eliminating an underlying bias in structural biology that has tended to focus more How can we use these methods to
engineer a highly efficient enzyme
on drug discovery for human diseases, model organisms, or structures involved in pathogens.
function never encountered in living
With its cheap footprint and cost, compared to traditional experimental means to characterize organisms?
protein 3D structure, AF2 is neither constrained by access to expensive experimental instru-
ments, nor to beam-time usually prioritized for large consortia. This will enable groups across Speed, accuracy, and coverage of
structure predictions and methods
the world to work on their proteins of interest without geographical or economical limitations. In
are skyrocketing. How can these
a similar fashion to nanopore sequencing, model building could be done in real time for issues tools be improved to better probe the
that are time- or location-sensitive (i.e., emerging pandemics, neglected tropical diseases), or dark universe of uncharacterized
limited to an individual, potentially allowing to transition from whole-genome sequencing to proteins?

whole-proteome modeling, drug binding, and efficacy profiling. Could future methods build complexes
of protein–protein interactions and
Concluding remarks and future perspectives models quickly and accurately enough
to be used for precision medicine?
Computationally predicting protein properties with increasing accuracy using deep learning
[22,53,57,58] remains crucial to build structures and assemblies that assist researchers in Can we use embeddings-based
uncovering cellular machinery. Conveniently, the better these predictors become, the more inter- methods to predict evolution in
esting it is to hijack them to design new proteins that perform desired functions [112]. Recently, sequences?

deep-learning approaches emerged that ‘hallucinate’ new proteins [113] which systems like Can today’s deep-learning structure-
AF2 confirm may fold into plausible structures. These tools can generate new protein sequences prediction method capture dynamic
from start to finish or, similarly to text autocompletion, conditioned on part of a given sequence changes in structure?
input [112], all within milliseconds of runtime. Coupled with blazingly fast predictors [43,114,115],
millions of potentially foldable, ML-generated sequences can be screened reliably in silico, saving
energy, time, and resources, requiring in vitro and in vivo experiments only at the most exciting
stages of the experimental discovery process. Whilst not all designs fold, and caution is needed,
an approach similar to spell correction in NLP but trained on millions of protein sequences allowed
researchers to evolve and optimize existing antibodies to better perform a desired activity [116].
Additionally, approaches that generate protein sequences from 3D structure (in some sense,
the opposite direction of the classical folding problem) will get more and more important in
the post-AF2 era [117]. By selecting for sequence diversity conditioning on structure, new
candidates for families may be found.

With booming putative structure databases, we see the emergence of analytical approaches
leveraging a model similar to how UniProt’s mix of curated (SwissProt) and putative (TrEMBL)
[65] sequence databases are being used. In part, we can build on years of advances in maintain-
ing and searching sequence databases (e.g., to extract evolutionary relationships) to create tools
to analyze structure databases instead, with performant tools already available [34]. However,
mainstreaming structural analysis on billions of entries will require domain-specific infrastructure
and tooling. Geometric deep learning may also assist this modality by bringing new unsupervised

Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 355

Trends in Biochemical Sciences

solutions similar to pLMs but trained on protein 3D structures instead [110]. With reliable solutions
in this space, we expect practitioners to combine sequence and structural analysis.

Within the realm of ‘traditional’ pLMs, that solely learn information from large unlabeled protein
sequence databases, there is still room for improvement, as highlighted by recent advances in
NLP. For example, there are approaches optimizing the efficiency of LMs, especially, on long
sequences either by modifying the existing attention mechanism [118] or by proposing a
completely different solution not relying on the de facto standard (attention) [119]. Orthogonal
to such architectural improvements, recent research highlights the importance of hyperparameter
optimization [120] which goes away from the constant increase in model size and rather
suggests to train ‘smaller’ models (still, those models have billions of parameters) on more
samples. Taken together these improvements hold the potential to improve today’s sequence-
based pLMs further.

Ultimately, we see that a plethora of effective and efficient ML tools operating on different modal-
ities, each with unique strengths and weaknesses, become available at researchers’ fingertips.
Further developments in structure- and sequence-based approaches are inevitably needed
(see Outstanding questions), yet, even today, combining different ML and software solutions
will bring researchers to an untapped world of novel mechanisms that await discovery.

N.B. acknowledges funding from the Wellcome Trust Grant 221327/Z/20/Z. C.R. acknowledges funding from the BBSRC
grant BB/T002735/1. M.S. and S.K. acknowledge support from the National Research Foundation of Korea (NRF), grants
(2019R1-A6A1-A10073437, 2020M3-A9G7-103933, 2021-R1C1-C102065, and 2021-M3A9-I4021220), Samsung DS
research fund program, and the Creative-Pioneering Researchers Program through Seoul National University. This work
was additionally supported by the Bavarian Ministry of Education through funding to the TUM and by a grant from the
Alexander von Humboldt foundation through the German Ministry for Research and Education (Bundesministerium für
Bildung und Forschung, BMBF), by two grants from BMBF (031L0168117 and program ‘Software Campus 2.0 (TUM)
2.0’ 01IS17049) as well as by a grant from Deutsche Forschungsgemeinschaft (DFG-GZ: RO1320/4-1).

Declaration of interests
No interests are declared.


Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 359

