Main
Main
Main
Review
These structural domains – often dubbed ‘folds’ – recur in nature [4], and have been estimated
to be limited to a number in the order of thousands [5]. Folds resemble more the Plato’s allegory
1
of the cave: more the image or idea or concept than the real object (Plato Politeia [6]); this image Institute of Structural and Molecular
Biology, University College London,
helps to map relations between proteins. Gower St, WC1E 6BT London, UK
2
Technical University of Munich (TUM)
Various resources emerged to classify domain structures in evolutionary families and fold groups Department of Informatics, Bioinformatics
and Computational Biology – i12,
(e.g., SCOPii [7], CATHiii [8], SCOPeiv [9], and ECODv [10]), and these have saturated at about Boltzmannstr. 3, 85748 Garching/Munich,
5000 structural families and about 1300 folds over the past decade, despite structural genomics Germany
3
initiatives targeting proteins likely to have new folds [11]. As increasingly powerful sequence VantAI, 151 W 42nd Street, New York,
NY 10036, USA
profile methods [12–14] have identified structural families in completely sequenced organisms 4
TUM Graduate School, Center of
(complete proteomes), studies suggest that up to 70% of all domains resemble those already Doctoral Studies in Informatics and its
classified in SCOP or CATH [3,15–17]. Trivially, the distribution of family size follows some Applications (CeDoSIA), Boltzmannstr.
11, 85748 Garching, Germany
power law: most families/folds are small or species-specific, but a few hundred are very highly 5
School of Biological Sciences, Seoul
populated, tend to be universal across species, and have important functions [8]. In parallel, National University, Seoul, South Korea
Trends in Biochemical Sciences, April 2023, Vol. 48, No. 4 https://doi.org/10.1016/j.tibs.2022.11.001 345
© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Trends in Biochemical Sciences
OPEN ACCESS
6
flexible or intrinsically disordered regions (IDRs) [18] making up 20–30% of all residues in a given Artificial Intelligence Institute, Seoul
National University, Seoul, South Korea
proteome have been associated with protein function [19–21]. As much as structural domains 7
Institute for Advanced Study
could be thought of as the structural units of proteins, the millions of domain combinations create (TUM-IAS), Lichtenbergstr. 2a, 85748
the immense diversity of functional repertoires. Garching/Munich, Germany
8
TUM School of Life Sciences
Weihenstephan (TUM-WZW), Alte
Since the details of function for most proteins in most organisms remain uncharacterized, under- Akademie 8, Freising, Germany
standing how domains evolve and combine to modify function would be a major step in our quest
to understand and engineer biology. Protein structure data can provide a waymark, and exciting
advances in structure prediction over the past year suggest that a landmark has been reached
[22]. While structure prediction has steadily improved over time, thanks to the exponential growth *Correspondence:
in protein sequence data and covariation methods, this new era was kick-started by the remark- c.orengo@ucl.ac.uk (C. Orengo).
@
Twitter: @CATH_Gene3D (C. Orengo),
able performance of AlphaFoldvi (AF) at CASP13 [23]. The method, though, was not made avail- @rostlab (B. Rost), and @thesteinegger
able to the scientific community, resulting in various groups trying to replicate the features behind (M. Steinegger).
its breakthrough performance. Methods that were previously state-of-the-art released new ver-
sions based on these advancements, such as RoseTTAFold [24] and PREFMD [25]. DeepMind’s
AF2 outperformed others in CASP14 [26], and reports suggest that high-quality models can be
comparable to crystallographic structures, with competing methods reproducing DeepMind’s
results [24,27,28]. DeepMind has recently announced the availability of 214 million putative
protein structures for the whole of UniProt, which are available through the AF Database [29]
and 3D-Beacons platform at the European Bioinformatics Institute (EBI). The latter provides AF
models and other models by other prediction methods [30]. This 1000-fold increase in structural
data requires equally transformative developments in methods for processing and analyzing the
mix of experimental and putative structure/sequence data, including methods reliably predicting
aspects of function from sequence alone [31–33], and methods to quickly sift through putative
structures [34].
In this review, we consider recent developments in deep learning, a branch of ML (see Glossary)
operating on sequence and structure comparisons that enable highly sensitive detection of dis-
tant relationships between proteins. These will allow us to harness important insights on putative
structure space, on domain combinations, and on the extent and role of disorder. One important
observation from this review: no single modality has all the answers. Instead, protein sequence,
evolutionary information, latent embeddings from pLMs, and structure information all play
key roles in helping to uncover how proteins fold and act. Application of these tools have enabled
rapid evolutionary classification of good quality AF2 models (defined as AF2 Predicted Local Dis-
tance Difference Test (pLDDT) ≥ 70 [22]) for 21 model organisms, including human (Homo
sapiens), mouse (Mus musculus), rat (Rattus norvegicus), and thale cress (Arabidopsis thaliana)
[35]. We review the insights derived from these studies and the future opportunities they bring
for understanding the links between protein structural arrangements and their functions.
Figure 1. Overview of embeddings applications in protein structure and function characterization. Images were
retrieved from Wikipedia (alpha helix and beta strand, binding sites), Creative Proteomics (cell structure), and bioRxiv
(Transmembrane Regions - CASP13 Target T1008, Structure Prediction - PDB 1Q9F) with permission from the authors.
Leap in protein structure prediction combines ML with evolutionary information and hardware
Considered 2021’s method of the year [62], AF2 [22] combines advanced deep learning with
evolutionary information from larger MSAs – obtained from the BFD with 2.1 billion sequences
[63] or MGnifyvii [64] with 2.4 billion sequences, as opposed to UniProtviii with 231 million [65] –
and more potent computer hardware to make major advances in protein structure prediction,
providing good quality models for at least 50% of the likely globular domains in UniProt
sequences. All top structure prediction methods, including AF2 and RoseTTAFold, rely on
evolutionary couplings (EV) [66] extracted from MSAs. These approaches detect protein
residues in close proximity and coevolving. The adequate preprocessing of this information has
been advancing crucially over the past decades: for example, through direct coupling analysis
sharpening this signal [67,68]. Although the leap of AF2 required this foundation in 2021, future
advances may build their models on a different foundation [43].
Figure 2. Comparison of search sensitivity and speed for language models, sequence/profile-profile and
structure aligner. Average sensitivity up to the fifth false positive (x-axis) for family, superfamily, and fold measured on
SCOP40e (version 2.01) [9] against average search time for a single query (y-axis) of 100 million proteins. Per SCOP40e
domain we compute the fraction of detected true positives for family, superfamily, and fold up to the 5th false positive (FP)
(= different fold), and plotted the average sensitivity over the domains (x-axis).
protein folds. For instance, by mining structures with a widely used structural alignment tool
(DALI) [69], a new member of the perforin/gasdermin (GSDM) pore-forming family in humans
was identified in spite of having only 1% sequence identity with the GSDM family [70,71]. Further-
more, the expanded search to all proteomes, covering 356 000 predicted structures, discovered
16 novel perforin-like proteins [72].
One way to compress structural information is to break structures into fixed-size fragments.
Geometricus [77] represents proteins as a bag of shape-mers: fixed-sized structural fragments
described as moment invariants. It was used to cluster the AFDB and PDB using non-negative
matrix factorization to identify novel groups of protein structures [78]. RUPEE [79] is another
method that breaks structures into structural fragments. It discretizes protein structures by their
backbone torsion angles and then compares the Jaccard similarity of bags of torsion fragments
of the two structures. The top 8000 hits are then realigned by TM-align [74] in top-align mode.
Another category of tools represents tertiary structure as discretized volumes and compares
these. BioZernike [80], for example, approximates volumes through 3D Zernike descriptors and
compares these by a pretrained distance function. 3D-AF-Surfer [15] also applies 3D Zernike
descriptors followed by a support vector machine (SVM) trained to calculate the probability of
two structures being in the same fold. Results are ranked by the SVM scores, while individual
hits can be realigned using combinatorial extension (CE) [75].
The fastest category of structural aligners represents structures as sequences of a discrete struc-
tural alphabet. Most of these alphabets discretize the backbone angles of the structure [81–83],
however at a loss of information in the structured regions. Another type of structural discretization
to a sequence was proposed by Foldseekx [34]. It describes tertiary residue–residue interactions
as a discrete alphabet. It locally aligns the structure sequences using the fast MMseqs2xi algo-
rithm [84]. Foldseek achieves the sensitivity of a state-of-the-art structural aligner like TM-align,
while being at least 20 000 times faster.
Sequence-based structural alignment tools are well equipped to handle the upcoming avalanche
of predicted protein structures. Efficient storage of structure information and queries against
these makes searches against hundreds of millions of structures feasible. Representing struc-
tures as sequences allows us to also adapt fast clustering algorithms like Linclust [84] to compare
billions of structures within a day. We expect current tools to further increase in sensitivity to
match or exceed the performance of DALI [69] (Figure 2).
Embeddings from pLMs in combination with fast structural aligners (Foldseek) could be orthog-
onal in covering, classifying, and validating assignments in large swaths of protein fold space,
as shown in Figure 3.
Figure 3. Visual analysis of the structure space spanned by CATH domains expanded by AlphaFold 2 (AF2) models. We showcase how distance in either
structure (left) or embedding space (middle and right) can be used to gain insight into large sets of proteins. Simply put, we used pairwise distance between proteins to
summarize ~850 000 protein domains in a single 2D plot and colored them according to their CATH class and architecture. This exemplifies a general-purpose tool for
breaking down the complexity of large sets of proteins and allows, for example, detection of large-scale relationships that would otherwise be hard to find, or to detect
outliers. More specifically, ~850 000 domains were structurally aligned using Foldseek [34] (left) in an all-versus-all fashion, resulting in a distance matrix based on the
average pairwise bitscore within a superfamily as superfamily distances. The domain sequences were converted to embeddings using the ProtT5 (center) and
ProtTucker (right) protein language models (pLMs). Similarly to the structural approach, the distance matrix between superfamilies were calculated using the average
euclidean distance between embeddings belonging to different superfamilies. Using different modalities (i.e., structure and sequence embeddings) for computing dis-
tances on the same set of proteins, provides different, potentially orthogonal angles on the same problem which can be helpful during hypothesis generation. The resulting
distance matrices were used as precomputed inputs for uniform manifold approximation and projection (UMAP) [121] and plotted with seaborn [122].
We recently analyzed the proportion of predicted AF2 structural domains in the 21 model organ-
isms that could be assigned to known superfamilies in CATH [16]. Only good-quality models were
analyzed according to a range of criteria (pLDDT ≥70, large proportions of ordered residues,
characteristic packing of secondary structure). We used well-established hidden Markov model
(HMM)-based protocols and a novel deep-learning method (CATHe [61] based on ProtT5 [47])
to detect domain regions in the AF2 models, and Foldseek comparisons gave rapid confirmation
of matches to CATH relatives [34]. We found that 92%, on average, of domains could be mapped
to 3253 CATH superfamilies (out of 5600). We see that the proportion of residues in compact
globular domains varies according to the organism, with well-studied model organisms having
higher proportions of residues assigned to globular regions (ranging from 32% for Leishmania
infantum to 76% for Escherichia coli).
By classifying good-quality AF2 models into CATH, we can expand the number of structurally
characterized domains by ~67%, and our knowledge of fold groups in superfamilies (structurally
similar relatives which can be well superposed) increases by ~36% to 14 859 [16]. As with other
recent studies of AF2 models [15], we observe the greatest expansion in global fold groups
for mainly-alpha proteins (2.5-fold). Less than 5% of CATH superfamilies (~250) are highly
populated, accounting for 57.7% of all domains [8], and in these so called MEGAfamilies AF2
considerably increases the structural diversity, with some superfamilies now identified as having
more than 1000 different fold groups, suggesting considerable structural plasticity outside the
common structural core.
Our analyses identified 2367 putative novel families [16]. However, detailed manual analyses of
618 human AF2 structures revealed problematic features in the models, and some very distant
homologies, with only 25 superfamilies verified as novel, suggesting that the majority of domain
superfamilies may already be known. It is even likely that, as we bring more relatives into the
AF2 superfamilies, links between current CATH superfamilies will be established and the number
of superfamilies reduced. Indeed, most of the 25 new superfamilies identified possess domain
structures with very similar architectures to those in existing CATH superfamilies; mainly-alpha
structures (both orthogonal and up–down bundles) were particularly common, as were small
alpha–beta two-layer sandwiches and mainly-beta barrels.
Caveats
Although AF2 solves many challenging issues in structural modeling, its limitations have been
rapidly identified by the community. All models have accompanying scores for each residue,
indicating several aspects about the confidence of the prediction. For example, pLDDT gives
Figure 4. New folds in CATH-AlphaFold 2 (AF2). Examples of novel folds previously not encountered in CATH or
Protein Data Bank (PDB). Structures are identified as novel folds if they have no significant structural similarity to domains
or structures in the PDB using Foldseek as a comparison method. Each structure identifier is in the format UniProt_ID/
start–stop with its current name in UniProt.
the confidence for a particular residue, and predicted align error reflects inter-residue distances
and local structural environments. Other measures are also provided. Models scoring below an
average pLDDT value of 70 and containing large portions with incorrectly oriented secondary
structure segments are unsuitable for most biological applications and do not reach the quality
of experimentally derived structures [22]. Most issues could be related to the nature of the
MSA the model is built upon, as shallowness or gaps in the alignment often result in a poor
model [22,87].
Furthermore, overrepresentation of proteins with a particular folding state results in a model that is
not representative of other alternative states [88]. Some models with low pLDDT point to IDRs
that undergo disorder-to-order transition upon binding or are prone to fold-switching [89–91].
Other features that may be available for experimental structures are missing from AF2 models,
such as ions, cofactors, ligands, and post-translational modifications (PTMs) [92].
While some effects of sequence variants are captured by AF2, others – in particular point muta-
tions or single amino acid variants – remain elusive to AF2, partly because predictions constitute a
family-averaged more than a sequence-specific solution due to the MSA underlying each
prediction [78,93].
when designing a network (i.e., incorporating domain knowledge directly into the architecture) Outstanding questions
avoids information loss during abstraction and enables the network to directly learn useful infor- The best structural models rely on
mation from the raw 3D data itself. Recently, geometric deep-learning research [107], which quality and availability of protein
structures found in nature. How can
focuses on methods handling complex representations like graphs, has seen a steady increase
we further improve these deep-
in adoption, accuracy, and potential opportunities [108]. Protein 3D structures are naturally fit learning methods without relying on
for geometric deep-learning approaches, whether for supervised tasks, like the prediction of previous structural knowledge?
molecule binding [109], or unsupervised learning approaches, which could generate alternatives
How much structural novelty is hidden
to learned representations from pLMs [110,111]. Geometric deep-learning approaches stand to
in metagenomes? Are we close to
benefit the most from large putative 3D structures sets, potentially unlocking further opportunities discovering all ways nature can shape
for alternative, unsupervised protein representations derived from structure, or deep-learning- a protein?
based potentials to substitute expensive molecular dynamics simulations for molecular docking.
Could protein modeling entirely replace
experimentally derived structures?
A leveling effect
AF2 will help in eliminating an underlying bias in structural biology that has tended to focus more How can we use these methods to
engineer a highly efficient enzyme
on drug discovery for human diseases, model organisms, or structures involved in pathogens.
function never encountered in living
With its cheap footprint and cost, compared to traditional experimental means to characterize organisms?
protein 3D structure, AF2 is neither constrained by access to expensive experimental instru-
ments, nor to beam-time usually prioritized for large consortia. This will enable groups across Speed, accuracy, and coverage of
structure predictions and methods
the world to work on their proteins of interest without geographical or economical limitations. In
are skyrocketing. How can these
a similar fashion to nanopore sequencing, model building could be done in real time for issues tools be improved to better probe the
that are time- or location-sensitive (i.e., emerging pandemics, neglected tropical diseases), or dark universe of uncharacterized
limited to an individual, potentially allowing to transition from whole-genome sequencing to proteins?
whole-proteome modeling, drug binding, and efficacy profiling. Could future methods build complexes
of protein–protein interactions and
Concluding remarks and future perspectives models quickly and accurately enough
to be used for precision medicine?
Computationally predicting protein properties with increasing accuracy using deep learning
[22,53,57,58] remains crucial to build structures and assemblies that assist researchers in Can we use embeddings-based
uncovering cellular machinery. Conveniently, the better these predictors become, the more inter- methods to predict evolution in
esting it is to hijack them to design new proteins that perform desired functions [112]. Recently, sequences?
deep-learning approaches emerged that ‘hallucinate’ new proteins [113] which systems like Can today’s deep-learning structure-
AF2 confirm may fold into plausible structures. These tools can generate new protein sequences prediction method capture dynamic
from start to finish or, similarly to text autocompletion, conditioned on part of a given sequence changes in structure?
input [112], all within milliseconds of runtime. Coupled with blazingly fast predictors [43,114,115],
millions of potentially foldable, ML-generated sequences can be screened reliably in silico, saving
energy, time, and resources, requiring in vitro and in vivo experiments only at the most exciting
stages of the experimental discovery process. Whilst not all designs fold, and caution is needed,
an approach similar to spell correction in NLP but trained on millions of protein sequences allowed
researchers to evolve and optimize existing antibodies to better perform a desired activity [116].
Additionally, approaches that generate protein sequences from 3D structure (in some sense,
the opposite direction of the classical folding problem) will get more and more important in
the post-AF2 era [117]. By selecting for sequence diversity conditioning on structure, new
candidates for families may be found.
With booming putative structure databases, we see the emergence of analytical approaches
leveraging a model similar to how UniProt’s mix of curated (SwissProt) and putative (TrEMBL)
[65] sequence databases are being used. In part, we can build on years of advances in maintain-
ing and searching sequence databases (e.g., to extract evolutionary relationships) to create tools
to analyze structure databases instead, with performant tools already available [34]. However,
mainstreaming structural analysis on billions of entries will require domain-specific infrastructure
and tooling. Geometric deep learning may also assist this modality by bringing new unsupervised
solutions similar to pLMs but trained on protein 3D structures instead [110]. With reliable solutions
in this space, we expect practitioners to combine sequence and structural analysis.
Within the realm of ‘traditional’ pLMs, that solely learn information from large unlabeled protein
sequence databases, there is still room for improvement, as highlighted by recent advances in
NLP. For example, there are approaches optimizing the efficiency of LMs, especially, on long
sequences either by modifying the existing attention mechanism [118] or by proposing a
completely different solution not relying on the de facto standard (attention) [119]. Orthogonal
to such architectural improvements, recent research highlights the importance of hyperparameter
optimization [120] which goes away from the constant increase in model size and rather
suggests to train ‘smaller’ models (still, those models have billions of parameters) on more
samples. Taken together these improvements hold the potential to improve today’s sequence-
based pLMs further.
Ultimately, we see that a plethora of effective and efficient ML tools operating on different modal-
ities, each with unique strengths and weaknesses, become available at researchers’ fingertips.
Further developments in structure- and sequence-based approaches are inevitably needed
(see Outstanding questions), yet, even today, combining different ML and software solutions
will bring researchers to an untapped world of novel mechanisms that await discovery.
Acknowledgments
N.B. acknowledges funding from the Wellcome Trust Grant 221327/Z/20/Z. C.R. acknowledges funding from the BBSRC
grant BB/T002735/1. M.S. and S.K. acknowledge support from the National Research Foundation of Korea (NRF), grants
(2019R1-A6A1-A10073437, 2020M3-A9G7-103933, 2021-R1C1-C102065, and 2021-M3A9-I4021220), Samsung DS
research fund program, and the Creative-Pioneering Researchers Program through Seoul National University. This work
was additionally supported by the Bavarian Ministry of Education through funding to the TUM and by a grant from the
Alexander von Humboldt foundation through the German Ministry for Research and Education (Bundesministerium für
Bildung und Forschung, BMBF), by two grants from BMBF (031L0168117 and program ‘Software Campus 2.0 (TUM)
2.0’ 01IS17049) as well as by a grant from Deutsche Forschungsgemeinschaft (DFG-GZ: RO1320/4-1).
Declaration of interests
No interests are declared.
Resources
i
www.wwpdb.org/
ii
https://scop2.mrc-lmb.cam.ac.uk/
iii
www.cathdb.info/
iv
https://scop.berkeley.edu/
v
http://prodata.swmed.edu/ecod/
vi
www.deepmind.com/blog/putting-the-power-of-alphafold-into-the-worlds-hands
vii
www.ebi.ac.uk/metagenomics/
viii
www.uniprot.org/
ix
www.alphafold.ebi.ac.uk/
x
https://search.foldseek.com/
xi
https://github.com/soedinglab/MMseqs2
xii
https://github.com/sokrypton/ColabFold
References
1. wwPDB consortium (2019) Protein Data Bank: the single global 3. Orengo, C.A. and Thornton, J.M. (2005) Protein families and
archive for 3D macromolecular structure data. Nucleic Acids their evolution—a structural perspective. Annu. Rev. Biochem.
Res. 47, D520–D528 74, 867–900
2. Liu, J. and Rost, B. (2004) CHOP proteins into structural 4. Chothia, C. (1992) Proteins. One thousand families for the
domain-like fragments. Proteins 55, 678–688 molecular biologist. Nature 357, 543–544
5. Orengo, C.A. et al. (1994) Protein superfamilies and domain 34. van Kempen, M. et al. (2022) Foldseek: fast and accurate pro-
superfolds. Nature 372, 631–634 tein structure search. bioRxiv Published online September 20,
6. Sweeney, L. and St Louis University (1971) ‘The Republic of 2022. https://doi.org/10.1101/2022.02.07.479398
Plato’, translated with notes and an interpretative essay by 35. Varadi, M. et al. (2022) AlphaFold Protein Structure Database:
Allan Bloom. Mod. Sch. 48, 280–284 massively expanding the structural coverage of protein-
7. Murzin, A.G. et al. (1995) SCOP: a structural classification of sequence space with high-accuracy models. Nucleic Acids
proteins database for the investigation of sequences and Res. 50, D439–D444
structures. J. Mol. Biol. 247, 536–540 36. Hamp, T. et al. (2013) Homology-based inference sets the bar
8. Sillitoe, I. et al. (2021) CATH: increased structural coverage of high for protein function prediction. BMC Bioinforma. 14, S7
functional space. Nucleic Acids Res. 49, D266–D273 37. Qiu, J. et al. (2020) ProNA2020 predicts protein–DNA, protein–
9. Chandonia, J.-M. et al. (2022) SCOPe: improvements to the RNA, and protein–protein binding proteins and residues from
structural classification of proteins – extended database to facil- sequence. J. Mol. Biol. 432, 2428–2443
itate variant interpretation and machine learning. Nucleic Acids 38. Cui, Y. et al. (2019) Predicting protein-ligand binding residues
Res. 50, D553–D559 with deep convolutional neural networks. BMC Bioinforma.
10. Cheng, H. et al. (2014) ECOD: An evolutionary classification of 20, 93
protein domains. PLoS Comput. Biol. 10, e1003926 39. Rost, B. and Sander, C. (1993) Prediction of protein secondary
11. Dessailly, B.H. et al. (2009) PSI-2: Structural genomics to cover structure at better than 70% accuracy. J. Mol. Biol. 232,
protein domain family space. Structure 17, 869–881 584–599
12. Johnson, L.S. et al. (2010) Hidden Markov model speed heuristic 40. Hecht, M. et al. (2015) Better prediction of functional effects for
and iterative HMM search procedure. BMC Bioinforma. 11, 431 sequence variants. BMC Genomics 16, S1
13. Remmert, M. et al. (2012) HHblits: lightning-fast iterative protein 41. Altschul, S. (1997) Gapped BLAST and PSI-BLAST: a new gen-
sequence searching by HMM–HMM alignment. Nat. Methods eration of protein database search programs. Nucleic Acids
9, 173–175 Res. 25, 3389–3402
14. Mirdita, M. et al. (2019) MMseqs2 desktop and local web server 42. Mirdita, M. et al. (2022) ColabFold: making protein folding
app for fast, interactive sequence searches. Bioinformatics 35, accessible to all. Nat. Methods 19, 679–682
2856–2858 43. Weissenow, K. et al. (2022) Protein language model embeddings
15. Aderinwale, T. et al. (2022) Real-time structure search and for fast, accurate, alignment-free protein structure prediction.
structure classification for AlphaFold protein models. Commun. Structure 30, 1169–1177.e4
Biol. 5, 316 44. Buchfink, B. et al. (2021) Sensitive protein alignments at tree-of-
16. Bordin, N. et al. AlphaFold2 reveals commonalities and novel- life scale using DIAMOND. Nat. Methods 18, 366–368
ties in protein structure space for 21 model organisms. 45. Moore, G. (1965) Cramming more components onto integrated
Commun. Biol. In press circuits. Electronics 38, 82–85
17. Kolodny, R. et al. (2013) On the universe of protein folds. Annu. 46. Bepler, T. and Berger, B. (2019) Learning protein sequence
Rev. Biophys. 42, 559–582 embeddings using information from structure. arXiv Published
18. Dunker, A.K. et al. (2013) What’s in a name? Why these proteins online October 16, 2019. https://doi.org/10.48550/arXiv.
are intrinsically disordered: Why these proteins are intrinsically 1902.08661
disordered. Intrinsically Disord. Proteins 1, e24157 47. Elnaggar, A. et al. (2022) ProtTrans: towards cracking the lan-
19. Romero, P. et al. (1998) Thousands of proteins likely to have long guage of lifes code through self-supervised deep learning and
disordered regions. Pac. Symp. Biocomput. 1998, 437–448 high performance computing. IEEE Trans. Pattern Anal. Mach.
20. Schlessinger, A. et al. (2011) Protein disorder – a breakthrough Intell. 44, 7112–7127
invention of evolution? Curr. Opin. Struct. Biol. 21, 412–418 48. Heinzinger, M. et al. (2019) Modeling aspects of the language
21. Kastano, K. et al. (2020) Evolutionary study of disorder in of life through transfer-learning protein sequences. BMC
protein sequences. Biomolecules 10, 1413 Bioinforma. 20, 723
22. Jumper, J. et al. (2021) Highly accurate protein structure 49. Ofer, D. et al. (2021) The language of proteins: NLP, machine
prediction with AlphaFold. Nature 596, 583–589 learning & protein sequences. Comput. Struct. Biotechnol. J.
23. Kryshtafovych, A. et al. (2019) Critical assessment of methods 19, 1750–1758
of protein structure prediction (CASP) – round XIII. Proteins 50. Rives, A. et al. (2021) Biological structure and function emerge
Struct. Funct. Bioinforma. 87, 1011–1020 from scaling unsupervised learning to 250 million protein
24. Baek, M. et al. (2021) Accurate prediction of protein structures sequences. Proc. Natl. Acad. Sci. U. S. A. 118, e2016239118
and interactions using a three-track neural network. Science 51. Brandes, N. et al. (2022) ProteinBERT: a universal deep-learning
373, 871–876 model of protein sequence and function. Bioinformatics 38,
25. Heo, L. and Feig, M. (2020) High-accuracy protein structures by 2102–2110
combining machine-learning with physics-based refinement. 52. Stärk, H. et al. (2021) Light attention predicts protein location
Proteins Struct. Funct. Bioinforma. 88, 637–642 from the language of life. Bioinforma. Adv. 1, vbab035
26. Lupas, A.N. et al. (2021) The breakthrough in protein structure 53. Marquet, C. et al. (2022) Embeddings from protein language
prediction. Biochem. J. 478, 1885–1890 models predict conservation and variant effects. Hum. Genet.
27. Ahdritz, G. et al. (2022) OpenFold: Retraining AlphaFold2 yields 141, 1629–1647
new insights into its learning mechanisms and capacity for 54. Villegas-Morcillo, A. et al. (2021) Unsupervised protein embed-
generalization. bioRxiv Published online November 24, 2022. dings outperform hand-crafted sequence and structure features
https://doi.org/10.1101/2022.11.20.517210 at predicting molecular function. Bioinformatics 37, 162–170
28. Sen, N. et al. (2022) Characterizing and explaining the impact of 55. Thumuluri, V. et al. (2022) DeepLoc 2.0: multi-label subcellular
disease-associated mutations in proteins without known struc- localization prediction using protein language models. Nucleic
tures or structural homologs. Brief. Bioinform. 23, bbac187 Acids Res. 50, W228–W234
29. Tunyasuvunakool, K. et al. (2021) Highly accurate protein struc- 56. Alley, E.C. et al. (2019) Unified rational protein engineering with
ture prediction for the human proteome. Nature 596, 590–596 sequence-based deep representation learning. Nat. Methods
30. Humphreys, I.R. et al. (2021) Computed structures of core 16, 1315–1322
eukaryotic protein complexes. Science 374, eabm4805 57. Seo, S. et al. (2018) DeepFam: deep learning based alignment-free
31. Littmann, M. et al. (2021) Embeddings from deep learning method for protein family modeling and prediction. Bioinformatics
transfer GO annotations beyond homology. Sci. Rep. 11, 1160 34, i254–i262
32. Littmann, M. et al. (2021) Protein embeddings and deep learn- 58. Vig, J. et al. (2021) BERTology meets biology: interpreting
ing predict binding residues for various ligand types. Sci. Rep. attention in protein language models. arXiv Published online
11, 23916 March 28, 2021. https://doi.org/10.48550/arXiv.2006.15222
33. Zhao, B. et al. (2021) DescribePROT: database of amino acid- 59. Littmann, M. et al. (2021) Clustering FunFams using se-
level protein structure and function predictions. Nucleic Acids quence embeddings improves EC purity. Bioinformatics 37,
Res. 49, D298–D308 3449–3455
60. Heinzinger, M. et al. (2022) Contrastive learning on protein em- 88. del Alamo, D. et al. (2022) Sampling alternative conformational
beddings enlightens midnight zone. NAR Genomics Bioinforma. states of transporters and receptors with AlphaFold2. eLife
4, lqac043 11, e75751
61. Nallapareddy, V. et al. (2022) CATHe: detection of remote 89. Ruff, K.M. and Pappu, R.V. (2021) AlphaFold and implications
homologues for CATH superfamilies using embeddings from for intrinsically disordered proteins. J. Mol. Biol. 433, 167208
protein language models. bioRxiv Published online March 13, 90. Wilson, C.J. et al. (2022) AlphaFold2: a role for disordered pro-
2022. https://doi.org/10.1101/2022.03.10.483805 tein prediction? Int. J. Mol. Sci. 23, 4591
62. Marx, V. (2022) Method of the year: protein structure prediction. 91. Alderson, T.R. et al. (2022) Systematic identification of condi-
Nat. Methods 19, 5–10 tionally folded intrinsically disordered regions by AlphaFold2.
63. Steinegger, M. et al. (2019) Protein-level assembly increases bioRxiv Published online February 18, 2022. https://doi.org/
protein sequence recovery from metagenomic samples manyfold. 10.1101/2022.02.18.481080
Nat. Methods 16, 603–606 92. Perrakis, A. and Sixma, T.K. (2021) AI revolutions in biology: the
64. Mitchell, A.L. et al. (2020) MGnify: the microbiome analysis joys and perils of AlphaFold. EMBO Rep. 22, e54046
resource in 2020. Nucleic Acids Res. 48, D570–D578 93. Schmidt, A. et al. (2022) Predicting the pathogenicity of mis-
65. The UniProt Consortium et al. (2021) UniProt: the universal sense variants using features derived from AlphaFold2. bioRxiv
protein knowledgebase in 2021. Nucleic Acids Res. 49, Published online March 05, 2022. https://doi.org/10.1101/
D480–D489 2022.03.05.483091
66. Marks, D.S. et al. (2011) Protein 3D structure computed from 94. Esposito, L. et al. (2021) AlphaFold-predicted structures of
evolutionary sequence variation. PLoS One 6, e28766 KCTD proteins unravel previously undetected relationships
67. Anishchenko, I. et al. (2017) Origins of coevolution between among the members of the family. Biomolecules 11, 1862
residues distant in protein 3D structures. Proc. Natl. Acad. 95. Saldaño, T. et al. (2022) Impact of protein conformational diver-
Sci. 114, 9122–9127 sity on AlphaFold predictions. Bioinformatics 38, 2742–2748
68. Jones, D.T. et al. (2011) PSICOV: precise structural contact 96. Santuz, H. et al. (2022) Small oligomers of Aβ42 protein in the bulk
prediction using sparse inverse covariance estimation on large solution with AlphaFold2. ACS Chem. Neurosci. 13, 711–713
multiple sequence alignments. Bioinformatics 28, 184–190 97. Ivanov, Y.D. et al. (2022) Prediction of monomeric and dimeric
69. Holm, L. (2020) Using Dali for protein structure comparison. In structures of CYP102A1 using AlphaFold2 and AlphaFold
Structural Bioinformatics (2112) (Gáspári, Z., ed.), pp. 29–42, multimer and assessment of point mutation effect on the effi-
Springer ciency of intra- and interprotein electron transfer. Molecules
70. Ruan, J. et al. (2018) Cryo-EM structure of the gasdermin A3 27, 1386
membrane pore. Nature 557, 62–67 98. del Alamo, D. et al. (2021) AlphaFold2 predicts the inward-
71. Ding, J. et al. (2016) Pore-forming activity and structural facing conformation of the multidrug transporter LmrP. Proteins
autoinhibition of the gasdermin family. Nature 535, 111–116 Struct. Funct. Bioinforma. 89, 1226–1228
72. Bayly-Jones, C. and Whisstock, J.C. (2022) Mining folded 99. Goulet, A. and Cambillau, C. (2021) Structure and topology pre-
proteomes in the era of accurate structure prediction. PLoS diction of phage adhesion devices using AlphaFold2: the case
Comput. Biol. 18, e1009930 of two Oenococcus oeni phages. Microorganisms 9, 2151
73. Taylor, W.R. and Orengo, C.A. (1989) Protein structure alignment. 100. van Breugel, M. et al. (2022) Structural validation and assess-
J. Mol. Biol. 208, 1–22 ment of AlphaFold2 predictions for centrosomal and centriolar
74. Zhang, Y. (2005) TM-align: a protein structure alignment proteins and their complexes. Commun. Biol. 5, 312
algorithm based on the TM-score. Nucleic Acids Res. 33, 101. Millán, C. et al. (2021) Assessing the utility of CASP14 models
2302–2309 for molecular replacement. Proteins Struct. Funct. Bioinforma.
75. Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure align- 89, 1752–1769
ment by incremental combinatorial extension (CE) of the optimal 102. Rodriguez, J.M. et al. (2022) APPRIS: selecting functionally im-
path. Protein Eng. Des. Sel. 11, 739–747 portant isoforms. Nucleic Acids Res. 50, D54–D59
76. Waterhouse, A. et al. (2018) SWISS-MODEL: homology model- 103. Lomize, A.L. et al. (2022) Membranome 3.0: database of single-
ling of protein structures and complexes. Nucleic Acids Res. 46, pass membrane proteins with AlphaFold models. Protein Sci.
W296–W303 31, e4318
77. Durairaj, J. et al. (2020) Geometricus represents protein 104. Wehrspan, Z.J. et al. (2022) Identification of iron-sulfur (Fe-S)
structures as shape-mers derived from moment invariants. cluster and zinc (Zn) binding sites within proteomes predicted
Bioinformatics 36, i718–i725 by DeepMind’s AlphaFold2 program dramatically expands the
78. Akdel, M. et al. (2022) A structural biology community assess- metalloproteome. J. Mol. Biol. 434, 167377
ment of AlphaFold 2 applications. Nat. Struct. Mol. Biol. 29, 105. Binder, J.L. et al. (2022) AlphaFold illuminates half of the dark
1056–1067 human proteins. Curr. Opin. Struct. Biol. 74, 102372
79. Ayoub, R. and Lee, Y. (2019) RUPEE: A fast and accurate purely 106. Sommer, M.J. et al. (2022) Highly accurate isoform identification
geometric protein structure search. PLoS One 14, e0213712 for the human transcriptome. bioRxiv Published online June 09,
80. Guzenko, D. et al. (2020) Real time structural search of the 2022. https://doi.org/10.1101/2022.06.08.495354
Protein Data Bank. PLoS Comput. Biol. 16, e1007970 107. Bronstein, M.M. et al. (2021) Geometric deep learning: grids,
81. Yang, J.-M. (2006) Protein structure database search and evo- groups, graphs, geodesics, and gauges. arXiv Published online
lutionary classification. Nucleic Acids Res. 34, 3646–3659 May 2, 2021. http://doi.org/10.48550/arXiv.2104.13478
82. de Brevern, A.G. et al. (2000) Bayesian probabilistic approach 108. Veličković, P. (2022) Message passing all the way up. arXiv
for predicting backbone structures in terms of protein blocks. Published online February 22, 2022. http://doi.org/10.48550/
Proteins Struct. Funct. Genet. 41, 271–287 arxiv.2202.11097
83. Wang, S. and Zheng, W.-M. (2008) CLePAPS: fast pair align- 109. Stärk, H. et al. (2022) EquiBind: geometric deep learning for drug
ment of protein structures based on conformational letters. binding structure prediction. In Proceedings of the 39th Interna-
J. Bioinforma. Comput. Biol. 6, 347–366 tional Conference on Machine Learning (162), pp. 20503–20521
84. Steinegger, M. and Söding, J. (2018) Clustering huge protein 110. Zhang, Z. et al. (2022) Protein representation learning by geo-
sequence sets in linear time. Nat. Commun. 9, 2542 metric structure pretraining. arXiv Published online September
85. Porta-Pardo, E. et al. (2022) The structural coverage of the 19, 2022. http://doi.org/10.48550/arXiv.2203.06125
human proteome before and after AlphaFold. PLoS Comput. 111. Ingraham, J. et al. (2019) Generative models for graph-based
Biol. 18, e1009818 protein design. In NIPS'19: Proceedings of the 33rd Interna-
86. Evans, R. et al. (2022) Protein complex prediction with tional Conference on Neural Information Processing Systems,
AlphaFold-Multimer. bioRxiv Published online March 10, 2022. Article No. 1417, pp. 15820–15831
https://doi.org/10.1101/2021.10.04.463034 112. Ferruz, N. et al. (2022) ProtGPT2 is a deep unsupervised
87. Bondarenko, V. et al. (2022) Structures of highly flexible intracel- language model for protein design. Nat. Commun. 13, 4348
lular domain of human α7 nicotinic acetylcholine receptor. Nat. 113. Anishchenko, I. et al. (2021) De novo protein design by deep
Commun. 13, 793 network hallucination. Nature 600, 547–552
114. Teufel, F. et al. (2022) SignalP 6.0 predicts all five types of signal 118. Ma, X. et al. (2022) Mega: moving average equipped gated attention.
peptides using protein language models. Nat. Biotechnol. 40, arXiv Published online September 26, 2022. http://doi.org/
1023–1025 10.48550/arXiv.2209.10655
115. Høie, M.H. et al. (2022) NetSurfP-3.0: accurate and fast 119. Gu, A. et al. (2022) Efficiently modeling long sequences with
prediction of protein structural features by protein lan- structured state spaces. arXiv Published online August 5,
guage models and deep learning. Nucleic Acids Res. 50, 2022. http://doi.org/10.48550/arXiv.2111.00396
W510–W515 120. Hoffmann, J. et al. (2022) Training compute-optimal large
116. Hie, B.L. et al. (2022) Efficient evolution of human antibodies language models. arXiv Published online March 29, 2022.
from general protein language models and sequence informa- http://doi.org/10.48550/arXiv.2203.15556
tion alone. bioRxiv Published online September 6, 2022. 121. McInnes, L. et al. (2018) UMAP: uniform manifold approximation
https://doi.org/10.1101/2022.04.10.487779 and projection for dimension reduction. arXiv Published online
117. Hsu, C. et al. (2022) Learning inverse folding from millions of September 18, 2020. http://doi.org/10.48550/arXiv.1802.03426
predicted structures. bioRxiv Published online September 6, 122. Waskom, M. (2021) Seaborn: statistical data visualization.
2022. https://doi.org/10.1101/2022.04.10.487779 J. Open Source Softw. 6, 3021