[go: up one dir, main page]

Academia.eduAcademia.edu
Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Biophysics of protein evolution and evolutionary protein biophysics Tobias Sikosek and Hue Sun Chan J. R. Soc. Interface 2014 11, 20140419, published 27 August 2014 References This article cites 435 articles, 131 of which can be accessed free Subject collections Articles on similar topics can be found in the following collections http://rsif.royalsocietypublishing.org/content/11/100/20140419.full.html#ref-list-1 bioinformatics (49 articles) biophysics (354 articles) computational biology (313 articles) Email alerting service Receive free email alerts when new articles cite this article - sign up in the box at the top right-hand corner of the article or click here To subscribe to J. R. Soc. Interface go to: http://rsif.royalsocietypublishing.org/subscriptions Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Biophysics of protein evolution and evolutionary protein biophysics Tobias Sikosek1,2,3 and Hue Sun Chan1,2,3 rsif.royalsocietypublishing.org 1 Department of Biochemistry, 2Department of Molecular Genetics, and 3Department of Physics, University of Toronto, Toronto, Ontario, Canada M5S 1A8 TS, 0000-0001-9929-3525 Headline review Cite this article: Sikosek T, Chan HS. 2014 Biophysics of protein evolution and evolutionary protein biophysics. J. R. Soc. Interface 11: 20140419. http://dx.doi.org/10.1098/rsif.2014.0419 Received: 22 April 2014 Accepted: 28 July 2014 Subject Areas: biophysics, bioinformatics, computational biology Keywords: adaptation, promiscuous functions, conformational dynamics, hidden states, protein folding, protein– protein interactions Authors for correspondence: Tobias Sikosek e-mail: t.sikosek@utoronto.ca Hue Sun Chan e-mail: chan@arrhenius.med.toronto.edu The study of molecular evolution at the level of protein-coding genes often entails comparing large datasets of sequences to infer their evolutionary relationships. Despite the importance of a protein’s structure and conformational dynamics to its function and thus its fitness, common phylogenetic methods embody minimal biophysical knowledge of proteins. To underscore the biophysical constraints on natural selection, we survey effects of protein mutations, highlighting the physical basis for marginal stability of natural globular proteins and how requirement for kinetic stability and avoidance of misfolding and misinteractions might have affected protein evolution. The biophysical underpinnings of these effects have been addressed by models with an explicit coarse-grained spatial representation of the polypeptide chain. Sequence–structure mappings based on such models are powerful conceptual tools that rationalize mutational robustness, evolvability, epistasis, promiscuous function performed by ‘hidden’ conformational states, resolution of adaptive conflicts and conformational switches in the evolution from one protein fold to another. Recently, protein biophysics has been applied to derive more accurate evolutionary accounts of sequence data. Methods have also been developed to exploit sequence-based evolutionary information to predict biophysical behaviours of proteins. The success of these approaches demonstrates a deep synergy between the fields of protein biophysics and protein evolution. 1. Introduction Biological evolution uses mutations as its basic working material. Mutations occur in DNA molecules through various mechanisms. Some mutations are relatively ‘silent’ in that their effects are less appreciable, whereas others have a more prominent impact on the biological function. The most immediate effect of a mutation is the alteration of the DNA molecule itself and thus, possibly, its affinities to bind certain proteins or RNA. Given the vastness of many genomes, it was once believed that many mutations in DNA fall in regions that have no biological function. However, with increasing knowledge of the functional roles of non-coding DNA sequences, the proportion of genomes that is considered non-functional has decreased significantly [1]. Regions of the genome that do encode for a functional RNA or protein can undergo several different kinds of mutations, such as insertions, deletions and duplications of entire segments of DNA. The present review focuses primarily on the effect of point mutations (change of a single nucleotide) and will consider only proteins but not RNA, although many general principles of evolution are applicable to both classes of biomolecules. We refer to other authors for the evolution of protein structures via sequence re-arrangements such as domain-wise evolution [2–4], the fusion of small peptide fragments [5] or the ‘chimeric’ recombination of fragments that is also exploited in protein engineering [6–9]. Current study of molecular evolution can benefit from a huge amount of sequence data, but only a relatively small body of structural data. Consequently, many approaches in evolutionary studies are predominantly sequence-based. A prime example is phylogenetic inference methods based upon multiple sequence alignments. Mostly, the biophysical foundation of & 2014 The Author(s) Published by the Royal Society. All rights reserved. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 ns tei pro th wi ge lar (c) e cor ins ote pr all mall th s s wi core in prote J. R. Soc. Interface 11: 20140419 relative solvent accessibility (RSA) of residue (b) evolutionary rate (dN/dS) 2.5 elevated evolutionary rate 2.0 1.5 1.0 neutral baseline (majority of residues) 0.5 reduced evolutionary rate 0 0.2 0.4 0.6 RSA 0.8 1.0 1.2 Figure 1. Enhancing evolutionary methods with biophysical information. (a) Relative solvent accessibility (RSA), the exposure of an amino acid to solvent in the folded structure, is strongly correlated with the evolutionary rate v ; dN/dS [22], where dN/dS is the ratio of non-synonymous over synonymous substitution rates. (b) RSA was used to improve the neutral baseline for the detection of positive and negative selection in influenza proteins [23,24]. Data points in yellow have elevated evolutionary rates (larger v) but can be either below or above the conventional v ¼ 1 divide that typically distinguishes positive (v . 1) and negative/ purifying (v , 1) selection. Blue data points show sites evolving at reduced evolutionary rates compared with the neutral baseline. (c) The homo-trimeric haemagglutinin [25] is an influenza surface glycoprotein. The structure of one of the three monomeric units is shown as black ribbons with the residues under positive or negative selection highlighted as spheres using the same colour code as that in (b); the other two monomeric units are depicted by the grey surface. (Adapted from [24]). A majority of the positively selected sites are found around the region that is most frequently targeted by antibodies (top right of the structure in (c)) and are thus under strong selection pressure to diversify. these mathematical methods is provided only rudimentarily by the BLOSUM [10] or PAM [11] substitution matrices that are empirical summaries of the posterior probabilities of various amino acid substitutions. These models can roughly capture the tendency to conserve the physico-chemical properties of amino acids when they undergo mutations, like polar amino acids that are mostly substituted by other polar amino acids but less frequently by hydrophobic ones. However, such trends capture only a tiny aspect of the many biophysical implications of mutations that can be important for the biological function of proteins. For instance, they often do not even consider the local structural environment of a given amino acid residue position such as backbone conformation and hydrogen bonding pattern that might constrain evolutionary choices [12]. In this context, a number of authors from within the biophysics community have recently called for a stronger collaboration between the fields of molecular evolution and protein biophysics in order to achieve new and deeper insights into protein evolution [13– 17]. At the same time, within the phylogenetics community there is a growing realization of the need for including structure information into evolutionary models [18 –20]. As a first step in pursuing this direction of investigation, the effect of mutations on 2 rsif.royalsocietypublishing.org evolutionary rate (dN/dS) (a) protein stability or binding affinities is probably the most promising example of how biophysics can contribute to a better understanding of evolution [21]. The two fields can clearly benefit from each other. For example, a common evolutionary method to identify gene positions that have undergone significant mutational changes and to quantify the degree of selection is to compute the ratio of non-synonymous to synonymous substitutions. However, a correlation between this ratio and the solvent exposure of the site in the folded protein structure has been noted recently [22] (figure 1), suggesting that this ratio may not be purely a measure of adaptive selection but may also reflect the site’s contribution to protein stability. Based on this finding, solvent exposure of residues has been used in establishing a new neutral baseline that reflects this biophysical constraint under which natural selection must operate. Notably, this procedure has led to recognition of new amino acid positions in the influenza protein haemagglutinin that have undergone adaptation (figure 1), highlighting how biophysical/ structural knowledge can improve evolutionary analysis [23,24]. Conversely, evolutionary information can also provide novel biophysical understanding of proteins. One earlier example in using such an approach that may be termed evolutionary protein biophysics is the utilization of evolutionary Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 data on the PDZ domain family to predict energetically coupled positions on the protein, some of which are spatially far apart [26]. Another example is the inference of structural information from protein sectors, which are co-evolving clusters of spatially proximate and physically interacting amino acids within a protein structure. A protein such as rat trypsin [27], for example, can have several such clusters that have distinct functions and evolve independently (figure 2). The existence of protein sectors raises fundamental concerns over phylogenetic methods that assume no such biophysical interactions, because those methods led to inconsistent phylogenetic trees depending on whether they are deduced from all mutations of the protein or from considering only mutations within a sector. However, with appropriate analysis, biophysical studies of proteins can use this type of evolutionary information to predict the correct fold of a protein, deduce interactions between protein monomeric units in a multiple-chain protein complex and identify hitherto unknown functional conformations [28 –32]. In the following, we first discuss the basic constraints of biophysics on evolution by surveying salient biophysical consequences of protein mutations. We then outline recent advances in using biophysical concepts to shed light on experimentally observed evolutionary behaviours. For proteins that have a globular folded native structure, the thermodynamic stability of the folded structure relative to the ensemble of unfolded conformations is determined by the balance between the interactions that favour the folded state and the conformational entropy that favours the unfolded state. The more stable a protein, the more difficult it is to unfold (denature) under high temperatures or high concentrations of denaturing chemicals. To illustrate the energetic balance governing protein stability and its kinetic implications, the conformational diversity of the unfolded state and the essentially unique native structure of a globular protein is often depicted by a funnel-like representation of the free energy landscape of the protein conformations. The folded state is situated at the bottom of the funnel whereas the unfolded state populates the top of the funnel [33–36] (see for example the top-left drawing in figure 3). In protein evolution studies, stability is often used as a proxy for the fidelity of a protein function, because a sufficient stability of the native state is often required for function [21]. Although a protein’s function is not equivalent to its stability, experimental support exists for a positive correlation between protein functionality and native stability (e.g. [39– 41]). This relationship can be seen very clearly in a recent experiment demonstrating how the evolutionary trajectory of influenza nucleoprotein is probably constrained to avoid low-stability sequences [42] (see further discussion in §3.8). In general, a mutation that decreases the stability of a protein is more probable than a mutation that does not decrease the protein’s stability to lead to the formation of other non-functional structures that would be detrimental J. R. Soc. Interface 11: 20140419 2.1. Mutational effects on the thermodynamic stability of protein folded states rsif.royalsocietypublishing.org 2. Biophysical consequences of protein mutations 3 Figure 2. Co-evolving residues in rat trypsin (PDB code 3TGI), a serine protease. Protein sectors are networks of co-evolving residues with independent functions [27]. Here, the three sectors of serine proteases are shown in red (substrate specificity), blue (thermal stability) and green (catalysis). Known functional residues are shown as sticks. The existence of protein sectors has important consequences for phylogenetic analyses, since each sector evolves independently. Protein sectors were identified using the statistical coupling analysis (SCA) approach, whereas a different approach, direct coupling analysis (DCA), yielded a partially different set of co-evolving residues (dashed lines). Residue pairs from DCA have successfully been used in combination with structure-based models to predict native structure, protein– protein interactions and conformational changes [28,29]. These examples illustrate how the fields of biophysics and molecular evolution can benefit from each other. (Adapted from [27,29].) to the protein’s original (wild-type) biological function, and in the worst case can cause serious harm to the organism. The qualitative impact of a mutation on the folded state of a protein can often be anticipated. In globular proteins, surface residues are mostly polar and charged, while core residues have a higher tendency to be hydrophobic [43,44]. Mutations that conserve these properties are less likely to result in a large change in stability. In addition, the statistical propensities for certain amino acids to occur in a particular type of secondary structure have also been compiled and can be used to predict probable mutational effects on secondary structure (e.g. [45]). A recent comprehensive review of numerous studies of mutants occurring in natural protein families and superfamilies shows clearly that amino acid substitutions are constrained differently—i.e. their viabilities vary—in different local environments as defined by the main-chain secondary structure, solvent accessibility and hydrogen bonding [12]. Using stability as proxy for function, quantitative stability prediction is widely used to address the effect of mutations on protein function. Many tools exist to calculate an estimated DDG, or change in free energy, after one or more mutations Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 unfolded protein chain mutation 4 folding misfolding surface change mutated proteins misinteraction new interaction functional, folded protein interaction partner other proteins no interactions functional interaction Figure 3. Schematics of some of the possible effects of mutations on protein folding and interaction. The top-left cartoon of the folding landscape of a globular protein shows the correctly folded structure as the global free energy minimum, whereas a shallower minimum corresponds to a misfolded structure. Interactions of the original protein are indicated by black arrows; those of the mutant are indicated in red. Mutations can lead to misfolding and/or aggregation and/or misinteractions. Mutations can also lead to no apparent changes (neutral mutation). Some non-neutral mutations, however, can lead to new functional interactions that can then be subject to evolutionary selection. Note that the depiction of interactions between folded proteins as a ‘lock and key’ fit between specific shapes is adopted here merely to simplify the schematic representation. The perspective conveyed by the present figure does not preclude more dynamic binding mechanisms such as induced fit [37] and conformational selection [38]. [46–52]. Most of these methods focus on a static reference structure for which an energy or a score is calculated according to an empirical forcefield. To implement the mutation, the structure is computationally modified; energy is then recalculated and compared against the pre-mutation wild-type value. DDG prediction is widely used to screen large numbers of mutations, often in combination with laboratory experiments [53–57]. The approach has also served as fitness estimators in simulation studies of protein evolution [58,59]. One obvious limitation of these DDG prediction methods is that, with few exceptions [60–62], they consider only a single ‘native’ protein conformation. In essence, these methods disregard mutational effects on the unfolded state and often ignore the possibility of structural adjustment of the folded state in response to the mutation. The accuracy of these methods is limited because in reality the mutational effects on protein stability are determined by the balance between the impact of the mutation on the folded and the unfolded states. Moreover, these methods do not address possible change from one folded structure to another, nor the possibility of misfolding; but conformational transition is crucial for exploring new protein functions during evolution, with polar-to-hydrophobic substitutions having a higher potential to lead to alternative folded structures [63–65]. In fact, sometimes a mutation may seem harmless in the native structure but can have dramatic effects during the folding process so that the native state might not even be formed (see §2.2). In principle, with improved algorithms and appropriate atomistic forcefields, extensive molecular dynamics simulations that sample both the folded and unfolded conformations may provide more accurate stability predictions [66], even predictions of conformation transition [67–69]. But currently the computational cost for such simulations is very high; thus molecular dynamics cannot yet be used for large-scale mutation screenings. J. R. Soc. Interface 11: 20140419 failed interaction no effect (neutral) rsif.royalsocietypublishing.org aggregation Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 2.3. Interactions and misinteractions The biological functions of most proteins require them to interact with other proteins and/or other biomolecules [98]. Mutations affect these interactions and can lead to misinteractions [99]. A classic example is the glutamic acid to valine mutation in haemoglobin [100] that causes aggregation of haemoglobin and consequently sickle-cell anaemia [101]. More recent examples include mutations implicated in prion, amyloid and other misfolding diseases mentioned above [102] as well as disease-causing mutations that disrupt or weaken the proper binding between two proteins [103,104]. The cellular environment is crowded [105,106]. This crowdedness is probably dictated by biophysical constraints imposed by a living cell’s need for efficient rates of biochemical reactions [107]. Within the cellular confine, a given protein can potentially come into contact with a large number of other proteins [108,109]. Although the possibility of non-specific binding probably constitutes a biophysical constraint that might have restricted the number of proteins in a cell [110], natural proteins can function by being remarkably specific binders. This interaction specificity entails not only favourable binding with a protein’s target molecule(s) but also extremely unfavourable— essentially absence of—binding with many other molecules. This requirement is conceptually similar to the well-known principle for protein design, i.e. that an optimized sequence has to ‘design in’ the target structure as well as ‘design out’ alternative structures [111]. Many natural proteins have evolved not only to fold to the functional native state but also to strongly destabilize non-native intermediate states [112] by increasing the energetic 5 J. R. Soc. Interface 11: 20140419 The impact of mutations on a globular protein is not limited to its folded structure. The folding process itself is altered by mutations, even when the end-point of the folding kinetics of the mutant is essentially the same folded structure as that of the original sequence. Kinetics of folding is often two-state-like for small, single-domain proteins [70] but transiently populated intermediate states are observed in many other proteins [71]. Mutations can affect folding speeds of both two-state-like and non-two-state proteins by modulating the interactions that favour the native state [72–75] or through strengthening certain non-native interactions not present in the folded structure [76,77]. Folding kinetics can be subject to natural selection. A recent estimate pointed to an overall increase in folding speed during evolution. Specifically, the folding speeds of a-proteins (folded structures consisting mostly of a-helices) have increased throughout evolution whereas those of b-proteins (folded structures consisting mostly of b-sheets) appear to have been decreasing in the last 1.5 billion years [78]. In an earlier study of conserved amino acid positions across protein families, it was concluded that conserved sites are important for function or stability, and that there has been ‘evolutionary pressure towards fast (not necessarily the fastest) folding of several proteins’ [79]. By contrast, a subsequent investigation of 48 natural mutants with single-site substitutions in the hydrophobic core of the SH3 domain (a b-protein; not considered in [79]) indicated that conservation correlates well with unfolding rates but not the folding rates of the mutants. In other words, mutants with slower unfolding rates occur more frequently than mutants with faster unfolding rates, but a positive or negative correlation between folding rate with occurrence frequency was not observed. This finding suggests that evolution selects more strongly for a slower unfolding rate than faster folding rate, at least for the SH3 family [80]. In this regard, a recent survey argued that protein kinetic stability, i.e. a slow unfolding rate, is often more strongly selected by evolution than thermodynamic stability, most probably because kinetic instability (a faster unfolding rate) facilitates irreversible alteration processes such as amyloid formation and other forms of detrimental protein aggregation even if overall thermodynamic stability is maintained by a higher folding rate [81]. Echoing the aforementioned study of SH3 domains, an investigation of 27 single-substitution variants of thioredoxin—the fold of which is apparently extremely ancient in evolutionary history [82]—indicates that viable mutants can at most be 2 kcal mol21 less stable than the wild-type, but a significant correlation exists between slower unfolding rate and the occurrence frequency of a given residue in sequence alignments, again suggesting a significant natural selection for slower unfolding rates [83]. For proteins that undergo folding with significantly populated transient intermediates, a mutation may stabilize or destabilize the intermediate conformations, or even abrogate the intermediates encountered in the folding of the original sequence, or create new intermediates. In fact, in some experiments, mutations were intentionally introduced to stabilize various folding intermediates to facilitate their characterization [84,85]. In one case, swapping certain hydrophobic core residues between two related proteins could also swap the associated folding intermediates [86]. In more extreme cases, a mutation could lead to the formation of different folding intermediates or even different folded structures with potentially severe implications for protein function and aggregation [87]. In particular, highly abundant proteins with relatively low solubilities are prone to aggregate [88]. An increasing number of neurodegenerative and other varieties of prion and amyloid diseases are now known to be caused by misfolded structures (different ‘native’ structures) or by aggregation/oligomerization of intermediate conformational states, with propensity for misfolding increased by certain mutations [87,89] (figure 3). Cataracts in the human eye are also found to be caused by accumulation of misfolded proteins [90] and associated with mutations that led to abnormal folding behaviour [91,92]. As exemplified by the mouse prion protein and consistent with the general observation of evolutionary selection for kinetic stability [81], the folding and maintenance of the non-disease folded form of some of the pertinent proteins (the misfolded forms of which are implicated in diseases) is under kinetic rather than thermodynamic control [93]. Consistent with these observations, the experimentally observed distribution of protein evolution rates may be rationalized by an evolutionary process that selects against misfolding [94]. In the cellular environment, mutations can affect not only the folding kinetics of a protein in isolation but also how it interacts with the complex cellular machinery while it is folding. Inasmuch as folding kinetics is concerned, the in vivo translational rate can affect co-translational folding [95,96] because, for example, fast-translating codons can be useful for avoiding misfolding. In this regard, even synonymous mutations that do not change the amino acid sequence of a protein can lead to altered folding pathways in the cell [97]. rsif.royalsocietypublishing.org 2.2. Effects of mutation on folding kinetics and intermediate states Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 different substrates [128,129], which can be created easily via mutations from an original interface that binds only one substrate. This perspective is consistent with a recent directed evolution study on the bacterial immunity protein Im9. The wild-type Im9 primarily inhibits deoxyribonuclease ColE9 but also inhibits ColE7 promiscuously, i.e. to a much lesser extent. The experiment shows that it can evolve readily into a primary ColE7-inhibitor with an approximately 105-fold increase in affinity and 108-fold increase in selectivity via a ‘generalist’ intermediate that allows for rapid evolutionary divergence [130]. Since native stability is required for globular proteins to perform their biological functions (§2.1) and to avoid misfolding and aggregation (§2.2), it might seem that a higher native stability should always be desirable and therefore favoured by evolution. However, natural globular proteins are not extremely stable. An early survey of the thermal stability of 12 proteins at 258C showed considerable variation of native stability among them, with average stabilizing free energies of 0.05 –0.12 kcal per mole of amino acid residues [131]. This and other experimental data indicate an approximate native stability of 5 –15 kcal mol21 for a natural globular protein with about 100 amino acids. These findings have since been rationalized theoretically by considering the strength of intra-protein interactions and conformational entropy [44,132]. This experimental level of stability of natural globular proteins is often characterized as ‘marginally stable’. ‘Marginal’ here points to the relatively small free energies of folding. Sometimes the term also refers to the fact that the net balance of 5–15 kcal mol21 for native stability is the result of a partial cancellation of two much larger free energies on the order of 100–200 kcal mol21 contributed by favourable intra-protein interactions on one hand and conformational entropy on the other [44]. If evolutionary selection for stability is expected, why are natural proteins only marginally stable? One possible reason is that native stability is not the only requirement on a functional globular protein. Conformational flexibility is crucial for certain biological functions. Therefore, adaptation towards increased conformational flexibility might have acted as a check against proteins evolving to become extremely stable [21,133,134], suggesting that marginal stability can be an adaptive trait. 2.4.1. Marginal stability may not be an adaptive property Is a strong selection pressure for marginal stability necessary to account for the experimentally observed marginal stability of natural proteins? Biophysics-based models have suggested otherwise by showing that marginal stability could be a nonadaptive property [135,136]. The number of sequences encoding for a given structure generally decreases with native stability. Hence, even in the absence of any evolutionary selection, there are more sequences encoding for a given native structure with low stabilities than sequences encoding for the same structure with high stabilities. This phenomenon is a basic property of protein sequence space and is consistent with the ‘superfunnel’ perspective [137] (§3.2.3). Therefore, as long as a certain minimum stability requirement for folding and function is met, random mutational drift will lead an evolving population to a region of sequence space that J. R. Soc. Interface 11: 20140419 2.4. Marginal native stability 6 rsif.royalsocietypublishing.org separation between the folded and unfolded states [113,114] such that the folding–unfolding transition is switch-like [36,115]. Therefore, in line with both the folding and interaction requirements, functional proteins have to disfavour nonnative intra-protein interactions as well as discriminate against detrimental inter-protein misinteractions (figure 3). There is a biophysical limit to evolutionary optimization of protein binding specificity, however. Because proteins are made up of a finite alphabet of amino acid residues [116], the heterogeneity, or designability, of their interactions are constrained by the physico-chemical properties of the alphabet. It is not physically possible to eliminate all favourable interactions between a protein and all other proteins except its presumed functional partner(s). In other words, misinteractions cannot be eliminated completely by optimization. In the living cell, there can be more misinteractions because some evolving proteins have not had time to minimize them [117]. In fact, even the folded form of a globular protein is probably a metastable state, whereas amyloid [118] or prion-like [119] aggregates are expected to be thermodynamically more stable configurations at longer timescales. Therefore, binding should not be understood as an all-or-none proposition; instead it is a question of binding affinities that can vary over a wide range. Although proteins bind their evolved interaction partners particularly strongly, they probably also interact transiently with many other proteins, albeit with low affinities. Currently it is not feasible to identify the effects of a given mutation on the many possible interactions a protein can engage in, especially when the mutation has no detectable effect on the main function. Nonetheless, computational prediction methods are being developed to perform efficient tests for potential binding between large numbers of proteins [120]. Any mutation on a protein can potentially increase the binding strength with some molecular partners. If this change alters the cellular biochemistry, the mutation may be subject to either positive or negative natural selection (figure 3). A misinteraction is created by mutation if an originally negligible protein –protein interaction is strengthened to an appreciable level. If the misinteraction is beneficial, it can underpin a new oligomeric state or promiscuous function of the protein which can then be positively selected [121,122] (see further discussion in §3.5). In those cases, computational modelling suggests that positive selection of an interacting region can also facilitate evolution of globally well-packed globular structures in the interacting proteins [123,124]. Protein –protein interactions require geometric coupling of the protein interfaces. Mutations within the interfaces naturally have a direct impact on binding; mutations outside the interface can affect binding allosterically as well [125] (see further discussion in §2.7.1). Biophysically, new protein –protein interactions are not unlikely to emerge. A recent survey of heterodimers found that functional binding interfaces bury a surface area between 380 and 3400 Å2 [126]. Another recent study indicated that only two amino acid substitutions are needed to shift the average amino acid composition of a 1000 Å2, approximately 28-residue non-interacting protein surface to that of a protein –protein interface [127]. In this light, transient binding may be possible with even smaller interfaces. One can imagine a ‘grey area’ of interface sizes where a single surface mutation may significantly increase the binding affinity to a new substrate. There are also overlapping binding interfaces that bind Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Is it physically possible for some amino acid sequences to fold with exceedingly high stability? The perspective from experiments is different from that suggested by Goldstein [136]. Among 290 single-residue substitutions of staphylococcal nuclease created artificially by Shortle and co-workers [139–141], 257 are destabilizing, five lead to stabilities essentially the same as that of the wild-type (approx. 5.5 kcal mol21), only 28 are stabilizing. Moreover, each destabilizing artificial mutation destabilizes by more than 2.08 kcal mol21 on average (maximum ¼ 7.5 kcal mol21), whereas each stabilizing artificial mutation stabilizes by only 0.36 kcal mol21 on average (maximum ¼ 1.0 kcal mol21). A similar trend is exhibited by the 98 artificial mutants of chymotrypsin inhibitor 2 studied by Fersht and co-workers [142] (77 with a single substitution, 17 with two substitutions, and four with three substitutions): 90 artificial mutants are less stable than the wild-type (7.6 kcal mol21), only eight artificial mutants are more stable than the wild-type. On average, a destabilizing mutation destabilizes by 1.67 kcal mol21 (maximum ¼ 4.93 kcal mol21 among single-substitution mutants), whereas a stabilizing mutation stabilizes by only 0.18 kcal mol21 (maximum ¼ 0.42 kcal mol21). These data suggest that the stabilities of natural proteins are close to, albeit not exactly at, the maximum achievable by sequences in the immediate 2.4.3. Reconciling evolutionary selection for stability with marginal stability Taken together, the above discussion indicates that fundamentally, natural globular proteins without disulfide and other cross-links are marginally stable because of the physical constraints on native stability itself. Exceedingly high native stability is physically impossible. Because there are more sequences encoding for lower stabilities than higher stabilities [135,137], extensive evolutionary selection to decrease native stability is not necessary, though selection for local flexibility may sometimes result in functional globular proteins that are not the most stable possible for the given folds [21,133,134]. Experimental evidence abounds, however, for evolutionary selection for higher native stability [40,83] (see §2.1), though not necessarily the highest once a certain threshold for function is achieved [150], as illustrated by the data on the artificial mutants of staphylococcal nuclease and chymotrypsin inhibitor 2 discussed in §2.4.2. Therefore, natural globular proteins are marginally stable (because of biophysical constraints) but they are nonetheless nearly maximally stable (by evolution) for the structures they fold to. This conclusion is supported by theory: neutral net topology in protein sequence space tends to concentrate large evolving populations toward sequences that are mutationally most robust [137,151]. These sequences are often also thermodynamically most stable [137]. But random mutations alone—in the absence of a fitness drive towards higher native stability—are not sufficient to produce a highly concentrated population at the most stable ‘prototype’ sequence at the 7 J. R. Soc. Interface 11: 20140419 2.4.2. How stable can real proteins be? sequence-space neighbourhood of the wild-type sequence. However, when larger numbers of amino acid substitutions are applied to a wild-type, an increase in thermodynamic and/or kinetic stability of 3–4 kcal mol21 has been observed in several proteins (e.g. [143,144]). There is no experimental evidence to date indicating the existence of polypeptides that encode for an essentially unique folded structure with native stability as high as approximately 0.4 kcal per mole of amino acid residues as posited by Goldstein [136]. A case in point is the 93-residue designed protein Top7, which is already characterized as extremely stable. Its stability is approximately 13 kcal mol21 at 258C [145]. Although this level of native stability is significantly higher than several single-domain proteins [146] including the 97-residue S6 with similar secondary structure (native stability ¼ 8.5 kcal mol21) [147], the stability of the artificially designed Top7 is still within the 5–15 kcal mol21 native stability range long recognized for natural proteins [44,131]. The highest stability achieved by more recent attempts to design stable proteins is 14.9 kcal mol21, or 0.14 kcal per mole of amino acid residues for a 110-residue construct [148]. In this light, the 118 kcal mol21 stability estimated in [136] is physically unrealistic. This exceedingly high estimate is probably an artefact of the non-explicit-chain approach used in the study (for a discussion of explicit- versus non-explicit-chain protein models, see [36]), which tends to underestimate mutational effects on the unfolded states. From a protein biophysics standpoint, however, any given mutation not only impacts the free energy of the native state but can also have a significant effect on the denatured (unfolded) state, and the effects on the two states often partially cancel, such that extremely high native stability is physically not possible [149]. rsif.royalsocietypublishing.org encodes with marginal stabilities (close to the minimum required stability) simply because there are more sequences with that property [135]. In a more recent model, an evolved population is seen to prefer marginal stability even when the model fitness function increases exponentially with native stability [136]. In this view, if marginal stability of a protein is functionally beneficial, it may represent a ‘spandrel’ [135], i.e. a tendency occurring originally for non-adaptive reasons that is exploited subsequently by biology [138]. This population consideration argues convincingly that there might not have been extensive positive evolutionary selection to decrease the stabilities of globular proteins. A fundamental issue that remains to be addressed, however, is the extent of evolutionary selection to increase stability. This question asks whether the stabilities of natural proteins are close to their biophysical maximum, as envisioned in the superfunnel picture (§3.2.3) or are far from a biophysically possible maximum that was not selected evolutionarily. Notably, both of the models discussed above [135,136] posit that there are amino acid sequences that can fold to a given structure uniquely with native stabilities far exceeding the experimentally observed stabilities of natural proteins. Results of the random mutation model of Taverna & Goldstein [135] show a significant population of sequences encoding with higher native stabilities than the sequences around the peak of the steady-state population. Therefore, if the sequences near the peak of the population distribution are taken as models for natural proteins, their results suggest that a significant fraction of mutations of natural proteins would lead to higher native stabilities (although that fraction is smaller than the fraction of mutations leading to lower native stabilities). In a more recent model of Goldstein [136], it is stated specifically that the 300-residue protein used in the study can potentially reach an extremely high stability of 118 kcal mol21 but the evolved population has a stability of only about 9 kcal mol21. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 2.5. Geometric/topological constraints imposed by the native structure 2.6. Chaperones and in vivo folding In molecular biology, chaperones are a class of proteins that assist the folding and assembly of other proteins, or even reverse misfolding [163]. Many mutant proteins fail to fold or be expressed in the cell because of reduced native stability, increased probability of misinteractions during folding, or other changes in folding kinetics that are detrimental to productive folding. These biophysical constraints hinder evolution because they limit the number of mutants that can be explored. Mutations that decrease native stability below a certain threshold cannot participate in the evolutionary process even if they possess superior functionality—provided they are properly folded—because relative native instability compromises protein folding and expression. In the cellular environment, chaperones offer a degree of relief from these constraints. Molecular chaperones enhance evolvability— i.e. a genome’s ability to produce adaptive variants [164] (see §3.5)—because they help mutants that are less stable to fold to functional structures and to avoid non-functional aggregation, thus allowing more mutants with potentially beneficial new functions to be explored in vivo [165,166]. This principle was borne out in experiments involving the Escherichia coli GroEL/GroES chaperonin complex. In a set of laboratory evolution experiments on four enzymes, the divergence of modified enzymatic specificity was found to be much more speedy when GroEL/GroES is overexpressed, most probably because GroEL/GroES assist folding of enzyme variants, allowing mutants that lose as much as 3.5 kcal mol21 in native stability to be viable whereas only approximately 1 kcal mol21 loss in stability is permitted in 2.7. Multi-basin folding landscapes, allostery and conformational dynamics Protein structures are dynamic; and conformational dynamics is crucial in many biomolecular interactions [171,172]. Even for globular proteins that fold to an essentially unique native structure under physiological conditions, other less favourable ‘excited-state’ conformations are always populated, albeit to a much lesser extent than the dominant native conformation that is commonly identified as the ground-state structure. The balance between the dominant ground-state and excited-state populations can be altered by mutations. For instance, a recent NMR experiment demonstrated that a mutant T4 lysozyme populates an excited state to about 3% at 258C [173] (figure 4a). Besides uniquely folding proteins, there are globular proteins that have more than one dominant folded conformation. For these proteins, the same amino acid sequence adopts more than one structure with similarly high probabilities. Thus, instead of a single funnel, the energy landscape of such a protein has multiple basins of attraction [177,178]. In some cases, these alternative structures freely interconvert during the lifetime of the protein as for the cytokine lymphotactin [179] and the cell cycle control protein Mad2 [180]. Sometimes it takes an additional factor to stabilize an alternative structure, such as a change in the solvent conditions or a binding event (e.g. [181 –184]). 2.7.1. Conformational diversity is often needed for function Multi-basin energy landscapes are widely used by Nature to regulate protein function. A prime example is allostery, by which the function of a protein is regulated through binding a ligand (effector) at a site (the allosteric site) on the protein 8 J. R. Soc. Interface 11: 20140419 The Top7 example mentioned in §2.4.2 also offers insights into other aspects of the interplay between biophysical constraints and evolution. It shows that the tendency to misfold does not necessarily diminish with increasing native stability: despite the high native stability of Top7, its folding kinetics is complex, probably involving multiple kinetic traps [154,155]. Theoretical considerations indicate that the lack of two-state-like behaviour of Top7 is probably caused more fundamentally by its peculiar native structure, more so than the fact that it is an artificially designed protein that did not undergo natural selection [156]. Thus, native geometry or topology (the pattern of residue– residue contacts in the native structure) probably impose a physical constraint on the level of stability and folding cooperativity that natural or artificial selection can achieve [156,157]. In this connection, it has been shown using simple lattice protein models that not all protein structures are equally encodable [158] or designable [159 –161]. Some structures may not be encodable at all [158,162]. This represents another set of biophysical constraints under which protein evolution must operate. the absence of GroEL/GroES [165]. In a more recent experiment to evolve a phosphotriesterase into an arylesterase in vivo, GroEL/GroES is again seen to increase the ability to adapt to new functions by allowing for more genetic variation. Moreover, it was found that mutational tolerance is not determined by in vitro native stability per se, but rather by the level of soluble expression of the mutant protein in the cell. In this case, the GroEL/GroES chaperone enhances soluble expression by apparently stabilizing a folding intermediate against detrimental aggregation and thus indirectly promotes productive folding, underscoring the critical importance of mutants’ in vivo folding kinetics on the course of protein evolution [166]. Consistent with this trend, there is also strong evidence at the genome level that proteins that use GroEL/GroES obligately for folding evolve faster [167] and are less dependent on optimal codon usage to avoid translation-induced misfolding [168] than proteins that do not require these chaperones for folding but rather rely more on optimal codons [169]. The link between translation errors and evolutionary rates will be discussed further in §3.10. On the theory front, a recent simulation of protein evolution considered a model cell containing a few interacting protein species that can adopt either ‘folded’ or ‘moltenglobule’ structures [170]. Consistent with the trend seen in experiments [165,166], the simulation indicated that chaperones that actively catalyse folding also accelerate evolutionary adaptation because the increased chaperone-assisted folding rates allow for deeper searches of the sequence space [170]. rsif.royalsocietypublishing.org bottom of the sequence-space superfunnel because of the large number of sequences that are less stable [40,137,152, 153]. Therefore, the experimental observation that natural proteins are often a nearly most stable sequence that behaves like a prototype sequence suggests strongly that they are results of positive selection for higher native stability (see further discussion in §3.2.3). Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 (a) T4 lysozyme G113A L99A R119P double mutant (b) bi-stable L12N double mutant (c) bi-stable K21P cystein-rich domain NW1 G11V double mutant (d ) GA 27 mut. GA98 L45Y GB98 GB 21 mut. (e) chameleon P22 Cro 9 mut. 14 mut. l Cro Figure 4. (Caption opposite.) that changes the structure and/or dynamics of the protein’s active, functional site positioned at a distance from the allosteric site [185]. Allostery is important for biological function and its malfunction is implicated in disease processes [186,187]. Mutations affect allostery. Mutational effects on allostery can be subtle because allosteric communication between the allosteric and active sites can be underpinned by multiple mechanisms [188,189]. Nonetheless, mutational effects on allostery can be rationalized by computational approaches in some instances [190]. Conformational flexibility, dynamics of protein folded states and allosteric transitions often can be deduced to a reasonable degree from the structure(s) of the protein in question using elastic network models for folded-state dynamics [191–194] or native-centric Gō-like potentials [195] with multiple folding basins ([196]; reviewed in [177]). Similar to the aforementioned case for the probable existence of geometric/ topological constraints on the evolution of folding stability and cooperativity (§2.5), the success of structure-based native-centric modelling in rationalizing conformational dynamics and allosteric transitions suggests that there are significiant structural constraints on the evolution of functional folded-state dynamics. The computational efficiency of elastic network models also allows enzymes that are dissimilar in sequence and structure yet probably perform similar functions to be detected by their similar dynamic properties [194,197], making it possible for relationships between evolutionary conservation and conformational dynamics to be explored [198]. Allostery is envisioned to have evolved by oligomerization, gene fusion and/or recruitment of unused/flexible parts of a pre-existing protein structure (reviewed in [199]). The latter evolutionary route may proceed by positive selection of opportunistic binding of excited-state conformations. The mechanism of such binding may lie anywhere between the ‘conformational selection’ and ‘induced fit’ scenarios [177,200]. Evolution has apparently exploited latent allosteric potentials entailed by conformational dynamics in this manner, as in the case of Ste5 activators that target MAP kinases in yeast [201]. Opportunistic binding of excited-state conformations can also facilitate evolution of new functions that are not necessarily allosteric [122,202,203]. During such an evolutionary process, a sequence with a multi-basin energy landscape can serve as an evolutionary bridge. In particular, the evolutionary intermediate of two sequences each encoding for a J. R. Soc. Interface 11: 20140419 N11L Arc repressor 9 rsif.royalsocietypublishing.org bi-stable Figure 4. (Opposite.) Examples of experimentally designed bi-stable proteins and mutation-induced structural switches. (a) Wild-type T4 lysozyme was mutated (L99A) to create an internal cavity that allows for the population of an excited-state conformation with an altered helical segment (blue; left). This T4 lysozyme variant could be further transformed via a single G113A substitution into a bi-stable protein that also populates a new folded structure in which the local structure of the helical segment is modified (red; right). An additional R119P substitution on this L99A, G113A variant then leads to a protein that adopts the conformation on the right as its essentially unique native structure [173]. (b) Wild-type Arc repressor is a homo-dimeric protein. Each monomeric unit contributes a b-strand to form a two-stranded antiparallel b-sheet (blue; left). This shared configuration becomes bi-stable with the introduction of a single N11L substitution to each of the monomeric units. The mutated sequence now populates the original structure as well as a new structure with the b-strands changed into two short helices (red; right). An additional L12N substitution on each of the monomeric units results in a sequence that adopts the new configuration on the right as its essentially unique native structure [65]. (c) The cysteine-rich domain NW1 forms a stable structural element (blue; left) with three disulfide bonds (yellow sticks) between the residue pairs (8,20), (12,25) and (16,24). A single K21P substitution results in a bi-stable mutant that also populates a structure with a different overall conformation (red, right) and an alternate disulfide-bonding pattern, now between residue pairs (8,24), (12,20) and (16,25). Introduction of a single G11V substitution on this bi-stable mutant results in a sequence that adopts the conformation on the right as its essentially unique native structure [174]. (d) Two domains of streptococcal Protein B, named GA and GB, with a 3a and a 4b þ a-fold, respectively, and no significant sequence similarity, were transformed into each other by a series of point mutations that resulted in a structure pair GA98 and GB98 that allows the switch between the two structures with just a single L45Y mutation. GA98 exhibits a small 4b þ a population and thus may also be regarded as bi-stable [175]. (e) The viral P22 Cro and l Cro are DNA-binding proteins. Encoded by different sequences, they have structurally very similar helical N-terminal domains (represented by the yellow and green ribbon, respectively) but have structurally distinct C-terminal domains. P22 Cro has a helical C-terminal, whereas the C-terminal of the homo-dimeric l Cro forms a b-sheet. A 24-residue chameleon sequence created largely by mixing residues from the helical and sheet-forming C-termini adopts different secondary structure depending on whether it is inserted in the P22 or l context [176]. Sequence and structural information presented in this figure was taken from the cited original references. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 2.7.2. Bi-stable proteins and conformational switches 2.8. Intrinsic disorder When structural plasticity is extreme, one might expect a multi-stable sequence to morph into one without a discrete set of clearly discernible favoured conformations. This in itself is not surprising because an overwhelming majority of polypeptides with random amino acid sequences do not fold to a unique structure [116]. What is remarkable, in the context of our decades-long near-exclusive focus on proteins with well-ordered structures, is the existence of many functional proteins with such extreme conformational diversity. Although our main concern here is evolution of globular proteins, it is important to recognize that intrinsically disordered proteins (IDPs) or intrinsically disordered regions (IDRs) play key roles in cellular processes [216 –222]. 2.8.1. Any protein conformational state can potentially have biological function With the discovery of functional IDPs/IDRs, it has become abundantly clear that biology can exploit any protein conformational state that it finds useful. In this respect, an 10 J. R. Soc. Interface 11: 20140419 Experiments in several laboratories have found cases where a single mutation was able to either create a bi-stable protein from a uniquely folding protein or completely switch one uniquely folding protein to another with a new native structure [65,173–175,205,206]. Although these cases of mutationinduced structure switches were artificially engineered, they demonstrated that it is generally possible for bi-stable proteins to arise through mutations during natural evolution. An early example of mutation-induced structure switching was the Arc repressor, which is a homodimer with a two-stranded inter-unit b-sheet. Experiments by Cordes et al. [205] showed that the b-sheet in the wild-type protein can be changed to a pair of 310-helices by two amino acid substitutions that swap the neighbouring sequence positions of an asparagine and a leucine. A subsequent experiment indicated that a mutant with a single asparagine-to-leucine substitution has approximately equal populations of the b-sheet and helical forms, and thus may be regarded as an evolutionary bridge [65] (figure 4b). A recent study showed further that if two more polar or charged to hydrophobic substitutions are introduced, the resulting triple mutant adopts an octamer configuration with approximately half the helical content of wild-type Arc, indicating that new protein –protein interactions and novel oligomeric states can readily result from a small number of mutations [207]. Experimental mutagenesis has uncovered a similar behaviour in the cysteine-rich domains (CRD) of cnidarian nematocyst proteins. Different CRDs fold to either one of two structures with different disulfide-bonding arrangements despite high sequence similarity and identical sequence patterns for their cysteines. Meier et al. [174] found that a CRD sequence that folds to one disulfide arrangement can be converted to another disulfide arrangement by only two amino acid substitutions, one from lysine to proline and the other from glycine to valine, whereas the single-substitution mutant with only the lysine-to-proline mutation behaves as an evolutionary bridge that populates both disulfide arrangements (figure 4c). This finding again underscores that large structural changes can be effected by minimal changes in the amino acid sequence. The study by Alexander et al. [175] of the GA/GB system showed that a single leucine-to-tyrosine substitution can convert a sequence encoding for an albumin-binding 3a (GA) structure to a sequence encoding for an immunoglobulinbinding 4b þ a (GB) structure (figure 4d). A subsequent experiment on two other mutants identified two additional 3a $ 4b þ a structure switches induced by a single amino acid substitution [206]. Interestingly, a mutant with a conformational ensemble that is 95% 3a and only 5% 4b þ a when measured in isolation nevertheless binds immunoglobulin but not albumin [206], providing an excellent example of how protein –protein interactions can dramatically shift the conformational distributions of the binding partners [200]. Another recent example of an artificial ‘evolutionary intermediate’ is a 24-residue sequence that can adopt either the a-helical or b-sheet C-terminal conformations, respectively, of transcription factors P22 Cro and l Cro, depending on whether the designed sequence is fused with the N-terminal domain of P22 Cro or l Cro [176] (figure 4e). In this case, the naturally occurring wild-type 24-residue C-terminal sequences of P22 Cro and l Cro have only five identical amino acid positions, whereas the amino acid residues of the designed sequence at all but four positions are either identical to that in the wild-type P22 Cro or in the wildtype l Cro. This finding underscores the critical role of tertiary context in determining secondary structure in proteins [208]. Although the designed sequence is nine and 14 substitutions away from the corresponding sequences in wild-type P22 Cro and l Cro, respectively, the successful design of a structurally ambivalent ‘chameleon’ sequence in this experiment suggests that a smooth evolution transition from one Cro fold to another is possible [176]. Computation-assisted design of conformational switches has seen notable success [209,210]; but it is still a challenge to apply our current biophysical knowledge to provide a fundamental physical rationalization for experimentally observed conformational switching. For the GA/GB system, a mutationinduced gradual stabilization of one structure over another was demonstrated using a common software for DDG prediction (§2.1) [178]. However, the mutation-induced GA/GB conformational switching was not reproduced in atomistic molecular dynamics simulations [67], even though a part of the simulated energetics is consistent with experiment [68]. The structural plasticity in bi-stable and multi-stable proteins probably plays an important role in protein evolution [122,211,212]. Conformational switches and bridge sequences facilitate evolution by allowing continuous or nearcontinuous transition from one folded structure to another. The experiments in figure 4 suggest that, under certain circumstances, multi-functional proteins can be created by only a few mutations that stabilize certain hidden or excited states. A situation where it is advantageous to take such a route is the coevolution of pathogens and their hosts, a highly competitive evolutionary process that demands frequent change of protein shapes and functions. It is thus unsurprising that bi-stability and multi-specificity are exhibited in antibodies [213], antimicrobial peptides in natural plant defence [214] and antiviral proteins [215]. rsif.royalsocietypublishing.org different dominant structure can be a bi-stable sequence that folds to both structures with equal or similar probabilities [161,178,204]. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 What can be expected of the biophysical constraints on the evolution of IDPs/IDRs? IDPs/IDRs do not fold to a unique structure. Therefore, in contrast to many globular proteins, the energy landscapes of IDPs/IDRs are not funnel-like [222]. As far as near-neutral mutations [237,238] are concerned, one might expect less biophysical constraints on IDP/IDR evolution than on globular protein evolution because for IDPs/IDRs there is no need to maintain an essentially unique folded structure. However, it can also be argued that evolution of certain IDPs/IDRs may be subject to even more restrictive constraints because of their requirement to bind to multiple partners. As a result, these IDP/IDRs may suffer from low mutational robustness similar to that of bi-stable globular proteins that play the role of an evolutionary bridge between two folded structures [178,239]. Nevertheless, even in such cases, IDPs/IDRs in a neutral net might only need to conserve certain functional residues that are compatible with multiple binding partners while imposing few constraints on mutations at amino acid sites in the rest of the protein. These expectations are largely consistent with database studies and experiments. Phylogenetic analyses indicate 2.9. Protein dynamics and phenotypic plasticity: what is a molecular phenotype? In the study of molecular evolution, the term genotype is used for the inheritable part of genetic information; whereas phenotype refers to the biomolecules of interest that are produced based on the genotypic information. In theoretical studies of protein evolution, as a modelling simplification, the genotype may be identified with the amino acid sequence because as far as in vitro protein folding is concerned, it contains essentially the same information as the nucleic acid sequence that encodes it. This is a simplified approach that neglects in vivo complexities such as the fact that synonymous mutations can lead to altered cellular folding pathways (§2.2). In principle, the molecular phenotype should encompass all 11 J. R. Soc. Interface 11: 20140419 2.8.2. Biophysical constraints on evolution of intrinsically disordered proteins and regions that IDRs generally evolved faster than ordered regions of proteins, but some IDRs such as DNA-binding regions evolved slower [240,241]. For proteins that have both ordered and disordered regions, mutations in IDRs lead to smaller stability changes than in ordered regions. Thus, IDPs/IDRs may enhance protein evolvability and the development of new functions [242], as evolutionary changes in protein sequence and structure are often correlated with local flexibility and disorder [243]. The biophysical constraints on IDP/IDR evolution [244] are quite different from those on folded protein evolution [12]. In fact, the accepted amino acid substitutions in IDPs/ IDRs resemble those in solvent-exposed loops and turns of globular proteins [244]. Chemical composition defined as the fraction of positive, negative, polar, hydrophobic and special (proline and glycine) residues is often maintained across IDR orthologues that otherwise exhibit little conservation [245]. This observation is in line with the finding that whether an IDP is elastomeric or amyloidic depends largely on the relative compositions of proline and glycine [228], and is consistent with the central role of aromatic composition in a set of IDP interactions that are presumably underpinned by cation–p attraction [236]. Relative to the substitution matrices for globular proteins, substitution matrices for IDPs/IDRs entail a generally higher probability of evolutionary changes, but some residues such as tryptophan and tyrosine tend to be highly conserved in IDPs/IDRs, perhaps because of their critical role in protein–protein interfaces [244,246]. It should be recognized that IDP/IDR conformations are far from random. Biological functions of proteins are always underpinned by conformational structures. In this respect, the difference between IDPs/IDRs and ordered proteins is that the IDP/IDR function is conferred by a much more diverse conformational ensemble than for globular proteins. The transient, ‘fuzzy’ tertiary contacts in IDP/IDR conformations are often important for the function; hence mutations that disrupt such contacts can be extremely detrimental to function. An example of how a single mutation can disrupt IDP function is the threonine-to-arginine mutation at position 45 of the cyclin-dependent kinase inhibitor Sic1 [234,247]. This amino acid substitution leads to a dramatic increase in its hydrodynamic radius [234] and, at the same time, a serious disruption of its biological function in regulating the cell cycle [247]. Current biophysical understanding of this and other mutational effects on IDP/IDR conformational distribution is limited. Much remains to be discovered about the evolution of these proteins. rsif.royalsocietypublishing.org intriguing recent suggestion is that although avoidance of amyloid-like aggregation has apparently been a driving force of protein evolution [223] (§2.2), it is possible that modern protein folds have an amyloid origin in evolution [224]. For IDPs/IDRs, current understanding of the evolution of the triplet genetic code [225] suggests that the amino acid composition of primordial polypeptides was conducive to more disordered conformations before the modern genetic code for a 20-letter amino acid alphabet was completed [222]. However, surveys of modern proteomes indicate that IDPs/IDRs are more common in eukaryotes than in prokaryotes: more than 32% of amino acid residues in eukaryotic proteins are in IDPs/IDRs whereas the corresponding percentage is less than 27% for prokaryotic proteins. This pattern suggests that the proteins in the last universal ancestor were probably well structured and emergence of the IDPs/IDRs observed today was relatively late [222], perhaps coinciding with an evolutionary trend that has witnessed a general decrease in protein hydrophobicity [226]. According to one estimate, more than 30% of eukaryotic proteins have IDRs of more than 50 consecutive residues [216], consisting of more proline, glycine and charged residues but fewer hydrophobic residues [227,228]. IDPs/IDRs are involved in fundamental processes such as transcription, translation and cell cycle regulation that, when they malfunction, can lead to cancer. The essential role of IDPs/IDRs in mediating biological regulation suggests that, in some situations, they have certain advantages over folded proteins in recognition and binding [229]. For instance, their ability to flexibly bind to many different partners has allowed them to occupy hub-like roles in protein –protein interaction networks [230,231]. They can also encode relatively larger intermolecular interfaces to economize genome and cell sizes [218]. Protein –protein interactions for some IDPs/ IDRs entail significant folding upon binding [219], while others undergo only restricted local ordering at the binding site with other parts of the protein remaining disordered, thus forming a dynamic ‘fuzzy’ complex [220,232 –236]. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 3. Applications of biophysics-based models to understand protein evolution 3.1. Protein sequence and structure spaces: evolution meets biophysics Recent genomic and proteomic initiatives have greatly advanced our knowledge of the global sequence – structure relationship and the evolution of the primary, secondary, supersecondary, domain, tertiary and quaternary/complex structures of proteins [266 –269]. An earlier study suggested that a/b folds appeared later in evolution than other structural classes [266]. However, more recent investigations indicate that the globular protein architectures observed today had emerged during evolution roughly in the following order: a/b (e.g. TIM-barrel), a þ b, all-a, all-b, then multi-domain proteins [268]. Consistent with this perspective, a core of functional diversity corresponding mostly to the more ancient a/b folds in the protein structure space has been identified [270]. Interestingly, a computer modelling study of the dynamic properties of protein fold space suggests that a/b folds are also more stable than other fold classes [271]. In the space of all possible amino acid sequences, an overwhelming majority of sequences do not have a biological function. It has been estimated that the probability of finding a functional protein among random amino acid sequences is approximately 10211 [272]. Evolutionarily, natural protein sequences are still diverging from one another today, albeit at a slow rate because biophysical and functional constraints allow only about 2% of amino acid sites to be mutated [273]. This ongoing divergence means that the coverage of the space of all possible sequences by biologically viable sequences has been and still is increasing; i.e., there has been a continuing expansion of the ‘protein universe’ since the beginning of life on the Earth [273]. There is consensus among researchers that the repertoire of globular protein folds is probably finite. The SCOP classification [274] currently identifies about 1200 different folds in the PDB. Estimates for the total number of possible folds range from about 1000 [275], 2000 [276] to 10 000 [277]. Fold classifications are inherently difficult and estimation of the total number of possible folds is sensitive to the definition of a fold (see [278] and references therein). Nonetheless, in most cases, structures of recently sequenced proteins are related to known folds [279], suggesting that the existing PDB structures are probably a near-complete representation of all biologically viable globular protein folds. A broader issue is whether the observed natural globular protein folds constitute a relatively small subset selected by evolution from a much larger collection of all physically possible compact conformations. Biophysics and polymer physics J. R. Soc. Interface 11: 20140419 To gain insights into how proteins evolve, various models of the mapping from protein sequence (genotype) space to structure ( phenotype) space have been constructed. Here, we focus largely on models with an explicit representation of the protein chain and biophysics-based interactions because these models provide a better delineation of what is physically plausible for real proteins than other models of molecular evolution that postulate such a mapping in the absence of or with little biophysical considerations. 12 rsif.royalsocietypublishing.org properties—including but not limited to biological functions—of the protein encoded by the genotype. In practice, molecular phenotypes in theoretical and experimental investigations are defined, and thus are restricted, by the question being addressed. However, an oversimplified view of molecular phenotypes that is too restrictive can hinder understanding of important principles of protein evolution. For globular proteins that have an essentially unique folded structure, a practical and seemingly natural definition of molecular phenotype of a given amino acid sequence is its structure as deposited in the Protein Data Bank (PDB). This practice is useful for constructing a neutral net of sequences that encode uniquely for the same protein structure and the evolution from one such phenotype to another [137,161]. However, this simplistic view of molecular phenotype neglects the dynamic nature of proteins. Recent advances in experimental techniques, especially those using NMR, have enabled detailed characterizations of the dynamic properties of proteins [173,248 –250] and, in conjunction with computation, allowed for the construction of ensembles of diverse conformations of disordered proteins based on NMR and other experimental measurements [251,252]. As a result of these experimental advances and the theoretical energy landscape perspective [34,253,254], our view of how protein molecules function has undergone a drastic change in the past two decades, with increasing recognition of the biophysical, biological and evolutionary significance of protein dynamics [255,256]. Because of the role of dynamics in protein function (§2.7 and 2.8), identifying a protein’s molecular phenotype only with its native folded structure is often too restrictive. Ideally, the molecular phenotype of an amino acid sequence should correspond to the totality of its biologically relevant properties. Although it may not be practical to enumerate many properties of a protein, for many applications the molecular genotype should at least be understood as an ensemble of conformations with a sequence-specific and environment-dependent distribution. Within this ensemble, certain phenotypic properties, such as the presence of a secondary structure in the protein conformation, are not necessarily fixed but can undergo thermal fluctuations or environment-induced changes. This phenomenon is referred to as single-genotype phenotypic fluctuation or phenotypic plasticity, which can underpin important evolutionary responses to environmental changes [257]. Phenotypic plasticity tends to enhance evolvability. This trend can be seen clearly in an experimental evolution study of E. coli cells that express mutants of green fluorescence protein. In this experiment, mutants leading to a larger fluctuation in fluorescence among cells containing the same green fluorescence protein gene were found to exhibit a higher rate of evolution [258]. A positive correlation between single-genotype phenotypic fluctuation and evolvability has also been rationalized recently by computational models of proteins [259] and RNA [260]. As mentioned in §2.7.2 above, plasticity of molecular phenotype (i.e. conformational diversity) is generally conducive to higher evolution rates that can be beneficial to organisms in rapidly changing environments [261,262] through selection of ‘moonlighting’ [263] promiscuous functions [264]. We will elaborate below on how the relatively new view of protein dynamics and conformational distribution [202,203,265] has enriched our understanding of evolution. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Sequence and structure spaces of proteins are vast. Coarse-grained explicit-chain models are valuable tools in the study of protein evolution. Currently, modelling the sequence –structure mapping by energetically and structurally high-resolution representations is often not practical, especially for addressing large-scale evolutionary changes involving many different protein folds. In this regard, lattice models—wherein conformations of model proteins are configured on two- or three-dimensional (2D or 3D) lattices—are particularly useful because of their computational tractability [292– 294]. In view of their historical and current utility for investigating fundamental evolutionary issues (see, e.g., recent applications of 2D lattice models to study the basis of homology modelling [295], adaptive conflict [239] (§3.6) and long-term survivability [296]), a 3.2.1. Conformational enumeration Among lattice models of protein evolution, simple exact models allow for exhaustive enumeration of all possible sequences and structures in the model [292]. These models include the two-letter 2D hydrophobic-polar (HP) model that uses a reduced alphabet consisting of only two types of residues, hydrophobic (H) and polar (P), to capture the prominent effects of hydrophobic interactions in protein energetics [137,161,162,297–299] (figure 5), two-letter variants of the 2D HP model (e.g. [293,301,302]), and a four-letter 2D model that also includes residues behaving somewhat like positive and negative charges [303]. Some 20-letter 2D models may also be considered as exact, because they consider all possible mutations in the immediate sequence-space neighbourhood of a given sequence [40,153]. Other types of lattice models have been used to study protein evolution as well. These models either restrict chain conformations to be maximally compact so as to allow model proteins with longer chain lengths to be studied [152,160,304], or rely to various degrees on sequence-space sampling instead of considering all possible sequences because of their usage of a 20-letter amino acid alphabet that entails many more sequences (e.g. [40,153,305]) or both (e.g. [306,307]). Restricting model structure space to maximally compact conformations in 2D [306] or 3D [307,308] reduces computation drastically because such conformations constitute only a tiny fraction of all possible conformations [280,309 –312]. For instance, whereas the total number of all distinguishable conformations (not related by rotations and reflections) for a chain with 25 residues configured on the 2D square lattice is 5 768 299 665 [313], the number of maximally compact 25-residue conformations restricted to a 5  5 square, as considered in the study of Taverna & Goldstein [306], is only 1081 [309]. In 3D, the total number of all possible conformations (not related by rotations and inversions) on the simple cubic lattice for a chain with 27 residues is 11 447 808 041 780 409 [313,314], but the number of maximally compact 27-residue conformations restricted to a 3  3  3 cube, as considered by Deeds & Shakhnovich [307], is only 103 346 [280]. However, from a biophysical standpoint, it is important to keep in mind that real protein chains are not restricted to be maximally compact. Although behavioural trends predicted by models that use only maximally compact conformations may sometimes correlate with models that consider the full conformational ensemble, significant distortions of protein folding energetics are introduced by this approach [158,292]. In particular, for a given model sequence with a physically plausible interaction, the lowest-energy structure among maximally compact conformations may not be the true ground state structure, which is often less than maximally compact [299,315,316]. 3.2.2. Model interactions and their biophysical basis We next consider the physicality of the model interaction potential. Several examples presented below to illustrate recent biophysical insights into basic principles of protein evolution are based on the 2D HP model (figure 5). We choose the 2D HP model for this purpose because owing to 13 J. R. Soc. Interface 11: 20140419 3.2. Simple exact models and other explicit-chain coarse-grained models of protein evolution detailed assessment of the biophysical foundation of these models is in order. rsif.royalsocietypublishing.org have shed light on this question. A hallmark of most globular proteins is their helical and/or sheet-like organization. These secondary structures facilitate backbone –backbone hydrogen bonding in the folded protein core (reviewed in [12]). Secondary structures are conducive to tight tertiary packing as well. It has been shown that secondary-structure-like chain organization is enhanced by conformational compactness [280 –282], but in the absence of hydrogen bonding such structures exhibit deviations from sharply defined a-helices and b-sheets [283 –285]. These findings suggest that biophysical constraints of conformational compactness in conjunction with hydrogen bonding can go a long way in accounting for the basic architecture of globular protein folds. A more recent study using computational sampling of homopolypeptide conformations suggested further that the current repertoire of globular protein folds is nearly complete in its coverage of all physically possible compact folds [286]. However, studies by three other groups have found instead that the current fold repertoire represents only a small fraction of all possible folds [287 –289]. In particular, an investigation of the compact conformations of 60-residue polyvaline concluded that known protein folds constitute only approximately 5% of all physically possible folds, and that on average the natural folds have more local intrachain contacts (i.e. lower contact orders [146]) than the set of all possible folds, suggesting an evolutionary preference for structures with lower contact orders [288]. In response, a recent study argued that inasmuch as an appropriate criterion for matching simulated compact conformations and natural folds is applied, the existing library of singledomain PDB structures is probably complete in covering all physically possible folds [290]. A separate study by the same group indicated that computer-generated compact conformations tend to contain cavities resembling binding pockets in natural proteins as well, even in the absence of selection, suggesting that ‘many features of biochemical function arise from the physical properties of proteins that evolution likely fine-tunes to achieve specificity’ [291]. While the degree to which evolution has shaped the space of known protein folds remains to be further elucidated, the investigative effort described above is an excellent illustration of how explicit-chain biophysical models can be harnessed to address fundamental questions in protein evolution. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 (a) (b) 14 0.19 1.0 1.0 1.0 0.25 0.19 0.98 1.0 1.0 0.13 0.21 1.0 1.0 0.17 0.31 0.29 hydrophobic–hydrophobic (HH) contact (c) misfolded ! P to H H to P H to P P to H Figure 5. The simple exact 2D HP lattice model is a useful tool for studying evolution across entire sequence and structure spaces. (a) An example HP model sequence of length 18. Hydrophobic (H) and polar (P) residues are depicted, respectively, as black and grey beads. A favourable energy is assigned to each hydrophobic–hydrophobic (HH) contact (as indicated by the orange connections between two black beads), other contact types are neutral (carry a zero energy). The total energy of a conformation is proportional to the number of HH contacts. For a given sequence, the energy of every conformation can be computed accordingly. The schematic drawing below the sequence shows the folding funnel of the sequence. Conformations with more HH contacts are placed lower, because they are energetically more favourable than conformations with fewer HH contacts. The conformation encased in the grey circle is the lowest-energy (native) conformation of the given sequence. (b) Conserved and variable sites in an HP model structure. Among all the HP sequences that fold to this structure, the sequence shown here is the one that provides the highest native stability. The number on each bead is the relative frequency of occurrence of an H residue at the given sequence position among the entire neutral set of 48 HP sequences encoding for this structure [137]. Most core hydrophobic positions cannot be mutated (1.0H, i.e. 100% H) without losing the native structure; one surface position must be polar (0.0H, i.e. 100% P); but most other surface positions could be mutated to either polar or hydrophobic. This means that a hydrophobic-to-polar substitution has a very different structural effect depending on its location (surface or core). (c) Epistasis in the 2D HP lattice model. The sequence on the right is a double mutant of the sequence on the left (mutated positions indicated by circles and arrows, respectively). As for a real protein, here the first mutation could occur in either of the two positions to produce a single-substitution intermediate sequence. One of the intermediate sequences is viable. It folds uniquely to the same structure (lower middle drawing in (c)), whereas the other intermediate sequence is misfolded: in addition to the original structure, this sequence adopts two other equally likely lowestenergy conformations as shown in the upper middle drawing in (c). Since the protein core is conserved, these epistatic interactions occur at the surface. In approximately 90% of such epistatic double mutants in the model (as calculated for the given example structure), the non-viable mutation is a P to H substitution. A real-world example of this form of epistasis is found in adenylate kinase [300]. the model’s simplicity, its biophysical underpinning is transparent and intuitive; yet, the same underpinnings are closely related to those of more complex 2D lattice models with a biophysics-based 20-letter alphabet (see discussion below in this section). In fact, the biophysical criteria used to evaluate the 2D HP model may be applied as well to assess various 3D lattice models of protein evolution, including those that are being developed to study the impact of protein –protein interactions on evolution in model cells (§3.2.3). One advantage of the minimalist HP construct is that, within the model, it allows for an exact, complete description of the sequence–structure mapping for short model proteins of lengths up to approximately 25 residues [316,317]. However, the extreme coarse-graining of both the sequence and structure spaces in this model means that energetic heterogeneity (diversity) among 20-letter real protein sequences with the same two-letter HP pattern is ignored; and mutations among 20-letter sequences with the same HP pattern are not considered. Obviously, the correspondence between model lattice conformations and real protein structures can only be intuitive. Nevertheless, short-chain 2D HP models do capture a number of essential features of the sequence –structure mapping of real globular proteins. First, only a small fraction (approximately 2%) of HP sequences with chain lengths less than or equal to 30 have a unique lowest-energy structure J. R. Soc. Interface 11: 20140419 0.69 rsif.royalsocietypublishing.org 0 Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 3.2.3. Predictions and rationalizations In several 2D [137,161,293,299,303,304,323] and 3D [305,335] lattice models as well as an off-lattice model [304], protein sequence space was found to be organized as multiple neutral nets. A neutral net is a network of sequences that are connected by single-point substitutions and encoding for the same folded structure. In a few studies that addressed the global connectivity of sequence space, it has been observed further that neutral nets for different folded structures are interconnected to form a dominant ‘supernet’ covering most of the sequence space [298,303,324] in a manner similar to that envisioned by Maynard Smith [336]. Consistent with experiment [139 –142] (§2.4.2), mutations on encoding sequences often result in sequences encoding for the same structure [137,297,305], especially if the mutation site is on the surface of the folded protein [297]. In other words, many sequences are stable against mutation, a property referred to as mutational robustness [306]. Several models indicate that the topology of a protein neutral net has a superfunnel organization [137], wherein sequences encoding for the same structure tend to centre around a prototype sequence with maximum mutational robustness as well as maximum thermodynamic stability for the encoded native structure [137,293,304,323,335]. Mutational stability is generally correlated with thermodynamic stability [114,137], such that sequences at the edge of the neutral net have lower native stability [137,299,337]. Consistent with this trend, a 20-letter 2D lattice model predicts that, with increasing number of amino acid substitutions, the probability for a protein to retain its original structure declines exponentially [40]. Evolution of protein function has been explored using 2D square-lattice [296,301,302,330] and 3D diamond-lattice [335] model sequences that encode folded structures with a binding site. In the 2D square-lattice model studies of Bloom et al. [330,338], folded-state stability was found to promote evolvability of new binding functions, consistent with experimental observations. These simulated evolution processes allow for extensive exploration of sequence space. However, no structural change to the fold of the evolving lattice protein was seen during the simulation. Thus, the scope of protein evolution in these models [330,338] is largely limited to the 15 J. R. Soc. Interface 11: 20140419 as in real proteins (reviewed in [116,204]). However, certain modified, ‘shifted’ forms of MJ potentials that are similar to table VI of [329] with prominent repulsive energies [331] do not embody this biophysical property (reviewed in [158,204]), making it problematic to interpret some of the predictions from such models. For instance, in some cases, a shifted-MJ potential may lead to nominally charged residues instead of hydrophobic residues occupying the core of a model protein [114,204,331]. These limitations notwithstanding, all aforementioned lattice models provided useful evolutionary insights. Different approaches are often complementary because they tackle different aspects of protein evolution. However, caution should always be used to take these models’ limitations into account so as not to over-interpret model predictions. Major earlier progress of lattice models of protein evolution can be found in several reviews [292,332–334]. We now recall briefly some of the significant early efforts before highlighting recent advances. rsif.royalsocietypublishing.org [158,162,299], consistent with experimental observations that only a tiny percentage of random amino acid sequences (much lower than 2%) can fold and/or have a biological function [116,272]. Experimentally, it is very rare for folded and/ or functional sequences to arise in random sequence search, but binary HP patterning can help with artificial design of such sequences [318,319]. Second, the small fraction of model HP sequences that have a unique lowest-energy 2D structure exhibit statistics of HP patterns similar to that of real proteins [320,321]. Third, many of the lowest-energy 2D HP structures are highly compact but not maximally compact, as for real globular proteins [158,299], and a significant fraction of compact structures are not encodable by any HP sequence as its unique native structure [158,162]. The latter observation may bear on the question of whether the currently known set of globular protein folds is nearly complete in its coverage of all physically possible compact folds, as discussed in §3.1 [286– 290], but one has to also keep in mind that the HP model interaction potential is less heterogeneous, and thus entails fewer encodable structures, than model potentials that contain repulsive interactions or otherwise more heterogeneous interactions [158,322,323]. Fourth, the 2D HP lattice model provides sequences that act like evolutionary bridges [161,178,204,239,299] (see §3.5), encode for autonomous folding units [292,316,324], and exhibit homology-like behaviours [295], all similar to properties observed in real proteins. Two likely physical reasons underlie the similarity between the sequence – structure mapping of the 2D HP model and that of real globular proteins. First, the HP model potential captures the prominent effect of hydrophobic interactions in protein folding [44]. Similar to the hydrophobic effects operating in real proteins, the model potential leads to folded structures with a hydrophobic core and mostly polar residues on the surface (see examples in figure 5). Because the surface-core ratio of folded structures on the 2D square lattice with chain lengths approximately 16 is similar to the surface-core ratio of 3D folded structures with chain length approximately 150 [317], the energetics of short-chain 2D HP models should bear resemblance to that of real proteins with approximately 100 amino acid residues. Second, as has been argued [324], although the simple potentials of the HP model and its two-letter variants, and for that matter even 20-letter lattice potentials, are not sufficient to capture more detailed thermodynamic [325] and/or kinetic [326] properties of protein folding, the HP potential may still provide a useful caricature of the mapping between sequence and folded structure of real globular proteins because of the ‘consistency principle’ [327] or ‘principle of minimal frustration’ [328]. These principles stipulate consistency or near-consistency among various energy terms that contribute to the stability of natural proteins. Therefore, in this perspective, the folded state of a protein—at least for a natural, evolved protein—is expected to be a lowest-energy structure for a hydrophobic interaction potential similar to the one prescribed by the HP model, although other interaction types may provide it with additional stabilization. Besides the HP and HP-like models, many of the 20-letter models adopt interaction energies derived from the PDBbased Miyazawa–Jernigan (MJ) statistical potential [329]. The original MJ potential (Table V of [329]), as was used in [40,153,296,330], embodies the hydrophobic effect, thus it tends to place non-polar residues in the folded protein core Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 recently, Zhang and co-workers [99] developed a related lattice approach to model evolution of protein –protein interactions that offers an explanation of slow evolution of highly expressed proteins in terms of stronger constraints on these proteins to avoid misinteractions. 3.3. Energy, entropy, fitness, protein neutral nets and fitness/mortality landscapes J. R. Soc. Interface 11: 20140419 Protein evolution can be formulated in terms of population dynamics on a fitness or adaptive landscape in which a fitness function assigns a fitness value to each protein sequence, with evolving populations migrating to areas of higher fitness over time [348 –350]. Certain parallels may be drawn between fitness landscapes defined on sequence spaces and energy landscapes defined on protein conformational spaces. Mathematically, both protein sequence and conformational spaces are high-dimensional. As for protein energy landscapes, fitness landscapes are intrinsically highdimensional constructs, even though fitness and energy landscapes are often depicted as 1D profiles or 2D surfaces for metaphorical, conceptual visualization [34,254,317,348,351]. With this in mind, it is important to focus on the quantitative fitness function itself and not to over-interpret the picturesque 1D and 2D landscape representations [351]. The biological fitness concept has been compared to the physical quantity of negative energy or negative free energy [352]. There is an obvious analogy between the sequencespace distribution of steady-state evolutionary population and the conformational-space distribution of equilibrium population. Just as lower-energy states and higher-entropy macrostates are more favoured in statistical mechanical systems, higher-fitness sequences and phenotypes encoded by more sequences are expected to be more populated during evolution. Under certain limiting conditions, a direct quantitative correspondence can be made between fitness and energy, as well as between population size and inverse temperature [352]. In general, however, there is an important difference between the analogous roles of energy and fitness: insofar as a statistical mechanical system is ergodic [353] and total population is conserved, the equilibrium population of a given state is determined by its energy in accordance with the Boltzmann distribution. By contrast, for evolution, total population can be increased by reproduction and decreased by lethal mutations. As a result, the steady-state population of a given sequence in a large evolving population is determined not only by reproductive fitness of the sequence but also by its connectivities to neighbouring viable sequences. This issue will be addressed in the discussion on mutational robustness in §3.4. Studies of molecular evolution indicate that evolutionary pathways on fitness landscapes are subjected to various constraints [351,354]. The topographies of fitness landscapes of real proteins are far different, on average, from those postulated theoretically by random assignments of fitness [351]. This observation underscores the importance of biophysical considerations in constructing model fitness landscapes. As an illustration, figure 6 shows the sequence space of a 2D HP model (figure 6a), part of its neutral net organization (figure 6b), and a model fitness landscape that identifies fitness with native stability (figure 6c). As discussed above, the biophysics embodied by the HP model imparts it with several essential protein-like properties. For instance, 16 rsif.royalsocietypublishing.org development of new binding abilities through modifications of surface residues, while the evolved protein retains its original structural scaffold. For larger structural changes, one expects that it would be harder for a more stable protein to evolve to a specific new fold, because a more stable sequence is farther away from the edge of the neutral net where it can switch to another fold (§2.7.2). More recently, the binding between 2D square-lattice model proteins and lattice ligands was used to investigate short-term versus long-term evolutionary success. Interestingly, this study by Feldman and co-workers indicate that although the model evolution process is stochastic, longterm evolutionary success—as determined by stability of the evolving protein and its binding with a given ligand— is ‘surprisingly predictable’ from the founding sequence of a lineage. In this lattice protein model, long-term survivability is only partially determined by short-term fitness, i.e. shortterm adaptive success does not guarantee long-term survival of a lineage [296]. 2D lattice models have been applied to compare the evolutionary effects of point mutations and recombination [152,324,335]. An initial study showed that crossover of two encoding sequences is an effective way of producing a new encoding sequence, suggesting that local sequence patterns are important for determining whether the full protein sequence can fold to a unique structure [324]. This theoretical prediction is consistent with subsequent experiments on b-lactamase indicating that for a given number of amino acid substitutions, recombined variants are much more likely to retain function than variants generated by random point mutations [339]. In another simulation study, evolutionary dynamics that admits both point mutations and recombination leads to a much higher concentration of population in the prototype sequence than if evolution proceeds via point mutations alone, suggesting a significant role of recombination in the prototype-like behaviours of natural proteins [152]. Lattice models have also been used to investigate how mutagenesis data may be exploited to improve forcefields for protein structure prediction [340,341]. In addition, they have provided insights into the relationship between the native contact pattern of a target structure and its designability [342], the degree to which evolutionary selection at the molecular and/or organism level has led to the observed scale-free distribution of protein structure similarities [307,334], and the biophysical basis [343,344] of methods that use evolutionary/mutational information as a probe to reveal long-range energetic coupling in proteins [26]. An exciting recent development of 3D maximally compact lattice protein models is their application to the study of fitness and population dynamics of model cells containing multiple proteins. One model assumes that all proteins in a model cell are essential and that the fitness of the cell depends on the stability of the least stable proteins it contains. Real-life-like properties such as preferred folds and protein families readily emerge from these simple assumptions [308]. Introduction of various protein –protein interactions to this modelling set-up by Shakhnovich and co-workers [345] has provided further rationalization for the emergence of species-like collections of model cells with very similar sequence make-ups, an increased rate of mutation in stress response [346], as well as a trade-off between strengthening functional interactions and avoidance of misinteractions as observed in experimental proteomic data [347]. More Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 (a) foldable sequence space (b) neutral networks 17 rsif.royalsocietypublishing.org promiscuity and evolvability stability sequence space thermodynamic stability and mutational robustness Figure 6. The space of foldable HP sequences gives a glimpse of what real protein sequence space might look like. (a) A representation of the adjacency network of approximately 25 000 HP sequences of length 18, each with either a unique ground-state structure or up to six different but equally populated ground-state structures. Only a small minority of sequences is not connected to the dominant ‘supernet’. (b) Local sequence neighbourhoods form clusters of sequences that fold to the same ground-state structure. These clusters are called neutral nets or neutral networks, each of which is shown in a different colour here. Some of the neutral nets overlap. (c) Thermodynamic stabilities of the folded ground states of the sequences (stability increases in the downward direction) are plotted using a schematic planar of sequence space as in (b). A distinct funnel-like organization for each neutral net is apparent. These sequence-space funnels are referred to as superfunnels [137]. Typically, there is a single prototype sequence per neutral net that is thermodynamically maximally stable and also tolerates the highest number of neutral mutations (i.e. those resulting in a sequence within the neutral net). The ground-state structure in the HP model is shown for each neutral net. Sequences with high stability (at the bottom of the landscape) have a clear tendency to also have more neutral mutations and thus higher mutational robustness. In contrast, promiscuity and multi-functionality is likely encoded in less stable sequences, including sequences with two or more ground-state structures. These sequences are located further away from the prototype sequence but closer to other neutral nets; hence they have a higher evolvability—meaning a higher probability to acquire new functions through further mutations (see also figure 8). The network layouts here and in subsequent figures were constructed in such a way that the lengths of the edges roughly reflect the Hamming distances between the sequences connected by the edges [239]. Because of the large number of sequences depicted, the network drawings in the present figure convey only a heuristic impression of the sequence connections. More detailed descriptions of some of the neutral nets shown in figures 5 –9 are provided in [137,239]. consideration of all model HP sequences that fold uniquely to the native structure in figure 5a [137,204] shows that certain sites are conserved, and that the probability that a mutation is viable is site-dependent (figure 5b), echoing the contextdependent substitution rates of amino acids in real proteins (§1). It is also noteworthy that because the main biophysical driving forces in protein folding are different from that in RNA secondary structure formation, the global organization of neutral nets of globular proteins is probably very different [161,355] from that of RNA [356 –359]. It is conventional to associate increasing fitness with upward movement on fitness landscapes [348]. For the model fitness landscape in figure 6c, however, we have chosen to represent increasing fitness with downward movement. As discussed elsewhere, such landscapes may be referred to as ‘inverse-fitness’ or ‘mortality’ landscapes [324]. We prefer such depictions because our inescapable experience with gravity on Earth makes it easier for us to appreciate natural driving forces that point downward than ones that point upward. In this respect, mortality landscapes may offer a better metaphor for the drive by natural selection towards lower mortality and thus higher fitness. In analogy with the most favoured protein conformation having the lowest free energy in the conformational energy landscape, the fittest sequence is seen as situated at the bottom of an attractive basin with lowest mortality in the sequence-space mortality landscape [324]. 3.4. Mutational robustness, sequence-space topology and population evolution Experiments showed that many natural globular proteins are robust to mutations [40,53,55,202,360]. The observation that proteins with diverse primary sequences can be grouped into families with very similar structure and function gives further illustration of this robustness [361]. A folded structure that has a larger neutral net is more designable [160,322]. Compared to a less designable structure, it is expected to be more robust against mutations and thus more likely to be populated by evolution. From a biophysical standpoint, robustness may be viewed as a form of entropy in sequence space [306]. As discussed above, an idea of sequence-space ‘entropy’ (§3.3) is useful for analysing possible origins of marginal stability of natural globular proteins [135] (§2.4). There are at least two biophysical reasons for the observed mutational robustness of natural globular proteins. First, a significant part of the stabilization of specific secondary structures in protein comes from backbone hydrogen J. R. Soc. Interface 11: 20140419 (c) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 (a) no selection for stability 18 rsif.royalsocietypublishing.org (b) selection for stability Figure 7. Interplay of network-topology and fitness effects on evolutionary population. In this figure, the 132 HP model sequences in an extended neutral net are depicted as nodes on a planar representation of sequence space (light blue surface), with each edge denoting a single-point mutation. The sequences either fold to the native structure in figure 5b uniquely or fold to that structure as well as at most five other ground-state structures [137]. Mutations leading to a sequence outside the net are deemed lethal in the present consideration. Here, steady-state distribution of evolutionary populations (i.e. at mutation – selection balance) of an infinite, asexually reproducing sequence population within the neutral net is computed without (a) or with (b) positive selection for native stability. The distributions of steady-state populations are shown by the histograms in logarithmic scales with the insets showing the same distributions in linear scales. Fractional populations of sequences that fold uniquely and sequences that fold to multiple ground-state structures are shown by vertical bars in lighter and darker colours, respectively. To better highlight the overall trend, different vertical scales are used in (a,b); and only fractional populations more than 0.1% are plotted. (a) With no selection for native stability, all sequences in the net have equal fitness. In this case, the uneven distribution in steady-state populations originates entirely from network topology. Among all 132 sequences, the prototype sequence has the highest steady-state population that encompasses, however, only 3.31% of total population. (b) When fitness is proportional to native stability, the majority of the steady-state population (78.3%) is concentrated at the prototype sequence. The fitness functions used for (a) and (b) here correspond to the single-structure fitness function defined in [239] with u ¼ 0; t ¼ 1 and u ¼ t ¼ 1, respectively. may work in concert with sequence-space topology to beget a dominant population at the prototype sequence. This example also illustrates how mutational robustness of natural globular proteins may have arisen by selection for native stability without a direct selection on robustness itself [364]. J. R. Soc. Interface 11: 20140419 bonding. With proline as the only exception, amino acid substitutions do not change the ability of backbone atoms to form hydrogen bonds. While the nature of the amino acid sidechain has an effect on the preferred backbone dihedral angles and interactions with surrounding atoms, an a-helix or a b-sheet can be formed by many different combinations of amino acids along the primary sequence [362,363]. Second, the biophysics of hydrophobicity results in a rough grouping of a globular protein’s amino acids into non-polar residues in a largely conserved hydrophobic core and polar surface residues. Surface residues with higher solvent accessibility are more tolerant to mutations [22] and thus contribute to mutational robustness because, on an individual basis, they are less crucial for stability (see lattice model example in figure 5b). Mutational robustness is not determined solely by the number of sequences encoding for a certain structure or property. Another important factor that contributes to mutational robustness of any given sequence is its connectivity with other viable sequences. The pattern of mutational connections among sequences has also been referred to as sequence-space topology [137,151]. Among sequences with the same fitness, analytical and simulation studies have shown that evolving populations tend to prefer sequences that have more viable mutants. Fundamentally, this relative concentration of steady-state evolutionary population at sequences that are mutationally more stable (i.e. robust) arises from the fact that they lose less population to lethal mutations than sequences that are mutationally less robust [137,151,292]. As long as the evolving population N is sufficiently large and satisfies mN  1 , where m is the mutation rate [151], the phenomenon is independent of m and N [137,151]. A lattice-model example that illustrates how sequence-space topology affects evolutionary population under the mN  1 condition is provided in figure 7. It should be noted, however, that when the mN  1 condition is not satisfied, the significance of sequence-space topology on evolutionary population diminishes [151,153]. In the limit of mN  1, evolutionary population dynamics on a neutral net becomes a random walk without regard to the relative abundance of connectivities of different sequences [151]. Figure 7a demonstrates that the role of fitness in steadystate evolutionary population distribution does not correspond directly to that of energy in equilibrium population distribution. Microstates having the same energy have equal equilibrium populations in a canonical ensemble [353]. In contrast, although all sequences are assumed to be equally fit in figure 7a; their steady-state evolutionary populations are different when mN  1 because of sequence-space topology, with the prototype sequence that has maximum native stability and is also mutationally most robust achieving the highest population among sequences belonging to this sequence-space superfunnel [137]. However, this relative concentration of population at the prototype sequence is weak [137,152,153,335]. Thus, sequence-space topology alone is probably insufficient to account for the experimentally observed dominance of prototype-like sequences in natural proteins [139 –142]. This recognition led to the argument in §2.4.2 that the distribution of evolutionary population in natural proteins has been driven significantly by selection for native thermodynamic and/or kinetic stability. Here, using the same explicit-chain lattice modelling set-up, we show in figure 7b how a selection for stability ground-state stability unique ground state sequence space neutral network neutral network 3 4 2 1 most stable stability conformational space Figure 8. Mutational paths can be guided by selection for hidden conformational states: a more detailed view of the sequence-space fitness/mortality landscapes (top) of two of the HP model neutral nets in figure 6, with the conformational energy landscapes (bottom) of four adjacent sequences highlighted to illustrate the concept of excited-state selection [178,203, 239]. Idealized folding funnels at the bottom provide a schematic depiction of the relative populations of the blue and red structures. From left to right: sequence 1 folds predominantly into the blue native structure with only a negligible population (less than 1%) of the red structure. Now, a mutation that does not change the ground-state structure of sequence 1 produces sequence 2, which has an increased population of the red structure, to around 4%. At this population level, the red structure may be able to perform some new selectable function. Such selection can promote further mutations that produce a sequence that populates the blue and red structures with equal probabilities, i.e. a bi-stable protein (sequence 3; see figure 4). One further mutation can then switch the relative stabilities of the red and blue structure in sequences 1 and 2, finally leaving the red structure as the new native state (sequence 4). This theoretical perspective offers a biophysical rationalization for several recent results from directed evolution and NMR experiments [173,202]. Sequences 1– 4 in this model have different fitness values if fitness is strictly proportional to ground state native stability. However, if fitness does not increase above a certain level of native stability, sequences 1 – 4 can be equally fit or nearly so [239] (figure 9). advantage of excited-state selection is much more efficient than a process that applies selection pressure only on the dominant function [203]. In this perspective, bi-stable or multi-stable proteins at the peripheral of neutral networks—as exemplified by the overlapping regions in the lattice-model example in figure 6b—and proteins that sacrifice stability for functional promiscuity, as is the case in some antibodies [387], should be more evolvable towards new functions underpinned by significantly different native structures than sequences with high mutational robustness and thermodynamic stability. The latter sequences, however, can be more evolvable towards new functions that are still based upon the original structural scaffold [330,338]. 3.6. Adaptive conflicts: evolution under constraints While selection of latent traits can be an efficient route to new function, such an evolutionary process raises a basic 19 J. R. Soc. Interface 11: 20140419 Mutational robustness is often discussed in conjunction with evolvability, which characterizes the ability of a biological system to evolve new traits [164,365 –368]. The relationship between robustness and evolvability has been seen as opposing evolutionary forces, with the former impeding and the latter promoting evolutionary innovation. However, a network-based view of mutational robustness and evolvability indicates that they are not mutually exclusive [369,370]. Although mutational robustness implies that the sequencespace distance to any specific alternative phenotype is large (§§2.7.2 and 3.2.3), general evolvability can be enhanced by the mutational robustness afforded by a larger neutral network because different positions on such a network are likely to be mutationally close to many different phenotypes [294,371,372], as has been demonstrated experimentally for the evolution of enzyme functions [55,202,265,338]. These observations led to an understanding that seemingly neutral mutations can dramatically alter the future potential of a protein to evolve towards new functions. The hidden evolutionary potential of a neutral network is embodied in a wide variety of latent phenotypes that were not under selection originally but are present biophysically, because these mutational variations do not affect the main function of the protein. Several studies have shown how the co-option of neutrally evolved properties can allow adaptation towards new functions under the shadow of a dominant existing function. For example, enzyme promiscuity—which refers to low-affinity binding of molecules resembling an enzyme’s main ligand target [373 –375]—has been demonstrated as a potent mechanism for adaptations [264]. In the same vein, latent evolvable traits have also been identified in the evolution of steroid receptor specificity [376], allostery [201], gene regulation [377] and metabolic networks [378]. The term ‘exaptation’ (or ‘spandrels’ [138]; §2.4.1) has been coined for such latent traits that arise by chance and may or may not evolve to have a new function [379]. Apart from point mutations, mobile genetic elements are likely to play a crucial role in providing exaptations [380 –382]. Each genome appears to constantly produce transcribed and translated ‘proto-genes’ that arise by chance, some of which may be exapted by evolution for a certain function [383]. It follows that a major part of the enhancement of evolvability by mutational robustness is based on the evolutionary potential provided by conformational dynamics at the level of a single sequence when excited-state structures of a protein [173,384–386] (§2.9) with beneficial function come under natural selection. Selection of such a promiscuous function rewards mutations that further stabilize the beneficial excited state. In this scenario, a protein can retain its original ground-state native structure while at the same time stabilizing an excited-state structure incrementally, thus maintaining continuous viability during evolutionary transitions. Eventually, the protein may first become bi-stable then switch to the selected excited-state conformation as its dominant structure (see lattice-model example in figure 8) or switch to the new dominant structure directly, as has been observed in protein design experiments (figure 4). Consistent with experiment [202], lattice-model studies [178,203] indicate that an evolutionary process that takes least stable rsif.royalsocietypublishing.org 3.5. Evolvability, hidden states and promiscuous functions degenerate ground state Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 Evolution of bi-stable and more generally multi-stable proteins is an efficient means to meet new functional needs; but multifunctional proteins often serve as evolutionary intermediates rather than long-term solutions. Explicit-chain biophysical models suggest that multi-stable globular proteins are mutationally much less robust than globular proteins with an essentially unique native structure [161,178,239]. This trend is readily seen in the 2D HP lattice model example in figure 9, which shows that the sequence-space ‘entropy’ of bi-stable sequences (magenta area) is much smaller than that of sequences folding uniquely for either one of the two structures encoded by the bi-stable sequences (blue and red areas). This is a prediction that should be testable experimentally, for example, by using the recently designed bi-stable proteins [65,174,206] (figure 4). In this picture, the short-term advantage of bi-stability/bi-functionality is expected to give way eventually to an alternative sequence-space arrangement that is mutationally more robust, provided that the gene encoding for the bi-stable protein is duplicated at some point in the evolutionary process. Subfunctionalization, or functional divergence after gene duplication, is a ubiquitous phenomenon in evolution [389,390]. For example, an experimental study of the reconstructed common ancestor of the fluorescent proteins in corals that emit either red or green light was found to emit 20 ^ DW = 0.398 high population of bi-stable proteins ^ DW = 0.09 gene duplications fixed due to dosage effect ^ steady state: subfunctionalization DW = 0.008 Figure 9. The simulated evolution, with gene duplication, of an essentially infinite HP sequence population under an adaptive conflict of two selection pressures. Here, four stages of the evolutionary dynamics are shown by representative changes in the distribution of evolutionary population and average population fitness DŴ from one stage to the next. The two adjacent neutral nets (blue and red) are the same as those in figure 8. Distributions of population are plotted by logarithmic scales in the same style as figure 7. Initially, before the native structure of the red network is selected, nearly the entire population occupies the most stable HP sequence of the blue network. After selection pressure is imposed simultaneously for both the blue and red structures (figure 8), the red structure becomes a selectable promiscuous function in the model. After a number of generations, bi-stable proteins (magenta) appear as high-fitness evolutionary intermediates that are prone to undergo gene duplication. Duplicated bi-stable sequences that have maximum fitness in the model (because of an assumed beneficial dosage increase) then slowly give way to equally fit subfunctionalized (functionally diverged) gene pairs occupying those regions of the neutral networks with the highest mutational robustness. In other words, this model shows that some subfunctionalization processes can be non-adaptive in that subfunctionalization can be driven by sequence-space topology even when the total fitness of a duplicated pair of bi-stable proteins is the same as that of a pair of prototype sequences each folding uniquely to one or the other selected structures, as in this model. Consistent with this perspective, and as indicated by the last DŴ values in this figure, the final-stage optimization of population distribution around the two prototype sequences provides only a very small increase in fitness value. Details of this model can be found in [239]. J. R. Soc. Interface 11: 20140419 3.7. Protein divergence driven by gene duplication and mutational robustness commencement of selection for promiscuous function rsif.royalsocietypublishing.org question regarding biophysical limits on the degree of multifunctionality or promiscuity that can be carried within one protein molecule. Multi-functionality also bears the danger of creating an adaptive conflict. Such a conflict can emerge whenever adaptation on the same gene is driven by two or more different or even mutually exclusive functional requirements. In the extreme case of viruses with severely constrained genome sizes, adaptive conflict can arise from overlapping open reading frames encoding for different proteins within the same DNA sequence [388]. For a multifunctional protein, adaptive conflict arises when enhancing one subfunction impedes another subfunction. As far as adaptive conflicts in a protein is concerned, if multi-functionality is realized by the presence of different binding interfaces on the protein surface, a small number of such interfaces may coexist, limited by such factors as the protein size and/or surface area and the number of surface hydrophobic residues that can be tolerated without causing misfolding. Binding interfaces can also overlap, using some but not all of the same residues for different ligands [128,129]. In that case, an adaptive conflict may be anticipated since increasing the binding affinity for one interface through mutations may interfere with the binding affinity of the second interface, and vice versa. If multi-functionality is underpinned by bi-stability or multi-stability, i.e. the coexistence of and dynamic inter-conversion between alternative functional conformations (§2.7.2), it is expected biophysically that only a narrow capacity for accommodating several distinct conformations can exist in the lifetime of a globular protein and perhaps a somewhat higher capacity for doing so in disordered proteins. During evolution, a general mechanism for resolving adaptive conflict is offered by gene duplication and subfunctionalization [389], which we will discuss briefly in §3.7. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 The phenomenon of epistasis, referring originally to non-additivity of genetic effects caused by gene interactions (e.g. [393]), can also manifest within a protein molecule. Mutational effects on stability or function at different sites of a protein can be non-additive when the sites are energetically coupled [394]. A consequence of this biophysical property is that the overall evolutionary effect of multiple mutations can depend on the order in which the mutations are made. For the same given set of mutations, it may be that one temporal order of mutations is evolutionarily favoured because it entails a monotonic increase in fitness, whereas another order of mutations is disfavoured because it involves an intermediate step that decreases fitness. Several studies have demonstrated this type of epistatic behaviour in proteins and its constraints on evolutionary pathways [42,53,300,395–398]. For instance, experiments on adenylate kinase indicate that a double mutant with higher stability can only be obtained via one mutation path [300]. When the 21 J. R. Soc. Interface 11: 20140419 3.8. Epistasis and co-evolution of interacting amino acid residues order of the two mutations was changed the protein could not fold. This type of behaviour is readily observed in simple biophysical models of protein evolution, as is illustrated by the HP model example in figure 5c. An implication of epistasis is that the propensity for a viable mutation at a certain site in a protein structure may not be fixed. Rather it should depend on the preceding mutations at other sites. This phenomenon is illustrated by a recent simulation study of the evolution of purple acid phosphatase by generating random mutations in the structure [58]. Based on stability calculations in the model, the propensities for all amino acids at selected solvent-exposed and buried sites were determined. The results indicated considerable variations of these propensities because stability effects of amino acid substitutions at a given site change with time as the evolutionary process progresses. The simulation showed that an enforced destabilizing mutation (which could arise in real proteins owing to functional constraints such as the need to preserve an active site) can be compensated by subsequent mutations, thus increasing the future viability of the already-mutated residue at that same site and rendering the reverse mutation detrimental and therefore less probable [58]. Although this model probably overestimated native stability as well as stability effects of mutations and underestimated the probability for misfolding because of its simplified treatment of the unfolded state (see discussion above on marginal stability; §2.4.2), it offers an excellent elucidation of the ‘holistic’ nature of intra-protein interactions and the biophysical forces that govern the mutational effect on stability and how it may depend on the temporal order in which the mutations occur [58]. Epistatic effects are common but not universal. Strong epistasis arising from significant evolutionary shifts in the stability effects of mutations as envisioned in [58] may even be rare [57]. In the example of the influenza nucleoprotein, an experimental analysis of mutations in a set of homologues showed that stability effects of mutations with no clear functional benefit are largely conserved across homologues, mostly additive and exhibit no aforementioned [58] strong dependence on temporal order [57]. It has been argued that mutations in viral proteins in general—which probably have evolved to buffer deleterious mutations—are not likely to exhibit strong epistasis [399]. Whenever the functional benefits outweigh the cost of destabilization of a mutation, strong epistatic effects are more likely to follow [21]. Nonetheless, a weaker form of epistasis can occur even if stability effects of mutations are conserved because different temporal orders of stability changes can result in drastically different survivabilities. For instance, a recent experimental study of the 39 mutations on the nucleoprotein of the influenza virus between years 1968 and 2007 identified several mutations that decrease the stability of the protein significantly when introduced individually to the starting 1968 protein, thus suggesting strongly that these mutations were preceded by ‘enabling’ mutations that increase native stability. An inferred evolutionary trajectory was constructed based on the stability constraints [42]. Epistasis has also been revealed by studying disease-causing single mutations in humans and comparing them with compensated mutations that do not cause disease in other species [400–402]. One estimate indicates that 80% of pathological mutations result in protein stability changes [403]. When the compensated pathological mutations are compared against uncompensated pathological mutations, compensated mutations are mostly found at solvent-exposed positions and the amino acid substitutions are ‘milder’, rsif.royalsocietypublishing.org both red and green light [391]. Subfunctionalization of a duplicated multi-functional gene is probably a more efficient evolutionary route than neofunctionalization, which necessitates evolution of new function in a duplicated gene from scratch. However, in the subfunctionalization route, this process can be facilitated by selection on latent traits (§3.6) before gene duplication [239]. In the lexicon of fitness landscape, gene duplication amounts to doubling the number of dimensions of sequence space and thus may be viewed as an ‘extradimensional bypass mechanism’ for resolving adaptive conflicts [392]. Functional or structural divergence can be driven by an increase in fitness when a pair of identical bi-stable sequences is transformed into two subfunctionalized sequences. Generally speaking, such an increase in fitness is biophysically plausible because each of the subfunctionalized sequences may afford a higher kinetic stability [81] to one or the other functional structure than that provided by a bi-stable sequence [239]. However, divergence does not always have to be adaptive. Even if the fitness of a subfunctionalized pair is identical to the fitness of a pair of bi-stable sequences, the inherent tendency of protein evolution towards higher mutational robustness can still drive subfunctionalization. Biophysically, an ensemble of protein structures subjected to more overlapping functional constraints is likely to be more restricted in sequence space, resulting in low mutational robustness, as is exemplified in figure 9. When the constraints are lifted through gene duplication or changes in the environmental selection pressures, evolution will naturally favour mutations that result in higher mutational robustness even if there is no gain in functional fitness for the proteins in the process. This scenario is illustrated by the model evolutionary dynamics in figure 9. Simulation results summarized in this figure indicate that after an adaptive pressure to simultaneously select two structures is imposed, the evolving proteins first attempt a short-term resolution of the adaptive conflict using bi-stable sequences with low mutational robustness. Subsequently, upon gene duplication, a process of divergence that is essentially neutral ensues, with each copy of the originally bi-stable protein evolving towards the central, high-robustness region of one of the two neighbouring neutral networks [239] (figure 9). Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 3.9. Fitness landscapes for multiple phenotypic properties A fundamental evolutionary question is why different proteins evolve at different rates. What makes some proteins less likely to accept new mutations than others? Can the different evolutionary rates be explained in terms of the biophysical constraints on mutations as outlined in figure 3? One hypothesis is that proteins that are functionally more important are more conserved, because the cell cannot risk their function to be compromised in any way, even slightly. Some authors have linked the evolutionary rate to the position of a protein in the protein–protein interaction network, finding that ‘hub’ proteins involved in many interactions evolve more slowly [406]. Empirically, evolutionary rate was found to be most strongly anticorrelated with the expression level [407]. This trend is not inconsistent with the functional argument. A protein is essential to an organism if the organism fails to survive when the gene encoding for the protein is deleted from its genome [408,409]. Many essential proteins are highly expressed, as the cell needs a constant supply for its most basic and vital functions. However, is the slower evolutionary rate of highly expressed proteins a result of the importance of their functions or a more direct consequence of their high concentrations in the cell? Several biophysical mechanisms have been proposed for the latter scenario. Here we summarize briefly two mechanisms that are based, respectively, on protein misfolding and protein misinteraction, noting however that an explanation in terms of mRNA folding rate has also been put forth recently [410]. Multiple mechanisms can be at play because the proposed mechanisms are not mutually exclusive. All proteins have to avoid misfolding. Taking the population of a protein sequence as a whole, a protein that is abundantly populated provides more opportunity for the formation of detrimental misfolded structures than a protein that is sparsely populated; thus the constraint imposed by misfolding avoidance is stronger on protein sequences with higher populations. A similar consideration applies to the misinteractions, which will be discussed further below. Accordingly, the need to prevent or at least minimize misfolding caused by translation error has been proposed as a major constraint on the evolution of highly expressed proteins [411–414], leading to slower evolution. Consistent with this picture, highly expressed proteins are selected to be more robust against translation errors by using synonymous codons with the smallest chance of producing nonsynonymous changes as a result of translation errors. Apart from translation errors, the need to avoid misfolding of the properly translated protein also constitutes a strong evolutionary constraint on highly expressed proteins, resulting in preferential usage of amino acid residues that minimize misfolding [413]. These restrictive requirements lead to proteins that are both slowly evolving and thermodynamically more stable [414,415]. Another probable biophysical constraint behind the anticorrelation between protein expression level and evolutionary rate is the need for a protein to avoid misinteraction with other proteins [99]. This selection pressure affects primarily surface residues that can potentially participate in interactions between the protein and other molecules. Thus, its effects are to some degree complementary to that arising from the need for folding stability, kinetic accessibility of the folded structure 22 J. R. Soc. Interface 11: 20140419 Natural protein evolution takes place in a highly wired, interacting molecular system. Ultimately, therefore, studies of protein evolution have to take into account a complex molecular context [392] (see §3.11). The expanded concept of molecular phenotype discussed in §2.9 is an attempt towards a better account of this biophysical reality. In this regard, highly simplified yet promising explicit-chain biophysical models have recently been developed to study protein evolution in the context of protein– protein interactions [99,347] (§3.2.3) We have also summarized explicit-chain biophysical models that take into consideration the ensemble nature of a protein’s conformational phenotype, and how these models can provide insights into selection of promiscuous function, bi-stability and structural divergence (§3.2). With the understanding that the fitness of a protein sequence depends not only on its ground-state native structure, but also on its entire conformational distribution as well as potential functional interactions and detrimental misinteractions with other molecules, a critical issue in modelling is how to assess contributions from different phenotypic properties to the overall fitness. For example, in the bi-stable fitness landscape in figures 8 and 9, the combined fitness is taken as the sum of two fitness values, one for each structure [178,239]. Although this modelling scheme is useful for illustrating general principles, it would be too simplistic when applied to real-life situations. Different forms of multifunctionality may require different rules for combining fitness contributions. Ideally, a fitness function should include not only positive contributions from selected biophysical properties, but also account for the negative effects on folding rates and misfolding, as well as aggregation and misinteraction. Thus, protein evolution in general entails a multi-factorial optimization problem where only something like a Pareto optimality can be achieved, i.e. a satisfaction of multiple optimality criteria just above a minimum threshold of optimality for each criterion [405]. One simple example would be the trade-off between thermodynamic stability of a folded protein versus the need for conformational dynamics in biochemical functions and degradation. Both these criteria probably cannot be fully satisfied, but a Pareto optimality may be achieved such that the protein is stable enough to maintain the same fold yet flexible enough to allow binding. Constructing more realistic fitness functions will be a challenging task. Genomic information is abundant; but pinpointing mutational impact on cellular function by experiment is often daunting. Theoretical/computational investigations can assist greatly in this endeavour by developing more comprehensive models that account for various biophysical constraints on protein evolution. Two recent examples will be discussed in §3.10 and 3.11 to illustrate how incorporating information about protein –protein interactions into biophysical models can advance understanding of experimentally observed evolutionary patterns. 3.10. Biophysical links between protein expression level and evolutionary rate rsif.royalsocietypublishing.org entailing, for example, less changes in hydrophobicity [400–402]. These experimental trends are consistent with the biophysical principles of protein structure and stability expounded here. A more detailed discussion of the biophysical/structural basis of epistasis and compensatory evolution is available in a recent review [404]. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 As emphasized above, proteins do not act in isolation in living organisms; hence a full understanding of the function and evolution of a protein should take into account its interactions with other biomolecules and metabolites [392,416]. It would be daunting to account for these interactions in all their complexity at the molecular level. Using simple explicit-chain protein models, conceptual advances were made in elucidating how the biophysical constraint of misinteraction avoidance might impact protein evolution [99,347] (§3.10). However, investigators have to rely upon abstract descriptions of protein interactions, using model parameters extracted from experimental data on binding and on the effects of enzymatic activities on biochemical reaction rates. With an increasing repertoire of genomic data, this approach has produced significant progress. For instance, the recent mapping of a realistic network of DNA sequences bound by the same transcription factor [417] has afforded new evidence in support of the idea that a large genotype network enhances both mutational robustness and evolvability (§3.5). Important advances have also been made by taking an abstract approach in the study of metabolic networks [418]. Notably, a reductive evolution algorithm was applied to determine minimal viable genomes for E. coli [419]. In principle, the effect of a mutation on metabolism is difficult to predict, because it affects not only the activity of the mutated protein but also many downstream events. Yet metabolic networks are often found to be robust against perturbations such as gene deletions and loss-of-function mutations because of ‘distributed robustness’, i.e. an ability of the network to compensate for the local perturbation by systemic adjustments [420]. In silico metabolic networks have also shed light on the evolution of specialist versus generalist enzymes. By analysing a model network of E. coli metabolism, it was found that specialists are very efficient at catalysing single metabolic reaction steps, responsible for a high metabolic throughput, and often essential to the cell. These functional roles necessitate more regulation of its activity. Consequently, specialists require a much higher degree of maintenance than generalists that are promiscuous and multi-functional. This model study 4. Outlook: enriching the biophysics of protein evolution As this review emphasizes, evolution is ultimately a physicochemical process that cannot be fully comprehended without biophysics. Likewise, because evolution happened under biophysical constraints, evolutionary information can help decipher aspects of protein biophysics that are still too complex or too costly to be tackled by first-principles physical or chemical methods. In closing, we provide a few further examples to showcase the productive research directions in which future progress will probably be made by this synergistic approach. 4.1. Synergy between biophysics and the study of protein evolution Perhaps the most direct way to access the change in biophysical properties during the long evolutionary history of natural proteins is to perform experiments on ‘resurrected’ ancestral proteins. Recently, several putative ancestral protein sequences have been constructed computationally using common phylogenetic methods and then synthesized in the laboratory [82,391,424–428]. We have mentioned thioredoxin as one of the proteins that were studied in this manner in the discussion on kinetic stability (§2.2). Another interesting case is the reconstruction of the evolution of steroid receptors [376,429], revealing that ancient steroid receptors were already able to bind to the hormone aldosterone. But aldosterone only became available to 23 J. R. Soc. Interface 11: 20140419 3.11. Protein evolution in the context of functional networks thus offered an explanation for why specialists have not replaced all the generalists in real organisms [421]. A more recent computational study used flux balance analysis [422] and random re-wiring of a realistic model metabolic network [423] to study evolution of a model cell under a selection pressure to survive on a given carbon source [378]. The simulation showed that selection for one carbon source also allows the model cell to survive on a number of other carbon sources that were not selected for. This finding demonstrated that metabolic systems embody latent evolutionary potentials, and that beneficial traits can arise non-adaptively through exaptations [379] in the absence of selection at the level of metabolic network [378], as is the case we have seen at the molecular level (§3.5). Although it is currently not feasible to apply explicitchain models of proteins in the simulation of a realistic cellular metabolic network, a recent evolutionary population dynamics study was able to incorporate energetic information of explicit-chain continuum (off-lattice) models for 10 proteins from the folate biosynthetic pathway [59]. The study considers a population of 1000 model cells. The fitness of each model cell is taken to be the total metabolic output of the model biosynthetic pathway minus the number of misfolded proteins, with both of these quantities dependent upon the thermodynamic folded-state stabilities of the proteins. Protein stabilities are in turn computed using a biophysical potential function. Simulation results from this model provide a protein-based molecular biophysical rationalization for the distribution of stabilizing and destabilizing mutations and other experimentally observed patterns of polymorphisms [59]. rsif.royalsocietypublishing.org and avoidance of misfolding. The latter selection pressures affect primarily buried residues but can also affect surface residues. Misinteractions may be caused by the same exposed hydrophobic surface residues that are part of the functional protein–protein interactions, leading to an adaptive conflict between increasing the strength of functional interactions and avoiding misinteractions [347]. This conflict can result in further constraints that limit the viability of mutants of highly expressed proteins. As a consequence of these biophysical constraints, evolution might have increased the proportion of functional monomeric proteins with hydrophilic surfaces, reduced the abundance of functional multi-chain complexes, weakened the strengths of functional interactions, or increased the degree of disordered protein interactions to minimize exposed hydrophobics while still allowing many interaction partners [230,347]. In other words, such strategies might have contributed to the evolution of interaction network topologies that can better alleviate the conflict between functional interactions and misinteractions [110]. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 under investigation (figure 1). This enriched methodology provides more accurate evolutionary inference than approaches that do not consider the conformational context of the mutation sites, but it requires the 3D structure of the protein in question. Hopefully, with better protein structure prediction techniques [437,438], it may be possible to apply structure-based phylogenetic reconstruction methods routinely by starting with sequence information alone. The advances summarized above exemplify how biophysics can assist in the study of evolution. In the following, we describe briefly three examples in which evolutionary data have assisted in biophysical studies of proteins. The first example concerns prediction of mutational change in protein stability, DDG. As described in §2.1, a number of biophysical methods for DDG prediction exist but are limited in various respects. In this context, it has been shown recently that a Bayesian method for inferring DDG values of individual mutations from the evolutionary information embodied in homologous sequences can achieve accuracy exceeding pure biophysical methods and sequence-based consensus approaches. The method was applied to predict stabilizing mutations for influenza haemagglutinin. Subsequent experiment demonstrated that some of the mutations do allow a temperature-sensitive virus to grow at a higher temperature, attesting to the utility of this evolution-based method in improving biophysical understanding [439]. Another example that we have mentioned briefly is the detection of protein sectors from coevolution data (§1). Evolutionarily, protein sectors are largely independent of one another even though they are parts of the same protein. Amino acid residues within a sector are physically connected in the folded structure and are correlated evolutionarily [27] (figure 2). Sectors constitute sparse networks of co-evolving amino acid residues comprising only a minority of the residues in a protein. A recent high-throughput saturation point mutagenesis study of a PDZ domain (1577 mutations were tested) showed that sector positions are functionally less tolerant to mutation than non-sector positions [440]. These observations suggest that coevolution data can be used in general to gain insight into the biophysics of functional binding. Coevolution data have also been applied to predict biophysical interactions in proteins, as mentioned briefly in §1 [28 –32]. A computational algorithm has recently been developed to use pure sequence information to predict contacts within a protein structural domain. The approach is useful for predicting the native structure when sequence data are abundant but a structure has not been determined experimentally for a protein [31]. Even more interestingly, and going beyond earlier seminal findings [26], it was found that sequence information can reveal residue interactions that are not present in the PDB structure, including interactions between structural domains [31] as well as interactions involved in alternative conformational states with evolutionarily conserved functional significance [29]. Most recently, coevolutionary information of several protein families has been applied to determine a theoretical sequence-space J. R. Soc. Interface 11: 20140419 4.2. Evolutionary protein biophysics: evolutionary information benefiting biophysical studies of proteins 24 rsif.royalsocietypublishing.org mammalian cells much later during evolution, indicating once again that latent functions, or exaptations (§3.5), can play important roles in protein evolution [376]. The study of these putative ancestral proteins indicates further that epistatic interactions within the structure of hormone receptors have led to surprisingly irreversible evolutionary pathways [430]. The utility of an ancestral reconstruction, however, is only as good as the accuracy of the phylogenetic relationships it assumes. As mentioned earlier, mutation rates depend not only on amino acid residue type, but also on the conformational context of the site being mutated; but many common phylogenetic methods do not account for this dependence (§1). One of the structural properties that exhibit significant correlation with mutation rate is conformational diversity. The conformational diversity exhibited by a single protein sequence is also reflected in the conformational diversity among the sequences in the family to which the protein belongs [431]. Local packing density of an amino acid position, which correlates negatively with local backbone conformational diversity (flexibility), was also found to correlate negatively with evolution rate. In other words, amino acid positions that have a lower local packing density and are more flexible locally tend to evolve faster. The correlation of evolutionary rate with local packing density is even stronger than that with solvent accessibility [432]. More recently, it has been suggested further that this strong correlation is a reflection of a fundamental relationship between evolutionary rate and the energetic stress caused by random mutations since average mutational stress is proportional to local packing density in an idealized elastic network model of protein structure [433]. This general trend is also consistent with the observation that IDRs typically evolve faster than globular proteins except when the IDR site is involved in the binding of multiple partners (§2.8.2). Taken together, the findings summarized above contradict another group’s earlier finding that evolutionary rate is negatively correlated with conformational diversity [434,435]. The correlation coefficients computed from the limited dataset considered in the earlier study were, however, all very small [434]. As mentioned in §1, a recent method makes use of the information about solvent exposure of residues in a known protein structure to identify sites in a protein-coding gene that have undergone positive or negative evolutionary selection [23,24]. Although synonymous mutations are not necessarily neutral [95– 97,436], the ratio of non-synonymous over synonymous codon replacement rates, v ¼ dN/dS, is commonly used to detect adaptation in a given phylogenetic tree of related gene sequences. In such studies, most mutations do not show any signs of adaptation, with v  1, and are thus considered ‘neutral’; v . 1 is usually taken to indicate positive selection but negative selection is difficult to decipher from v values. The predictive power of this method hinges on an accurate discrimination between neutral, adaptive and conservative v signatures. Recent studies have shown that by considering solvent-exposed areas of mutation sites, this discrimination can be improved in a protein-specific manner. In this new approach, instead of using the same underlying assumptions for every protein family, as has been commonly practiced, biophysical consideration of solvent exposure is used to customize and improve the accuracy of the model for the specific protein Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 4.3. Role of theory and computation 5. Concluding remarks Theoretical/computational methods are an integral part of the biophysical study of protein evolution (§3.2– 3.4). Quantitative biophysical modelling is indispensable in the formulation of concepts, rationalization of existing experimental data, discovery of novel hypotheses and provision of predictions for subsequent experimental testing. Recent models not only addressed evolution of individual proteins but have also begun to take into account the interaction and metabolic networks in model organisms (§3.11). More complex models tend to be richer in that they have the capacity to provide non-trivial predictions that are not immediately obvious from the modelling set-up. Nonetheless, even simple explicit-chain models that serve largely to confirm expected trends are useful for conceptualizing how known evolutionary behaviours might have arisen from the physical forces that govern protein properties. Despite improved model sophistication and the tremendous recent increase in computational power, models of protein evolution have to rely on representations that are coarse-grained and thus the models are inevitably highly simplified caricatures of the complex real situation they seek to mimic. In general, model predictions are sensitive to the assumed parameter values; but a precise correspondence between these values and physical reality often cannot be readily established. With these limitations in mind, theoreticians should strive to perform more exploration, as controls, of multiple parameter sets and alternative modelling set-ups that are biophysically plausible. Enhanced efforts in scenario classification are needed in general to better delineate the logical relationship between the assumptions and predictions of any given model. The aim of this review is to provide a broad sketch of the fundamental biophysical forces that both enable and constrain protein evolution. Starting with the effects of mutations on protein stability, folding kinetics, interactions, functional dynamics, promiscuous functions, conformational switch and conformational disorder, these findings are then linked to broader evolutionary themes including the global and local organization of protein sequence and structure space, simple models of the protein sequence – structure mapping, fitness/mortality landscapes, sequence-space topology and mutational robustness, adaptive conflict and its possible resolution by selection of promiscuous function and subfunctionalization driven by mutational robustness, evolvability, epistasis and intra-cellular networks. We have highlighted advances made through computational models, especially simple exact and other explicit-chain models of protein evolution, because many insightful discoveries in biophysics of protein evolution were pioneered through simple, coarsegrained modelling of biological or biophysical processes that are too complex to be studied in atomistic details. As far as simplified models are concerned, explicit-chain models with biophysics-based interactions enjoy a clear physical advantage over theories that contain little or no biophysical consideration of protein structure and dynamics. We have also summarized several recent experimental advances that bear on the biophysics of evolution, as many questions that have arisen from theory and simulation can only be answered definitely by further experiments. Even so, this review touches upon only a small fraction of all the exciting discoveries that have been made lately. Looking into the future, we expect to witness increasing collaboration between the fields of biophysics and evolution as well as between theory/computation and experiment to decipher many aspects of the evolutionary forces that have been shaping the biological roles of proteins. 4.4. Evolution within and across protein families and superfamilies As outlined above, much progress has been made in experimental and theoretical studies of evolution within a protein family or superfamily. Compared to mutational changes that convert one protein fold to another, mutational changes that maintain essentially the same folded structure are computationally less costly to simulate; their study is more amenable to experimental techniques such as directed evolution, and can also benefit from the availability of abundant genomic data. It is more challenging to study fold-altering protein evolution. Nonetheless, notable recent experimental advances have been made in the design and structural characterization of bi-stable proteins and conformational switches (§2.7; figure 4). Understanding structural Acknowledgements. We thank Jesse Bloom, Xavier de la Cruz, Julie Forman-Kay, Alessandro Laio, Austin Meyer, Marc Ostermeier, Jose Sanchez-Ruiz, Andreas Wagner and Claus Wilke for helpful discussions. H.S.C. wishes to take this opportunity to thank Erich Bornberg-Bauer specially for a fruitful and pleasurable collaboration on evolutionary studies over many years. Part of this work was presented at the 2014 Meeting of the Society for Molecular Biology and Evolution (San Juan, Puerto Rico) by T.S., who gratefully acknowledges a travel award he received from the Canadian Institutes of Health Research (CIHR) Training Program in ‘Protein Folding and Interaction Dynamics: Principles and Diseases’ at the University of Toronto. Funding statement. This work was supported by a CIHR grant to H.S.C. and the computational resource provided by SciNet of Compute Canada. 25 J. R. Soc. Interface 11: 20140419 innovation entails biophysical accounting of conformational diversity—which underpins functional promiscuity—as well as the plasticity, or ensemble nature of molecular phenotype (§2.9). Here, we have placed considerable emphasis on this broader perspective of protein evolution, although theoretical investigation in this area is only in its infancy (§§2.7 and 3.5–3.7; figures 4 and 8). We hope to witness more advances in this direction. It is exciting to better understand not only how proteins evolved within a family or a superfamily but also, more fundamentally, how the structural families originated in the first place. rsif.royalsocietypublishing.org ‘selection temperature’ Tsel that can be related to the Tf/Tg ratio between the folding and glass-transition temperatures in protein folding [441]. Because Tf/Tg is a biophysical measure of folding cooperativity [36,254], this latest analysis [441] demonstrates remarkably how fundamental biophysical principles can be revealed by evolutionary data. All in all, the examples in this section illustrate a general productive research approach of evolutionary protein biophysics, in which the current deluge of sequence-based evolutionary data is harnessed to extract important biophysical/structural information to improve understanding of protein function. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 References 2. 3. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Harms MJ, Thornton JW. 2013 Evolutionary biochemistry: revealing the historical and physical causes of protein properties. Nat. Rev. Genet. 14, 559 –571. (doi:10.1038/nrg3540) 17. Bordner AJ, Mittelmann HD. 2014 A new formulation of protein evolutionary models that account for structural constraints. Mol. Biol. Evol. 31, 736 –749. (doi:10.1093/molbev/mst240) 18. Rodrigue N, Philippe H. 2010 Mechanistic revisions of phenomenological modeling strategies in molecular evolution. Trends Genet. 26, 248–252. (doi:10.1016/j.tig.2010.04.001) 19. Kleinman CL, Rodrigue N, Lartillot N, Philippe H. 2010 Statistical potentials for improved structurally constrained evolutionary models. Mol. Biol. Evol. 27, 1546 –1560. (doi:10.1093/molbev/msq047) 20. Le SQ, Gascuel O. 2010 Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. Syst. Biol. 59, 277 –287. (doi:10.1093/sysbio/syq002) 21. DePristo MA, Weinreich DM, Hartl DL. 2005 Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet. 6, 678 –687. (doi:10.1038/nrg1672) 22. Franzosa EA, Xia Y. 2009 Structural determinants of protein evolution are context-sensitive at the residue level. Mol. Biol. Evol. 26, 2387–2395. (doi:10.1093/molbev/msp146) 23. Scherrer MP, Meyer AG, Wilke CO. 2012 Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol. Biol. 12, 179. (doi:10.1186/1471-2148-12-179) 24. Meyer AG, Wilke CO. 2013 Integrating sequence variation and protein structure to identify sites under selection. Mol. Biol. Evol. 30, 36 –44. (doi:10. 1093/molbev/mss217) 25. Stevens J, Corper AL, Basler CF, Taubenberger JK, Palese P, Wilson IA. 2004 Structure of the uncleaved human H1 hemagglutinin from the extinct 1918 influenza virus. Science 303, 1866–1870. (doi:10. 1126/science.1093373) 26. Lockless SW, Ranganathan R. 1999 Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295 –299. (doi:10. 1126/science.286.5438.295) 27. Halabi N, Rivoire O, Leibler S, Ranganathan R. 2009 Protein sectors: evolutionary units of threedimensional structure. Cell 138, 774–786. (doi:10. 1016/j.cell.2009.07.038) 28. Morcos F et al. 2010 Modeling conformational ensembles of slow functional motions in Pin1-WW. PLoS Comput. Biol. 6, e1001015. (doi:10.1371/ journal.pcbi.1001015) 29. Morcos F, Jana B, Hwa T, Onuchic JN. 2013 Coevolutionary signals across protein lineages help capture multiple protein conformations. Proc. Natl Acad. Sci. USA 110, 205 33 –205 38. (doi:10.1073/ pnas.1315625110) 30. Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H. 2009 High-resolution protein complexes from 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. integrating genomic information with molecular simulation. Proc. Natl Acad. Sci. USA 106, 22 124– 22 129. (doi:10.1073/pnas.0912100106) Morcos F et al. 2011 Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293– E1301. (doi:10.1073/pnas. 1111471108) Dago AE, Schug A, Procaccini A, Hoch JA, Weigt M, Szurmant H. 2012 Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc. Natl Acad. Sci. USA 109, E1733– E1742. (doi:10.1073/pnas.1201301109) Leopold PE, Montal M, Onuchic JN. 1992 Protein folding funnels: a kinetic approach to the sequencestructure relationship. Proc. Natl Acad. Sci. USA 89, 8721– 8725. (doi:10.1073/pnas.89.18.8721) Dill KA, Chan HS. 1997 From Levinthal to pathways to funnels. Nat. Struct. Biol. 4, 10– 19. (doi:10. 1038/nsb0197-10) Onuchic JN, Wolynes PG. 2004 Theory of protein folding. Curr. Opin. Struct. Biol. 14, 70 –75. (doi:10. 1016/j.sbi.2004.01.009) Chan HS, Zhang Z, Wallin S, Liu Z. 2011 Cooperativity, local-nonlocal coupling, and nonnative interactions: principles of protein folding from coarse-grained models. Annu. Rev. Phys. Chem. 62, 301 –326. (doi:10.1146/annurev-physchem032210-103405) Koshland DE. 1958 Application of a theory of enzyme specificity to protein synthesis. Proc. Natl Acad. Sci. USA 44, 98–104. (doi:10.1073/pnas.44.2.98) Csermely P, Palotai R, Nussinov R. 2010 Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends Biochem. Sci. 35, 539–546. (doi:10.1016/j. tibs.2010.04.009) Bellotti V, Stoppini M, Mangione PP, Fornasieri A, Min L, Merlini G, Ferri G. 1996 Structural and functional characterization of three human immunoglobulin kappa light chains with different pathological implications. Biochim. Biophys. Acta 1317, 161–167. (doi:10.1016/S0925-4439(96)00049-X) Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. 2005 Thermodynamic prediction of protein neutrality. Proc. Natl Acad. Sci. USA 102, 606– 611. (doi:10.1073/pnas. 0406744102) Mayer S, Rüdiger S, Ang HC, Joerger AC, Fersht AR. 2007 Correlation of levels of folded recombinant p53 in Escherichia coli with thermodynamic stability in vitro. J. Mol. Biol. 372, 268–276. (doi:10.1016/j. jmb.2007.06.044) Gong LI, Suchard MA, Bloom JD. 2013 Stabilitymediated epistasis constrains the evolution of an influenza protein. eLife 2, e00631. (doi:10.7554/ eLife.00631) Kauzmann W. 1959 Some factors in the interpretation of protein denaturation. Adv. Protein J. R. Soc. Interface 11: 20140419 4. Dunham I et al. 2012 An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57 –74. (doi:10.1038/nature11247) Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA. 2004 Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 14, 208–216. (doi:10.1016/j.sbi.2004.03.011) Moore AD, Björklund AK, Ekman D, Bornberg-Bauer E, Elofsson A. 2008 Arrangements in the modular evolution of proteins. Trends Biochem. Sci. 33, 444–451. (doi:10.1016/j.tibs.2008.05.008) Bornberg-Bauer E, Albà MM. 2013 Dynamics and adaptive benefits of modular protein evolution. Curr. Opin. Struct. Biol. 23, 459–466. (doi:10.1016/ j.sbi.2013.02.012) Söding J, Lupas AN. 2003 More than the sum of their parts: on the evolution of proteins from peptides. BioEssays 25, 837–846. (doi:10.1002/ bies.10321) Höcker B, Claren J, Sterner R. 2004 Mimicking enzyme evolution by generating new (ba)8-barrels from (ba)4-half-barrels. Proc. Natl Acad. Sci. USA 101, 16 448–16 453. (doi:10.1073/pnas.0405832101) Carbone MN, Arnold FH. 2007 Engineering by homologous recombination: exploring sequence and function within a conserved fold. Curr. Opin. Struct. Biol. 17, 454–459. (doi:10.1016/j.sbi.2007.08.005) Höcker B. 2013 Engineering chimaeric proteins from fold fragments: ‘hopeful monsters’ in protein design. Biochem. Soc. Trans. 41, 1137– 1140. (doi:10.1042/BST20130099) Smith MA, Romero PA, Wu T, Brustad EM, Arnold FH. 2013 Chimeragenesis of distantly-related proteins by noncontiguous recombination. Protein Sci. 22, 231– 238. (doi:10.1002/pro.2202) Henikoff S, Henikoff JG. 1992 Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10 915 –10 919. (doi:10.1073/ pnas.89.22.10915) Dayhoff M, Schwartz R, Orcutt B. 1978 A model of evolutionary change in proteins. In Atlas of protein sequence and structure (ed MO Dayhoff ), pp. 345– 352. Washington, DC: National Biomedical Research Foundation. Worth CL, Gong S, Blundell TL. 2009 Structural and functional constraints in the evolution of protein families. Nat. Rev. Mol. Cell Biol. 10, 709 –720. (doi:10.1038/nrm2762) Wilke CO. 2012 Bringing molecules back into molecular evolution. PLoS Comput. Biol. 8, e1002572. (doi:10.1371/journal.pcbi.1002572) Liberles DA et al. 2012 The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 21, 769 –785. (doi:10.1002/ pro.2071) Studer RA, Dessailly BH, Orengo CA. 2013 Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem. J. 449, 581– 594. (doi:10.1042/ BJ20121221) rsif.royalsocietypublishing.org 1. 26 Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 45. 46. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 73. Shortle D. 1989 Probing the determinants of protein folding and stability with amino acid substitutions. J. Biol. Chem. 264, 5315–5318. 74. Jackson SE, ElMasry N, Fersht AR. 1993 Structure of the hydrophobic core in the transition state for folding of chymotrypsin inhibitor 2: a critical test of the protein engineering method of analysis. Biochemistry 32, 11 270–11 278. (doi:10.1021/ bi00093a002) 75. Lawrence C, Kuge J, Ahmad K, Plaxco KW. 2010 Investigation of an anomalously accelerating substitution in the folding of a prototypical twostate protein. J. Mol. Biol. 403, 446–458. (doi:10. 1016/j.jmb.2010.08.049) 76. Viguera AR, Vega C, Serrano L. 2002 Unspecific hydrophobic stabilization of folding transition states. Proc. Natl Acad. Sci. USA 99, 5349–5354. (doi:10. 1073/pnas.072387799) 77. Zarrine-Afsar A, Wallin S, Neculai AM, Neudecker P, Howell PL, Davidson AR, Chan HS. 2008 Theoretical and experimental demonstration of the importance of specific nonnative interactions in protein folding. Proc. Natl Acad. Sci. USA 105, 9999–10 004. (doi:10.1073/pnas.0801874105) 78. Debès C, Wang M, Caetano-Anollés G, Gräter F. 2013 Evolutionary optimization of protein folding. PLoS Comput. Biol. 9, e1002861. (doi:10.1371/ journal.pcbi.1002861) 79. Mirny LA, Shakhnovich EI. 1999 Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J. Mol. Biol. 291, 177–196. (doi:10. 1006/jmbi.1999.2911) 80. Di Nardo AA, Larson SM, Davidson AR. 2003 The relationship between conservation, thermodynamic stability, and function in the SH3 domain hydrophobic core. J. Mol. Biol. 333, 641–655. (doi:10.1016/j.jmb.2003.08.035) 81. Sanchez-Ruiz JM. 2010 Protein kinetic stability. Biophys. Chem. 148, 1–15. (doi:10.1016/j.bpc. 2010.02.004) 82. Inglés-Prieto A, Ibarra-Molero B, Delgado-Delgado A, Perez-Jimenez R, Fernandez JM, Gaucher EA, Sanchez-Ruiz JM, Gavira JA. 2013 Conservation of protein structure over four billion years. Structure 21, 1690 –1697. (doi:10.1016/j.str.2013.06.020) 83. Godoy-Ruiz R, Ariza F, Rodriguez-Larrea D, PerezJimenez R, Ibarra-Molero B, Sanchez-Ruiz JM. 2006 Natural selection for kinetic stability is a likely origin of correlations between mutational effects on protein energetics and frequencies of amino acid occurrences in sequence alignments. J. Mol. Biol. 362, 966–978. (doi:10.1016/j.jmb.2006.07.065) 84. Gsponer J, Hopearuoho H, Whittaker SB-M, Spence GR, Moore GR, Paci E, Radford SE, Vendruscolo MH. 2006 Determination of an ensemble of structures representing the intermediate state of the bacterial immunity protein Im7. Proc. Natl Acad. Sci. USA 103, 99– 104. (doi:10.1073/pnas.0508667102) 85. Kato H, Feng H, Bai Y. 2007 The folding pathway of T4 lysozyme: the high-resolution structure and folding of a hidden intermediate. J. Mol. Biol. 365, 870–880. (doi:10.1016/j.jmb.2006.10.047) 27 J. R. Soc. Interface 11: 20140419 47. 58. Pollock DD, Thiltgen G, Goldstein RA. 2012 Amino acid coevolution induces an evolutionary Stokes shift. Proc. Natl Acad. Sci. USA 109, E1352– E1359. (doi:10.1073/pnas.1120084109) 59. Serohijos AWR, Shakhnovich EI. 2014 Contribution of selection for protein folding stability in shaping the patterns of polymorphisms in coding regions. Mol. Biol. Evol. 31, 165–176. (doi:10.1093/molbev/ mst189) 60. Benedix A, Becker CM, de Groot BL, Caflisch A, Böckmann RA. 2009 Predicting free energy changes using structural ensembles. Nat. Methods 6, 3 –4. (doi:10.1038/nmeth0109-3) 61. Willis JR, Briney BS, DeLuca SL, Crowe JE, Meiler J. 2013 Human germline antibody gene segments encode polyspecific antibodies. PLoS Comput. Biol. 9, e1003045. (doi:10.1371/journal.pcbi.1003045) 62. Howell SC, Inampudi KK, Bean DP, Wilson CJ. 2014 Understanding thermal adaptation of enzymes through the multistate rational design and stability prediction of 100 adenylate kinases. Structure 22, 218 –229. (doi:10.1016/j.str.2013.10.019) 63. Cordes MHJ, Sauer RT. 1999 Tolerance of a protein to multiple polar-to-hydrophobic surface substitutions. Protein Sci. 8, 318–325. (doi:10. 1110/ps.8.2.318) 64. Gu H, Doshi N, Kim DE, Simons KT, Santiago JV, Nauli S, Baker D. 1999 Robustness of protein folding kinetics to surface hydrophobic substitutions. Protein Sci. 8, 2734–2741. (doi:10.1110/ps.8.12.2734) 65. Cordes MHJ, Burton RE, Walsh NP, McKnight CJ, Sauer RT. 2000 An evolutionary bridge to a new protein fold. Nat. Struct. Biol. 7, 1129–1132. (doi:10.1038/81985) 66. Seeliger D, de Groot BL. 2010 Protein thermostability calculations using alchemical free energy simulations. Biophys. J. 98, 2309–2316. (doi:10.1016/j.bpj.2010.01.051) 67. Allison JR, Bergeler M, Hansen N, van Gunsteren WF. 2011 Current computer modeling cannot explain why two highly similar sequences fold into different structures. Biochemistry 50, 10 965–10 973. (doi:10. 1021/bi2015663) 68. Hansen N, Allison JR, Hodel FH, van Gunsteren WF. 2013 Relative free enthalpies for point mutations in two proteins with highly similar sequences but different folds. Biochemistry 52, 4962–4970. (doi:10.1021/bi400272q) 69. Roy A, Perez A, Dill KA, Maccallum JL. 2014 Computing the relative stabilities and the perresidue components in protein conformational changes. Structure 22, 168–175. (doi:10.1016/j.str. 2013.10.015) 70. Baker D. 2000 A surprising simplicity to protein folding. Nature 405, 39 –42. (doi:10.1038/ 35011000) 71. Brockwell DJ, Radford SE. 2007 Intermediates: ubiquitous species on folding energy landscapes? Curr. Opin. Struct. Biol. 17, 30 –37. (doi:10.1016/j. sbi.2007.01.003) 72. Matthews CR, Hurle MR. 1987 Mutant sequences as probes of protein folding mechanisms. BioEssays 6, 254 –257. (doi:10.1002/bies.950060603) rsif.royalsocietypublishing.org 44. Chem. 14, 1 –63. (doi:10.1016/S0065-3233(08) 60608-7) Dill KA. 1990 Dominant forces in protein folding. Biochemistry 29, 7133–7155. (doi:10.1021/ bi00483a001) Rost B. 2001 Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 134, 204–218. (doi:10.1006/jsbi.2001.4336) Guerois R, Nielsen JE, Serrano L. 2002 Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J. Mol. Biol. 320, 369 –387. (doi:10.1016/S00222836(02)00442-4) Capriotti E, Fariselli P, Casadio R. 2005 I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306 –W310. (doi:10.1093/nar/gki375) Parthiban V, Gromiha MM, Schomburg D. 2006 CUPSAT: prediction of protein stability upon point mutations. Nucleic Acids Res. 34, W239– W242. (doi:10.1093/nar/gkl190) Yin S, Ding F, Dokholyan NV. 2007 Eris: an automated estimator of protein stability. Nat. Methods 4, 466– 467. (doi:10.1038/nmeth 0607-466) Wang Q, Canutescu AA, Dunbrack RL. 2008 SCWRL and MolIDE: computer programs for side-chain conformation prediction and homology modeling. Nat. Protoc. 3, 1832–1847. (doi:10.1038/nprot. 2008.184) Dehouck Y, Grosfils A, Folch B, Gilis D, Bogaerts P, Rooman M. 2009 Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: poPMuSiC-2.0. Bioinformatics 25, 2537 –2543. (doi:10.1093/bioinformatics/btp445) Kellogg EH, Leaver-Fay A, Baker D. 2011 Role of conformational sampling in computing mutationinduced changes in protein structure and stability. Proteins 79, 830–838. (doi:10.1002/prot.22921) Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS. 2006 Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929 –932. (doi:10.1038/nature05385) Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. 2007 The stability effects of protein mutations appear to be universally distributed. J. Mol. Biol. 369, 1318 –1332. (doi:10.1016/j.jmb. 2007.03.069) Bershtein S, Goldin K, Tawfik DS. 2008 Intense neutral drifts yield robust and evolvable consensus proteins. J. Mol. Biol. 379, 1029 –1044. (doi:10. 1016/j.jmb.2008.04.024) Bloom JD, Nayak JS, Baltimore D. 2011 A computational-experimental approach identifies mutations that enhance surface expression of an oseltamivir-resistant influenza neuraminidase. PLoS ONE 6, e22201. (doi:10.1371/journal.pone. 0022201) Ashenberg O, Gong LI, Bloom JD. 2013 Mutational effects on stability are largely conserved during protein evolution. Proc. Natl Acad. Sci. USA 110, 21 071–21 076. (doi:10.1073/pnas.1314781111) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 117. Tompa P, Tusnády GE, Cserzo M, Simon I. 2001 Prion protein: evolution caught en route. Proc. Natl Acad. Sci. USA 98, 4431–4436. (doi:10.1073/pnas. 071308398) 118. Baldwin AJ et al. 2011 Metastability of native proteins and the phenomenon of amyloid formation. J. Am. Chem. Soc. 133, 14 160–14 163. (doi:10.1021/ ja2017703) 119. Harrison PM, Chan HS, Prusiner SB, Cohen FE. 2001 Conformational propagation with prion-like characteristics in a simple model of protein folding. Protein Sci. 10, 819– 835. (doi:10.1110/ps.38701) 120. Lopes A, Sacquin-Mora S, Dimitrova V, Laine E, Ponty Y, Carbone A. 2013 Protein–protein interactions in a crowded environment: an analysis via cross-docking simulations and evolutionary information. PLoS Comput. Biol. 9, e1003369. (doi:10.1371/journal.pcbi.1003369) 121. Andreeva A, Murzin AG. 2006 Evolution of protein fold in the presence of functional constraints. Curr. Opin. Struct. Biol. 16, 399 –408. (doi:10.1016/j.sbi. 2006.04.003) 122. Tokuriki N, Tawfik DS. 2009 Protein dynamism and evolvability. Science 324, 203– 207. (doi:10.1126/ science.1169375) 123. Yomo T, Saito S, Sasai M. 1999 Gradual development of protein-like global structures through functional selection. Nat. Struct. Biol. 6, 743–746. (doi:10.1038/11512) 124. Nagao C, Terada TP, Yomo T, Sasai M. 2005 Correlation between evolutionary structural development and protein folding. Proc. Natl Acad. Sci. USA 102, 18 950–18 955. (doi:10.1073/pnas. 0509163102) 125. Perica T, Chothia C, Teichmann SA. 2012 Evolution of oligomeric state through geometric coupling of protein interfaces. Proc. Natl Acad. Sci. USA 109, 8127– 8132. (doi:10.1073/pnas.1120028109) 126. Chen J, Sawyer N, Regan L. 2013 Protein-protein interactions: general trends in the relationship between binding affinity and interfacial buried surface area. Protein Sci. 22, 510– 515. (doi:10. 1002/pro.2230) 127. Levy ED. 2010 A simple definition of structural regions in proteins and its use in analyzing interface evolution. J. Mol. Biol. 403, 660– 670. (doi:10. 1016/j.jmb.2010.09.028) 128. Davis FP, Sali A. 2010 The overlap of small molecule and protein binding sites within families of protein structures. PLoS Comput. Biol. 6, e1000668. (doi:10. 1371/journal.pcbi.1000668) 129. Dasgupta B, Nakamura H, Kinjo AR. 2011 Distinct roles of overlapping and non-overlapping regions of hub protein interfaces in recognition of multiple partners. J. Mol. Biol. 411, 713– 727. (doi:10.1016/ j.jmb.2011.06.027) 130. Levin KB, Dym O, Albeck S, Magdassi S, Keeble AH, Kleanthous C, Tawfik DS. 2009 Following evolutionary paths to protein–protein interactions with high affinity and selectivity. Nat. Struct. Mol. Biol. 16, 1049– 1055. (doi:10.1038/nsmb.1670) 131. Privalov PL, Gill SJ. 1988 Stability of protein structure and hydrophobic interaction. Adv. Protein 28 J. R. Soc. Interface 11: 20140419 100. Ingram VM. 1957 Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180, 326 –328. (doi:10.1038/180326a0) 101. Pauling L, Itano H, Singer S, Wells I. 1949 Sickle cell anemia, a molecular disease. Science 110, 543 –548. (doi:10.1126/science.110.2865.543) 102. Meyer V et al. 2014 Single mutations in tau modulate the populations of fibril conformers through seed selection. Angew. Chem. Int. Ed. Engl. 53, 1590–1593. (doi:10.1002/anie.201308473) 103. Schuster-Böckler B, Bateman A. 2008 Protein interactions in human genetic diseases. Genome Biol. 9, R9. (doi:10.1186/gb-2008-9-1-r9) 104. Wang X, Wei X, Thijssen B, Das J, Lipkin SM, Yu H. 2012 Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat. Biotechnol. 30, 159–164. (doi:10. 1038/nbt.2106) 105. Ellis RJ, Minton AP. 2006 Protein aggregation in crowded environments. Biol. Chem. 387, 485–497. (doi:10.1515/BC.2006.064) 106. Gershenson A, Gierasch LM. 2010 Protein folding in the cell: challenges and progress. Curr. Opin. Struct. Biol. 21, 32 –41. (doi:10.1016/j.sbi.2010.11.001) 107. Dill KA, Ghosh K, Schmit JD. 2011 Physical limits of cells and proteomes. Proc. Natl Acad. Sci. USA 108, 17 876– 17 882. (doi:10.1073/pnas.1114477108) 108. Levy ED, De S, Teichmann SA. 2012 Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc. Natl Acad. Sci. USA 109, 20 461–20 466. (doi:10.1073/ pnas.1209312109) 109. Sarkar M, Smith AE, Pielak GJ. 2013 Impact of reconstituted cytosol on protein stability. Proc. Natl Acad. Sci. USA 110, 19 342–19 347. (doi:10.1073/ pnas.1312678110) 110. Johnson ME, Hummer G. 2011 Nonspecific binding limits the number of proteins in a cell and shapes their interaction networks. Proc. Natl Acad. Sci. USA 108, 603 –608. (doi:10.1073/pnas.1010954108) 111. Yue K, Dill KA. 1992 Inverse protein folding problem: designing polymer sequences. Proc. Natl Acad. Sci. USA 89, 4163 –4167. (doi:10.1073/pnas. 89.9.4163) 112. Isogai Y. 2006 Native protein sequences are designed to destabilize folding intermediates. Biochemistry 45, 2488 –2492. (doi:10.1021/ bi0523714) 113. Sali A, Shakhnovich EI, Karplus M. 1994 Kinetics of protein folding. A lattice model study of the requirements for folding to the native state. J. Mol. Biol. 235, 1614 –1636. (doi:10.1006/jmbi. 1994.1110) 114. Broglia RA, Tiana G, Roman HH, Vigezzi E, Shakhnovich EI. 1999 Stability of designed proteins against mutations. Phys. Rev. Lett. 82, 4727–4730. (doi:10.1103/PhysRevLett.82.4727) 115. Chan HS, Shimizu S, Kaya H. 2004 Cooperativity principles in protein folding. Methods Enzymol. 380, 350 –379. (doi:10.1016/S0076-6879(04)80016-8) 116. Chan HS. 1999 Folding alphabets. Nat. Struct. Biol. 6, 994– 996. (doi:10.1038/14876) rsif.royalsocietypublishing.org 86. Dalessio PM, Boyer JA, McGettigan JL, Ropson IJ. 2005 Swapping core residues in homologous proteins swaps folding mechanism. Biochemistry 44, 3082–3090. (doi:10.1021/bi048125u) 87. Valastyan JS, Lindquist S. 2014 Mechanisms of protein-folding diseases at a glance. Dis. Model Mech. 7, 9–14. (doi:10.1242/dmm.013474) 88. Ciryam P, Tartaglia GG, Morimoto RI, Dobson CM, Vendruscolo MH. 2013 Widespread aggregation and neurodegenerative diseases are associated with supersaturated proteins. Cell Rep. 5, 1–10. (doi:10. 1016/j.celrep.2013.09.043) 89. Jahn TR, Parker MJ, Homans SW, Radford SE. 2006 Amyloid formation under physiological conditions proceeds via a native-like folding intermediate. Nat. Struct. Mol. Biol. 13, 195–201. (doi:10.1038/ nsmb1058) 90. Surguchev A, Surguchov A. 2010 Conformational diseases: looking into the eyes. Brain Res. Bull. 81, 12 –24. (doi:10.1016/j.brainresbull.2009.09.015) 91. Das P, King JA, Zhou R. 2011 Aggregation of gcrystallins associated with human cataracts via domain swapping at the C-terminal b-strands. Proc. Natl Acad. Sci. USA 108, 10 514–10 519. (doi:10. 1073/pnas.1019152108) 92. Ji F, Jung J, Koharudin LMI, Gronenborn AM. 2013 The human W42R gD-crystallin mutant structure provides a link between congenital and age-related cataracts. J. Biol. Chem. 288, 99 –109. (doi:10.1074/ jbc.M112.416354) 93. Baskakov IV, Legname G, Prusiner SB, Cohen FE. 2001 Folding of prion protein to its native a-helical conformation is under kinetic control. J. Biol. Chem. 276, 19 687– 19 690. (doi:10.1074/jbc.C100180200) 94. Lobkovsky AE, Wolf YI, Koonin EV. 2010 Universal distribution of protein evolution rates as a consequence of protein folding physics. Proc. Natl Acad. Sci. USA 107, 2983 –2988. (doi:10.1073/pnas. 0910445107) 95. Ciryam P, Morimoto RI, Vendruscolo MH, Dobson CM, O’Brien EP. 2013 In vivo translation rates can substantially delay the cotranslational folding of the Escherichia coli cytosolic proteome. Proc. Natl Acad. Sci. USA 110, E132 –E140. (doi:10.1073/pnas. 1213624110) 96. Sander IM, Chaney JL, Clark PL. 2014 Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. J. Am. Chem. Soc. 136, 858–861. (doi:10.1021/ja411302m) 97. Tsai C-J, Sauna ZE, Kimchi-Sarfaty C, Ambudkar SV, Gottesman MM, Nussinov R. 2008 Synonymous mutations and ribosome stalling can lead to altered folding pathways and distinct minima. J. Mol. Biol. 383, 281 –291. (doi:10.1016/j.jmb. 2008.08.012) 98. Nooren IMA, Thornton JM. 2003 Diversity of proteinprotein interactions. EMBO J. 22, 3486 –3492. (doi:10.1093/emboj/cdg359) 99. Yang J-R, Liao B-Y, Zhuang S-M, Zhang J. 2012 Protein misinteraction avoidance causes highly expressed proteins to evolve slowly. Proc. Natl Acad. Sci. USA 109, E831 –E840. (doi:10.1073/pnas. 1117408109) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 133. 134. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 161. Bornberg-Bauer E. 1997 How are model protein structures distributed in sequence space? Biophys. J. 73, 2393 –2403. (doi:10.1016/S0006-3495(97) 78268-7) 162. Chan HS, Dill KA. 1991 ‘Sequence space soup’ of proteins and copolymers. J. Chem. Phys. 95, 3775– 3787. (doi:10.1063/1.460828) 163. Kim YE, Hipp MS, Bracher A, Hayer-Hartl M, Hartl FU. 2013 Molecular chaperone functions in protein folding and proteostasis. Annu. Rev. Biochem. 82, 323–355. (doi:10.1146/annurev-biochem-060208092442) 164. Wagner GP, Altenberg L. 1996 Perspective: complex adaptations and the evolution of evolvability. Evolution 50, 967–976. (doi:10.2307/2410639) 165. Tokuriki N, Tawfik DS. 2009 Chaperonin overexpression promotes genetic variation and enzyme evolution. Nature 459, 668–673. (doi:10. 1038/nature08009) 166. Wyganowski KT, Kaltenbach M, Tokuriki N. 2013 GroEL/ES buffering and compensatory mutations promote protein evolution by stabilizing folding intermediates. J. Mol. Biol. 425, 3403–3414. (doi:10.1016/j.jmb.2013.06.028) 167. Bogumil D, Dagan T. 2010 Chaperonin-dependent accelerated substitution rates in prokaryotes. Genome Biol. Evol. 2, 602–608. (doi:10.1093/gbe/ evq044) 168. Warnecke T, Hurst LD. 2010 GroEL dependency affects codon usage–support for a critical role of misfolding in gene evolution. Mol. Syst. Biol. 6, 340. (doi:10.1038/msb.2009.94) 169. O’Brien EP, Vendruscolo M, Dobson CM. 2014 Kinetic modelling indicates that fast-translating codons can coordinate cotranslational protein folding by avoiding misfolded intermediates. Nat. Commun. 5, 2988. (doi:10.1038/ncomms3988) 170. Cetinbaş M, Shakhnovich EI. 2013 Catalysis of protein folding by chaperones accelerates evolutionary dynamics in adapting cell populations. PLoS Comput. Biol. 9, e1003269. (doi:10.1371/ journal.pcbi.1003269) 171. Kim H, Abeysirigunawarden SC, Chen K, Mayerle M, Ragunathan K, Luthey-Schulten Z, Ha T, Woodson SA. 2014 Protein-guided RNA dynamics during early ribosome assembly. Nature 506, 334–338. (doi:10. 1038/nature13039) 172. Seo M-H, Park J, Kim E, Hohng S, Kim H-S. 2014 Protein conformational dynamics dictate the binding affinity for a ligand. Nat. Commun. 5, 3724. (doi:10.1038/ncomms4724) 173. Bouvignies G et al. 2011 Solution structure of a minor and transiently formed state of a T4 lysozyme mutant. Nature 477, 111 –114. (doi:10.1038/ nature10349) 174. Meier S, Jensen PR, David CN, Chapman J, Holstein TW, Grzesiek S, Özbek S. 2007 Continuous molecular evolution of protein-domain structures by single amino acid changes. Curr. Biol. 17, 173–178. (doi:10.1016/j.cub.2006.10.063) 175. Alexander PA, He Y, Chen Y, Orban J, Bryan PN. 2009 A minimal sequence code for switching protein structure and function. Proc. Natl Acad. Sci. 29 J. R. Soc. Interface 11: 20140419 135. 146. Plaxco KW, Simons KT, Baker D. 1998 Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277, 985–994. (doi:10.1006/jmbi.1998.1645) 147. Miller EJ, Fischer KF, Marqusee S. 2002 Experimental evaluation of topological parameters determining protein-folding rates. Proc. Natl Acad. Sci. USA 99, 10 359 –10 363. (doi:10.1073/pnas. 162219099) 148. Koga N, Tatsumi-Koga R, Liu G, Xiao R, Acton TB, Montelione GT, Baker D. 2012 Principles for designing ideal protein structures. Nature 491, 222 –227. (doi:10.1038/nature11600) 149. Shortle D, Chan HS, Dill KA. 1992 Modeling the effects of mutations on the denatured states of proteins. Protein Sci. 1, 201–215. (doi:10.1002/pro. 5560010202) 150. Arnold FH, Wintrode PL, Miyazaki K, Gershenson A. 2001 How enzymes adapt: lessons from directed evolution. Trends Biochem. Sci. 26, 100 –106. (doi:10.1016/S0968-0004(00)01755-2) 151. Van Nimwegen E, Crutchfield JP, Huynen M. 1999 Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA 96, 9716–9720. (doi:10.1073/ pnas.96.17.9716) 152. Xia Y, Levitt M. 2002 Roles of mutation and recombination in the evolution of protein thermodynamics. Proc. Natl Acad. Sci. USA 99, 10 382– 10 387. (doi:10.1073/pnas.162097799) 153. Bloom JD, Raval A, Wilke CO. 2007 Thermodynamics of neutral protein evolution. Genetics 175, 255 –266. (doi:10.1534/genetics.106.061754) 154. Watters AL, Deka P, Corrent C, Callender D, Varani G, Sosnick T, Baker D. 2007 The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection. Cell. 128, 613–624. (doi:10.1016/j.cell.2006.12.042) 155. Zhang Z, Chan HS. 2010 Competition between native topology and nonnative interactions in simple and complex folding kinetics of natural and designed proteins. Proc. Natl Acad. Sci. USA 107, 2920 –2925. (doi:10.1073/pnas.0911844107) 156. Zhang Z, Chan HS. 2009 Native topology of the designed protein Top7 is not conducive to cooperative folding. Biophys. J. 96, L25– L27. (doi:10.1016/j.bpj.2008.11.004) 157. Badasyan A, Liu Z, Chan HS. 2008 Probing possible downhill folding: native contact topology likely places a significant constraint on the folding cooperativity of proteins with≏40 residues. J. Mol. Biol. 384, 512–530. (doi:10.1016/j.jmb.2008.09.023) 158. Chan HS, Dill KA. 1996 Comparing folding codes for proteins and polymers. Proteins 24, 335 –344. (doi:10.1002/(SICI)1097-0134(199603)24:3,335:: AID-PROT6.3.0.CO;2-F) 159. Govindarajan S, Goldstein RA. 1995 Searching for foldable protein structures using optimized energy functions. Biopolymers 36, 43 –51. (doi:10.1002/ bip.360360105) 160. Li H, Helling R, Tang C, Wingreen NS. 1996 Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669. (doi:10. 1126/science.273.5275.666) rsif.royalsocietypublishing.org 132. Chem. 39, 191– 234. (doi:10.1016/S00653233(08)60377-0) Pace CN. 2001 Polar group burial contributes more to protein stability than nonpolar group burial. Biochemistry 40, 310– 313. (doi:10.1021/bi001574j) Zavodszky P, Kardos J, Svingor A, Petsko GA. 1998 Adjustment of conformational flexibility is a key event in the thermal adaptation of proteins. Proc. Natl Acad. Sci. USA 95, 7406 –7411. (doi:10.1073/ pnas.95.13.7406) Beadle BM, Shoichet BK. 2002 Structural bases of stability –function tradeoffs in enzymes. J. Mol. Biol. 321, 285–296. (doi:10.1016/S0022-2836(02) 00599-5) Taverna DM, Goldstein RA. 2002 Why are proteins marginally stable? Proteins 46, 105 –109. (doi:10. 1002/prot.10016) Goldstein RA. 2011 The evolution and evolutionary consequences of marginal thermostability in proteins. Proteins 79, 1396– 1407. (doi:10.1002/ prot.22964) Bornberg-Bauer E, Chan HS. 1999 Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc. Natl Acad. Sci. USA 96, 10 689–10 694. (doi:10. 1073/pnas.96.19.10689) Gould SJ, Lewontin RC. 1979 The spandrels of San Marco and the panglossian paradigm: a critique of the adaptationist programme. Proc. R. Soc. Lond. B 205, 581–598. (doi:10.1098/rspb.1979.0086) Shortle D, Stites WE, Meeker AK. 1990 Contributions of the large hydrophobic amino acids to the stability of staphylococcal nuclease. Biochemistry 29, 8033–8041. (doi:10.1021/bi00487a007) Green SM, Meeker AK, Shortle D. 1992 Contributions of the polar, uncharged amino acids to the stability of staphylococcal nuclease: evidence for mutational effects on the free energy of the denatured state. Biochemistry 31, 5717–5728. (doi:10.1021/bi00140a005) Meeker AK, Garcia-Moreno B, Shortle D. 1996 Contributions of the ionizable amino acids to the stability of staphylococcal nuclease. Biochemistry 35, 6443–6449. (doi:10.1021/bi960171+) Itzhaki LS, Otzen DE, Fersht AR. 1995 The structure of the transition state for folding of chymotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleation-condensation mechanism for protein folding. J. Mol. Biol. 254, 260–288. (doi:10.1006/jmbi.1995.0616) Serrano L, Day AG, Fersht AR. 1993 Step-wise mutation of barnase to binase. A procedure for engineering increased stability of proteins and an experimental analysis of the evolution of protein stability. J. Mol. Biol. 233, 305–312. (doi:10.1006/ jmbi.1993.1508) Zhao H, Arnold FH. 1999 Directed evolution converts subtilisin E into a functional equivalent of thermitase. Protein Eng. Des. Sel. 12, 47 –53. (doi:10.1093/protein/12.1.47) Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D. 2003 Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368. (doi:10.1126/science.1089427) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 177. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. mutational drift of an enzyme. HFSP J. 1, 67– 78. (doi:10.2976/1.2739115) Wroe R, Chan HS, Bornberg-Bauer E. 2007 A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J. 1, 79– 87. (doi:10.2976/1.2739116) Chan HS, Kaya H, Shimizu S. 2002 Computational methods for protein folding: scaling a hierarchy of complexities. In Current topics in computational molecular biology (eds T Jiang, Y Xu, MQ Zhang), pp. 403–447. Cambridge, MA: The MIT Press. Cordes MHJ, Walsh NP, Mcknight CJ, Sauer RT. 1999 Evolution of a protein fold in vitro. Science 284, 325–327. (doi:10.1126/science.284.5412.325) He Y, Chen Y, Alexander PA, Bryan PN, Orban J. 2012 Mutational tipping points for switching protein folds and functions. Structure 20, 283 –291. (doi:10.1016/j.str.2011.11.018) Stewart KL, Dodds ED, Wysocki VH, Cordes MHJ. 2013 A polymetamorphic protein. Protein Sci. 22, 641–649. (doi:10.1002/pro.2248) Minor DL, Kim PS. 1996 Context-dependent secondary structure formation of a designed protein sequence. Nature 380, 730–734. (doi:10.1038/380730a0) Ambroggio XI, Kuhlman B. 2006 Design of protein conformational switches. Curr. Opin. Struct. Biol. 16, 525–530. (doi:10.1016/j.sbi.2006.05.014) Dagliyan O et al. 2013 Rational design of a ligandcontrolled protein conformational switch. Proc. Natl Acad. Sci. USA 110, 6800–6804. (doi:10.1073/pnas. 1218319110) Meier S, Özbek S. 2007 A biological cosmos of parallel universes: does protein structural plasticity facilitate evolution? BioEssays 29, 1095–1104. (doi:10.1002/bies.20661) Bryan PN, Orban J. 2010 Proteins that switch folds. Curr. Opin. Struct. Biol. 20, 482– 488. (doi:10.1016/ j.sbi.2010.06.002) James LC, Roversi P, Tawfik DS. 2003 Antibody multispecificity mediated by conformational diversity. Science 299, 1362–1367. (doi:10.1126/ science.1079731) Franco OL. 2011 Peptide promiscuity: an evolutionary concept for plant defense. FEBS Lett. 585, 995–1000. (doi:10.1016/j.febslet.2011.03.008) Caines MEC, Bichel K, Price AJ, McEwan WA, Towers GJ, Willett BJ, Freund SMV, James LC. 2012 Diverse HIV viruses are targeted by a conformationally dynamic antiviral. Nat. Struct. Mol. Biol. 19, 411–416. (doi:10.1038/nsmb.2253) Dunker AK et al. 2001 Intrinsically disordered protein. J. Mol. Graph Model 19, 26 –59. (doi:10. 1016/S1093-3263(00)00138-8) Tompa P. 2002 Intrinsically unstructured proteins. Trends Biochem. Sci. 27, 527–533. (doi:10.1016/ S0968-0004(02)02169-2) Gunasekaran K, Tsai CJ, Kumar S, Zanuy D, Nussinov R. 2003 Extended disordered proteins: targeting function with less scaffold. Trends Biochem. Sci. 28, 81– 85. (doi:10.1016/S0968-0004(03)00003-3) Dyson HJ, Wright PE. 2005 Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 6, 197 –208. (doi:10.1038/nrm1589) 30 J. R. Soc. Interface 11: 20140419 178. 189. FEBS Lett. 583, 1692 –1698. (doi:10.1016/j.febslet. 2009.03.019) Zayner JP, Antoniou C, French AR, Hause RJ, Sosnick TR. 2013 Investigating models of protein function and allostery with a widespread mutational analysis of a light-activated protein. Biophys. J. 105, 1027 –1036. (doi:10.1016/j.bpj.2013.07.010) Weinkam P, Chen YC, Pons J, Sali A. 2013 Impact of mutations on the allosteric conformational equilibrium. J. Mol. Biol. 425, 647– 661. (doi:10. 1016/j.jmb.2012.11.041) Keskin O, Jernigan RL, Bahar I. 2000 Proteins with similar architecture exhibit similar large-scale dynamic behavior. Biophys. J. 78, 2093– 2106. (doi:10.1016/S0006-3495(00)76756-7) Micheletti C, Lattanzi G, Maritan A. 2002 Elastic properties of proteins: insight on the folding process and evolutionary selection of native structures. J. Mol. Biol. 321, 909–921. (doi:10.1016/S00222836(02)00710-6) Zheng W, Brooks BR, Thirumalai D. 2006 Lowfrequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations. Proc. Natl Acad. Sci. USA 103, 7664 –7669. (doi:10.1073/pnas.0510426103) Zen A, Carnevale V, Lesk AM, Micheletti C. 2008 Correspondences between low-energy modes in enzymes: dynamics-based alignment of enzymatic functional families. Protein Sci. 17, 918–929. (doi:10.1110/ps.073390208) Taketomi H, Ueda Y, Gō N. 1975 Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. Int. J. Pept. Protein Res. 7, 445 –459. (doi:10.1111/j.1399-3011.1975.tb02465.x) Schug A, Whitford PC, Levy Y, Onuchic JN. 2007 Mutations as trapdoors to two competing native conformations of the Rop-dimer. Proc. Natl Acad. Sci. USA 104, 17 674 –17 679. (doi:10.1073/pnas. 0706077104) Micheletti C. 2013 Comparing proteins by their internal dynamics: exploring structure-function relationships beyond static structural alignments. Phys. Life Rev. 10, 1–26. (doi:10.1016/j.plrev.2012.10.009) Liu Y, Bahar I. 2012 Sequence evolution correlates with structural dynamics. Mol. Biol. Evol. 29, 2253 –2263. (doi:10.1093/molbev/mss097) Peracchi A, Mozzarelli A. 2011 Exploring and exploiting allostery: models, evolution, and drug targeting. Biochim. Biophys. Acta Proteins Proteomics. 1814, 922–933. (doi:10.1016/j.bbapap. 2010.10.008) Boehr DD, Nussinov R, Wright PE. 2009 The role of dynamic conformational ensembles in biomolecular recognition. Nat. Chem. Biol. 5, 789 –796. (doi:10. 1038/nchembio.232) Coyle SM, Flores J, Lim WA. 2013 Exploitation of latent allostery enables the evolution of new modes of MAP kinase regulation. Cell 154, 875–887. (doi:10.1016/j.cell.2013.07.019) Amitai G, Gupta RD, Tawfik DS. 2007 Latent evolutionary potentials under the neutral rsif.royalsocietypublishing.org 176. USA 106, 21 149–21 154. (doi:10.1073/pnas. 0906408106) Anderson WJ, Van Dorn LO, Ingram WM, Cordes MHJ. 2011 Evolutionary bridges to new protein folds: design of C-terminal Cro protein chameleon sequences. Protein Eng. Des. Sel. 24, 765 –771. (doi:10.1093/protein/gzr027) Zhuravlev PI, Papoian GA. 2010 Protein functional landscapes, dynamics, allostery: a tortuous path towards a universal theoretical framework. Q. Rev. Biophys. 3, 1 –38. (doi:10.1017/S00335 83510000119) Sikosek T, Bornberg-Bauer E, Chan HS. 2012 Evolutionary dynamics on protein bi-stability landscapes can potentially resolve adaptive conflicts. PLoS Comput. Biol. 8, e1002659. (doi:10.1371/ journal.pcbi.1002659) Tuinstra RL, Peterson FC, Kutlesa S, Elgin ES, Kron MA, Volkman BF. 2008 Interconversion between two unrelated protein folds in the lymphotactin native state. Proc. Natl Acad. Sci. USA 105, 5057 –5062. (doi:10.1073/pnas.0709518105) Luo X, Tang Z, Xia G, Wassmann K, Matsumoto T, Rizo J, Yu H. 2004 The Mad2 spindle checkpoint protein has two distinct natively folded states. Nat. Struct. Mol. Biol. 11, 338–345. (doi:10.1038/ nsmb748) Andersen JF, Ding XD, Balfour C, Shokhireva TK, Champagne DE, Walker FA, Montfort WR. 2000 Kinetics and equilibria in ligand binding by nitrophorins 1– 4: evidence for stabilization of a nitric oxide-ferriheme complex through a ligandinduced conformational trap. Biochemistry 39, 10 118–10 131. (doi:10.1021/bi000766b) Ådén J, Verma A, Schug A, Wolf-Watz M. 2012 Modulation of a pre-existing conformational equilibrium tunes adenylate kinase activity. J. Am. Chem. Soc. 134, 16 562–16 570. (doi:10.1021/ ja3032482) Burmann BM, Knauer SH, Sevostyanova A, Schweimer K, Mooney RA, Landick R, Artsimovitch I, Rösch P. 2012 An a helix to b barrel domain switch transforms the transcription factor RfaH into a translation factor. Cell 150, 291–303. (doi:10. 1016/j.cell.2012.05.042) Di Russo NV, Estrin DA, Martı́ MA, Roitberg AE. 2012 pH-Dependent conformational changes in proteins and their effect on experimental pKas: the case of Nitrophorin 4. PLoS Comput. Biol. 8, e1002761. (doi:10.1371/journal.pcbi.1002761) Monod J, Wyman J, Changeux J-P. 1965 On the nature of allosteric transitions: a plausible model. J. Mol. Biol. 12, 88 –118. (doi:10.1016/S00222836(65)80285-6) Hilser VJ, Wrabl JO, Motlagh HN. 2012 Structural and energetic basis of allostery. Annu. Rev. Biophys. 41, 585–609. (doi:10.1146/annurev-biophys050511-102319) Nussinov R, Tsai C-J. 2013 Allostery in disease and in drug discovery. Cell 153, 293 –305. (doi:10.1016/ j.cell.2013.03.034) Laskowski RA, Gerick F, Thornton JM. 2009 The structural basis of allosteric regulation in proteins. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 249. Henzler-Wildman K, Kern D. 2007 Dynamic personalities of proteins. Nature 450, 964 –972. (doi:10.1038/nature06522) 250. Lange OF et al. 2008 Recognition dynamics up to microseconds revealed from an RDC-derived ubiquitin ensemble in solution. Science 320, 1471– 1475. (doi:10.1126/science.1157092) 251. Choy WY, Forman-Kay JD. 2001 Calculation of ensembles of structures representing the unfolded state of an SH3 domain. J. Mol. Biol. 308, 1011– 1032. (doi:10.1006/jmbi.2001.4750) 252. Lindorff-Larsen K, Best RB, Depristo MA, Dobson CM, Vendruscolo MH. 2005 Simultaneous determination of protein structure and dynamics. Nature 433, 128– 132. (doi:10.1038/nature03199) 253. Frauenfelder H, Sligar SG, Wolynes PG. 1991 The energy landscapes and motions of proteins. Science 254, 1598– 1603. (doi:10.1126/science.1749933) 254. Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG. 1995 Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins 21, 167–195. (doi:10.1002/prot.340210302) 255. Wrabl JO, Gu J, Liu T, Schrank TP, Whitten ST, Hilser VJ. 2011 The role of protein conformational fluctuations in allostery, function, and evolution. Biophys. Chem. 159, 129–141. (doi:10.1016/j.bpc. 2011.05.020) 256. Bastolla U, Porto M, Roman HE. 2013 The emerging dynamic view of proteins: protein plasticity in allostery, evolution and self-assembly. Biochim. Biophys. Acta 1834, 817–819. (doi:10.1016/j. bbapap.2013.03.016) 257. Chevin L-M, Lande R, Mace GM. 2010 Adaptation, plasticity, and extinction in a changing environment: towards a predictive theory. PLoS Biol. 8, e1000357. (doi:10.1371/journal.pbio.1000357) 258. Sato K, Ito Y, Yomo T, Kaneko K. 2003 On the relation between fluctuation and response in biological systems. Proc. Natl Acad. Sci. USA 100, 14 086 –14 090. (doi:10.1073/pnas.2334996100) 259. Chen T, Vernazobres D, Yomo T, Bornberg-Bauer E, Chan HS. 2010 Evolvability and single-genotype fluctuation in phenotypic properties: a simple heteropolymer model. Biophys. J. 98, 2487 –2496. (doi:10.1016/j.bpj.2010.02.046) 260. Wagner A. 2014 Mutational robustness accelerates the origin of novel RNA phenotypes through phenotypic plasticity. Biophys. J. 106, 955–965. (doi:10.1016/j.bpj.2014.01.003) 261. Rosenberg SM. 2001 Evolving responsively: adaptive mutation. Nat. Rev. Genet. 2, 504–515. (doi:10. 1038/35080556) 262. Earl DJ, Deem MW. 2004 Evolvability is a selectable trait. Proc. Natl Acad. Sci. USA 101, 11 531–11 536. (doi:10.1073/pnas.0404656101) 263. Jeffery CJ. 1999 Moonlighting proteins. Trends Biochem. Sci. 24, 8–11. 264. Khersonsky O, Tawfik DS. 2010 Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505. (doi:10.1146/annurevbiochem-030409-143718) 265. Aharoni A, Gaidukov L, Khersonsky O, McQ Gould S, Roodveldt C, Tawfik DS. 2005 The ‘evolvability’ of 31 J. R. Soc. Interface 11: 20140419 235. Mittag T, Marsh JA, Grishaev A, Orlicky S, Lin H, Sicheri F, Tyers M, Forman-Kay JD. 2010 Structure/ function implications in a dynamic complex of the intrinsically disordered Sic1 with the Cdc4 subunit of an SCF ubiquitin ligase. Structure 18, 494–506. (doi:10.1016/j.str.2010.01.020) 236. Song J, Ng SC, Tompa P, Lee KAW, Chan HS. 2013 Polycation-p interactions are a driving force for molecular recognition by an intrinsically disordered oncoprotein family. PLoS Comput. Biol. 9, e1003239. (doi:10.1371/journal.pcbi.1003239) 237. Kimura M. 1968 Evolutionary rate at the molecular level. Nature 217, 624 –626. (doi:10.1038/ 217624a0) 238. Ohta T. 1973 Slightly deleterious mutant substitutions in evolution. Nature 246, 96 –98. (doi:10.1038/246096a0) 239. Sikosek T, Chan HS, Bornberg-Bauer E. 2012 Escape from adaptive conflict follows from weak functional trade-offs and mutational robustness. Proc. Natl Acad. Sci. USA 109, 14 888–14 893. (doi:10.1073/ pnas.1115620109) 240. Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, Williams CJ, Dunker AK. 2002 Evolutionary rate heterogeneity in proteins with long disordered regions. J. Mol. Evol. 55, 104 –110. (doi:10.1007/s00239-001-2309-6) 241. Nilsson J, Grahn M, Wright APH. 2011 Proteomewide evidence for enhanced positive Darwinian selection within intrinsically disordered regions in proteins. Genome Biol. 12, R65. (doi:10.1186/gb2011-12-7-r65) 242. Huang H, Sarai A. 2012 Analysis of the relationships between evolvability, thermodynamics, and the functions of intrinsically disordered proteins/regions. Comput. Biol. Chem. 41, 51 –57. (doi:10.1016/j. compbiolchem.2012.10.001) 243. Marsh JA, Teichmann SA. 2014 Parallel dynamics and evolution: protein conformational fluctuations and assembly reflect evolutionary changes in sequence and structure. BioEssays 36, 209–218. (doi:10.1002/bies.201300134) 244. Brown CJ, Johnson AK, Daughdrill GW. 2010 Comparing models of evolution for ordered and disordered proteins. Mol. Biol. Evol. 27, 609 –621. (doi:10.1093/molbev/msp277) 245. Moesa HA, Wakabayashi S, Nakai K, Patil A. 2012 Chemical composition is maintained in poorly conserved intrinsically disordered regions and suggests a means for their classification. Mol. Biosyst. 8, 3262 –3273. (doi:10.1039/ c2mb25202c) 246. Brown CJ, Johnson AK, Dunker AK, Daughdrill GW. 2011 Evolution and disorder. Curr. Opin. Struct. Biol. 21, 441– 446. (doi:10.1016/j.sbi.2011.02.005) 247. Nash P, Tang X, Orlicky S, Chen Q, Gertler FB, Mendenhall MD, Sicheri F, Pawson T, Tyers M. 2001 Multisite phosphorylation of a CDK inhibitor sets a threshold for the onset of DNA replication. Nature 414, 514 –521. (doi:10.1038/35107009) 248. Mittermaier A, Kay LE. 2006 New tools provide new insights in NMR studies of protein dynamics. Science 312, 224 –228. (doi:10.1126/science.1124964) rsif.royalsocietypublishing.org 220. Fuxreiter M, Simon I, Bondos S. 2011 Dynamic protein-DNA recognition: beyond what can be seen. Trends Biochem. Sci. 36, 415– 423. (doi:10.1016/j. tibs.2011.04.006) 221. Marsh JA, Teichmann SA, Forman-Kay JD. 2012 Probing the diverse landscape of protein flexibility and binding. Curr. Opin. Struct. Biol. 22, 643–650. (doi:10.1016/j.sbi.2012.08.008) 222. Uversky VN. 2013 A decade and a half of protein intrinsic disorder: biology still waits for physics. Protein Sci. 22, 693 –724. (doi:10.1002/pro.2261) 223. Monsellier E, Chiti F. 2007 Prevention of amyloidlike aggregation as a driving force of protein evolution. EMBO Rep. 8, 737 –742. (doi:10.1038/sj. embor.7401034) 224. Greenwald J, Riek R. 2012 On the possible amyloid origin of protein folds. J. Mol. Biol. 421, 417–426. (doi:10.1016/j.jmb.2012.04.015) 225. Trifonov EN. 2000 Consensus temporal order of amino acids and evolution of the triplet code. Gene 261, 139–151. (doi:10.1016/S0378-1119(00) 00476-5) 226. Mannige RV, Brooks CL, Shakhnovich EI. 2012 A universal trend among proteomes indicates an oily last common ancestor. PLoS Comput. Biol. 8, e1002839. (doi:10.1371/journal.pcbi.1002839) 227. Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK. 2001 Sequence complexity of disordered protein. Proteins 42, 38–48. (doi:10.1002/10970134(20010101)42:1,38::AID-PROT50.3.0.CO;2-3) 228. Rauscher S, Baud S, Miao M, Keeley FW, Pomès R. 2006 Proline and glycine control protein selforganization into elastomeric or amyloid fibrils. Structure 14, 1667 –1676. (doi:10.1016/j.str.2006. 09.008) 229. Liu Z, Huang Y. 2014 Advantages of proteins being disordered. Protein Sci. 23, 539–550. (doi:10.1002/ pro.2443) 230. Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN. 2005 Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J. 272, 5129 –5148. (doi:10.1111/j.1742-4658.2005. 04948.x) 231. Cumberworth A, Lamour G, Babu MM, Gsponer J. 2013 Promiscuity as a functional trait: intrinsically disordered regions as central players of interactomes. Biochem. J. 454, 361–369. (doi:10. 1042/BJ20130545) 232. Borg M, Mittag T, Pawson T, Tyers M, Forman-Kay JD, Chan HS. 2007 Polyelectrostatic interactions of disordered ligands suggest a physical basis for ultrasensitivity. Proc. Natl Acad. Sci. USA 104, 9650–9655. (doi:10.1073/pnas.0702580104) 233. Tompa P, Fuxreiter M. 2008 Fuzzy complexes: polymorphism and structural disorder in proteinprotein interactions. Trends Biochem. Sci. 33, 2 –8. (doi:10.1016/j.tibs.2007.10.003) 234. Mittag T, Orlicky S, Choy W-Y, Tang X, Lin H, Sicheri F, Kay LE, Tyers M, Forman-Kay JD. 2008 Dynamic equilibrium engagement of a polyvalent ligand with a single-site receptor. Proc. Natl Acad. Sci. USA 105, 17 772–17 777. (doi:10.1073/pnas. 0809222105) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 267. 268. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 297. 298. 299. 300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311. 312. proteins. J. R. Soc. Interface 10, 20130026. (doi:10. 1098/rsif.2013.0026) Lau KF, Dill KA. 1990 Theory for protein mutability and biogenesis. Proc. Natl Acad. Sci. USA 87, 638–642. (doi:10.1073/pnas.87.2.638) Lipman DJ, Wilbur WJ. 1991 Modelling neutral and selective evolution of protein folding. Proc. R. Soc. Lond. B 245, 7–11. (doi:10.1098/rspb.1991.0081) Holzgräfe C, Irbäck A, Troein C. 2011 Mutationinduced fold switching among lattice proteins. J. Chem. Phys. 135, 195101. (doi:10.1063/1. 3660691) Miller C, Davlieva M, Wilson C, White KI, Couñago R, Wu G, Myers JC, Wittung-Stafshede P, Shamoo Y. 2010 Experimental evolution of adenylate kinase reveals contrasting strategies toward protein thermostability. Biophys. J. 99, 887–896. (doi:10. 1016/j.bpj.2010.04.076) Hirst JD. 1999 The evolutionary landscape of functional model proteins. Protein Eng. Des. Sel. 12, 721–726. (doi:10.1093/protein/12.9.721) Blackburne BP, Hirst JD. 2001 Evolution of functional model proteins. J. Chem. Phys. 115, 1935. (doi:10.1063/1.1383051) Burke S, Elber R. 2011 Super folds, networks, and barriers. Proteins 80, 463–470. (doi:10.1002/prot. 23212) Noirel J, Simonson T. 2008 Neutral evolution of proteins: the superfunnel in sequence space and its relation to mutational robustness. J. Chem. Phys. 129, 185104. (doi:10.1063/1.2992853) Bastolla U, Roman HE, Vendruscolo MH. 1999 Neutral evolution of model proteins: diffusion in sequence space and overdispersion. J. Theor. Biol. 200, 49– 64. (doi:10.1006/jtbi.1999.0975) Taverna DM, Goldstein RA. 2002 Why are proteins so robust to site mutations? J. Mol. Biol. 315, 479–484. (doi:10.1006/jmbi.2001.5226) Deeds EJ, Shakhnovich EI. 2005 The emergence of scaling in sequence-based physical models of protein evolution. Biophys. J. 88, 3905–3911. (doi:10.1529/biophysj.104.051433) Zeldovich KB, Chen P, Shakhnovich BE, Shakhnovich EI. 2007 A first-principles model of early evolution: emergence of gene families, species, and preferred protein folds. PLoS Comput. Biol. 3, e139. (doi:10. 1371/journal.pcbi.0030139) Chan HS, Dill KA. 1989 Compact polymers. Macromolecules 22, 4559 –4573. (doi:10.1021/ ma00202a031) Pande VS, Joerg C, Grosberg AY, Tanaka T. 1994 Enumerations of the Hamiltonian walks on a cubic sublattice. J. Phys. A Math. Gen. 27, 6231–6236. (doi:10.1088/0305-4470/27/18/030) Lee JH, Kim S-Y, Lee J. 2011 Parallel algorithm for calculation of the exact partition function of a lattice polymer. Comput. Phys. Commun. 182, 1027– 1033. (doi:10.1016/j.cpc.2011.01.004) Schram RD, Schiessel H. 2013 Exact enumeration of Hamiltonian walks on the 444 cube and applications to protein folding. J. Phys. A Math. Theor. 46, 485001. (doi:10.1088/1751-8113/46/ 48/485001) 32 J. R. Soc. Interface 11: 20140419 269. 282. Maritan A, Micheletti C, Trovato A, Banavar JR. 2000 Optimal shapes of compact strings. Nature 406, 287 –290. (doi:10.1038/35018538) 283. Gregoret LM, Cohen FE. 1991 Protein folding. Effect of packing density on chain conformation. J. Mol. Biol. 219, 109 –122. (doi:10.1016/0022-2836(91) 90861-Y) 284. Hunt NG, Gregoret LM, Cohen FE. 1994 The origins of protein secondary structure. Effects of packing density and hydrogen bonding studied by a fast conformational search. J. Mol. Biol. 241, 214–225. (doi:10.1006/jmbi.1994.1490) 285. Yee DP, Chan HS, Havel TF, Dill KA. 1994 Does compactness induce secondary structure in proteins? A study of poly-alanine chains computed by distance geometry. J. Mol. Biol. 241, 557 –573. (doi:10.1006/jmbi.1994.1531) 286. Zhang Y, Hubner IA, Arakaki AK, Shakhnovich EI, Skolnick J. 2006 On the origin and highly likely completeness of single-domain protein structures. Proc. Natl Acad. Sci. USA 103, 2605 –2610. (doi:10. 1073/pnas.0509379103) 287. Taylor WR, Chelliah V, Hollup SM, MacDonald JT, Jonassen I. 2009 Probing the ‘dark matter’ of protein fold space. Structure 17, 1244–1252. (doi:10.1016/j.str.2009.07.012) 288. Cossio P, Trovato A, Pietrucci F, Seno F, Maritan A, Laio A. 2010 Exploring the universe of protein structures beyond the Protein Data Bank. PLoS Comput. Biol. 6, e1000957. (doi:10.1371/journal. pcbi.1000957) 289. Dai L, Zhou Y. 2011 Characterizing the existing and potential structural space of proteins by large-scale multiple loop permutations. J. Mol. Biol. 408, 585 –595. (doi:10.1016/j.jmb.2011.02.056) 290. Skolnick J, Zhou H, Brylinski M. 2012 Further evidence for the likely completeness of the library of solved single domain protein structures. J. Phys. Chem. B 116, 6654 –6664. (doi:10.1021/ jp211052j) 291. Skolnick J, Gao M. 2013 Interplay of physics and evolution in the likely origin of protein biochemical function. Proc. Natl Acad. Sci. USA 110, 9344–9349. (doi:10.1073/pnas.1300011110) 292. Chan HS, Bornberg-Bauer E. 2002 Perspectives on protein evolution from simple exact models. Appl. Bioinform. 1, 121–144. 293. Xia Y, Levitt M. 2004 Funnel-like organization in sequence space determines the distributions of protein stability and folding rate preferred by evolution. Proteins 55, 107 –114. (doi:10.1002/prot. 10563) 294. Greenbury SF, Johnston IG, Louis AA, Ahnert SE. 2014 A tractable genotype-phenotype map modelling the self-assembly of protein quaternary structure. J. R. Soc. Interface 11, 20140249. (doi:10. 1098/rsif.2014.0249) 295. Moreno-Hernández S, Levitt M. 2012 Comparative modeling and protein-like features of hydrophobicpolar models on a two-dimensional lattice. Proteins 80, 1683–1693. (doi:10.1002/prot.24067) 296. Palmer ME, Moudgil A, Feldman MW. 2013 Longterm evolution is surprisingly predictable in lattice rsif.royalsocietypublishing.org 266. promiscuous protein functions. Nat. Genet. 37, 73 –76. (doi:10.1038/ng1482) Hou J, Sims GE, Zhang C, Kim S-H. 2003 A global representation of the protein fold space. Proc. Natl Acad. Sci. USA 100, 2386 –2390. (doi:10.1073/pnas. 2628030100) Sippl MJ. 2009 Fold space unlimited. Curr. Opin. Struct. Biol. 19, 312–320. (doi:10.1016/j.sbi.2009. 03.010) Caetano-Anollés G, Wang M, Caetano-Anollés D, Mittenthal JE. 2009 The origin, evolution and structure of the protein world. Biochem. J. 417, 621–637. (doi:10.1042/BJ20082063) Caetano-Anollés G, Kim KM, Caetano-Anollés D. 2012 The phylogenomic roots of modern biochemistry: origins of proteins, cofactors and protein biosynthesis. J. Mol. Evol. 74, 1– 34. (doi:10.1007/s00239-011-9480-1) Osadchy M, Kolodny R. 2011 Maps of protein structure space reveal a fundamental relationship between protein structure and function. Proc. Natl Acad. Sci. USA 108, 12 301 –12 306. (doi:10.1073/ pnas.1102727108) Minary P, Levitt M. 2008 Probing protein fold space with a simplified model. J. Mol. Biol. 375, 920–933. (doi:10.1016/j.jmb.2007.10.087) Keefe AD, Szostak JW. 2001 Functional proteins from a random-sequence library. Nature 410, 715–718. (doi:10.1038/35070613) Povolotskaya IS, Kondrashov FA. 2010 Sequence space and the ongoing expansion of the protein universe. Nature 465, 922–926. (doi:10.1038/ nature09105) Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995 SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536 –540. (doi:10.1006/jmbi. 1995.0159) Wolf YI, Grishin NV, Koonin EV. 2000 Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897 –905. (doi:10. 1006/jmbi.2000.3786) Govindarajan S, Recabarren R, Goldstein RA. 1999 Estimating the total number of protein folds. Proteins 35, 408–414. (doi:10.1002/(SICI)1097-0134(1999 0601)35:4,408::AID-PROT4.3.0.CO;2-A) Coulson AFW, Moult J. 2002 A unifold, mesofold, and superfold model of protein fold use. Proteins 46, 61 –71. (doi:10.1002/prot.10011) Kolodny R, Pereyaslavets L, Samson AO, Levitt M. 2013 On the universe of protein folds. Annu. Rev. Biophys. 42, 559–582. (doi:10.1146/annurevbiophys-083012-130432) Godzik A. 2011 Metagenomics and the protein universe. Curr. Opin. Struct. Biol. 21, 398 –403. (doi:10.1016/j.sbi.2011.03.010) Chan HS, Dill KA. 1990 The effects of internal constraints on the configurations of chain molecules. J. Chem. Phys. 92, 3118 –3135. (doi:10. 1063/1.458605) Chan HS, Dill KA. 1990 Origins of structure in globular proteins. Proc. Natl Acad. Sci. USA 87, 6388–6392. (doi:10.1073/pnas.87.16.6388) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 345. Heo M, Kang L, Shakhnovich EI. 2009 Emergence of species in evolutionary ‘simulated annealing’. Proc. Natl Acad. Sci. USA 106, 1869–1874. (doi:10.1073/ pnas.0809852106) 346. Heo M, Shakhnovich EI. 2010 Interplay between pleiotropy and secondary selection determines rise and fall of mutators in stress response. PLoS Comput. Biol. 6, e1000710. (doi:10.1371/journal.pcbi.1000710) 347. Heo M, Maslov S, Shakhnovich EI. 2011 Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc. Natl Acad. Sci. USA 108, 4258– 4263. (doi:10.1073/pnas.1009392108) 348. Wright S. 1932 The roles of mutations, inbreeding, crossbreeding and selection in evolution. In Proc. 6th Int. Congr. Genet, vol. 1, pp. 356–366. Menasha, WI: Brooklyn Botanical Garden. 349. Kauffman S, Levin S. 1987 Towards a general theory of adaptive walks on rugged landscapes. J. Theor. Biol. 128, 11–45. (doi:10.1016/S0022-5193(87) 80029-2) 350. Voigt CA, Kauffman S, Wang ZG. 2000 Rational evolutionary design: the theory of in vitro protein evolution. Adv. Protein Chem. 55, 79– 160. (doi:10. S0065-3233(01)55003-2) 351. Carneiro M, Hartl DL. 2010 Adaptive landscapes and protein evolution. Proc. Natl Acad. Sci. USA 107, 1747– 1751. (doi:10.1073/pnas.0906192106) 352. Sella G, Hirsh AE. 2005 The application of statistical physics to evolutionary biology. Proc. Natl Acad. Sci. USA 102, 9541–9546. (doi:10.1073/pnas. 0501865102) 353. Pathria R. 1980 Statistical mechanics. Oxford, UK: Pergamon Press. 354. Lobkovsky AE, Wolf YI, Koonin EV. 2013 Quantifying the similarity of monotonic trajectories in rough and smooth fitness landscapes. Mol. Biosyst. 9, 1627– 1631. (doi:10.1039/c3mb25553k) 355. Ferrada E, Wagner A. 2012 A comparison of genotype–phenotype maps for RNA and proteins. Biophys. J. 102, 1916– 1925. (doi:10.1016/j.bpj. 2012.01.047) 356. Fontana W, Schuster P. 1987 A computer model of evolutionary optimization. Biophys. Chem. 26, 123–147. (doi:10.1016/0301-4622(87)80017-0) 357. Fontana W, Stadler P, Bornberg-Bauer E, Griesmacher T, Hofacker I, Tacker M, Tarazona P, Weinberger E, Schuster P. 1993 RNA folding and combinatory landscapes. Phys. Rev. E 47, 2083– 2099. (doi:10.1103/PhysRevE.47.2083) 358. Schuster P, Fontana W, Stadler PF, Hofacker IL. 1994 From sequences to shapes and back: a case study in RNA secondary structures. Proc. R. Soc. Lond. B 255, 279–284. (doi:10.1098/rspb.1994.0040) 359. Ancel LW, Fontana W. 2000 Plasticity, evolvability, and modularity in RNA. J. Exp. Zool. 288, 242–283. (doi:10.1002/1097-010X(20001015) 288:3,242::AID-JEZ5.3.0.CO;2-O) 360. Guo HH, Choe J, Loeb LA. 2004 Protein tolerance to random amino acid change. Proc. Natl Acad. Sci. USA 101, 9205–9210. (doi:10.1073/pnas.0403255101) 361. Punta M et al. 2012 The Pfam protein families database. Nucleic Acids Res. 40, D290 –D301. (doi:10.1093/nar/gkr1065) 33 J. R. Soc. Interface 11: 20140419 329. Miyazawa S, Jernigan RL. 1985 Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18, 534– 552. (doi:10.1021/ ma00145a039) 330. Bloom JD, Wilke CO, Arnold FH, Adami C. 2004 Stability and the evolvability of function in a model protein. Biophys. J. 86, 2758– 2764. (doi:10.1016/ S0006-3495(04)74329-5) 331. Abkevich VI, Gutin AM, Shakhnovich EI. 1994 Specific nucleus as the transition state for protein folding: evidence from the lattice model. Biochemistry 33, 10 026–10 036. (doi:10.1021/bi00199a029) 332. Xia Y, Levitt M. 2004 Simulating protein evolution in sequence and structure space. Curr. Opin. Struct. Biol. 14, 202 –207. (doi:10.1016/j.sbi.2004.03.001) 333. Goldstein RA. 2008 The structure of protein evolution and the evolution of protein structure. Curr. Opin. Struct. Biol. 18, 170–177. (doi:10.1016/j.sbi.2008.01.006) 334. Zeldovich KB, Shakhnovich EI. 2008 Understanding protein evolution: from protein physics to Darwinian selection. Annu. Rev. Phys. Chem. 59, 105 –127. (doi:10.1146/annurev.physchem.58.032806.104449) 335. Blackburne BP, Hirst JD. 2005 Population dynamics simulations of functional model proteins. J. Chem. Phys. 123, 154907. (doi:10.1063/1.2056545) 336. Maynard Smith J. 1970 Natural selection and the concept of a protein space. Nature 225, 563– 564. (doi:10.1038/225563a0) 337. Bastolla U, Vendruscolo MH, Roman HE. 2000 Structurally constrained protein evolution: results from a lattice simulation. Eur. Phys. J. B 15, 385 –397. (doi:10.1007/s100510051140) 338. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. 2006 Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA 103, 5869– 5874. (doi:10.1073/pnas. 0510098103) 339. Drummond DA, Silberg JJ, Meyer MM, Wilke CO, Arnold FH. 2005 On the conservative nature of intragenic recombination. Proc. Natl Acad. Sci. USA 102, 5380–5385. (doi:10.1073/pnas.0500729102) 340. Cui Y, Wong W. 2000 Multiple-sequence information provides protection against mis-specified potential energy functions in the lattice model of proteins. Phys. Rev. Lett. 85, 5242–5245. (doi:10.1103/ PhysRevLett.85.5242) 341. Nanda V, DeGrado WF. 2005 Automated use of mutagenesis data in structure prediction. Proteins 59, 454– 466. (doi:10.1002/prot.20382) 342. England JL, Shakhnovich BE, Shakhnovich EI. 2003 Natural selection of more designable folds: a mechanism for thermophilic adaptation. Proc. Natl Acad. Sci. USA 100, 8727– 8731. (doi:10.1073/pnas. 1530713100) 343. Noivirt-Brik O, Unger R, Horovitz A. 2009 Analysing the origin of long-range interactions in proteins using lattice models. BMC Struct. Biol. 9, 4. (doi:10. 1186/1472-6807-9-4) 344. Liu Z, Chen J, Thirumalai D. 2009 On the accuracy of inferring energetic coupling between distant sites in protein families from evolutionary imprints: illustrations using lattice model. Proteins 77, 823 –831. (doi:10.1002/prot.22498) rsif.royalsocietypublishing.org 313. Lee J. 2004 Exact partition function zeros of twodimensional lattice polymers. J. Korean Phys. Soc. 44, 617–620. (doi:10.3938/jkps.44.617) 314. Clisby N, Liang R, Slade G. 2007 Self-avoiding walk enumeration via the lace expansion. J. Phys. A Math. Theor. 40, 10 973– 11 017. (doi:10.1088/ 1751-8113/40/36/003) 315. Yue K, Fiebig KM, Thomas PD, Chan HS, Shakhnovich EI, Dill KA. 1995 A test of lattice protein folding algorithms. Proc. Natl Acad. Sci. USA 92, 325–329. (doi:10.1073/pnas.92.1.325) 316. Irbäck A, Troein C. 2002 Enumerating designing sequences in the HP model. J. Biol. Phys. 28, 1– 15. (doi:10.1023/A:1016225010659) 317. Dill KA, Bromberg S, Yue K, Fiebig KM, Yee DP, Thomas PD, Chan HS. 1995 Principles of protein folding—a perspective from simple exact models. Protein Sci. 4, 561 –602. (doi:10.1002/pro. 5560040401) 318. Kamtekar S, Schiffer J, Xiong H, Babik J, Hecht M. 1993 Protein design by binary patterning of polar and nonpolar amino acids. Science 262, 1680–1685. (doi:10.1126/science.8259512) 319. Urvoas A, Valerio-Lepiniec M, Minard P. 2012 Artificial proteins from combinatorial approaches. Trends Biotechnol. 30, 512 –520. (doi:10.1016/j. tibtech.2012.06.001) 320. Irbäck A, Sandelin E. 2000 On hydrophobicity correlations in protein chains. Biophys. J. 79, 2252–2258. (doi:10.1016/S0006-3495(00)76472-1) 321. Irbäck A, Peterson C, Potthast F. 1996 Evidence for nonrandom hydrophobicity structures in protein chains. Proc. Natl Acad. Sci. USA 93, 9533 –9538. (doi:10.1073/pnas.93.18.9533) 322. Buchler NE, Goldstein RA. 1999 Effect of alphabet size and foldability requirements on protein structure designability. Proteins 34, 113–124. (doi:10.1002/(SICI)1097-0134(19990101) 34:1,113::AID-PROT9.3.0.CO;2-J) 323. Wroe R, Bornberg-Bauer E, Chan HS. 2005 Comparing folding codes in simple heteropolymer models of protein evolutionary landscape: robustness of the superfunnel paradigm. Biophys. J. 88, 118–131. (doi:10.1529/biophysj.104.050369) 324. Cui Y, Wong WH, Bornberg-Bauer E, Chan HS. 2002 Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc. Natl Acad. Sci. USA 99, 809–814. (doi:10.1073/pnas.022240299) 325. Chan HS. 2000 Modeling protein density of states: additive hydrophobic effects are insufficient for calorimetric two-state cooperativity. Proteins 40, 543–571. (doi:10.1002/1097-0134(20000901) 40:4,543::AID-PROT20.3.0.CO;2-O) 326. Chan HS. 1998 Protein folding. Matching speed and locality. Nature 392, 761– 763. (doi:10.1038/33808) 327. Gō N. 1983 Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12, 183–210. (doi:10. 1146/annurev.bb.12.060183.001151) 328. Bryngelson JD, Wolynes PG. 1987 Spin glasses and the statistical mechanics of protein folding. Proc. Natl Acad. Sci. USA 84, 7524 –7528. (doi:10.1073/ pnas.84.21.7524) Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 397. 398. 399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409. 410. 411. 412. evolution by conformational epistasis. Science 317, 1544– 1548. (doi:10.1126/science.1142819) Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. 2012 Epistasis as the primary factor in molecular evolution. Nature 490, 535 –538. (doi:10.1038/nature11510) Soylemez O, Kondrashov FA. 2012 Estimating the rate of irreversibility in protein evolution. Genome Biol. Evol. 4, 1213–1222. (doi:10.1093/gbe/evs096) Pollock DD, Goldstein RA. 2014 Strong evidence for protein epistasis, weak evidence against it. Proc. Natl Acad. Sci. USA 111, E1450. (doi:10.1073/pnas. 1401112111) Ferrer-Costa C, Orozco M, de la Cruz X. 2004 Sequence-based prediction of pathological mutations. Proteins 57, 811 –819. (doi:10.1002/ prot.20252) Ferrer-Costa C, Orozco M, de la Cruz X. 2007 Characterization of compensated mutations in terms of structural and physico-chemical properties. J. Mol. Biol. 365, 249– 256. (doi:10.1016/j.jmb. 2006.09.053) Baresić A, Hopcroft LEM, Rogers HH, Hurst JM, Martin ACR. 2010 Compensated pathogenic deviations: analysis of structural effects. J. Mol. Biol. 396, 19– 30. (doi:10.1016/j.jmb.2009.11.002) Wang Z, Moult J. 2003 Three-dimensional structural location and molecular functional effects of missense SNPs in the T cell receptor Vbeta domain. Proteins 53, 748 –757. (doi:10.1002/prot.10522) Ivankov DN, Finkelstein AV, Kondrashov FA. 2014 A structural perspective of compensatory evolution. Curr. Opin. Struct. Biol. 26, 104– 112. (doi:10.1016/ j.sbi.2014.05.004) Shoval O, Sheftel H, Shinar G, Hart Y, Ramote O, Mayo A, Dekel E, Kavanagh K, Alon U. 2012 Evolutionary trade-offs, Pareto optimality, and the geometry of phenotype space. Science 336, 1157– 1160. (doi:10.1126/science.1217405) Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. 2002 Evolutionary rate in the protein interaction network. Science 296, 750 –752. (doi:10. 1126/science.1068696) Pál C, Papp B, Hurst L. 2001 Highly expressed genes in yeast evolve slowly. Genetics 71, 416–417. Giaever G et al. 2002 Functional profiling of the Saccharomyces cerevisiae genome. Nature 418, 387–391. (doi:10.1038/nature00935) Zhang R, Lin Y. 2009 DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 37, D455 –D458. (doi:10.1093/ nar/gkn858) Park C, Chen X, Yang J, Zhang J. 2013 Differential requirements for mRNA folding partially explain why highly expressed proteins evolve slowly. Proc. Natl Acad. Sci. USA 110, E678– E686. (doi:10.1073/ pnas.1218066110) Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005 Why highly expressed proteins evolve slowly. Proc. Natl Acad. Sci. USA 102, 14 338 –14 343. (doi:10.1073/pnas.0504070102) Drummond DA, Wilke CO. 2008 Mistranslationinduced protein misfolding as a dominant constraint 34 J. R. Soc. Interface 11: 20140419 380. Brosius J, Gould SJ. 1992 On ‘genomenclature’: a comprehensive (and respectful) taxonomy for pseudogenes and other ‘junk DNA’. Proc. Natl Acad. Sci. USA 89, 10 706 –10 710. (doi:10.1093/ oxfordjournals.molbev.a025627) 381. Brosius J. 1999 RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 238, 115–134. (doi:10.1016/S0378-1119(99)00227-9) 382. Krull M, Brosius J, Schmitz J. 2005 Alu-SINE exonization: en route to protein-coding function. Mol. Biol. Evol. 22, 1702–1711. (doi:10.1093/ molbev/msi164) 383. Carvunis A-R et al. 2012 Proto-genes and de novo gene birth. Nature 487, 370–374. (doi:10.1038/ nature11184) 384. Fraser JS, Clarkson MW, Degnan SC, Erion R, Kern D, Alber T. 2009 Hidden alternative structures of proline isomerase essential for catalysis. Nature 462, 669 –673. (doi:10.1038/nature08615) 385. Vallurupalli P, Bouvignies G, Kay LE. 2012 Studying ‘invisible’ excited protein states in slow exchange with a major state conformation. J. Am. Chem. Soc. 134, 8148–8161. (doi:10.1021/ja3001419) 386. Sekhar A, Kay LE. 2013 NMR paves the way for atomic level descriptions of sparsely populated, transiently formed biomolecular conformers. Proc. Natl Acad. Sci. USA 110, 12 867–12 874. (doi:10. 1073/pnas.1305688110) 387. Dimitrov JD, Kaveri SV, Lacroix-Desmazes S. 2014 Thermodynamic stability contributes to immunoglobulin specificity. Trends Biochem. Sci. 39, 221 –226. (doi:10.1016/j.tibs.2014.02.010) 388. Sabath N, Wagner A, Karlin D. 2012 Evolution of viral proteins originated de novo by overprinting. Mol. Biol. Evol. 29, 3767–3780. (doi:10.1093/ molbev/mss179) 389. Conant GC, Wolfe KH. 2008 Turning a hobby into a job: how duplicated genes find new functions. Nat. Rev. Genet. 9, 938–950. (doi:10.1038/nrg2482) 390. Soskine M, Tawfik DS. 2010 Mutational effects and the evolution of new protein functions. Nat. Rev. Genet. 11, 572 –582. (doi:10.1038/nrg2808) 391. Ugalde JA, Chang BSW, Matz MV. 2004 Evolution of coral pigments recreated. Science 305, 1433. (doi:10.1126/science.1099597) 392. Gutiérrez J, Maere S. 2014 Modeling the evolution of molecular systems from a mechanistic perspective. Trends Plant Sci. 19, 292–303. (doi:10. 1016/j.tplants.2014.03.004) 393. Corbett-Detig RB, Zhou J, Clark AG, Hartl DL, Ayroles JF. 2013 Genetic incompatibilities are widespread within species. Nature 504, 135–137. (doi:10.1038/ nature12678) 394. Wells JA. 1990 Additivity of mutational effects in proteins. Biochemistry 29, 8509 –8517. (doi:10. 1021/bi00489a001) 395. Weinreich DM, Delaney NF, Depristo MA, Hartl DL. 2006 Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111 –114. (doi:10.1126/science.1123539) 396. Ortlund EA, Bridgham JT, Redinbo MR, Thornton JW. 2007 Crystal structure of an ancient protein: rsif.royalsocietypublishing.org 362. Thorne JL, Goldman N, Jones DT. 1996 Combining protein evolution and secondary structure. Mol. Biol. Evol. 13, 666 –673. (doi:10.1093/oxfordjournals. molbev.a025627) 363. Goldman N, Thorne JL, Jones DT. 1998 Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149, 445–458. 364. Massey SE. 2010 Pseudaptations and the emergence of beneficial traits. In Evolutionary biology— concepts, molecular and morphological evolution (ed. P Pontarotti), pp. 81– 98. Berlin, Germany: Springer. 365. Kitano H. 2004 Biological robustness. Nat. Rev. Genet. 5, 826–837. (doi:10.1038/nrg1471) 366. Masel J, Siegal ML. 2009 Robustness: mechanisms and consequences. Trends Genet. 25, 395 –403. (doi:10.1016/j.tig.2009.07.005) 367. Masel J, Trotter MV. 2010 Robustness and evolvability. Trends Genet. 26, 406–414. (doi:10. 1016/j.tig.2010.06.002) 368. Rorick MM, Wagner GP. 2011 Protein structural modularity and robustness are associated with evolvability. Genome Biol. Evol. 3, 456–475. (doi:10.1093/gbe/evr046) 369. Wagner A. 2008 Neutralism and selectionism: a network-based reconciliation. Nat. Rev. Genet. 9, 965–974. (doi:10.1038/nrg2473) 370. Wagner A. 2008 Robustness and evolvability: a paradox resolved. Proc. R. Soc. B 275, 91– 100. (doi:10.1098/rspb.2007.1137) 371. Draghi JA, Parsons TL, Wagner GP, Plotkin JB. 2010 Mutational robustness can facilitate adaptation. Nature 463, 353 –355. (doi:10.1038/nature08694) 372. Bornberg-Bauer E, Kramer L. 2010 Robustness versus evolvability: a paradigm revisited. HFSP J. 4, 105–108. (doi:10.2976/1.3404403) 373. Nobeli I, Favia AD, Thornton JM. 2009 Protein promiscuity and its implications for biotechnology. Nat. Biotechnol. 27, 157 –167. (doi:10.1038/ nbt1519) 374. Babtie A, Tokuriki N, Hollfelder F. 2010 What makes an enzyme promiscuous? Curr. Opin. Chem. Biol. 14, 200–207. (doi:10.1016/j.cbpa.2009. 11.028) 375. Schreiber G, Keating AE. 2011 Protein binding specificity versus promiscuity. Curr. Opin. Struct. Biol. 21, 50 –61. (doi:10.1016/j.sbi.2010.10.002) 376. Bridgham JT, Carroll SM, Thornton JW. 2006 Evolution of hormone-receptor complexity by molecular exploitation. Science 312, 97 –101. (doi:10.1126/science.1123348) 377. Rebeiz M, Jikomes N, Kassner VA, Carroll SB. 2011 Evolutionary origin of a novel gene expression pattern through co-option of the latent activities of existing regulatory sequences. Proc. Natl Acad. Sci. USA 108, 10 036–10 043. (doi:10.1073/pnas. 1105937108) 378. Barve A, Wagner A. 2013 A latent capacity for evolutionary innovation through exaptation in metabolic systems. Nature 500, 203–206. (doi:10. 1038/nature12301) 379. Gould S, Vrba E. 1982 Exaptation-a missing term in the science of form. Paleobiology 8, 4–15. Downloaded from rsif.royalsocietypublishing.org on August 27, 2014 414. 415. 417. 418. 419. 420. 421. 422. 423. 433. 434. 435. 436. 437. 438. 439. 440. 441. protein sequence evolutionary divergence: local packing density versus solvent exposure. Mol. Biol. Evol. 31, 135– 139. (doi:10.1093/molbev/mst178) Huang T-T, del Valle Marcos ML, Hwang J-K, Echave J. 2014 A mechanistic stress model of protein evolution accounts for site-specific evolutionary rates and their relationship with packing density and flexibility. BMC Evol. Biol. 14, 78. (doi:10.1186/ 1471-2148-14-78) Javier Zea D, Miguel Monzon A, Fornasari MS, Marino-Buslje C, Parisi G. 2013 Protein conformational diversity correlates with evolutionary rate. Mol. Biol. Evol. 30, 1500–1503. (doi:10.1093/ molbev/mst065) Juritz E, Palopoli N, Fornasari MS, Fernandez-Alberti S, Parisi G. 2013 Protein conformational diversity modulates sequence divergence. Mol. Biol. Evol. 30, 79– 87. (doi:10.1093/molbev/mss080) Firnberg E, Labonte JW, Gray JJ, Ostermeier M. 2014 A comprehensive, high-resolution map of a gene’s fitness landscape. Mol. Biol. Evol. 31, 1581 –1592. (doi:10.1093/molbev/msu081) Mariani V, Kiefer F, Schmidt T, Haas J, Schwede T. 2011 Assessment of template based protein structure predictions in CASP9. Proteins 79, 37 –58. (doi:10.1002/prot.23177) Moult J, Fidelis K, Kryshtafovych A, Tramontano A. 2011 Critical assessment of methods of protein structure prediction (CASP)–round IX. Proteins 79, 1–5. (doi:10.1002/prot.23200) Bloom JD, Glassman MJ. 2009 Inferring stabilizing mutations from protein phylogenies: application to influenza hemagglutinin. PLoS Comput. Biol. 5, e1000349. (doi:10.1371/journal. pcbi.1000349) McLaughlin RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. 2012 The spatial architecture of protein function and adaptation. Nature 491, 138–142. (doi:10.1038/nature11500) Morcos F, Schafer NP, Cheng RR, Onuchic JN, Wolynes PG. 2014 Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection. Proc. Natl Acad. Sci. USA (doi:10.1073/pnas.1413575111) 35 J. R. Soc. Interface 11: 20140419 416. 424. Konno A, Kitagawa A, Watanabe M, Ogawa T, Shirai T. 2011 Tracing protein evolution through ancestral structures of fish galectin. Structure 19, 711–721. (doi:10.1016/j.str.2011.02.014) 425. Perez-Jimenez R et al. 2011 Single-molecule paleoenzymology probes the chemistry of resurrected enzymes. Nat. Struct. Mol. Biol. 18, 592 –596. (doi:10.1038/nsmb.2020) 426. Voordeckers K, Brown CA, Vanneste K, van der Zande E, Voet A, Maere S, Verstrepen KJ. 2012 Reconstruction of ancestral metabolic enzymes reveals molecular mechanisms underlying evolutionary innovation through gene duplication. PLoS Biol. 10, e1001446. (doi:10.1371/journal.pbio. 1001446) 427. Hobbs JK, Shepherd C, Saul DJ, Demetras NJ, Haaning S, Monk CR, Daniel RM, Arcus VL. 2012 On the origin and evolution of thermophily: reconstruction of functional precambrian enzymes from ancestors of Bacillus. Mol. Biol. Evol. 29, 825 –835. (doi:10.1093/molbev/msr253) 428. Risso VA, Gavira JA, Mejia-Carmona DF, Gaucher EA, Sanchez-Ruiz JM. 2013 Hyperstability and substrate promiscuity in laboratory resurrections of Precambrian b-lactamases. J. Am. Chem. Soc. 135, 2899 –2902. (doi:10.1021/ja311630a) 429. Harms MJ, Eick GN, Goswami D, Colucci JK, Griffin PR, Ortlund EA, Thornton JW. 2013 Biophysical mechanisms for large-effect mutations in the evolution of steroid hormone receptors. Proc. Natl Acad. Sci. USA 110, 11 475–11 480. (doi:10.1073/ pnas.1303930110) 430. Bridgham JT, Ortlund EA, Thornton JW. 2009 An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461, 515 –519. (doi:10.1038/nature08249) 431. Friedland GD, Lakomek N-A, Griesinger C, Meiler J, Kortemme T. 2009 A correspondence between solution-state dynamics of an individual protein and the sequence and conformational diversity of its family. PLoS Comput. Biol. 5, e1000393. (doi:10. 1371/journal.pcbi.1000393) 432. Yeh S-W, Liu J-W, Yu S-H, Shih C-H, Hwang J-K, Echave J. 2014 Site-specific structural constraints on rsif.royalsocietypublishing.org 413. on coding-sequence evolution. Cell 134, 341–352. (doi:10.1016/j.cell.2008.05.042) Yang J-R, Zhuang S-M, Zhang J. 2010 Impact of translational error-induced and error-free misfolding on the rate of protein evolution. Mol. Syst. Biol. 6, 421. (doi:10.1038/msb.2010.78) Serohijos AWR, Rimas Z, Shakhnovich EI. 2012 Protein biophysics explains why highly abundant proteins evolve slowly. Cell Rep. 2, 249 –256. (doi:10.1016/j.celrep.2012.06.022) Drummond DA, Wilke CO. 2009 The evolutionary consequences of erroneous protein synthesis. Nat. Rev. Genet. 10, 715–724. (doi:10.1038/nrg2662) Papp B, Notebaart RA, Pál C. 2011 Systems-biology approaches for predicting genomic evolution. Nat. Rev. Genet. 12, 591–602. (doi:10.1038/nrg3033) Payne JL, Wagner A. 2014 The robustness and evolvability of transcription factor binding sites. Science 343, 875–877. (doi:10.1126/science. 1249046) McCloskey D, Palsson BØ, Feist AM. 2013 Basic and applied uses of genome-scale metabolic network reconstructions of Escherichia coli. Mol. Syst. Biol. 9, 661. (doi:10.1038/msb.2013.18) Pál C, Papp B, Lercher MJ, Csermely P, Oliver SG, Hurst LD. 2006 Chance and necessity in the evolution of minimal metabolic networks. Nature 440, 667–670. (doi:10.1038/nature04568) Wagner A. 2005 Distributed robustness versus redundancy as causes of mutational robustness. BioEssays 27, 176– 188. (doi:10.1002/bies.20170) Nam H, Lewis NE, Lerman JA, Lee D-H, Chang RL, Kim D, Palsson BO. 2012 Network context and selection in the evolution to enzyme specificity. Science 337, 1101 –1104. (doi:10.1126/science. 1216861) Varma A, Palsson BO. 1994 Metabolic flux balancing: basic concepts, scientific and practical use. Nat. Biotechnol. 12, 994–998. (doi:10.1038/ nbt1094-994) Barve A, Rodrigues JFM, Wagner A. 2012 Superessential reactions in metabolic networks. Proc. Natl Acad. Sci. USA 109, E1121 –E1130. (doi:10.1073/pnas.1113065109)