doi:10.1016/j.jmb.2006.07.092
J. Mol. Biol. (2006) 362, 1004–1024
Mis-translation of a Computationally Designed Protein
Yields an Exceptionally Stable Homodimer: Implications
for Protein Engineering and Evolution
Gautam Dantas 1 , Alexander L. Watters 2 , Bradley M. Lunde 3
Ziad M. Eletr 7 , Nancy G. Isern 8 , Toby Roseman 4 , Jan Lipfert 9
Sebastian Doniach 9 , Martin Tompa 4 , Brian Kuhlman 7
Barry L. Stoddard 10 , Gabriele Varani 1,5 and David Baker 1,6 ⁎
1
Department of Biochemistry,
University of Washington,
Seattle 98195, USA
2
Department of Molecular and
Cellular Biology, University of
Washington, Seattle 98195,
USA
3
Bio-Molecular Structure and
Design Program, University of
Washington, Seattle 98195,
USA
4
Department of Computer
Science and Engineering,
University of Washington,
Seattle 98195, USA
5
Department of Chemistry,
University of Washington,
Seattle 98195, USA
6
Howard Hughes Medical
Institute, University of
Washington, Seattle 98195,
USA
We recently used computational protein design to create an extremely
stable, globular protein, Top7, with a sequence and fold not observed
previously in nature. Since Top7 was created in the absence of genetic
selection, it provides a rare opportunity to investigate aspects of the cellular
protein production and surveillance machinery that are subject to natural
selection. Here we show that a portion of the Top7 protein corresponding to
the final 49 C-terminal residues is efficiently mis-translated and accumulates at high levels in Escherichia coli. We used circular dichroism, sizeexclusion chromatography, small-angle X-ray scattering, analytical ultracentrifugation, and NMR spectroscopy to show that the resulting
C-terminal fragment (CFr) protein adopts a compact, extremely stable,
homo-dimeric structure. Based on the solution structure, we engineered an
even more stable variant of CFr by disulfide-induced covalent circularisation that should be an excellent platform for design of novel functions. The
accumulation of high levels of CFr exposes the high error rate of the protein
translation machinery. The rarity of correspondingly stable fragments in
natural proteins coupled with the observation that high quality ribosome
binding sites are found to occur within E. coli protein-coding regions
significantly less often than expected by random chance implies a stringent
evolutionary pressure against protein sub-fragments that can independently fold into stable structures. The symmetric self-association between
two identical mis-translated CFr sub-domains to generate an extremely
stable structure parallels a mechanism for natural protein-fold evolution by
modular recombination of protein sub-structures.
© 2006 Elsevier Ltd. All rights reserved.
7
Department of Biochemistry
and Biophysics, University of
North Carolina, Chapel Hill,
NC 27599, USA
*Corresponding author
Keywords: mistranslation; protein-fold evolution; protein sub-fragments;
NMR structure; protein engineering
Present address: G. Dantas, Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA
02115, USA.
Abbreviations used: AUC, analytical ultra-centrifugation; D2O, deuterium oxide; ESI, electrospray-ionization; MS,
mass spectroscopy; CFr, C-terminal fragment; GuHCl, guanidinium hydrochloride; HSQC, heteronuclear
single-quantum coherence; NaPi, Sodium phosphate; NOE(SY), nuclear Overhauser effect (spectroscopy); MALDI-TOF,
matrix-assisted laser desorption ionization - time of flight; Rg, radius of gyration; RMSD, root-mean-squared deviation;
SASA, solvent accessible surface area; SAXS, small-Angle X-ray Scattering; SD, Shine–Dalgarno; TOCSY, total correlation
spectroscopy.
E-mail address of the corresponding author: dabaker@u.washington.edu
0022-2836/$ - see front matter © 2006 Elsevier Ltd. All rights reserved.
CFr: Super-Stable Sub-fragment of Top7
1005
8
EMSL High Field Molecular
Resonance Facility, PNNL,
Richland, WA 99352, USA
9
Department of Physics,
Stanford University, Stanford
CA 94305, USA
10
Division of Basic Sciences,
Fred Hutchison Cancer Research
Center, 1100 Fairview Ave N,
Seattle, WA 98109, USA
Introduction
The last decade has seen tremendous advances in
the field of computational protein design. In silico
protein sequence and structure optimisation algorithms have been successfully applied to completely
redesign and thermodynamically stabilise naturally
occurring protein structures,1,2 to create novel3 and
thermodynamically stabilised enzymes,4 to redesign
protein–protein5,6 and protein–ligand7 interactions
and to create extremely stable new protein structures.8,9 Structural validation in many cases has
confirmed the high-resolution accuracy of the
design.1,4–6,8–10 The accurate identification of extremely low energy regions of the protein sequence
structure landscape is further validated by the
finding that these designed proteins often achieve
thermodynamic stabilities greater than those
reported for any naturally occurring proteins.2,9
An obvious application of these exceptionally
stable proteins is the generation of longer-lasting
designer proteins and therapeutics. 11 However,
while exceptional protein stability would have
advantages in resistance to proteolysis and unfolding, there may also be biological costs once these
proteins are expressed or delivered in the cell. It is
therefore of considerable interest to investigate how
computationally designed proteins are handled by
the cellular protein production and surveillance
machinery.
Translation processes often lead to faulty protein
products, due to inappropriate translation initiation,
ribosomal processivity errors, or missense errors
where the mRNA transcript is erroneously
decoded.12–15 The overwhelming majority of these
mis-translated proteins fail to assume native-like
conformations, and are cleared from the cell by
post-translational processes that involve a functional cooperation between molecular chaperones
assisting in folding and the proteasome system.15–17
Aberrant protein translation products that fold into
stable substructures can evade cellular surveillance
mechanisms and their subsequent accumulation can
significantly damage or kill cells.18–21 These phenomena are implicated in the pathology of a large
number of diseases, including diabetes, cancer, and
many neurodegenerative disorders. 22–24 Since
exceptionally stable computationally designed proteins are created in the absence of specific evolu-
Figure 1. Mis-translation of Top7. (a) Coomassiestained SDS-PAGE gel of Top7 protein variants ATG1ATT,
wild-type and GTG48GTT (lanes 1-3, respectively). (b)
ESI-MS spectrum of Top7_ATG1ATT. (c) Top7 protein (top
lines) and DNA (bottom lines) sequence, with primary
and alternate initiation codons highlighted (colours
match the peaks from (b)). Degenerate Shine–Dalgarno
sequences are highlighted in red.
1006
CFr: Super-Stable Sub-fragment of Top7
tionary pressure, they provide a rare opportunity to
reveal aspects of the cellular protein production and
surveillance machinery that are subject to natural
selection.
We recently generated an extremely stable, small,
globular protein, called Top7, with a sequence
and fold not observed previously in nature, using
purely computational techniques.9 Biophysical and
structural analysis of Top7 demonstrated the highresolution accuracy of our design. Here we show
that a portion of the Top7 protein corresponding to
the final 49 C-terminal residues is efficiently mistranslated in Escherichia coli. The solution structure
of the resulting C-terminal fragment (CFr) protein
reveals a compact, stable, homo-dimeric structure.
Further stabilisation of CFr by disulfide-induced
covalent circularisation yields a super-stable miniature protein that can serve as a robust scaffold for
further protein engineering. The rarity of correspondingly stable fragments in natural proteins
suggests evolution selects against protein fragments
than can form stably folded structures.
Results
During the purification of the computationally
designed Top7 protein, a strong band corresponding
to a molecular mass of ∼6.5 kDa was consistently
observed on SDS-PAGE gels. This band was
observed in addition to the Top7 band (∼12.5 kDa)
and remained even after Ni+ affinity chromatography (Figure 1(a), lane 2). A subsequent anionexchange purification step, however, was sufficient
to isolate only the full-length Top7 as observed on
SDS-PAGE and further confirmed by electrosprayionization mass spectroscopy (ESI-MS), thereby
allowing complete biophysical and structural characterisation of the pure Top7 protein.9 In order to
study the kinetic folding landscape of Top7, it
nonetheless became clear that many mutant variants
of the protein would need to be generated, and
hence a practical interest arose in identifying and
removing the lower molecular mass band. Since this
smaller protein was retained in high yield following
the Ni+ affinity purification step, it was most likely a
fragment of full-length Top7 that contained the Cterminal 6xHis tag and was either a product of
proteolytic cleavage or of mis-translation.
Proteolysis or mis-translation?
To investigate the possibility that the Top7 subfragment was a proteolytic product, Top7 bacterial
cell lysates were incubated at room temperature for
up to three days in the presence and absence of
protease inhibitors. Full-length Top7 was observed
by SDS-PAGE in the supernatant fraction at relatively equal concentrations at all incubation times.
Surprisingly, the ∼6.5 kDa Top7 sub-fragment band
was also observed in all supernatant fractions, also
at relatively equal concentrations at all incubation
times (data not shown). Since no appreciable degradation of Top7 was observed in vitro under
conditions where many natural proteins show significant degradation, 25 and no enrichment of
the sub-fragment was observed with increasing
incubation time, it seemed unlikely that the subfragment was generated by Top7 proteolysis.
Matrix-assisted laser desorption ionization - time
of flight (MALDI-TOF)-MS analysis of Ni+-affinity
purified Top7 confirmed that a species of ∼6613 Da
was present in addition to full-length protein (data
not shown). The predicted molecular mass corresponded to a product ∼30 Da larger than a
polypeptide starting at Val48 and ∼120 Da smaller
than a polypeptide starting at Arg47. The subfragment was subsequently isolated from fulllength Top7 by anion-exchange chromatography
and analysed by N-terminal MS sequencing. The
first six residues were found to be Met-Arg-Ile-SerIle-Thr, corresponding to a Met followed by the
sequence Arg49 to Thr53 of Top7. Methionine is
∼30 Da larger than valine and hence a Top7 fragment starting with a Val48Met mutation matches
the MALDI-TOF-MS predicted molecular mass.
Since the plasmid coding for full-length Top7 did
not contain this internal mutation, these results
Table 1. Statistics of ribosome binding sites in E. coli protein-coding regions
Thresholda
1000th
1500th
2000th
2500th
CCAGGTTcaaGTG
GAAGGTTTTG
ACAGGGGgctaaacgcGTG
Sb
Cc
μd
Σd
z-scoree
3.18 × 10−4
1.13 × 10−4
4.29 × 10−5
2.02 × 10−5
2.96 × 10−6
2.99 × 10−7
2.68 × 10−5
900
2507
5589
9830
26,790
51,583
8069
1238
3311
7144
11,944
29,931
56,234
9972
36
59
88
112
189
267
98
−9.4
−13.6
−17.7
−18.9
−16.6
−17.4
−19.3
a
The first four rows correspond to the 1000th, 1500th, 2000th, and 2500th best-scoring E. coli upstream ribosome binding sites,
respectively. The last three rows correspond to the alternative translation initiation sites within Top7 (Figure 1(c)), as they appear in 5′ to
3′ order. The Shine–Dalgarno sequence is shown in upper case and the start codon in bold.
b
The score, S, is a product of the probability at each position in a putative ribosomal binding site, using the model described in
Supplementary Data, Table S1.
c
C is the number of sites within real protein-coding regions with scores at least S.
d
The mean, μ, and standard deviation, σ, of the number of sites found in 300 random shufflings of the protein-coding regions,
respectively, with scores at least S.
e
The z-score is defined as the number of standard deviations separating C and μ.
CFr: Super-Stable Sub-fragment of Top7
suggested that the sub-fragment might be a
product of mis-translation of the Top7 mRNA
starting at amino acid position 48.
In prokaryotes, two key sequence features guide
the ribosome to initiate translation from a specific
location on mRNA–an AUG or AUG-cognate initiation codon, and the five to nine nucleotide ribosomal
binding sequence (Shine–Dalgarno (SD) sequence)
found three to 13 nucleotides upstream of the
initiation codon.13 The Val48 codon in the Top7
gene sequence is GTG. While >90% of E. coli translation is initiated at ATG, a small fraction of
translation initiation occurs at GTG (8%), TTG
(1%), and in one known case at ATT.26 Could the
Val(GTG)48 be the site (and cause) of mis-translation? To test this idea, we generated two single point
mutants of the Top7 gene: a silent codon change from
GTG to GTT at Val48 (GTG48GTT), and an Nterminal codon change from ATG to ATT to
substitute the N-terminal Met with Ile (ATG1ATT).
Since GTT has never been observed as a translation
initiation codon, mis-translation from Val48 should
be abrogated in this context, allowing translation of
only the full-length product. The ATT variant at
position one should disrupt translation of full-length
Top7, but should not affect translation of the subfragment. Each of these variants were expressed, Ni+
affinity purified, and visualised with SDS-PAGE
(Figure 1(a)). The GTG48GTT variant shows no
observable expression of the ∼6.5 kDa sub-fragment
band (lane 3). The ATG1ATT variant shows significant reduction of the full-length Top7, but expression of the sub-fragment was essentially
unaffected (lane 1). These variants were further analysed by ESI-MS, which confirmed the SDS-PAGE
results (Figure 1(b)). However, the MS results for
ATG1ATT also suggested that at least two other
minor species of intermediate molecular mass
between full-length Top7 and the ∼6.5 kDa subfragment were present in the preparation. The
predicted molecular masses for these two species
match well to Top7 fragments beginning at Val8
(GTG) and Leu33 (TTG), both of which are coded for
by potential alternate initiation codons (Figure 1(c)).
In fact, zooming in on the 6–15 kDa region in the
SDS-PAGE gels after increased protein staining also
showed the presence of faint bands between Top7
and the ∼6.5 kDa fragment. Analysis of the Top7
gene sequence revealed that degenerate versions of
the E. coli ribosomal binding site (Shine–Dalgarno,
SD) sequence are also present just upstream of all
three identified Top7 mis-translation sites, and might
also contribute to mis-translation (Figure 1(c)). To
test whether the SD sequence was critical for mistranslation of the sub-fragment starting at Val48, we
generated another point mutant of the wild-type
gene that changes codon 44 from GGG to TCT, which
should disrupt the putative SD sequence (ACAGGGG to ACAGTCT) without changing the Val(GTG)
48 initiation codon. The GGG44TCT variant was
identical to the GTG48GTT variant in the observed
ablation of the sub-fragment and no observed effect
on translation of the full-length Top7 (SDS-PAGE
1007
and ESI-MS, data not shown). This result indicates
that both translation initiation features are critical for
the efficient mis-translation of the ∼6.5 kDa fragment
of Top7.
If evolution has selected against corresponding
mis-translations in natural genes, one would expect to observe a reduced frequency of translation
Figure 2. Biophysical characterisation of CFr and SS.
CFr. (a) The far-ultraviolet (UV) CD spectrum of 25 μM
CFr, 25 μM SS.CFr and 20 μM Top7 in 25 mM Tris–HCl
(pH 8.0) at varying temperatures and concentrations of
GuHCl. (b) CD signal at 220 nm as a function of
temperature and GuHCl concentration for 12 μM CFr in
25 mM Tris–HCl (pH 8.0). (c) CD Signal at 220 nm as a
function of GuHCl concentration for multiple concentrations of CFr, SS.CFr, and Top7 in 25 mM Tris–HCl (pH 8.0)
at 25 °C.
1008
initiation sequence features within the coding region
of natural genes when compared to the frequency
expected by random chance. Accordingly, we have
computed the frequency of initiation codons in the
context of an SD sequence within the 4237 annotated
protein-coding regions of the E. coli genome, and
compared them to the expected frequency if the
codons were randomly permuted. Table 1 shows the
results of this comparison for seven different thresholds that might reasonably be used to define what
constitutes a “high-scoring” ribosomal binding site.
The first four of these thresholds correspond to the
1000th, 1500th, 2000th, and 2500th best-scoring
upstream ribosome binding sites, respectively,
from the 2912 annotated E. coli genes which have
at least 20 bp of non-coding DNA upstream of their
start codons. The last three rows correspond to the
scores of the alternative translation initiation sites
within the Top7 gene (Figure 1(c)), as they appear in
5′ to 3′ order. For all seven thresholds S of Table 1,
the number of observed instances within the real
E. coli protein-coding regions with scores at least S
(shown in column 3) is far below its expectation in
randomly shuffled coding regions (shown in column 4). A standard measure of this difference is the
z-score (shown in column 6), which is the number of
standard deviations by which the observed and
expected values differ. These results support the
theory that evolution has selected against genetic
features that would allow for mis-translation of
protein sub-fragments.
Biophysical characterisation of CFr
The sequence of the ∼6.5 kDa fragment of Top7
begins at a boundary between secondary structure
CFr: Super-Stable Sub-fragment of Top7
elements in the Top7 structure and includes strands
3, 4 and 5, as well as helix 2 of Top7. This fragment is
translated at high levels, is expressed in the soluble
fraction, does not aggregate significantly, and is as
resistant to cellular proteases as Top7. These results
strongly suggest that this fragment has intrinsic
stability and structure. For further analysis, a separate gene construct that codes for the ∼6.5 kDa
C-terminal fragment (CFr) of Top7 was made as
described in Materials and Methods. Like Top7,
the CFr protein can be obtained with high yield
(25 mg/l) and purity (>99%) from the soluble
fraction of the bacterial lysate. ESI-MS confirmed
that a full-length protein of 7036 Da was isolated;
this mass is within 0.1 Da of its theoretical molecular
mass (Supplementary Data, Figure S1A).
Circular dichroism spectra strongly suggest that
CFr is folded with α/β secondary structure, comparable in relative composition to Top7 (Figure 2(a)).
CFr secondary structure appears unchanged at 98 °C
or in 3 M guanidine-hydrochloride (GuHCl), but the
CD spectrum of the protein is consistent with an
unfolded polypeptide at 7 M GuHCl. In the presence
of intermediate GuHCl concentrations (4.3 M), CFr
unfolds cooperatively with temperature (Figure
2(b)), displaying remarkably high thermal stability,
comparable to Top7. CFr also displays co-operative
unfolding by GuHCl-induced chemical denaturation
(Figure 2(c)). However, unlike Top7, CFr appears to
be more stable with increasing protein concentration.
These concentration dependent effects are generally
indicative of the presence of quaternary structure
during the unfolding transition. This was confirmed
by gel filtration analysis of CFr at 25 μM and 1.2 mM;
the protein resolves as a single peak with a molecular mass corresponding to a CFr dimer (data not
Figure 3. Small-angle X-ray scattering (SAXS) profiles of CFr and SS.CFr. Kratky plots (s2·I versus s) for (a) CFr and (b)
SS.CFr as a function of GuHCl concentration are from bottom to top 1 M (blue), 2 M (green), 3 M (red), 4 M (light blue),
5 M (purple), 6 M (light brown), 6.5 M (black), and 7 M (blue) GuHCl for both CFr and SS.Cfr. The last profile for CFr is at
8 M (green) GuHCl and the last two profiles for SS.CFr are at 7.5 M (green) and 8 M (red) GuHCl. Profiles are vertically
offset for clarity. ((b), inset) Superimposed profiles for CFr (black) and SS.CFr (red) at 1 M GuHCl.
CFr: Super-Stable Sub-fragment of Top7
shown). For a more robust characterisation of its
unfolding behaviour and oligomeric state, CFr was
analysed by small-angle X-ray scattering (SAXS) and
analytical ultra-centrifugation (AUC). SAXS profiles
of 2 mM CFr exhibit a single peak characteristic of a
folded protein up to 5 M GuHCl, whereas the
profiles at 7 M and 8 M GuHCl are indicative of a
completely unfolded protein (Figure 3(a)). AUC
scans of 35 μM −97 μM CFr show the protein to be
dimeric at 0 M and 4 M GuHCl (where it appears
folded by CD and SAXS), and monomeric at 7 M
GuHCl (where it appears unfolded by CD and SAXS)
(Figure 4(a) and (c)). These results suggest that CFr is
an obligate dimer; the folded monomer is essentially
1009
never populated and the denaturation may be
represented as an equilibrium transition between
folded dimer and unfolded monomer. If this model is
correct, the analysis of unfolding curves at different
protein concentrations should result in similar
values for ΔG° or Kd (see Materials and Methods
for a description of this fitting procedure). Indeed,
the ΔG° fit values are the same within experimental
error for the different folding experiments: 26.4 kcal/
mol (108 μM CFr), 25.5 kcal/mol (62 μM CFr), and
25.5 kcal/mol (5 μM CFr), confirming that CFr exists
as an obligate dimer. A ΔG° value of 25.5 kcal/mol
corresponds to a dissociation constant (Kd) of ∼200
zeptoM (10-21 M).
Figure 4. Analytical ultra-centrifugation (AUC) studies of CFr and SS.CFr. Selected equilibrium sedimentation
profiles for (a) CFr and (b) SS.CFr collected at 30,000 rpm, 20 °C at protein concentrations of 59 μM −66 μM in solvent
containing 4 M (black circles) or 7 M (red circles) GuHCl. The fitted weight-averaged molecular mass (Mr) was
determined using a global fit to nine equilibrium scans collected at three protein concentrations and three speeds (see
Materials and Methods). (c) Fitted Mrversus concentration of GuHCl plot. Fitted Mr values were determined as
described above for CFr (black circles) and SS.CFr (red circles) at varying concentrations of denaturant. Horizontal lines
represent predicted monomer/dimer molecular massesfor CFr, 7,037/14,074 (black, broken), and SS.CFr, 7,241/14,482
(red, dotted-dashed).
1010
1D 1H spectra and 2D 1 H-15N heteronuclear
single-quantum coherence (HSQC) spectra of CFr
exhibit the features of a rigid well-folded protein
(Figure 5), with well-dispersed and sharp peaks.
Notably, the HSQC spectrum contains a single set of
cross-peaks for each NH in the protein. Since CFr is a
dimer, this result implies fully symmetric association. Solution structures of symmetric protein
dimers are difficult to determine using conventional
nuclear-Overhauser effect (NOE)-guided NMR techniques, because it is very difficult to distinguish
between intra and inter-subunit NOEs. We employed asymmetric isotope labelling of the protein,
in combination with isotope editing techniques27,28
to resolve intra-subunit NOEs from inter-subunit
NOEs in CFr and determine the symmetric homodimer solution structure.
Determination of the NMR structure of CFr
Protein backbone and side-chain assignments
were obtained as described in the Materials and
Methods. Structure determination was conducted in
a two-step process, a fully automated iterative step
dominated by NOE-derived distance constraints for
generating models of a single subunit of CFr (CFrA),
followed by a partly automated iterative step for
building the symmetric homo-dimer model using
manually assigned interfacial-NOE constraints. In
the final calculation 100 structures were generated,
of which the top 20 (Figure 6) had an average target
function of 1.20(±0.11) Å2 (Table 2) and an ensemble
RMSD value of 0.33(±0.10) Å over backbone atoms
and 0.75(±0.09) Å over heavy-atoms in residues 3
through 51 in both subunits (Table 3). There were no
distance constraints violated by more than 0.1 Å and
no angle constraints violated by more than 1°. When
the ensemble was analysed with ProcheckNMR,29
99.2% of all dihedral angles were found in the
allowed regions of the Ramachandran plot (Table 3).
The small number of disallowed dihedral angles are
all found for residues in the linker region (Glu2 and
Gly52–His58).
Figure 5. 1H-15N HSQC spectrum of CFr. The HSQC
spectrum of ∼1 mM 15N-CFr in 50 mM phosphate buffer
(pH 7.0), recorded at 298 K and 500 MHz.81 Peaks
are labelled with the one-letter amino acid code and sequence number.
CFr: Super-Stable Sub-fragment of Top7
CFr structure
Each of the two subunits of the CFr dimer adopts
the same fold observed for the corresponding
sequence in Top7, one helix packed on a threestranded, antiparallel β-sheet (Figure 7; Top7 in
purple, CFrA in green). The subunits form a
symmetric antiparallel dimer, with all interfacial
residues contributed by the first strand of the βsheet and by the helix (Figure 8). The two subunits
have virtually identical structures with an RMSD
value of 0.41 Å over backbone atoms and 0.81 Å over
all atoms (best NMR model, residues 3–51). Each
subunit is also extremely similar to the corresponding
portion of the Top7 crystal structure with an average
backbone RMSD value of 1.12 Å (Figure 7). These
deviations are as likely to reflect inaccuracies in the
models as genuine structural differences. The largest
deviation is in the hairpin between the second and
third strand of the β-sheet (Asp40-Gly41-Asp42 in
CFr); ignoring these residues improves the Top7 to
CFr backbone RMSD value to 0.91 Å. The backbone
NH of Gly41 is the only amide not observed in the
HSQC spectrum, suggesting this loop is flexible in
solution. Significantly, it is also not visible in the
HSQC spectrum of the Top7 protein (data not shown).
The CFr dimer interface buries a total of 1457 Å2 of
solvent-accessible surface area (SASA), which
accounts for about 19% of the surface of each
subunit (Figure 8(a); interface carbon atoms in
green and yellow). Ten residues on the β-sheet and
ten on the helix (Figure 8(b); green or yellow
cartoons and sticks) contribute to the CFr interface,
and interestingly, these residues are buried to a very
similar extent in the Top7 structure (data not
shown). The CFr dimer interface is an extension of
the individual CFr subunit hydrophobic cores; the
strands of the two subunits form an extended sixstranded antiparallel β-sheet, stabilised by backbone hydrogen bonds across the interface between
the first strands of both subunits. Of particular note
is a pair of strong symmetric inter-subunit hydrogen
bonds formed between the backbone NH of Ser7
on one subunit and backbone carbonyl of Ser7 on
the other subunit: this NH remains very strongly
protected after prolonged D2O exchange. The tight
packing observed between buried β-sheet residues
interacting across the dimer interface (Val4, Ile6, Ile8,
and Ala10 in both subunits) appears identical to the
inter-strand side-chain packing observed within
each subunit (this “continuous sheet core” is
illustrated in Figure 8(c); AB_00_SHEET). Similar
tight packing is also observed between helical sidechains interacting across the interface (Figure 8(c);
AB_00_HELIX). Two symmetric aromatic clusters
are formed between Phe19 on one subunit and
Phe27 and Tyr32 on the other subunit, where the
edge of the Phe19 aromatic ring stacks against the
faces of the other two aromatics. Another strong
interaction is a set of symmetric hydrogen bonds
between the hydroxyl of Tyr32 on one subunit and
the carboxyl moiety of Glu15 on the other subunit,
which form an interfacial stitch at the helical caps.
CFr: Super-Stable Sub-fragment of Top7
1011
Figure 6. NMR-generated structures of CFr. The top 20 NMR models from the final CFr structure calculation are
shown as ribbons. Each model is superimposed on the average backbone coordinates for residues 3–51 (structured region,
separate colour for each model) in both chains from the entire ensemble. The structured regions have an ensemble RMSD
value of 0.33 Å over the backbone atoms and 0.75 Å over all heavy atoms. Residues from the unstructured tails (52–58) are
coloured in grey.
1012
Table 2. NMR experimental constraints for CFr (residues
2–58)
A. Monomer calculation
Unique NOE distance constraints (first round/final round)a
Total
1915/1116
Intra-residue and Sequential ([i–j] ≤ 1)
1312/645
Medium-range (1 ≤ [i–j] ≤ 5)
513/198
Long-range ([i–j] ≥ 5)
90/273
76
Dihedral angle constraintsb
Hydrogen bond constraints
16 (8 H-bonds)
Total number of constraints
1208
Number of constraints per residue
21.2
Long-range constraints per residue
4.8
B. Dimer calculationc
Intra-subunit NOE constraints
2232
Inter-subunit NOE constraints
46
130
Dihedral angle constraintsb
Hydrogen bond constraints
36 (18 H-bonds)
Total number of constraints
2444
Residual constraint violationsb
Distance violations
(0.2-0.5) (Å)
0
(>0.5) (Å)
0
Van der Waals violations
(0.2-0.5) (Å)
2
(>0.5) (Å)
0
Max. violation (Å)
0.33
Dihedral angle violations
(1-10°)
0
(>10°)
0
*
CYANA target function (first round/final round)
NOEASSIGN monomer calculation
107.7 Å2/5.2 Å2
Final dimer calculation
———–/1.2 Å2
a
First and final round refer to statistics from the NOEASSIGN
macro in Cyana2.0.
b
Dihedral angle constraints were generated from TALOS.66
c
All dimer restraints and violations are twofold redundant due
to the symmetric nature of the structure (see Materials and
Methods for details).
Backbone dynamics
Further evidence for the structural stability of the
CFr protein is provided by the measurements of 15N
T1, T2 and 1H-15N heteronuclear nuclear Overhauser
enhancements (NOEs) that were measured by
standard techniques described in the Materials and
Methods. The results (Figure 9) show relatively
uniform and featureless values with only relatively
small variations across the sequence, with the
exception of the unfolded C-terminal tail. The
heteronuclear NOE is high, as expected for a domain
rigid on the ns–ps time scale of motion, and the
average value of T2 and T1 are consistent with a
protein of about 10–15 kDa, the size of the CFr
dimer. Even in loops, the values of T2 are nearly
constant, with the exception of residue Gly41 that
appears to be exchange broadened.
Further stabilisation by disulfide circularisation
of CFr
The high thermodynamic stability of the CFr
structure makes it an ideal candidate as a scaffold
for further design of novel or improved functions.
Since functional design often involves making
CFr: Super-Stable Sub-fragment of Top7
amino acid mutations that sacrifice thermodynamic
stability, design on an extremely stable template
should allow, at least in principle, for a larger
number of “functionalising” mutations. We investigated the possibility of further stabilising CFr by the
simple method of disulfide-induced protein circularisation. Since the NMR structure shows that the
N and C termini of each subunit are next to each
other, we chose positions at the end of both termini
to add single cysteine residues such that their thiol
groups could be within disulfide-forming distance.
Formation of a disulfide bond between these two
terminal cysteine residues should yield a covalently
circularised form of each CFr subunit. The corresponding SS.CFr clone was generated and the
protein purified as described in Materials and
Methods.
ESI-MS showed that SS.CFr was isolated as a
7241 Da species, which corresponds to a single
completely oxidised intra-molecular disulfide
bond per subunit (within 0.1 Da of predicted Mr;
Supplementary Data, Figure S1B). The CD wavelength scan of SS.CFr appears identical to CFr
(Figure 2(a)), and the SAXS scattering profiles of
the two proteins are indistinguishable at low
denaturant concentrations (Figure 3(b), inset), suggesting the disulfide has not perturbed protein
secondary or tertiary structure. The CD chemical
denaturation profile of SS.CFr (Figure 2(c)) shows it
to be dramatically stabilised over CFr, the protein
begins to unfold only at 6.5 M GuHCl and appears
to still be in the unfolding transition at 8.2 M GuHCl.
The SAXS profile of 2 mM SS.CFr also indicates that
the protein is still in the unfolding transition at 8 M
GuHCl (Figure 3(b)). In comparison, both CFr and
Top7 are almost completely unfolded by 6.5 M
GuHCl (Figure 2(c)). Like CFr, SS.CFr also shows
protein concentration dependence in its chemical
denaturation, suggesting that it too exists as an
obligate dimer. AUC scans confirm that SS.CFr
(33 μM −105 μM) is predominantly dimeric from
0 M to 5 M GuHCl, but a small fraction of the
monomeric form appears as the protein begins to
unfold between 6 M and 7 M GuHCl (Figure 4(b)
and (c)). These results indicate that SS.CFr is
stabilised over CFr, and it is likely to be one of the
most stable proteins reported regardless of class or
size.
Table 3. Structural statistics for CFr dimer
A. RMSD from averaged structure (Å) a
(Structured region, residues 3–51 in both chains)
Backbone atoms
All heavy atoms
0.33
0.75
B. PROCHECK-NMR analysisa
(All residues in both chains)
Most favoured regions (%)
Additionally allowed (%)
Generously allowed (%)
Disallowed (%)
80.3
17.3
1.6
0.8
a
Structural statistics reported are based on analysis of the best
20 conformers of 100 generated by CYANA.
CFr: Super-Stable Sub-fragment of Top7
1013
Figure 7. Comparison of the Top7 and CFr structures. (a) and (b) Ribbon diagrams of residues 3–51 from one subunit
of the CFr NMR structure (green) superimposed on the corresponding region of the Top7 X-ray structure (purple). The
backbone RMSD value over these residues is 1.12 Å. The two diagrams are related by a 90° rotation around the vertical
axis in the plane of the page.
The SS.CFr construct was crystallized in an
attempt to determine a higher resolution structure
than that which was achieved for CFr by NMR
spectroscopy. However, extensive crystallization
trials and subsequent screening of specimens at the
Advanced Light Source (ALS) yielded crystals that
diffract to only 3.6 Å resolution, which provides no
higher structural resolution than the NMR structure
reported above. A strong molecular replacement
solution to the phase problem was found, which
generated models displaying relative subunit orientations and packing that agrees well with the NMRderived structure. Additionally, difference maps
calculated after molecular replacement demonstrate
the presence of a disulfide bond bridging the N and
C termini of the engineered construct, confirming
this additional aspect of the design cycle.
Discussion
Initiation is usually the rate-limiting step of
translation under normal conditions,21,26 and ample
evidence exists for regulation of protein synthesis at
this step.13,14 The significant bias in nucleotide
frequencies observed in the translation initiation
region of natural genes30–32 suggests a stringent
evolutionary selection for strong translation initiation
signals at the sequence level. In an analysis of 30
complete prokaryotic genomes, a significant positive
correlation was observed between the strength of the
SD sequence and predicted expression level of a gene,
such that highly expressed genes were much more
likely to have a strong SD sequence than average
genes.33 Mutational analysis of translation initiation
regions of a variety of genes have confirmed that
disruption of the start codon or the SD sequence
adversely affects translation efficiency and accuracy
of initiation at the proper start codon.34–36 Since
appropriate initiation sequences are clearly important
for efficient translation of normal genes, it should
follow that similar sequences are avoided within the
coding regions of genes to prevent mis-translation of
sub-gene fragments. We have shown that the CFr
fragment can be efficiently translated from within the
Top7 gene due to the fortuitous presence of an
initiation codon and a degenerate SD sequence at
appropriate positions within the coding region of
the Top7 gene, and that removal of either sequence
feature is sufficient to completely abrogate CFr mistranslation, without affecting translation of the fulllength Top7 protein. If evolution has selected against
corresponding mis-translations in natural genes, one
would expect to observe a reduced frequency of
translation initiation sequence features within the
coding region of natural genes when compared to
the frequency expected by random chance. Saito &
Tomita have shown conclusively that in both
eukaryotes and prokaryotes, the frequencies of
AUG triplets just upstream and downstream of the
natural initiation codon are significantly lower than
expected by random chance, which “is likely due to
negative selectional pressure, since protein mistranslation is evolutionarily disadvantageous.”37
We extended this analysis to the complete coding
regions of genes, and observed that high quality
1014
CFr: Super-Stable Sub-fragment of Top7
Figure 8. Details of the CFr NMR structure. (a) Seven views of the two subunits of CFr shown in surface
representation. Interfacial carbon atoms are coloured green in subunit A and yellow in subunit B, and all other atoms are
in CPK colour. Starting with the centre model of the dimer, the three models to the left (subunit A) and to the right
(subunit B) show the dimer opening like a book. (b) Three views of CFr subunits, with the dimer model in the centre
opened like a book (left: subunit A in green, right: subunit B in yellow). The centre model shows a ribbon representation of
the two subunits with interfacial regions coloured in green and yellow. The flanking models show the interfacial sidechains as green or yellow sticks. Surface representations are overlaid with 80% transparency to show orientation relative
to (a). (c) Specific interactions between the subunit interfaces are highlighted in the right (helices) and left (sheet) panels.
Backbone secondary structure is represented as ribbons and side-chains are represented as sticks. The model in the centre
of the panel is another ribbon representation of the dimer. The numerical suffix in each model label represents the degree
of rotation from the centre model (in (a) and (b)) around the vertical axis in the plane of the page (e.g. B+90 is subunit B
rotated 90° from the orientation of the dimer). All straight dotted arrows between models represent translations in the
plane of the page. All curved arrows between models represent rotations around the vertical axis in the plane of the page.
CFr: Super-Stable Sub-fragment of Top7
1015
Figure 9. Backbone Dynamics
of CFr. (a) 15N T1 measurements; (b)
15
N T2 measurements; (c) 15N HetNOE measurements.
ribosome binding sites, defined as a start codon in
the context of a strong SD sequence, are found to
occur significantly less often within E. coli coding
sequences than expected by random chance. These
new results provide quantitative evidence to support the theory that evolution has selected against
genetic features that would allow for mis-translation of protein sub-fragments. Additionally, the
probabilistic model we implemented to quantify the
E. coli translation initiation motifs correctly identifies the three sites of mis-translation we observed
experimentally in the Top7 gene, with the dominant
observed mis-translation product (CFr) scoring the
highest with our model. The model also correctly
predicts that either of our two independent sets of
experimentally observed CFr ablating mutations
(initiation codon or SD sequence) would reduce the
probability of CFr mis-translation to zero. The
model may be useful for identifying and removing
translation initiation motifs from within any gene to
be expressed in E. coli.
Despite the observed genetic evolutionary selection against mis-translation of internal protein
fragments, many newly synthesized natural polypeptides are products of aberrant translation
reactions.12–15 This is because in addition to inappropriate translation initiation, an aberrant protein
product may be produced by ribosomal processivity
errors (such as ribosomal slipping, hopping, or dropoff) or missense errors where the mRNA transcript is
erroneously decoded.14 The overwhelming majority
of these mis-translated proteins fail to assume
native-like conformations and are cleared from the
cell by post-translational processes that involve a
functional cooperation between molecular chaperones assisting in folding and the proteasome
system.15–17 Super-stable protein fragments like
CFr that can fold with native-like tertiary structure
1016
would challenge this cellular surveillance machinery, and hence would be expected to be under
negative evolutionary selection. When alternatively
translated proteins are stable enough to evade the
cellular surveillance machinery, they can compete
with the natural isoform to function in a dominantnegative fashion, as in the case of the HIV-1 Gag
protein.38 A significant number of human disease
pathologies involve mechanisms that implicate
protein fragments that are a result of an error in
translation of the native protein, including fragments
of C/EBPα in acute myeloid leukemia,39 GATA1
in Down syndrome-related leukemia,40 c-myc in
Burkitt's lymphoma,41 and lyl-1 in T cell acute lymphoblastic leukemia.42 The evidence for the rarity of
super-stable protein sub-fragments also comes from
the large body of work on limited proteolysis of
natural proteins which has revealed that, with a
few notable exceptions where independently folded
stable native-like fragments are observed, 43, 44
most proteolytic fragments are either completely
unfolded or mis-folded, or adopt only partially
folded states that require complementation with
fragments corresponding to the rest of the protein to
adopt rigid native-like protein structure.45–47 Due to
their low stability and/or conformational flexibility,
most if not all of these fragments would be expected
to be cleared by molecular chaperones and the proteasome before they could challenge their fulllength counterparts.15–17,48 Cellular homeostasis
would be challenged only if the fragments were too
stable or were being selectively overproduced (as
in the case of cellular immortalization leading to
cancer). This latter mechanism was also demonstrated in experiments where protein fragments
of Ile-tRNA synthetase were overexpressed in vivo,
causing dominant lethality to host cells, presumably
due to fragment-induced mis-folding of the fulllength protein.49 Since stable protein sub-fragments
clearly stand to disrupt homeostasis by challenging
the cellular surveillance system, we propose that
evolution has selected against protein structures that
can yield stable sub-fragments that can adopt nativelike conformations.
The simplest level of evolutionary selection,
perhaps, is against extreme thermodynamic stability
of any protein. We have shown that both Top79 and
CFr display thermodynamic stability profiles significantly higher than most, if not all, natural
proteins of similar shape and size. In the design of
the novel sequence and topology of Top7, every
amino acid was selected to stabilize the final folded
structure, in the absence of any functional constraints. By contrast, nature selects proteins to fulfill
very specific functions in a time-dependent fashion,
and hence natural proteins need only be just stable
enough to fulfill their function, after which they are
cleared away by the proteasomal degradation
machinery. It is reasonable to expect that extremely
stable proteins (like Top7) have a higher probability
of containing independently stable sub-structures
than proteins of lower stability (most natural proteins). In addition to this intrinsic probability (which
CFr: Super-Stable Sub-fragment of Top7
nature can select against), however, we show that
the Top7 protein contains specific sequence and
structural features that increase the ability of its subfragment, CFr, to achieve a stable, rigid, native-like
structure, and hence suggest aspects of protein
structure that may be under evolutionary control.
First, the Top7 topology has a low contact order; the
primary sequence separation between most structural amino acid neighbours is low. This allows Top7
to be stabilized by largely local interactions, significantly increasing the probability that contiguous
sequence fragments can adopt independently stable
tertiary structures. Second, the buried hydrophobic
residues in the Top7 core have a high-level of
sequence symmetry. Of the residues in close contact
between the two helices, the first helix contributes
three leucine residues and an isoleucine, while the
second helix contributes two leucine residues, an
isoleucine, and a valine. In the β-sheet, two core
isoleucine residues on the third strand in Top7 (Ile6
and Ile8 in CFr) interact with two valine residues
from the first strand in Top7. This high-level of
sequence symmetry allows the CFr fragment to
effectively mimic the packing of the Top7 core by
self-associating into a symmetric homodimer. This
mechanism has been previously observed in the
proteolytically derived C-terminal fragment 255–
316 of thermolysin, which also adopts a symmetric
homodimeric structure, with the dimer interface
effectively mimicking interactions from the core of
the parent protein.43 Finally, the interacting surfaces
on the two helices of Top7 have no large protrusions
or intrusions, no interdigitation of side-chains,
allowing the self-interaction in the CFr dimer to be
as viable as the heterologous interaction with the
N-terminal portion of Top7. In addition to highlighting protein structural features that might be
under evolutionary selection, our observations
provide guidelines for synthetic protein engineers
who either wish to avoid super-stable protein
progeny or conversely wish to create protein folds
that can yield stable sub-fragments for the purpose
of functional regulation of the full-length protein.
This balance between the danger and utility of protein sub-fragments leads us to the final evolutionary
implication of our analysis.
It has been suggested that many natural single
domain protein structures that have a high internal
sequence and structural symmetry (such as ribonuclease inhibitor and proteins containing ankyrin or
HEAT repeats) may have arisen by duplication of a
single ancestral gene-product that initially formed
homo-multimers of identical chains, which were
gradually replaced by single polypeptide chains
encoding multiple repeats. 50, 51 The formation
of the CFr dimer from a fragment of Top7 may
parallel this natural protein-fold evolution by modular recombination of stable protein sub-structures.
On the surface, this might seem to contradict the
theory that evolution has selected against stable
protein sub-structures. However, analyses of most
modern repeat-containing proteins show that the
internal interaction surfaces of the repeats have
1017
CFr: Super-Stable Sub-fragment of Top7
evolved to be inter-dependent, such that in isolation,
a single repeat unit cannot fold into a independent
stable structure. 50–52 In fact, these observations
suggest that autonomously folded ancient peptides
evolved to associate interdependently into modern
larger monomeric proteins with diverse functions,
but in turn the ancestral peptide components of
modern proteins were selected to lose their ability to
fold autonomously to prevent protein fragments
from interfering with the structure and function of
parent domains. Evidence for the delicate nature of
this evolutionary balance is clearly implied by the
numerous aforementioned disease states caused by
the selective stabilization of fragment isoforms of
natural proteins.39–42 Submission of the CFr structure to the DALI server53 finds 122 natural protein
domains with significant structural homology
(Z-score > 2.0) to the CFr template. In many of
these cases, the CFr subunits are found to be homologous to multiple non-overlapping parts of the
same protein (e.g. the E. coli acriflavine resistance
protein pump54 has four distinct regions of homology to CFr), suggesting that a CFr-like module
could have played a role in natural protein-fold
evolution.
In addition to the evolutionary implications of the
mis-translation and subsequent structural characterisation of CFr and SS.CFr, these extremely stable
proteins also serve a potentially significant practical
utility as novel scaffolds for further protein design.
Their extremely high thermodynamic stability
should allow, in principle, for their employment
in industrial applications where most proteins
would be rapidly degraded, such as at 100 °C or at
extremely high denaturant concentrations.55,56 Polypeptides of this length (∼50 amino acids) can also
routinely and cheaply be produced in high yield and
purity by chemical synthesis (as opposed to bacterial
expression).57–59 Chemical synthesis has the distinct
advantage over bacterial expression of allowing for
the efficient and selective covalent modification of
amino acids and/or the covalent addition of nonamino acid functional groups to the polypeptide
chain, allowing for the potential design of extremely chemically diverse nano-scale protein
machines.59,60 The symmetric homo-dimeric nature
of CFr and SS.CFr can provide an additional benefit
as a scaffold, in that a singly functionalised monomer will yield a doubly functionalised macromolecular unit. Interestingly, the scorpion toxin fold
family61 has a similar overall architecture to a CFr
monomer (one helix packed on a three-stranded
antiparallel sheet), and has been successfully employed as a protein engineering scaffold.56,61 However, all scorpion toxin fold proteins have six
cysteine residues that participate in three specific
internal disulfide bonds which are required for the
protein to fold accurately, whereas CFr (with no
disulfides) and SS.CFr (with only one disulfide) fold
into extremely stable structures bereft of these extra
internal covalent constraints. Our current efforts
using CFr and SS.CFr as scaffolds include their
design for the presentation of epitope-peptides for
production of antibodies against HIV, and their
functionalisation with peroxide-activating catalysts
for bioremediation.
Materials and Methods
Protein expression and purification
The gene coding for the CFr protein sequence (amino
acid residues Val48 through Gly95 in Top7) was PCR
amplified from the Top7 gene sequence and cloned into
plasmid pet29b(+) (Novagen). The CFr protein has the
sequence: MERVRISITARTKKEAEKFAAILIKVFAELGYNDINVTWDGDTVTVEGQLEGGSLEHHHHHH.
The SS.CFr gene construct was generated by PCR amplifying the CFr construct using oligonucleotide primers
that add a Cys-Glu sequence at position 3 and change
Glu51 to Cys, and sub-cloning this fragment back into
pet29b(+). The SS.CFr protein has the sequence: MECERVRISITARTKKEAEKFAAILIKVFAELGYNDINVTWDGDTVTVEGQLCGGSLEHHHHHH. Point mutants of
Top7 (ATG1ATT, GTG48GTT, and GGG44TCT) were generated using the Quick Change Site-Directed mutagenesis
kit (Stratagene).
The 6× histidine-tagged proteins were expressed in the
BL21(DE3)pLysS strain of E. coli. Cells were grown in LB
media at 37 °C to an A600 of 0.6, induced with 1 mM
isopropyl-thio-β-D-galactosidase (IPTG), and cells were
harvested after another 4–5 h of growth at 37 °C. Harvested
cells were lysed by sonication, and soluble protein collected
after centrifugation of cellular debris. Soluble protein was
purified on a Ni+ affinity column (Pharmacia) followed by
104-fold dialysis against 25 mM Tris–HCl (pH 8.0). The
protein was further purified on a QFF anion exchange
column (Pharmacia) with a 50 mM to 600 mM NaCl
gradient in 25 mM Tris–HCl (pH 8.0), followed by a final
104-fold dialysis against 25 mM Tris–HCl (pH 8.0) (or
50 mM sodium phosphate (pH 7.0) for NMR). To ensure
complete disulfide formation, anion-exchange purified SS.
CFr was oxidised in the presence of 20 mM potassium
ferricyanide [K3Fe(CN6)] for 10 min at room temperature,
prior to the final dialysis steps. Protein identity and purity
were determined by SDS-PAGE and ESI-MALDI mass
spectroscopy. Protein concentrations were determined by
UV absorbance at 280 nm with extinction coefficients
calculated using the ExPASy Protparam tool†).
For NMR studies, uniformly 15N and 15N/13C labelled
samples were prepared by growing bacteria in
M9 minimal media supplemented with 0.5 g/l of [15N]
NH4Cl and 2 g/l of [13C]glucose (Spectra Isotope).
Purification was identical to that executed for the
unlabelled samples. For 12C/13C filtered NOESY experiments, equimolar amounts of 15N12C and 15N13C samples
were mixed, the protein was then denatured in 8 M
GuHCl with overnight mixing to ensure complete monomerisation, dialysed back into 50 mM NaPi (pH 7.0) to
allow refolding and dimerisation, lyophilised, and
brought up in 100% D2O.
Limited proteolysis
Bacterial cells containing over-expressed Top7 were
lysed by three freeze-thaw cycles in the presence or
† http://us.expasy.org/tools/protparam.html
1018
absence of protease inhibitors (1 mM PMSF, 1 mM
benzamidine). These two lysates were then divided into
four equal fractions, which were incubated at room
temperature for 2, 4, 24, and 72 h, respectively. After the
incubation period, the lysates were centrifuged and
separated into supernatant and pellet, which were
subsequently visualised by SDS-PAGE.
Statistical analysis of E. coli ribsome binding site
motifs
CFr: Super-Stable Sub-fragment of Top7
Size exclusion (gel filtration) chromatography
Size exclusion chromatography was carried out using
an analytical Superdex-75 column (Amersham Pharmacia)
with the Pharmacia FPLC system (GP-250 gradient
programmer, P-500 Pump). Protein samples at concentrations used for NMR (600 μM–1.2 mM) or CD (5-100 μM)
were equilibrated in 20 mM EDTA, 25 mM Tris (pH 8.0) at
25 °C, and run on the Superdex-750 column at 1 ml/min.
Small-angle X-ray scattering (SAXS)
There are three steps in the measurement of ribosome
binding sites within E. coli protein-coding regions: (1) infer
a probabilistic model M of the true upstream ribosome
binding sites; (2) use the model M to measure the observed
number of high-scoring ribosome binding sites within real
protein-coding regions; and (3) use the model M to
measure the expected number μ of high-scoring ribosome
binding sites and its standard deviation σ in randomly
generated coding regions.
For the first step, we adopted the simple iterative
approach of Kibler & Hampson.62 The E. coli genome
contains 2912 annotated genes each with at least 20 bp of
non-coding DNA upstream of their start codons. From this
training set T of 2912 upstream sequences each of length
20 bp, we extracted all 7-mers that differ in at most one
position from TAAGGAG, known to be the optimal
Shine–Dalgarno sequence,63 since it is the reverse complement of the 3′ end of E. coli's 16 S rRNA. The initial Shine–
Dalgarno profile P was formed from this collection of 7mers. This profile is a 4 × 7 matrix whose columns give the
probability distributions for each of the seven positions of
the Shine–Dalgarno sequence. The profile P can then be
used to score any 7-mer. We then iterated the following
process until convergence. (1) Extract from the training set
T the 2000 7-mers that score highest according to the
profile P. (2) Use these 2000 7-mers to compute a new
profile P. When this process converges, P should be a good
approximation to the distribution of true Shine–Dalgarno
sequences.
Using the 2000 highest scoring matches of this
converged profile P in the training set T, we computed
the 4 × 3 profile of the start codons and the probability
distribution of the distance separating the Shine–Dalgarno 7-mer from the start codon. These two profiles and
the distance distribution are shown in Supplementary
Data, Table S1. Together, the three distributions given in
Table S1 comprise the probabilistic model M described
above, and can be used to score any ribosome binding
site.
The next step was to use M to measure the observed
number of high-scoring ribosome binding sites within E.
coli's 4237 annotated protein-coding regions. For any
score threshold S (such as the thresholds shown in Table
1), the model M of Supplementary Data, Table S1 can be
used to count the number of sequences internal to
protein-coding regions with scores at least S. We insisted
that the internal “start codon” that matches Supplementary Data, Table S1(c) occur in the correct reading frame,
so that translation could proceed in that open reading
frame.
Finally, we repeated this counting process in “random”
coding regions. We simulated random coding regions by
randomly permuting the codons of all of E. coli's real
protein-coding regions. We kept each codon intact so as to
preserve E. coli's natural codon biases. We repeated this
random codon shuffling process 300 times, computing the
means and standard deviations in Table 1 over these 300
trials.
SAXS measurements were carried out at the BESSRCCAT beamline 12-ID at the Advanced Photon Source
(Argonne, IL). Immediately before data collection, the
samples were centrifuged for 10 min at 11,000g. The
measurements were performed at 25(±1) °C in a custommade, thermostated flow cell64 at a flow rate of ∼1 ml/min
and a photon energy of 12 keV. For each condition, a total of
40 measurements of 1.0 s integration time each were taken.
All data were image-corrected and circularly averaged
after data taking. The 40 profiles for each condition were
averaged, and appropriate buffer scattering profiles were
subtracted for background correction. There were no signs
of radiation damage. Measurements were performed at
varying concentrations of GuHCl in 25 mM Tris (pH 8.0), at
a protein concentration of 14.5 mg/ml, unless otherwise
noted.
Changes in protein conformation monitored by SAXS
are represented as Kratky plots, which are graphs of s2·I(s)
as a function of s, where s is the momentum transfer vector
(s = 2sin(θ)/λ, where λ = 1 Å is the X-ray wavelength and 2θ
is the scattering angle). Porod's Law states that for large s
the scattering from an object with a well defined surface
falls approximately as s−4,65 which leads to decrease as s−2
in the Kratky representation for large s. Well-folded
proteins therefore have a characteristic peak in the Kratky
plot. Scattering from a random polymer falls like s−1, which
leads to a linear rise at high s in the Kratky plot for
unfolded proteins.66
Analytical ultra-centrifugation (AUC)
Sedimentation equilibrium studies on CFr and SS.CFr
were conducted in a Beckman XL-A analytical ultracentrifuge using 12 mm Epon charcoal-filled centerpieces
containing six channels. Studies on each protein were
conducted at three concentrations in 25 mM Tris (pH 8.0) in
the presence of 0, 3, 4, 5, 6, and 7 M guanidine
hydrochloride (GuHCl). Centerpiece sample channels
were filled with 110 μl of protein sample and reference
channels were filled with 120 μl of matched solvent.
All scans were conducted at 20 °C at an absorbance wavelength of 280 nm. Protein concentrations were determined
using scans conducted at 3000 rpm and low, intermediate,
and high protein concentrations that fell in the range of 33–
39 μM, 59–66 μM, and 89–105 μM. Scans were collected at
three rotor speeds, 25,000, 30,000, and 45,000 rpm, using
equilibration times of 10 h for each speed. This equilibration time was deemed sufficient by identical absorbance
scans collected after 8 and 10 h at each speed.
Solvent densities were determined at 20 °C using an
Anton Paar DMA 5000 densitometer. Triplicate measurements were collected and averaged for 25 mM Tris (pH
8.0) in the presence of varying concentrations of GuHCl
(Supplementary Data, Table S2). The partial specific
volumes of CFr and SS.CFr (0.733 and 0.730 ml/g,
respectively) were determined at 25 °C from amino acid
composition67 and adjusted to 20 °C.68
1019
CFr: Super-Stable Sub-fragment of Top7
Data analysis was performed using Beckman XL-A
Data Analysis Software.version 4.0. Individual equilibrium scans were fit to a single ideal species model using
non-linear least-squares analysis to determine a weightaveraged molecular mass, Mr. During this analysis, the
baseline offset was allowed to float; if it was found to be
> ± 0.08, it was fixed to zero so that the goodness of fit
could be assessed for each case. Analysis of residuals to
the fit allowed for detection of aggregation or non-ideal
behavior in a few scans. Following this analysis, global
fits were performed across the three protein concentrations and three speeds (nine scans) to re-determine Mr.
The residuals, typically small (< ± 0.02) and random, and
baseline offsets (typically < ± 0.04) were most often
improved during the global data analysis.
Circular dichroism
CD data were collected on an Aviv 62A DS spectrometer. Far-UV CD wavelength scans (260 nm–195 nm) at
varying protein concentrations (10 μM −25 μM), guanidinium hydrochloride (GuHCl) concentrations (0–8.3 M),
and temperatures (0–98 °C) were collected in a 1 mm
path length cuvette. GuHCl induced protein denaturation was followed by the change in ellipticity at 220 nm
in a 1 cm path length cuvette, using a Microlab titrator
(Hamilton) for denaturant mixing. Temperature was
maintained at 25 °C with a Peltier device. Temperatureinduced protein denaturation was followed by the
change in ellipticity at 220 nm in a 2 mm path length
cuvette. All CD data were converted to mean residue
ellipticity. The dimer dissociation constant (Kd) and the
free energy of unfolding (ΔGH2O
) were calculated
U
according to the procedure described by Kuhlman and
co-workers,69 where chemical denaturation curves were
fit to an equilibrium model between unfolded monomer
(U) and folded dimer (F):
Kd
F2 X 2U
where:
h
h i2 .h i
DGF2 ¼ 2Pt fu2 =ð1
¼ Kd ¼ U
exp
RT
fu Þ
i
where Pt is the total protein concentration, fu is fraction
of unfolded protein, R is the gas constant and T is the
temperature. The final equation used to fit the circular
dichroism data (θ) takes the form:
hð½GuÞ ¼ ðhU
hF Þd fu þ hF
where:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
fu ¼ 0:5 a þ ða2 þ 4aÞ ;
a¼
DG-
exp RT
2Pt
and ΔG° and the circular dichroism signal of folded (θF)
and unfolded (θU) protein are assumed to vary linearly
with denaturant concentration:
DG-ð½GuHClÞ ¼ DG-ð0M Gu HClÞ þ md ½GuHCl
uU ð½GuHClÞ ¼ uU ð0M GuHClÞ þ ad ½GuHCl
uF ð½GuHClÞ ¼ uF ð0M GuHClÞ þ bd ½GuHCl:
Nuclear magnetic resonance spectroscopy
All CFr samples were prepared for NMR experiments in
Shigemi susceptibility-matched NMR tubes, at 0.7 mM–
1.0 mM concentration in H2O solution containing 10%
D2O or in 100% D2O, 50 mM sodium phosphate (pH 7.0).
All experiments were recorded at 298K unless otherwise
specified. Triple resonance NMR experiments were
collected on a Bruker Avance 500 MHz spectrometer
equipped with a TXI HCN triple resonance probe with
triple axis gradients. Three-dimensional 15 N-edited
NOESY spectra and 2-dimensional NOESY and TOCSY
datasets were recorded on a Bruker Avance 750 MHz
spectrometer equipped with a TXI HCN triple resonance
probe with z-axis gradient. Three-dimensional 13C-edited
NOESY and two and three-dimensional 12C/13C-filtered
NOESY spectra were recorded at Environmental Molecular Sciences Laboratory (EMSL) at PNNL in Richland,
WA using a Varian 600MHz spectrometer equipped with a
cryoprobe. Data were processed with NMRPipe70 and
analyzed with SPARKY‡.
Backbone amide 1H and 15N, Cα, C=O and side chain
Cβ resonances were assigned using 1H-15N HSQC,
HNCO, HNCACB, CBCA(CO)NH, HBHA(CO)NH, HN
(CO)CA and 3D 15N edited TOCSY experiments.71 Over
98% of the backbone N, (N)H, C(O), Cα and Cβ nuclei for
residues 2–58 could be assigned (no assignments were
possible for the N-terminal methionine and for the last
four histidine residues at the C terminus). Side-chain
assignments were obtained by analysis of 3D HCCHTOCSY and 3D 13C-edited NOESY experiments. Aromatic
side-chain assignments were obtained from two-dimensional NOESY and TOCSY spectra recorded in D2O
buffers. Side-chain 1H and 13C resonances were >92%
assigned, whereas the aromatic side-chains (Phe, Tyr, Trp)
were >68% assigned. Gln/Asn NH2 were 100% assigned
while Arg Nε and guanidinium groups and Lys NH3
remain unassigned. The spectra used in deriving distance
constraints included 3D 15N-edited NOESY and 3D
13
C-edited NOESY, 2D NOESY in H2O (80 ms and
120 ms mixing) and 2D NOESY in D2O (120 ms mixing)
recorded at 750MHz. Additionally, inter-subunit distance
constraints were derived from 2D and 3D 12C/13C-filtered
NOESY spectra.27,28
Protein structure determination by NMR
Structure determination was conducted in a two-step
process using the program CYANA 2.0,72 a fully automated iterative step for generating models of the monomeric unit of CFr, followed by a partly automated iterative
step for building the symmetric homo-dimer model with
manually assigned interfacial constraints. Fully automated structure determination of the CFr dimer was not
possible in CYANA because the symmetric nature of the
dimer made it impossible for the program to distinguish
between inter-subunit and intra-subunit NOEs. The
experimental NMR data used for structural analysis
included the NOESY peak lists derived from the 3D
15
N- and 13C-edited NOESY data together with the 2D
NOESY data collected in both H2O and D2O. In
addition, the 2D and 3D 12C/13C-filtered NOESY peak
‡ Goddard, T. D. & Kneller, D. G. (2005). SPARKY 3.111.
University of California, San Fransisco) on Windows or
Linux workstations.
1020
lists were added prior to the second step. Hydrogen
bonding constraints derived from slow amide exchange
data (as described below), and Φ–Ψ angle constraints
generated from chemical shift data using the program
TALOS73 were also used. The NOESY peak lists used as
input for automated analysis with CYANA were
generated automatically using the program SPARKY
based on the chemical shift list generated in the assignment process. Peaks volumes were calculated using
SPARKY's Gaussian integration tool. Slowly exchanging
amides were identified by lyophilizing the protein from
H2O, then dissolving it in D2O and acquiring 2D 1H-15N
HSQC spectra at 30 min and 50 h after dissolving in
D2O. Hydrogen bond donors were identified by the
presence of an amide peak in the 2D 1H-15N HSQC
spectrum recorded at 30 min. The corresponding acceptors were identified by visualizing PDB files obtained
from CYANA in Rasmol 2.7.174 to identify carbonyl
groups that were at a distance of approximately 2.0 Å
from slow exchanging hydrogen atoms. Each step of
structural refinement in CYANA was performed with
and without these hydrogen-bonding constraints.
For the structure determination of a single subunit of
CFr (i.e. one chain from the symmetric homo-dimer or
CFrA), 3873 NOE peaks (many of them repetitions of
the same peak observed in different spectra) were semiautomatically generated from 3D 15N and 13C-edited
NOESY and 2D NOESY spectra in H2O and D2O, using
the program SPARKY. In addition, 76 dihedral constraints were generated with the program TALOS and 32
hydrogen bond constraints were generated by analysis
of D2O protection experiments. The NOEASSIGN macro
in CYANA was used to automatically assign >92% of the
NOE input peaks. Together with the dihedral and
hydrogen bond constraints, the 3783 initial cross-peaks
yielded 1116 unique distance constraints that were used
in the final CFrA structure calculation. In the final
calculation, 100 structures were generated, of which the
top 20 structures had an average target function value of
5.24(±0.08) Å2 and an ensemble RMSD value of
0.24(±0.08) Å over backbone atoms and 0.76(±0.14) Å
over heavy atoms in residues 3 through 51. There were
16 distance constraint violations between 0.1 Å −0.25 Å
and two angle constraint violations of between 33°–36°.
In the next step of refinement, results from the CFrA
structure calculation were combined with inter-subunit
NOE data from the 2D and 3D 12C/13C filtered NOESYs to
determine the CFr dimer structure. The CFrA sequence
list, chemical shift list and the intra-subunit distance
constraints derived from the last round of CYANA were
duplicated to generate an equivalent copy of data for a
second chain labelled CFrB. A flexible 60 Å tether was
introduced between the C terminus of CfrA and the N
terminus of CFrB to allow each monomer to refine
separately during the calculation while also allowing a
generous range of motion for relative inter-subunit
re-orientation. A total of 23 inter-subunit NOEs were
assigned by manually inspecting the 2D and 3D 12C-13C
filtered NOESYs in SPARKY, based on the earlier intrasubunit backbone and side-chain assignments and the
intra-subunit NOEs assignments from the CYANA runs.
All inter-subunit NOE assignments were made as double
assignments between equivalent pairs of interacting
nuclei between CFrA and CFrB (i.e. an interaction
assigned between nucleus X on CFrA with nucleus Y on
CFrB automatically implied the same interaction between
nucleus Y on CFrA with nucleus X on CFrB). Peak volumes
calculated by SPARKY's Gaussian integration tool were
converted into upper distance constraints in CYANA by
CFr: Super-Stable Sub-fragment of Top7
setting the ratio of volumes to upper distance constraints
equal to that obtained in the automated intra-subunit
NOE assignment step. This inter-subunit upper distance
constraint list was then used in combination with the
CFrA and CFrB chemical shift lists and intra-subunit
distance constraint lists as input for a single round of
structure calculation that consisted of 100 separate
simulated annealing runs using torsion angle dynamics.
Similar structure calculations were also run with CFrA
duplicated hydrogen-bonding constraints and TALOSderived dihedral angle constraints, including hydrogen
bonds that were observed across the interface. All violated
constraints were investigated and were removed or
modified only if it appeared that they had been misassigned (intra-subunit instead of inter-subunit) or poorly
integrated. Unassigned NOEs from the CFrA automated
structure calculation were also investigated at this stage to
assign them, if possible, as inter-subunit NOEs. Two cycles
of this type of refinement were sufficient to obtain
structures with appropriate target function values, tight
ensemble convergence and no distance or dihedral
violations. The only violation after the final CYANA run
was the same single intra-residue close atom contact in
each monomer (Ile35 CG2 to C(O) violated by 0.33 Å). The
quality of the final structure was evaluated with
ProcheckNMR.29 Experimental constraints and structural
statistics are reported in Tables 2 and 3, respectively.
Solvent accessible surface area (SASA)
SASA was calculated using the program NACCESS§
SASA buried in the dimer interface (DSASA) was calculated
as:
DSASA ¼ ðCFrASASA þ CFrBSASA Þ
CfrABSASA
where CFrASASA and CFrBSASA are the SASA for each
subunit treated separately, and CFrABSASA is the SASA for
the dimer structure. Interfacial residues are defined as any
amino acid that loses >1 Å2 SASA when the dimer is
compared to the individual subunits.
Measurements of 15N nuclear relaxation rates and
15
N-1H heteronuclear NOEs
Standard pulse sequences were used to measure the 15N
T1, T2 and heteronuclear NOEs.75,76 All experiments
utilize pulsed-field gradients for coherence selection,
reduction of artefacts and sensitivity enhancement. In
the CPMG sequence of the T2 experiment, 1H 180° pulses
were applied for elimination of cross-correlation between
1
H-15N dipolar and 15N CSA relaxation mechanisms.77 A
delay of 0.75 ms was inserted between successive
applications of 15N 180° with 1H 180° pulses applied
every 4 ms in the CPMG pulse train. Spectra were
recorded with 112 complex points in the indirect dimension and with spectral widths of 1822.49 and 6009.6 in the
15
N and 1H dimensions, respectively. Delays of 0.030,
0.060, 0.100, 0.150, 0.220, 0.310, 0.420, and 0.550 s were
used for the T1 experiments. T2 spectra were measured
from spectra recorded with delays of 0.008, 0.016, 0.024,
0.032, 0.048, 0.064, 0.080, 0.096, and 0.120 s. The relaxation
delay was 1.9 s for each experimental set. For the
§ Hubbard, S. J. & Thornton, J. M. (1993). NACCESS.
Department of Biochemistry and Molecular Biology,
University College London).
1021
CFr: Super-Stable Sub-fragment of Top7
heteronuclear NOE measurements, a pair of spectra was
recorded with and without proton saturation that was
achieved by application of 1H 120° pulses every 5 ms.
Spectra recorded with proton saturation utilized a 2 s
recycle delay followed by a 3 s period of saturation, while
those recorded in the absence of saturation employed a
recycle delay of 5 s.
All spectra were processed using NMRPipe/NMRDraw software with polynomial baseline correction after
multiplication with cosine-bell window functions. Linear
prediction was applied in the indirect dimension to
increase the number of complex points in that dimension
to 224 in the T1/T2 heteronuclear NOE experiments,
followed by zero filling to generate 512 points. Peak
heights were calculated for every assigned peak in the T1
and T2 spectra and fitted into an exponential curve using
the SPARKY relaxation fit software∥ T1 and T2 values were
determined from the decay curves using the equation:
IðtÞ ¼ IðoÞexpð s=T1,2 Þ
Where I(o) is the initial peak intensity and τ is the delay
time. The error estimates for the rate constants reflects the
likely error of the best fit from the parameters obtained for
a perfect exponential decay. Average values and errors are
reported in Results.
Heteronuclear NOE values were calculated from the
ratio of peak heights with and without proton saturation.
Errors in these measurements were estimated from the
plane base noise in 2D 1H-15N-HSQC spectra recorded
with and without proton saturation.
X-ray crystallography
SS.CFr was crystallized in hanging drops (1 μl of protein
solution at 20 mg/ml with 1 μl of well solution). The well
solutions ranged from 30%–40% (v/v) methyl-2,4-pentanediol (MPD), 6% PEG-4K and 0.1 M of Na-Hepes (pH 6.9).
The protein crystals grew within two to six days and were
between 50 μm −200 μm on a side. Since MPD is a cryoprotectant at 30–40%, crystals were dunked in fresh well
solution and directly flash frozen in liquid nitrogen. With
this treatment, the crystals diffracted in a tetragonal space
group (P43212) with unit cell dimensions a = 58.3 Å,
b = 58.3 Å, c = 96.7 Å. A single wavelength (0.9793 Å) native
data set was collected to 3.6 Å resolution on beam-line 5.4.1
at the ALS (Advanced Light Source, Lawrence Berkeley
Laboratory, Berkeley) using a four panel ADSC CCD area
detector. Data were processed and scaled using HKL2000.78
The phases for the SS.CFr dataset were solved by
molecular replacement (MR) with the program EPMR¶.
Residues Glu2–Leu50 in both subunits of the CFr NMR
structure (best NMR model) were used as the search
model. The two subunits were input as separate chains to
allow for relative rigid-body re-orientation. The correlation coefficient for the initial MR search, using data to
4.0 Å resolution, was 0.58, versus background of 0.36.
Further structural refinement against the model-derived
MR phases was attempted with model building in
simulated annealing composite-omit maps in XtalView,79
along with rigid-body refinement, torsion-angle based
∥ Goddard, T. D. & Kneller, D. G. (2005). SPARKY 3.111.
University of California, San Fransisco).
¶ Kissinger, C. R. & Gehlhaar, D. K. (1997). EPMR: a
program for crystallographic molecular replacement by
evolutionary search. Agouron Pharmaceuticals, La Jolla,
CA).
simulated annealing, and conjugate-gradient based minimization in CNS.80
Protein Data Bank and BioMagRes database
accession numbers
The coordinates and corresponding NMR constraint
files for 20 NMR-derived CFr structures have been
deposited with the RCSB Protein Data Banka under the
identifier code 2GJH, and the chemical shift list corresponding to this structure determination has been
deposited in the BioMagRes Databaseb under the accession code 7101.
Acknowledgements
We acknowledge the expert assistance of Steve
Reichow, Tom Leeper, and Kate Godin in NMR data
collection and processing, and modelling and
refinement of the CFr structure; Priti Deka for help
with NMR dynamics analysis of CFr; Juan Pizarro
and Django Sussman for help with crystallographic
data collection and processing; Soenke Seifert for
help with SAXS data collection; Mark DePristo for
insightful comments about mechanisms of protein
evolution; the facilities at NMRFAM (Madison, WI,
supported by NIH) and PNNL (Richland, WA,
supported by DOE) for access to NMR instrumentation, and the facilities at the Advanced Light
Source (Berkeley, CA, supported by DOE); and the
Advanced Photon Source (Argonne, IL, supported
by DOE) for access to their synchrotron-source
X-ray beamlines. This work is supported in part by
grants from NIH-NIGMS (to G.V.) and NIH and
HHMI (to D.B.).
Supplementary Data
Supplementary data associated with this article
can be found, in the online version, at doi:10.1016/
j.jmb.2006.07.092
References
1. Dahiyat, B. I. & Mayo, S. L. (1997). De novo protein
design: fully automated sequence selection. Science,
278, 82–87.
2. Dantas, G., Kuhlman, B., Callender, D., Wong, M. &
Baker, D. (2003). A large scale test of computational
protein design: folding and stability of nine completely redesigned globular proteins. J. Mol. Biol. 332,
449–460.
3. Dwyer, M. A., Looger, L. L. & Hellinga, H. W. (2004).
Computational design of a biologically active enzyme.
Science, 304, 1967–1971.
4. Korkegian, A., Black, M. E., Baker, D. & Stoddard,
a
b
http://www.rcsb.org/pdb/
http://www.bmrb.wisc.edu
1022
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
B. L. (2005). Computational thermostabilization of an
enzyme. Science, 308, 857–860.
Kortemme, T., Joachimiak, L. A., Bullock, A. N.,
Schuler, A. D., Stoddard, B. L. & Baker, D. (2004).
Computational redesign of protein-protein interaction
specificity. Nature Struct. Mol. Biol. 11, 371–379.
Chevalier, B. S., Kortemme, T., Chadsey, M. S., Baker,
D., Monnat, R. J. & Stoddard, B. L. (2002). Design,
activity, and structure of a highly specific artificial
endonuclease. Mol. Cell. 10, 895–905.
Looger, L. L., Dwyer, M. A., Smith, J. J. & Hellinga,
H. W. (2003). Computational design of receptor and
sensor proteins with novel functions. Nature, 423,
185–190.
Harbury, P. B., Plecs, J. J., Tidor, B., Alber, T. & Kim,
P. S. (1998). High-resolution protein design with
backbone freedom. Science, 282, 1462–1467.
Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G.,
Stoddard, B. L. & Baker, D. (2003). Design of a novel
globular protein fold with atomic-level accuracy.
Science, 302, 1364–1368.
Dobson, N., Dantas, G., Baker, D. & Varani, G. (2006).
High-resolution structural validation of the computational redesign of human U1A protein. Structure, 14,
847–856.
Dahiyat, B. I. (1999). In silico design for protein
stabilization. Curr. Opin. Biotechnol. 10, 387–390.
DePristo, M. A., Weinreich, D. M. & Hartl, D. L. (2005).
Missense meanderings in sequence space: a biophysical view of protein evolution. Nature Rev. Genet. 6,
678–687.
Kozak, M. (1999). Initiation of translation in prokaryotes and eukaryotes. Gene, 234, 187–208.
Kurland, C. G. (1992). Translational accuracy and the
fitness of bacteria. Annu. Rev. Genet. 26, 29–50.
McClellan, A. J., Tam, S., Kaganovich, D. & Frydman, J. (2005). Protein quality control: chaperones
culling corrupt conformations. Nature Cell. Biol. 7,
736–741.
Vabulas, R. M. & Hartl, F. U. (2005). Protein synthesis
upon acute nutrient restriction relies on proteasome
function. Science, 310, 1960–1963.
McClellan, A. J., Scott, M. D. & Frydman, J. (2005).
Folding and quality control of the VHL tumor
suppressor proceed through distinct chaperone pathways. Cell, 121, 739–748.
Cazzola, M. & Skoda, R. C. (2000). Translational
pathophysiology: a novel molecular mechanism of
human disease. Blood, 95, 3280–3288.
Bence, N. F., Sampat, R. M. & Kopito, R. R. (2001).
Impairment of the ubiquitin-proteasome system by
protein aggregation. Science, 292, 1552–1555.
Horwich, A. (2002). Protein aggregation in disease: a
role for folding intermediates forming specific multimeric interactions. J. Clin. Invest. 110, 1221–1232.
Kozak, M. (2002). Pushing the limits of the scanning
mechanism for initiation of translation. Gene, 299,
1–34.
Dobson, C. M. (1999). Protein misfolding, evolution
and disease. Trends Biochem. Sci. 24, 329–332.
Cohen, F. E. & Kelly, J. W. (2003). Therapeutic
approaches to protein-misfolding diseases. Nature,
426, 905–909.
Selkoe, D. J. (2003). Folding proteins in fatal ways.
Nature, 426, 900–904.
Maurizi, M. R. (1992). Proteases and protein degradation in Escherichia coli. Experientia, 48, 178–201.
Gualerzi, C. O. & Pon, C. L. (1990). Initiation of mRNA
translation in prokaryotes. Biochemistry, 29, 5881–5889.
CFr: Super-Stable Sub-fragment of Top7
27. Folkers, P. J. M., Folmer, R. H. A., Konings, R. N. H.
& Hilbers, C. W. (1993). Overcoming the ambiguity
problem encountered in the analysis of nuclear
overhauser magnetic-resonance spectra of symmetrical dimer proteins. J. Amer. Chem. Soc. 115,
3798–3799.
28. Zwahlen, C., Legault, P., Vincent, S. J. F., Greenblatt, J.,
Konrat, R. & Kay, L. E. (1997). Methods for measurement of intermolecular NOEs by multinuclear NMR
spectroscopy: application to a bacteriophage lambda
N-peptide/boxB RNA complex. J. Amer. Chem. Soc.
119, 6711–6721.
29. Laskowski, R. J., Macarthur, M. W., Moss, D. S. &
Thornton, J. M. (1993). PROCHECK: a program to
check the stereochemical quality of protein structures.
J. Appl. Crystalogl. 26, 283–291.
30. Sakai, H., Imamura, C., Osada, Y., Saito, R., Washio, T.
& Tomita, M. (2001). Correlation between Shine–
Dalgarno sequence conservation and codon usage of
bacterial genes. J. Mol. Evol. 52, 164–170.
31. Stenstrom, C. M., Holmgren, E. & Isaksson, L. A.
(2001). Cooperative effects by the initiation codon and
its flanking regions on translation initiation. Gene, 273,
259–265.
32. Yamagishi, K., Oshima, T., Masuda, Y., Ara, T.,
Kanaya, S. & Mori, H. (2002). Conservation of
translation initiation sites based on dinucleotide
frequency and codon usage in Escherichia coli K-12
(W3110): non-random distribution of A/T-rich
sequences immediately upstream of the translation
initiation codon. DNA Res. 9, 19–24.
33. Ma, J., Campbell, A. & Karlin, S. (2002). Correlations
between Shine-Dalgarno sequences and gene features
such as predicted expression levels and operon
structures. J. Bacteriol. 184, 5733–5745.
34. Kozak, M. (1984). Selection of initiation sites by
eucaryotic ribosomes: effect of inserting AUG triplets
upstream from the coding sequence for preproinsulin.
Nucl. Acids Res. 12, 3873–3893.
35. Spanjaard, R. A. & van Duin, J. (1989). Translational
reinitiation in the presence and absence of a Shine and
Dalgarno sequence. Nucl. Acids Res. 17, 5501–5507.
36. de Smit, M. H. & van Duin, J. (1990). Control of
prokaryotic translational initiation by mRNA secondary structure. Prog. Nucl. Acid Res. Mol. Biol. 38, 1–35.
37. Saito, R. & Tomita, M. (1999). On negative selection
against ATG triplets near start codons in eukaryotic
and prokaryotic genomes. J. Mol. Evol. 48, 213–217.
38. Schubert, U., Ott, D. E., Chertova, E. N., Welker, R.,
Tessmer, U., Princiotta, M. F. et al. (2000). Proteasome
inhibition interferes with gag polyprotein processing,
release, and maturation of HIV-1 and HIV-2. Proc. Natl
Acad. Sci. USA, 97, 13057–13062.
39. Pabst, T., Mueller, B. U., Zhang, P., Radomska,
H. S., Narravula, S., Schnittger, S. et al. (2001).
Dominant-negative mutations of CEBPA, encoding
CCAAT/enhancer binding protein-alpha (C/EBPalpha), in acute myeloid leukemia. Nature Genet. 27,
263–270.
40. Wechsler, J., Greene, M., McDevitt, M. A., Anastasi, J.,
Karp, J. E., Le Beau, M. M. & Crispino, J. D. (2002).
Acquired mutations in GATA1 in the megakaryoblastic leukemia of Down syndrome. Nature Genet. 32,
148–152.
41. Hann, S. R., King, M. W., Bentley, D. L., Anderson, C. W.
& Eisenman, R. N. (1988). A non-AUG translational
initiation in c-myc exon 1 generates an N-terminally
distinct protein whose synthesis is disrupted in Burkitt's
lymphomas. Cell, 52, 185–195.
1023
CFr: Super-Stable Sub-fragment of Top7
42. Mellentin, J. D., Smith, S. D. & Cleary, M. L. (1989). lyl1, a novel gene altered by chromosomal translocation
in T cell leukemia, codes for a protein with a helixloop-helix DNA binding motif. Cell, 58, 77–83.
43. Rico, M., Jimenez, M. A., Gonzalez, C., De Filippis, V.
& Fontana, A. (1994). NMR solution structure of the
C-terminal fragment 255-316 of thermolysin: a dimer
formed by subunits having the native structure.
Biochemistry, 33, 14834–14847.
44. Tasayco, M. L. & Carey, J. (1992). Ordered selfassembly of polypeptide fragments to form nativelike
dimeric trp repressor. Science, 255, 594–597.
45. Fontana, A., de Laureto, P. P., Spolaore, B., Frare, E.,
Picotti, P. & Zambonin, M. (2004). Probing protein
structure by limited proteolysis. Acta Biochim. Pol. 51,
299–321.
46. Wu, L. C., Grandori, R. & Carey, J. (1994). Autonomous subdomains in protein folding. Protein Sci. 3,
369–371.
47. Philipp, S., Kim, Y. M., Durr, I., Wenzl, G., Vogt, M. &
Flecker, P. (1998). Mutational analysis of disulfide
bonds in the trypsin-reactive subdomain of a
Bowman-Birk-type inhibitor of trypsin and chymotrypsin–cooperative versus autonomous refolding
of subdomains. Eur. J. Biochem. 251, 854–862.
48. Goldberg, A. L. (2003). Protein degradation and
protection against misfolded or damaged proteins.
Nature, 426, 895–899.
49. Michaels, J. E., Schimmel, P., Shiba, K. & Miller, W. T.
(1996). Dominant negative inhibition by fragments of
a monomeric enzyme. Proc. Natl Acad. Sci. USA, 93,
14452–14455.
50. Lupas, A. N., Ponting, C. P. & Russell, R. B. (2001). On
the evolution of protein folds: are similar motifs in
different protein folds the result of convergence,
insertion, or relics of an ancient peptide world?
J. Struct. Biol. 134, 191–203.
51. Andrade, M. A., Perez-Iratxeta, C. & Ponting, C. P.
(2001). Protein repeats: structures, functions, and
evolution. J. Struct. Biol. 134, 117–131.
52. Grishin, N. V. (2001). Fold change in evolution of
protein structures. J. Struct. Biol. 134, 167–185.
53. Holm, L. & Sander, C. (1995). Dali: a network tool for
protein structure comparison. Trends Biochem. Sci. 20,
478–480.
54. Yu, E. W., McDermott, G., Zgurskaya, H. I., Nikaido,
H. & Koshland, D. E., Jr (2003). Structural basis of
multiple drug-binding capacity of the AcrB multidrug
efflux pump. Science, 300, 976–980.
55. Bloom, J. D., Silberg, J. J., Wilke, C. O., Drummond,
D. A., Adami, C. & Arnold, F. H. (2005). Thermodynamic prediction of protein neutrality. Proc. Natl
Acad. Sci. USA, 102, 606–611.
56. Martin, L. & Vita, C. (2000). Engineering novel
bioactive mini-proteins from small size natural and
de novo designed scaffolds. Curr. Protein Pept. Sci. 1,
403–430.
57. Schnolzer, M. & Kent, S. B. (1992). Constructing
proteins by dovetailing unprotected synthetic peptides: backbone-engineered HIV protease. Science, 256,
221–225.
58. Dawson, P. E., Muir, T. W., Clark-Lewis, I. & Kent, S. B.
(1994). Synthesis of proteins by native chemical
ligation. Science, 266, 776–779.
59. Kochendoerfer, G. G. (2001). Chemical protein synthesis methods in drug discovery. Curr. Opin. Drug
Discov. Devel. 4, 205–214.
60. Kochendoerfer, G. G., Chen, S. Y., Mao, F., Cressman,
S., Traviglia, S., Shao, H. et al. (2003). Design and
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
chemical synthesis of a homogeneous polymer-modified erythropoiesis protein. Science, 299, 884–887.
Vita, C., Roumestand, C., Toma, F. & Menez, A.
(1995). Scorpion toxins as natural scaffolds for
protein engineering. Proc. Natl Acad. Sci. USA, 92,
6404–6408.
Kibler, D. & Hampson, S. (2002). Characterizing the
E. coli shine-dalgarno site: probability matrices and
weight matrices. International Conference on Mathematical and Engineering Techniques in Medicine and
Biological Science - METMBS 2002. CSREA Press, Las
Vegas, NV, USA.
Shine, J. & Dalgarno, L. (1974). The 3′-terminal
sequence of Escherichia coli 16S ribosomal RNA:
complementarity to nonsense triplets and ribosome
binding sites. Proc. Natl Acad. Sci. USA, 71,
1342–1346.
Lipfert, J., Millett, I. S., Seifert, S. & Doniach, S. (2006).
Sample holder for small-angle X-ray scattering static
and flow cell measurements. Rev. of Sci. Instrum. 77,
046108-1–046108-3.
Glatter, O. & Kratky, O. (1982). Small Angle X-ray
Scattering. Academic Press, London.
Doniach, S. (2001). Changes in biomolecular conformation seen by small angle X-ray scattering. Chem.
Rev. 101, 1763–1778.
Cohn, E. J. & Edsall, J. T. (1943). Proteins, Amino Acids
and Peptides as Ions and Dipolar Ions. Reinhold Publishing Corporation, New York.
Laue, T. M. (1992). Short column sedimentation
equilibrium analysis for characterization of macromolecules in solution. Spinco Business Unit, Palo
Alto, CA.
Kuhlman, B., O'Neill, J. W., Kim, D. E., Zhang, K. Y. &
Baker, D. (2001). Conversion of monomeric protein L
to an obligate dimer by computational protein design.
Proc. Natl Acad. Sci. USA, 98, 10687–10691.
Delaglio, F., Grzesiek, S., Vuister, G. W., Zhu, G.,
Pfeifer, J. & Bax, A. (1995). NMRPipe: a multidimensional spectral processing system based on UNIX
pipes. J. Biomol. NMR, 6, 277–293.
Sattler, M., Schleucher, J. & Griesinger, C. (1999).
Heteronuclear multidimensional NMR experiments
for the structure determination of proteins in solution
employing pulsed field gradients. Prog. Nucl. Magn.
Reson. Spectr. 34, 93–158.
Guntert, P. (2003). Automated NMR protein structure calculation. Prog. Nucl. Magn. Reson. Spectr. 43,
105–125.
Cornilescu, G., Delaglio, F. & Bax, A. (1999). Protein
backbone angle restraints from searching a database
for chemical shift and sequence homology. J. Biomol.
NMR, 13, 289–302.
Sayle, R. A. & Milner-White, E. J. (1995). RASMOL:
biomolecular graphics for all. Trends Biochem. Sci. 20,
374.
Farrow, N. A., Muhandiram, R., Singer, A. U., Pascal,
S. M., Kay, C. M., Gish, G. et al. (1994). Backbone
dynamics of a free and phosphopeptide-complexed
Src homology 2 domain studied by 15 N NMR
relaxation. Biochemistry, 33, 5984–6003.
Deka, P., Rajan, P. K., Perez-Canadillas, J. M. & Varani,
G. (2005). Protein and RNA dynamics play key roles
in determining the specific recognition of GU_rich
polyadenylation regulatory elements by human Cstf64 protein. J. Mol. Biol. 347, 719–733.
Boyd, J., Hommel, U. & Campbell, I. D. (1990).
Influence of cross-correlation between dipolar and
anisotropic chemical-shift relaxation mechanisms
1024
CFr: Super-Stable Sub-fragment of Top7
upon longitudinal relaxation rates of N-15 in macromolecules. Chem. Phys. Letters, 175, 477–482.
78. Otwinowski, Z. & Minor, W. (1997). Processing of
X-ray diffraction data collected in oscillation mode.
Methods Enzymol. 276, 307–326.
79. McRee, D. E. (1999). A versatile program for manipulating atomic coordinates and electron density.
J. Struct. Biol. 125, 156–165.
80. Brunger, A. T., Adams, P. D., Clore, G. M., DeLano,
W. L., Gros, P., Grosse-Kunstleve, R. W. et al. (1998).
Crystallography and NMR system: A new software
suite for macromolecular structure determination.
Acta Crystallog. sect. D, 54, 905–921.
81. Mori, S., Abeygunawardana, C., Johnson, M. O. & van
Zijl, P. C. (1995). Improved sensitivity of HSQC spectra
of exchanging protons at short interscan delays using a
new fast HSQC (FHSQC) detection scheme that avoids
water saturation. J. Magn. Reson. ser. B, 108, 94–98.
82. Kohn, J. E., Millett, I. S., Jacob, J., Zagrovic, B., Dillon,
T. M., Cingel, N. et al. (2004). Random-coil behavior
and the dimensions of chemically unfolded proteins.
Proc. Natl Acad. Sci. USA, 101, 12491–12496.
Edited by F. Schmid
(Received 17 May 2006; received in revised form 21 July 2006; accepted 29 July 2006)
Available online 4 August 2006