Protein structure
Predictive methods and
experimental methodologies
Sources for this lecture
Bioinformatics (course text: Baxevanis and Ouellette)
Chapter 8: Predictive methods using protein sequences (Ofran and Rost) 198-219
Chapter 9: Protein structure prediction and analysis (Wishart) 224-247
Chapter 12: Creation and analysis of protein multiple sequence alignments (Barton)
pgs 333-336; (Required reading during weeks 3-4 on multiple sequence alignment)
Proteins: Structures and molecular properties (Creighton)
Introduction to Protein Structure (Branden & Tooze)
Much of the text in the slides that follow are drawn either verbatim
or paraphrased from these texts.
Topics Covered
Methods for solving protein structures experimentally
Overview of protein structure: primary, secondary, tertiary,
and quaternary
Overview of protein folding
Protein structure classification resources
CATH
SCOP
The Structural Genomics Initiative
The importance of protein structure
Bioinformatics is much more than just sequence
analysismany of the most interesting and exciting
applications in bioinformatics today actually are
concerned with structure analysis.
The origins of bioinformatics actually lie in the field of
structural biology.
Proteins are perhaps the most complex chemical entities in
nature. No other class of molecule exhibits the variety and
and irregularity in shape, size, texture and mobility that
can be found in proteins.
Baxevanis & Ouellette (Ch. 9, p.224, Wishart)
Primary,
Secondary,
Tertiary and
Quaternary
Structure
Secondary structure
! helix first described by
Linux Pauling in 1951
Avg length: 10 residues
(3 turns)
range: from 4 to over 40
residues
good helix formers:
A, E, L, M
very poor formers:
P, G, Y, S
most common location:
on surface, one side
buried, one side exposed
transmembrane helices
have almost entirely
hydrophobic side chains
(Branden & Tooze)
http://www.web-books.com/MoBio/Free/Ch2C4.htm
Amphipathic
alpha helix
http://www.web-books.com/MoBio/Free/Ch2C4.htm
Beta strand
http://www.web-books.com/MoBio/Free/Ch2C4.htm
Beta sheet
http://www.web-books.com/MoBio/Free/Ch2C4.htm
Super-secondary structures
Somewhat short structural segments
These may play key functional roles
Helix-turn-helix and EF-hand
http://chemistry.umeche.maine.edu/CHY431/Proteins8.html
Supersecondary structure:
Helix-turn-helix
4-helical bundles
More
about 4helical
bundles
Coiled Coil
A small selection of common folds
TM proteins: Porin & Rhodopsin
http://www.mhl.soton.ac.uk/public/research/projects/current/rhodopsin/index.html
Flavodoxin
Thioredoxin
10
TIM Barrel
Rossman fold
11
Repeat structures: LRR
LRR structures
across the Tree
of Life
12
Repeat structures: TPR
Structure Classification Databases
SCOP and CATH
13
CATH
CATH classification
14
Structural Classification of
Proteins (SCOP)
and the Astral datasets
15
Evolution of structure and function
Extend function prediction through inclusion of
structure prediction and analysis
Anti-fungal defensin
(Radish)
Drosomycin
(Drosophila)
Scorpion toxin
16
SCOP and ASTRAL
SCOP is a database providing a hierarchy of protein structural domains
divided into Class, Fold, Superfamily and Family
Domains are defined by SCOP primarily by evolutionary evidence
(found in proteins of different domain architectures)
Domains are placed into classes based on similar secondary structure
content (e.g., all alpha, all beta, alpha+beta, etc)
Domains in the same SCOP fold are asserted to have a similar
topology, but their actual homology (evolutionary relationship) may not
be known
Domains in the same SCOP superfamily are asserted by SCOP to
have a common ancestor (evidence may appear obscure to some
biologists, and is based on structural and sequence analyses)
Domains in the same SCOP family generally have a high functional
similarity
ASTRAL is a subset of SCOP, used to assess protein structure
prediction methods
See, e.g., PDB40: a subset of SCOP domains s.t. no pair has greater than
40% identity
17
The complexity of protein structure
John Kendrew, upon solving a low-resolution structure of
myoglobin in 1958:
Perhaps the most remarkable features of the molecule
are its complexity and its lack of symmetry.
The arrangement seems to be almost totally lacking in
the kind of regularities which one instinctively
anticipates, and it is more complicated than has been
predicted by any theory of protein structure.
Reported in Introduction to Protein Structure (Branden & Tooze)
18
Experimental methods for solving
protein 3D structure
Experimental determination of
protein structure
X-ray crystallography
NMR spectroscopy
Many experimental issues:
Some proteins are hard to solve (e.g., membrane proteins)
Long proteins normally hard to solve, and are often divided into
individual domains
Getting the domain boundaries correct is hard
Numerous other problems (which is why predicting protein
structure is useful)
19
X-ray crystallography
Most accurate; can be applied to larger proteins
Oldest method; first structure (myoglobin) determined in late 1950s
(Kendrew et al 1958). More than 20K structures solved to date
Method:
Small protein crystals (measuring <1mm) exposed to X-ray beams
X-rays (which have a wavelength of 1-2Angstroms) are scattered or
diffracted by the protein atoms in the crystal
Diffraction pattern appears as tens of thousands of tiny spots arrayed in
complex circular patterns
XYZ coordinates determined from these diffraction patterns, along with
intensity and phase information
Molecular replacement (comparative models based on homologous
structures) can facilitate that process.
Baxevanis & Ouellette (Ch. 9, Wishart)
Limitations of X-ray crystallography
Structures are solved in an artificial solid-state (crystalline) environment
unlike the natural (liquid) environment of cell.
Structures can be altered by crystal packing and solvent exclusion effects.
Some regions can not be resolved properly (especially very mobile
regions)
these can be left as gaps in the 3D structure, or simply be fuzzy
R factor: measure of agreement (actually divergence: smaller is better)
between calculated structure and experimental data
0.25 (for good protein structures of normal size)
0.05 (for small molecules)
Completely wrong structures can have an R factor of 0.59
It is not unusual for many protein structures to have some errors,
ambiguities or inaccuracies in atomic positions (+/- 0.5 Angstroms), or
to be missing atoms and residues (e.g., TrwB, PDB 1E9RA)
Baxevanis & Ouellette (Ch. 9, Wishart)
20
NMR spectroscopy
Much newer: first NMR structure in 1983
Allows biologists to study structure and dynamics of molecules in liquid state
(or near-physiological environment)
Structures solved by measuring how radio waves are absorbed by atomic
nuclei
Absorption measurement allows the determination of how much nuclear
magnetism is transferred from one atom (or nucleus) to another
Magnetization transfer measured through chemical shifts, J-couplings and nuclear
Overhauser effects
Measured parameters define a set of approximate structural constraints that are fed
into a constraint minimization calculation (distance geometry or simulated annealing)
Result is an ensemble of (15-50) of structures that satisfy the experimental
constraints
These multiple structures are overlaid/superimposed on each other to produce
blurrograms
NMR result is potentially more reflective of true solution behavior of proteins;
most proteins seem to exist in an ensemble of slightly different configurations
Baxevanis & Ouellette (Ch. 9, Wishart)
Limitations of NMR spectroscopy
Size limitations: maximum of 30kD (~250aa)
Solubility of molecule
cannot be applied to membrane proteins
Expensive: requires special isotopically labeled
molecules
Inherently less precise
Baxevanis & Ouellette (Ch. 9, Wishart)
21
Structural superposition
Superposing protein structures can reveal startling
structural similarities not apparent from primary
sequence comparison
Structural superposition is the basis for evaluating
many alignment methods
Structural superposition has a great degree of
variability across methods
Also: proteins can be solved in a bound or unbound
conformation
The two conformations can be significantly different
Flexible superposition is important
Popular methods include: DALI, CE, VAST, Structal
Storing and retrieving protein structures
The Protein Data Bank (PDB)
First electronic database in bioinformatics
Set up at Brookhaven National Laboratory by Walter Hamilton in 1971
7 protein structures at database initiation
Coordinates stored and distributed on punch cards and computer tape
Currently
22K structures (as of October 23, 2005) (~46K as of October 2007)
Coordinate distribution and deposition is electronic (via the world wide web)
Moved to the Research Collaboratory for Structural Bioinformatics (RSCB) in
1998
Primary archival center for experimentally determined 3D structures of proteins,
nucleic acids, carbohydrates and complexes
Separate repository for theoretical models
Baxevanis & Ouellette (Ch. 9, Wishart)
22
http://www.usm.maine.edu/~rhodes/ModQual/index.html
http://www.usm.maine.edu/~rhodes/ModQual/index.html
23
Structural Genomics Initiative
From Andras Fiser
Albert Einstein College of Medicine
Why is it useful to know the structure of a
protein not only its sequence?
The 3D structure is more informative than sequence because patterns in space are
frequently more recognizable than patterns in sequence
Evolution tends to conserve function and
function depends more directly on structure
than on sequence, structure is more
conserved in evolution than sequence.
Andras Fiser, Albert Einstein College of Medicine
24
Why Protein Structure Prediction?
Y 2005
Sequences
2,300,000
Structures
29,000
We know the experimental 3D structure for
~1% of the protein sequences
Andras Fiser, Albert Einstein College of Medicine
Principles of Protein Structure
Anacystis nidulans
Anabaena 7120
Ab initio prediction
Condrus crispus
folding
Desulfovibrio vulgaris
GFCHIKAYTRLIMVG
evolution
Fold Recognition
Comparative Modeling
Andras Fiser, Albert Einstein College of Medicine
25
Protein structure modeling
Ab initio prediction
Comparative Modeling
Applicable to any sequence
Applicable to those sequences only that
share recognizable similarity to a template
structure
Not very accurate (>4 Ang RMSD),
Fairly accurate ( <3 Ang RMSD), typically
comparable to a low resolution X-ray
experiment.
Attempted for proteins of <100 residues
Not limited by size
Accuracy and applicability are limited
by our understanding of the protein
folding problem
Accuracy and applicability are rather
limited by the number of known folds
Andras Fiser, Albert Einstein College of Medicine
What makes comparative modeling possible
I A small difference in the sequence makes a small
difference in the structure
II Protein structures are clustered into fold families
Andras Fiser, Albert Einstein College of Medicine
26
Structural Genomics
Characterize most protein sequences (red) based on related
known structures (green).
The number of families is
much smaller than the number
of proteins
Andras Fiser, Albert Einstein College of Medicine
Structural Genomics
Definition: The aim of structural genomics is to put every
protein sequence within a modeling distance of a
known protein structure.
Size of the problem:
There are a few thousand domain fold families.
There are ~20,000 sequence families (30% sequence id).
Solution:
Determine protein structures for as many different
families as possible.
Model the rest of the family members using comparative
modeling
Andras Fiser, Albert Einstein College of Medicine
27
Protein folding
Information required for folding is (mostly)
contained in the primary sequence
Early on, proteins were shown to fold into their native structures
in isolation
This led to the belief that structure is determined by sequence
alone (Anfinsen, 1973)
Over the last decade, a significant number of proteins have
been shown to not fold properly in the test tube (e.g., requiring
the assistance of chaperonins)
Nevertheless, the native 3D structure is assumed to be in some
energetic minimum
This led to the development of ab initio folding methods
Baxevanis & Ouellette (Ch. 9, Wishart)
28
Folding pathways
Evidence that local structure segments form first, and
then pack against each other to form 3D fold
Exploited in protein fold prediction, Rosetta method
Simons, Bonneau, Ruczinski & Baker (1999). Ab initio Protein
Structure Prediction of CASP III Targets Using ROSETTA. Proteins
Semi-stable structural intermediates on folding pathway
to lowest-energy conformation
Prof. Susan Marqusee, Berkeley
Baxevanis & Ouellette (Ch. 9, Wishart)
Structural studies have provided
insights into protein folding
When high-resolution studies of myoglobin became available, Kendrew
noticed that amino acids in the interior had almost exclusively
hydrophobic side chains
Over time, structural studies have shown the following:
The main driving force for folding water-soluble globular proteins is to
pack hydrophobic side chains into the interior, thus creating a
hydrophobic core and a hydrophilic surface
Bringing hydrophobic side chains into the interior requires that the highly polar
main chain (with one hydrogen bond donor, NH, and one hydrogen bond
acceptor, C=0) is brought along
In a hydrophobic environment, these main-chain polar groups must be
neutralized by the formation of hydrogen bonds
This problem is solved elegantly by the formation of regular secondary
structure in the interior of the molecule
Internal cavities are usually occupied by water molecules that hydrogen-bond
to internal polar groups
From Introduction to Protein Structure (Branden & Tooze)
29
Hierarchical descriptions of proteins
(follows the folding process)
Primary structure: the amino acid sequence
Secondary structure: regular local structure of linear segments of polypeptide chains
(Creighton)
Helix (~35% of residues): subtypes: !,
Beta sheet (~25% of residues)
Both types predicted by Linus Pauling (Corey and Pauling, 1953;
! helix first described by Pauling in 1951)
Other less common structures:
Beta turns
3/10 helices
! loops
Remaining unclassifiable regions termed random coil or unstructured regions
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Tertiary structure: Overall topology of the folded polypeptide chain (Creighton)
" and 310
Mediated by hydrophobic interactions between distant parts of protein
Quaternary structure: Aggregation of the separate polypeptide chains of a protein
(Creighton)
Baxevanis & Ouellette (Ch. 9, p.224, Wishart)
Folded conformations of globular
proteins
Most proteins are globular: natural proteins in solution are much
smaller in their dimensions than comparable polypeptides with
random or repetitive conformations and have roughly spherical
shapes
Denaturation: Most proteins are robust to changes in their
environment, until they (somewhat literally) fall apart:
Most proteins are robust to changes in temperature, pH and
pressure, exhibiting little or no change until a point is reached at
which there is a sudden change and loss of biological function
Denaturing proteins has been used to explore folding pathways
e.g.,Understanding how proteins fold: the lysozyme story so far.Dobson CM, Evans
PA, Radford SE.Trends Biochem Sci. 1994
Creighton, Proteins Ch. 6
30
Structural domains
Folded structures of most small proteins are roughly spherical and
remarkably compact
Proteins with >200aa tend to consist of >2 structural units, called
domains
Domains interact to varying extents, but less extensively than do
structural elements within domains
Some domain detection tools make use of this pattern, looking for
covariation between positions as evidence of interaction, and lack of
covariation as evidence of domain boundaries
Nagarajan and Yona, Automatic prediction of protein domains from sequence
information using a hybrid learning system. Bioinformatics 2004
Domains may not always be well segregated; some proteins have
multiple domains with 2 or three polypeptide connections between
domains
See, for example, the SCOP interleaved domains
Domains may also be connected by flexible linker regions
Creighton, Proteins Ch. 6
Structural domains (contd)
Definition of domain is a subjective process done in different ways
by different people
Domains are most evident by their compactness
Expressed quantitatively as the ratio of the surface area of a domain to
the surface area of a sphere with the same volume
Observed values are 1.65+/- 0.08
Course of polypeptide backbone through domain is irregular, but
generally follows moderately straight course through the domain
and then makes a U-turn to re-cross the domain
Overall impression: segments of somewhat stiff polypeptide chain
interspersed with relatively tight turns or bends (almost always on
the molecules surface)
Compared to behavior of a fire hose dropped in one spot
Creighton, Proteins Ch. 6
31
Driving forces in protein folding
Complex combination of local and global
forces
Local forces drive secondary structure formation
Repulsion between hydrophobic side chains of some
amino acids and hydrophilic backbone of protein chain
(intra-molecular)
Interaction between side chains and surrounding solvent
Subcellular environment (e.g., membrane, secreted, etc.)
Pauling et al 1951
Baxevanis & Ouellette (Ch. 9, Wishart)
Summary of driving forces in protein
folding
Hydrophobicity
Hydrophobic residues need to be shielded from solvent
Polar residues to the outside, hydrophobic to the inside
Stronger interactions
Hydrogen bonds, disulfide bridges
Weak interactions
Van der Waals, electrostatic, etc
Recommended reading: Proteins (Thomas Creighton).
32
Global effects on protein fold
Long-range interactions (repulsive or
attractive) between distant parts of structure
These can override local effects
E.g., chameleon protein:
11 amino acids adopt helical structure in one region, and
the same 11 amino acids adopt beta strand in another.
Minor & Kim, 1996
Baxevanis & Ouellette (Ch. 9, Wishart)
33