Lecture 03 Protein Sequence Analysis
Lecture 03 Protein Sequence Analysis
Systems
Biology
Quantitative
Proteomics
Protein Structural
Sequencing Proteomics
• Each amino acid can be presented with a single lettered
amino acid tag
3
Amino Acids & Peptide bonds
https://www2.chemistry.msu.edu/faculty/reusch/VirtTxtJml/protein2.htm
4
But proteins are 3D, Right?
5
Peptide bond (Planar)
7
The Dihedral Angles
To visualize the dihedral angle
of four atoms:
1. Look down the second bond vector
w angle
α
w angle
phi
9
Right-handed Alpha Helix: If you hold it pointing away from you and it twists clockwise moving
away, it is right-handed, otherwise it is left-handed. These models are mirror images and can
not be converted into the other by rotation. The helix of normal DNA is right-handed.
Alpha-helix and beta-strand regions
Data as in (Lovell et
al. 2003) showing
about 100,000 data Left-handed
Alpha Helices
points for several
proteins/amino-acids
Right-handed
Alpha Helices
http://www.ocf.berkeley.edu/~asiegel/posts/?p=24 10
Protein Sequence?
• So each amino acid can be presented with a single
lettered amino acid tag
11
Protein Sequencing – Edman Degradation
• Edman degradation starting from the N-terminal and
removing one amino acid at a time (details next).
• Drawback:
• Restricted to 60 residues
• Laborious: ~50 aa/day
(http://en.wikibooks.org/wiki/Structural_Biochemistry/Proteins/Protein_sequence_determination_techniques)
13
Edman Cycle
MS-based Proteomics
• Objective: Large-scale determination of gene and
cellular function directly at the protein level
How?
Mass spectrometer ionizes molecules and sorts them
based on their mass-to-charge (m/z) ratio against
their relative abundance.
17
Figure - Mass Spectrometer Workflow
18
Physics behind Mass Spectrometry
www.mhhe.com
m/z Ratio
• Moving charged particles in a magnetic field
experience forces given by
Force ∝ Q
• Mass Analyzer
• Separates the samples according to their m/z
• Detector
• Selected molecules then hit the detector
• Spectrum Assembly
• Proteomics software which is interfaced to the MS, assembles spectra
FT-ICR Development 24
Richard Smith at PNNL
Protein Sequence Databases
• Searching for a protein by ID
http://www.uniprot.org/
25
Peptide Databases
• Peptide Atlas • Antiparasitic Peptides
• PepBank (Harvard) (Antimalaria)
• Cancer Peptide and Protein • Anticancer Peptides
Database (CPPD) • Anti-protist Peptides
• Antibacterial Peptides • Insecticidal Peptides
(Antibiofilms)
• Spermicidal Peptides
• Antiviral Peptides (Anti-HIV)
• Chemotactic peptides
• Antifungal Peptides
• wound healing
• Antiparasitic Peptides
(Antimalaria) • Antioxidant peptides
• Protease inhibitors
26
MS Spectral Data Processing – Charge
State Deconvolution
• Charge needs to be estimated before calculation of the mass from
m/z ratios 1 kg = 6.022e+26 amu
27
Activity
29
Mass Isotopic Distributions
Calculating Isotopic Mass Distributions of
• N H3
• C H4
• C2 O2 N H5
Calculate peaks and their intensity in each case and plot them!
Cookie Point (0.25)
30
31
Excerpt from a Mass Spectrum
Hurdles in Application of MS
1. Hard ionization techniques
3. Search Algorithms
• Isotopic envelope deconvolution
• Post-translational modifications detection
Soft Ionization to the Rescue!
The Nobel Prize in Chemistry 2002 was awarded "for the development of
methods for identification and structure analyses of biological macromolecules
with one half jointly to John B. Fenn and Koichi Tanaka "for their development of
soft desorption ionisation methods for mass spectrometric analyses of biological
macromolecules" and the other half to Kurt Wüthrich "for his development of
nuclear magnetic resonance spectroscopy
High Resolution Mass Spec!
C E H
D F G
Fig. SPECTRUM GUIs. The set of graphical user interfaces in SPECTRUM created using MATLAB GUIDE to undertake the search process and
visualize results. (A) Main SPECTRUM GUI to provide general search parameters, (B) GUI to tune intact protein mass, (C) GUI to provide PST
search parameters, (D) GUI to include special fragmentation ions in the search process and (E) GUI to specify instrument based chemical
modification. (F) GUI to tailor final scoring scheme, and (G-H) GUIs to provide the user with brief as well as detailed results.
Overview
• A toolbox for protein identification from top-down proteomics
data built using MATLAB
pTop ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✗
MSPathFinder ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✓
MASH Suite ✓
✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Pro (2 Types)
✓
ProsightPC ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗
(7 types)
✓
TopPIC ✗ ✗ ✗ ✗ ✓ ✗ ✓ ✓ ✗
(4 types)
Results – Case Study on HeLa H4
Histone
• Case Study I – Evaluation of SPECTRUM Search with Known Target Protein
Search Parameters
Results Comparison for HeLa PST Search: Disabled PST Search: Disabled PST Search: Enabled
Dataset Scoring Component: In silico Scoring Component: In silico Scoring Component: In silico
Blind PTM Search: Disabled Blind PTM Search: Enabled Blind PTM Search: Disabled
Proteins Identified 3 3 3 2 1 0
Not Reported 0 0 0 2 2 10
Search Time
15 24 16 2350 15 13
(in seconds)
* ProSight PC v4.0, TopPIC v1.1, pTop v1.2 PST : Peptide Sequence Tag
Results – Case Study E. coli
Dataset
• Case Study II – Evaluation of SPECTRUM Search with Unknown Target Protein
Results Comparison for
With PSTs/TagSearch Without PSTs/TagSearch
E. Coli Dataset
SPECTRUM MSPathFinder pTop SPECTRUM MSPathFinder TopPIC
Note: Benchmarking performed using a desktop machine with Intel® Ci7 7700 @ 4.2GHz and 32 GB RAM
Results – Case Study E. coli Dataset
Validation/benchmarking of the platform was performed against:
1. Published datasets
2. TDP tools (ProSight PC, pTop, TopPIC, MSPathFinder)
Filtered DB
P62837
P62838
P62839
Q6ZPJ3
P21734
Q02159
P62837
Filtered DB
P62837
P62839
Q6ZPJ3
P21734
P. ID | Score
Filtered DB
1263.796108
1304.861342
1433.914879
1477.96237
1535.004263
1562.021644
Things to watch out for in MS data
1. Intact Protein/Peptide Mass
2. Charge States
3. Relative Abundances
47
1. MS1 Mass Tuning and Scoring
48
Estimating Intact Mass
For
For each
each 2-tuple(i,j),
2-tuple(i,j),
Get Generate
Generate 2-tuples
2-tuples fromfrom
Get monoisotopic
monoisotopic MS
MS compute
compute tuple
tuple sum:
sum:
Start spectra MS2
MS2 peaks
peaks TS
Start spectra {(m/z TSkk =m/z
=m/zii+m/z
+m/zjj ,,
(MS1, {(m/zii,, Int
Intii);
); (m/z
(m/zjj,, Int
Intjj)}
)}
(MS1, MS2,
MS2, Int)
Int) i=1:n,
Int avg_k =(Int
Intavg_k =(Intii+Int
+Intjj )/2,
)/2,
i=1:n, j=i:n,
j=i:n, n=1:size(MS2)
n=1:size(MS2)
k=1:all
k=1:all 2-tuple
2-tuple sums
sums
NO
Filter
Filter TS
TS
Initialize
Initialize the
the scanning
scanning using
Create
Create aa scanning
scanning window
window using user
user
window
window position
position with
with the
the Obtain
Obtain filtered
filtered TS
TS YES
(width=1
(width=1 Dalton
Dalton )) defined tolerance
defined tolerance
smallest TS
smallest TSkk (Tol frag))
(Tolfrag
NO
Count
Count TS TSkkfalling
falling within
within the
the Scanning
Scanning Obtain
Obtain maximum
maximum value
value
Incrementally
Incrementally shift
shift the
the scanning
scanning window
window forfor every
every window
window from
from TS_Count
TS_Count and
and select
select
scanning
scanning window
window by by aa user
user YES
shift
shift &
& store
store respective
respective reached
reached end
end corresponding scanning
corresponding scanning
defined
defined step
step size
size TS_Count
TS_Count values
values of
of TS?
TS? window
window
Obtain
Obtain tuned
tuned Compute
Compute intensity
intensity weighted
weighted From
From selected
selected window
window
End
End intact
intact protein
protein average
average ofof elements
elements in
in obtain TS &
obtain TSkk & Int
Intavgk
avgk
mass
mass selected
selected window
window
TS ki Intkavg
i
TunedMass i 1
m
Int
i 1
avg
ki
49
Intuitively Scoring Tuned Masses
• As a first step in protein search, protein database is
filtered for proteins matching the MW reported in
the experimental data
1
𝑀𝑆𝑐𝑜𝑟𝑒 =
√ 𝑀𝐸𝑥𝑝 − 𝑀 𝑇ℎ𝑟 2
50
What we do in SPECTRUM?
Massdiff | Massexp erimental Masstheoretical |
1, Massdiff 0
2Massdiff , 0 Massdiff Thr
1
ScoreWPMW
0, Massdiff Thr
51
2. Peptide Sequence Tags
• Upon obtaining scores of all proteins in the protein
database, we filter the database for “candidate
proteins”
Get
Get monoisotopic
monoisotopic mass
mass Mass
MassAAAA-Tol
-Tol For
For each
each 2-tuple(i,j)
2-tuple(i,j)
If for
for standard
standard
If PTM
PTM selected
selected Mass
Massdiff
diff compute
compute difference:
difference:
amino
amino acids
acids (L_Mass AA))
(L_MassAA Mass
MassAA +Tol
AA +Tol Mass diff =m/z
Massdiff =m/zii-m/z
-m/zjj
YES YES
Add
Add modified
modified mass
mass of of For
For each
each value
value compute:
compute: Join
Join Tag AA,, if
TagAA if Tag
TagOrder
Order
amino
amino acids
acids for
for selected
selected Store
Store Tag
TagAAAA ,, Error
Error ,, Error=|Mass
Error=|Massdiffdiff -- Mass AA||
MassAA show
show that
that they
they are
are
PTM(s)
PTM(s) in
in the
the list
list Int avgk &
Intavgk & Tag
TagOrder
Order Int avgk =Int
Intavgk =Intii+Int
+Intjj consecutive
consecutive
L_Mass
L_MassAAAA
Create
Create PST(s)
PST(s) of
of
Compute
Compute score
score for
for each
each
End
End Length_range(length
Length_range(lengthmin
min,,
for each PST
for each PST length
lengthmax)
max)
Scoring Sequence Tags - I
• Sequence Tag Examples: ‘M’, ‘MQ’, and ‘QV’ etc
• Scoring Philosophy:
• The lengthier the tag, the better,
• The smaller the RMSE, the better,
• The more abundant the better!
55
Scoring Sequence Tags - II
• If a candidate protein matches ‘n’ PSTs, then its
score can be given by:
𝑛
𝑖=0
56
Scoring Sequence Tags - III
• So, what is the RMSE for a specific sequence tag ‘i’ of length
‘n’?
𝑛
57
Cookie Point: How to cater for abundance? (0.25)
3. Post-translational Modifications
NO NO
YES
YES
NO
Generate fragments
Compute mass for each assemble in silico
using user specified End
fragment spectrum (Frag_thr)
fragmentation technique
1
𝑺𝒄𝒐𝒓𝒆_𝑴𝑾 = , 0 < 𝑨𝑩𝑺(𝑀𝑊𝑃𝐷𝑖𝑓𝑓 ) ≤ 𝑇ℎ𝑟
2𝑀𝑊𝑃𝐷𝑖𝑓𝑓
𝑁𝑜 𝑜𝑓 𝑚𝑎𝑡𝑐ℎ𝑒𝑠
𝑰𝒏𝒔𝒊𝒍𝒊𝒄𝒐 𝑺𝒄𝒐𝒓𝒆 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐸𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡𝑎𝑙 𝐹𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑠