5BBB0226 Bioinformatics Principles
–
Next Generation Sequencing Read Mapping
and Multiple Sequence Alignments
Reiner.Schulz@kcl.ac.uk
Medical and Molecular Genetics Department
Concepts
● Revision: Optimal Pairwise Sequence Alignments
– Building blocks of Multiple Sequence Alignments
● Next Generation Sequencing: Need for Speed
– kmer Indexing
– Spaceefficient Indexing
● Multiple Sequence Alignments (MSAs): Utility
– Scoring MSAs
– Generating MSAs: Progressive Alignment
– Improvement Strategies: Iterative Refinement, Consistency, Protein
Structure, Consensus MetaMSAs
– Genomic MSAs: Challenges
NeedlemanWunsch (NW) & SmithWaterman (SW)
● S4,3, S5,3, S4,4: best scores for paths
ending at nodes (4,3), (5,3), (4,4)
(alignments of GATT v GAT,
GATTA v GAT, GATT v GATA)
➔ S5,4: best score for a path ending at
node (5,4)
(alignment(s) of GATTA v GATA)
= best score achievable by
extending above paths, or
(SW only!) starting over
= max(S5,3+score(“_”),
S4,3+score(5th,4th), S4,4+score(“_”),
● NW base cases: S0,j = j score( “_”), (SW only!) 0 )
Si,0 = i score( “_”) ● Remember scoremaximising
● SW base cases: S0,j = Si,0 = 0 option(s) for backtracing!
Backtracing Revisited: NW
● Generate global alignment by
backtracing path of score
maximising options from
bottomleft node (= score of
optimal alignment(s))
to node at i=j=0
● Each backtracing step is a
column in the alignment,
starting with rightmost
column
● Branch point in path:
GATTACA GATTACA
multiple optimal global
GA_TACA GAT_ACA
alignments
Backtracing Revisited: SW
● Generate local alignment by
backtracing path of score
maximising options from node
with maximum score until
reaching a node with score 0
● Each backtracing step is a
column in the alignment,
starting with rightmost
column
● Multiple nodes with maximum
score and/or branch point in
TAC TAC TACTA
path: multiple optimal local
TAC TAC TAC_A
alignments
Next Generation Sequencing (NGS)
● ~107 to 109 DNA sequences (reads)
– length ~100 to 300 bp
– error prone: 0.1 – 10% error rate (depending on platform)
● Genomic/Transcriptomic origin of each read?
● SmithWaterman: N O(nm) time complexity
N: number of reads, n: read length,
m: genome/transcriptome length
● 107 300 bp human (genome: 3x109) reads: ~1019
● At 1010/s (very fast CPU!): 109 s > 30 years!
The Lab/Index Book
Manageable with indices!
Be better than this!
Indices Speed Up Searches
The Genome/Read Index Idea
● Words: kmers, short (k<20) sequences of fixed length
● Genome index: kmers → positions in genome
– For each read, use index to look up genome positions (exact
matches) of kmers in the read: < O(n log n)
– Align read to genome around match positions using DP: O(n 2)
– Or, “stitch” together alignment from successive matches of
kmers and similar kmers: O(n)
● Read index: kmers → reads
– At each position in genome, look up reads with matching kmers
– As above
● Time and space of index construction amortises
Genome Index + Dynamic Programming
Read
k=6-mers: ATTGCAGGTAACTTATAGGATACA
Index lookup provides
match positions:
Match Match
Genome GCTTAACTATTGCA GGTATAG GATACAGATCAGG
Use DP to fill in alignment
around exact matches:
Sequence Search and Alignment by
Hashing Algorithm (SSAHA)
● For illustration: superseded by more sophisticated algorithms
● Database D of sequences (e.g., chromosomes):
1:GTGACGTCACTCTGAGGATCCCCTGGGTGTGG
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
3:GGATCCCCTGTCCTCTCTGTCACATA
● Query sequence Q of length n (e.g., NGS read):
TGCAACAT
● SSAHA finds “Qlike sequences” in D in O(n log n)
– Much faster than O(nm) where m is size of D
● 107 300 bp reads: ~2.5 10 10 operations (minutes, not years)
– NOT guaranteed (but likely) to find optimal alignments!
Ning, Cox & Mullikan. Genome Res 2001
SSAHA Index Construction
1:GTGACGTCACTCTGAGGATCCCCTGGGTGTGG
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
3:GGATCCCCTGTCCTCTCTGTCACATA
● Mapping of each kmer to list of its
positions in D
– Number of kmers: 4k
● 2mers (w) in example
– E(w): hash function returning 2bit
encoding of kmer as integer (hash
value) = row in table with list
● A: 00, C: 01, G: 10, T: 11
● GA: 8=1000b = 1*23+0*22+0*21+0*20
– Positions: (sequence, position)
● Two scans of nonoverlapping
(to save space) kmers in D
1)Frequency: length of list
2)Filling in positions
Alignment using Q:TGCAACAT
SSAHA Index
● For each position t in Q, add positions p in D of
kmer starting at t to list of hits H
– Hit = (sequence, shift = p t, p)
● Sort H by sequence and shift → M
– O(Kn log Kn); K is small: (m/k)/4k = expected
#occurrences in D of each kmer (nk+1) in Q
● Scan M for “runs” of same sequence and
same shift (+/ x: allowance for indels)
● Sort hits in each run by position
● Identify alignment “seeds”, i.e., regions of exact
matches in an alignment: same shift and successive
positions k bases apart
● Alignment: extend seeds if possible and connect by
gaps, or use seeds for constrained DP
Q: TGCAACAT
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
Constrained Dynamic Programming
Run: (2,7,7) (2,7,11) (2,7,13)
no (2,7,9) hit due to T/C mismatch: two seeds, (2,7,7) and (2,7,11) (2,7,13)
t: 0123456
Q: TGTAACAT
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
p:12345678911111…
01234…
● Finding seeds is quick
● Seeds reduce the number of
possible paths/alignments:
fewer operations = less time
● But optimal alignments may
correspond to unconstrained
paths, i.e., may miss them;
However, that is unlikely!
Alignment using
Spaced Seed Index
● Mapping of spaced seed pairs to reads
– e.g., 16 bp reads, 6 pairs of 4 bp seeds:
tolerates 2 mismatches (errors, SNPs)
● Scan genome: seed pairs at each
position → index → matching reads
● Alignments: extend hits (e.g., DP)
● O(n log n): building index for n reads
● Limitations: fixed short read length,
large index for large n (>10GB)
– split read set, run on multiple CPUs
(example of data parallelisation)
Pevsner 2015 Fig. 5.18
BurrowsWheeler Transform (BWT)
● Reversible reordering L of
characters in a text T
– in order of their suffixes
➔ L: long runs of same character:
• common words like “the” in an English
text
• repeats in a genome (large fraction)
➔ efficient compression of L
FM Index
● L and F from BWT, plus small
auxilliary lists
– Ferragina and Manzini: “Fulltext
Minutespace”
● Human genome: 2GB
● Match finding using FM index
nontrivial but fast: O(n)
● No fixed or limited read length
– NGS reads becoming longer
● Programmes: Bowtie2,
Pevsner 2015 Fig. 5.18 BWAMEM, HISAT2, ...
Multiple Sequence Alignment (MSA)
● Valid MSA for sequences k: 01234567
GATTACA, GATACA and GATAA: i: 01234567
● m sequences: columns of GATTACA
mtuples j: 01233456
GAT_ACA
● No column can be all “_” l: 01233445
● Monotonicity property GAT_A_A
analogous to pairwise alignments
Utility of MSAs
● Identification of variable and conserved regions in proteins
of a family
– Inference of protein domains
● Detecting distant homology of two sequences via other, “in
between” sequences
● Reconstruction of phylogenetic trees: evolutionary
relationship of sequences
● RNA: 2ndary structure prediction
● Genomes: conserved noncoding elements, inference of
synteny (conserved order of regions between genomes) or
genome rearrangements
Number of Possible MSAs
● Astronomical
Number of
sequences
Length of
sequences
Slowinksi JB. Mol Phylogen Evol 1998
Scoring MSAs
● Not obvious how
● Common: Sum of Pairs (SP)
1st column:
score(A1,A2) + score(A1,G3) + score(A1,_4) +
score(A1,_5) + score(A2,G3) + score(A2,_4) +
score (A2,_5) + score(G3,_4) + score(G3,_5) + score(_4,_5)
+ 2nd column etc.
Sum of Pairs: a Good Measure?
● Alignment column: A, A, A, C
● Sum of pairs: 3x score(A,A) + 3x score(A,C)
● What if sequences are related like this:
● 1 A→C mutation in common
ancestor sequence explains data
● SP prone to penalise mutations
multiple times
An Optimal MSA using DP
● Generalisation from 2 n1=2, n2=2, n3=3
to >2 sequences
● Guaranteed to find an
optimal MSA
● Impractical: O(nm) for m
sequences of length n
• n=100bp, m=40: #operations
= #particles in universe
➔ All MSA algorithms use
heuristics
Progressive Alignment Heuristic Idea
● Compute all optimal global pairwise alignments
e.g., for sequences (1) ACG, (2) CGA and (3) GAC
● Merge them consistently: once a gap always a gap
● Result depends on merge order!
(1&2) ACG_ (1&3) _ACG (2&3) CGA_
_CGA GAC_ _GAC
_ACG_ ACG__ __ACG
__CGA _CGA_ CGA__
GAC__ __GAC _GAC_
Determining Merge Order, Part I
● From optimal global pairwise alignments,
derive pairwise distances between sequences
● Simple: #mismatches / (#mismatches + #matches)
– Underestimates distances (ignores gaps) QKLMN
_KLVN
=1/4
● Better: alignment score S to distance transformation
(S – Sshuffled) / ((SAvA + SBvB)/2 – Sshuffled)
– Sshuffled: mean score of optimal alignment of shuffled sequences
– SAvA, SBvB: score of aligning A and B, respectively, with itself
Determining Merge Order, Part I
● Pairwise distance matrix example:
Determining Merge Order, Part II
● Distance matrix → Algorithm → ”Guide” Tree
– e.g., NeighbourJoining algorithm
● Objective: Path lengths in tree = Distances
Path length from human Hbb to human Hba
=.081+.226+.216+.055=.578 ≈ Distance = .59
Merging Alignments in Tree Order
In order of path length:
1
1. human v horse Hba
3
2 • aligning two sequences
4
2. human v horse Hbb
5
3. 1 v 2
6
• merging alignments
4. 3 v whale myoglobin
• adding a sequence to an
alignment
5. 4 v lamprey globin
6. 5 v lupus globin
Merging/Adding to Alignments
● Analogous to pairwise sequence alignment:
alignment columns are the “sequence characters”
– Optimal solution by Dynamic Programming in O(nm)
● Score of matching two columns = average of scores of
pairwise character (incl. “_”) alignments
score(1st, 2nd) = (score(A,C) + score(A,A) + ATA TCAFE
score(A,A) + score(A,G) + score(C,C) + CTA TAT_E
score(C,A) + score(C,A) + score(C,G)) / 8 TATF_
● Similar sequences “redundant” but each adds AGTFD
to score: undue influence on MSA
Weighing Sequences
● Weight: length of path from Guide Tree root
– Branch length divided between sequences under branch:
downweighs sets of similar sequences
● Lupus globin weight
= .442
● Human Hba weight
= .055 + .216 /2 +
.016 /4 + .015 /5 +
.062 /6 = .192
Using Sequence Weights for Scoring
● Previously:
average of scores of pairwise character alignments
● Now: weighted average ATA TCAFE
score(1st, 2nd) = (w1,1 w2,1 score(A,C) + CTA TAT_E
w1,1 w2,2 score(A,A) + w1,1 w2,3 score(A,A) + TATF_
w1,1 w2,4 score(A,G) + w1,2 w2,1 score(C,C) + AGTFD
w1,2 w2,2 score(C,A) + w1,2 w2,3 score(C,A) +
w1,2 w2,4 score(C,G)) / 8
● Classic implementation: ClustalW by Thompson et al. 1994
– superseded by other, improved MSA programmes
Progressive Alignment: Summary
(1) Sequences → Optimal global pairwise alignments
(2) Alignment scores → Pairwise distance matrix
(3) Distance matrix → Guide tree
(4) Guide tree → Sequence weights
(5) Merging/adding to alignments:
Guide tree → Order,
Sequence weights → Weighing scores during DP
Progressive Alignment: Caveats
● Heuristic: no guarantee of finding optimal MSA
– Optimality? Max sumofpairs or another measure?
● Once a gap always a gap:
propagation of initial errors, compounded during merging
● Result depends on:
– Initial optimal global pairwise alignments:
arbitrary choice among multiple pairwise alignments
– Guide tree merge order and sequence weights:
depend on tree construction algorithm
– Other parameter choices: substitution matrices, gap penalties
● Sequences <25% identical: poor MSAs
● Manual correction often required: e.g., JalView
Improvements: Iterative Refinement
● Cut branch of guide tree: bipartition MSA
● Realign, score (e.g., sumofpairs): keep if better
● Repeat for every
branch
● Can overcome
initial error
propagation issue
● Implemented by programmes like MAFFT and MUSCLE
Improvements: Consistency
● “Library”: optimal global pairwise alignments
➔ “Transitive” alignments: 3way global alignments
➔ Positionspecific scores (PSSs) for aligning ith character in X with jth
character in Y, directly or via a character in Z: PSS(Xi,Yj)
= #(Xi,Yj) in library + #(Xi,Zk,Yj) in transitive alignments
● Recompute library using PSSs: “extended library”
TCAFE
– Fixes inconsistencies between original
optimal pairwise alignments (initial errors)
TAT_E
ATA TATF_
● Progressive alignment using PSSs
CTA AGTFD
score(1st, 2nd) = (w1,1 w2,1 PSS(A1,C1) + ...
● Sumofpairs of MSA using PSSs = degree of consistency with
extended library
Consistency: Extending a Library
Library +Transitive Alignments
SeqA GARFIELD THE LAST FAT CAT SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT SeqB GARFIELD THE FAST CAT
SeqA GARFIELD THE LAST FAT CAT SeqA GARFIELD THE LAST FAT CAT
SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT
SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT
SeqD THE FAT CAT SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT SeqD THE FAT CAT
SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT
SeqB GARFIELD THE FAST CAT Positionspecific scores
SeqD THE FAT CAT
... SeqA GARFIELD THE LAST FA T CAT
PSS 22222222 333 1111 22 2 222
● Implemented by SeqB GARFIELD THE FAST CAT
programmes like Extended library:
ProbCons and TCOFFEE SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
...
Improvements: Protein Structure
● Protein structures more conserved than sequences
● Consistencybased MSA with library including
optimal pairwise alignments of structures
– Optimality: minimise distances between residues in 3D
● Implemented by TCOFFEE Expresso module
– Structure entered or, if available, retrieved from PDB
● Also useful for checking quality of MSAs:
TCOFFEE iRMSDAPDB
Improvements: MetaMSA
● ~50 different approaches/programmes for MSA
● No one approach is optimal
– speed v accuracy tradeoff
➔ Generate MSAs with variety of programmes
➔ Derive consistencybased consensus MSA
● Implemented by MCOFFEE
– includes MAFFT, ProbCons, MUSCLE, TCOFFEE etc.
MSA Programme Comparison
● From TCOFFEE online documentation:
Genomic MSA: Challenges
● Progressive alignment approaches, modified to account
for genomespecific challenges:
● >1000fold longer sequences
– protein: <1000 residues v genome: >1M nucleotides
● Low sequence conservation in noncoding regions
● High content of repeated subsequences
– transposons, retrotransposons, telomeric/centromeric repeats
● Largescale rearrangements
– deletions, duplications, inversions, translocations
Genomic MSA: UCSC and Ensembl
● Databases of MSAs of 100s of vertebrate genomes
– integrated into genome browsers
● UCSC: Multiz MSA + phastCons/phyloP
quantification of conservation
● Ensembl: Enredo syntenic/colinear regions
+ Pecan MSA + Ortheus ancestral genome
reconstruction (EPO)
Acknowledgments
● Mona Singh
● Ben Langmead
● Cédric Notredame