[go: up one dir, main page]

0% found this document useful (0 votes)
89 views42 pages

Multiple Sequence Alignments - Bioinformatics

This document discusses techniques for mapping next generation sequencing reads to reference genomes and performing multiple sequence alignments. It covers pairwise sequence alignment using Needleman-Wunsch and Smith-Waterman algorithms. It then discusses challenges with mapping large numbers of short reads from next generation sequencing to genomes using k-mer indexing approaches and data structures like the Burrows-Wheeler transform to enable efficient search. The document concludes by discussing scoring and construction of multiple sequence alignments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views42 pages

Multiple Sequence Alignments - Bioinformatics

This document discusses techniques for mapping next generation sequencing reads to reference genomes and performing multiple sequence alignments. It covers pairwise sequence alignment using Needleman-Wunsch and Smith-Waterman algorithms. It then discusses challenges with mapping large numbers of short reads from next generation sequencing to genomes using k-mer indexing approaches and data structures like the Burrows-Wheeler transform to enable efficient search. The document concludes by discussing scoring and construction of multiple sequence alignments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

5BBB0226 Bioinformatics Principles


Next Generation Sequencing Read Mapping 
and Multiple Sequence Alignments

Reiner.Schulz@kcl.ac.uk
Medical and Molecular Genetics Department
Concepts
● Revision: Optimal Pairwise Sequence Alignments
– Building blocks of Multiple Sequence Alignments
● Next Generation Sequencing: Need for Speed
– k­mer Indexing
– Space­efficient Indexing
● Multiple Sequence Alignments (MSAs): Utility
– Scoring MSAs
– Generating MSAs: Progressive Alignment
– Improvement Strategies: Iterative Refinement, Consistency, Protein 
Structure, Consensus Meta­MSAs
– Genomic MSAs: Challenges
Needleman­Wunsch (NW) & Smith­Waterman (SW)
● S4,3, S5,3, S4,4: best scores for paths 
ending at nodes (4,3), (5,3), (4,4) 
(alignments of GATT v GAT,
GATTA v GAT, GATT v GATA)
➔ S5,4: best score for a path ending at 
node (5,4) 
(alignment(s) of GATTA v GATA)
= best score achievable by 
extending above paths, or 
(SW only!) starting over
= max(S5,3+score(“_”), 
S4,3+score(5th,4th), S4,4+score(“_”), 
● NW base cases: S0,j = j score( “_”),  (SW only!) 0 )
Si,0 = i score( “_”) ● Remember score­maximising 

● SW base cases: S0,j =  Si,0 = 0 option(s) for backtracing!
Backtracing Revisited: NW
● Generate global alignment by 
backtracing path of score­
maximising options from 
bottom­left node (= score of 
optimal alignment(s)) 
to node at i=j=0
● Each backtracing step is a 
column in the alignment, 
starting with right­most 
column
● Branch point in path: 
GATTACA GATTACA
multiple optimal global 
GA_TACA GAT_ACA
alignments
Backtracing Revisited: SW
● Generate local alignment by 
backtracing path of score­
maximising options from node 
with maximum score until 
reaching a node with score 0
● Each backtracing step is a 
column in the alignment, 
starting with right­most 
column
● Multiple nodes with maximum 
score and/or branch point in 
TAC TAC TACTA
path: multiple optimal local 
TAC TAC TAC_A
alignments
Next Generation Sequencing (NGS)
● ~107 to 109 DNA sequences (reads)
– length ~100 to 300 bp
– error prone: 0.1 – 10% error rate (depending on platform)
● Genomic/Transcriptomic origin of each read?
● Smith­Waterman: N O(nm) time complexity
N: number of reads, n: read length, 
m: genome/transcriptome length
● 107 300 bp human (genome: 3x109) reads: ~1019
● At 1010/s (very fast CPU!): 109 s > 30 years!
The Lab/Index Book

Manageable with indices!

Be better than this!


Indices Speed Up Searches
The Genome/Read Index Idea
● Words: k­mers, short (k<20) sequences of fixed length
● Genome index: k­mers → positions in genome
– For each read, use index to look up genome positions (exact 
matches) of k­mers in the read: < O(n log n)
– Align read to genome around match positions using DP: O(n 2)
– Or, “stitch” together alignment from successive matches of 
k­mers and similar k­mers: O(n)
● Read index: k­mers → reads
– At each position in genome, look up reads with matching k­mers
– As above
● Time and space of index construction amortises
Genome Index + Dynamic Programming
Read

k=6-mers: ATTGCAGGTAACTTATAGGATACA

Index lookup provides


match positions:
Match Match

Genome GCTTAACTATTGCA GGTATAG GATACAGATCAGG

Use DP to fill in alignment


around exact matches:
Sequence Search and Alignment by 
Hashing Algorithm (SSAHA)
● For illustration: superseded by more sophisticated algorithms
● Database D of sequences (e.g., chromosomes):
1:GTGACGTCACTCTGAGGATCCCCTGGGTGTGG
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
3:GGATCCCCTGTCCTCTCTGTCACATA
● Query sequence Q of length n (e.g., NGS read):
TGCAACAT
● SSAHA finds “Q­like sequences” in D in O(n log n)
– Much faster than O(nm) where m is size of D
● 107 300 bp reads: ~2.5 10 10 operations (minutes, not years)

– NOT guaranteed (but likely) to find optimal alignments!

Ning, Cox & Mullikan. Genome Res 2001


SSAHA Index Construction
1:GTGACGTCACTCTGAGGATCCCCTGGGTGTGG
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
3:GGATCCCCTGTCCTCTCTGTCACATA
● Mapping of each k­mer to list of its 
positions in D
– Number of k­mers: 4k
● 2­mers (w) in example
– E(w): hash function returning 2­bit 
encoding of k­mer as integer (hash 
value) = row in table with list
● A: 00, C: 01, G: 10, T: 11
● GA: 8=1000b = 1*23+0*22+0*21+0*20
– Positions: (sequence, position)
● Two scans of non­overlapping 
(to save space) k­mers in D
1)Frequency: length of list
2)Filling in positions
Alignment using  Q:TGCAACAT

SSAHA Index
● For each position t in Q, add positions p in D of 
k­mer starting at t to list of hits H
– Hit = (sequence, shift = p ­ t, p)
● Sort H by sequence and shift → M
–  O(Kn log Kn); K is small: (m/k)/4k = expected 
#occurrences in D of each k­mer (n­k+1) in Q
● Scan M for “runs” of same sequence and 
same shift (+/­ x: allowance for indels)
● Sort hits in each run by position
● Identify alignment “seeds”, i.e., regions of exact 
matches in an alignment: same shift and successive 
positions k bases apart
● Alignment: extend seeds if possible and connect by 
gaps, or use seeds for constrained DP
Q:      TGCAACAT
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
Constrained Dynamic Programming
Run: (2,7,7) (2,7,11) (2,7,13)
no (2,7,9) hit due to T/C mismatch: two seeds, (2,7,7) and (2,7,11) (2,7,13)
t:      0123456
Q:      TGTAACAT
2:GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT
p:12345678911111…
           01234…
● Finding seeds is quick
● Seeds reduce the number of 
possible paths/alignments: 
fewer operations = less time
● But optimal alignments may 
correspond to unconstrained 
paths, i.e., may miss them;
However, that is unlikely!
Alignment using  
Spaced Seed Index
● Mapping of spaced seed pairs to reads
– e.g., 16 bp reads, 6 pairs of 4 bp seeds:
tolerates 2 mismatches (errors, SNPs)
● Scan genome: seed pairs at each 
position → index → matching reads
● Alignments: extend hits (e.g., DP)
● O(n log n): building index for n reads
● Limitations: fixed short read length, 
large index for large n (>10GB)
– split read set, run on multiple CPUs 
(example of data parallelisation)
Pevsner 2015 Fig. 5.18
Burrows­Wheeler Transform (BWT)
● Reversible reordering L of 
characters in a text T
– in order of their suffixes
➔ L: long runs of same character:
• common words like “the” in an English 
text
• repeats in a genome (large fraction)
➔ efficient compression of L
FM Index
● L and F from BWT, plus small 
auxilliary lists
– Ferragina and Manzini: “Full­text 
Minute­space”
● Human genome: 2GB
● Match finding using FM index 
non­trivial but fast: O(n)
● No fixed or limited read length
– NGS reads becoming longer
● Programmes: Bowtie2, 
Pevsner 2015 Fig. 5.18 BWA­MEM, HISAT2, ... 
Multiple Sequence Alignment (MSA)
● Valid MSA for sequences  k: 01234567
GATTACA, GATACA and GATAA: i: 01234567
● m sequences: columns of  GATTACA
m­tuples j: 01233456
GAT_ACA
● No column can be all “_” l: 01233445
● Monotonicity property  GAT_A_A
analogous to pairwise alignments
Utility of MSAs
● Identification of variable and conserved regions in proteins 
of a family
– Inference of protein domains
● Detecting distant homology of two sequences via other, “in 
between” sequences
● Reconstruction of phylogenetic trees: evolutionary 
relationship of sequences
● RNA: 2nd­ary structure prediction
● Genomes: conserved non­coding elements, inference of 
synteny (conserved order of regions between genomes) or 
genome rearrangements
Number of Possible MSAs
● Astronomical

Number of
sequences

Length of
sequences

Slowinksi JB. Mol Phylogen Evol 1998


Scoring MSAs
● Not obvious how
● Common: Sum of Pairs (SP)
1st column: 
score(A1,A2) + score(A1,G3) + score(A1,_4) + 
score(A1,_5) + score(A2,G3) + score(A2,_4) + 
score (A2,_5) + score(G3,_4) + score(G3,_5) + score(_4,_5)
+ 2nd column etc.
Sum of Pairs: a Good Measure?
● Alignment column: A, A, A, C
● Sum of pairs: 3x score(A,A) + 3x score(A,C)
● What if sequences are related like this:
● 1 A→C mutation in common
ancestor sequence explains data
● SP prone to penalise mutations
multiple times
An Optimal MSA using DP
● Generalisation from 2  n1=2, n2=2, n3=3

to >2 sequences
● Guaranteed to find an 
optimal MSA
● Impractical: O(nm) for m 
sequences of length n
• n=100bp, m=40: #operations 
= #particles in universe
➔ All MSA algorithms use
heuristics
Progressive Alignment Heuristic Idea
● Compute all optimal global pairwise alignments
e.g., for sequences (1) ACG, (2) CGA and (3) GAC
● Merge them consistently: once a gap always a gap
● Result depends on merge order!
(1&2) ACG_ (1&3) _ACG (2&3) CGA_
_CGA GAC_ _GAC

_ACG_ ACG__ __ACG


__CGA _CGA_ CGA__
GAC__ __GAC _GAC_
Determining Merge Order, Part I
● From optimal global pairwise alignments, 
derive pairwise distances between sequences
● Simple: #mismatches / (#mismatches + #matches)
– Underestimates distances (ignores gaps) QKLMN
_KLVN
=1/4
● Better: alignment score S to distance transformation
(S – Sshuffled) / ((SAvA + SBvB)/2 – Sshuffled)
– Sshuffled: mean score of optimal alignment of shuffled sequences
– SAvA, SBvB: score of aligning A and B, respectively, with itself
Determining Merge Order, Part I
● Pairwise distance matrix example:
Determining Merge Order, Part II
● Distance matrix → Algorithm → ”Guide” Tree
– e.g., Neighbour­Joining algorithm
● Objective: Path lengths in tree = Distances

Path length from human Hbb to human Hba


=.081+.226+.216+.055=.578 ≈ Distance = .59
Merging Alignments in Tree Order
In order of path length:
1
1. human v horse Hba
3
2 • aligning two sequences
4
2. human v horse Hbb
5
3. 1 v 2 
6
• merging alignments
4. 3 v whale myoglobin
• adding a sequence to an 
alignment
5. 4 v lamprey globin
6. 5 v lupus globin
Merging/Adding to Alignments
● Analogous to pairwise sequence alignment: 
alignment columns are the “sequence characters”
– Optimal solution by Dynamic Programming in O(nm)
● Score of matching two columns = average of scores of 
pairwise character (incl. “_”) alignments
score(1st, 2nd) = (score(A,C) + score(A,A) + ATA TCAFE
score(A,A) + score(A,G) + score(C,C) + CTA TAT_E
score(C,A) + score(C,A) + score(C,G)) / 8 TATF_
● Similar sequences “redundant” but each adds  AGTFD
to score: undue influence on MSA
Weighing Sequences
● Weight: length of path from Guide Tree root
– Branch length divided between sequences under branch:
down­weighs sets of similar sequences
● Lupus globin weight 
= .442
● Human Hba weight 
= .055 + .216 /2 + 
.016 /4 + .015 /5 + 
.062 /6 = .192
Using Sequence Weights for Scoring
● Previously: 
average of scores of pairwise character alignments
● Now: weighted average ATA TCAFE
score(1st, 2nd) = (w1,1 w2,1 score(A,C) +  CTA TAT_E
w1,1 w2,2 score(A,A) + w1,1 w2,3 score(A,A) +  TATF_
w1,1 w2,4 score(A,G) + w1,2 w2,1 score(C,C) + AGTFD
w1,2 w2,2 score(C,A) + w1,2 w2,3 score(C,A) + 
w1,2 w2,4 score(C,G)) / 8
● Classic implementation: ClustalW by Thompson et al. 1994
– superseded by other, improved MSA programmes
Progressive Alignment: Summary
(1) Sequences → Optimal global pairwise alignments
(2) Alignment scores → Pairwise distance matrix
(3) Distance matrix → Guide tree
(4) Guide tree → Sequence weights
(5) Merging/adding to alignments: 
Guide tree → Order, 
Sequence weights → Weighing scores during DP
Progressive Alignment: Caveats
● Heuristic: no guarantee of finding optimal MSA
– Optimality? Max sum­of­pairs or another measure?
● Once a gap always a gap: 
propagation of initial errors, compounded during merging
● Result depends on:
– Initial optimal global pairwise alignments: 
arbitrary choice among multiple pairwise alignments
– Guide tree merge order and sequence weights: 
depend on tree construction algorithm
– Other parameter choices: substitution matrices, gap penalties
● Sequences <25% identical: poor MSAs
● Manual correction often required: e.g., JalView
Improvements: Iterative Refinement
● Cut branch of guide tree: bi­partition MSA
● Re­align, score (e.g., sum­of­pairs): keep if better

● Repeat for every
branch
● Can overcome 
initial error
propagation issue
● Implemented by programmes like MAFFT and MUSCLE
Improvements: Consistency
● “Library”: optimal global pairwise alignments
➔ “Transitive” alignments: 3­way global alignments
➔ Position­specific scores (PSSs) for aligning ith character in X with jth 
character in Y, directly or via a character in Z: PSS(Xi,Yj) 
= #(Xi,Yj) in library + #(Xi,Zk,Yj) in transitive alignments
● Recompute library using PSSs: “extended library”
TCAFE
– Fixes inconsistencies between original 
optimal pairwise alignments (initial errors)
TAT_E
ATA TATF_
● Progressive alignment using PSSs
CTA AGTFD
score(1st, 2nd) = (w1,1 w2,1 PSS(A1,C1) + ...
● Sum­of­pairs of MSA using PSSs = degree of consistency with 
extended library
Consistency: Extending a Library
Library +Transitive Alignments
SeqA GARFIELD THE LAST FAT CAT SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ­­­ SeqB GARFIELD THE FAST CAT ­­­
SeqA GARFIELD THE LAST FA­T CAT SeqA GARFIELD THE LAST FA­T CAT
SeqC GARFIELD THE VERY FAST CAT SeqC GARFIELD THE VERY FAST CAT
SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE ­­­­ FAST CAT
SeqD ­­­­­­­­ THE ­­­­ FAT CAT SeqA GARFIELD THE LAST FA­T CAT
SeqB GARFIELD THE ­­­­ FAST CAT SeqD ­­­­­­­­ THE ­­­­ FA­T CAT
SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ­­­­ FAST CAT

SeqB GARFIELD THE ­­­­ FAST CAT Position­specific scores
SeqD ­­­­­­­­ THE ­­­­ FA­T CAT
... SeqA GARFIELD THE LAST FA T CAT
 PSS 22222222 333 1111 22 2 222

● Implemented by  SeqB GARFIELD THE      FAST CAT

programmes like Extended library:
ProbCons and T­COFFEE SeqA GARFIELD THE LAST FA­T CAT
SeqB GARFIELD THE ­­­­ FAST CAT
...
Improvements: Protein Structure
● Protein structures more conserved than sequences
● Consistency­based MSA with library including 
optimal pairwise alignments of structures
– Optimality: minimise distances between residues in 3D
● Implemented by T­COFFEE Expresso module
– Structure entered or, if available, retrieved from PDB
● Also useful for checking quality of MSAs: 
T­COFFEE iRMSD­APDB
Improvements: Meta­MSA
● ~50 different approaches/programmes for MSA
● No one approach is optimal
– speed v accuracy trade­off
➔ Generate MSAs with variety of programmes
➔ Derive consistency­based consensus MSA
● Implemented by M­COFFEE
– includes MAFFT, ProbCons, MUSCLE, T­COFFEE etc.
MSA Programme Comparison
● From T­COFFEE online documentation:
Genomic MSA: Challenges
● Progressive alignment approaches, modified to account 
for genome­specific challenges:
● >1000­fold longer sequences
– protein: <1000 residues v genome: >1M nucleotides
● Low sequence conservation in non­coding regions
● High content of repeated subsequences
– transposons, retrotransposons, telomeric/centromeric repeats
● Large­scale rearrangements
– deletions, duplications, inversions, translocations
Genomic MSA: UCSC and Ensembl
● Databases of MSAs of 100s of vertebrate genomes
– integrated into genome browsers
● UCSC: Multiz MSA + phastCons/phyloP 
quantification of conservation
● Ensembl: Enredo syntenic/co­linear regions
+ Pecan MSA + Ortheus ancestral genome 
reconstruction (EPO)
Acknowledgments
● Mona Singh
● Ben Langmead
● Cédric Notredame

You might also like