Global Alignment: Ben Langmead
Global Alignment: Ben Langmead
Ben Langmead
Transitions are A ↔ G
and C ↔ T changes
Transversions are A C,
A T, C G, G T
GGGTAGCGGGTTTAAC
||||| ||||||||||
GGGTAACGGGTTTAAC
Human substitution rate ≈ 1 in 1,000
GGGTAGCGGGTTTAAC
||||| |||||||||
GGGTA--GGGTTTAAC
Small-gap rate is ≈ 1 in 3,000
Wanted: keep basic edit distance idea and algorithm, but give
different weights to different events according to likelihood
Penalty function
A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)
C 4 0 4 2 8 4 Transversions
G 2 4 0 4 8
8 Gaps
T 4 2 4 0 8
- 8 8 8 8
Global alignment
Pj 1 Pi 1
Let D[0, j] = k=0 s( , y[k]), and let D[i, 0] = k=0 s(x[k], )
8
< D[i 1, j] + s(x[i 1], )
Otherwise, let D[i, j] = min D[i, j 1] + s( , y[j 1])
:
D[i 1, j 1] + s(x[i 1], y[j 1])
s(a, b) assigns a cost to a particular gap or substitution
A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)
C 4 0 4 2 8
s(a, b) : 4 Transversions (everything else)
G 2 4 0 4 8
T 4 2 4 0 8 8 Gaps
- 8 8 8 8
Global alignment: implementation
http://bit.ly/CG_DP_Global
Global alignment: implementation
def exampleCost(xc, yc):
""" Cost function assigning 0 to match, 2 to transition, 4 to
transversion, and 8 to a gap """
if xc == yc: return 0 # match
if xc == '-' or yc == '-': return 8 # gap
minc, maxc = min(xc, yc), max(xc, yc)
if minc == 'A' and maxc == 'G': return 2 # transition
elif minc == 'C' and maxc == 'T': return 2 # transition
return 4 # transversion
A C G T -
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
- 8 8 8 8
http://bit.ly/CG_DP_Global
Global alignment: dynamic programming
D = zeros((len(x)+1, len(y)+1), dtype=int)
globalAlignment for j in range(1, len(y)+1):
D[0, j] = D[0, j-1] + s('-', y[j-1])
initialization: for i in range(1, len(x)+1):
D[i, 0] = D[i-1, 0] + s(x[i-1], '-')
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 A C G T -
A 0 4 2 4 8
A 16 C 4 0 4 2 8
C 24 G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 ? G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 2 10 18 24 32 40 48 56 G 2 4 0 4 8
T 4 2 4 0 8
G 32 24 16 10 2 10 18 26 34 40 48 - 8 8 8 8
T 40 32 24 16 10 2 10 18 26 34 42
C 48 40 32 24 18 10 2 10 18 26 34
A 56 48 40 32 26 18 10 2 10 18 26
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10 Optimal global
alignment value
Global alignment: getting the alignment
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80
T 8 0 8 16 24 32 40 48 56 64 72
A 16 8 0 8 16 24 32 40 48 56 64
C 24 16 8 2 10 18 24 32 40 48 56
G 32 24 16 10 2 10 18 26 34 40 48
T 40 32 24 16 10 2 10 18 26 34 42 TACGTCA-GC
C 48 40 32 24 18 10 2 10 18 26 34 || |||| ||
TATGTCATGC
A 56 48 40 32 26 18 10 2 10 18 26 +2 +8
(transition) (gap)
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10
Global alignment: summary
FillIng matrix is O(mn) space and time, yields global alignment value
negative positive
Matrix is symmetric
Amino acids