[go: up one dir, main page]

0% found this document useful (0 votes)
89 views15 pages

Global Alignment: Ben Langmead

This document discusses global alignment and generalizing edit distance. It introduces the idea of assigning different costs or penalties to different sequence differences, like transitions versus transversions or gaps versus substitutions. It presents an implementation of global alignment using dynamic programming that allows a custom penalty function. The algorithm runs in O(mn) time and space and traceback returns the optimal alignment in O(m+n) time. Scoring functions aim to reflect expected mutational events and biological interchangeability, though simple functions are often used for computational efficiency.

Uploaded by

mohit mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views15 pages

Global Alignment: Ben Langmead

This document discusses global alignment and generalizing edit distance. It introduces the idea of assigning different costs or penalties to different sequence differences, like transitions versus transversions or gaps versus substitutions. It presents an implementation of global alignment using dynamic programming that allows a custom penalty function. The algorithm runs in O(mn) time and space and traceback returns the optimal alignment in O(m+n) time. Scoring functions aim to reflect expected mutational events and biological interchangeability, though simple functions are often used for computational efficiency.

Uploaded by

mohit mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Global Alignment

Ben Langmead

Department of Computer Science

Please sign guestbook (www.langmead-lab.org/teaching-materials)


to tell me briefly how you are using the slides. For original Keynote
files, email me (ben.langmead@gmail.com).
Generalizing edit distance

What if it doesn’t make sense for every edit to cost 1?

If you compare two human genomes, you see some kinds of


sequence differences more often than others
Generalizing edit distance

Transitions are A ↔ G
and C ↔ T changes

Transversions are A C,
A T, C G, G T

For random mutations,


transitions should be half as
frequent as transversions...

...but if you compare two humans, transition


to transversion ratio (ti/tv) is ~2.1
Generalizing edit distance

GGGTAGCGGGTTTAAC
||||| ||||||||||
GGGTAACGGGTTTAAC
Human substitution rate ≈ 1 in 1,000

GGGTAGCGGGTTTAAC
||||| |||||||||
GGGTA--GGGTTTAAC
Small-gap rate is ≈ 1 in 3,000

Wanted: keep basic edit distance idea and algorithm, but give
different weights to different events according to likelihood
Penalty function

A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)
C 4 0 4 2 8 4 Transversions
G 2 4 0 4 8
8 Gaps
T 4 2 4 0 8
- 8 8 8 8
Global alignment
Pj 1 Pi 1
Let D[0, j] = k=0 s( , y[k]), and let D[i, 0] = k=0 s(x[k], )
8
< D[i 1, j] + s(x[i 1], )
Otherwise, let D[i, j] = min D[i, j 1] + s( , y[j 1])
:
D[i 1, j 1] + s(x[i 1], y[j 1])
s(a, b) assigns a cost to a particular gap or substitution

A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)

C 4 0 4 2 8
s(a, b) : 4 Transversions (everything else)
G 2 4 0 4 8
T 4 2 4 0 8 8 Gaps
- 8 8 8 8
Global alignment: implementation

from numpy import zeros

def globalAlignment(x, y, s):


""" Calculate global alignment value of sequences x and y using
dynamic programming. Return global alignment value. """
D = zeros((len(x)+1, len(y)+1), dtype=int)
for j in range(1, len(y)+1): Use of new
D[0, j] = D[0, j-1] + s('-', y[j-1])
for i in range(1, len(x)+1): penalty function
D[i, 0] = D[i-1, 0] + s(x[i-1], '-')
for i in range(1, len(x)+1):
for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
return D, D[len(x), len(y)]

Similar to edit distance

http://bit.ly/CG_DP_Global
Global alignment: implementation
def exampleCost(xc, yc):
""" Cost function assigning 0 to match, 2 to transition, 4 to
transversion, and 8 to a gap """
if xc == yc: return 0 # match
if xc == '-' or yc == '-': return 8 # gap
minc, maxc = min(xc, yc), max(xc, yc)
if minc == 'A' and maxc == 'G': return 2 # transition
elif minc == 'C' and maxc == 'T': return 2 # transition
return 4 # transversion

A C G T -
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
- 8 8 8 8

http://bit.ly/CG_DP_Global
Global alignment: dynamic programming
D = zeros((len(x)+1, len(y)+1), dtype=int)
globalAlignment for j in range(1, len(y)+1):
D[0, j] = D[0, j-1] + s('-', y[j-1])
initialization: for i in range(1, len(x)+1):
D[i, 0] = D[i-1, 0] + s(x[i-1], '-')

ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 A C G T -
A 0 4 2 4 8
A 16 C 4 0 4 2 8
C 24 G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal

ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 ? G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal

ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 2 10 18 24 32 40 48 56 G 2 4 0 4 8
T 4 2 4 0 8
G 32 24 16 10 2 10 18 26 34 40 48 - 8 8 8 8
T 40 32 24 16 10 2 10 18 26 34 42
C 48 40 32 24 18 10 2 10 18 26 34
A 56 48 40 32 26 18 10 2 10 18 26
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10 Optimal global
alignment value
Global alignment: getting the alignment

Traceback works just as it did for edit distance

ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80
T 8 0 8 16 24 32 40 48 56 64 72
A 16 8 0 8 16 24 32 40 48 56 64
C 24 16 8 2 10 18 24 32 40 48 56
G 32 24 16 10 2 10 18 26 34 40 48
T 40 32 24 16 10 2 10 18 26 34 42 TACGTCA-GC
C 48 40 32 24 18 10 2 10 18 26 34 || |||| ||
TATGTCATGC
A 56 48 40 32 26 18 10 2 10 18 26 +2 +8
(transition) (gap)
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10
Global alignment: summary

Matrix-filling dynamic programming algorithm is O(mn) time and space

FillIng matrix is O(mn) space and time, yields global alignment value

Traceback is O(m + n) time, yields optimal alignment


Global alignment: scoring functions

Where do these penalty functions come from? A C G T -


A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
- 8 8 8 8
They can be based on:

Expected frequency of the different mutational events

How interchangeable are the alternatives are from a biological perspective

Does the substitution change the shape or function of the molecule

Prevalence of simple (linear, constant, affine) gap penalties is mostly because


that’s what we can do efficiently, as discussed in HW4

One occasionally sees more general (e.g. convex) gap penalties


BLOSUM62 Some amino acid substitutions have a smaller impact on
structure & function than others. BLOSUM62 elements
are, roughly speaking, log-odds of observing these
substitutions between two highly related proteins

Rare; larger effect on Common; modest effect


structure/function on structure/function

negative positive

Matrix is symmetric

Amino acids

You might also like