Lecture 3
Lecture 3
Introduction to Bioinformatics
Lecture 3
Mohammed El-Kebir
September 6, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216 (not today)
TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Outline
1. Edit distance
2. Review elementary graph theory
3. Manhattan Tourist problem
4. Longest/shortest paths in DAGs
5. Global alignment
6. Fitting alignment
Reading:
• Jones and Pevzner. Chapters 2.7-2.9 and 6.1-6.4, 6.6
• Lecture notes
3
Alignment
An alignment between two strings v (of m characters) and w (of n characters)
is a two row matrix where the first row contains the characters of v in order,
the second row contains the characters of w in order, and spaces may be
interspersed throughout each.
Input Output
v: KITTEN (m = 6) v: K - I T T E N -
w: SITTING (n = 7) w: S I - T T I N G
w: AGCGTAC...
𝑖−1 𝑖
prefix of 𝐯 of length 𝑖 𝐯# : A T - G T T T
prefix of 𝐰 of length 𝑗 𝐰$ : A G C G T - C
𝑗−1 𝑗
Optimal substructure:
Edit distance obtained from edit distance of prefix of string.
5
Computing Edit Distance – Dynamic Programming 8
>
> 0, if i = 0 and j = 0,
>
>
… 𝐯" … - … 𝐯" … 𝐯" >
>
deletion insertion mismatch match <d[i 1, j] + 1, if i > 0,
… - … 𝐰! … 𝐰! … 𝐰! d[i, j] = min d[i, j 1] + 1, if j > 0,
>
>
>
> d[i 1, j 1] + 1, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
W A T C G
V 0 1 2 3 4
0 0 1 2 3 4
A T - G T
A 1 1 0 1 2 3 A T C G -
T 2 2 1 0 1 2
G 3 3 2 1 1 1
A T G T
A T C G
T 4 4 3 2 2 2
6
Computing Edit Distance – Your turn! 8
> 0, if i = 0 and j = 0,
… 𝐯" … -
𝑖 − 1, 𝑗 − 1 𝑖 − 1, 𝑗 >
>
>
>
… -
deletion … 𝐰!
insertion >
<d[i 1, j] + 1, if i > 0,
0
or d[i, j] = min d[i, j 1] + 1, if j > 0,
1 1 >
>
>
> d[i 1, j 1] + 1, if i > 0, j > 0 and vi 6= wj ,
… 𝐯" … 𝐯" >
>
… 𝐰!
mismatch … 𝐰!
match 1 :d[i 1, j 1], if i > 0, j > 0 and vi = wj .
𝑖, 𝑗 − 1 𝑖, 𝑗
W C A R W A T E W A R E
V 0 1 2 3 V 0 1 2 3 V 0 1 2 3
0 0 0
C 1 C 1 C 1
A 2 A 2 A 2
T 3 T 3 T 3
W C A R W A T E W A R E
V 0 1 2 3 V 0 1 2 3 V 0 1 2 3
0 0 1 2 3 0 0 1 2 3 0 0 1 2 3
C 1 1 0 1 2 C 1 1 1 2 3 C 1 1 1 2 3
A 2 2 1 0 1 A 2 2 1 2 3 A 2 2 1 2 3
T 3 3 2 1 1 T 3 3 2 1 2 T 3 3 2 2 3
A 1 1 0 1 2 3
Value 1 2 3 4 5 6 7
Min # coins 1 2 1 2 1 2 3 T 2 2 1 0 1 2
G 3 3 2 1 1 1
T 4 4 3 2 2 2
• Both have optimal substructure and can be solved using dynamic programming
• These are examples of a more general problem!
9
Review of Graph Theory
• Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Edges 𝐸 = {(𝑣# , 𝑣$ ), … }
Chicago
Bloomington
Champaign-Urbana
Indianapolis
St. Louis
10
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }
Chicago
Bloomington
Champaign-Urbana
Indianapolis
St. Louis
11
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }
• Path is a sequence of vertices and edges
Chicago
that connect them
Bloomington
Champaign-Urbana
Indianapolis
St. Louis
12
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }
• Path is a sequence of vertices and edges
Chicago
that connect them
130
• Edges can be weighted
140
Bloomington
50
Champaign-Urbana
180
170 150 Indianapolis
St. Louis
13
Manhattan Tourist Problem
Begin
A tourist in Manhattan * *
wants to visit the
maximum number of *
attractions (*) by
* *
traveling on a path (only
* *
eastward and southward) *
from start to end
*
* * *
End
14
Manhattan Tourist Problem
A tourist in Manhattan
wants to visit the Begin
maximum number of 2 1
attractions (*) by
traveling on a path (only 1 1 1
eastward and southward)
from start to end 1 2
1
May be more than 1 5
attraction on a street. 1 1 3
Add weights! End
15
Manhattan Tourist Problem
0 1 2 3 4
begin j coordinate
3 2 4 0
0 0 3 5 9
Manhattan Tourist 1 0 2 4 3
Problem (MTP): 3 2 4 2
1 13
Given a weighted,
1
directed grid graph G 4 6 5 2
i coordinate
with two vertices “begin” 2
0 7 3
15
4
19
and “end”, find the
maximum weight path in 4 4 5 2 1
5 6 8 5 3
1 3 2 2 end
4 23
16
Manhattan Tourist Problem – Exhaustive Algorithm
0 1 2 3 4
Check all paths begin 3 2 4 0
j coordinate
0 0 3 5 9
Question: 1 0 2 4 3
1
4 6 5 2
i coordinate
0 7 3 4
2 15 19
4 4 5 2 1
3 3 0 2
3
20
5 6 8 5 3
1 3 2 2 end
4 23
17
Manhattan Tourist Problem – Greedy Algorithm
1 2 5
begin
5 3 10 5
2 1 5
3 5 3 1
better path!
2 3 4
promising start, 0 0 5 2
but leads to bad 22
0 0 0
choices! end
18
18
Manhattan Tourist Problem – Optimal Substructure
1 2 5
begin
5 3 10 5
2 1 5
0 0 5 2
0 0 0
end best score to end
best score to this point 18 22
19
Manhattan Tourist Problem – Optimal Substructure
𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
1 2 5
begin
5 3 10 5
2 1 5
best score to
Question: What is the recurrence? 3 5 3 1 this point
2 3 4
20
0 0 5 2
0 0 0
end best score to
best score to this point 18 22 end
1 2 5
begin
5 3 10 5
2 1 5
8 best score to
>
<0, if i = 0 and j = 0, 3 5 3 1 this point
s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0, 2 3 4
>
: 20
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 0 5 2
0 0 0
end best score to
best score to this point 18 22 end
22
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1
source 8
>
<0, if i = 0 and j = 0,
1 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1
i 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
23
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2
source 8
>
<0, if i = 0 and j = 0,
1 2 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3
i 5 3
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3
2
8
24
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5
2
2
8 12
0
3
8
25
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3
2 3
2
8 12 16
0 0
0
3
8 12
26
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1
2 3 4
2
8 12 16 20
0 0 5
0 0
3
8 12 21
27
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1
2 3 4
2
8 12 16 20
0 0 5 2
0 0 0
3
8 12 21 22
28
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1
3
0 0 0 Question: Implementation?
8 12 21 22
S3,3 = 22
29
Manhattan Is Not a Perfect Grid
A2 A3
8
>
<s[A1 ] + w[A1 , B],
s[B] = max s[A2 ] + w[A2 , B],
>
:
s[A3 ] + w[A3 , B].
30
Manhattan Is Not a Perfect Grid, It’s a Directed Graph
pred 𝑖, 𝑗
𝐺 = (𝑉, 𝐸) is a directed
acyclic graph (DAG) Each edge is evaluated
with nonnegative edges once: 𝑂( 𝐸 ) time
weights 𝑤 ∶ 𝐸 → ℝ.
𝑖, 𝑗
s[0, 0] = 0
0 0 0 0
s[i, j] = max {s[i , j ] + w[(i , j ), (i, j)]}
(i0 ,j 0 ) 2 pred(i,j)
31
Dynamic Programming as a Graph Problem
Begin
Manhattan Tourist Problem: * *
Every path in directed graph is a possible * * *
tourist path. Find maximum weight path. *
* *
Running time: 𝑂 𝑚𝑛 = 𝑂( 𝐸 )
*
* * *
End
G 3
T 4
33
What About the Edit Distance Problem?
W A T C G Edit Distance problem: Given
V 0 1 2 3 4
strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " ,
compute the minimum number
0 O O O O O 𝑑(𝐯, 𝐰) of elementary operations
A 1 O O O O O to transform 𝐯 into 𝐰.
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).
34
What About the Edit Distance Problem?
W A T C G
V 0 1 2 3 4
0 O O O O O
A 1 O O O O O
Alignment is a path from (0, 0) to (𝑚, 𝑛)
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).
35
What About the Edit Distance Problem?
W A T C G Edit Distance problem: Given edit
V 0 1 2 3 4
graph 𝐺 = (𝑉, 𝐸), with edge
weights c ∶ 𝐸 → 0,1 . Find
0 O O O O O shortest path from (0, 0) to (𝑚, 𝑛).
A 1 O O O O O
Alignment is a path from (0, 0) to (𝑚, 𝑛)
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).
36
Shortest Path vs Longest Path
• Change graph, edit graph and the MTP grid are directed graphs G.
37
Shortest Path vs Longest Path
• Shortest path in directed graphs can be found efficiently (Dijkstra, Bellman-
Ford, Floyd-Warshall algorithms)
• Longest path in direct graphs cannot be found efficiently (NP-hard).
• Change graph, edit graph and MTP grid graph are directed acylic graphs
(DAGs).
• No directed cycles.
Question: What’s the relation between absence of directed cycles and optimal substructure?
38
Weighted Edit Distance
𝑑[𝑖, 𝑗] is the edit distance of 𝐯# and 𝐰$ ,
where 𝐯# is prefix of 𝐯 of length 𝑖 and 𝐰$ is prefix of 𝐰 of length 𝑗
8 deletion
… 𝐯$
>
> d[i 1, j] + 1, … -
>
<d[i, j 1] + 1, insertion
… -
… 𝐰#
A T C G
w 0 1 2 3 4
V 8
> 0, if i = 0 and j = 0,
>
>
>
>
0 >
<d[i 1, j] + 1, if i > 0,
d[i, j] = min d[i, j 1] + 1, if j > 0,
A >
1 >
>
> d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
G 2
T 3
40
Weighted Edit Distance – Practice Problem
• Compute weighted edit distance between 𝐯 = AGT and 𝐰 = ATCT.
A T C G
w 0 1 2 3 4
V 8
> 0, if i = 0 and j = 0,
>
>
>
>
0 0 1 2 3 4 >
<d[i 1, j] + 1, if i > 0,
d[i, j] = min d[i, j 1] + 1, if j > 0,
A >
1 1 0 1 2 3 >
>
> d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
G 2 2 1 2 3 2
T 3 3 2 1 2 3
41
Edit Distance – Additional Insights
• An alignment corresponds to a series of elementary operations
44
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .
Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
i. 𝑑 𝐯, 𝐰 ≥ 0 [non-negativity]
Edit distance is defined by an alignment. This in turn uniquely determines
a series of elementary operations, each with cost either 0 (match) or 1
(otherwise). Thus, 𝑑 𝐯, 𝐰 ≥ 0.
45
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .
Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
ii. 𝑑 𝐯, 𝐰 = 0 if and only if 𝐯 = 𝐰 [identity of indiscernibles]
(=>) By the premise, 𝑑 𝐯, 𝐰 = 0. By definition, the optimal alignment can only consist
of operations with cost 0. That is, the alignment consist of only matches. Thus, 𝐯 = 𝐰.
(<=) By the premise, 𝐯 = 𝐰. Thus, there exists an alignment where every pair of
columns is a match. This means that |𝐯| = |𝐰| and each letter 𝑣0 equals 𝑤0 (where 𝑖 ∈
[|𝐯|]). Moreover, only the match operations have cost 0, the other operations have
cost 1. Hence, this is the optimal alignment with cost 𝑑 𝐯, 𝐰 = 0.
46
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .
Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
iii. 𝑑 𝐯, 𝐰 = 𝑑(𝐰, 𝐯) [symmetry]
Let 𝐀 = [𝑎0,1 ] be the optimal alignment corresponding to 𝑑 𝐯, 𝐰 , i.e. 𝐀 is an 2 × 𝑘
matrix where 𝑘 ∈ {max( 𝐯 , 𝐰 ), … , 𝐯 + 𝐰 }. Define the function 𝑓 𝐀 = 𝐁 such
that 𝐁 is obtained by interchanging the two rows of 𝐀. Since the cost of any insertion,
deletion and mismatch is 1, we have that alignment 𝐁 has cost 𝑑 𝐯, 𝐰 . The existence
of an alignment from 𝐰 to 𝐯 with cost less than 𝑑 𝐯, 𝐰 , yields a contradiction as it
implies that 𝐀 is not an optimal alignment from 𝐯 to 𝐰. Hence, 𝑑 𝐰, 𝐯 = 𝑑 𝐯, 𝐰 .
47
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .
Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
iv. 𝑑 𝐯, 𝐰 ≤ 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰) [triangle inequality]
Assume for a contradiction that 𝑑 𝐯, 𝐰 > 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰). Let 𝑆 be the sequence
of elementary operations for transforming 𝐯 into 𝐮. Let 𝑆′ be the sequence of
elementary operations for transforming 𝐮 into 𝐰. Note that 𝑑 𝐯, 𝐮 = |𝑆| and
𝑑 𝐮, 𝐰 = |𝑆′|. Concatenate 𝑆 and 𝑆′ and remove redundant operations, yielding
sequence 𝑆′′. By definition, 𝑆 22 ≤ 𝑆 + 𝑆 2 . We can obtain an alignment of 𝐯 and
𝐰 from 𝑆′′ with cost 𝑆 22 ≤ 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰). This yields a contradiction with
𝑑 𝐯, 𝐰 > 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰) being the cost of the optimal alignment of 𝐯 and 𝐰.
48
Outline
1. Edit distance recap
2. Global alignment
3. Fitting alignment
4. Local alignment
5. Gapped alignment
Reading:
• Jones and Pevzner. Chapters 6.6, 6.8 and 6.9
49
Biological Sequence Alignment
• Weighted edit distance: find W A T C G
alignment with minimum V 0 1 2 3 4
distance
• Shortest path in weighted 0 O O O O O
edit graph
A 1 O O O O O
• Sequence alignment: find
alignment with maximum T 2 O O O O O
similarity G 3 O O O O O
• Longest path in weighted
edit graph T 4 O O O O O
• Score function:
𝛿∶ Σ ∪ − 3→ℝ $% deletion - insertion $% mismatch $% match
- "# "# "#
G T
51
Scoring Matrices
Transitions: interchanges among purines
(two rings) or pyrimidines (one ring)
• A <--> G
𝛿 A T C G -
• C <--> T A 1 -2 -2 -1 -1
T -2 1 -1 -2 -1
Transversions: interchanges between purines
(two rings) and pyrimidines (one ring) C -2 -1 1 -2 -1
• A <--> C, A <--> T G -1 -2 -2 1 -1
• G <--> C, G <--> T
- -1 -1 -1 -1 −∞
Transitions more likely than transversions!
52
Global Alignment – Needleman-Wunsch Algorithm
Global Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find alignment with maximum score.
54
NGS Characterized by Short Reads
… CATTCAGTAG …
… AGCCATTAG …
… GGTAGTTAG … … GGTAAACTAG …
… TATAATTAG … … CGTACCTAG …
Genome
Next-generation 10-100’s million short reads
Millions -billions
DNA sequencing Short read: 100 nucleotides
nucleotides
Allow for inexact matches due to: Human reference genome is
• Sequencing errors 3,300,000,000 nucleotides, while a
short read is 100 nucleotides.
• Polymorphisms/mutations in Global sequence alignment will not
reference genome work!
Question: How to account for discrepancy
between lengths of reference and short read?
55
Fitting Alignment
For short read alignment, we want to align complete short read 𝐯 ∈
Σ ! to substring of reference genome 𝐰 ∈ Σ " . Note that 𝑚 ≪ 𝑛.
𝐯 ∈ Σ4
𝐰 ∈ Σ5
𝐯 ∈ Σ4
𝐰 ∈ Σ5
57
Fitting Alignment – Naive Approach
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
𝐯 ∈ Σ4
𝐰 ∈ Σ5
58
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
𝐯\𝐰 0 T A C G G C >
> 0, if i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]}
G
59
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
0, Start anywhere on first row if
𝐯\𝐰 0 T A C G G C >
> i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]} End anywhere on last row
G
60
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
0, Start anywhere on first row if
𝐯\𝐰 0 T A C G G C >
> i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]} End anywhere on last row
G
Question: Let match score be 1,
G mismatch/indel score be -1. What is 𝑠 ∗ ?
𝐯 A - G G Question: Same scores. What is optimal
𝐰 A C G G global alignment and score?
61
Fitting Alignment – Dynamic Programming
• Online:
https://valiec.github.io/AlignmentVisualizer/index.html
8
𝐯\𝐰 0 T A C G G C >
> 0, if i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]}
G
Question: Let match score be 1,
G mismatch/indel score be -1. What is 𝑠 ∗ ?
𝐯 A - G G Question: Same scores. What is optimal
𝐰 A C G G global alignment and score?
62