[go: up one dir, main page]

0% found this document useful (0 votes)
22 views62 pages

Lecture 3

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views62 pages

Lecture 3

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CS 466

Introduction to Bioinformatics
Lecture 3
Mohammed El-Kebir
September 6, 2024
Course Announcements
Instructor:
• Mohammed El-Kebir (melkebir) HW1: To be released on 09/13
Office hours: Wednesdays, 3:30-4:30pm in and due 09/21 at 11:59pm
Siebel 3216 (not today)

Piazza / Gradescope: Midterm:


• https://piazza.com/illinois/fall2024/cs466 week of 10/09 – 10/11,
• Gradescope entry code: ‘VDEE52’ details to follow

TA:
• Nicole (nsdong2): Mondays 4-5pm in CS Tutoring Center (Siebel basement)
• Claire (czchou2): Tuesdays 2-3pm in CS Tutoring Center (Siebel basement)
• Mrinmoy (mroddur2): Thursdays 2-3pm in Siebel 2219
2
Outline
1. Edit distance
2. Review elementary graph theory
3. Manhattan Tourist problem
4. Longest/shortest paths in DAGs
5. Global alignment
6. Fitting alignment

Reading:
• Jones and Pevzner. Chapters 2.7-2.9 and 6.1-6.4, 6.6
• Lecture notes
3
Alignment
An alignment between two strings v (of m characters) and w (of n characters)
is a two row matrix where the first row contains the characters of v in order,
the second row contains the characters of w in order, and spaces may be
interspersed throughout each.

Input Output

v: KITTEN (m = 6) v: K - I T T E N -
w: SITTING (n = 7) w: S I - T T I N G

Question: Is this a good alignment?


Answer: Count the number of insertion, deletions, substitutions.
4
Computing Edit Distance
Edit Distance problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " , compute the
minimum number 𝑑(𝐯, 𝐰) of elementary operations to transform 𝐯 into 𝐰.

v: ATGTTAT... deletion insertion mismatch match

w: AGCGTAC...
𝑖−1 𝑖
prefix of 𝐯 of length 𝑖 𝐯# : A T - G T T T
prefix of 𝐰 of length 𝑗 𝐰$ : A G C G T - C
𝑗−1 𝑗
Optimal substructure:
Edit distance obtained from edit distance of prefix of string.
5
Computing Edit Distance – Dynamic Programming 8
>
> 0, if i = 0 and j = 0,
>
>
… 𝐯" … - … 𝐯" … 𝐯" >
>
deletion insertion mismatch match <d[i 1, j] + 1, if i > 0,
… - … 𝐰! … 𝐰! … 𝐰! d[i, j] = min d[i, j 1] + 1, if j > 0,
>
>
>
> d[i 1, j 1] + 1, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
W A T C G
V 0 1 2 3 4

0 0 1 2 3 4
A T - G T
A 1 1 0 1 2 3 A T C G -
T 2 2 1 0 1 2

G 3 3 2 1 1 1
A T G T
A T C G
T 4 4 3 2 2 2

6
Computing Edit Distance – Your turn! 8
> 0, if i = 0 and j = 0,
… 𝐯" … -
𝑖 − 1, 𝑗 − 1 𝑖 − 1, 𝑗 >
>
>
>
… -
deletion … 𝐰!
insertion >
<d[i 1, j] + 1, if i > 0,
0
or d[i, j] = min d[i, j 1] + 1, if j > 0,
1 1 >
>
>
> d[i 1, j 1] + 1, if i > 0, j > 0 and vi 6= wj ,
… 𝐯" … 𝐯" >
>
… 𝐰!
mismatch … 𝐰!
match 1 :d[i 1, j 1], if i > 0, j > 0 and vi = wj .
𝑖, 𝑗 − 1 𝑖, 𝑗

W C A R W A T E W A R E
V 0 1 2 3 V 0 1 2 3 V 0 1 2 3

0 0 0

C 1 C 1 C 1

A 2 A 2 A 2

T 3 T 3 T 3

𝑑 𝐜𝐚𝐭, 𝐜𝐚𝐫 = 𝑑 𝐜𝐚𝐭, 𝐚𝐭𝐞 = 𝑑 𝐜𝐚𝐭, 𝐚𝐫𝐞 = 7


Computing Edit Distance – Your turn! 8
> 0, if i = 0 and j = 0,
… 𝐯" … -
𝑖 − 1, 𝑗 − 1 𝑖 − 1, 𝑗 >
>
>
>
… -
deletion … 𝐰!
insertion >
<d[i 1, j] + 1, if i > 0,
0
or d[i, j] = min d[i, j 1] + 1, if j > 0,
1 1 >
>
>
> d[i 1, j 1] + 1, if i > 0, j > 0 and vi 6= wj ,
… 𝐯" … 𝐯" >
>
… 𝐰!
mismatch … 𝐰!
match 1 :d[i 1, j 1], if i > 0, j > 0 and vi = wj .
𝑖, 𝑗 − 1 𝑖, 𝑗

W C A R W A T E W A R E
V 0 1 2 3 V 0 1 2 3 V 0 1 2 3

0 0 1 2 3 0 0 1 2 3 0 0 1 2 3

C 1 1 0 1 2 C 1 1 1 2 3 C 1 1 1 2 3

A 2 2 1 0 1 A 2 2 1 2 3 A 2 2 1 2 3

T 3 3 2 1 1 T 3 3 2 1 2 T 3 3 2 2 3

𝑑 𝐜𝐚𝐭, 𝐜𝐚𝐫 = 1 𝑑 𝐜𝐚𝐭, 𝐚𝐭𝐞 = 2 𝑑 𝐜𝐚𝐭, 𝐚𝐫𝐞 = 3 8


Change Problem and Edit distance
W A T C G
V 0 1 2 3 4
Make M cents using minimum
number of 1, 3 and 5 cent coins. 0 0 1 2 3 4

A 1 1 0 1 2 3
Value 1 2 3 4 5 6 7
Min # coins 1 2 1 2 1 2 3 T 2 2 1 0 1 2

G 3 3 2 1 1 1

T 4 4 3 2 2 2

• Both have optimal substructure and can be solved using dynamic programming
• These are examples of a more general problem!
9
Review of Graph Theory
• Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Edges 𝐸 = {(𝑣# , 𝑣$ ), … }

Chicago

Bloomington

Champaign-Urbana

Indianapolis

St. Louis

10
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }

Chicago

Bloomington

Champaign-Urbana

Indianapolis

St. Louis

11
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }
• Path is a sequence of vertices and edges
Chicago
that connect them

Bloomington

Champaign-Urbana

Indianapolis

St. Louis

12
Review of Graph Theory
• Directed Graph 𝐺 = (𝑉, 𝐸)
• Vertices 𝑉 = {𝑣% , … , 𝑣" }
• Directed edges 𝐸 = {(𝑣# , 𝑣$ ), … }
• Path is a sequence of vertices and edges
Chicago
that connect them
130
• Edges can be weighted
140
Bloomington
50
Champaign-Urbana

180
170 150 Indianapolis

St. Louis

13
Manhattan Tourist Problem

Begin
A tourist in Manhattan * *
wants to visit the
maximum number of *
attractions (*) by
* *
traveling on a path (only
* *
eastward and southward) *
from start to end
*
* * *
End

14
Manhattan Tourist Problem
A tourist in Manhattan
wants to visit the Begin
maximum number of 2 1
attractions (*) by
traveling on a path (only 1 1 1
eastward and southward)
from start to end 1 2
1
May be more than 1 5
attraction on a street. 1 1 3
Add weights! End

15
Manhattan Tourist Problem
0 1 2 3 4
begin j coordinate
3 2 4 0
0 0 3 5 9

Manhattan Tourist 1 0 2 4 3

Problem (MTP): 3 2 4 2
1 13
Given a weighted,
1
directed grid graph G 4 6 5 2

i coordinate
with two vertices “begin” 2
0 7 3
15
4
19
and “end”, find the
maximum weight path in 4 4 5 2 1

G from “begin” to “end”. 3


3 3 0 2
20

5 6 8 5 3

1 3 2 2 end
4 23
16
Manhattan Tourist Problem – Exhaustive Algorithm
0 1 2 3 4
Check all paths begin 3 2 4 0
j coordinate
0 0 3 5 9

Question: 1 0 2 4 3

How many paths? 1


3 2 4
13
2

1
4 6 5 2

i coordinate
0 7 3 4
2 15 19

4 4 5 2 1
3 3 0 2
3
20

5 6 8 5 3

1 3 2 2 end
4 23
17
Manhattan Tourist Problem – Greedy Algorithm
1 2 5
begin

5 3 10 5

2 1 5

3 5 3 1
better path!
2 3 4

promising start, 0 0 5 2
but leads to bad 22
0 0 0
choices! end
18

18
Manhattan Tourist Problem – Optimal Substructure
1 2 5
begin

5 3 10 5

2 1 5

best score to this


3 5 3 1 point
2 3 4
20

0 0 5 2

0 0 0
end best score to end
best score to this point 18 22

19
Manhattan Tourist Problem – Optimal Substructure
𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)

1 2 5
begin

5 3 10 5

2 1 5

best score to
Question: What is the recurrence? 3 5 3 1 this point
2 3 4
20

0 0 5 2

0 0 0
end best score to
best score to this point 18 22 end

• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between 𝑖 − 1, 𝑗 and 𝑖, 𝑗


• 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between 𝑖, 𝑗 − 1 and 𝑖, 𝑗
20
Manhattan Tourist Problem – Optimal Substructure
𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)

1 2 5
begin

5 3 10 5

2 1 5
8 best score to
>
<0, if i = 0 and j = 0, 3 5 3 1 this point
s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0, 2 3 4
>
: 20
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 0 5 2

0 0 0
end best score to
best score to this point 18 22 end

• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between 𝑖 − 1, 𝑗 and 𝑖, 𝑗


• 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between 𝑖, 𝑗 − 1 and 𝑖, 𝑗
21
MTP – Solving Recurrence using Dynamic Programming
𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0
source 8
>
<0, if i = 0 and j = 0,
s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0

• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between


𝑖 − 1, 𝑗 and 𝑖, 𝑗
• 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗

22
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1
source 8
>
<0, if i = 0 and j = 0,
1 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1
i 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗

23
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2
source 8
>
<0, if i = 0 and j = 0,
1 2 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3
i 5 3
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3

2
8

24
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5

2
2
8 12
0

3
8

25
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3

2 3
2
8 12 16
0 0

0
3
8 12

26
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1

2 3 4
2
8 12 16 20
0 0 5

0 0
3
8 12 21

27
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1

2 3 4
2
8 12 16 20
0 0 5 2

0 0 0
3
8 12 21 22

28
MTP – Solving Recurrence using Dynamic Programming
j 𝑠[𝑖, 𝑗] is the best score for path to coordinate (𝑖, 𝑗)
0 1 2 3
source 8
>
<0, if i = 0 and j = 0,
1 2 5 s[i, j] = max s[i 1, j] + w[(i 1, j), (i, j)] if i > 0,
0 >
:
s[i, j 1] + w[(i, j 1), (i, j)] if j > 0.
0 1 3 8
i 5 3 10 5
• 𝑤[ 𝑖 − 1, 𝑗 , (𝑖, 𝑗)] weight of street between
2 1 5 𝑖 − 1, 𝑗 and 𝑖, 𝑗
1
5 7 13 18 • 𝑤[ 𝑖, 𝑗 − 1 , (𝑖, 𝑗)] weight of street between
𝑖, 𝑗 − 1 and 𝑖, 𝑗
3 5 3 1

2 3 4 Let 𝑚 be the number of rows and 𝑛 be


2
8 12 16 20 the number of columns.
0 0 5 2
Running time: 𝑂(𝑚𝑛)

3
0 0 0 Question: Implementation?
8 12 21 22
S3,3 = 22
29
Manhattan Is Not a Perfect Grid
A2 A3

What about diagonals?


A1 B

8
>
<s[A1 ] + w[A1 , B],
s[B] = max s[A2 ] + w[A2 , B],
>
:
s[A3 ] + w[A3 , B].

30
Manhattan Is Not a Perfect Grid, It’s a Directed Graph

pred 𝑖, 𝑗
𝐺 = (𝑉, 𝐸) is a directed
acyclic graph (DAG) Each edge is evaluated
with nonnegative edges once: 𝑂( 𝐸 ) time
weights 𝑤 ∶ 𝐸 → ℝ.
𝑖, 𝑗

s[0, 0] = 0
0 0 0 0
s[i, j] = max {s[i , j ] + w[(i , j ), (i, j)]}
(i0 ,j 0 ) 2 pred(i,j)

31
Dynamic Programming as a Graph Problem
Begin
Manhattan Tourist Problem: * *
Every path in directed graph is a possible * * *
tourist path. Find maximum weight path. *
* *
Running time: 𝑂 𝑚𝑛 = 𝑂( 𝐸 )
*
* * *
End

Change Problem: Make M cents using


minimum number of coins 𝐜 = 1, 3, 5 .
Every path in directed graph is a possible
change. Find shortest path.
Running time: 𝑂 𝑀𝑛 = 𝑂( 𝐸 )
32
What About the Edit Distance Problem?
W A T C G Edit Distance problem: Given
V 0 1 2 3 4
strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " ,
compute the minimum number
0 𝑑(𝐯, 𝐰) of elementary operations
A 1 to transform 𝐯 into 𝐰.
T 2

G 3

T 4

𝐯" deletion - insertion 𝐯" mismatch 𝐯" match


- 𝐰! 𝐰! 𝐰!

33
What About the Edit Distance Problem?
W A T C G Edit Distance problem: Given
V 0 1 2 3 4
strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " ,
compute the minimum number
0 O O O O O 𝑑(𝐯, 𝐰) of elementary operations
A 1 O O O O O to transform 𝐯 into 𝐰.
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).

34
What About the Edit Distance Problem?
W A T C G
V 0 1 2 3 4

0 O O O O O

A 1 O O O O O
Alignment is a path from (0, 0) to (𝑚, 𝑛)
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).

35
What About the Edit Distance Problem?
W A T C G Edit Distance problem: Given edit
V 0 1 2 3 4
graph 𝐺 = (𝑉, 𝐸), with edge
weights c ∶ 𝐸 → 0,1 . Find
0 O O O O O shortest path from (0, 0) to (𝑚, 𝑛).
A 1 O O O O O
Alignment is a path from (0, 0) to (𝑚, 𝑛)
T 2 O O O O O
Edit graph is a weighed, directed grid
G 3 O O O O O graph 𝐺 = (𝑉, 𝐸) with source vertex
T 4 O O O O O
(0, 0) and target vertex (𝑚, 𝑛). Each
edge (𝑖, 𝑗) has weight [𝑖, 𝑗] corresponding
𝐯" deletion - insertion 𝐯" mismatch 𝐯" match to edit cost: deletion (1), insertion (1),
- 𝐰! 𝐰! 𝐰! mismatch (1) and match (0).

36
Shortest Path vs Longest Path
• Change graph, edit graph and the MTP grid are directed graphs G.

• Change problem and Edit Distance problem are minimization problems.


• Find shortest path in G from source to sink.

• Manhattan Tourist problem is a maximization problem.


• Find longest path in G from source to sink.

37
Shortest Path vs Longest Path
• Shortest path in directed graphs can be found efficiently (Dijkstra, Bellman-
Ford, Floyd-Warshall algorithms)
• Longest path in direct graphs cannot be found efficiently (NP-hard).

• Change graph, edit graph and MTP grid graph are directed acylic graphs
(DAGs).
• No directed cycles.

• Longest path problem in a DAG can


solved efficiently by dynamic programming directed cycle

Question: What’s the relation between absence of directed cycles and optimal substructure?
38
Weighted Edit Distance
𝑑[𝑖, 𝑗] is the edit distance of 𝐯# and 𝐰$ ,
where 𝐯# is prefix of 𝐯 of length 𝑖 and 𝐰$ is prefix of 𝐰 of length 𝑗
8 deletion
… 𝐯$

>
> d[i 1, j] + 1, … -

>
<d[i, j 1] + 1, insertion
… -
… 𝐰#

d[i, j] = min mismatch … 𝐯$


>
> d[i 1, j 1] + 1, if vi 6= wj , … 𝐰#
>
: … 𝐯$
d[i 1, j 1], if vi = wj . … 𝐰#

Replace +1 with different penalties for different types of edits.


39
Weighted Edit Distance – Practice Problem
• Compute weighted edit distance between 𝐯 = AGT and 𝐰 = ATCT.

A T C G
w 0 1 2 3 4
V 8
> 0, if i = 0 and j = 0,
>
>
>
>
0 >
<d[i 1, j] + 1, if i > 0,
d[i, j] = min d[i, j 1] + 1, if j > 0,
A >
1 >
>
> d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
G 2

T 3

40
Weighted Edit Distance – Practice Problem
• Compute weighted edit distance between 𝐯 = AGT and 𝐰 = ATCT.

A T C G
w 0 1 2 3 4
V 8
> 0, if i = 0 and j = 0,
>
>
>
>
0 0 1 2 3 4 >
<d[i 1, j] + 1, if i > 0,
d[i, j] = min d[i, j 1] + 1, if j > 0,
A >
1 1 0 1 2 3 >
>
> d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj ,
>
>
:d[i 1, j 1], if i > 0, j > 0 and vi = wj .
G 2 2 1 2 3 2

T 3 3 2 1 2 3

41
Edit Distance – Additional Insights
• An alignment corresponds to a series of elementary operations

Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 42


Edit Distance – Additional Insights
• An alignment corresponds to a series of elementary operations

• But not every series of elementary operations corresponds to an alignment! Why?

Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 43


Distance Function / Metric

A distance function (metric) on a set 𝑋 is a function 𝑑 ∶ 𝑋 × 𝑋 → ℝ


s.t. for all 𝑥, 𝑦, 𝑧 ∈ 𝑋:
i. 𝑑 𝑥, 𝑦 ≥ 0 [non-negativity]
ii. 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦 [identity of indiscernibles]
iii. 𝑑 𝑥, 𝑦 = 𝑑(𝑦, 𝑥) [symmetry]
iv. 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑(𝑧, 𝑦) [triangle inequality]

Question: Is edit distance a distance function?

44
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .

Claim: edit distance is a distance function.

Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
i. 𝑑 𝐯, 𝐰 ≥ 0 [non-negativity]
Edit distance is defined by an alignment. This in turn uniquely determines
a series of elementary operations, each with cost either 0 (match) or 1
(otherwise). Thus, 𝑑 𝐯, 𝐰 ≥ 0.

45
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .

Claim: edit distance is a distance function.

Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
ii. 𝑑 𝐯, 𝐰 = 0 if and only if 𝐯 = 𝐰 [identity of indiscernibles]
(=>) By the premise, 𝑑 𝐯, 𝐰 = 0. By definition, the optimal alignment can only consist
of operations with cost 0. That is, the alignment consist of only matches. Thus, 𝐯 = 𝐰.
(<=) By the premise, 𝐯 = 𝐰. Thus, there exists an alignment where every pair of
columns is a match. This means that |𝐯| = |𝐰| and each letter 𝑣0 equals 𝑤0 (where 𝑖 ∈
[|𝐯|]). Moreover, only the match operations have cost 0, the other operations have
cost 1. Hence, this is the optimal alignment with cost 𝑑 𝐯, 𝐰 = 0.
46
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .

Claim: edit distance is a distance function.

Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
iii. 𝑑 𝐯, 𝐰 = 𝑑(𝐰, 𝐯) [symmetry]
Let 𝐀 = [𝑎0,1 ] be the optimal alignment corresponding to 𝑑 𝐯, 𝐰 , i.e. 𝐀 is an 2 × 𝑘
matrix where 𝑘 ∈ {max( 𝐯 , 𝐰 ), … , 𝐯 + 𝐰 }. Define the function 𝑓 𝐀 = 𝐁 such
that 𝐁 is obtained by interchanging the two rows of 𝐀. Since the cost of any insertion,
deletion and mismatch is 1, we have that alignment 𝐁 has cost 𝑑 𝐯, 𝐰 . The existence
of an alignment from 𝐰 to 𝐯 with cost less than 𝑑 𝐯, 𝐰 , yields a contradiction as it
implies that 𝐀 is not an optimal alignment from 𝐯 to 𝐰. Hence, 𝑑 𝐰, 𝐯 = 𝑑 𝐯, 𝐰 .
47
Edit Distance is a Distance Function
Edit distance 𝑑(𝐯, 𝐰) is the minimum number of elementary operations to
transform 𝐯 ∈ Σ ∗ into 𝐰 ∈ Σ ∗ .

Claim: edit distance is a distance function.

Proof: Let 𝐮, 𝐯, 𝐰 ∈ Σ ∗ .
iv. 𝑑 𝐯, 𝐰 ≤ 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰) [triangle inequality]
Assume for a contradiction that 𝑑 𝐯, 𝐰 > 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰). Let 𝑆 be the sequence
of elementary operations for transforming 𝐯 into 𝐮. Let 𝑆′ be the sequence of
elementary operations for transforming 𝐮 into 𝐰. Note that 𝑑 𝐯, 𝐮 = |𝑆| and
𝑑 𝐮, 𝐰 = |𝑆′|. Concatenate 𝑆 and 𝑆′ and remove redundant operations, yielding
sequence 𝑆′′. By definition, 𝑆 22 ≤ 𝑆 + 𝑆 2 . We can obtain an alignment of 𝐯 and
𝐰 from 𝑆′′ with cost 𝑆 22 ≤ 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰). This yields a contradiction with
𝑑 𝐯, 𝐰 > 𝑑 𝐯, 𝐮 + 𝑑(𝐮, 𝐰) being the cost of the optimal alignment of 𝐯 and 𝐰.
48
Outline
1. Edit distance recap
2. Global alignment
3. Fitting alignment
4. Local alignment
5. Gapped alignment

Reading:
• Jones and Pevzner. Chapters 6.6, 6.8 and 6.9

49
Biological Sequence Alignment
• Weighted edit distance: find W A T C G
alignment with minimum V 0 1 2 3 4
distance
• Shortest path in weighted 0 O O O O O
edit graph
A 1 O O O O O
• Sequence alignment: find
alignment with maximum T 2 O O O O O
similarity G 3 O O O O O
• Longest path in weighted
edit graph T 4 O O O O O
• Score function:
𝛿∶ Σ ∪ − 3→ℝ $% deletion - insertion $% mismatch $% match
- "# "# "#

Question: What is an example of 𝛿? 𝛿(𝑣# , −) 𝛿(−, 𝑤$ ) 𝛿(𝑣# , 𝑤$ ) 50


Scoring Matrices
A C
Transitions: interchanges among purines
(two rings) or pyrimidines (one ring)
• A <--> G
• C <--> T

Transversions: interchanges between purines


(two rings) and pyrimidines (one ring)
• A <--> C, A <--> T
• G <--> C, G <--> T

Transitions more likely than transversions!

G T
51
Scoring Matrices
Transitions: interchanges among purines
(two rings) or pyrimidines (one ring)
• A <--> G
𝛿 A T C G -
• C <--> T A 1 -2 -2 -1 -1
T -2 1 -1 -2 -1
Transversions: interchanges between purines
(two rings) and pyrimidines (one ring) C -2 -1 1 -2 -1
• A <--> C, A <--> T G -1 -2 -2 1 -1
• G <--> C, G <--> T
- -1 -1 -1 -1 −∞
Transitions more likely than transversions!

52
Global Alignment – Needleman-Wunsch Algorithm
Global Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find alignment with maximum score.

• An alignment is a source-to-sink path in the edit graph


• An alignment 𝐀 = [𝑎#,$ ] is a 2 × 𝑘 matrix s.t. (i) 𝑘 = {max 𝑚, 𝑛 , … , 𝑚 + 𝑛},
(ii) 𝑎#,$ ∈ Σ ∪ − and (iii) there is no 𝑗 ∈ [𝑘] where 𝑎%,$ = 𝑎(,$ = −
8
>
> 0, if i = 0 and j = 0,
>
<s[i 1, j] + (v , ),
i if i > 0, deletion
s[i, j] = max
>
> s[i, j 1] + ( , wj ), if j > 0, insertion
>
:
s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0. match/
mismatch
53
Demonstration
• http://rna.informatik.uni-
freiburg.de/Teaching/index.jsp?toolName=Needleman-Wunsch

• 𝐯 = ATGTTAT and 𝐰 = ATCGTAC. 𝛿 A T C G -


A 1 -2 -2 -1 -1
T -2 1 -1 -2 -1
C -2 -1 1 -2 -1
G -1 -2 -2 1 -1
- -1 -1 -1 -1 −∞

54
NGS Characterized by Short Reads
… CATTCAGTAG …
… AGCCATTAG …
… GGTAGTTAG … … GGTAAACTAG …

… TATAATTAG … … CGTACCTAG …
Genome
Next-generation 10-100’s million short reads
Millions -billions
DNA sequencing Short read: 100 nucleotides
nucleotides
Allow for inexact matches due to: Human reference genome is
• Sequencing errors 3,300,000,000 nucleotides, while a
short read is 100 nucleotides.
• Polymorphisms/mutations in Global sequence alignment will not
reference genome work!
Question: How to account for discrepancy
between lengths of reference and short read?
55
Fitting Alignment
For short read alignment, we want to align complete short read 𝐯 ∈
Σ ! to substring of reference genome 𝐰 ∈ Σ " . Note that 𝑚 ≪ 𝑛.

𝐯 ∈ Σ4
𝐰 ∈ Σ5

Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and


scoring function 𝛿, find a alignment of 𝐯 and a substring of 𝐰 with
maximum global alignment score 𝑠 ∗ among all global alignments of
𝐯 and all substrings of 𝐰
56
Fitting Alignment – Naive Approach
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰

𝐯 ∈ Σ4
𝐰 ∈ Σ5

• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛


• How many?

57
Fitting Alignment – Naive Approach
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰

𝐯 ∈ Σ4
𝐰 ∈ Σ5

• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛


"
• How many? Answer: 𝑛 + (
• What are their total lengths?
• What is the running time?

58
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
𝐯\𝐰 0 T A C G G C >
> 0, if i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]}
G

59
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
0, Start anywhere on first row if
𝐯\𝐰 0 T A C G G C >
> i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]} End anywhere on last row
G

60
Fitting Alignment – Dynamic Programming
Fitting Alignment problem: Given strings 𝐯 ∈ Σ ! and 𝐰 ∈ Σ " and scoring
function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global
alignment score 𝑠 ∗ among all global alignments of 𝐯 and all substrings of 𝐰
8
0, Start anywhere on first row if
𝐯\𝐰 0 T A C G G C >
> i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]} End anywhere on last row
G
Question: Let match score be 1,
G mismatch/indel score be -1. What is 𝑠 ∗ ?
𝐯 A - G G Question: Same scores. What is optimal
𝐰 A C G G global alignment and score?
61
Fitting Alignment – Dynamic Programming
• Online:
https://valiec.github.io/AlignmentVisualizer/index.html
8
𝐯\𝐰 0 T A C G G C >
> 0, if i = 0,
>
<s[i 1, j] + (v , ), if i > 0,
0 s[i, j] = max
i
>
> s[i, j 1] + ( , wj ), if i > 0 and j > 0,
>
:
A s[i 1, j 1] + (vi , wj ), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m, n]}
G
Question: Let match score be 1,
G mismatch/indel score be -1. What is 𝑠 ∗ ?
𝐯 A - G G Question: Same scores. What is optimal
𝐰 A C G G global alignment and score?
62

You might also like