Introduction to Bioinformatics
Lecture 9
A Markov Chain Model
• Nucleotide frequencies in the human genome
A C T G
29.5 20.4 20.5 29.6
A Markov Chain Model
• Traditionally end of a sequence is not modelled
• can also have an explicit end state; allows the model
to represent
– a distribution over sequences of different lengths
– preferences for ending sequences with certain
symbols
Markov Chain Model: Definition
• a Markov chain model is defined by
– a set of states
• some states emit symbols
• other states are silent
– (e.g. the begin and end states)
– a set of transitions with associated
probabilities
• the transitions emanating from a given state
define a distribution over the possible next
states
Markov Chain Model: Property
• given some sequence x of length L, we can ask how
probable the sequence is given our model
• for any probabilistic model of sequences, we can write
this probability as
Pr(x) Pr(xL , xL1 ,K , x1 )
Pr(xL | xL1 ,K , x1 ) Pr(xL1 | xL2 ,K , x1 )K Pr(x1 )
• key property of a (1st order) Markov chain: the
probability of each xi depends only on the value of xi-1
Pr(x) Pr(xL | xL1 ) Pr(xL 1 | xL2 )K Pr(x2 | x1 ) Pr(x1 )
L
Pr(x1 ) Pr(xi | xi 1 )
i2
The Probability of a Sequence for a Given
Markov Chain Model
Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end |
t)
Markov Chain Model: Notation
• the transition parameters can be denoted by a
xi1 xi where
a xi1 xi Pr(xi | x )
i1
• similarly we can denote the probability of a sequence x
as B x1 L L
xi1 xi Pr(x1 ) Pr(xi |
a i2
a i2
xi 1 )
where aB x i
represents the transition from the begin state
Written CpG to
CpG Islands distinguish from
a C≡G base pair
• CpG dinucleotides are rarer than would be expected
from the independent probabilities of C and G.
– Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
• A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
• High CpG frequency may be biologically significant;
e.g., may signal promoter region (“start” of a gene).
Markov Chain for Discrimination
• suppose we want to distinguish CpG islands from
other sequence regions
• given sequences from CpG islands, and sequences
from other regions, we can construct
– a model to represent CpG islands (model +)
– a null model to represent other regions (model -)
• can then score a test sequence by:
Pr( x | model)
score( x) log
Pr(x | model - )
Markov Chain for Discrimination
• parameters estimated for + and - models
– human sequences containing 48 CpG islands
– 60,000 nucleotides
• Calculated Transition probabilities for both models
Markov Chain for Discrimination
• Calculated the log-odds ratio
L L
a
score(x) log
Pr( x | model)
Pr(x | model
log xi1 x
i xi1 xi
i1 i1
xi1 xi
a
-)
x x
• i1
i
are the log-likelihood ratios of corresponding
transition probabilities
A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
T -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679
Markov Chain for Discrimination
•Solid bars represent
non-CPG
•Dotted bars represent
CpG islands
•Error could be due to
inadequate modeling or
mislabelling
A simple Hidden Markov Model (HMM)
• given say a T in our input sequence, which state
emitted it?
Why Hidden ?
• we’ll distinguish between the observed parts
of a problem and the hidden parts
• in the Markov chain models it is clear which
state accounts for each part of the observed
sequence
• in the model above, there are multiple states
that could account for each part of the
observed sequence
– this is the hidden part of the problem
– the essential difference between Markov
chain and Hidden Markov model
Hidden Markov Models
• Components:
– Observed variables
• Emitted symbols
– Hidden variables
– Relationships between them
• Represented by a graph with transition
probabilities
• Goal: Find the most likely explanation for the
observed variables
Notations in HMM
• States are decoupled from symbols
• x is the sequence of symbols emitted by model
– xi is the symbol emitted at time i
• A path, , is a sequence of states
– The i-th state in is i
• akr is the probability of making a transition from
state k to state r:
akr Pr(i r |i 1
k)
• ek(b) is the probability that symbol b is emitted
when in state k
e (b ) Pr(x b|
The occasionally dishonest casino
• A casino uses a fair die most of the time, but
occasionally switches to a loaded one
– Fair die: Prob(1) = Prob(2) = . . . = Prob(6) =
1/6
– Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) = ½
– These are the emission probabilities
• Transition probabilities
– Prob(Fair Loaded) = 0.01
– Prob(Loaded Fair) = 0.2
– Transitions between states obey a Markov process
An HMM for occasionally dishonest casino
1: 1/6
2: 1/6 akl
0.99
0.80
0.01
1: 1/10
2: 1/10
3: 1/6 3: 1/10
4: 1/6 4: 1/10
5: 1/6 5: 1/10
0.2
6: 1/6 6: 1/2
ek (b) Fair Loaded
The occasionally dishonest casino
• Known:
– The structure of the model
– The transition probabilities
• Hidden: What the casino did
– FFFFFLLLLLLLFFFF...
• Observable: The series of die
tosses
– 3415256664666153...
• What we must infer:
– When was a fair die used?
– When was a loaded one used?
• The answer is a sequence
FFFFFFFLLLLLLFFF...
Three Important Questions
• How likely is a given sequence?
– the Forward algorithm
• What is the most probable “path” for
generating a given sequence?
– the Viterbi algorithm
• How can we learn the HMM parameters
given a set of sequences?
– the Baum-Welch (Forward-Backward)
algorithm
How Likely is a Given Sequence?
The probability that the path 1, 2,…, L is taken
and the sequence x1,x2,…,xL is generated:
Pr(x1 ,K , xL | 1 ,K , L ) a0 1
e (xi )a
i i i1
i1
(assuming begin/end are the only silent
states on path)
The occasionally dishonest casino
x x1 , x2 ,
x3 6,2,6
Pr(x, ) a0 F eF (6)a FF eF (2)a FF eF (6)
(1)
1 1
1
0.5 0.99 0.99
FFF
(1)
6 6
6
0.00227
Pr(x,
(2)
) a0 L eL (6)aLL eL (2)aLL eL (6)
LLL
(2) 0.5 0.5 0.8 0.1 0.8
0.5
0.008
Pr(x, ) a0 L eL (6)aLF eF (2)aFL eL (6)aL 0
(3)
LFL 1
0.5 0.5 0.2 0.01
(3)
0.5
6
0.0000417
Making the inference
• Model assigns a probability to each explanation of the
observation:
P(326|FFL)
• Maximum Likelihood: Determine which
explanation is most likely
– Find the path most likely to have produced
the observed sequence
• Total probability: Determine probability
that observed sequence was produced by the
HMM
– Consider all paths that could have produced
the observed sequence
How Likely is a Given Sequence?
• for a single path the probability that the
sequence x is generated L
Pr(x1 ,K , xL | 1 ,K , L ) a0 1
e (xi )a
i i i1
i1
• the probability over all such paths is:
Pr(x
• but,K
1
the number
sequence...
, xL ) ofpaths Pr(x
can
1 ,K ,
be exponential in the length of the
xL • the Forward algorithm enables us to compute this efficiently
The most probable path
The most likely path * satisfies
* argmax
Pr(x, )
To find *, consider all possible ways the last
symbol of x could have been emitted
vk (i ) Prob. of path 1 ,L , i most
Let
to emit x ,K , xlikely
1 i such
that i k
Then
vk (i ) ek (xi ) max
r
rv rk
(i 1)a
The Viterbi Algorithm
• Initialization: (i = 0)
v0 (0) 1, vk (0) 0 for k 0
• Recursion: (i = 1, . . . , L): For each state k
vk (i) ek (xi ) max
r
vr (i 1)ark
• Termination:
Pr(x, * ) vk (L)ak
max 0
k
To find *, use trace-back, as in dynamic programming
Viterbi: Example
x
6 2 6
1 0 0 0
B
0 (1/6)(1/2) (1/6)max{(1/12)0.99, (1/6)max{0.013750.99,
= 1/12 (1/4)0.2} 0.020.2}
F
= 0.01375 = 0.00226875
0 (1/2)(1/2) (1/10)max{(1/12)0.01, (1/2)max{0.013750.01,
= 1/4 (1/4)0.8} 0.020.8}
L
= 0.02 = 0.08
0.80
0.99
0.01 1: 1/10
1: 1/6
2: 1/10
2: 1/6
rv
3: 1/10
vk (i ) ek (xi ) max
r
rk
3:
4:
1/6
1/6
4: 1/10
5: 1/10
5: 1/6 0.2
(i 1)a 6: 1/6
6: 1/2
Fair Loaded
Viterbi gets it right more often than not
The numbers in first rows show 300 rolls of a die. Second rows show which die
was actually used for the roll. Third lines show the Viterbi algorithm prediction