Forward-Backward Activation Algorithm For Hierarchical Hidden Markov Models
Forward-Backward Activation Algorithm For Hierarchical Hidden Markov Models
Forward-Backward Activation Algorithm For Hierarchical Hidden Markov Models
Abstract
Hierarchical Hidden Markov Models (HHMMs) are sophisticated stochastic mod-
els that enable us to capture a hierarchical context characterization of sequence
data. However, existing HHMM parameter estimation methods require large com-
putations of time complexity O(T N 2D ) at least for model inference, where D is
the depth of the hierarchy, N is the number of states in each level, and T is the
sequence length. In this paper, we propose a new inference method of HHMMs
for which the time complexity is O(T N D+1 ). A key idea of our algorithm is ap-
plication of the forward-backward algorithm to state activation probabilities. The
notion of a state activation, which offers a simple formalization of the hierarchical
transition behavior of HHMMs, enables us to conduct model inference efficiently.
We present some experiments to demonstrate that our proposed method works
more efficiently to estimate HHMM parameters than do some existing methods
such as the flattening method and Gibbs sampling method.
1 Introduction
Latent structure analysis of sequence data is an important technique for many applications such
as speech recognition, bioinformatics, and natural language processing. Hidden Markov Models
(HMMs) play a key role in solving these problems. HMMs assume a single Markov chain of hidden
states as the latent structure of sequence data. Because of this simple assumption, HMMs tend to
capture only local context patterns of sequence data. Hierarchical Hidden Markov Models (HH-
MMs) are stochastic models which assume hierarchical Markov chains of hidden states as the latent
structure of sequence data [3]. HHMMs have a hierarchical state transition mechanism that yields
the capability of capturing global and local sequence patterns in various granularities. By their na-
ture, HHMMs are applicable to problems of many kinds including handwritten letter recognition [3],
information extraction from documents [11], musical pitch structure modeling [12], video structure
modeling [13], and human activity modeling [8, 6].
For conventional HMMs, we can conduct unsupervised learning efficiently using the forward-
backward algorithm, which is a kind of dynamic programming [9]. In situations where few or
no supervised data are available, the existence of the efficient unsupervised learning algorithm is
a salient advantage of using HMMs. The unsupervised learning of HHMMs is an important tech-
nique, as it is for HMMs. In this paper, we discuss unsupervised learning techniques for HHMMs.
We introduce a key notion, activation probability, to formalize the hierarchical transition mecha-
nism naturally. Using this notion, we propose a new exact inference algorithm which has less time
complexity than existing methods have.
The remainder of the paper is organized as follows. In section 2, we overview HHMMs. In section
3, we survey HHMM parameter estimation techniques proposed to date. In section 4, we introduce
our parameter estimation algorithm. Section 5 presents experiments to show the effectiveness of our
algorithm. We conclude our discussion in section 6.
1
Figure 1: (left) Dynamic Bayesian network of the HHMM. (top-right) Tree representation of the
HHMM state space. (bottom-right) State identification by the absolute path of the tree.
2
Table 1: Notation for HHMMs.
D Depth of hierarchy
N Number of states in each level
Ωd Set of values taken by absolute path state at level d
Ztd ∈ Ωd Absolute path state at time t and level d
Ftd ∈ {0, 1} Termination indicator at time t and level d
Ot ∈ {1, ..., V } Observation at time t
Adij State transition probability from state Ztd = i to state Zt+1
d
= j at level d
AdiEnd Termination probability of Markov chain at level d from state Ztd = i
πdi Initial state probability of state Z d = i at level d
Biv Output probability of observation v with Z D = i
Table 1 presents the notation used for the HHMM description. We use the notation of the absolute
path state Z d rather than Qd throughout the paper. Therefore, we define compatible notations for
the model parameters. Whereas the conventional notation πkd (j) denotes the initial state probability
of Qd = j when Q1:d−1 = k, we aggregate Qd and Q1:d−1 into Q1:d = Z d and define πdi as the
of Z d = i. Similarly, we
initial state probability ∑ ∑define Adij as the state transition probability from
d
Z = i to j. Note that i0 ∈sib(i) πdi0 = 1 and j 0 ∈{sib(i)∪End} Adij 0 = 1.
The first work for HHMMs [3] proposed the generalized Baum-Welch algorithm. This algorithm is
based on an inside-outside algorithm used for inference of probabilistic context free grammars. This
method takes O(T 3 ) time complexity, which is not practical for long sequence data.
A more efficient approach is the flattening method [7]. The hierarchical state sequence can be
reduced to a single sequence of the bottom level absolute path states {Z1D , ..., ZTD }. If we regard
Z D as a flat HMM state, then we can conduct the inference by using the forward-backward algorithm
with O(T N 2D ) time complexity since |ΩD | = N D . Notice that the flat state Z D can transit to any
other flat state, and we cannot apply efficient algorithms for HMMs of sparse transition matrix. In
the flattening method, we must make a weak constraint on the HHMM parameters, say minimally
self-referential (MinSR) [12], which restricts the self-transition at higher levels i.e. Adii = 0 for 1 ≤
d ≤ D − 1. The MinSR constraint enables us to identify the path connecting two flat states uniquely.
This property is necessary for estimating HHMM parameters by using the flattening method.
We also discuss a sampling approach as an alternative parameter estimation technique. The Gibbs
sampling is often used for parameter estimation of probabilistic models including latent variables
[4]. We can estimate HMM parameters using a Gibbs sampler, which sample each hidden states
iteratively. This method is applicable to inference of HHMMs in a straightforward manner on the flat
HMM. This straightforward approach, called the Direct Gibbs Sampler (DGS), takes the O(T N D )
time complexity for a single iteration.
The convergence of a posterior distribution by the DGS method is said to be extremely slow for
HMMs [10] because the DGS ignores long time dependencies. Chib [2] introduced an alternative
method, called the Forward-Backward Gibbs Sampler (FBS), which calculates forward probabilities
in advance. FBS samples hidden states from the end of the sequence regarding the forward proba-
bilities. FBS method requires larger computations for a single iteration than DGS does, but it can
bring a posterior of hidden states to its stationary distribution with fewer iterations [10].
Heller [5] proposed Infinite Hierarchical Hidden Markov Models (IHHMMs) which can have an
infinitely large depth by weakening the dependency between the states at different levels. They pro-
posed the inference method for IHHMMs based on a blocked Gibbs sampler of which the sampling
unit is a state sequence from t = 1 to T at a single level. This inference takes only O(T D) time
for a single iteration. In HHMMs, the states in each level are strongly dependent, so resampling
a state at an intermediate level causes all lower states to alter into a state which has a completely
different behavior. Therefore, it is not practical to apply this Gibbs sampler to HHMMs in terms of
the convergence speed.
3
4 Forward-Backward Activation Algorithm
In this section, we introduce a new parameter estimation algorithm for HHMMs, which theoretically
has O(T N D+1 ) time complexity. The basic idea of our algorithm is a decomposition of the flat
transition probability distribution p(Zt+1 D
|ZtD ), which the flattening method calculates directly for
all pairs of the flat states. We can rewrite the flat transition probability distribution into a sum of two
cases that the Markov chain at level D terminates or not, as follows.
D
p(Zt+1 |ZtD ) = p(Zt+1 D
|ZtD , FtD = 0)p(FtD = 0|ZtD ) +
D
p(Zt+1 |Zt+1
D−1 D−1
, FtD = 1)p(Zt+1 |ZtD−1 , FtD = 1)p(FtD = 1|ZtD )
The first term corresponds to the direct transition without the Markov chain termination. The ac-
tual computational complexity for calculating this term is O(N D+1 ) because the direct transition
is permitted only between the sibling states, i.e. ADij = 0 if j ∈ / sib(i). The second term, cor-
responding to the case in which the Markov chain terminates at level D, contains two factors: The
D−1
upper level transition probability p(Zt+1 |ZtD−1 , FtD = 1) and the state initialization probability
for the terminated Markov chain p(Zt+1 D
|Zt+1
D−1
, FtD = 1). We attempt to compute these probability
distributions efficiently in a dynamic programming manner.
The transition probability at level d has the form p(Zt+1 d
|Ztd , Ftd+1 = 1). We define ending activa-
tion et , as the condition of the transition probability from Ztd , formally:
d
p(Ztd = i, Ftd+1 = 1) (if i 6= null and d < D)
d
p(et = i) = p(Ztd = i) (if i 6= null and d = D)
p(Ftd+1 = 0) (if i = null)
The null value in edt indicates that the Markov chain at level d + 1 does not terminate at time t.
The state initialization probability for level d + 1 has the form p(Ztd+1 |Ztd , Ft−1
d+1
= 1). We define
beginning activation bt , as the condition of the state initialization probability from Ztd , formally, as
d
p(Ztd = i, Ft−1d+1
= 1) (if i 6= null and d < D and t > 1)
d
p(bt = i) = d
p(Zt = i) (if i 6= null and (d = D or t = 1))
d+1
p(Ft−1 = 0) (if i = null)
The null value in bdt indicates that the Markov chain at level d + 1 does not terminate at time t − 1.
Using these notations, we can represent the flat transition with propagations of activation probabil-
ities as shown in figure 2 (left) because p(Zt+1D
t+1 |et ). This representation naturally
|ZtD ) = p(bD D
describes the decomposition of the flat transition probability distribution discussed above, and it
enables us to apply the decomposition recursively for all levels. We can derive the conditional
probability distributions of edt and bdt+1 as
{ ∑ d+1
c∈child(i) p(et = c)A(d+1)cEnd (if i 6= null)
p(edt = i|ed+1
t ) = ∑ d+1 d+1
c∈Ωd+1 p(et = c)(1−A(d+1)cEnd )+p(et = null) (if i = null)
{ ∑
d d d−1 p(bt+1 = parent(i))πdi + j∈sib(i) p(edt = j)Adji (if i 6= null)
d−1
p(bt+1 = i|et , bt+1 ) =
p(edt = null) (if i = null)
In the following subsections, we show the efficient inference algorithm and the parameter estimation
algorithm using the activation probabilities.
We can translate the DBN of HHMMs in figure 1 (left) equivalently into simpler DBN using acti-
vation probabilities. The translated DBN is portrayed in figure 2 (right). The inference algorithm
proposed herein is based on a forward-backward calculation over this DBN. We define forward
activation probability α and backward activation probability β as follows.
αedt (i) = p(edt = i, O1:t )
αbdt (i) = p(bdt = i, O1:t−1 )
βedt (i) = p(Ot+1:T , FT1 = 1|edt = i)
βbdt (i) = p(Ot:T , FT1 = 1|bdt = i)
4
Figure 2: (left) Propagation of activation probabilities for calculating the flat transition probability
from time t to t + 1. (right) Equivalent DBN of the HHMM using activation probabilities.
p(edt = i|O1:T , FT1 = 1) ∝ p(edt = i, O1:t )p(Ot+1:T , FT1 = 1|edt = i) = αedt (i)βedt (i)
p(bdt = i|O1:T , FT1 = 1) ∝ p(bdt = i, O1:t−1 )p(Ot:T , FT1 = 1|bdt = i) = αbdt (i)βbdt (i)
The inference of the flat state p(ZtD |O1:T , FT1 = 1) is identical to of the bottom level activation
t |O1:T , FT = 1). We can calculate the likelihood of the whole observation as follows.
probability p(eD 1
∑ ∑
p(O1:T , FT1 = 1) = p(e1T = i, O1:T )p(FT1 = 1|e1T = i) = αe1T (i)βe1T (i)
i∈Ω1 i∈Ω1
5
Algorithm 2 Calculate backward activation probabilities
1: for t = T to 1 do
2: if t = T then
3: βe1T (i ∈ Ω1 ) = A1iEnd
4: for d = 2 to D do
5: βed (i ∈ Ωd ) = βed−1 (parent(i))AdiEnd
T T
6: end for
7: else ∑
8: βe1t (i ∈ Ω1 ) = j∈sib(i) βb1t+1 (j)A1ij
9: for d = 2 to D do ∑
10: βedt (i ∈ Ωd ) = βed−1 (parent(i))AdiEnd + j∈sib(i) βbdt+1 (j)Adij
t
11: end for
12: end if
13: βbDt
(i ∈ ΩD ) = βeD t
(i)BiOt
14: for d = D − 1 to 1∑do
15: βbdt (i ∈ Ωd ) = c∈child(i) βbd+1 (c)π(d+1)c
t
16: end for
17: end for
Using the forward and backward activation probabilities, we can estimate HHMM parameters effi-
ciently in the EM framework. In the EM algorithm, the function Q(θ, θ̄) is defined, where θ is a
parameter set before updating and θ̄ is a parameter set after updating, as described below.
∑
Q(θ, θ̄) = pθ (Y |X) log pθ̄ (X, Y )
Y
In that equation, X represents a set of observed variables, and Y is a set of latent variables. The dif-
ference of log likelihood between the models of θ and θ̄ is known to be greater than Q(θ, θ̄)−Q(θ, θ)
[1]. For this reason, we can increase the likelihood monotonically by selecting a new parameter θ̄ to
maximize the function Q. For HHMMs, the set of parameters is θ = {A, π, B}. The set of observed
variables is X = {O1:T , FT1 = 1}. The set of latent variables is Y = {Z1:T 1:D 1:D
, F1:T −1 }. Therefore,
the function Q can be represented as shown below.
∑
Q(θ, θ̄) ∝ pθ (O1:T , FT1 = 1, Z1:T
1:D 1:D
, F1:T 1 1:D 1:D
−1 ) log pθ̄ (O1:T , FT = 1, Z1:T , F1:T −1 ) (1)
1:D ,F 1:D
Z1:T 1:T −1
The joint probability of observed variables and latent variables is given below.
1:D
pθ (O1:T , FT1 = 1, Z1:T 1:D
, F1:T −1 )
∏
D −1 ∏
T∏ D ∏D ∏T
Fd F d+1 (1−Ftd ) Ftd
= πdZ1d (AdZt d End AdZt d Z d πdZ d ) AdZ d End BZtD Ot
t t t+1 t+1 T
d=1 t=1 d=1 d=1 t=1
We substitute this equation for the joint probability in equation (1). We integrate out irrelevant
variables and organize around each parameter. Thereby, we obtain the following.
∑
D ∑ ∑
D ∑ ∑ ∑ ∑
V
Q(θ, θ̄) ∝ gπdi log π̄di + gAdij log Ādij + gBiv log B̄iv
d=1 i∈Ωd d=1 i∈Ωd j∈{sib(i)∪End} i∈ΩD v=1
Therein, gπdi , gAdij , gBiv are shown by equation (2)(3)(4)(5). They are calculable using forward
and backward activation probabilities.
∑
T −1
gπdi = αbd1 (i)βbd1 (i) + αbd−1 (parent(i))πdi βbdt+1 (i) (2)
t+1
t=1
∑
T −1
gAdiEnd = αedt (i)AdiEnd βed−1 (parent(i)) + αed (i)βed (i) (3)
t T T
t=1
6
Table 2: Log-likelihood achieved at each iteration.
Iteration 1 2 3 4 5 10 50 100
FBA w/o MinSR -773.47 -672.44 -668.50 -631.30 -610.63 -577.33 -457.66 -447.90
FBA with MinSR -773.89 -672.47 -670.40 -643.62 -614.98 -573.84 -453.09 -448.52
FFB -773.89 -672.47 -670.40 -643.62 -614.98 -573.84 -453.09 -448.52
∑
T −1
gAdij = αedt (i)Adij βbdt+1 (j) (4)
t=1
∑
gBiv = αeD
t
(i)βeD
t
(i) (5)
t:Ot =v
Consequently, we can calculate the update parameters using α and β. The time complexity for
computing a single EM iteration is O(T N D+1 ), which is identical to the calculation of forward and
backward activation probabilities.
5 Experiments
Firstly, we experimentally confirm that the forward-backward activation algorithm yields exactly
identical parameter estimation to the flattening method does. Remind that we must make the MinSR
constraint on the HHMM parameter set in the flattening method (see section 3). We compare
three parameter estimation algorithms: our forward-backward activation algorithm for a MinSR
HHMM (FBA with MinSR), for a HHMM without MinSR (FBA w/o MinSR), and the flattening
method(FFB). The dataset to learn includes 5 sequences of 10 length, which are artificially gener-
ated by a MinSR HHMM of biased parameter set. We execute three algorithms and examine the
log-likelihood achieved at each iteration.
Table 2 presents the result. The FBA with MinSR and the FFB achieve the identical log-likelihood
through the training. This result provides experimental evidence that our algorithm estimates
HHMM parameters exactly identically to the flattening method does. Furthermore, the FBA enables
us to conduct the parameter estimation of HHMMs which has non-zero self-transition parameters.
To evaluate the computational costs empirically, we compare four methods of HHMM parameter
estimation. Two are based on the EM algorithm with inference by the forward-backward activation
algorithm (FBA), and by the flattening forward-backward method (FFB). Another two are based on
a sampling approach: direct Gibbs sampling for the flat HMMs (DGS) and forward-backward acti-
vation sampling (FBAS). FBAS is a straightforward application of the forward-backward sampling
scheme to the translated DBN presented in figure 2. In FBAS, we first calculate forward activation
probabilities. Then we sample state activation variables from e1T to b11 in the backward order with
respect to forward activation probabilities. We evaluate four methods based on three aspects: execu-
tion time, convergence speed, and scalability of the state space size. We apply each method to four
different HHMMs of (D = 3,N = 3), (D = 3,N = 4), (D = 4,N = 3), and (D = 4,N = 4). We
examine the log-likelihood of the training dataset achieved at each iteration to ascertain the learn-
ing convergence. As a training dataset, we use 100 documents from the Reuters corpus as word
sequences. The dataset includes 36,262 words in all, with a 4,899 word vocabulary.
Figure 3 presents the log-likelihood of the training data. The horizontal axis shows the logarith-
mically scaled execution time. Table 2 presents the average execution time for a single iteration.
From these results, we can say primarily that FBA outperforms FFB in terms of execution time. The
improvement is remarkable, especially for the HHMMs of large state space size because FBA has
less time complexity for N and D than FFB has.
7
-180000 -180000
-190000 -190000
-200000 -200000
-210000 -210000
Log-Likelihood
Log-Likelihood
-220000 -220000
-230000 -230000
-240000 -240000
FB Activation FB Activation
-250000 Flattening FB -250000 Flattening FB
FB Activation Sampling FB Activation Sampling
Direct Gibbs Sampling Direct Gibbs Sampling
-260000 -260000
100 1000 10000 100000 1e+006 100 1000 10000 100000 1e+006
Execution Time (ms) Execution Time (ms)
-180000 -180000
-190000 -190000
-200000 -200000
-210000 -210000
Log-Likelihood
Log-Likelihood
-220000 -220000
-230000 -230000
-240000 -240000
FB Activation FB Activation
-250000 Flattening FB -250000 Flattening FB
FB Activation Sampling FB Activation Sampling
Direct Gibbs Sampling Direct Gibbs Sampling
-260000 -260000
100 1000 10000 100000 1e+006 1000 10000 100000 1e+006
Execution Time (ms) Execution Time (ms)
Figure 3: Convergence of log-likelihood for the training data on the Reuters corpus. Log-likelihood
(vertical) is shown against the log-scaled execution time (horizontal) to display the execution time
necessary to converge the learning of each algorithm. (top-left) HHMM of D = 3, N = 3. (top-
right) D = 3, N = 4. (bottom-left) D = 4, N = 3. (bottom-right) HHMM of D = 4, N = 4.
The results show that the likelihood convergence using DGS is much slower than that of other
methods.The execution time of DGS is less than that of other methods for a single iteration, but
this cannot compensate for the low convergence speed. However, FBAS achieves a competitive
likelihood in comparison to FBA. Results show that FBAS might be appropriate for some situations
because FBAS finds a better solution than that FBA do in some results.
6 Conclusion
In this work, we proposed a new inference algorithm for HHMMs based on the activation probability.
Results show that the performance of our proposed algorithm surpasses that of existing methods.
The forward-backward activation algorithm described herein enables us to conduct unsupervised
parameter learning with a practical computational cost for HHMMs of larger state space size.
References
[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
[2] S. Chib. Calculating posterior distributions and modal estimates in markov mixture models.
Journal of Econometrics, 1996.
[3] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and appli-
cations. Machine Learning, 1998.
8
[4] T. Griffiths and M. Steyvers. Finding scientific topics. Proc. the National Academy of Sciences
of the United States of America, 2004.
[5] K. Heller, Y. Teh, and D. Gorur. Infinite hierarchical hidden markov models. In Proc. Interna-
tional Conference on Artificial Intelligence and Statistics, 2009.
[6] S. Luhr, H. Bui, S. Venkatesh, and G. West. Recognition of human activity through hierarchical
stochastic learning. In Proc. Pervasive Computing and Communication, 2003.
[7] K. Murphy and M. Paskin. Linear time inference in hierarchical hmms. In Proc. Neural
Information Processing Systems, 2001.
[8] N. Nguyen, D. Phung, and S. Venkatesh. Learning and detecting activities from movement tra-
jectories using the hierarchical hidden markov models. In Proc. Computer Vision and Pattern
Recognition, 2005.
[9] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recogni-
tion. Proc. IEEE, 1989.
[10] S. Scott. Bayesian methods for hidden markov models: Recursive computing in the 21st
century. Journal of the American Statistical Association, 2002.
[11] M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden markov models for information
extraction. In Proc. International Joint Conference on Artificial Intelligence, 2003.
[12] M. Weiland, A. Smaill, and P. Nelson. Learning musical pitch structures with hierarchical
hidden markov models. In Proc. Journees Informatiques Musicales, 2005.
[13] L. Xie, S. Chang, A. Divakaran, and H. Sun. Learning hierarchical hidden markov models for
video structure discovery. Technical report, Columbia University, 2002.