70
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
A Latent Manifold Markovian Dynamics
Gaussian Process
Sotirios P. Chatzis and Dimitrios Kosmopoulos
Abstract— In this paper, we propose a Gaussian process (GP)
model for analysis of nonlinear time series. Formulation of our
model is based on the consideration that the observed data
are functions of latent variables, with the associated mapping
between observations and latent representations modeled through
GP priors. In addition, to capture the temporal dynamics in the
modeled data, we assume that subsequent latent representations
depend on each other on the basis of a hidden Markov prior
imposed over them. Derivation of our model is performed by
marginalizing out the model parameters in closed form using
GP priors for observation mappings, and appropriate stickbreaking priors for the latent variable (Markovian) dynamics.
This way, we eventually obtain a nonparametric Bayesian model
for dynamical systems that accounts for uncertainty in the
modeled data. We provide efficient inference algorithms for
our model on the basis of a truncated variational Bayesian
approximation. We demonstrate the efficacy of our approach
considering a number of applications dealing with real-world
data, and compare it with the related state-of-the-art approaches.
Index Terms— Gaussian process (GP), latent manifold,
Markovian dynamics, stick-breaking process, variational Bayes.
I. I NTRODUCTION
T
HERE is a wide variety of generative models used to
perform analysis of nonlinear time series [1]. Approaches
based on hidden Markov models (HMMs) and linear dynamical systems (LDS) are quite ubiquitous in the current literature
due to their simplicity, efficiency, and generally satisfactory
performance in many applications. More expressive models,
such as switching LDS and nonlinear dynamical systems
(NLDS), have also been proposed; however, these approaches
are faced with difficulties in terms of their learning and
inference algorithms, due to the entailed large number of
parameters that must be estimated, and the hence needed large
amounts of training data [1].
Recently, a nonparametric Bayesian approach designed to
resolve these issues of NLDS, namely the Gaussian process
dynamical model (GPDM), was introduced in [2]. This
approach is fully defined by a set of low-dimensional representations of the observed data, with both the observation and dynamical processes learned by means of Gaussian
Manuscript received July 21, 2013; revised February 24, 2014; accepted
March 8, 2014. Date of publication March 21, 2014; date of current version
December 16, 2014.
S. P. Chatzis is with the Department of Electrical and Computer Engineering, Cyprus University of Technology, Limassol 3036, Cyprus (e-mail:
soteri0s@me.com).
D. Kosmopoulos is with the Technological Educational Institute of Crete,
Heraklion 71004, Greece (e-mail: dkosmo@gmail.com).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2014.2311073
process (GP) regression [3]. This GP-based formulation of the
model gives rise to a nonparametric Bayesian nature, which
removes the need to select a large number of parameters associated with function approximators, while retaining the power
of nonlinear dynamics and observation. GPDM is essentially
an extension of the GP latent variable model (GPLVM) of [4],
which models the joint distribution of the observed data and
their representation in a low-dimensional latent space through
a GP prior. GPDM extends GPLVM by augmenting it with a
model of temporal dynamics captured through imposition of
a dedicated GP prior. This way, GPDM allows for not only
obtaining predictions about future data, but also regularizing
the latent space to allow for more effective modeling of
temporal dynamics.
Despite the merits of GPDM, a significant drawback of this
model consists in the need of its inference algorithm to obtain
maximum a posteriori (MAP) estimates of its parameters
through type-II maximum-likelihood [2] (performed by means
of scaled conjugate gradient descent [5]). This formulation
poses a significant bottleneck to GPDM, due to both the
entailed high computational costs, as well as the possibility
of obtaining bad estimates due to the algorithm getting stuck
to poor local maxima in cases of limited training datasets.
In addition, to increase computational efficiency, GPDM
imposes an oversimplistic spherical Gaussian prior over its
model of temporal dynamics, which probably undermines
its data modeling capacity. Finally, the use of GP priors to
describe the temporal dynamics between the latent variables
of the model leads to significant computational overheads, as it
gives rise to calculations that entail inverting very large Gram
matrices [3].
To resolve these issues, in this paper, we propose a flexible
generative model for modeling sequential data by means of
nonparametric component densities. Formulation of our proposed model is based on the assumption that, when modeling
sequential data, each observation in a given sequence is related
to a vector in a latent space, and is generated through a latent
nonlinear function that maps the latent space to the space of
observations. We use a GP prior to infer this unknown mapping
function from the data in a flexible manner.
In addition, the latent vectors that generate the observed
sequential data are assumed to possess strong temporal interdependencies; to capture these dependencies, we assume that
these latent variables are generated from an HMM in the
manifold of latent variables. Specifically, we assume a latent
space HMM with infinite hidden states, and use flexible
stick-breaking priors [6], [7] to infer its hidden state dynamics;
2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
71
base distribution G 0 with probability (α/α + M − 1) or 2) is
selected from the existing draws, according to a multinomial
allocation, with probabilities proportional to the number of
the previous draws with the same allocation [10]. Let {c }C
c=1
M−1
be the set of distinct values taken by the variables {∗m }m=1
.
M−1
Denoting as νcM−1 the number of values in {∗m }m=1
that
M−1
equals to c , the distribution of ∗M given {∗m }m=1
can be
shown to be of the form [10]
M−1
, α, G 0 )
p(∗M |{∗m }m=1
Fig. 1.
=
Intuitive illustration of the generative construction of our model.
this formulation allows us to automatically determine the
required number of states in this latent manifold HMM in a
data-driven fashion. We dub our approach the latent manifold
hidden Markov GP (LM2 GP) model. Inference for our model
is performed by means of an efficient truncated variational
Bayesian algorithm [8], [9]. We evaluate the efficacy of
our approach in several applications, dealing with sequential
data clustering, classification, and generation. A high-level
illustration of the conceptual configuration of our model is
shown in Fig. 1.
The remainder of this paper is organized as follows.
In Section II, we provide a brief presentation of the theoretical background of the proposed method. Initially, we
present the Dirichlet process (DP) and its function as a
prior in nonparametric Bayesian models; furthermore, we
provide a brief summary of the GPDM model. In Section
III, we introduce the proposed LM2 GP model, and derive
efficient model inference algorithms based on the variational
Bayesian framework. In Section IV, we conduct the experimental evaluation of our proposed model, considering a number of applications dealing with several real-world datasets,
and we compare its performance with state-of-the-art-related
approaches. Finally, in Section V, we conclude and summarize
this paper.
C
ν M−1
α
c
G0 +
δ
α+ M −1
α+M −1 c
(3)
c=1
where δc denotes the distribution concentrated at a single
point c . These results illustrate two key properties of the DP
scheme. First, the innovation parameter α plays a key role in
determining the number of distinct parameter values. A larger
α induces a higher tendency of drawing new parameters from
the base distribution G 0 ; indeed, as α → ∞, we get G → G 0 .
M
tend to cluster to a single
Conversely, as α → 0, all {∗m }m=1
random variable. Second, the more often a parameter is shared,
the more likely it will be shared in the future.
A characterization of the (unconditional) distribution of the
random variable G drawn from a DP, DP(α, G 0 ), is provided
by the stick-breaking construction in [6]. Consider two infinite
collections of independent random variables v = [v c ]∞
c=1 and
{c }∞
,
where
the
v
are
drawn
from
a
Beta
distribution,
and
c
c=1
the c are independently drawn from the base distribution G 0 .
The stick-breaking representation of G is then given by
G=
∞
̟c (v)δc
(4)
c=1
where
p(v c ) = Beta(1, α)
c−1
(1 − v j ) ∈ [0, 1]
̟c (v) = v c
(5)
(6)
j =1
and
∞
II. P RELIMINARIES
A. Dirichlet Process
c=1
DP models were first introduced in [7]. A DP is characterized by a base distribution G 0 and a positive scalar α,
usually referred to as the innovation parameter, and is denoted
as DP(α, G 0 ). Essentially, a DP is a distribution placed over
a distribution. Let us suppose we randomly draw a sample
distribution G from a DP, and, subsequently, we independently
M
from G
draw M random variables {∗m }m=1
G|α, G 0 ∼ DP(α, G 0 )
∗m |G ∼ G, m = 1, . . . , M.
(1)
(2)
Integrating out G, the joint distribution of the variables
M
{∗m }m=1
can be shown to exhibit a clustering effect. SpecifM−1
ically, given the first M − 1 samples of G, {∗m }m=1
, it can
∗
be shown that a new sample M is either 1) drawn from the
̟c (v) = 1.
(7)
Under the stick-breaking representation of the DP, the
atoms c , drawn independently from the base distribution G 0 ,
can be seen as the parameters of the component distributions
of a mixture model comprising an unbounded number of
component densities, with mixing proportions ̟c (v).
B. GP Dynamical Model
1) GP Models: Let us begin with a brief description
of GP regression. Consider an observation space X ; a
GP f (x), x ∈ X , is defined as a collection of random
variables, any finite number of which have a joint Gaussian
distribution [3]. We typically use the notation
f (x) ∼ GP(m(x), k(x, x))
(8)
72
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
where m(x) is the mean function of the process and k(x, x ′ ) is
the covariance function of the process. Usually, for simplicity,
and without any loss of generality, the mean of the process
is taken to be zero, m(x) = 0, although this is not necessary.
Concerning selection of the covariance function, a large variety
of kernel functions k(x, x ′ ) might be employed, depending
on the application considered [3]. Eventually, the real process
f (x) drawn from a GP with mean zero and kernel function
k(x, x ′ ) follows a Gaussian distribution with
p( f |x) = N ( f |0, k(x, x)).
(9)
Let us suppose a set of independent and identically distributed samples D = {(x i , yi )|i = 1, . . . , N}, with the
d-dimensional variables x i being the observations related to a
modeled phenomenon, and the scalars yi being the associated
target values. The goal of a regression model is, given a new
observation x ∗ , to predict the corresponding target value y∗ ,
based on the information contained in the training set D. The
basic notion behind GP regression consists in the assumption
that the observable (training) target values y in a considered
regression problem can be expressed as the superposition of a
GP over the input space X , f (x), and an independent white
Gaussian noise
y = f (x) + ǫ
(10)
where f (x) is given by (8), and
p(ǫ) = N ǫ|0, σ 2 .
(11)
Under this regard, the joint normality of the training target
N
values Y = [yi ]i=1
and some unknown target value y∗ ,
approximated by the value f ∗ of the postulated GP evaluated at the observation point x ∗ , is a Gaussian of the
form [3]
Y
k(x ∗ )
K (X, X) + σ 2 I N
(12)
N
0,
f∗
k(x ∗ )T
k(x ∗ , x ∗ )
where
k(x ∗ ) [k(x 1 , x ∗ ), . . . , k(x N , x ∗ )]T
(13)
N , I
X = [x i ]i=1
N is the N × N identity matrix, and K is the
matrix of the covariances between the N training data points
(Gram matrix)
⎤
⎡
k(x 1 , x 1 ) k(x 1 , x 2 ) . . . k(x 1 , x N )
⎢ k(x 2 , x 1 ) k(x 2 , x 2 ) . . . k(x 2 , x N ) ⎥
⎥
⎢
K (X, X) ⎢
⎥. (14)
..
..
..
⎦
⎣
.
.
.
k(x N , x 1 )
k(x N , x 2 ) . . .
k(x N , x N )
From (12), and conditioning on the available training samples,
we can derive the expression of the model predictive distribution, yielding
(15)
p( f ∗ |x ∗ , D) = N ( f ∗ |µ∗ , σ∗2 )
T
2
−1
(16)
µ∗ = k(x ∗ ) (K (X, X) + σ I N ) Y
−1
2
2
T
2
σ∗ = σ − k(x ∗ ) K (X, X) + σ I N
× k(x ∗ ) + k(x ∗ , x ∗ ).
(17)
2) GP Latent Variable Models: Building upon the GP
model, the GPLVM is essentially a GP, and the input variables
x of which are considered latent variables rather than observed
ones. Specifically, GPLVM considers that the y ∈ R D are
observed multidimensional variables, while the latent vectors
x are variables lying in some lower dimensional manifold that
generates the observed variables y through an unknown latent
mapping f , modeled through a GP. This way, we have
p( y|x) =
D
d=1
N (yd |0, kd (x, x))
(18)
N , obtains
which, considering a set of observations Y = [ yn ]n=1
p(Y |X) =
D
d=1
N (Yd |0, K d (X, X))
(19)
where kd (·, ·) is the kernel of the model for the dth observed
N
N
dimension, X = [x n ]n=1
, Yd = [ynd ]n=1
, and K d (X, X)
is the Gram matrix (with inputs X) corresponding to the
dth dimension of the modeled data. GPLVM learns the values
of the latent variables x corresponding to the observed data y
through the maximization of the model marginal likelihood,
i.e., in a MAP fashion. As shown in [4], GPLVM can be
viewed as a GP-based nonparametric Bayesian extension of
probabilistic principal component analysis [11], [12], a popular
method for latent manifold modeling of high-dimensional data.
3) Dynamic GPLVMs: Inspired from GPLVM, GPDM performs modeling of sequential data by considering a GP prior as
in GPLVM, and introducing an additional model of temporal
interdependencies between successive latent vectors x. SpecifN
,
ically, considering a sequence of observed data Y = [ yn ]n=1
D
where yn = [ynd ]d=1 , with corresponding latent manifold
N
, x n ∈ R Q , GPDM models the
projections X = [x n ]n=1
dependencies between the latent and observed data in (19),
while also considering a GP-based model of the interdependencies between the latent vectors x n . In detail, this latter
model initially postulates
p(X) = p(x 1 )
ˆ
N
n=2
p(x n |x n−1 ; A) p( A)d A
(20)
which, assuming
p(x n |x n−1 ; A) = N (x n | Ax n−1 , σ 2 )
(21)
and a simplistic isotropic Gaussian prior on the columns of
the parameters matrix A, yields a GP prior of the form
p(x 1 )
p(X) =
Q(N−1)
(2π)
|(X, X)| Q
1
T
× exp − tr((X, X)−1 X 2:N X 2:N
)
2
(22)
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
where (X, X) is the Gram matrix
(X, X)
⎡
k(x 1 , x 1 )
⎢ k(x 2 , x 1 )
⎢
⎢
..
⎣
.
k(x 1 , x 2 ) . . .
k(x 2 , x 2 ) . . .
..
.
k(x N−1 , x 1 )
k(x N−1 , x 2 ) . . .
k(x 1 , x N−1 )
k(x 2 , x N−1 )
..
.
⎤
⎥
⎥
⎥.
⎦
k(x N−1 , x N−1 )
(23)
Estimation of the set of (dynamically interdependent) latent
variables X of the GPDM is performed by means of type-II
maximum-likelihood, similar to GPLVM.
As we observe, GPDM utilizes a rather simplistic modeling
approach for the dependencies between the latent variables x,
which is based on a linear model of dependencies, as in (21),
combined with a rather implausible spherical prior over A.
This model formulation is clearly restrictive, and limits the
level of complexity of the temporal structure that GPDM can
capture. In addition, the requirement of MAP estimation of
the latent variables x ∈ R Q results in high computational
requirements. This is aggravated even more by the form
of (22), which gives rise to computations entailing inverting
the Gram matrix (X, X), in addition to computing the inverse
of the matrices K d (X, X). Our proposed model is designed to
ameliorate these issues, as we shall discuss next.
III. P ROPOSED A PPROACH
where f d is the vector of the f d (x n ) ∀n, i.e., f d
N
[ f d (x n )]n=1
, and K d (X, X) is the Gram matrix pertaining
to the dth dimension of the observations space, with kernel
hyperparameters set ψd .
Subsequently, we consider that the latent manifold projections x are generated by an HMM comprising infinite states.
N,∞
Let us introduce the set of variables Z = {z nc }n,c=1
, with
z nc = 1, if x n is considered to be generated from the cth
HMM model state, z nc = 0 otherwise. Then, our introduced
generative model for the latent vectors x n comprises the
assumptions [13]
(26)
p(x n |z nc = 1) = N (x n |µc , R −1
c )
̟
)
=
̟
(v
),
n
>
1
(27)
p(z n j = 1|z n−1,i = 1; v ̟
j i
i
̟ j (v ̟
i )
=
v i̟j
j −1
̟
(1 − v ik
) ∈ [0, 1]
k=1
π
p(z 1i = 1|; v ) = πi (v π )
i−1
(1 − v πj ) ∈ [0, 1]
πi (v π ) = v iπ
Let us consider a sequence of observations with strong
N
temporal interdependencies Y = [ yn ]n=1
. Similar to GPLVM,
we assume a latent lower dimensional manifold that generates
the observation space, and a smooth nonlinear mapping from
the latent space to the observations space. To infer this nonlinear mapping, we postulate a GP prior, similar to GPLVM.
In addition though, we also postulate a model capable of
capturing the temporal dynamics in the modeled data. For
this purpose, we prefer to work in the lower dimensional
manifold of the latent vectors generating our observations.
Specifically, we consider a generative nonparametric Bayesian
model for these latent coordinates: we assume that they are
generated from an HMM comprising infinite (latent) states.
To make formulation of this latent manifold infinite-state
HMM possible, we impose suitable stick-breaking priors over
its state-transition dynamics, similar to [13].
In particular, formulation of our model commences by
considering a smooth nonlinear (generative) mapping of the
latent vectors x n to the observed ones yn , described by means
of a GP, with the addition of some additive white noise. Given
these assumptions, we yield a likelihood function of the form
p yn |x n =
D
d=1
N (ynd | f d (x n ), β −1 )
(24)
where β is the white noise precision, and f d (x n ) is modeled
by means of a GP prior, such that
p( f d |X) = N ( f d |0, K d (X, X))
(25)
(28)
(29)
(30)
j =1
with
∞
̟ j (v ̟
i ) = 1 ∀i
(31)
π j (v π ) = 1.
(32)
j =1
∞
j =1
A. Model Definition
73
∞
̟
̟ ∞
For the stick variables {v ̟
i }i=1 , where v i = [v i j ] j =1 , and
v π = [v iπ ]∞
i=1 , we impose Beta priors [13] of the form
p(v i̟j ) = Beta(1, αi̟ ) ∀ j
p(v iπ ) = Beta(1, α π ) ∀i.
(33)
(34)
In essence, this parameterization of the imposed Beta priors
gives rise to a DP-based prior construction for the transition
dynamics of our model [6].
Finally, due to the effect of the innovation parameters on
the number of effective latent Markov chain states, we also
impose Gamma priors over them
p(αi̟ ) = G(αi̟ |η1̟ , η2̟ )
π
p(α ) = G(α
π
|η1π , η2π ).
(35)
(36)
We also impose suitable conjugate priors over the mean and
precision parameters of the (Markov chain-)state-conditional
likelihoods of the latent vectors. Specifically, we choose
normal-Wishart priors, yielding
p(µc , R c ) = N W(µc , R c |λc , mc , ωc , c ).
(37)
This completes the definition of our proposed LM2 GP model.
A plate diagram of our model is shown in Fig. 2.
B. Inference Algorithm
Inference for nonparametric models can be conducted under
a Bayesian setting, typically by means of variational Bayes [9],
or Markov chain Monte Carlo techniques [14]. Here, we prefer
74
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
is nonnegative, L(q) forms a strict lower bound of the log
evidence, and would become exact if q(W ) = p(W |, Y ).
Hence, by maximizing this lower bound L(q) (variational
free energy) so that it becomes as tight as possible, not only
do we minimize the KL-divergence between the true and the
variational posterior, but we also implicitly integrate out the
unknowns W [8].
To facilitate variational Bayesian inference for our model,
we assume that the posterior factorizes similar to the prior of
our model (mean-field approximation) [15], [16]. This way, the
variational free energy L(q) yields (ignoring constant terms)
Fig. 2. Plate diagram representation of the LM2 GP model. The arrows
represent conditional dependencies. The plates indicate independent copies
for states i and j.
a variational Bayesian approach, due to its considerably better
scalability in terms of computational costs, which becomes of
major importance when having to deal with large data corpora.
Our variational Bayesian inference algorithm for the
LM2 GP model comprises derivation of a family of variational
posterior distributions q(.), which approximate the true posπ
terior distribution over the infinite sets Z , {µc , R c }∞
c=1 , v ,
∞ , and {α ̟ }∞ , as well the parameters α π , the latent
{v ̟
}
i i=1
i i=1
N
, and the latent
mapping function values f d = [ f d (x n )]n=1
N
mappings X = [x n ]n=1
. Apparently, Bayesian inference is
not tractable under this setting, since we are dealing with an
infinite number of parameters.
For this reason, we employ a common strategy in the
literature of Bayesian nonparametrics, formulated on the basis
of a truncation of the stick-breaking process [9]. Specifically,
we fix a value C and we let the variational posterior over
̟ = 1) = 1,
the v i̟j , and the v iπ have the property q(v iC
π
∀i = 1, . . . , C, and q(v C = 1) = 1. In other words, we
set πc (v π ) and ̟c (v ̟
i ) equal to zero for c > C. Note
that, under this setting, the treated LM2 GP model involves
full stick-breaking priors; truncation is not imposed on the
model itself, but only on the variational distribution to allow
for tractable inference. Hence, the truncation level C is a
variational parameter that can be freely set, and not part of
the prior model specification.
D , {v ̟ , α ̟ , µ , R }C } be
Let W {X, Z , v π , α π , { f d }d=1
c c=1
c
c
c
the set of all the parameters of the LM2 GP model over which
a prior distribution has been imposed, and be the set of
the hyperparameters of the model priors and kernel functions.
Variational Bayesian inference introduces an arbitrary distribution q(W ) to approximate the actual posterior p(W |, Y )
that is computationally intractable, yielding [8]
log p(Y ) = L(q) + KL(q|| p)
(38)
where
L(q) =
ˆ
dW q(W )log
p(Y, W |)
q(W )
(39)
and KL(q|| p) stands for the Kullback–Leibler (KL) divergence between the (approximate) variational posterior q(W )
and the actual posterior p(W |, Y ). Since KL divergence
D ˆ ˆ
p( f d |X)
d Xd f d q(X)q( f d ) log
L(q) =
q( f d )
d=1
N
log p(ynd | f d (x n ), β −1 )
+
n=1
+
C ˆ ˆ
dµc d Rc q(µc , R c )
c=1
p(µc , R c |λc , mc , ωc , c )
× log
q(µc , R c )
ˆ
p(α π |η1π , η2π )
π
π
+
dα q(α ) log
q(α π )
ˆ
C−1
p(v cπ |α π )
π
π
+
dv c q(v c )log
q(v cπ )
c=1
C−1
ˆ
p(αc̟ |η1̟ , η2̟ )
̟
̟
+
dαc q(αc ) log
q(αc̟ )
c=1
C−1
̟ |α ̟ )
ˆ
p(v cc
′
c
̟
̟
+
)log
q(v
dv cc
′
cc′
̟)
q(v
′
′
cc
+
c =1
C
N
C
i=1 j =1 n=2
×
ˆ
+
C
+
q(z n j = 1|z n−1,i = 1)
̟
dv ̟
i q(v i )log
q(z 1c = 1)
c=1
C
N
c=1 n=1
ˆ
p(z n j = 1|z n−1,i = 1; v ̟
i )
q(z n j = 1|z n−1,i = 1)
dv π q(v π )log
q(z nc = 1)
× q(x n )q(µc , R c )log
ˆ ˆ ˆ
p(z 1c = 1|v π )
q(z 1c = 1)
dx n dµc d Rc
p(x n |µc , R c )
.
q(x n )
(40)
Derivation of the variational posterior distribution q(W )
involves maximization of the variational free energy L(q)
over each one of the factors of q(W ) in turn, holding the
others fixed, in an iterative manner [15]. On each iteration,
apart from variational posterior updating, we also update the
estimates of the model hyperparameters in , by maximization
of the variational free energy L(q) over each one of them. By
construction, this iterative consecutive updating of the model
is guaranteed to monotonically and maximally increase the
free energy L(q) [17].
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
1) Variational Posteriors: Let us denote as . q(·) the posterior expectation of a quantity, that is the quantity’s mean
value considering the entailed random variables follow the
variational posteriors q(·). In the following, all these posterior means can be computed analytically, except for the
expectations with respect to the posterior q(x n ) over the
latent manifold representations of our modeled data. We shall
elaborate on computation of these latter quantities later in this
section.
From (40), we obtain the following variational (approximate) posteriors over the parameters of our model.
1) For the q(v ̟
i ), we have
q(v i̟j ) = Beta(β̃i̟j , β̂i̟j )
β̃i̟j = 1 +
N
n=2
(41)
q(z n j = 1|z n−1,i = 1)
(42)
C
N
q(z nρ = 1|z n−1,i = 1).
β̂i̟j = αi̟ q(α ̟ ) +
i
2) Similar, for the
q(v iπ )
β̃iπ
̺= j +1 n=2
q(v π ),
75
(55)
and, we have
ω̃c = ωc +
˜c =
q(z nc = 1)
n=1
N
λc n=1
q(z nc = 1)
N
λc + n=1 q(z nc = 1)
N
λc mc + x̄ c n=1
q(z nc = 1)
m̃c =
.
N
λc + n=1 q(z nc = 1)
(46)
3) For the q(αi̟ ), we have
q(αi̟ ) = G(αi̟ |ǫ̃i̟ , ǫ̂i̟ )
q( f d ) = N ( f d |µ̂d , d )
d =
(47)
+C −1
(48)
C−1
ψ(β̂i̟j ) − ψ(β̃i̟j + β̂i̟j ) . (49)
−
j =1
4) For the q(α π ), we have
q(α π ) = G(α π |ε̃π , ε̂π )
(50)
where
K d (X, X)−1
q(X )
+ βI
−1
q(Z ) ∝ πδ∗1
N−1
̟δ∗n δn+1
N
n=1
n=1
(52)
where
πc∗ exp logπc (v π ) q(vπ )
(64)
̟
̟i∗j exp log̟ j (v ̟
(65)
i )q(v i )
p∗ (x t |µi , R i ) exp log p(x t |µi , R i ) q(µ ,R ),q(x )
n=1
q(z nc = 1) x n
q(x n )
− x̄ c
(54)
xn
q(x n )
i
t
and
δn arg(z nc = 1).
c
5) The posteriors q(µc , R c ) are approximated in the following form:
˜c
(53)
q(µc , R c ) = N W µc , R c |λ̃c , m̃c , ω̃c ,
N
(63)
(66)
(51)
i=1
where we introduce the notation
N
q(z nc = 1) x n q(x n )
x̄ c n=1
N
n=1 q(z nc = 1)
(61)
p∗ (x n |µδn , R δn )
i
ε̃π = η1π + C − 1
C−1
ε̂π = η2π −
ψ(β̂iπ ) − ψ(β̃iπ + β̂iπ ) .
c
(60)
µ̂d = β d IYd
(62)
N
Yd [ynd ]n=1
, and K d (X, X)−1 q(X ) is the posterior
mean of the inverse Gram matrix K d (X, X)−1 given
the variational posteriors q(x n ) ∀n.
7) Similarly, the posteriors over the indicator variables Z
yield
where
ǫ̂i̟ = η2̟
(59)
In the above expressions, x n q(xn ) is the (posterior) expectation of the latent variable x n given its
posterior q(x n ).
6) Regarding the posteriors over the latent functions f d ,
we have
(45)
̺=i+1
=
(58)
n=1
(44)
= 1 + q(z 1i = 1)
η1̟
(mc − x̄ c ) (mc − x̄ c )T (57)
where
= Beta(β̃iπ , β̂iπ )
C
β̂iπ = α π q(α π ) +
q(z 1ρ = 1).
ǫ̃i̟
(56)
+ c + c
N
q(z nc = 1)
λ̃c = λc +
(43)
we have
N
− x̄ c
T
From (63), and comparing this expression with the
corresponding expressions of standard HMMs [18], it
is easy to observe that computation of the probabilities
q(z n j = 1|z n−1,i = 1), and q(z nc = 1), which constitute
the variational posterior q(Z ), can be easily performed
by means of the forward–backward algorithm for simple HMMs trained by means of maximum-likelihood,
exactly as described in [19]; specifically, it suffices to
run forward–backward for a simple HMM model with
its optimized values (point estimates) of the Markov
chain probabilities set equal to the posterior expected
76
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
values πc∗ , ̟i∗j in (64) and (65), and its state-conditional
likelihoods set equal to the expectations p∗ (x t |µi , R i ).
8) Finally, regarding the latent variable posteriors q(x n ),
optimization over L(q) yields
logq(x n ) =
C
c=1
+
q(z nc = 1) log p(x n |z nc = 1)
D
d=1
log p( f d |0, K d (X, X)) q( f
q(µc ,R c )
d)
. (67)
From (67), it becomes apparent that model configuration
is not conjugate when it comes to the latent variables x n .
As a consequence, the model does not yield a closed-form
expression for the posteriors q(x n ). A repercussion of this fact
is that the entailed posterior expectations with respect to q(x n )
cannot be computed in an analytical fashion, wherever they
appear in the
inference algorithm of our model (namely the
quantities K d (X, X)−1 q(xn ) and x n q(x n ) ). To resolve this
issue, we can either resort to a deterministic approximation
of the posteriors q(x n ), e.g., by the application of Laplace
approximation, or perform sampling by means of MCMC. Our
investigations have shown that Laplace approximation does not
perform very well for our model. For this reason, in this paper,
we opt for the latter alternative.
Specifically, for this purpose, we draw samples from
the variational posterior (67) using hybrid Monte Carlo
(HMC) [20]. HMC provides an efficient method to draw
samples from the variational posterior distribution q(x n ) by
performing a physical simulation of an energy-conserving
system to generate proposal moves. In detail, we add kinetic
energy variables, p, for each latent dimension, while the
expression of −logq(x n ) is used to specify a potential energy
over the latent variables. Each HMC step involves sampling
p and carrying out a physics-based simulation using leap-frog
discretization. The final state of the simulation is accepted or
rejected based on the Metropolis–Hastings algorithm. HMC
sampling of x n requires computation of the gradient of q(x n ),
which can be analytically performed in a straightforward
manner.
2) Hyperparameter Selection: Regarding the values of the
hyperparameters of the model priors, we set mc = 0,
λc = 1, ωc = 10, and c = 100 I . Regarding the
hyperparameters of the kernel functions kd (·, ·), we estimate
them by performing maximization of the model variational
free energy L(q) over each one of them. Computation of
the variational free energy L(q) and its gradient requires
derivation of the entailed posterior expectations in (40),
which can be obtained analytically, except for the expectations with respect to the posteriors q(x n ). These latter
quantities can be obtained by means of MCMC, utilizing
the samples of x n , previously drawn by HMC sampling.
Given the fact that none of the sought quantities yields
closed-form estimators, to perform L(q) maximization, we
resort to the L-Broyden-Fletcher-Goldfarb-Shanno (BFGS)
algorithm [21].
C. Predictive Density
Let us now consider the problem of computing the probaN∗
with
bility of a given sequence of observations Y ∗ = [ y∗n ]n=1
2
respect to a trained LM GP model. This quantity is useful,
e.g., in applying our model to (sequence-level) classification
applications. For this purpose, we need to derive the predictive
density of the model
ˆˆ
p(Y ∗ |Y ) = p(Y ∗ |X ∗ ; Y, X) p(X ∗ |X, Y ) p(X|Y )d Xd X ∗ .
(68)
To begin with, the expression of p(Y ∗ |X ∗ ; Y, X) is
analogous to the predictive density expression of standard
GP regression, yielding
∗
∗
∗
p(Y |X ; Y, X) =
D
N
n=1 d=1
∗ 2
∗
∗
)
, σnd
|and
N (ynd
(69)
−1
∗
and
= k d (x ∗n )T K d (X, X) + β −1
Yd (70)
−1
∗ 2
σnd = −kd (x ∗n )T K d (X, X) + β −1
kd (x ∗n )
+ kd (x ∗ , x ∗ )
(71)
where
kd (x ∗n ) [kd (x 1 , x ∗n ), . . . , kd (x N , x ∗n )]T.
(72)
Now, regarding the computation of the entailed expectation
of p(Y ∗ |X ∗ ; Y, X) with respect to the distributions p(X|Y )
and p(X ∗ |X, Y ), we proceed as follows: first, we substitute p(X|Y
N) with its approximation (variational posterior)
q(X) = n=1
q(x n ). This way, the corresponding expectation
can be approximated by making use of the samples of the
latent variables x n previously drawn through HMC. Finally, to
obtain the expectations with respect to the density p(X ∗ |X, Y ),
we resort to Gibbs sampling from the inferred latent-manifold
infinite-state HMM. This obtains a set of samples of X ∗ from
the latent manifold, which can be used to approximate the
sought (remaining) expectation with respect to p(X ∗ |X, Y ).
D. Relations to Existing Approaches
As we have already discussed, our model is related to the
GPDM method of [2]. The main difference between this paper
and GPDM is that the latter uses a simplistic linear model of
temporal dynamics combined with a spherical GP prior over
the latent variables. In contrast, our approach employs a more
flexible generative construction, which considers latent variable emission from a latent stick-breaking HMM (SB-HMM).
In addition, one can also observe that our method essentially
reduces to a GPLVM model by setting the truncation threshold
C equal to one, i.e., C = 1.
IV. E XPERIMENT
Here, we experimentally evaluate our method in three
scenarios: 1) unsupervised segmentation (frame-level clustering) of sequential data; 2) supervised segmentation
(frame-level classification) of sequential data; and 3) (whole)
sequence classification.
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
In the context of our unsupervised sequence segmentation
experiments, we estimate the assignment of the data points
(frames) of each sequence to the discovered latent classes
(model states) using q(Z ), and compare the estimated values
with the available ground-truth sequence segmentation. To
compute the optimal assignment of the discovered class labels
to the available ground-truth class labels, we resort to the
Munkres assignment algorithm. This set of experiments makes
it thus possible for us to evaluate the quality of the latent
temporal dynamics discovered by our model.
On the other hand, the considered supervised sequence
segmentation experiments allow for us to additionally evaluate
the quality of the drawn latent manifold representations of
the modeled data. Specifically, in this set of experiments, the
drawn latent manifold representations x n are subsequently fed
to a simple multiclass classifier (specifically, a multiclass linear
support vector machine [22]), which is trained to discover
the correct (frame-level) classes by being presented with the
generated manifold representations x n . We expect that a good
performance of our model under such an experimental setup
would indicate high-quality representational capabilities for
both the generated latent manifold representations x n and the
temporal dynamics discovered by our model. Furthermore, in
the context of this set of experiments, we also explore whether
our model could be used to obtain a deep learning architecture,
by stacking multiple layers of models, each one fed with
the latent state representations generated from the previous
layer (and the first layer fed with the actual observations). We
believe that, by stacking multiple layers of models, we might
be capable of extracting more complex, higher level temporal
dynamics encoded in the final-layer latent representations x n ,
thus allowing for eventually obtaining increased supervised
(frame-level) classification performance.
Finally, in the case of the sequence classification experiments, we train one model for each one of the considered
classes. Evaluation is performed by finding for each test
sequence the model that yields the highest predictive probability, and assigning it to the corresponding class.
To obtain some comparative results, in our experiments,
apart from our method, we also evaluate GPLVM [4] and
GPDM models [2], as well as other state-of-the-art approaches
relevant to each scenario. In our experiments, we make use of
the large-scale linear support vector machine (SVM) library
of [23], and the HMM-SVM implementation of Thorsten
Joachims. In all our experiments, we use RBF kernels for all
the evaluated GP-based models. Finally, HMC is run for 100
iterations, with 25 leap-frog √
steps at each iteration, and with
a step-length equal to 0.001 N , where N is the number of
modeled data points.
A. Unsupervised Sequence Segmentation
1) Workflow Recognition: We first consider an unsupervised
sequence segmentation (frame-level clustering) application.
For this purpose, we use a public benchmark dataset involving
action recognition of humans, namely the workflow recognition (WR) database [24]. Specifically, we use the first two
workflows pertaining to car assembly (see [24] for more
77
TABLE I
U NSUPERVISED S EQUENCE S EGMENTATION : W ORKFLOW
R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL L ATENT
M ANIFOLD D IMENSIONALITY
details). The frame-level tasks to recognize in an unsupervised
fashion in these workflows are the following:
1) worker 1 picks up part 1 from rack 1 (upper) and places
it on the welding cell, and the mean duration is 8–10 s;
2) workers 1 and 2 pick part 2a from rack 2 and place it
on the welding cell;
3) workers 1 and 2 pick part 2b from rack 3 and place it
on the welding cell;
4) worker 2 picks up spare parts 3a and 3b from rack 4
and places them on the welding cell;
5) worker 2 picks up spare part 4 from rack 1 and places
it on the welding cell;
6) workers 1 and 2 pick up part 5 from rack 5 and place
it on the welding cell.
Feature extraction is performed as follows: to extract
the spatiotemporal variations, we use pixel change
history images to capture the motion history [25], and
compute the complex Zernike moments A00 , A11, A20 ,
A22 , A31 , A33 , A40 , A42 , A44 , A51 , A53 , A55 , A60 , A62 ,
A64 , A66 , for each of which we compute the norm and the
angle. In addition, the center of gravity and the area of the
found blobs are also used. In total, this feature extraction
procedure results in 31-D observation vectors. Zernike
moments are calculated in rectangular regions of interest
of approximately 15K pixels in each image to limit the
processing and allow real-time feature extraction (performed
at a rate of approximately 50–60 frames per second). In our
experiments, we use a total of 40 sequences representing full
assembly cycles and containing at least one of the considered
behaviors, with each sequence being approximately 1K
frames long. Frame annotation has been performed manually.
In Table I, we illustrate the obtained error rates for optimal
selection of the latent manifold dimensionality of our method,
as well as the GPDM and GPLVM methods. Regarding the
GPDM and GPLVM methods, clustering was performed by
presenting the generated latent subspace representations to
an SB-HMM [13]. All these results are means and variances
over 10 repetitions of the experiment. As a baseline, we also
evaluate SB-HMMs presented with the original (observed)
data. As we observe, our approach yields a clear advantage
over the competition; we also observe that all the other latent
manifold models yield inferior performance compared with
the baseline SB-HMM. This result is a clear indication of the
much improved capability of our approach to capture latent
temporal dynamics in the modeled data.
In addition, in Fig. 3, we show how model performance
changes as a function of the postulated latent manifold dimen-
78
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
Fig. 3. Unsupervised sequence segmentation: workflow recognition. Error
rate fluctuation with latent manifold dimensionality.
Fig. 5. Unsupervised sequence segmentation: Honeybee. Error rate fluctuation with latent manifold dimensionality.
TABLE II
U NSUPERVISED S EQUENCE S EGMENTATION : H ONEYBEE . E RROR
R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY
Fig. 4. Workflow Recognition. Average execution time (per sequence) of
inference algorithm.
sionality, for the LM2 GP, GPDM, and GPLVM methods.
We observe that the model performance increases with manifold dimensionality in a consistent fashion. We also observe
that our model benefits the most by an increase in latent
manifold dimensionality. Furthermore, we run the Student’s-t
test on all pairs of performances across the evaluated methods,
to assess the statistical significance of the obtained differences;
we obtain that all performance differences are deemed statistically significant.
Finally, in Fig. 4, we show an indicative empirical comparison of computational times, for processing one of the
sequences in our dataset (means and error bars over all the
used sequences). As we observe, GPDM requires the longest
time between the compared methods; our method is faster than
GPDM, while imposing a reasonable increase in computational
costs compared with GPLVM. This difference from GPDM
was expected, since our method involves one less Gram matrix
compared with GPDM.
2) Honeybee Dataset: Furthermore, we perform a second
unsupervised sequence segmentation experiment using the
Honeybee dataset [26]; it contains video sequences of honeybees, which communicate the location and distance to a
food source through a dance that takes place within the hive.
The dance can be decomposed into three different movement
patterns that must be recognized by the evaluated algorithms:
1) waggle; 2) right turn; and 3) left turn. During the waggle
dance, the bee moves roughly in a straight line while rapidly
shaking its body from left to right; the duration and orientation
of this phase correspond to the distance and the orientation to
the food source. At the endpoint of a waggle dance, the bee
turns in a clockwise or counter-clockwise direction to form a
turning dance.
Our dataset consists of six video sequences with lengths
1058, 1125, 1054, 757, 609, and 814 frames, respectively. The
bees were visually tracked, and their locations and head angles
were recorded. This resulted in obtaining 4-D frame-level feature vectors comprising the 2-D location of the bee and the sine
and cosine of its head angle. Once the sequence observations
were obtained, the trajectories were preprocessed as in [27].
Specifically, the trajectory sequences were rotated so that the
waggle dances had head angle measurements centered about
zero radian. The sequences were then translated to center at
(0, 0), and the 2-D coordinates were scaled to the [−1, 1]
range. Aligning the waggle dances was possible by looking at
the high-frequency portions of the head angle measurements.
Following the suggestion of [26], the data were smoothed
using a Gaussian FIR pulse-shaping filter with 0.5-dB
bandwidth-symbol time.
In our evaluations, we adopt the experimental setup of
the previous experiment. In Fig. 5, we show how model
performance changes as a function of the postulated latent
manifold dimensionality, for the LM2 GP, GPDM, and GPLVM
methods. In Table II, we provide the error rates of all the
evaluated models for optimal latent manifold dimensionality
(wherever applicable). These results are means and variances
over 10 repetitions of the experiment. We again observe that
both GPDM and GPLVM are outperformed by the baseline
SB-HMM. Our method works better than all the alternatives.
Finally, we run the Student’s-t test on all pairs of perfor-
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
79
TABLE III
S UPERVISED S EQUENCE S EGMENTATION : W ORKFLOW
R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL
L ATENT M ANIFOLD D IMENSIONALITY
Fig. 7. Supervised sequence segmentation: workflow recognition. Error rate
fluctuation with latent manifold dimensionality for two-layer deep learning
architectures.
Fig. 6. Supervised sequence segmentation: workflow recognition. Error rate
fluctuation with latent manifold dimensionality.
mances across the evaluated methods, to assess the statistical
significance of the obtained differences; we obtain that all
performance differences are deemed statistically significant,
except for the differences between the optimal performance
of GPLVM and the optimal performance of GPDM (obtained
for three and four latent features, respectively).
B. Supervised Sequence Segmentation
1) Workflow Recognition: Here, we use the same dataset as
in Section IV-A1, but with the goal to perform supervised
sequence segmentation using the obtained latent manifold
representations of the modeled data. For this purpose, we first
use our approach to obtain the latent manifold representations
of the considered datasets. Subsequently, we use half of
the resulting representations (corresponding to all the modeled classes) for initial training of a (frame-level) multiclass
linear SVM classifier [22], and keep the rest for testing
of the obtained classifier. Apart from our method, we also
evaluate GPDM and GPLVM under the same experimental
setup. In addition, as a baseline, we evaluate the HMM-SVM
approach of [28] (presented with the original data, not the
latent encodings).
In Table III, we provide the obtained classification error
rates for the considered models. These results correspond
to optimal latent manifold dimensionality selection for the
LM2 GP, GPDM, and GPLVM methods, and constitute means
and variances over 10 repetitions of the experiment. As we
observe, our method outperforms the competition. We also
observe that GPDM and GPLVM yield essentially identical results; this fact indicates that GPDM does not capture
richer temporal dynamics in our data compared with GPLVM.
Another significant finding from these results is that the
state-of-the-art HMM-SVM approach yields a considerably
inferior performance compared with the evaluated latent variable models. This result seems to indicate that extracting
latent manifold representations of the modeled data is much
more effective for classification purposes than directly using
the observed data and assuming temporal dynamics in the
observations space. In addition, in Fig. 6, we show how model
performance changes with the number of latent features for
the LM2 GP, GPDM, and GPLVM methods. We observe that
all methods yield their optimal performance with 10 latent
features.
Furthermore, we turn to our deep learning scenario, examining how model performance changes if we add a second layer
of models (with the same latent dimensionality), fed with the
latent manifold representations generated by the original models. We present the output (latent manifold representations)
of the 2-layer models as input to a multiclass linear SVM
classifier, and we evaluate the classification performance of
the so-obtained classifier in the considered tasks.
In Fig. 7, we illustrate the classification error rates of the
resulting models as a function of latent manifold dimensionality. As we observe, two-layer deep learning architectures
yield a significant improvement over the corresponding singlelayer architectures when the latent manifold dimensionality is
low; however, performance for high latent manifold dimensionality turns out to be similar in the cases of GPDM and
our model, and substantially worse in the case of GPLVM.
To our perception, this result indicates that our model encodes
richer and more consistent temporal dynamics information
in the obtained latent manifold representations, which can
be combined in a hierarchical manner to extract even more
complex temporal dynamics.
Finally, the Student’s-t test on all pairs of performances
across the evaluated methods obtains here some very interesting results.
1) Our method yields statistically significant differences
from GPDM and GPLVM for latent manifold dimensionality between 10 and 20. However, for more or
less latent dimensions, the obtained differences are not
statistically significant.
2) GPDM and GPLVM are comparable, with no statistically significant differences, whatsoever.
3) Models with two layers have statistically significant performance differences with respect to similar one-layer
models in cases of low latent manifold dimensionality.
80
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
TABLE IV
S UPERVISED S EQUENCE S EGMENTATION : H ONEYBEE . E RROR R ATES (%)
FOR
O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY
Fig. 8. Supervised sequence segmentation: Honeybee. Error rate fluctuation
with latent manifold dimensionality.
observations space. In addition, in Fig. 8, we show how
model performance changes with the number of latent features
for the LM2 GP, GPDM, and GPLVM methods. We observe
a consistent performance improvement with latent manifold
dimensionality. This is not unexpected, since we are dealing
with models trained on low-dimensional data using some
thousands of training points.
Furthermore, we turn to our deep learning scenario, examining how model performance changes if we add a second
layer of models (with the same latent dimensionality), fed
with the latent manifold representations generated by the
original models. In Fig. 9, we show the classification error
rates of the resulting models as a function of latent manifold
dimensionality. As we observe, two-layer deep learning architectures yield improved performance for low latent manifold
dimensionality (2-D latent space); however, any performance
gains quickly disappear as latent dimensionality increases,
with GPLVM performance eventually becoming inferior to
the shallow modeling scenario, similar to the previous
experiment.
Finally, the Student’s-t test in this scenario finds that all
pairs of differences are statistically significant, except for the
case of one-layer versus two-layer model comparison, where
the assumption of statistical significance is rejected for models
with four latent dimensions.
C. Sequence-Level Classification
Fig. 9. Supervised sequence segmentation: Honeybee. Error rate fluctuation
with latent manifold dimensionality for two-layer deep learning architectures.
However, these differences quickly become insignificant
as the number of latent dimensions increases.
2) Honeybee Dataset: Furthermore, we perform additional
frame-level classification experiments using the Honeybee
dataset described in Section IV-A2. In our evaluations, we
adopt the experimental setup of the previous experiment. First,
we consider a shallow modeling scenario. In Table IV, we
provide the obtained classification error rates for the considered models. These results correspond to optimal latent
manifold dimensionality selection for the LM2 GP, GPDM,
and GPLVM methods, and are means and variances over
10 repetitions of the experiment.
As we observe, our method outperforms the competition.
We also observe that GPDM works better than GPLVM.
Another significant finding from these results is that the
state-of-the-art HMM-SVM approach yields again inferior
performance compared with the evaluated latent variable
models. This result corroborates our intuition that extracting
latent manifold representations of the modeled data is much
more effective for classification purposes than directly using
the observed data and assuming temporal dynamics in the
1) Learning Music-to-Dance Mappings: Finally, we evaluate our method in sequence-level classification (classification
of whole sequences). Initially, we consider the problem of
learning music-to-dance mappings. In this experiment, the
observed sequences presented to our model constitute the
chroma features extracted from a collection of music clips.
Chroma analysis [29] is an interesting and powerful representation for music audio in which the entire spectrum is projected
onto 12 bins representing the 12 distinct semitones (or chroma)
of the musical octave. Since, in music, notes exactly one
octave apart are perceived as particularly similar, knowing the
distribution of chroma even without the absolute frequency
(i.e., the original octave) can give useful musical information
about the audio, and may even reveal perceived musical
similarity that is not apparent in the original spectra [30].
In our experiments, we use a dataset of 600 music
clips; they are split into sets of 100 clips pertaining to
each one of the dance classes: Waltz, Tango, Foxtrot, Cha
Cha, Quickstep, and Samba.1 We preprocess these clips
as described above to obtain chroma features; this way, we
eventually obtain sequences of 12-D observations, each one
35K–184K frames long. The clips are further segmented into
approximately 8K frames long subsequences to form our
dataset. We use half of our data for model training, and the rest
for testing. We train one model for each one of the six classes
we want to recognize. Classification of our test sequences
is conducted by computing the sequence predictive densities
with respect to the model of each class, and assigning each
1 Music clips were downloaded from: http://www.ballroomdancers.com/Music/
Default.asp?Tab=2; we converted them to WAV format for our experiments.
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
81
TABLE V
S EQUENCE -L EVEL C LASSIFICATION : L EARNING M USIC - TO -D ANCE
TABLE VI
S EQUENCE -L EVEL C LASSIFICATION : B IMANUAL G ESTURE
M APPINGS . E RROR R ATES (%) FOR O PTIMAL L ATENT
R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL L ATENT
M ANIFOLD D IMENSIONALITY
M ANIFOLD D IMENSIONALITY
Fig. 10. Sequence-level classification: learning music-to-dance mappings.
Error rate fluctuation with latent manifold dimensionality.
Fig. 11. Sequence-level classification: bimanual gesture recognition. Error
rate fluctuation with latent manifold dimensionality.
sequence to the class the model of which yields the highest
predictive density. Apart from our method, we also evaluate
SB-HMMs [13], GPLVMs [4], and GPDMs [2] under the same
experimental setup.
The obtained error rates for optimal latent manifold dimensionality (wherever applicable) are depicted in Table V. These
results are means and variances over 10 repetitions of the
experiment. As we observe, our approach works much better
than the competition. We also observe that, in this experiment,
GPDM does actually work better than GPLVM. In Fig. 10,
we show how average model performance changes with latent
manifold dimensionality. It becomes apparent from this graph
that the modeled data contain a great deal of redundancy, since
all models yield a performance deterioration for high latent
space dimensionality. Finally, the Student’s-t test finds that all
performance differences are statistically significant.
2) Bimanual Gesture Recognition: Furthermore, we perform evaluations considering the problem of bimanual gesture recognition. For this purpose, we experiment with the
American Sign Language gestures for the words: against, aim,
balloon, bandit, cake, chair, computer, concentrate, cross, deaf,
explore, hunt, knife, relay, reverse, and role. The used dataset
was obtained from four different persons executing each one of
these gestures and comprises 40 videos per gesture; 30 of these
videos are used for training and the rest for model evaluation.
From this dataset, we extracted several features representing
the relative position of the hands and the face in the images,
as well as the shape of the respective skin regions, by means
of the complex Zernike moments [31], as described in [18].
This way, each used video comprises 1K–4K frames of 12-D
feature vectors used in our experiments.
For each one of these 16 gestures, we fitted one model to
recognize it. The obtained error rates (means and variances
over 10 experiment repetitions) for optimal latent manifold
dimensionality (wherever applicable) are depicted in Table VI.
As we observe, our approach works much better than the competition. Furthermore, in Fig. 11, we show how average model
performance changes with latent manifold dimensionality. We
observe that all models yield a performance increase for
moderate latent space dimensionality. Finally, the Student’s-t
test finds that all performance differences are statistically
significant.
V. C ONCLUSION
In this paper, we proposed a method for sequential data
modeling that leverages the strengths of GPs, allowing for
more flexibly capturing temporal dynamics compared with
the existing nonparametric Bayesian approaches. Our method
considers a latent manifold representation of the modeled data,
and chooses to postulate a model of temporal dependencies
on this latent manifold. Temporal dependencies in our model
are captured through consideration of infinite-state Markovian
dynamics, and imposition of stick-breaking priors over the
entailed Markov chain probabilities. Inference for our model
was performed by means of an efficient variational Bayesian
algorithm.
As we showed through experimental evaluations, our
approach is suitable for unsupervised sequence segmentation
(frame-level clustering), supervised sequence segmentation
(frame-level classification), and whole sequence classification
(sequence-level operation). We evaluated our approach in all
these scenarios using real-world datasets, and observed that
our method yields very competitive results, outperforming
popular, recently proposed related approaches, e.g., GPDM
and GPLVM.
82
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015
Finally, we examined whether our method can be also
employed to obtain a deep learning architecture, by stacking multiple (layers of) LM2 GP models, each one fed with
the latent manifold representations generated from the previous layer (and the first one fed with the observable data);
specifically, we experimented with two-layer architectures.
As we observed, our method seems to yield much more
significant gains in such a scenario than GPDM- or GPLVMbased models, especially for low-dimensional manifold
assumptions.
Our future research endeavors in this line of research mainly
focus on addressing two open questions: the first one concerns
the possibility of sharing the precision matrices between close
hidden states. A question that must be answered is what
proximity criterion we could use for this purpose in the context
of our model. Would, e.g., comparing the values of the latent
vectors x n generated from different states provide a valid
proximity criterion? In such a case, what type of proximity
measure would be more effective?
The second open question we want to investigate is the
possibility of obtaining stacked denoising autoencoders [32]
for sequential data modeling using our LM2 GP model as
the main building block. Stacked denoising autoencoders are
deep learning architectures that apart from extracting the most
informative latent subspace representations of observed data
are also capable of denoising the observed data, which are
considered contaminated with noise at each phase of the model
training algorithm. GPLVM models have already been used
as building blocks for obtaining stacked denoising autoencoder architectures with great success [33]. However, existing
formulations are not designed for capturing and exploiting
temporal dependencies in the modeled data. In this paper, we
investigated the utility of LM2 GP as the main building block
for obtaining simple deep learning architectures, and obtained
some promising results. How would our model perform in
the context of a stacked denoising autoencoder framework?
This is a question that remains to be addressed in our future
work.
We shall publish source codes pertaining to our method at:
http://www.cut.ac.cy/eecei/staff/sotirios.chatzis/?languageId=2.
A PPENDIX
In this Appendix, we provide (for completeness sake)
the expressions of the derivatives of the variational free
energy L(q) over the kernel hyperparameters, required for
hyperparameter optimization by means of L-BFGS, as well
as the expressions of the derivatives of q(x n ) with respect
to the latent vectors x n , required for HMC sampling
from q(x n ).
Regarding the derivatives of L(q) with respect to the kernel
hyperparameters, say ϕ d , d = 1, . . . , D, we have
∂ K d (X, X)−1
∂L(q)
1
= − µ̂d µ̂dT + d
∂ϕ d
2
∂ϕ d
q(X )
∂ K d (X, X)−1
1
.
(73)
+ K d (X, X)
2
∂ϕ d
q(X )
Similarly, the derivatives of q(x n ) with respect to the latent
vectors x n yield
!
∂logq(x n )
˜c
˜ c m̃c − x nT ω̃c
q(z nc = 1) ω̃c
=
∂ xn
c
∂ K d (X, X)−1
1
−
µ̂d µ̂dT + d
2
∂ xn
d
1
∂ K d (X, X)−1
+
K d (X, X)
.
2
∂ xn
(74)
d
R EFERENCES
[1] C. M. Bishop, Neural Networks for Pattern Recognition. Phoenix, AZ,
USA: Clarendon, 1995.
[2] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical
models for human motion,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 30, no. 2, pp. 283–298, Feb. 2008.
[3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. Cambridge, MA, USA: MIT Press, 2006.
[4] N. Lawrence, “Probabilistic non-linear principal component analysis
with Gaussian process latent variable models,” J. Mach. Learn. Res.,
vol. 6, pp. 1783–1816, Nov. 2005.
[5] M. F. Moller, “A scaled conjugate gradient algorithm for fast supervised
learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, 1993.
[6] J. Sethuraman, “A constructive definition of the Dirichlet prior,” Statist.
Sinica, vol. 2, no. 2, pp. 639–650, 1994.
[7] T. Ferguson, “A Bayesian analysis of some nonparametric problems,”
Ann. Statist., vol. 1, no. 2, pp. 209–230, 1973.
[8] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” in Learning in
Graphical Models, M. Jordan, Ed. Boston, MA, USA: Kluwer, 1998,
pp. 105–162.
[9] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process
mixtures,” Bayesian Anal., vol. 1, no. 1, pp. 121–144, 2006.
[10] D. Blackwell and J. MacQueen, “Ferguson distributions via Pólya urn
schemes,” Ann. Statist., vol. 1, no. 2, pp. 353–355, 1973.
[11] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Comput., vol. 11, no. 2, pp. 443–482,
1999.
[12] J. Zhao and Q. Jiang, “Probabilistic PCA for t distributions,”
Neurocomputing, vol. 69, nos. 16–18, pp. 2217–2226, Oct. 2006.
[13] J. Paisley and L. Carin, “Hidden Markov models with stick-breaking
priors,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 3905–3917,
Oct. 2009.
[14] Y. Qi, J. W. Paisley, and L. Carin, “Music analysis using hidden
Markov mixture models,” IEEE Trans. Signal Process., vol. 55, no. 11,
pp. 5209–5224, Nov. 2007.
[15] D. Chandler, Introduction to Modern Statistical Mechanics. New York,
NY, USA: Oxford University Press, 1987.
[16] D. Blei and M. Jordan, “Variational methods for the Dirichlet process,”
in Proc. 21st Int. Conf. Mach. Learn., New York, NY, USA, Jul. 2004,
pp. 12–19.
[17] P. Muller and F. Quintana, “Nonparametric Bayesian data analysis,”
Statist. Sci., vol. 19, no. 1, pp. 95–110, 2004.
[18] S. P. Chatzis, D. I. Kosmopoulos, and T. A. Varvarigou, “Robust sequential data modeling using an outlier tolerant hidden Markov model,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1657–1669, Sep.
2009.
[19] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 245–255,
Feb. 1989.
[20] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo
methods,” Dept. Comput. Sci., Univ. Toronto, Tech. Rep. CRG-TR-93-1,
1993.
[21] D. Liu and J. Nocedal, “On the limited memory method for large
scale optimization,” Math. Program. B, vol. 45, no. 3, pp. 503–528,
1989.
[22] V. N. Vapnik, Statistical Learning Theory. New York, NY, USA: Wiley,
1998.
[23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“LIBLINEAR: A library for large linear classification,” J. Mach. Learn.
Res., vol. 9, pp. 1871–1874, Jan. 2008.
CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP
[24] A. Voulodimos et al., “A threefold dataset for activity and workflow
recognition in complex industrial environments,” IEEE MultiMedia,
vol. 19, no. 3, pp. 42–52, Jul./Sep. 2012.
[25] D. Kosmopoulos and S. Chatzis, “Robust visual behavior recognition,” IEEE Signal Process. Mag., vol. 27, no. 5, pp. 34–45,
Sep. 2010.
[26] S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert, “Learning and
inferring motion patterns using parametric segmental switching linear
dynamic systems,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 103–124,
2008.
[27] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “Nonparametric Bayesian learning of switching linear dynamical systems,” in Proc.
Neural Inf. Process. Syst., 2009, pp. 1–3.
[28] Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support
vector machines,” in Proc. ICML, 2004, pp. 1–9.
[29] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chromabased representations for audio thumbnailing,” in Proc. IEEE Workshop
Appl. Signal Process. Audio Acoust., Jun. 2001, pp. 15–18.
[30] H. Jensen, M. G. Christensen, D. Ellis, and S. H. Jensen, “A tempoinsensitive distance measure for cover song identification based on
chroma features,” in Proc. ICASSP, 2008, pp. 2209–2212.
[31] R. Mukundan and K. R. Ramakrishnan, Moment Functions in Image
Analysis: Theory and Applications. Singapore: World Sci., 1998.
[32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” J. Mach. Learn. Res.,
vol. 11, pp. 3371–3408, Jan. 2010.
[33] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of
autoencoder representations using label information,” J. Mach. Learn.
Res., vol. 13, pp. 2567–2588, Sep. 2012.
Sotirios P. Chatzis received the M.Eng. (Hons.)
degree in electrical and computer engineering and
the Ph.D. degree in machine learning from the
National Technical University of Athens, Athens,
Greece, in 2005 and 2008, respectively.
He was a Post-Doctoral Fellow with the University
of Miami, Coral Gables, FL, USA, from 2009 to
2010. He was a Post-Doctoral Researcher with the
Department of Electrical and Electronic Engineering, Imperial College London, London, U.K., from
2010 to 2012. He is currently an Assistant Professor
with the Department of Electrical Engineering, Computer Engineering and
Informatics, Cyprus University of Technology, Limassol, Cyprus. He has
authored more than 40 papers in the most prestigious journals and conferences
of the research field in his first seven years as a Researcher. His current
research interests include machine learning theory and methodologies with
a special focus on hierarchical Bayesian models, Bayesian nonparametrics,
quantum statistics, and neuroscience. His Ph.D. research was supported by
the Bodossaki Foundation, Greece, and the Greek Ministry for Economic
Development.
Dr. Chatzis was a recipient of the Dean’s scholarship for Ph.D. studies,
being the best performing Ph.D. student of the class.
83
Dimitrios Kosmopoulos received B.Eng. degree in
electrical and computer engineering and the Ph.D.
degree from the National Technical University of
Athens, Athens, Greece, in 1997 and 2002, respectively.
He was a Research Assistant Professor with the
Department of Computer Science, Rutgers University, New Brunswick, NJ, USA, and an Adjunct
Assistant Professor with the Department of Computer Science and Engineering, University of Texas
at Arlington, Arlington, TX, USA. His former collaborations include the Institute of Informatics and Telecommunications,
National Center for Scientific Research Demokritos, Athens, and the National
Technical University of Athens. In the same period, he was with the University
of Central Greece, Lamia, Greece, University of Peloponnese, Sparti, Greece,
and the Technical Educational Institute of Athens, Egaleo, Greece. He is
currently an Assistant Professor with the Department of Applied Informatics
and Multimedia, Technical Educational Institute of Crete, Heraklion, Greece.
He has published more than 70 papers in the field of computer/robotic vision,
machine learning, and signal processing.