[go: up one dir, main page]

Academia.eduAcademia.edu
70 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 A Latent Manifold Markovian Dynamics Gaussian Process Sotirios P. Chatzis and Dimitrios Kosmopoulos Abstract— In this paper, we propose a Gaussian process (GP) model for analysis of nonlinear time series. Formulation of our model is based on the consideration that the observed data are functions of latent variables, with the associated mapping between observations and latent representations modeled through GP priors. In addition, to capture the temporal dynamics in the modeled data, we assume that subsequent latent representations depend on each other on the basis of a hidden Markov prior imposed over them. Derivation of our model is performed by marginalizing out the model parameters in closed form using GP priors for observation mappings, and appropriate stickbreaking priors for the latent variable (Markovian) dynamics. This way, we eventually obtain a nonparametric Bayesian model for dynamical systems that accounts for uncertainty in the modeled data. We provide efficient inference algorithms for our model on the basis of a truncated variational Bayesian approximation. We demonstrate the efficacy of our approach considering a number of applications dealing with real-world data, and compare it with the related state-of-the-art approaches. Index Terms— Gaussian process (GP), latent manifold, Markovian dynamics, stick-breaking process, variational Bayes. I. I NTRODUCTION T HERE is a wide variety of generative models used to perform analysis of nonlinear time series [1]. Approaches based on hidden Markov models (HMMs) and linear dynamical systems (LDS) are quite ubiquitous in the current literature due to their simplicity, efficiency, and generally satisfactory performance in many applications. More expressive models, such as switching LDS and nonlinear dynamical systems (NLDS), have also been proposed; however, these approaches are faced with difficulties in terms of their learning and inference algorithms, due to the entailed large number of parameters that must be estimated, and the hence needed large amounts of training data [1]. Recently, a nonparametric Bayesian approach designed to resolve these issues of NLDS, namely the Gaussian process dynamical model (GPDM), was introduced in [2]. This approach is fully defined by a set of low-dimensional representations of the observed data, with both the observation and dynamical processes learned by means of Gaussian Manuscript received July 21, 2013; revised February 24, 2014; accepted March 8, 2014. Date of publication March 21, 2014; date of current version December 16, 2014. S. P. Chatzis is with the Department of Electrical and Computer Engineering, Cyprus University of Technology, Limassol 3036, Cyprus (e-mail: soteri0s@me.com). D. Kosmopoulos is with the Technological Educational Institute of Crete, Heraklion 71004, Greece (e-mail: dkosmo@gmail.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2311073 process (GP) regression [3]. This GP-based formulation of the model gives rise to a nonparametric Bayesian nature, which removes the need to select a large number of parameters associated with function approximators, while retaining the power of nonlinear dynamics and observation. GPDM is essentially an extension of the GP latent variable model (GPLVM) of [4], which models the joint distribution of the observed data and their representation in a low-dimensional latent space through a GP prior. GPDM extends GPLVM by augmenting it with a model of temporal dynamics captured through imposition of a dedicated GP prior. This way, GPDM allows for not only obtaining predictions about future data, but also regularizing the latent space to allow for more effective modeling of temporal dynamics. Despite the merits of GPDM, a significant drawback of this model consists in the need of its inference algorithm to obtain maximum a posteriori (MAP) estimates of its parameters through type-II maximum-likelihood [2] (performed by means of scaled conjugate gradient descent [5]). This formulation poses a significant bottleneck to GPDM, due to both the entailed high computational costs, as well as the possibility of obtaining bad estimates due to the algorithm getting stuck to poor local maxima in cases of limited training datasets. In addition, to increase computational efficiency, GPDM imposes an oversimplistic spherical Gaussian prior over its model of temporal dynamics, which probably undermines its data modeling capacity. Finally, the use of GP priors to describe the temporal dynamics between the latent variables of the model leads to significant computational overheads, as it gives rise to calculations that entail inverting very large Gram matrices [3]. To resolve these issues, in this paper, we propose a flexible generative model for modeling sequential data by means of nonparametric component densities. Formulation of our proposed model is based on the assumption that, when modeling sequential data, each observation in a given sequence is related to a vector in a latent space, and is generated through a latent nonlinear function that maps the latent space to the space of observations. We use a GP prior to infer this unknown mapping function from the data in a flexible manner. In addition, the latent vectors that generate the observed sequential data are assumed to possess strong temporal interdependencies; to capture these dependencies, we assume that these latent variables are generated from an HMM in the manifold of latent variables. Specifically, we assume a latent space HMM with infinite hidden states, and use flexible stick-breaking priors [6], [7] to infer its hidden state dynamics; 2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 71 base distribution G 0 with probability (α/α + M − 1) or 2) is selected from the existing draws, according to a multinomial allocation, with probabilities proportional to the number of the previous draws with the same allocation [10]. Let {c }C c=1 M−1 be the set of distinct values taken by the variables {∗m }m=1 . M−1 Denoting as νcM−1 the number of values in {∗m }m=1 that M−1 equals to c , the distribution of ∗M given {∗m }m=1 can be shown to be of the form [10] M−1 , α, G 0 ) p(∗M |{∗m }m=1 Fig. 1. = Intuitive illustration of the generative construction of our model. this formulation allows us to automatically determine the required number of states in this latent manifold HMM in a data-driven fashion. We dub our approach the latent manifold hidden Markov GP (LM2 GP) model. Inference for our model is performed by means of an efficient truncated variational Bayesian algorithm [8], [9]. We evaluate the efficacy of our approach in several applications, dealing with sequential data clustering, classification, and generation. A high-level illustration of the conceptual configuration of our model is shown in Fig. 1. The remainder of this paper is organized as follows. In Section II, we provide a brief presentation of the theoretical background of the proposed method. Initially, we present the Dirichlet process (DP) and its function as a prior in nonparametric Bayesian models; furthermore, we provide a brief summary of the GPDM model. In Section III, we introduce the proposed LM2 GP model, and derive efficient model inference algorithms based on the variational Bayesian framework. In Section IV, we conduct the experimental evaluation of our proposed model, considering a number of applications dealing with several real-world datasets, and we compare its performance with state-of-the-art-related approaches. Finally, in Section V, we conclude and summarize this paper. C  ν M−1 α c G0 + δ α+ M −1 α+M −1 c (3) c=1 where δc denotes the distribution concentrated at a single point c . These results illustrate two key properties of the DP scheme. First, the innovation parameter α plays a key role in determining the number of distinct parameter values. A larger α induces a higher tendency of drawing new parameters from the base distribution G 0 ; indeed, as α → ∞, we get G → G 0 . M tend to cluster to a single Conversely, as α → 0, all {∗m }m=1 random variable. Second, the more often a parameter is shared, the more likely it will be shared in the future. A characterization of the (unconditional) distribution of the random variable G drawn from a DP, DP(α, G 0 ), is provided by the stick-breaking construction in [6]. Consider two infinite collections of independent random variables v = [v c ]∞ c=1 and {c }∞ , where the v are drawn from a Beta distribution, and c c=1 the c are independently drawn from the base distribution G 0 . The stick-breaking representation of G is then given by G= ∞  ̟c (v)δc (4) c=1 where p(v c ) = Beta(1, α) c−1  (1 − v j ) ∈ [0, 1] ̟c (v) = v c (5) (6) j =1 and ∞  II. P RELIMINARIES A. Dirichlet Process c=1 DP models were first introduced in [7]. A DP is characterized by a base distribution G 0 and a positive scalar α, usually referred to as the innovation parameter, and is denoted as DP(α, G 0 ). Essentially, a DP is a distribution placed over a distribution. Let us suppose we randomly draw a sample distribution G from a DP, and, subsequently, we independently M from G draw M random variables {∗m }m=1 G|α, G 0 ∼ DP(α, G 0 ) ∗m |G ∼ G, m = 1, . . . , M. (1) (2) Integrating out G, the joint distribution of the variables M {∗m }m=1 can be shown to exhibit a clustering effect. SpecifM−1 ically, given the first M − 1 samples of G, {∗m }m=1 , it can ∗ be shown that a new sample  M is either 1) drawn from the ̟c (v) = 1. (7) Under the stick-breaking representation of the DP, the atoms c , drawn independently from the base distribution G 0 , can be seen as the parameters of the component distributions of a mixture model comprising an unbounded number of component densities, with mixing proportions ̟c (v). B. GP Dynamical Model 1) GP Models: Let us begin with a brief description of GP regression. Consider an observation space X ; a GP f (x), x ∈ X , is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution [3]. We typically use the notation f (x) ∼ GP(m(x), k(x, x)) (8) 72 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 where m(x) is the mean function of the process and k(x, x ′ ) is the covariance function of the process. Usually, for simplicity, and without any loss of generality, the mean of the process is taken to be zero, m(x) = 0, although this is not necessary. Concerning selection of the covariance function, a large variety of kernel functions k(x, x ′ ) might be employed, depending on the application considered [3]. Eventually, the real process f (x) drawn from a GP with mean zero and kernel function k(x, x ′ ) follows a Gaussian distribution with p( f |x) = N ( f |0, k(x, x)). (9) Let us suppose a set of independent and identically distributed samples D = {(x i , yi )|i = 1, . . . , N}, with the d-dimensional variables x i being the observations related to a modeled phenomenon, and the scalars yi being the associated target values. The goal of a regression model is, given a new observation x ∗ , to predict the corresponding target value y∗ , based on the information contained in the training set D. The basic notion behind GP regression consists in the assumption that the observable (training) target values y in a considered regression problem can be expressed as the superposition of a GP over the input space X , f (x), and an independent white Gaussian noise y = f (x) + ǫ (10) where f (x) is given by (8), and   p(ǫ) = N ǫ|0, σ 2 . (11) Under this regard, the joint normality of the training target N values Y = [yi ]i=1 and some unknown target value y∗ , approximated by the value f ∗ of the postulated GP evaluated at the observation point x ∗ , is a Gaussian of the form [3]     Y k(x ∗ ) K (X, X) + σ 2 I N (12) N 0, f∗ k(x ∗ )T k(x ∗ , x ∗ ) where k(x ∗ )  [k(x 1 , x ∗ ), . . . , k(x N , x ∗ )]T (13) N , I X = [x i ]i=1 N is the N × N identity matrix, and K is the matrix of the covariances between the N training data points (Gram matrix) ⎤ ⎡ k(x 1 , x 1 ) k(x 1 , x 2 ) . . . k(x 1 , x N ) ⎢ k(x 2 , x 1 ) k(x 2 , x 2 ) . . . k(x 2 , x N ) ⎥ ⎥ ⎢ K (X, X)  ⎢ ⎥. (14) .. .. .. ⎦ ⎣ . . . k(x N , x 1 ) k(x N , x 2 ) . . . k(x N , x N ) From (12), and conditioning on the available training samples, we can derive the expression of the model predictive distribution, yielding (15) p( f ∗ |x ∗ , D) = N ( f ∗ |µ∗ , σ∗2 ) T 2 −1 (16) µ∗ = k(x ∗ ) (K (X, X) + σ I N ) Y −1  2 2 T 2 σ∗ = σ − k(x ∗ ) K (X, X) + σ I N × k(x ∗ ) + k(x ∗ , x ∗ ). (17) 2) GP Latent Variable Models: Building upon the GP model, the GPLVM is essentially a GP, and the input variables x of which are considered latent variables rather than observed ones. Specifically, GPLVM considers that the y ∈ R D are observed multidimensional variables, while the latent vectors x are variables lying in some lower dimensional manifold that generates the observed variables y through an unknown latent mapping f , modeled through a GP. This way, we have p( y|x) = D  d=1 N (yd |0, kd (x, x)) (18) N , obtains which, considering a set of observations Y = [ yn ]n=1 p(Y |X) = D  d=1 N (Yd |0, K d (X, X)) (19) where kd (·, ·) is the kernel of the model for the dth observed N N dimension, X = [x n ]n=1 , Yd = [ynd ]n=1 , and K d (X, X) is the Gram matrix (with inputs X) corresponding to the dth dimension of the modeled data. GPLVM learns the values of the latent variables x corresponding to the observed data y through the maximization of the model marginal likelihood, i.e., in a MAP fashion. As shown in [4], GPLVM can be viewed as a GP-based nonparametric Bayesian extension of probabilistic principal component analysis [11], [12], a popular method for latent manifold modeling of high-dimensional data. 3) Dynamic GPLVMs: Inspired from GPLVM, GPDM performs modeling of sequential data by considering a GP prior as in GPLVM, and introducing an additional model of temporal interdependencies between successive latent vectors x. SpecifN , ically, considering a sequence of observed data Y = [ yn ]n=1 D where yn = [ynd ]d=1 , with corresponding latent manifold N , x n ∈ R Q , GPDM models the projections X = [x n ]n=1 dependencies between the latent and observed data in (19), while also considering a GP-based model of the interdependencies between the latent vectors x n . In detail, this latter model initially postulates p(X) = p(x 1 ) ˆ  N n=2 p(x n |x n−1 ; A) p( A)d A (20) which, assuming p(x n |x n−1 ; A) = N (x n | Ax n−1 , σ 2 ) (21) and a simplistic isotropic Gaussian prior on the columns of the parameters matrix A, yields a GP prior of the form p(x 1 ) p(X) =  Q(N−1) (2π) |(X, X)| Q  1 T × exp − tr((X, X)−1 X 2:N X 2:N ) 2 (22) CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP where (X, X) is the Gram matrix (X, X) ⎡ k(x 1 , x 1 ) ⎢ k(x 2 , x 1 ) ⎢ ⎢ .. ⎣ . k(x 1 , x 2 ) . . . k(x 2 , x 2 ) . . . .. . k(x N−1 , x 1 ) k(x N−1 , x 2 ) . . . k(x 1 , x N−1 ) k(x 2 , x N−1 ) .. . ⎤ ⎥ ⎥ ⎥. ⎦ k(x N−1 , x N−1 ) (23) Estimation of the set of (dynamically interdependent) latent variables X of the GPDM is performed by means of type-II maximum-likelihood, similar to GPLVM. As we observe, GPDM utilizes a rather simplistic modeling approach for the dependencies between the latent variables x, which is based on a linear model of dependencies, as in (21), combined with a rather implausible spherical prior over A. This model formulation is clearly restrictive, and limits the level of complexity of the temporal structure that GPDM can capture. In addition, the requirement of MAP estimation of the latent variables x ∈ R Q results in high computational requirements. This is aggravated even more by the form of (22), which gives rise to computations entailing inverting the Gram matrix (X, X), in addition to computing the inverse of the matrices K d (X, X). Our proposed model is designed to ameliorate these issues, as we shall discuss next. III. P ROPOSED A PPROACH where f d is the vector of the f d (x n ) ∀n, i.e., f d  N [ f d (x n )]n=1 , and K d (X, X) is the Gram matrix pertaining to the dth dimension of the observations space, with kernel hyperparameters set ψd . Subsequently, we consider that the latent manifold projections x are generated by an HMM comprising infinite states. N,∞ Let us introduce the set of variables Z = {z nc }n,c=1 , with z nc = 1, if x n is considered to be generated from the cth HMM model state, z nc = 0 otherwise. Then, our introduced generative model for the latent vectors x n comprises the assumptions [13] (26) p(x n |z nc = 1) = N (x n |µc , R −1 c ) ̟ ) = ̟ (v ), n > 1 (27) p(z n j = 1|z n−1,i = 1; v ̟ j i i ̟ j (v ̟ i ) = v i̟j j −1  ̟ (1 − v ik ) ∈ [0, 1] k=1 π p(z 1i = 1|; v ) = πi (v π ) i−1  (1 − v πj ) ∈ [0, 1] πi (v π ) = v iπ Let us consider a sequence of observations with strong N temporal interdependencies Y = [ yn ]n=1 . Similar to GPLVM, we assume a latent lower dimensional manifold that generates the observation space, and a smooth nonlinear mapping from the latent space to the observations space. To infer this nonlinear mapping, we postulate a GP prior, similar to GPLVM. In addition though, we also postulate a model capable of capturing the temporal dynamics in the modeled data. For this purpose, we prefer to work in the lower dimensional manifold of the latent vectors generating our observations. Specifically, we consider a generative nonparametric Bayesian model for these latent coordinates: we assume that they are generated from an HMM comprising infinite (latent) states. To make formulation of this latent manifold infinite-state HMM possible, we impose suitable stick-breaking priors over its state-transition dynamics, similar to [13]. In particular, formulation of our model commences by considering a smooth nonlinear (generative) mapping of the latent vectors x n to the observed ones yn , described by means of a GP, with the addition of some additive white noise. Given these assumptions, we yield a likelihood function of the form  p yn |x n = D  d=1 N (ynd | f d (x n ), β −1 ) (24) where β is the white noise precision, and f d (x n ) is modeled by means of a GP prior, such that p( f d |X) = N ( f d |0, K d (X, X)) (25) (28) (29) (30) j =1 with ∞  ̟ j (v ̟ i ) = 1 ∀i (31) π j (v π ) = 1. (32) j =1 ∞  j =1 A. Model Definition  73 ∞ ̟ ̟ ∞ For the stick variables {v ̟ i }i=1 , where v i = [v i j ] j =1 , and v π = [v iπ ]∞ i=1 , we impose Beta priors [13] of the form p(v i̟j ) = Beta(1, αi̟ ) ∀ j p(v iπ ) = Beta(1, α π ) ∀i. (33) (34) In essence, this parameterization of the imposed Beta priors gives rise to a DP-based prior construction for the transition dynamics of our model [6]. Finally, due to the effect of the innovation parameters on the number of effective latent Markov chain states, we also impose Gamma priors over them p(αi̟ ) = G(αi̟ |η1̟ , η2̟ ) π p(α ) = G(α π |η1π , η2π ). (35) (36) We also impose suitable conjugate priors over the mean and precision parameters of the (Markov chain-)state-conditional likelihoods of the latent vectors. Specifically, we choose normal-Wishart priors, yielding p(µc , R c ) = N W(µc , R c |λc , mc , ωc , c ). (37) This completes the definition of our proposed LM2 GP model. A plate diagram of our model is shown in Fig. 2. B. Inference Algorithm Inference for nonparametric models can be conducted under a Bayesian setting, typically by means of variational Bayes [9], or Markov chain Monte Carlo techniques [14]. Here, we prefer 74 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 is nonnegative, L(q) forms a strict lower bound of the log evidence, and would become exact if q(W ) = p(W |, Y ). Hence, by maximizing this lower bound L(q) (variational free energy) so that it becomes as tight as possible, not only do we minimize the KL-divergence between the true and the variational posterior, but we also implicitly integrate out the unknowns W [8]. To facilitate variational Bayesian inference for our model, we assume that the posterior factorizes similar to the prior of our model (mean-field approximation) [15], [16]. This way, the variational free energy L(q) yields (ignoring constant terms) Fig. 2. Plate diagram representation of the LM2 GP model. The arrows represent conditional dependencies. The plates indicate independent copies for states i and j. a variational Bayesian approach, due to its considerably better scalability in terms of computational costs, which becomes of major importance when having to deal with large data corpora. Our variational Bayesian inference algorithm for the LM2 GP model comprises derivation of a family of variational posterior distributions q(.), which approximate the true posπ terior distribution over the infinite sets Z , {µc , R c }∞ c=1 , v , ∞ , and {α ̟ }∞ , as well the parameters α π , the latent {v ̟ } i i=1 i i=1 N , and the latent mapping function values f d = [ f d (x n )]n=1 N mappings X = [x n ]n=1 . Apparently, Bayesian inference is not tractable under this setting, since we are dealing with an infinite number of parameters. For this reason, we employ a common strategy in the literature of Bayesian nonparametrics, formulated on the basis of a truncation of the stick-breaking process [9]. Specifically, we fix a value C and we let the variational posterior over ̟ = 1) = 1, the v i̟j , and the v iπ have the property q(v iC π ∀i = 1, . . . , C, and q(v C = 1) = 1. In other words, we set πc (v π ) and ̟c (v ̟ i ) equal to zero for c > C. Note that, under this setting, the treated LM2 GP model involves full stick-breaking priors; truncation is not imposed on the model itself, but only on the variational distribution to allow for tractable inference. Hence, the truncation level C is a variational parameter that can be freely set, and not part of the prior model specification. D , {v ̟ , α ̟ , µ , R }C } be Let W  {X, Z , v π , α π , { f d }d=1 c c=1 c c c the set of all the parameters of the LM2 GP model over which a prior distribution has been imposed, and  be the set of the hyperparameters of the model priors and kernel functions. Variational Bayesian inference introduces an arbitrary distribution q(W ) to approximate the actual posterior p(W |, Y ) that is computationally intractable, yielding [8] log p(Y ) = L(q) + KL(q|| p) (38) where L(q) = ˆ dW q(W )log p(Y, W |) q(W ) (39) and KL(q|| p) stands for the Kullback–Leibler (KL) divergence between the (approximate) variational posterior q(W ) and the actual posterior p(W |, Y ). Since KL divergence D ˆ ˆ   p( f d |X) d Xd f d q(X)q( f d ) log L(q) = q( f d ) d=1  N  log p(ynd | f d (x n ), β −1 ) + n=1 + C ˆ ˆ  dµc d Rc q(µc , R c ) c=1 p(µc , R c |λc , mc , ωc , c ) × log q(µc , R c )  ˆ p(α π |η1π , η2π ) π π + dα q(α ) log q(α π )  ˆ C−1  p(v cπ |α π ) π π + dv c q(v c )log q(v cπ ) c=1  C−1 ˆ p(αc̟ |η1̟ , η2̟ ) ̟ ̟ + dαc q(αc ) log q(αc̟ ) c=1 C−1 ̟ |α ̟ )  ˆ p(v cc ′ c ̟ ̟ + )log q(v dv cc ′ cc′ ̟) q(v ′ ′ cc + c =1 C  N C   i=1 j =1 n=2 × ˆ + C  + q(z n j = 1|z n−1,i = 1) ̟ dv ̟ i q(v i )log q(z 1c = 1) c=1 C  N  c=1 n=1 ˆ p(z n j = 1|z n−1,i = 1; v ̟ i ) q(z n j = 1|z n−1,i = 1) dv π q(v π )log q(z nc = 1) × q(x n )q(µc , R c )log ˆ ˆ ˆ p(z 1c = 1|v π ) q(z 1c = 1) dx n dµc d Rc p(x n |µc , R c ) . q(x n ) (40) Derivation of the variational posterior distribution q(W ) involves maximization of the variational free energy L(q) over each one of the factors of q(W ) in turn, holding the others fixed, in an iterative manner [15]. On each iteration, apart from variational posterior updating, we also update the estimates of the model hyperparameters in , by maximization of the variational free energy L(q) over each one of them. By construction, this iterative consecutive updating of the model is guaranteed to monotonically and maximally increase the free energy L(q) [17]. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 1) Variational Posteriors: Let us denote as . q(·) the posterior expectation of a quantity, that is the quantity’s mean value considering the entailed random variables follow the variational posteriors q(·). In the following, all these posterior means can be computed analytically, except for the expectations with respect to the posterior q(x n ) over the latent manifold representations of our modeled data. We shall elaborate on computation of these latter quantities later in this section. From (40), we obtain the following variational (approximate) posteriors over the parameters of our model. 1) For the q(v ̟ i ), we have q(v i̟j ) = Beta(β̃i̟j , β̂i̟j ) β̃i̟j = 1 + N  n=2 (41) q(z n j = 1|z n−1,i = 1) (42) C N     q(z nρ = 1|z n−1,i = 1). β̂i̟j = αi̟ q(α ̟ ) + i 2) Similar, for the q(v iπ ) β̃iπ ̺= j +1 n=2 q(v π ), 75 (55) and, we have ω̃c = ωc + ˜c =  q(z nc = 1) n=1 N λc n=1 q(z nc = 1) N λc + n=1 q(z nc = 1) N λc mc + x̄ c n=1 q(z nc = 1) m̃c = . N λc + n=1 q(z nc = 1) (46) 3) For the q(αi̟ ), we have q(αi̟ ) = G(αi̟ |ǫ̃i̟ , ǫ̂i̟ ) q( f d ) = N ( f d |µ̂d ,  d ) d = (47)  +C −1 (48) C−1   ψ(β̂i̟j ) − ψ(β̃i̟j + β̂i̟j ) . (49) − j =1 4) For the q(α π ), we have q(α π ) = G(α π |ε̃π , ε̂π ) (50) where K d (X, X)−1  q(X ) + βI −1 q(Z ) ∝ πδ∗1 N−1  ̟δ∗n δn+1 N  n=1 n=1 (52) where    πc∗  exp logπc (v π ) q(vπ ) (64)   ̟ ̟i∗j  exp log̟ j (v ̟ (65) i )q(v i )    p∗ (x t |µi , R i )  exp log p(x t |µi , R i ) q(µ ,R ),q(x ) n=1  q(z nc = 1) x n q(x n ) − x̄ c (54)  xn q(x n ) i t and δn  arg(z nc = 1). c 5) The posteriors q(µc , R c ) are approximated in the following form:   ˜c (53) q(µc , R c ) = N W µc , R c |λ̃c , m̃c , ω̃c ,  N  (63) (66) (51) i=1 where we introduce the notation N q(z nc = 1) x n q(x n ) x̄ c  n=1 N n=1 q(z nc = 1) (61) p∗ (x n |µδn , R δn ) i ε̃π = η1π + C − 1 C−1   ε̂π = η2π − ψ(β̂iπ ) − ψ(β̃iπ + β̂iπ ) . c  (60) µ̂d = β d IYd (62)   N Yd  [ynd ]n=1 , and K d (X, X)−1 q(X ) is the posterior mean of the inverse Gram matrix K d (X, X)−1 given the variational posteriors q(x n ) ∀n. 7) Similarly, the posteriors over the indicator variables Z yield where ǫ̂i̟ = η2̟ (59) In the above expressions, x n q(xn ) is the (posterior) expectation of the latent variable x n given its posterior q(x n ). 6) Regarding the posteriors over the latent functions f d , we have (45) ̺=i+1 = (58) n=1 (44) = 1 + q(z 1i = 1) η1̟ (mc − x̄ c ) (mc − x̄ c )T (57) where = Beta(β̃iπ , β̂iπ ) C    β̂iπ = α π q(α π ) + q(z 1ρ = 1). ǫ̃i̟ (56) + c + c N  q(z nc = 1) λ̃c = λc + (43) we have N  − x̄ c T From (63), and comparing this expression with the corresponding expressions of standard HMMs [18], it is easy to observe that computation of the probabilities q(z n j = 1|z n−1,i = 1), and q(z nc = 1), which constitute the variational posterior q(Z ), can be easily performed by means of the forward–backward algorithm for simple HMMs trained by means of maximum-likelihood, exactly as described in [19]; specifically, it suffices to run forward–backward for a simple HMM model with its optimized values (point estimates) of the Markov chain probabilities set equal to the posterior expected 76 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 values πc∗ , ̟i∗j in (64) and (65), and its state-conditional likelihoods set equal to the expectations p∗ (x t |µi , R i ). 8) Finally, regarding the latent variable posteriors q(x n ), optimization over L(q) yields logq(x n ) = C  c=1 + q(z nc = 1) log p(x n |z nc = 1) D   d=1  log p( f d |0, K d (X, X)) q( f q(µc ,R c ) d) . (67) From (67), it becomes apparent that model configuration is not conjugate when it comes to the latent variables x n . As a consequence, the model does not yield a closed-form expression for the posteriors q(x n ). A repercussion of this fact is that the entailed posterior expectations with respect to q(x n ) cannot be computed in an analytical fashion, wherever they appear in the  inference algorithm of our model (namely the quantities K d (X, X)−1 q(xn ) and x n q(x n ) ). To resolve this issue, we can either resort to a deterministic approximation of the posteriors q(x n ), e.g., by the application of Laplace approximation, or perform sampling by means of MCMC. Our investigations have shown that Laplace approximation does not perform very well for our model. For this reason, in this paper, we opt for the latter alternative. Specifically, for this purpose, we draw samples from the variational posterior (67) using hybrid Monte Carlo (HMC) [20]. HMC provides an efficient method to draw samples from the variational posterior distribution q(x n ) by performing a physical simulation of an energy-conserving system to generate proposal moves. In detail, we add kinetic energy variables, p, for each latent dimension, while the expression of −logq(x n ) is used to specify a potential energy over the latent variables. Each HMC step involves sampling p and carrying out a physics-based simulation using leap-frog discretization. The final state of the simulation is accepted or rejected based on the Metropolis–Hastings algorithm. HMC sampling of x n requires computation of the gradient of q(x n ), which can be analytically performed in a straightforward manner. 2) Hyperparameter Selection: Regarding the values of the hyperparameters of the model priors, we set mc = 0, λc = 1, ωc = 10, and c = 100 I . Regarding the hyperparameters of the kernel functions kd (·, ·), we estimate them by performing maximization of the model variational free energy L(q) over each one of them. Computation of the variational free energy L(q) and its gradient requires derivation of the entailed posterior expectations in (40), which can be obtained analytically, except for the expectations with respect to the posteriors q(x n ). These latter quantities can be obtained by means of MCMC, utilizing the samples of x n , previously drawn by HMC sampling. Given the fact that none of the sought quantities yields closed-form estimators, to perform L(q) maximization, we resort to the L-Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [21]. C. Predictive Density Let us now consider the problem of computing the probaN∗ with bility of a given sequence of observations Y ∗ = [ y∗n ]n=1 2 respect to a trained LM GP model. This quantity is useful, e.g., in applying our model to (sequence-level) classification applications. For this purpose, we need to derive the predictive density of the model ˆˆ p(Y ∗ |Y ) = p(Y ∗ |X ∗ ; Y, X) p(X ∗ |X, Y ) p(X|Y )d Xd X ∗ . (68) To begin with, the expression of p(Y ∗ |X ∗ ; Y, X) is analogous to the predictive density expression of standard GP regression, yielding ∗ ∗ ∗ p(Y |X ; Y, X) = D N   n=1 d=1  ∗ 2 ∗ ∗ ) , σnd |and N (ynd (69)  −1 ∗ and = k d (x ∗n )T K d (X, X) + β −1 Yd (70)   −1  ∗ 2 σnd = −kd (x ∗n )T K d (X, X) + β −1 kd (x ∗n ) + kd (x ∗ , x ∗ ) (71) where kd (x ∗n )  [kd (x 1 , x ∗n ), . . . , kd (x N , x ∗n )]T. (72) Now, regarding the computation of the entailed expectation of p(Y ∗ |X ∗ ; Y, X) with respect to the distributions p(X|Y ) and p(X ∗ |X, Y ), we proceed as follows: first, we substitute p(X|Y  N) with its approximation (variational posterior) q(X) = n=1 q(x n ). This way, the corresponding expectation can be approximated by making use of the samples of the latent variables x n previously drawn through HMC. Finally, to obtain the expectations with respect to the density p(X ∗ |X, Y ), we resort to Gibbs sampling from the inferred latent-manifold infinite-state HMM. This obtains a set of samples of X ∗ from the latent manifold, which can be used to approximate the sought (remaining) expectation with respect to p(X ∗ |X, Y ). D. Relations to Existing Approaches As we have already discussed, our model is related to the GPDM method of [2]. The main difference between this paper and GPDM is that the latter uses a simplistic linear model of temporal dynamics combined with a spherical GP prior over the latent variables. In contrast, our approach employs a more flexible generative construction, which considers latent variable emission from a latent stick-breaking HMM (SB-HMM). In addition, one can also observe that our method essentially reduces to a GPLVM model by setting the truncation threshold C equal to one, i.e., C = 1. IV. E XPERIMENT Here, we experimentally evaluate our method in three scenarios: 1) unsupervised segmentation (frame-level clustering) of sequential data; 2) supervised segmentation (frame-level classification) of sequential data; and 3) (whole) sequence classification. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP In the context of our unsupervised sequence segmentation experiments, we estimate the assignment of the data points (frames) of each sequence to the discovered latent classes (model states) using q(Z ), and compare the estimated values with the available ground-truth sequence segmentation. To compute the optimal assignment of the discovered class labels to the available ground-truth class labels, we resort to the Munkres assignment algorithm. This set of experiments makes it thus possible for us to evaluate the quality of the latent temporal dynamics discovered by our model. On the other hand, the considered supervised sequence segmentation experiments allow for us to additionally evaluate the quality of the drawn latent manifold representations of the modeled data. Specifically, in this set of experiments, the drawn latent manifold representations x n are subsequently fed to a simple multiclass classifier (specifically, a multiclass linear support vector machine [22]), which is trained to discover the correct (frame-level) classes by being presented with the generated manifold representations x n . We expect that a good performance of our model under such an experimental setup would indicate high-quality representational capabilities for both the generated latent manifold representations x n and the temporal dynamics discovered by our model. Furthermore, in the context of this set of experiments, we also explore whether our model could be used to obtain a deep learning architecture, by stacking multiple layers of models, each one fed with the latent state representations generated from the previous layer (and the first layer fed with the actual observations). We believe that, by stacking multiple layers of models, we might be capable of extracting more complex, higher level temporal dynamics encoded in the final-layer latent representations x n , thus allowing for eventually obtaining increased supervised (frame-level) classification performance. Finally, in the case of the sequence classification experiments, we train one model for each one of the considered classes. Evaluation is performed by finding for each test sequence the model that yields the highest predictive probability, and assigning it to the corresponding class. To obtain some comparative results, in our experiments, apart from our method, we also evaluate GPLVM [4] and GPDM models [2], as well as other state-of-the-art approaches relevant to each scenario. In our experiments, we make use of the large-scale linear support vector machine (SVM) library of [23], and the HMM-SVM implementation of Thorsten Joachims. In all our experiments, we use RBF kernels for all the evaluated GP-based models. Finally, HMC is run for 100 iterations, with 25 leap-frog √ steps at each iteration, and with a step-length equal to 0.001 N , where N is the number of modeled data points. A. Unsupervised Sequence Segmentation 1) Workflow Recognition: We first consider an unsupervised sequence segmentation (frame-level clustering) application. For this purpose, we use a public benchmark dataset involving action recognition of humans, namely the workflow recognition (WR) database [24]. Specifically, we use the first two workflows pertaining to car assembly (see [24] for more 77 TABLE I U NSUPERVISED S EQUENCE S EGMENTATION : W ORKFLOW R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY details). The frame-level tasks to recognize in an unsupervised fashion in these workflows are the following: 1) worker 1 picks up part 1 from rack 1 (upper) and places it on the welding cell, and the mean duration is 8–10 s; 2) workers 1 and 2 pick part 2a from rack 2 and place it on the welding cell; 3) workers 1 and 2 pick part 2b from rack 3 and place it on the welding cell; 4) worker 2 picks up spare parts 3a and 3b from rack 4 and places them on the welding cell; 5) worker 2 picks up spare part 4 from rack 1 and places it on the welding cell; 6) workers 1 and 2 pick up part 5 from rack 5 and place it on the welding cell. Feature extraction is performed as follows: to extract the spatiotemporal variations, we use pixel change history images to capture the motion history [25], and compute the complex Zernike moments A00 , A11, A20 , A22 , A31 , A33 , A40 , A42 , A44 , A51 , A53 , A55 , A60 , A62 , A64 , A66 , for each of which we compute the norm and the angle. In addition, the center of gravity and the area of the found blobs are also used. In total, this feature extraction procedure results in 31-D observation vectors. Zernike moments are calculated in rectangular regions of interest of approximately 15K pixels in each image to limit the processing and allow real-time feature extraction (performed at a rate of approximately 50–60 frames per second). In our experiments, we use a total of 40 sequences representing full assembly cycles and containing at least one of the considered behaviors, with each sequence being approximately 1K frames long. Frame annotation has been performed manually. In Table I, we illustrate the obtained error rates for optimal selection of the latent manifold dimensionality of our method, as well as the GPDM and GPLVM methods. Regarding the GPDM and GPLVM methods, clustering was performed by presenting the generated latent subspace representations to an SB-HMM [13]. All these results are means and variances over 10 repetitions of the experiment. As a baseline, we also evaluate SB-HMMs presented with the original (observed) data. As we observe, our approach yields a clear advantage over the competition; we also observe that all the other latent manifold models yield inferior performance compared with the baseline SB-HMM. This result is a clear indication of the much improved capability of our approach to capture latent temporal dynamics in the modeled data. In addition, in Fig. 3, we show how model performance changes as a function of the postulated latent manifold dimen- 78 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 Fig. 3. Unsupervised sequence segmentation: workflow recognition. Error rate fluctuation with latent manifold dimensionality. Fig. 5. Unsupervised sequence segmentation: Honeybee. Error rate fluctuation with latent manifold dimensionality. TABLE II U NSUPERVISED S EQUENCE S EGMENTATION : H ONEYBEE . E RROR R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY Fig. 4. Workflow Recognition. Average execution time (per sequence) of inference algorithm. sionality, for the LM2 GP, GPDM, and GPLVM methods. We observe that the model performance increases with manifold dimensionality in a consistent fashion. We also observe that our model benefits the most by an increase in latent manifold dimensionality. Furthermore, we run the Student’s-t test on all pairs of performances across the evaluated methods, to assess the statistical significance of the obtained differences; we obtain that all performance differences are deemed statistically significant. Finally, in Fig. 4, we show an indicative empirical comparison of computational times, for processing one of the sequences in our dataset (means and error bars over all the used sequences). As we observe, GPDM requires the longest time between the compared methods; our method is faster than GPDM, while imposing a reasonable increase in computational costs compared with GPLVM. This difference from GPDM was expected, since our method involves one less Gram matrix compared with GPDM. 2) Honeybee Dataset: Furthermore, we perform a second unsupervised sequence segmentation experiment using the Honeybee dataset [26]; it contains video sequences of honeybees, which communicate the location and distance to a food source through a dance that takes place within the hive. The dance can be decomposed into three different movement patterns that must be recognized by the evaluated algorithms: 1) waggle; 2) right turn; and 3) left turn. During the waggle dance, the bee moves roughly in a straight line while rapidly shaking its body from left to right; the duration and orientation of this phase correspond to the distance and the orientation to the food source. At the endpoint of a waggle dance, the bee turns in a clockwise or counter-clockwise direction to form a turning dance. Our dataset consists of six video sequences with lengths 1058, 1125, 1054, 757, 609, and 814 frames, respectively. The bees were visually tracked, and their locations and head angles were recorded. This resulted in obtaining 4-D frame-level feature vectors comprising the 2-D location of the bee and the sine and cosine of its head angle. Once the sequence observations were obtained, the trajectories were preprocessed as in [27]. Specifically, the trajectory sequences were rotated so that the waggle dances had head angle measurements centered about zero radian. The sequences were then translated to center at (0, 0), and the 2-D coordinates were scaled to the [−1, 1] range. Aligning the waggle dances was possible by looking at the high-frequency portions of the head angle measurements. Following the suggestion of [26], the data were smoothed using a Gaussian FIR pulse-shaping filter with 0.5-dB bandwidth-symbol time. In our evaluations, we adopt the experimental setup of the previous experiment. In Fig. 5, we show how model performance changes as a function of the postulated latent manifold dimensionality, for the LM2 GP, GPDM, and GPLVM methods. In Table II, we provide the error rates of all the evaluated models for optimal latent manifold dimensionality (wherever applicable). These results are means and variances over 10 repetitions of the experiment. We again observe that both GPDM and GPLVM are outperformed by the baseline SB-HMM. Our method works better than all the alternatives. Finally, we run the Student’s-t test on all pairs of perfor- CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 79 TABLE III S UPERVISED S EQUENCE S EGMENTATION : W ORKFLOW R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY Fig. 7. Supervised sequence segmentation: workflow recognition. Error rate fluctuation with latent manifold dimensionality for two-layer deep learning architectures. Fig. 6. Supervised sequence segmentation: workflow recognition. Error rate fluctuation with latent manifold dimensionality. mances across the evaluated methods, to assess the statistical significance of the obtained differences; we obtain that all performance differences are deemed statistically significant, except for the differences between the optimal performance of GPLVM and the optimal performance of GPDM (obtained for three and four latent features, respectively). B. Supervised Sequence Segmentation 1) Workflow Recognition: Here, we use the same dataset as in Section IV-A1, but with the goal to perform supervised sequence segmentation using the obtained latent manifold representations of the modeled data. For this purpose, we first use our approach to obtain the latent manifold representations of the considered datasets. Subsequently, we use half of the resulting representations (corresponding to all the modeled classes) for initial training of a (frame-level) multiclass linear SVM classifier [22], and keep the rest for testing of the obtained classifier. Apart from our method, we also evaluate GPDM and GPLVM under the same experimental setup. In addition, as a baseline, we evaluate the HMM-SVM approach of [28] (presented with the original data, not the latent encodings). In Table III, we provide the obtained classification error rates for the considered models. These results correspond to optimal latent manifold dimensionality selection for the LM2 GP, GPDM, and GPLVM methods, and constitute means and variances over 10 repetitions of the experiment. As we observe, our method outperforms the competition. We also observe that GPDM and GPLVM yield essentially identical results; this fact indicates that GPDM does not capture richer temporal dynamics in our data compared with GPLVM. Another significant finding from these results is that the state-of-the-art HMM-SVM approach yields a considerably inferior performance compared with the evaluated latent variable models. This result seems to indicate that extracting latent manifold representations of the modeled data is much more effective for classification purposes than directly using the observed data and assuming temporal dynamics in the observations space. In addition, in Fig. 6, we show how model performance changes with the number of latent features for the LM2 GP, GPDM, and GPLVM methods. We observe that all methods yield their optimal performance with 10 latent features. Furthermore, we turn to our deep learning scenario, examining how model performance changes if we add a second layer of models (with the same latent dimensionality), fed with the latent manifold representations generated by the original models. We present the output (latent manifold representations) of the 2-layer models as input to a multiclass linear SVM classifier, and we evaluate the classification performance of the so-obtained classifier in the considered tasks. In Fig. 7, we illustrate the classification error rates of the resulting models as a function of latent manifold dimensionality. As we observe, two-layer deep learning architectures yield a significant improvement over the corresponding singlelayer architectures when the latent manifold dimensionality is low; however, performance for high latent manifold dimensionality turns out to be similar in the cases of GPDM and our model, and substantially worse in the case of GPLVM. To our perception, this result indicates that our model encodes richer and more consistent temporal dynamics information in the obtained latent manifold representations, which can be combined in a hierarchical manner to extract even more complex temporal dynamics. Finally, the Student’s-t test on all pairs of performances across the evaluated methods obtains here some very interesting results. 1) Our method yields statistically significant differences from GPDM and GPLVM for latent manifold dimensionality between 10 and 20. However, for more or less latent dimensions, the obtained differences are not statistically significant. 2) GPDM and GPLVM are comparable, with no statistically significant differences, whatsoever. 3) Models with two layers have statistically significant performance differences with respect to similar one-layer models in cases of low latent manifold dimensionality. 80 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 TABLE IV S UPERVISED S EQUENCE S EGMENTATION : H ONEYBEE . E RROR R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY Fig. 8. Supervised sequence segmentation: Honeybee. Error rate fluctuation with latent manifold dimensionality. observations space. In addition, in Fig. 8, we show how model performance changes with the number of latent features for the LM2 GP, GPDM, and GPLVM methods. We observe a consistent performance improvement with latent manifold dimensionality. This is not unexpected, since we are dealing with models trained on low-dimensional data using some thousands of training points. Furthermore, we turn to our deep learning scenario, examining how model performance changes if we add a second layer of models (with the same latent dimensionality), fed with the latent manifold representations generated by the original models. In Fig. 9, we show the classification error rates of the resulting models as a function of latent manifold dimensionality. As we observe, two-layer deep learning architectures yield improved performance for low latent manifold dimensionality (2-D latent space); however, any performance gains quickly disappear as latent dimensionality increases, with GPLVM performance eventually becoming inferior to the shallow modeling scenario, similar to the previous experiment. Finally, the Student’s-t test in this scenario finds that all pairs of differences are statistically significant, except for the case of one-layer versus two-layer model comparison, where the assumption of statistical significance is rejected for models with four latent dimensions. C. Sequence-Level Classification Fig. 9. Supervised sequence segmentation: Honeybee. Error rate fluctuation with latent manifold dimensionality for two-layer deep learning architectures. However, these differences quickly become insignificant as the number of latent dimensions increases. 2) Honeybee Dataset: Furthermore, we perform additional frame-level classification experiments using the Honeybee dataset described in Section IV-A2. In our evaluations, we adopt the experimental setup of the previous experiment. First, we consider a shallow modeling scenario. In Table IV, we provide the obtained classification error rates for the considered models. These results correspond to optimal latent manifold dimensionality selection for the LM2 GP, GPDM, and GPLVM methods, and are means and variances over 10 repetitions of the experiment. As we observe, our method outperforms the competition. We also observe that GPDM works better than GPLVM. Another significant finding from these results is that the state-of-the-art HMM-SVM approach yields again inferior performance compared with the evaluated latent variable models. This result corroborates our intuition that extracting latent manifold representations of the modeled data is much more effective for classification purposes than directly using the observed data and assuming temporal dynamics in the 1) Learning Music-to-Dance Mappings: Finally, we evaluate our method in sequence-level classification (classification of whole sequences). Initially, we consider the problem of learning music-to-dance mappings. In this experiment, the observed sequences presented to our model constitute the chroma features extracted from a collection of music clips. Chroma analysis [29] is an interesting and powerful representation for music audio in which the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave. Since, in music, notes exactly one octave apart are perceived as particularly similar, knowing the distribution of chroma even without the absolute frequency (i.e., the original octave) can give useful musical information about the audio, and may even reveal perceived musical similarity that is not apparent in the original spectra [30]. In our experiments, we use a dataset of 600 music clips; they are split into sets of 100 clips pertaining to each one of the dance classes: Waltz, Tango, Foxtrot, Cha Cha, Quickstep, and Samba.1 We preprocess these clips as described above to obtain chroma features; this way, we eventually obtain sequences of 12-D observations, each one 35K–184K frames long. The clips are further segmented into approximately 8K frames long subsequences to form our dataset. We use half of our data for model training, and the rest for testing. We train one model for each one of the six classes we want to recognize. Classification of our test sequences is conducted by computing the sequence predictive densities with respect to the model of each class, and assigning each 1 Music clips were downloaded from: http://www.ballroomdancers.com/Music/ Default.asp?Tab=2; we converted them to WAV format for our experiments. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 81 TABLE V S EQUENCE -L EVEL C LASSIFICATION : L EARNING M USIC - TO -D ANCE TABLE VI S EQUENCE -L EVEL C LASSIFICATION : B IMANUAL G ESTURE M APPINGS . E RROR R ATES (%) FOR O PTIMAL L ATENT R ECOGNITION . E RROR R ATES (%) FOR O PTIMAL L ATENT M ANIFOLD D IMENSIONALITY M ANIFOLD D IMENSIONALITY Fig. 10. Sequence-level classification: learning music-to-dance mappings. Error rate fluctuation with latent manifold dimensionality. Fig. 11. Sequence-level classification: bimanual gesture recognition. Error rate fluctuation with latent manifold dimensionality. sequence to the class the model of which yields the highest predictive density. Apart from our method, we also evaluate SB-HMMs [13], GPLVMs [4], and GPDMs [2] under the same experimental setup. The obtained error rates for optimal latent manifold dimensionality (wherever applicable) are depicted in Table V. These results are means and variances over 10 repetitions of the experiment. As we observe, our approach works much better than the competition. We also observe that, in this experiment, GPDM does actually work better than GPLVM. In Fig. 10, we show how average model performance changes with latent manifold dimensionality. It becomes apparent from this graph that the modeled data contain a great deal of redundancy, since all models yield a performance deterioration for high latent space dimensionality. Finally, the Student’s-t test finds that all performance differences are statistically significant. 2) Bimanual Gesture Recognition: Furthermore, we perform evaluations considering the problem of bimanual gesture recognition. For this purpose, we experiment with the American Sign Language gestures for the words: against, aim, balloon, bandit, cake, chair, computer, concentrate, cross, deaf, explore, hunt, knife, relay, reverse, and role. The used dataset was obtained from four different persons executing each one of these gestures and comprises 40 videos per gesture; 30 of these videos are used for training and the rest for model evaluation. From this dataset, we extracted several features representing the relative position of the hands and the face in the images, as well as the shape of the respective skin regions, by means of the complex Zernike moments [31], as described in [18]. This way, each used video comprises 1K–4K frames of 12-D feature vectors used in our experiments. For each one of these 16 gestures, we fitted one model to recognize it. The obtained error rates (means and variances over 10 experiment repetitions) for optimal latent manifold dimensionality (wherever applicable) are depicted in Table VI. As we observe, our approach works much better than the competition. Furthermore, in Fig. 11, we show how average model performance changes with latent manifold dimensionality. We observe that all models yield a performance increase for moderate latent space dimensionality. Finally, the Student’s-t test finds that all performance differences are statistically significant. V. C ONCLUSION In this paper, we proposed a method for sequential data modeling that leverages the strengths of GPs, allowing for more flexibly capturing temporal dynamics compared with the existing nonparametric Bayesian approaches. Our method considers a latent manifold representation of the modeled data, and chooses to postulate a model of temporal dependencies on this latent manifold. Temporal dependencies in our model are captured through consideration of infinite-state Markovian dynamics, and imposition of stick-breaking priors over the entailed Markov chain probabilities. Inference for our model was performed by means of an efficient variational Bayesian algorithm. As we showed through experimental evaluations, our approach is suitable for unsupervised sequence segmentation (frame-level clustering), supervised sequence segmentation (frame-level classification), and whole sequence classification (sequence-level operation). We evaluated our approach in all these scenarios using real-world datasets, and observed that our method yields very competitive results, outperforming popular, recently proposed related approaches, e.g., GPDM and GPLVM. 82 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 Finally, we examined whether our method can be also employed to obtain a deep learning architecture, by stacking multiple (layers of) LM2 GP models, each one fed with the latent manifold representations generated from the previous layer (and the first one fed with the observable data); specifically, we experimented with two-layer architectures. As we observed, our method seems to yield much more significant gains in such a scenario than GPDM- or GPLVMbased models, especially for low-dimensional manifold assumptions. Our future research endeavors in this line of research mainly focus on addressing two open questions: the first one concerns the possibility of sharing the precision matrices between close hidden states. A question that must be answered is what proximity criterion we could use for this purpose in the context of our model. Would, e.g., comparing the values of the latent vectors x n generated from different states provide a valid proximity criterion? In such a case, what type of proximity measure would be more effective? The second open question we want to investigate is the possibility of obtaining stacked denoising autoencoders [32] for sequential data modeling using our LM2 GP model as the main building block. Stacked denoising autoencoders are deep learning architectures that apart from extracting the most informative latent subspace representations of observed data are also capable of denoising the observed data, which are considered contaminated with noise at each phase of the model training algorithm. GPLVM models have already been used as building blocks for obtaining stacked denoising autoencoder architectures with great success [33]. However, existing formulations are not designed for capturing and exploiting temporal dependencies in the modeled data. In this paper, we investigated the utility of LM2 GP as the main building block for obtaining simple deep learning architectures, and obtained some promising results. How would our model perform in the context of a stacked denoising autoencoder framework? This is a question that remains to be addressed in our future work. We shall publish source codes pertaining to our method at: http://www.cut.ac.cy/eecei/staff/sotirios.chatzis/?languageId=2. A PPENDIX In this Appendix, we provide (for completeness sake) the expressions of the derivatives of the variational free energy L(q) over the kernel hyperparameters, required for hyperparameter optimization by means of L-BFGS, as well as the expressions of the derivatives of q(x n ) with respect to the latent vectors x n , required for HMC sampling from q(x n ). Regarding the derivatives of L(q) with respect to the kernel hyperparameters, say ϕ d , d = 1, . . . , D, we have    ∂ K d (X, X)−1 ∂L(q) 1 = − µ̂d µ̂dT +  d ∂ϕ d 2 ∂ϕ d q(X )   ∂ K d (X, X)−1 1 . (73) + K d (X, X) 2 ∂ϕ d q(X ) Similarly, the derivatives of q(x n ) with respect to the latent vectors x n yield  ! ∂logq(x n ) ˜c ˜ c m̃c − x nT ω̃c  q(z nc = 1) ω̃c  = ∂ xn c  ∂ K d (X, X)−1 1  − µ̂d µ̂dT +  d 2 ∂ xn d 1 ∂ K d (X, X)−1 + K d (X, X) . 2 ∂ xn (74) d R EFERENCES [1] C. M. Bishop, Neural Networks for Pattern Recognition. Phoenix, AZ, USA: Clarendon, 1995. [2] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 2, pp. 283–298, Feb. 2008. [3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [4] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” J. Mach. Learn. Res., vol. 6, pp. 1783–1816, Nov. 2005. [5] M. F. Moller, “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, 1993. [6] J. Sethuraman, “A constructive definition of the Dirichlet prior,” Statist. Sinica, vol. 2, no. 2, pp. 639–650, 1994. [7] T. Ferguson, “A Bayesian analysis of some nonparametric problems,” Ann. Statist., vol. 1, no. 2, pp. 209–230, 1973. [8] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” in Learning in Graphical Models, M. Jordan, Ed. Boston, MA, USA: Kluwer, 1998, pp. 105–162. [9] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Anal., vol. 1, no. 1, pp. 121–144, 2006. [10] D. Blackwell and J. MacQueen, “Ferguson distributions via Pólya urn schemes,” Ann. Statist., vol. 1, no. 2, pp. 353–355, 1973. [11] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Comput., vol. 11, no. 2, pp. 443–482, 1999. [12] J. Zhao and Q. Jiang, “Probabilistic PCA for t distributions,” Neurocomputing, vol. 69, nos. 16–18, pp. 2217–2226, Oct. 2006. [13] J. Paisley and L. Carin, “Hidden Markov models with stick-breaking priors,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 3905–3917, Oct. 2009. [14] Y. Qi, J. W. Paisley, and L. Carin, “Music analysis using hidden Markov mixture models,” IEEE Trans. Signal Process., vol. 55, no. 11, pp. 5209–5224, Nov. 2007. [15] D. Chandler, Introduction to Modern Statistical Mechanics. New York, NY, USA: Oxford University Press, 1987. [16] D. Blei and M. Jordan, “Variational methods for the Dirichlet process,” in Proc. 21st Int. Conf. Mach. Learn., New York, NY, USA, Jul. 2004, pp. 12–19. [17] P. Muller and F. Quintana, “Nonparametric Bayesian data analysis,” Statist. Sci., vol. 19, no. 1, pp. 95–110, 2004. [18] S. P. Chatzis, D. I. Kosmopoulos, and T. A. Varvarigou, “Robust sequential data modeling using an outlier tolerant hidden Markov model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1657–1669, Sep. 2009. [19] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 245–255, Feb. 1989. [20] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo methods,” Dept. Comput. Sci., Univ. Toronto, Tech. Rep. CRG-TR-93-1, 1993. [21] D. Liu and J. Nocedal, “On the limited memory method for large scale optimization,” Math. Program. B, vol. 45, no. 3, pp. 503–528, 1989. [22] V. N. Vapnik, Statistical Learning Theory. New York, NY, USA: Wiley, 1998. [23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, Jan. 2008. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP [24] A. Voulodimos et al., “A threefold dataset for activity and workflow recognition in complex industrial environments,” IEEE MultiMedia, vol. 19, no. 3, pp. 42–52, Jul./Sep. 2012. [25] D. Kosmopoulos and S. Chatzis, “Robust visual behavior recognition,” IEEE Signal Process. Mag., vol. 27, no. 5, pp. 34–45, Sep. 2010. [26] S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert, “Learning and inferring motion patterns using parametric segmental switching linear dynamic systems,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 103–124, 2008. [27] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “Nonparametric Bayesian learning of switching linear dynamical systems,” in Proc. Neural Inf. Process. Syst., 2009, pp. 1–3. [28] Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support vector machines,” in Proc. ICML, 2004, pp. 1–9. [29] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chromabased representations for audio thumbnailing,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., Jun. 2001, pp. 15–18. [30] H. Jensen, M. G. Christensen, D. Ellis, and S. H. Jensen, “A tempoinsensitive distance measure for cover song identification based on chroma features,” in Proc. ICASSP, 2008, pp. 2209–2212. [31] R. Mukundan and K. R. Ramakrishnan, Moment Functions in Image Analysis: Theory and Applications. Singapore: World Sci., 1998. [32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Jan. 2010. [33] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” J. Mach. Learn. Res., vol. 13, pp. 2567–2588, Sep. 2012. Sotirios P. Chatzis received the M.Eng. (Hons.) degree in electrical and computer engineering and the Ph.D. degree in machine learning from the National Technical University of Athens, Athens, Greece, in 2005 and 2008, respectively. He was a Post-Doctoral Fellow with the University of Miami, Coral Gables, FL, USA, from 2009 to 2010. He was a Post-Doctoral Researcher with the Department of Electrical and Electronic Engineering, Imperial College London, London, U.K., from 2010 to 2012. He is currently an Assistant Professor with the Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology, Limassol, Cyprus. He has authored more than 40 papers in the most prestigious journals and conferences of the research field in his first seven years as a Researcher. His current research interests include machine learning theory and methodologies with a special focus on hierarchical Bayesian models, Bayesian nonparametrics, quantum statistics, and neuroscience. His Ph.D. research was supported by the Bodossaki Foundation, Greece, and the Greek Ministry for Economic Development. Dr. Chatzis was a recipient of the Dean’s scholarship for Ph.D. studies, being the best performing Ph.D. student of the class. 83 Dimitrios Kosmopoulos received B.Eng. degree in electrical and computer engineering and the Ph.D. degree from the National Technical University of Athens, Athens, Greece, in 1997 and 2002, respectively. He was a Research Assistant Professor with the Department of Computer Science, Rutgers University, New Brunswick, NJ, USA, and an Adjunct Assistant Professor with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX, USA. His former collaborations include the Institute of Informatics and Telecommunications, National Center for Scientific Research Demokritos, Athens, and the National Technical University of Athens. In the same period, he was with the University of Central Greece, Lamia, Greece, University of Peloponnese, Sparti, Greece, and the Technical Educational Institute of Athens, Egaleo, Greece. He is currently an Assistant Professor with the Department of Applied Informatics and Multimedia, Technical Educational Institute of Crete, Heraklion, Greece. He has published more than 70 papers in the field of computer/robotic vision, machine learning, and signal processing.