Open AccessArticle

Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold

Jialun Zhou

^1,*,†

and

Salem Said

^2,†

Department of Science and Technology, University of Bordeaux, 33076 Bordeaux, France

IMS Laboratory, University of Bordeaux, 33076 Bordeaux, France

Author to whom correspondence should be addressed.

^†

Current address: 351 Cours de la Libération, 33405 Talence Cedex, France.

Entropy 2019, 21(10), 1021; https://doi.org/10.3390/e21101021

Submission received: 3 September 2019 / Revised: 9 October 2019 / Accepted: 17 October 2019 / Published: 21 October 2019

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

Stochastic optimisation in Riemannian manifolds, especially the Riemannian stochastic gradient method, has attracted much recent attention. The present work applies stochastic optimisation to the task of recursive estimation of a statistical parameter which belongs to a Riemannian manifold. Roughly, this task amounts to stochastic minimisation of a statistical divergence function. The following problem is considered: how to obtain fast, asymptotically efficient, recursive estimates, using a Riemannian stochastic optimisation algorithm with decreasing step sizes. In solving this problem, several original results are introduced. First, without any convexity assumptions on the divergence function, we proved that, with an adequate choice of step sizes, the algorithm computes recursive estimates which achieve a fast non-asymptotic rate of convergence. Second, the asymptotic normality of these recursive estimates is proved by employing a novel linearisation technique. Third, it is proved that, when the Fisher information metric is used to guide the algorithm, these recursive estimates achieve an optimal asymptotic rate of convergence, in the sense that they become asymptotically efficient. These results, while relatively familiar in the Euclidean context, are here formulated and proved for the first time in the Riemannian context. In addition, they are illustrated with a numerical application to the recursive estimation of elliptically contoured distributions.

Keywords:

Riemannian stochastic gradient; Fisher information metric; recursive estimation; asymptotic efficiency; elliptically contoured distributions

1. Introduction

Over the last five years, the data science community has devoted significant attention to stochastic optimisation in Riemannian manifolds. This was impulsed by Bonnabel, who proved the convergence of the Riemannian stochastic gradient method [1]. Later on [2], the rate of convergence of this method was studied in detail and under various convexity assumptions on the cost function. More recently, asymptotic efficiency of the averaged Riemannian stochastic gradient method was proved in [3]. Previously, for the specific problem of computing Riemannian means, several results on the convergence and asymptotic normality of Riemannian stochastic optimisation methods had been obtained [4,5]. The framework of stochastic optimisation in Riemannian manifolds is far-reaching, and encompasses applications to principal component analysis, dictionary learning, and tensor decomposition, to give only a few examples [6,7,8].

The present work moves in a different direction, focusing on recursive estimation in Riemannian manifolds. While recursive estimation is a special case of stochastic optimisation, it has its own geometric structure, given by the Fisher information metric. Here, several original results will be introduced, which show how this geometric structure can be exploited, to design Riemannian stochastic optimisation algorithms which compute fast, asymptotically efficient, recursive estimates, of a statistical parameter which belongs to a Riemannian manifold. For the first time in the literature, these results extend, from the Euclidean context to the Riemannian context, the classical results of [9,10].

The mathematical problem, considered in the present work, is formulated in Section 2. This involves a parameterised statistical model P of probability distributions

P_{θ}

, where the statistical parameter

θ

belongs to a Riemannian manifold

Θ

. Given independent observations, with distribution

P_{θ^{*}}

for some

θ^{*} \in Θ

, the aim is to estimate the unknown parameter

θ^{*}

. In principle, this is done by minimising a statistical divergence function

D (θ)

, which measures the dissimilarity between

P_{θ}

and

P_{θ^{*}}

. Taking advantage of the observations, there are two approaches to minimising

D (θ)

: stochastic minimisation, which leads to recursive estimation, and empirical minimisation, which leads to classical techniques, such as maximum-likelihood estimation [11,12].

The original results, obtained in the present work, are stated in Section 3. In particular, these are Propositions 2, 4, and 5. Overall, these propositions show that recursive estimation, which requires less computational resources than maximum-likelihood estimation, can still achieve the same optimal performance, characterised by asymptotic efficiency [13,14].

To summarise these propositions, consider a sequence of recursive estimates

θ_{n}

, computed using a Riemannian stochastic optimisation algorithm with decreasing step sizes (n is the number of observations already processed by the algorithm). Informally, under assumptions which guarantee that

θ^{*}

is an attractive local minimum of

D (θ)

, and that the algorithm is neither too noisy, nor too unstable, in the neighborhood of

θ^{*}

Proposition 2 states that, with an adequate choice of step sizes, the $θ_{n}$ achieve a fast non-asymptotic rate of convergence to $θ^{*}$ . Precisely, the expectation of the squared Riemannian distance between $θ_{n}$ and $θ^{*}$ is $O (n^{- 1})$ . This is called a fast rate, because it is the best achievable, for any step sizes which are proportional to $n^{- q}$ with $q \in (1 / 2, 1]$ [9,15]. Here, this rate is obtained without any convexity assumptions, for twice differentiable $D (θ)$ . It would still hold for non-differentiable, but strongly convex, $D (θ)$ [2].
Proposition 4 states that the distribution of the $θ_{n}$ becomes asymptotically normal, centred at $θ^{*}$ , when n grows increasingly large, and also characterises the corresponding asymptotic covariance matrix. This proposition is proved using a novel linearisation technique, which also plays a central role in [3].
Proposition 5 states that, if the Riemannian manifold $Θ$ is equipped with the Fisher information metric of the statistical model P, then Riemannian gradient descent with respect to this information metric, when used to minimise $D (θ)$ , computes recursive estimates $θ_{n}$ which are asymptotically efficient, achieving the optimal asymptotic rate of convergence, given by the Cramér-Rao lower bound. This is illustrated, with a numerical application to the recursive estimation of elliptically contoured distributions, in Section 4.

The end result of Proposition 5 is asymptotic efficiency, achieved using the Fisher information metric. In [3], an alternative route to asymptotic efficiency is proposed, using the averaged Riemannian stochastic gradient method. This method does not require any prior knowledge of the Fisher information metric, but has an additional computational cost, which comes from computing on-line Riemannian averages.

The proofs of Propositions 2, 4, and 5, are detailed in Section 6, and Appendix A and Appendix B. Necessary background, about the Fisher information metric (in short, this will be called the information metric), is recalled in Appendix C. Before going on, the reader should note that the summation convention of differential geometry is used throughout the following, when working in local coordinates.

2. Problem Statement

Let

P = (P, Θ, X)

be a statistical model, with parameter space

Θ

and sample space X. To each

θ \in Θ

, the model P associates a probability distribution

P_{θ}

on X. Here,

Θ

is a

C^{r}

Riemannian manifold with

r > 3

, and X is any measurable space. The Riemannian metric of

Θ

will be denoted

〈 \cdot, \cdot 〉

, with its Riemannian distance

d (\cdot, \cdot)

. In general, the metric

〈 \cdot, \cdot 〉

is not the information metric of the model P.

Let

(Ω, F, P)

be a complete probability space, and

(x_{n}; n = 1, 2, \dots)

be i.i.d. random variables on

Ω

, with values in X. While the distribution of

x_{n}

is unknown, it is assumed to belong to the model P. That is,

P \circ x_{n}^{- 1} = P_{θ^{*}}

for some

θ^{*} \in Θ

, to be called the true parameter.

Consider the following problem: how to obtain fast, asymptotically efficient, recursive estimates

θ_{n}

of the true parameter

θ^{*}

, based on observations of the random variables

x_{n}

? The present work proposes to solve this problem through a detailed study of the decreasing-step-size algorithm, which computes, similar to [1]

θ_{n + 1} = {Exp}_{θ_{n}} (γ_{n + 1} u (θ_{n}, x_{n + 1})) n = 0, 1, \dots

(1a)

starting from an initial guess

θ_{0}

This algorithm has three ingredients. First,

Exp

denotes the Riemannian exponential map of the metric

〈 \cdot, \cdot 〉

Θ

[16]. Second, the step sizes

γ_{n}

are strictly positive, decreasing, and verify the usual conditions for stochastic approximation [10,17]

\sum γ_{n} = \infty \sum γ_{n}^{2} < \infty

(1b)

Third,

u (θ, x)

is a continuous vector field on

Θ

for each

x \in X

, which generalises the classical concept of score statistic [13,18]. It will become clear, from the results given in Section 3, that the solution of the above-stated problem depends on the choice of each one of these three ingredients.

A priori knowledge about the model P is injected into Algorithm (1a) using a divergence function

D (θ) = D (P_{θ^{*}}, P_{θ})

(note that

θ^{*}

is unknown, though). As defined in [19], this is a positive function, equal to zero if and only if

P_{θ} = P_{θ^{*}}

, and with positive definite Hessian at

θ = θ^{*}

. Since one expects that minimising

D (θ)

will lead to estimating

θ^{*}

, it is natural to require that

E_{θ^{*}} u (θ, x) = - \nabla D (θ)

(1c)

In other words, that

u (θ, x)

is an unbiased estimator of minus the Riemannian gradient of

D (θ)

. With

u (θ, x)

given by (1c), Algorithm (1a) is a Riemannian stochastic gradient descent, of the form considered in [1,2,3]. However, as explained in Remark 2, (1c) may be replaced by the weaker condition (9), which states that

D (θ)

is a Lyapunov function of Algorithm (1a), without affecting the results in Section 3. In this sense, Algorithm (1a) is more general than Riemannian stochastic gradient descent.

In practice, a suitable choice of

D (θ)

is often the Kullback-Leibler divergence [20],

D (θ) = - E_{θ^{*}} log L (θ) L (θ) = \frac{d P_{θ}}{d P_{θ^{*}}}

(2a)

where

P_{θ}

is absolutely continuous with respect to

P_{θ^{*}}

with Radon-Nikodym derivative

L (θ)

(the likelihood function). Indeed, if

D (θ)

is chosen to be the Kullback-Leibler divergence, then (1c) is satisfied by

u (θ, x) = \nabla log L (θ)

(2b)

which, in many practical situations, can be evaluated directly, without any knowledge of

θ^{*}

Before stating the main results, in the following Section 3, it may be helpful to recall some general background on recursive estimation [10]. For simplicity, let

D (θ)

be the Kullback-Leibler divergence (2a). The problem of estimating the true parameter

θ^{*}

is equivalent to the problem of finding a global minimum of

D (θ)

. Of course, this problem cannot be tackled directly, since

D (θ)

cannot be computed without knowledge of

θ^{*}

. There exist two routes around this difficulty.

The first route is empirical minimisation, which replaces the expectation in (2a) with an empirical mean over observed data. Given the first n observations, instead of minimising

D (θ)

, one minimises the empirical divergence

D_{n} (θ)

D_{n} (θ) = - \frac{1}{n} \sum_{m = 1}^{n} log L (θ, x_{m})

(3)

where L is the likelihood function of (2a). Now, given the minus sign ahead of the sum in (3), it is clear that minimising

D_{n} (θ)

amounts to maximising the sum of log-likelihoods. Thus, one is lead to the method of maximum-likelihood estimation.

It is well-known that maximum-likelihood estimation under general regularity conditions is asymptotically efficient [13]. Roughly, this means the maximum-likelihood estimator has the least possible asymptotic variance, equal to the inverse of the Fisher information. On the other hand, as the number n of observations grows, it can be especially difficult to deal with the empirical divergence

D_{n} (θ)

of Equation (3). In the process of searching for the minimum of

D_{n} (θ)

, each evaluation of this function, or of its derivatives, will involve a massive number of operations, ultimately becoming unpractical.

Aiming to avoid this difficulty, the second route recursive estimation is based on observation-driven updates, following the general scheme of algorithm (1a). In this scheme, each new recursive estimate

θ_{n + 1}

is computed using only the new observation

x_{n + 1}

. Therefore, as the number of observations grows very large, the overall computational effort required by recursive estimation remains the same.

The main results in the following section show that recursive estimation can achieve the same asymptotic performance as maximum-likelihood estimation as the number n of observations grows large. As a word of caution, it should be mentioned that recursive estimation does not, in general, fare better than maximum-likelihood estimation for moderate values of the number n of observations, and models with a small number of parameters. However, it is a very desirable substitute to maximum-likelihood estimation for models with a large number of parameters, which typically require a very large number of observations in order to be estimated correctly.

3. Main Results

The motivation of the following Propositions 1 to 5 is to provide general conditions, which guarantee that Algorithm (1a) computes fast, asymptotically efficient, recursive estimates

θ_{n}

of the true parameter

θ^{*}

. In the statement of these propositions, it is implicitly assumed that conditions (1b) and (1c) are verified. Moreover, the following assumptions are considered.

(d1): the divergence function $D (θ)$ has an isolated stationary point at $θ = θ^{*}$ , and Lipschitz gradient in a neighborhood of this point.
(d2): this stationary point is moreover attractive: $D (θ)$ is twice differentiable at $θ = θ^{*}$ , with positive definite Hessian at this point.
(u1): in a neighborhood of $θ = θ^{*}$ , the function $V (θ) = E_{θ^{*}} {∥ u (θ, x) ∥}^{2}$ is uniformly bounded.
(u2): in a neighborhood of $θ = θ^{*}$ , the function $R (θ) = E_{θ^{*}} {∥ u (θ, x) ∥}^{4}$ is uniformly bounded.

For Assumption (d1), the definition of a Lipschitz vector field on a Riemannian manifold may be found in [21]. For Assumptions (u1) and (u2),

∥ \cdot ∥

denotes the Riemannian norm.

Assumptions (u1) and (u2) are so-called moment control assumptions. They imply that the noise in Algorithm (1a) does not cause the iterates

θ_{n}

to diverge, and are also crucial to proving the asymptotic normality of these iterates.

Let

Θ^{*}

be a neighborhood of

θ^{*}

which verifies (d1), (u1), and (u2). Without loss of generality, it is assumed that

Θ^{*}

is compact and convex (see the definition of convexity in [16,22]). Then,

Θ^{*}

admits a system of normal coordinates

(θ^{α}; α = 1, \dots, d)

with origin at

θ^{*}

. With respect to these coordinates, denote the components of

u (θ^{*}, x)

u^{α} (θ^{*})

and let

\sum^{*} = (\sum_{α β}^{*})

\sum_{α β}^{*} = E_{θ^{*}} [u^{α} (θ^{*}) u^{β} (θ^{*})]

(4a)

When (d2) is verified, denote the components of the Hessian of

D (θ)

θ = θ^{*}

H = (H_{α β})

H_{α β} = {\frac{\partial^{2} D}{\partial θ^{α} \partial θ^{β}}|}_{θ^{α} = 0}

(4b)

Then, the matrix

H = (H_{α β})

is positive definite [23]. Denote by

λ > 0

its smallest eigenvalue.

Propositions 1 to 5 require the condition that the recursive estimates

θ_{n}

are stable, which means that all the

θ_{n}

lie in

Θ^{*}

, almost surely. The need for this condition is discussed in Remark 3. Note that, if

θ_{n}

lies in

Θ^{*}

, then

θ_{n}

is determined by its normal coordinates

θ_{n}^{α}

Proposition 1

(consistency). assume (d1) and (u1) are verified, and the recursive estimates

θ_{n}

are stable. Then,

lim θ_{n} = θ^{*}

almost surely.

Proposition 2

(mean-square rate). assume (d1), (d2) and (u1) are verified, the recursive estimates

θ_{n}

are stable, and

γ_{n} = \frac{a}{n}

where

2 λ a > 1

. Then

E d^{2} (θ_{n}, θ^{*}) = O (n^{- 1})

(5)

Proposition 3

(almost-sure rate). assume the conditions of Proposition 2 are verified. Then,

d^{2} (θ_{n}, θ^{*}) = o (n^{- p}) f o r p \in (0, 1) a l m o s t s u r e l y

(6)

Proposition 4

(asymptotic normality). assume the conditions of Proposition 2, as well as (u2), are verified. Then, the distribution of the re-scaled coordinates

(n^{1 / 2} θ_{n}^{α})

converges to a centred d-variate normal distribution, with covariance matrix Σ given by Lyapunov’s equation

A \sum + \sum A = - a^{2} \sum^{*}

(7)

where

A = (A_{α β})

with

A_{α β} = \frac{1}{2} δ_{α β} - a H_{α β}

(here, δ denotes Kronecker’s delta).

Proposition 5

(asymptotic efficiency). assume the Riemannian metric

〈 \cdot, \cdot 〉

of Θ coincides with the information metric of the model P, and let

D (θ)

be the Kullback-Leibler divergence (2a). Further, assume (d1), (d2), (u1) and (u2) are verified, the recursive estimates

θ_{n}

are stable, and

γ_{n} = \frac{a}{n}

where

2 a > 1

. Then,

(i)the rates of convergence (5) and (6) hold true.

(ii)if

a = 1

, the distribution of the re-scaled coordinates

(n^{1 / 2} θ_{n}^{α})

converges to a centred d-variate normal distribution, with covariance matrix

\sum^{*}

(iii)if

a = 1

, and

u (θ, x)

is given by (2b), then

\sum^{*}

is the identity matrix, and the recursive estimates

θ_{n}

are asymptotically efficient.

(iv)the following rates of convergence also hold

\begin{matrix} E D (θ_{n}) = O (n^{- 1}) \end{matrix}

(8a)

\begin{matrix} D (θ_{n}) = o (n^{- p}) f o r p \in (0, 1) a l m o s t s u r e l y \end{matrix}

(8b)

The following remarks are concerned with the scope of Assumptions (d1), (d2), (u1), and (u2), and with the applicability of Propositions 1 to 5.

Remark 1.

(d2), (u1) and (u2) do not depend on the Riemannian metric

〈 \cdot, \cdot 〉

of Θ. Precisely, if they are verified for one Riemannian metric on Θ, then they are verified for any Riemannian metric on Θ. Moreover, if the function

D (θ)

C^{2}

, then the same is true for (d1). In this case, Propositions 1 to 5 apply for any Riemannian metric on Θ, so that the choice of the metric

〈 \cdot, \cdot 〉

is a purely practical matter, to be decided according to applications.

Remark 2.

the conclusion of Proposition 1 continues to hold, if (1c) is replaced by

E_{θ^{*}} 〈 u (θ, x), \nabla D (θ) 〉 < 0 f o r θ \neq θ^{*}

(9)

Then, it is even possible to preserve Propositions 2, 3, and 4, provided (d2) is replaced by the assumption that the mean vector field,

X (θ) = E_{θ^{*}} u (θ, x)

, has an attractive stationary point at

θ = θ^{*}

. This generalisation of Propositions 1 to 4 can be achieved following essentially the same approach as laid out in Section 6. However, in the present work, it will not be carried out in detail.

Remark 3.

the condition that the recursive estimates

θ_{n}

are stable is standard in all prior work on stochastic optimisation in manifolds [1,2,3]. In practice, this condition can be enforced through replacing Algorithm (1a) by a so-called projected or truncated algorithm. This is identical to (1a), except that

θ_{n}

is projected back onto the neighborhood

Θ^{*}

θ^{*}

, whenever it falls outside of this neighborhood [10,17]. On the other hand, if the

θ_{n}

are not required to be stable, but (d1) and (u1) are replaced by global assumptions,

(d1’): $D (θ)$ has compact level sets and globally Lipschitz gradient.
(u1’): $V (θ) \leq C (1 + D (θ))$ for some constant C and for all $θ \in Θ$ .

then, applying the same arguments as in the proof of Proposition 1, it follows that the

θ_{n}

converge to the set of stationary points of

D (θ)

, almost surely.

Remark 4.

from (ii) and (iii) of Proposition 5, it follows that the distribution of

n d^{2} (θ_{n}, θ^{*})

converges to a

χ^{2}

-distribution with d degrees of freedom. This provides a practical means of confirming the asymptotic efficiency of the recursive estimates

θ_{n}

4. Application: Estimation of ECD

Here, the conclusion of Proposition 5 is illustrated, by applying Algorithm (1a) to the estimation of elliptically contoured distributions (ECD) [24,25]. Precisely, in the notation of Section 2, let

Θ = P_{m}

the space of

m \times m

positive definite matrices, and

X = R^{m}

. Moreover, let each

P_{θ}

have probability density function

p (x | θ) \propto exp [h (x^{†} θ^{- 1} x) - \frac{1}{2} log det (θ)] θ \in P_{m}, x \in R^{m}

(10)

where

h : R \to R

is fixed, has negative values, and is decreasing, and

^{†}

denotes the transpose. Then,

P_{θ}

is called an ECD with scatter matrix

θ

. To begin, let

(x_{n}; n = 1, 2, \dots)

be i.i.d. random vectors in

R^{m}

, with distribution

P_{θ^{*}}

given by (10), and consider the problem of estimating the true scatter matrix

θ^{*}

. The standard approach to this problem is based on maximum-likelihood estimation [25,26]. An original approach, based on recursive estimation, is now introduced using Algorithm (1a).

As in Proposition 5, the parameter space

P_{m}

will be equipped with the information metric of the statistical model P just described. In [27], it is proved that this information metric is an affine-invariant metric on

P_{m}

. In other words, it is of the general form [28]

{〈 u, u 〉}_{θ} = I_{1} tr {(θ^{- 1} u)}^{2} + I_{2} {tr}^{2} (θ^{- 1} u) u \in T_{θ} P_{m}

(11a)

parameterised by constants

I_{1} > 0

and

I_{2} \geq 0

, where

tr

denotes the trace and

{tr}^{2}

the squared trace, and where

T_{θ} P_{m}

denotes the tangent space at

θ

to the manifold

P_{m}

. Precisely [27], for the information metric of the model P,

I_{1} = \frac{φ}{2 m^{2} (m + 2)} I_{2} = \frac{φ}{m^{2}} - \frac{1}{4}

(11b)

where

φ

is a further constant, given by the expectation

φ = E_{e} {[h^{'} (x^{†} x) (x^{†} x)]}^{2}

(11c)

with

e \in P_{m}

the identity matrix, and

h^{'}

the derivative of h. This expression of the information metric can now be used to specify Algorithm (1a).

First, since the information metric is affine-invariant, it is enough to recall that all affine-invariant metrics on

P_{m}

have the same Riemannian exponential map [25,29],

{Exp}_{θ} (u) = θ exp (θ^{- 1} u)

(12a)

where exp denotes the matrix exponential. Second, as in (ii) of Proposition 5, choose the sequence of step sizes

γ_{n} = \frac{1}{n}

(12b)

Third, as in (iii) of Proposition 5, let

u (θ, x)

be the vector field on

P_{m}

given by (2b),

u (θ, x) = \nabla^{(i n f)} log L (θ) = \nabla^{(i n f)} log p (x | θ)

(12c)

where

\nabla^{(i n f)}

denotes the gradient with respect to the information metric, and

L (θ)

is the likelihood ratio, equal to

p (x | θ)

divided by

p (x | θ^{*})

. Now, replacing (12) into (1a) defines an original algorithm for recursive estimation of the true scatter matrix

θ^{*}

To apply this algorithm in practice, one may evaluate

u (θ, x)

via the following steps. Denote

g (θ, x)

the gradient of

log p (x | θ)

with respect to the affine-invariant metric of [29], which corresponds to

I_{1} = 1

and

I_{2} = 0

. By direct calculation from (10), this is given by

g (θ, x) = - \frac{1}{2} θ - h^{'} (x^{†} θ^{- 1} x) x x^{†}

(13a)

Moreover, introduce the constants

J_{1} = I_{1}

and

J_{2} = I_{1} + m I_{2}

. Then,

u (θ, x)

can be evaluated,

u (θ, x) = J_{1}^{- 1} {(g (θ, x))}^{⊥} + J_{2}^{- 1} {(g (θ, x))}^{‖}

(13b)

from the orthogonal decomposition of

g = g (θ, x)

g^{‖} = tr (θ^{- 1} g) \frac{θ}{m} g^{⊥} = g - g^{‖}

(13c)

Figure 1 and Figure 2 below display numerical results from an application to Kotz-type distributions, which correspond to

h (t) = - \frac{t^{s}}{2}

in (10) and

φ = s^{2} \frac{m}{2 s} (\frac{m}{2 s} + 1)

in (11c) [24,27]. These figures were generated from

10^{3}

Monte Carlo runs of the algorithm defined by (1a) and (12), with random initialisation, for the specific values

s = 4

and

m = 7

. Essentially the same numerical results could be observed for any

s \leq 9

and

m \leq 50

Figure 1 confirms the fast non-asymptotic rate of convergence (5), stated in (i) of Proposition 5. On a log-log scale, it shows the empirical mean

E_{MC} d^{2} (θ_{n}, θ^{*})

over Monte Carlo runs, as a function of n. This decreases with a constant negative slope equal to

- 1

, starting roughly at

log n = 4

. Here, the Riemannian distance

d (θ_{n}, θ^{*})

induced by the information metric (11) is given by [28]

d^{2} (θ, θ^{*}) = I_{1} tr {[log (θ^{- 1} θ^{*})]}^{2} + I_{2} {tr}^{2} [log (θ^{- 1} θ^{*})] θ, θ^{*} \in Θ

(14)

where log denotes the symmetric matrix logarithm [30]. Figure 2 confirms the asymptotic efficiency of the recursive estimates

θ_{n}

, stated in (iii) of Proposition 5, using Remark 4. It shows a kernel density estimate of

n d^{2} (θ_{n}, θ^{*})

where

n = 10^{5}

(solid blue curve). This agrees with a

χ^{2}

-distribution with 28 degrees of freedom (dotted red curve), where

d = 28

is indeed the dimension of the parameter space

P_{m}

for

m = 7

5. Conclusions

Recursive estimation is a subject that is over fifty years old [10], with its foundation in the general theory of stochastic optimisation [9,15]. Its applications are very wide-ranging, as they cover areas from control theory to machine learning [17].

With the increasing role of Riemannian manifolds in statistical inference and machine learning, it was natural to generalise the techniques of stochastic optimisation, from Euclidean space to Riemannian manifolds. Indeed, this started with the work of Bonnabel [1], which impulsed a series of successive works, such as [2,3].

These works have mostly sought to directly adapt classical results, known in Euclidean space, which concern optimal rates of convergence to a unique attractive minimum of a cost function. The present work also belongs to this line of thinking. It shows that when dealing with a recursive estimation problem, where the unknown statistical parameter belongs to a differentiable manifold, equipping this manifold with the information metric of the underlying statistical model, leads to optimal algorithm performance, which is moreover automatic (does not involve parameter tuning).

The results obtained in the present work (as well as in [2,3]) suffer from inherent limitations. Indeed, being only focused on convergence to a unique attractive minimum, it does not tackle the following important open problems:

stability of stochastic optimisation algorithms finding verifiable conditions which ensure a stochastic optimisation algorithm remains within a compact set. A more general form of this problem is computing the probability of a stochastic optimisation algorithm exiting a certain neighborhood of a stationary point (whether attractive or not) within a finite number of iterations.
non-asymptotic performance of stochastic optimisation algorithms: this involves computing explicitly the outcome which the algorithm is able to achieve, after a given finite number of iterations. This provides a much stronger theoretical guarantee, to the user, than standard results which compute a rate of convergence.

These problems have attracted much attention and generated well-known results when considered in the Euclidean case [31,32], but remain open in the context of Riemannian manifolds. They involve much richer interaction between Riemannian geometry and stochastic optimisation, due to their global nature.

6. Proofs of Main Results

Proof of Proposition 1.

The proof is a generalisation of the original proof in [1], itself modeled on the proof for the Euclidean case in [33]. Throughout the following, let

X_{n}

be the

σ

-field generated by

x_{1}, \dots, x_{n}

[20]. Recall that

(x_{n}; n = 1, 2, \dots)

are i.i.d. with distribution

P_{θ^{*}}

. Therefore, by (1a),

θ_{n}

X_{n}

-measurable and

x_{n + 1}

is independent from

X_{n}

. Thus, using elementary properties of conditional expectation [20],

\begin{matrix} E [u (θ_{n}, x_{n + 1}) | X_{n}] = - D (θ_{n}) \end{matrix}

(15a)

\begin{matrix} E [∥ u (θ_{n}, x_{n + 1}) ∥^{2} | X_{n}] = V (θ_{n}) \end{matrix}

(15b)

where (15a) follows from (1c), and (15b) from (u1). Let L be a Lipschitz constant for

\nabla D (θ)

, and C be an upper bound on

V (θ)

, for

θ \in Θ^{*}

. The following inequality is now proved, for any positive integer n,

E [D (θ_{n + 1}) - D (θ_{n}) | X_{n}] \leq γ_{n + 1}^{2} L C - γ_{n + 1} {∥ \nabla D (θ_{n}) ∥}^{2}

(16)

once this is done, Proposition 1 is obtained by applying the Robbins-Siegmund theorem [9].

Proof of (16): let

c (t)

be the geodesic connecting

θ_{n}

θ_{n + 1}

with equation

c (t) = {Exp}_{θ_{n}} (t γ_{n + 1} u (θ_{n}, x_{n + 1}))

(17a)

From the fundamental theorem of calculus,

D (θ_{n + 1}) - D (θ_{n}) = γ_{n + 1} 〈 u (θ_{n}, x_{n + 1}), \nabla D (θ_{n}) 〉 + γ_{n + 1} \int_{0}^{1} [{〈 \dot{c}, \nabla D 〉}_{c (t)} - {〈 \dot{c}, \nabla D 〉}_{c (0)}] d t

(17b)

Since the recursive estimates

θ_{n}

are stable,

θ_{n}

and

θ_{n + 1}

both lie in

Θ^{*}

. Since

Θ^{*}

is convex, the whole geodesic

c (t)

lies in

Θ^{*}

. Then, since

\nabla D (θ)

is Lipschitz on

Θ^{*}

, it follows from (17b),

D (θ_{n + 1}) - D (θ_{n}) \leq γ_{n + 1} 〈 u (θ_{n}, x_{n + 1}), \nabla D (θ_{n}) 〉 + γ_{n + 1}^{2} L {∥ u (θ_{n}, x_{n + 1}) ∥}^{2}

(17c)

Taking conditional expectations in this inequality, and using (15a) and (15b),

E [D (θ_{n + 1}) - D (θ_{n}) | X_{n}] \leq - γ_{n + 1} {∥ \nabla D (θ_{n}) ∥}^{2} + γ_{n + 1}^{2} L V (θ_{n})

(17d)

so (16) follows since (u1) guarantees

V (θ_{n}) \leq C

. □ Conclusion: by the Robbins-Siegmund theorem, inequality (16) implies that, almost surely,

lim D (θ_{n}) = D_{\infty} < \infty a n d \sum_{n = 1}^{\infty} γ_{n + 1} {∥ \nabla D (θ_{n}) ∥}^{2} < \infty

(18a)

In particular, from the first condition in (1b), convergence of the sum in (18a) implies

lim ∥ \nabla D (θ_{n}) ∥ = 0 almost surely

(18b)

Now, since the sequence of recursive estimates

θ_{n}

lies in the compact set

Θ^{*}

, it has at least one point of accumulation in this set, say

θ_{*}

. If

θ_{n (k)}

is a subsequence of

θ_{n}

, converging to

θ_{*}

∥ \nabla D (θ_{*}) ∥ = lim ∥ \nabla D (θ_{n (k)}) ∥ = lim ∥ \nabla D (θ_{n}) ∥ = 0 almost surely

where the third equality follows from (18b). This means that

θ_{*}

is a stationary point of

D (θ)

Θ^{*}

. Thus, (d1) implies

θ_{*} = θ^{*}

is the unique point of accumulation of

θ_{n}

. In other words,

lim θ_{n} = θ^{*}

almost surely. □

Proof of Proposition 2.

The proof is modeled on the proofs for the Euclidean case, given in [10,15]. It relies on the following geometric Lemmas 1 and 2. Lemma 1 will be proved in Appendix A. On the other hand, Lemma 2 is the same as the trigonometric distance bound of [2]. For Lemma 1, recall that

λ > 0

denotes the smallest eigenvalue of the matrix H defined in (4b).

Lemma 1.

for any

μ < λ

, there exists a neighborhood

{\bar{Θ}}^{*}

θ^{*}

, contained in

Θ^{*}

, with

〈 {Exp}_{θ}^{- 1} (θ^{*}), \nabla D (θ) 〉 \leq - μ d^{2} (θ, θ^{*}) f o r θ \in {\bar{Θ}}^{*}

(19a)

Lemma 2.

let

- κ^{2}

be a lower bound on the sectional curvature of Θ in

Θ^{*}

, and

C_{κ} = R κ coth (R κ)

where R is the diameter of

Θ^{*}

. For

τ, θ \in Θ^{*}

, where

τ = {Exp}_{θ} (u)

d^{2} (τ, θ^{*}) \leq d^{2} (θ, θ^{*}) - 2 〈 {Exp}_{θ}^{- 1} (θ^{*}), u 〉 + C_{κ} {∥ u ∥}^{2}

(19b)

Proof of (5): let

γ_{n} = \frac{a}{n}

with

2 λ a > 2 μ a > 1

for some

μ < λ

, and let

{\bar{Θ}}^{*}

be the neighborhood corresponding to

μ

in Lemma 1. By Proposition 1, the

θ_{n}

converge to

θ^{*}

almost surely. Without loss of generality, it can be assumed that all the

θ_{n}

lie in

{\bar{Θ}}^{*}

, almost surely. Then, (1a) and Lemma 2 imply, for any positive integer n,

d^{2} (θ_{n + 1}, θ^{*}) \leq d^{2} (θ_{n}, θ^{*}) - 2 γ_{n + 1} 〈 {Exp}_{θ_{n}}^{- 1} (θ^{*}), u (θ_{n}, x_{n + 1}) 〉 + γ_{n + 1}^{2} C_{κ} {∥ u (θ_{n}, x_{n + 1}) ∥}^{2}

(20a)

Indeed, this follows by replacing

τ = θ_{n + 1}

and

θ = θ_{n}

in (19b). Taking conditional expectations in (20a), and using (15a) and (15b),

E [d^{2} (θ_{n + 1}, θ^{*}) | X_{n}] \leq d^{2} (θ_{n}, θ^{*}) + 2 γ_{n + 1} 〈 {Exp}_{θ_{n}}^{- 1} (θ^{*}), \nabla D (θ_{n}) 〉 + γ_{n + 1}^{2} C_{κ} V (θ_{n})

Then, by (u1) and (19a) of Lemma 1,

E [d^{2} (θ_{n + 1}, θ^{*}) | X_{n}] \leq d^{2} (θ_{n}, θ^{*}) (1 - 2 γ_{n + 1} μ) + γ_{n + 1}^{2} C_{κ} C

(20b)

where C is an upper bound on

V (θ)

, for

θ \in Θ^{*}

. By further taking expectations

E d^{2} (θ_{n + 1}, θ^{*}) \leq E d^{2} (θ_{n}, θ^{*}) (1 - 2 γ_{n + 1} μ) + γ_{n + 1}^{2} C_{κ} C

(20c)

Using (20c), the proof reduces to an elementary reasoning by recurrence. Indeed, replacing

γ_{n} = \frac{a}{n}

into (20c), it follows that

E d^{2} (θ_{n + 1}, θ^{*}) \leq E d^{2} (θ_{n}, θ^{*}) (1 - \frac{2 μ a}{n + 1}) + \frac{a^{2} C_{κ} C}{{(n + 1)}^{2}}

(21a)

On the other hand, if

b (n) = \frac{b}{n}

where

b > a^{2} C_{κ} C {(2 μ a - 1)}^{- 1}

, then

b (n + 1) \geq b (n) (1 - \frac{2 μ a}{n + 1}) + \frac{a^{2} C_{κ} C}{{(n + 1)}^{2}}

(21b)

Let b be sufficiently large, so (21b) is verified and

E d^{2} (θ_{n_{o}}, θ^{*}) \leq b (n_{o})

for some

n_{o}

. Then, by recurrence, using (21a) and (21b), one also has that

E d^{2} (θ_{n}, θ^{*}) \leq b (n)

for all

n \geq n_{o}

. In other words, (5) holds true. □

Proof of Proposition 3.

the proof is modeled on the proof for the Euclidean case in [10]. To begin, let

W_{n}

be the stochastic process given by

W_{n} = n^{p} d^{2} (θ_{n}, θ^{*}) + n^{- q} where q \in (0, 1 - p)

(22a)

The idea is to show that this process is a positive supermartingale, for sufficiently large n. By the supermartingale convergence theorem [20], it then follows that

W_{n}

converges to a finite limit, almost surely. In particular, this implies

lim n^{p} d^{2} (θ_{n}, θ^{*}) = ℓ_{p} < \infty almost surely

(22b)

Then,

ℓ_{p}

must be equal to zero, since p is arbitrary in the interval

(0, 1)

. Precisely, for any

ε \in (0, 1 - p)

ℓ_{p} = lim n^{p} d^{2} (θ_{n}, θ^{*}) = lim n^{- ε} n^{p + ε} d^{2} (θ_{n}, θ^{*}) = (lim n^{- ε}) ℓ_{p + ε} = 0

It remains to show that

W_{n}

is a supermartingale, for sufficiently large n. To do so, note that by (20b) from the proof of Proposition 2,

E [W_{n + 1} - W_{n} | X_{n}] \leq d^{2} (θ_{n}, θ^{*}) \frac{p - 2 μ a}{{(n + 1)}^{1 - p}} + \frac{a^{2} C_{κ} C}{{(n + 1)}^{2 - p}} - \frac{q}{{(n + 1)}^{q + 1}}

Here, the first term on the right-hand side is negative, since

2 μ a > 1 > p

. Moreover, the third term dominates the second one for sufficiently large n, since

q < 1 - p

. Thus, for sufficiently large n, the right-hand side is negative, and

W_{n}

is a supermartingale. □

Proof of Proposition 4.

the proof relies on the following geometric Lemmas 3 and 4, which are used to linearise Algorithm (1a), in terms of the normal coordinates

θ^{α}

. This idea of linearisation in terms of local coordinates also plays a central role in [3].

Lemma 3.

let

θ_{n}, θ_{n + 1}

be given by (1a) with

γ_{n} = \frac{a}{n}

. Then, in a system of normal coordinates with origin at

θ^{*}

θ_{n + 1}^{α} = θ_{n}^{α} + γ_{n + 1}^{} u_{n + 1}^{α} + γ_{n + 1}^{2} π_{n + 1}^{α} E |π_{n + 1}^{α}| = O (n^{- 1 / 2})

(23a)

where

u_{n + 1}^{α}

are the components of

u (θ_{n}, x_{n + 1})

Lemma 4.

let

v_{n} = \nabla D (θ_{n})

. Then, in a system of normal coordinates with origin at

θ^{*}

v_{n}^{α} = H_{α β}^{} θ_{n}^{β} + ρ_{n}^{α} ρ_{n}^{α} = o (d (θ_{n}, θ^{*}))

(23b)

where

v_{n}^{α}

are the components of

v_{n}

and the

H_{α β}

were defined in (4b).

Linearisation of (1a): let

u (θ_{n}, x_{n + 1}) = - v_{n} + w_{n + 1}

. Then, it follows from (23a) and (23b),

θ_{n + 1}^{α} = θ_{n}^{α} - γ_{n + 1}^{} H_{α β}^{} θ_{n}^{β} - γ_{n + 1}^{} ρ_{n}^{α} + γ_{n + 1}^{} w_{n + 1}^{α} + γ_{n + 1}^{2} π_{n + 1}^{α}

(24a)

Denote the re-scaled coordinates

n^{1 / 2} θ_{n}^{α}

η_{n}^{α}

, and recall

γ_{n} = \frac{a}{n}

. Then, using the estimate

{(n + 1)}^{1 / 2} = n^{1 / 2} (1 + {(2 n)}^{- 1} + O (n^{- 2}))

, it follows from (24a) that

η_{n + 1}^{α} = η_{n}^{α} + \frac{A_{α β}}{n + 1} η_{n}^{β} + \frac{a}{{(n + 1)}^{1 / 2}} [B_{α β}^{} θ_{n}^{β} - ρ_{n}^{α} + w_{n + 1}^{α} + \frac{a π_{n + 1}^{α}}{n + 1}]

(24b)

where

A_{α β} = \frac{1}{2} δ_{α β} - a H_{α β}

and

B_{α β} = O (n^{- 1})

. Equation (24b) is a first-order, inhomogeneous, linear difference equation, for the “vector”

η_{n}^{}

of components

η_{n}^{α}

. □

Study of equation (24b): switching to vector-matrix notation, equation (24b) is of the general form

η_{n + 1}^{} = (I + \frac{A}{n + 1}) η_{n}^{} + \frac{a ξ_{n + 1}^{}}{{(n + 1)}^{1 / 2}}

(25a)

where I denotes the identity matrix, A has matrix elements

A_{α β}

, and

(ξ_{n})

is a sequence of inputs. The general solution of this equation is [10,34]

η_{n}^{} = A_{n, m}^{} η_{m}^{} + \sum_{k = m + 1}^{n} A_{n, k}^{} \frac{a ξ_{k}^{}}{k^{1 / 2}} for n \geq m

(25b)

where the transition matrix

A_{n, k}

is given by

A_{n, k}^{} = \prod_{j = k + 1}^{n} (I + \frac{A}{j}) A_{n, n}^{} = I

(25c)

Since

2 λ a > 1

, the matrix A is stable. This can be used to show that [10,34]

q > \frac{1}{2} and E |ξ_{n}^{}| = O (n^{- q}) ⟹ lim η_{n}^{} = 0 in probability

(25d)

where

| ξ_{n} |

denotes the Euclidean vector norm. Then, it follows from (25d) that

η_{n}^{}

converges to zero in probability, in each one of the three cases

ξ_{n + 1}^{α} = B_{α β}^{} θ_{n}^{β}; ξ_{n + 1}^{α} = ρ_{n}^{α}; ξ_{n + 1}^{α} = \frac{π_{n + 1}^{α}}{n + 1}

Indeed, in the first two cases, the condition required in (25d) can be verified using (5), whereas in the third case, it follows immediately from the estimate of

E | π_{n + 1}^{α} |

in (23a). □

Conclusion: by linearity of (24b), it is enough to consider the case

ξ_{n + 1}^{α} = w_{n + 1}^{α}

in (25a). Then, according to (25b),

η_{n}^{}

has the same limit distribution as the sums

{\tilde{η}}_{n}^{} = \sum_{k = 1}^{n} A_{n, k}^{} \frac{a w_{k}}{k^{1 / 2}}

(26)

By (15),

(w_{k})

is a sequence of square-integrable martingale differences. Therefore, to conclude that the limit distribution of

{\tilde{η}}_{n}^{}

is a centred d-variate normal distribution, with covariance matrix

\sum

given by (7), it is enough to verify the conditions of the martingale central limit theorem [35],

lim max_{k \leq n} |A_{n, k}^{} \frac{a w_{k}}{k^{1 / 2}}| = 0 in probability

(27a)

sup E {|{\tilde{η}}_{n}^{}|}^{2} < \infty

(27b)

lim \sum_{k = 1}^{n} \frac{a^{2}}{k} A_{n, k}^{} \sum_{k}^{} A_{n, k}^{} = \sum in probability

(27c)

where

\sum_{k}^{}

is the conditional covariance matrix

\sum_{k}^{} = E [w_{k}^{} w_{k}^{†} | X_{k - 1}]

(28)

Conditions (27) are verified in Appendix B, which completes the proof. □

Proof of Proposition 5.

Denote

\partial_{α} = \frac{\partial}{\partial θ^{α}}

the coordinate vector fields of the normal coordinates

θ^{α}

. Since

〈 \cdot, \cdot 〉

coincides with the information metric of the model P, it follows from (4b) and (A10),

H_{α β} = {〈 \partial_{α}, \partial_{β} 〉}_{θ^{*}}

(29a)

However, by the definition of normal coordinates [16], the

\partial_{α}

are orthonormal at

θ^{*}

. Therefore,

H_{α β} = δ_{α β}

(29b)

Thus, the matrix H is equal to the identity matrix, and its smallest eigenvalue is

λ = 1

Proof of (i): this follows directly from Propositions 2 and 3. Indeed, since

λ = 1

, the conditions of these propositions are verified, as soon as

2 a > 1

. Therefore, (5) and (6) hold true. □

Proof of (ii): this follows from Proposition 4. The conditions of this proposition are verified, as soon as

2 a > 1

. Therefore, the distribution of the re-scaled coordinates

(n^{1 / 2} θ_{n}^{α})

converges to a centred d-variate normal distribution, with covariance matrix

\sum

given by Lyapunov’s equation (7). If

a = 1

, then (29b) implies

A_{α β} = - \frac{1}{2} δ_{α β}

, so that Lyapunov’s equation (7) reads

\sum = \sum^{*}

, as required. □

For the following proof of (iii), the reader may wish to recall that summation convention is used throughout the present work. That is [16], summation is implicitly understood over any repeated subscript or superscript from the Greek alphabet, taking the values

1, \dots, d

Proof of (iii): let

ℓ (θ) = log L (θ)

and assume

u (θ, x)

is given by (2b). Then, by the definition of normal coordinates [16], the following expression holds

u^{α} (θ^{*}) = {\frac{\partial ℓ}{\partial θ^{α}}|}_{θ^{α} = 0}

(30a)

Replacing this into (4a) gives

\sum_{α β}^{*} = E_{θ^{*}} {[\frac{\partial ℓ}{\partial θ^{α}} \frac{\partial ℓ}{\partial θ^{β}}]}_{θ^{α} = 0} = - E_{θ^{*}} {\frac{\partial^{2} ℓ}{\partial θ^{α} \partial θ^{β}}|}_{θ^{α} = 0} = {\frac{\partial^{2} D}{\partial θ^{α} \partial θ^{β}}|}_{θ^{α} = 0}

(30b)

where the second equality is the so-called Fisher’s identity (see [19], Page 28), and the third equality follows from (2a) by differentiating under the expectation. Now, by (4b) and (29b),

\sum^{*}

is the identity matrix.

To show that the recursive estimates

θ_{n}

are asymptotically efficient, let

(τ^{α}; α = 1, \dots, d)

be any local coordinates with origin at

θ^{*}

and let

τ_{n}^{α} = τ^{α} (θ_{n})

. From the second-order Taylor expansion of each coordinate function

τ^{α}

, it is straightforward to show that

n^{1 / 2} τ_{n}^{α} = {(\frac{\partial τ^{α}}{\partial θ^{γ}})}_{θ^{*}} (n^{1 / 2} θ_{n}^{γ}) + σ^{α} (θ_{n}) (n^{1 / 2} d^{2} (θ_{n}, θ^{*}))

(31a)

where the subscript

θ^{*}

indicates the derivative is evaluated at

θ^{*}

, and where

σ^{α}

is a continuous function in the neighborhood of

θ^{*}

. By (6), the second term in (31a) converges to zero almost surely. Therefore, the limit distribution of the re-scaled coordinates

(n^{1 / 2} τ_{n}^{α})

is the same as that of the first term in (31a). By (ii), this is a centred d-variate normal distribution with covariance matrix

\sum^{τ}

given by

\sum_{α β}^{τ} = {(\frac{\partial τ^{α}}{\partial θ^{γ}})}_{θ^{*}} \sum_{γ κ}^{*} {(\frac{\partial τ^{β}}{\partial θ^{κ}})}_{θ^{*}} = {(\frac{\partial τ^{α}}{\partial θ^{γ}})}_{θ^{*}} {(\frac{\partial τ^{β}}{\partial θ^{γ}})}_{θ^{*}}

(31b)

where the second equality follows because

\sum_{γ κ}^{*} = δ_{γ κ}

since

\sum^{*}

is the identity matrix.

It remains to show that

\sum^{τ}

is the inverse of the information matrix

I^{τ}

as in (A12). According to (A10), this is given by

I_{α β}^{τ} = {\frac{\partial^{2} D}{\partial τ^{α} \partial τ^{β}}|}_{τ^{α} = 0} = - E_{θ^{*}} {\frac{\partial^{2} ℓ}{\partial τ^{α} \partial τ^{β}}|}_{τ^{α} = 0} = E_{θ^{*}} {[\frac{\partial ℓ}{\partial τ^{α}} \frac{\partial ℓ}{\partial τ^{β}}]}_{τ^{α} = 0}

(31c)

where the second equality follows from (2a), and the third equality from Fisher’s identity (see [19], Page 28). Now, a direct application of the chain rule yields the following

I_{α β}^{τ} = E_{θ^{*}} {[\frac{\partial ℓ}{\partial τ^{α}} \frac{\partial ℓ}{\partial τ^{β}}]}_{τ^{α} = 0} = {(\frac{\partial θ^{γ}}{\partial τ^{α}})}_{θ^{*}} E_{θ^{*}} {[\frac{\partial ℓ}{\partial θ^{γ}} \frac{\partial ℓ}{\partial θ^{κ}}]}_{θ^{γ} = 0} {(\frac{\partial θ^{κ}}{\partial τ^{β}})}_{θ^{*}}

By the first equality in (30b), this is equal to

I_{α β}^{τ} = {(\frac{\partial θ^{γ}}{\partial τ^{α}})}_{θ^{*}} \sum_{γ κ}^{*} {(\frac{\partial θ^{κ}}{\partial τ^{β}})}_{θ^{*}} = {(\frac{\partial θ^{γ}}{\partial τ^{α}})}_{θ^{*}} {(\frac{\partial θ^{γ}}{\partial τ^{β}})}_{θ^{*}}

(31d)

because

\sum_{γ κ}^{*} = δ_{γ κ}

is the identity matrix. Comparing (31b) to (31d>), it is clear that

\sum^{τ}

is the inverse of the information matrix

I^{τ}

as in (A12).

Proof of(iv): (8a) and (8b) follow from (5) and (6), respectively, by using (A11). Precisely, it is possible to write (A11) in the form

D (θ_{n}) = \frac{1}{2} d^{2} (θ_{n}, θ^{*}) + ω (θ_{n}) d^{2} (θ_{n}, θ^{*})

(32a)

where

ω

is a continuous function in the neighborhood of

θ^{*}

, equal to zero at

θ = θ^{*}

. To obtain (8a), it is enough to take expectations in (32a) and note that

ω

is bounded above in the neighborhood of

θ^{*}

. Then, (8a) follows directly from (5).

To obtain (8b), it is enough to multiply (32a) by

n^{p}

where

p \in (0, 1)

. This gives the following expression

n^{p} D (θ_{n}) = \frac{1}{2} n^{p} d^{2} (θ_{n}, θ^{*}) (1 + ω (θ_{n}))

(32b)

From (6),

n^{p} d^{2} (θ_{n}, θ^{*})

converges to zero almost surely. Moreover, by continuity of

ω

, it follows that

ω (θ_{n})

converges to

ω (θ^{*}) = 0

almost surely. Therefore, by taking limits in (32b), it is readily seen that

lim n^{p} D (θ_{n}) = \frac{1}{2} (lim n^{p} d^{2} (θ_{n}, θ^{*})) (1 + lim ω (θ_{n})) = 0

(32c)

almost surely. However, this is equivalent to the statement that

D (θ_{n}) = o (n^{- p})

for

p \in (0, 1)

, almost surely. Thus, (8b) is proved. □

Author Contributions

Data curation, J.Z.; Software, J.Z.; Writing—original draft, S.S.; Writing—review & editing, S.S.

Funding

Agence Nationale de la Recherche: ANR-17-ASTR-0015.

Acknowledgments

At the end, we thank the support from the ANR MARGARITA (Modern Adaptive Radar: Great Advances in Robust and Inference Techniques and Application) under Grant ANR-17-ASTR-0015 for our works.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Geometric Lemmas

Appendix A.1. Lemma 1

Let

c (t)

be the geodesic connecting

θ^{*}

to some

θ \in Θ^{*}

, parameterised by arc length. In other words,

c (0) = θ^{*}

and

c (t_{θ}) = θ

where

t_{θ} = d (θ, θ^{*})

. Denote

Π_{t}

the parallel transport along

c (t)

, from

T_{c (0)} Θ

T_{c (t)} Θ

. Since the velocity

\dot{c} (t)

is self-parallel [16],

\dot{c} (t_{θ}) = Π_{t_{θ}} (\dot{c} (0))

Multiplying this identity by

- t_{θ}

, it follows that

{Exp}_{θ}^{- 1} (θ^{*}) = - Π_{t_{θ}} ({Exp}_{θ^{*}}^{- 1} (θ))

(A1a)

Moreover, recall the first-order Taylor expansion of the gradient

\nabla D (θ)

[16,36]

\nabla D (θ) = Π_{t_{θ}} (\nabla D (θ^{*}) + t_{θ} \nabla^{2} D (θ^{*}) \cdot \dot{c} (0) + t_{θ} ϕ (θ))

(A1b)

where

ϕ (θ)

is continuous and equal to zero at

θ = θ^{*}

. Here,

\nabla^{2} D (θ^{*})

is the Hessian of

D (θ)

θ = θ^{*}

, considered as a linear mapping of

T_{θ^{*}} Θ

[16,36]

\nabla^{2} D (θ^{*}) \cdot w = \nabla_{w} \nabla D (θ^{*}) for w \in T_{θ^{*}} Θ

where

\nabla_{w}

denotes the covariant derivative in the direction of w. By (d1), the first term on the right-hand side of (A1b) is equal to zero, so that

\nabla D (θ) = Π_{t_{θ}} (\nabla^{2} D (θ^{*}) \cdot {Exp}_{θ^{*}}^{- 1} (θ) + t_{θ} ϕ (θ))

(A1c)

Taking the scalar product of (A1a) and (A1c),

〈 {Exp}_{θ}^{- 1} (θ^{*}), \nabla D (θ) 〉 = - 〈 {Exp}_{θ^{*}}^{- 1} (θ), \nabla^{2} D (θ^{*}) \cdot {Exp}_{θ^{*}}^{- 1} (θ) 〉 - t_{θ} 〈 {Exp}_{θ^{*}}^{- 1} (θ), ϕ (θ) 〉

(A1d)

since parallel transport preserves scalar products. In terms of the normal coordinates

θ^{α}

, this reads [16]

〈 {Exp}_{θ}^{- 1} (θ^{*}), \nabla D (θ) 〉 = - H_{α β} θ^{α} θ^{β} - t_{θ}^{2} {\hat{θ}}^{α} ϕ^{α}

(A1e)

where

H = (H_{α β})

was defined in (4b),

{\hat{θ}}^{α}

denotes the quotient

θ^{α} / t_{θ}

, and the

ϕ^{α}

denote the components of

ϕ (θ)

. Note that

t_{θ}^{2} = d^{2} (θ, θ^{*}) = θ^{α} θ^{α}

, so (A1e) can be written

〈 {Exp}_{θ}^{- 1} (θ^{*}), \nabla D (θ) 〉 = (ψ (θ) δ_{α β} - H_{α β}) θ^{α} θ^{β}

(A1f)

where

ψ (θ)

is continuous and equal to zero at

θ = θ^{*}

. To conclude, let

μ = λ - ε

for some

ε > 0

, and

{\bar{Θ}}^{*}

a neighborhood of

θ^{*}

, contained in

Θ^{*}

, such that

ψ (θ) \leq ε

for

θ \in {\bar{Θ}}^{*}

. Then, since

λ

is the smallest eigenvalue of

H = (H_{α β})

〈 {Exp}_{θ}^{- 1} (θ^{*}), \nabla D (θ) 〉 \leq (ε - λ) θ^{α} θ^{α} = - μ d^{2} (θ, θ^{*})

for

θ \in {\bar{Θ}}^{*}

. This is exactly (19a), so the lemma is proved. □

Appendix A.2. Lemma 3

To simplify notation, let

u_{n + 1} = u (θ_{n}, x_{n + 1})

. Then, the geodesic

c (t)

, connecting

θ_{n}

θ_{n + 1}

, has equation

c (t) = {Exp}_{θ_{n}} (t γ_{n + 1} u_{n + 1})

Each one of the normal coordinates

θ^{α}

is a

C^{3}

function

θ^{α} : Θ^{*} \to R

, with differential

d θ^{α}

and Hessian [16]

\nabla^{2} θ^{α} = - Γ_{β γ}^{α} (θ) d θ^{β} \otimes d θ^{γ}

where

Γ_{β γ}^{α}

are the Christoffel symbols of the coordinates

θ^{α}

, and

\otimes

denotes the tensor product. Then, the second-order Taylor expansion of the functions

θ^{α} \circ c

reads

(θ^{α} \circ c) (1) = (θ^{α} \circ c) (0) + γ_{n + 1} d θ^{α} (u_{n + 1}) - \frac{1}{2} γ_{n + 1}^{2} Γ_{β γ}^{α} (θ_{n}) d θ^{β} (u_{n + 1}) d θ^{γ} (u_{n + 1}) + γ_{n + 1}^{3} T_{n + 1}^{α}

(A2a)

where

T_{n + 1}^{α}

satisfies

|T_{n + 1}^{α}| \leq K_{1} {∥ u_{n + 1} ∥}^{3}

(A2b)

for a constant

K_{1}

which does not depend on n, as can be shown by direct calculation. Of course,

(θ^{α} \circ c) (1) = θ_{n + 1}^{α}

and

(θ^{α} \circ c) (0) = θ_{n}^{α}

. Moreover,

d θ^{α} (u_{n + 1}) = u_{n + 1}^{α}

are the components of

u_{n + 1}

. Replacing into (A2a), this yields

θ_{n + 1}^{α} = θ_{n}^{α} + γ_{n + 1}^{} u_{n + 1}^{α} + γ_{n + 1}^{2} π_{n + 1}^{α}

(A2c)

where

π_{n + 1}^{α}

is given by

π_{n + 1}^{α} = γ_{n + 1}^{} T_{n + 1}^{α} - \frac{1}{2} Γ_{β γ}^{α} (θ_{n}) u_{n + 1}^{β} u_{n + 1}^{γ}

(A2d)

Comparing (A2c) to (23a), it is clear the proof will be complete upon showing

E |π_{n + 1}^{α}| = O (n^{- 1 / 2})

. To do so, note that each Christoffel symbol

Γ_{β γ}^{α}

is a

C^{1}

function on the compact set

Θ^{*}

, with

Γ_{β γ}^{α} (θ^{*}) = 0

by the definition of normal coordinates [16]. Therefore,

|Γ_{β γ}^{α} (θ)| \leq K_{2} d (θ, θ^{*})

(A2e)

for a constant

K_{2}

which does not depend on n. Replacing the inequalities (A2b) and (A2e) into (A2d), and taking expectations, it follows that

E |π_{n + 1}^{α}| \leq γ_{n + 1} K_{1} E {∥ u_{n + 1} ∥}^{3} + d^{2} \times K_{2} E [d (θ_{n}, θ^{*}) {∥ u_{n + 1} ∥}^{2}]

(A3a)

where d is the dimension of the parameter space

Θ

. However, using the fact that the

x_{n}

are i.i.d. with distribution

P_{θ^{*}}

E [∥ u_{n + 1} ∥^{3} | X_{n}] = E_{θ^{*}} {∥ u (θ_{n}, x) ∥}^{3} \leq R^{3 / 4} (θ_{n})

(A3b)

by (u2) and Jensen’s inequality [20]. On the other hand, by the Cauchy–Schwarz inequality,

E [d (θ_{n}, θ^{*}) {∥ u_{n + 1} ∥}^{2}] \leq {(E d^{2} (θ_{n}, θ^{*}))}^{1 / 2} {(E ∥ u_{n + 1} ∥^{4})}^{1 / 2} \leq b n^{- 1 / 2} {(E ∥ u_{n + 1} ∥^{4})}^{1 / 2}

for some

b > 0

as follows from (5). Then, by the same reasoning that lead to (A3b),

E [d (θ_{n}, θ^{*}) {∥ u_{n + 1} ∥}^{2}] \leq b n^{- 1 / 2} {(E R (θ_{n}))}^{1 / 2}

(A3c)

By (u2), there exists a uniform upper bound M on

R (θ)

for

θ \in Θ^{*}

. Since

θ_{n}

lies in

Θ^{*}

for all n, it follows by replacing the inequalities (A3b) and (A3c) into (A3a) that

E |π_{n + 1}^{α}| \leq γ_{n + 1} K_{1} M^{3 / 4} + d^{2} \times K_{2} b n^{- 1 / 2} M^{1 / 2}

(A3d)

Finally, by recalling that

γ_{n} = \frac{a}{n}

, it is clear that the right-hand side of (A3d) is

O (n^{- 1 / 2})

, so the proof is complete. □

Appendix A.3. Lemma 4

The lemma is an instance of the general statement: let

θ \in Θ^{*}

and

v = \nabla D (θ)

. Then, in a system of normal coordinates with origin at

θ^{*}

v^{α} = H_{α β} θ^{β} + o (d (θ, θ^{*}))

(A4a)

where

v^{α}

are the components of v. Indeed, (23b) follows from (A4a) after replacing

θ = θ_{n}

, so that

v = v_{n}

, and setting

ρ_{n}^{α} = v_{n}^{α} - H_{α β} θ_{n}^{β}

To prove (A4a), recall (A1c) from the proof of Lemma 1, which can be written

v = Π_{t_{θ}} (\nabla^{2} D (θ^{*}) \cdot {Exp}_{θ^{*}}^{- 1} (θ)) + d (θ, θ^{*}) Π_{t_{θ}} (ϕ (θ))

(A4b)

Denote

\partial_{α} = \frac{\partial}{\partial θ^{α}}

the coordinate vector fields of the normal coordinates

θ^{α}

. Note that [16,36]

{Exp}_{θ^{*}}^{- 1} (θ) = θ^{β} \partial_{β} (θ^{*}) \nabla^{2} D (θ^{*}) \cdot \partial_{β} (θ^{*}) = H_{α β} \partial_{α} (θ^{*})

Replacing in (A4b), this gives

v = H_{α β} θ^{β} Π_{t_{θ}} (\partial_{α} (θ^{*})) + d (θ, θ^{*}) Π_{t_{θ}} (ϕ (θ))

(A4c)

From the first-order Taylor expansion of the vector fields

\partial_{α}

[16,36]

\partial_{α} (θ) = Π_{t_{θ}} (\partial_{α} (θ^{*}) + \nabla \partial_{α} (θ^{*}) \cdot {Exp}_{θ^{*}}^{- 1} (θ)) + d (θ, θ^{*}) Π_{t_{θ}} (χ^{α} (θ))

where

χ^{α} (θ)

is continuous and equal to zero at

θ = θ^{*}

. However, by the definition of normal coordinates [16], each covariant derivative

\nabla \partial_{α} (θ^{*})

is zero. In other words,

\partial_{α} (θ) = Π_{t_{θ}} (\partial_{α} (θ^{*})) + d (θ, θ^{*}) Π_{t_{θ}} (χ^{α} (θ))

(A4d)

Replacing (A4d) into (A4c), it follows

v = H_{α β} θ^{β} \partial_{α} (θ) + d (θ, θ^{*}) Π_{t_{θ}} (ϕ (θ) - H_{α β} θ^{β} χ^{α} (θ))

(A4e)

Now, to obtain (A4a), it is enough to note the decomposition

v = v^{α} \partial_{α} (θ)

is unique, and

ϕ (θ) - H_{α β} θ^{β} χ^{α} (θ)

converges to zero as

θ

converges to

θ^{*}

. □

Appendix B. Conditions of the Martingale CLT

For the verification of Conditions (27), the following inequality (A5) will be useful. Let

ν = a λ - \frac{1}{2}

, so

- ν

is the largest eigenvalue of the matrix A in (25a). There exists a constant

C_{A}

such that the transition matrices

A_{n, k}

in (25c) satisfy [10,34]

{|A_{n, k}|}_{Op} \leq C_{A} {(\frac{k}{n})}^{ν}

(A5)

where

{|A_{n, k}|}_{Op}

denotes the Euclidean operator norm, equal to the largest singular value of the matrix

A_{n, k}

Condition (27a): to verify this condition, note that for arbitrary

ε > 0

P (max_{k \leq n} |A_{n, k}^{} \frac{a w_{k}}{k^{1 / 2}}| > ε) \leq \sum_{k = 1}^{n} P (|A_{n, k}^{} \frac{a w_{k}}{k^{1 / 2}}| > ε) \leq \sum_{k = 1}^{n} P (C_{A} {(\frac{k}{n})}^{ν} |\frac{a w_{k}}{k^{1 / 2}}| > ε)

(A6a)

where the second inequality follows from (A5). However, it follows from (u2) that there exists a uniform upper bound

M_{w}

on the fourth-order moments of

| w_{k} |

. Therefore, by Chebyshev’s inequality [20]

\sum_{k = 1}^{n} P (C_{A} {(\frac{k}{n})}^{ν} |\frac{a w_{k}}{k^{1 / 2}}| > ε) \leq {(\frac{a C_{A}}{ε})}^{4} \frac{M_{w}}{n^{4 ν}} \sum_{k = 1}^{n} k^{4 ν - 2}

(A6b)

Since

ν > 0

, the right-hand side of (A6b) has limit equal to 0 as

n \to \infty

, by the Euler–Maclaurin formula [37]. Replacing this limit from (A6b) into (A6a) immediately yields Condition (27a). □

Condition (27b): to verify this condition, recall that

(w_{k})

is a sequence of square-integrable martingale differences. Therefore, from (26)

E {|{\tilde{η}}_{n}^{}|}^{2} = \sum_{k = 1}^{n} \frac{a^{2}}{k} E tr (A_{n, k}^{2} \sum_{k}^{})

(A7a)

where

\sum_{k}

is the conditional covariance matrix in (28). Applying (A5) to each term under the sum in (A7a), it follows that

E {|{\tilde{η}}_{n}^{}|}^{2} \leq d^{\frac{1}{2}} \sum_{k = 1}^{n} \frac{a^{2}}{k} E {|A_{n, k}|}_{Op}^{2} {|\sum_{k}^{}|}_{F} \leq (d^{\frac{1}{2}} a^{2} C_{A}^{2}) \frac{1}{n^{2 ν}} \sum_{k = 1}^{n} k^{2 ν - 1} E {|\sum_{k}^{}|}_{F}

(A7b)

where d is the dimension of the parameter space

Θ

, and

{|\sum_{k}^{}|}_{F}

denotes the Frobenius matrix norm. However, it follows from (u1) that there exists a uniform upper bound S on

{|\sum_{k}^{}|}_{F}

. Therefore, by (A7b)

E {|{\tilde{η}}_{n}^{}|}^{2} \leq (d^{\frac{1}{2}} a^{2} C_{A}^{2}) \frac{S}{n^{2 ν}} \sum_{k = 1}^{n} k^{2 ν - 1}

(A7c)

Since

ν > 0

, the right-hand side of (A7c) remains bounded as

n \to \infty

, by the Euler-Maclaurin formula [37]. This immediately yields Condition (27b). □

Condition (27c): to verify this condition, it is first admitted that the following limit is known to hold

lim E (\sum_{k}) = \sum^{*}

(A8a)

where

\sum^{*}

was defined in (4a). Then, let the sum in (27c) be written

\sum_{k = 1}^{n} \frac{a^{2}}{k} A_{n, k}^{} \sum_{k}^{} A_{n, k}^{} = \sum_{k = 1}^{n} \frac{a^{2}}{k} A_{n, k}^{} \sum^{*} A_{n, k}^{} + \sum_{k = 1}^{n} \frac{a^{2}}{k} A_{n, k}^{} [\sum_{k}^{} - \sum^{*}] A_{n, k}^{}

(A8b)

Due to the equivalence

A_{n, k} \sim exp (ln (n / k) A)

(see [10], Page 125), the first term in (A8b) is a Riemann sum for the integral [10,34]

a^{2} \int_{0}^{1} e^{- ln (s) A} \sum^{*} e^{- ln (s) A} d ln (s) = a^{2} \int_{0}^{\infty} e^{- t A} \sum^{*} e^{- t A} d t

which is known to be the solution

\sum

of Lyapunov’s equation (7). The second term in (A8b) can be shown to converge to zero in probability, using inequality (A5) and the limit (A8a), by a similar argument to the ones in the verification of Conditions (27a) and (27b). Then, Condition (27c) follows immediately. □

Proof of (A8a): recall that

w_{k} = u_{k} + v_{k - 1}

where

u_{k} = u (θ_{k - 1}, x_{k})

and

v_{k - 1} = \nabla D (θ_{k - 1})

. Since

(w_{k})

is a sequence of square-integrable martingale differences, it is possible to write, in the notation of (28),

\sum_{k}^{} = E [u_{k}^{} u_{k}^{†} | X_{k - 1}] - v_{k - 1}^{} v_{k - 1}^{†}

(A9a)

By (18b), the second term in (A9a) converges to zero almost surely, as

k \to \infty

. It also converges to zero in expectation, since

\nabla D (θ)

is uniformly bounded for

θ

in the compact set

Θ^{*}

. For the first term in (A9a), since the

x_{k}

are i.i.d. with distribution

P_{θ^{*}}

, it follows that

E [u_{k}^{} u_{k}^{†} | X_{k - 1}] = E_{θ^{*}} [u (θ_{k - 1}, x) u^{†} (θ_{k - 1}, x)]

(A9b)

Since

u (θ, x)

is a continuous vector field on

Θ

for each

x \in X

, and

θ_{k - 1}

converge to

θ^{*}

almost surely, it follows that

u (θ_{k - 1}, x)

converge to

u (θ^{*}, x)

for each

x \in X

, almost surely. On the other hand, it follows from (u2) that the functions under the expectation

E_{θ^{*}}

in (A9b) have bounded second order moments, so they are uniformly integrable [20]. Therefore,

lim E_{θ^{*}} [u (θ_{k - 1}, x) u^{†} (θ_{k - 1}, x)] = E_{θ^{*}} [u (θ^{*}, x) u^{†} (θ^{*}, x)] = \sum^{*}

(A9c)

almost surely, by the definition (4a) of

\sum^{*}

. It now follows from (A9a), (A9b), and (A9c) that the following limit holds

lim \sum_{k} = \sum^{*} almost surely

(A9d)

To obtain (A8a) it is enough to note, as already stated in the verification of Condition (27b), that the

\sum_{k}

are uniformly bounded in the Frobenius matrix norm. Thus, (A9d) implies (A8a), by the dominated convergence theorem. □

Appendix C. Background on the Information Metric

Let

D (θ)

be the Kullback–Leibler divergence (2a) or any other so-called

α

-divergence [19]. Assume the Riemannian metric

〈 \cdot, \cdot 〉

Θ

coincides with the information metric of the model P. Then, for any local coordinates

(τ^{α}; α = 1, \dots, d)

, with origin at

θ^{*}

, the following relation holds, by definition of the information metric (see [19], Page 54),

{\frac{\partial^{2} D}{\partial τ^{α} \partial τ^{β}}|}_{τ^{α} = 0} = {〈\frac{\partial}{\partial τ^{α}}, \frac{\partial}{\partial τ^{β}}〉}_{θ^{*}}

(A10)

where

\frac{\partial}{\partial τ^{α}}

denote the coordinate vector fields of the local coordinates

τ^{α}

. It is also possible to express (A10) in terms of the Riemannian distance

d (\cdot, \cdot)

, induced by the information metric

〈 \cdot, \cdot 〉

. Precisely,

D (θ) = \frac{1}{2} d^{2} (θ, θ^{*}) + o (d^{2} (θ, θ^{*}))

(A11)

This follows immediately from the second-order Taylor expansion of

D (θ)

, since

θ^{*}

is a minimum of

D (θ)

, by using (A10). Formula (A11) shows that the divergence

D (θ)

is equivalent to half the squared Riemannian distance

d^{2} (θ, θ^{*})

, at

θ = θ^{*}

The scalar products appearing in (A10) form the components of the information matrix

I^{τ}

of the coordinates

τ^{α}

I_{α β}^{τ} = {\frac{\partial^{2} D}{\partial τ^{α} \partial τ^{β}}|}_{τ^{α} = 0}

In any change of coordinates, these transform like the components of a

(0, 2)

covariant tensor [16]. That is, if

(θ^{α}; α = 1, \dots, d)

are any local coordinates defined at

θ^{*}

I_{α β}^{τ} = {(\frac{\partial θ^{γ}}{\partial τ^{α}})}_{θ^{*}} I_{γ κ}^{θ} {(\frac{\partial θ^{κ}}{\partial τ^{β}})}_{θ^{*}}

where the subscript

θ^{*}

indicates the derivative is evaluated at

θ^{*}

, and where

I_{γ κ}^{θ}

are the components of the information matrix

I^{θ}

of the coordinates

θ^{α}

The recursive estimates

θ_{n}

are said to be asymptotically efficient, if they are asymptotically efficient in any local coordinates

τ^{α}

, with origin at

θ^{*}

. That is, according to the classical definition of asymptotic efficiency [13,14], if the following weak limit of probability distributions is verified [20],

L \{(n^{1 / 2} τ_{n}^{α})\} ⟹ N_{d} (0, \sum^{τ}) \sum^{τ} = {(I^{τ})}^{- 1}

(A12)

where

L {\dots}

denotes the probability distribution of the quantity in braces,

τ_{n}^{α} = τ^{α} (θ_{n})

are the coordinates of the recursive estimates

θ_{n}

, and

N_{d} (0, \sum^{τ})

denotes a centred d-variate normal distribution with covariance matrix

\sum^{τ}

It is important to note that asymptotic efficiency of the recursive estimates

θ_{n}

is an intrinsic geometric property, which does not depend on the particular choice of local coordinates

τ^{α}

, with origin at

θ^{*}

. This can be seen from the transformation rule of the components of the information matrix, described above. In fact, since these transform like the components of a

(0, 2)

covariant tensor, the components of

\sum^{τ}

transform like those of a

(2, 0)

contravariant tensor, which is the correct transformation rule for the components of a covariance matrix.

References

Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Automat. Contr. 2013, 58, 2217–2229. [Google Scholar] [CrossRef]
Zhang, H.; Sra, S. First-order methods for geodesically convex optimization. In Proceedings of the 29th Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 1617–1638. [Google Scholar]
Tripuraneni, N.; Flammarion, N.; Bach, F.; Jordan, M.I. Averaging stochastic gradient descent on Riemannian manifolds. arXiv 2018, arXiv:1802.09128. [Google Scholar]
Arnaudon, M.; Barbaresco, F.; Yang, L. Riemannian medians and means with applications to radar signal processing. IEEE J. Sel. Topics Signal Process. 2013, 7, 595–604. [Google Scholar] [CrossRef]
Arnaudon, M.; Dombry, C.; Phan, A.; Yang, L. Stochastic algorithms for computing means of probability measures. Stochastic Process. Appl. 2012, 122, 1437–1455. [Google Scholar] [CrossRef]
Zhang, H.; Reddi, S.J.; Sra, S. Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4592–4600. [Google Scholar]
Sun, J.; Qu, Q.; Wright, J. Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory 2017, 63, 885–914. [Google Scholar] [CrossRef]
Ge, R.; Huang, F.; Jin, C.; Yuan, Y. Escaping from saddle points: online stochastic gradient for tensor decomposition. In Proceedings of the Conference on Learning Theory, Paris, France, 3–6 July 2015; pp. 797–842. [Google Scholar]
Duflo, M. Random Iterative Models; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Nevelson, M.B.; Khasminskii, R.Z.; Khasminskii, R.Z. Stochastic Approximation and Recursive Estimation; American Mathematical Society: Providence, RI, USA, 1973. [Google Scholar]
Broniatowski, M. Minimum divergence estimators, maximum likelihood and exponential families. Stat. Probab. Lett. 2014, 93, 27–33. [Google Scholar] [CrossRef] [Green Version]
Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivar. Anal. 2009, 100, 16–36. [Google Scholar] [CrossRef] [Green Version]
Ibragimov, I.A.; Has’ Minskii, R.Z. Statistical Estimation: Asymptotic Theory; Springer: New York, NY, USA, 1981. [Google Scholar]
Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Benveniste, A.; Métivier, M.; Priouret, P. Adaptive Algorithms and Stochastic Approximations; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
Petersen, P. Riemannian Geometry; Springer: New York, NY, USA, 2006. [Google Scholar]
Kushner, H.; Yin, G.G. Stochastic Approximation and Recursive Algorithms and Applications; Springer: New York, NY, USA, 2003. [Google Scholar]
Heyde, C.C. Quasi-Likelihood and Its Applications: A General Approach to Optimal Parameter Estimation; Springer: New York, NY, USA, 1997. [Google Scholar]
Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
Shiryaev, A.N. Probability-2; Springer: New York, NY, USA, 1996. [Google Scholar]
Munier, J. Steepest descent method on a Riemannian Manifold: The Convex Case. Balk. J. Geom. Appl. 2007, 12, 2. Available online: https://hal.archives-ouvertes.fr/hal-00018758 (accessed on 18 October 2019).
Udriste, C. Convex Functions and Optimization Methods on Riemannian Manifolds; Springer: Berlin/Heidelberg, Germany, 1994. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
Fang, K.T.; Kotz, S. Symmetric Multivariate and Related Distributions; Chapman and Hall: London, UK, 1990. [Google Scholar]
Sra, S.; Hosseini, R. Conic geometric optimization on the manifold of positive definite matrices. SIAM J. Opt. 2015, 25, 713–739. [Google Scholar] [CrossRef]
Pascal, F.; Bombrun, L.; Tourneret, J.Y.; Berthoumieu, Y. Parameter estimation for multivariate generalized Gaussian distributions. IEEE Trans. Signal Process. 2013, 61, 5960–5971. [Google Scholar] [CrossRef]
Berkane, M.; Oden, K.; Bentler, P.M. Geodesic estimation in elliptical distributions. J. Multivar. Anal. 1997, 63, 35–46. [Google Scholar] [CrossRef]
Mostajeran, C.; Sepulchre, R. Sepulchre, Ordering positive definite matrices. Inf. Geom. 2018, 1, 287–313. [Google Scholar] [CrossRef]
Pennec, X.; Fillard, P.; Ayache, N. A Riemannian framework for tensor computing. Int. J. Comput. Vis. 2006, 66, 41–66. [Google Scholar] [CrossRef]
Higham, N.J. Functions of Matrices: Theory and Computation; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2008. [Google Scholar]
Andrieu, C.; Moulines, É.; Priouret, P. Stability of stochastic approximation under verifiable conditions. SIAM J. Control Opt. 2006, 44, 283–312. [Google Scholar] [CrossRef]
Ghadimi, S.; Lan, G. Stochastic first- and zeroth order methods for nonconvex stochastic programming. SIAM J. Opt. 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
Bottou, L. On-Line Learning in Neural Networks; Cambridge University Press: Cambridge, UK, 1998; pp. 9–42. [Google Scholar]
Kailath, T. Linear Systems; Prentice-Hall: Englewood Cliffs, NJ, USA, 1980. [Google Scholar]
Heyde, C.C. Martingale Limit Theory and Its Applications; Academic Press: Cambridge, MA, USA, 1980. [Google Scholar]
Chavel, I. Riemannian Geometry, a Modern Introduction; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Courant, R.; John, F. Introduction to Calculus and Analysis; Interscience Publishers: New York, NY, USA, 1965; Volume 1. [Google Scholar]

Figure 1. Fast non-asymptotic rate of convergence

Figure 2. Asymptotic efficiency (optimal rate of convergence)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, J.; Said, S. Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold. Entropy 2019, 21, 1021. https://doi.org/10.3390/e21101021

AMA Style

Zhou J, Said S. Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold. Entropy. 2019; 21(10):1021. https://doi.org/10.3390/e21101021

Chicago/Turabian Style

Zhou, Jialun, and Salem Said. 2019. "Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold" Entropy 21, no. 10: 1021. https://doi.org/10.3390/e21101021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold

Abstract

1. Introduction

2. Problem Statement

3. Main Results

4. Application: Estimation of ECD

5. Conclusions

6. Proofs of Main Results

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proofs of Geometric Lemmas

Appendix A.1. Lemma 1

Appendix A.2. Lemma 3

Appendix A.3. Lemma 4

Appendix B. Conditions of the Martingale CLT

Appendix C. Background on the Information Metric

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI