[go: up one dir, main page]

Next Article in Journal
Using the Maximal Entropy Modeling Approach to Analyze the Evolution of Sedentary Agricultural Societies in Northeast China
Next Article in Special Issue
On Voronoi Diagrams on the Information-Geometric Cauchy Manifolds
Previous Article in Journal
Proposal and Thermodynamic Assessment of S-CO2 Brayton Cycle Layout for Improved Heat Recovery
Previous Article in Special Issue
Global Geometry of Bayesian Statistics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Geometric Approach to Average Problems on Multinomial and Negative Multinomial Models

1
School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 100081, China
2
Beijing Key Laboratory on MCAACI, Beijing Institute of Technology, Beijing 100081, China
3
Department of Mathematics, Duke University, Durham, NC 27708, USA
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 306; https://doi.org/10.3390/e22030306
Submission received: 12 February 2020 / Revised: 29 February 2020 / Accepted: 5 March 2020 / Published: 8 March 2020
(This article belongs to the Special Issue Information Geometry III)

Abstract

:
This paper is concerned with the formulation and computation of average problems on the multinomial and negative multinomial models. It can be deduced that the multinomial and negative multinomial models admit complementary geometric structures. Firstly, we investigate these geometric structures by providing various useful pre-derived expressions of some fundamental geometric quantities, such as Fisher-Riemannian metrics, α -connections and α -curvatures. Then, we proceed to consider some average methods based on these geometric structures. Specifically, we study the formulation and computation of the midpoint of two points and the Karcher mean of multiple points. In conclusion, we find some parallel results for the average problems on these two complementary models.

1. Introduction

The concept of an average of a set of points within a given geometric structure abounds in various mathematical research. Significant development has been produced since the conception of the Karcher mean [1]. Among the various structure to be studied, the standard simplex presents itself as an interesting framework since it can be directly connected to the parameter spaces of various probability distributions, such as the two to-be-discussed models—the multinomial and negative multinomial models. There are already some recent works involving the statistical modeling of the probability simplex [2]. In the present work, we provide alternative modelings by considering the multinomial and negative multinomial models with classical methods of information geometry.
As a newly developed theory, information geometry has supplied us with various measures of the discrepancy between any two probability distributions. In addition to some standard distance functions, divergence functions are intended for measuring the asymmetric proximity of probability distributions on an appropriate statistical model. Some geometric quantities, such as Riemannian metric and pair of dual connections, can be readily induced from a divergence function by its higher order derivatives [3,4]. In this way, the geometric structures of various parametric statistical models can be studied specifically [5,6,7], and the investigation about some particular parametric models can also be found in [8,9]. Among these models, we find the multinomial model and negative multinomial model especially interesting, in the sense that they can be seen as a pair of complementary model spaces. The multinomial model is well known as a spherical space of positive constant curvature [10], while the negative multinomial model is found to be a hyperbolic space of negative constant curvature [11].
To be more specific, the motivation of the present paper is twofold. Firstly, we aim at clarifying the complementary geometric structures of the multinomial and negative multinomial models. The main results about these geometric structures involving geometric quantities, such as Fisher-Riemannian metrics, α -connections and α -curvatures, are collected in Section 3, most of which can be derived in a standard way. In particular, this paper extends the isometric representation results about the multinomial model to those about the negative multinomial model, obtaining new insight into the complementary structures of these two models as illustrated by Table 1. Secondly, the original purpose of the formulation and computation of average problems is approached by utilizing these geometric structures. To this end, we propose a generalized concept of midpoints for two points and a computation scheme of Karcher mean for multiple points. For the midpoints, we generalize the Chernoff points in the literature [12] to some wider parametrized classes. For the Karcher mean, as there are many algorithmic results [13,14] for general manifolds, this paper mainly contributes to addressing some practical issues, such as initial point choice and iteration computation, which yields effective solving methods via the geometric structures within the multinomial model and negative multinomial model. The results about these average methods are presented in Section 4.

2. Preliminaries

For the sake of clarity, we summarize some preliminary knowledge about information divergence functions in this section (more details can be found in [15]).
Given a particular parametric statistical model M = { p θ | θ Θ } , for our purpose, here we mainly consider invariant divergences that satisfy the property of information monotonicity, as mentioned in [15]. A typical kind of invariant divergence is given in the form of the well-defined f-divergence as
D f ( p θ p ξ ) = x X p θ ( x ) f p ξ ( x ) p θ ( x ) d x ,
where f is a convex function satisfying f ( 1 ) = 0 and f ( 1 ) = 1 .
A commonly used class of f-divergences is given by the α -divergence D ( α ) , with
f ( u ) = 4 1 α 2 1 u 1 + α 2 , α ± 1 , log u , α = 1 , u log u , α = 1 .
Particularly, for α = 1 , the divergence D ( 1 ) is usually called Kullback–Leibler divergence, which we denote by D K L ; for α = 0 , D ( 0 ) is usually called squared Hellinger distance, which we denote by H 2 ; and for α = 3 , D ( 3 ) is often related to the chi-square statistic.
While our later results are mainly related to the α -divergence, for comparison, we briefly mention another f-divergence called exponential divergence (see §12.3.6 [6]), which we denote by E with f ( u ) = 1 2 log 2 u .
For any divergence function D on a statistical manifold, we can construct another divergence function D * called dual divergence by swapping the arguments as ([16])
D * ( p θ p ξ ) = D ( p ξ p θ ) .
The dual divergence of an f-divergence D f is again an f-divergence D f * = D f * , with f * ( u ) = u f ( 1 u ) . Particularly, the dual divergence of the α -divergence D ( α ) just is D ( α ) .
To be mentioned, as a divergence is generally not symmetric, a symmetric divergence D s can be constructed from an asymmetric D by averaging it with its dual:
D s = 1 2 ( D + D * ) .
A Riemannian metric g can be induced by a divergence D as
g θ i , θ j = g i j ( θ ) = 2 ξ i ξ j ξ = θ D ( p θ p ξ ) ,
which is equivalent to the usual Fisher-Riemannian metric in the case when D is an f-divergence.
Furthermore, an affine connection ∇ is induced by the divergence D with connection coefficients
g θ i θ j , θ k = Γ i j k ( θ ) = 3 θ i θ j ξ k ξ = θ D ( p θ p ξ ) .
Similarly, another affine connection * can be obtained by replacing D with the dual divergence D * in the above formula. Thus, with the primal connection ∇ and dual connection * , the statistical manifold admits a dual structure ( M , g , , * ) . Furthermore, the structure ( M , g , , * ) is called dually flat if both ∇ and * are flat.
As a well known result [15], the primal connection induced by an f-divergence D f is the same as the usual α -connection ( α ) with α = 3 + 2 f ( 1 ) , while the induced dual connection is ( α ) . Particularly, one can check that the primal connection induced by the α -divergence D ( α ) is exactly the α -connection ( α ) .

3. Geometric Structure of Multinomial and Negative Multinomial Models

3.1. Basic Information Geometric Structure

In this subsection, we present the parametric formulation of the multinomial and negative multinomial model, respectively. Then some basic results about divergences and geometric structures are derived for both models.

3.1.1. Multinomial Model

Consider the multinomial n-dimensional model M N consisting of (n + 1)-nominal distributions with probability mass function given by
p θ ( x 0 , x 1 , , x n ) = N ! x 0 ! x 1 ! x n ! θ 0 x 0 θ 1 x 1 θ n x n ,
where i = 0 n x i = N and the parametrization is given by θ = ( θ 1 , , θ n ) with θ 0 = 1 i = 1 n θ i .
We can rewrite Equation (3) as
p θ ( x 0 , x 1 , , x n ) = N ! x 0 ! x 1 ! x n ! exp i = 1 n x i log θ i θ 0 + N log θ 0 .
By some general knowledge about exponential distribution families [17], we see that the multinomial model M N admits the natural parameters
η i = log θ i θ 0 , i = 1 , , n ,
and the potential function
ψ ( θ ) = N log θ 0 = N log 1 + i = 1 n exp η i ,
from which we also obtain the expectation parameters
ψ / η i = N 1 + i = 1 n exp η i 1 exp η i = N θ i , i = 1 , , n .
Next, we have the following result by direct calculation.
Proposition 1. 
The divergences introduced in Section 2 are obtained for M N as follows:
the Kullback–Leibler divergence
D K L ( p θ p ξ ) = N i = 0 n θ i log θ i ξ i ,
the squared Hellinger distance
H 2 ( p θ , p ξ ) = 4 1 i = 0 n θ i ξ i N ,
the α-divergence
D ( α ) ( p θ p ξ ) = 4 1 α 2 1 i = 0 n θ i 1 α 2 ξ i 1 + α 2 N
and the exponential divergence
E ( p θ p ξ ) = N ( N 1 ) 2 i = 0 n θ i log θ i ξ i 2 + N 2 i = 0 n θ i log 2 θ i ξ i .
From any one of these expressions, by using Equation (1), we obtain the Fisher-Riemannian metric matrix as
g i j ( θ ) = N θ 0 + N θ i δ i j ,
where δ i j = 1 if i = j and 0 otherwise. Then, the inverse matrix of g can also be obtained as
g i j ( θ ) = θ i N δ i j θ i θ j N .
Furthermore, by direct verification with the metric expression of Equation (6), we have the following well-known result.
Theorem 1 
([18]). An isometry is established between the multinomial model M N and the n-sphere within the non-negative orthant of the Euclidean space R n + 1 by the parametric mapping:
( θ 0 , θ 1 , , θ n ) 2 N θ 0 1 / 2 , θ 1 1 / 2 , , θ n 1 / 2 .
Consequently, via this isometry, the Fisher-Riemannian geodesic distance between two parameters θ and ξ of M N is given by
d ( θ , ξ ) = 2 N arccos i = 0 n θ i ξ i .
Referring to some basic differential geometrical concepts (see [19]), we can derive some further consequent results as follows.
Corollary 1. 
The multinomial n-dimensional model M N with the Fisher-Riemannian metric is a Riemannian manifold of constant sectional curvature K = 1 4 N and scalar curvature S = n ( n 1 ) 4 N . Furthermore, for a unit speed geodesic γ in M N , the normal Jacobi fields along γ are precisely linear combinations of the vector fields, which are in the form of J ( t ) = sin ( 2 N t ) E ( t ) or J ( t ) = cos ( 2 N t ) E ( t ) , where E is any parallel normal vector field along γ.

3.1.2. Negative Multinomial Model

Consider the negative multinomial n-dimensional model NM M consisting of negative (n + 1)-nominal distributions with probability mass function given by
p θ ( x 1 , , x n ) = Γ ( M + x 1 + + x n ) Γ ( M ) x 1 ! x n ! θ 0 M θ 1 x 1 θ n x n ,
where M > 0 , x 1 , , x n 0 and the parametrization is given by θ = ( θ 1 , , θ n ) with θ 0 = 1 i = 1 n θ i .
With the rewritten form of Equation (10) as
p θ ( x 1 , , x n ) = Γ ( M + x 1 + + x n ) Γ ( M ) x 1 ! x n ! exp i = 1 n x i log θ i + M log θ 0 ,
we find that the negative multinomial model NM M admits the natural parameters
η i = log θ i , i = 1 , , n ,
and the potential function
ψ ( θ ) = M log θ 0 = M log 1 i = 1 n exp η i ,
from which we also obtain the expectation parameters
ψ / η i = M 1 i = 1 n exp η i 1 exp η i = M θ i θ 0 .
Again, we derive the following result by direct calculation.
Proposition 2. 
The divergences introduced in Section 2 are obtained for NM M as follows:
the Kullback–Leibler divergence
D K L ( p θ p ξ ) = M log θ 0 ξ 0 + M i = 1 n θ i θ 0 log θ i ξ i ,
the squared Hellinger distance
H 2 ( p θ , p ξ ) = 4 1 θ 0 ξ 0 1 i = 1 n θ i ξ i M ,
the α-divergence
D ( α ) ( p θ p ξ ) = 4 1 α 2 1 θ 0 1 α 2 ξ 0 1 + α 2 1 i = 1 n θ i 1 α 2 ξ i 1 + α 2 M
and the exponential divergence
E ( p θ p ξ ) = M 2 2 log θ 0 ξ 0 + i = 1 n θ i θ 0 log θ i ξ i 2 + M 2 i = 1 n θ i θ 0 log 2 θ i ξ i + i = 1 n θ i θ 0 log θ i ξ i 2 .
Next, applying Equation (1) to the divergences in the foregoing proposition, we obtain the Fisher-Riemannian metric matrix as
g i j = M θ 0 2 + M θ 0 θ i δ i j ,
and its inverse matrix as
g i j = θ 0 M θ i δ i j θ i θ j .
Furthermore, by direct verification with the metric expression of Equation (13), we have the following result parallel to Theorem 1.
Theorem 2. 
An isometry is established between the negative multinomial model NM M and the n-hyperbola within the nonnegative orthant of the Minkowski space R 1 , n by the parametric mapping:
( θ 0 , θ 1 , , θ n ) 2 M θ 0 1 / 2 1 , θ 1 1 / 2 , , θ n 1 / 2 .
Consequently, via this isometry, the Fisher-Riemannian geodesic distance between two parameters θ and ξ of NM M is given by
d ( θ , ξ ) = 2 M arcosh 1 i = 1 n θ i ξ i θ 0 ξ 0 .
Again, some further consequent results are obtained as follows.
Corollary 2. 
The negative multinomial n-dimensional model NM M with the Fisher-Riemannian metric is a Riemannian manifold of constant sectional curvature K = 1 4 M and scalar curvature S = n ( n 1 ) 4 M . Furthermore, for a unit speed geodesic γ in NM M , the normal Jacobi fields along γ are precisely linear combinations of the vector fields, which are in the form of J ( t ) = sinh ( 2 M t ) E ( t ) or J ( t ) = cosh ( 2 M t ) E ( t ) , where E is any parallel normal vector field along γ.

3.2. Dual Structures

In this section, we derive the α -connection coefficients and curvatures of the multinomial and negative multinomial model. While some basic results are equivalent to those in [11], our calculation is performed directly with the original parameters in order to give a clear presentation of the results.
The α -connection coefficients can be obtained by applying the calculation of Equation (2) to the α -divergence D ( α ) , as mentioned in Section 2. However, an easier derivation of the α -connection ( α ) is given in the form of a linear combination of the mixture connection ( m ) = ( 1 ) and the exponential connection ( e ) = ( 1 ) . With the mixture connection coefficients obtained from the Kullback–Leibler divergence as
Γ i j k ( m ) ( θ ) = 3 θ i θ j ξ k ξ = θ D K L ( p θ p ξ ) ,
and the exponential connection coefficients obtained from the dual Kullback–Leibler divergence as
Γ i j k ( e ) ( θ ) = 3 θ i θ j ξ k ξ = θ D K L * ( p θ p ξ ) ,
the α -connection coefficients are given by
Γ i j k ( α ) ( θ ) = 1 + α 2 Γ i j k ( e ) ( θ ) + 1 α 2 Γ i j k ( m ) ( θ ) .
Next, the α -curvature tensor R ( α ) is defined by
R ( α ) ( X , Y ) Z = X ( α ) Y ( α ) Z Y ( α ) X ( α ) Z [ X , Y ] ( α ) Z ,
where [ X , Y ] denotes the Lie bracket of X and Y. By the duality condition (see [20])
X g ( Y , Z ) = g ( X ( α ) Y , Z ) + g ( Y , X ( α ) Z ) ,
one can check that the following identity holds:
g R ( α ) ( i , j ) j , i = i Γ j j i ( α ) j Γ i j i ( α ) + k , l g k l Γ i j k ( α ) Γ j i l ( α ) Γ j j k ( α ) Γ i i l ( α ) ,
where θ i is shortened as i .
At last, the α -sectional curvature spanned by two tangent vectors i and j ( i j ) is determined by
K ( α ) ( i , j ) = g R ( α ) ( i , j ) j , i g i i g j j g i j 2 .

3.2.1. Multinomial Model

For the multinomial model M N , by applying Equations (17)–(19) with the Kullback–Leibler divergence in Propositon 1, we have the mixture connection coefficients
Γ i j k ( m ) ( θ ) = 0 ,
the exponential connection coefficients
Γ i j k ( e ) ( θ ) = N θ 0 2 N θ i 2 δ i j k
and the α -connection coefficients
Γ i j k ( α ) ( θ ) = 1 + α 2 Γ i j k ( e ) ( θ ) .
Furthermore, by Equations (6), (7) and (20), we obtain
g R ( α ) ( i , j ) j , i = ( 1 α 2 ) N 4 1 θ 0 θ i + 1 θ 0 θ j + 1 θ i θ j ,
g i i g j j g i j 2 = N 2 1 θ 0 θ i + 1 θ 0 θ j + 1 θ i θ j .
Thus, via Equation (21), we recover the following result.
Theorem 3 
([11]). The multinomial model M N admits constant α-sectional curvature
K ( α ) = 1 α 2 4 N .

3.2.2. Negative Multinomial Model

For the negative multinomial model NM M , by applying Equations(17)–(19) with the Kullback–Leibler divergence in Propositon 2, we have the mixture connection coefficients
Γ i j k ( m ) ( θ ) = M θ 0 2 2 θ 0 + δ i k + δ j k θ k = ( g i k + g j k ) / θ 0 ,
the exponential connection coefficients
Γ i j k ( e ) ( θ ) = M θ 0 θ i 2 δ i j k M θ 0 2 θ i δ i j = g i k δ i j / θ i
and the α -connection coefficients
Γ i j k ( α ) ( θ ) = 1 + α 2 Γ i j k ( e ) ( θ ) + 1 α 2 Γ i j k ( m ) ( θ ) .
Furthermore, by Equations (13), (14) and (20), we have
g R ( α ) ( i , j ) j , i = ( 1 α 2 ) M 4 θ 0 2 1 θ 0 θ i + 1 θ 0 θ j + 1 θ i θ j ,
g i i g j j g i j 2 = M 2 θ 0 2 1 θ 0 θ i + 1 θ 0 θ j + 1 θ i θ j .
Again, via Equation (21), we recover another parallel result.
Theorem 4 
([11]). The negative multinomial model NM M admits constant α-sectional curvature
K ( α ) = 1 α 2 4 M .
For clarity, we summarize these results about the complementary geometric structures of the multinomial and negative multinomial models in Table 1.

4. Geometric Average Methods on Multinomial and Negative Multinomial Models

In this section, we present some average methods induced by the geometry of the multinomial and negative multinomial models.
Firstly, we consider the particular case when the to-be-averaged set consists of only two points. In this case, the problem is to find a method of computing a midpoint in some geometric sense. Next, via some techniques related to the Karcher mean, we consider the general case with a set of multiple points.

4.1. Midpoints of Two Points

In this subsection, again within the multinomial and negative multinomial model, we study a particular class of midpoints named Chernoff points. The original Chernoff point, which is motivated by the application of computing the best error exponent for the Bayesian hypothesis testing problem, is determined as the intersection point of an exponential geodesic and a mixture bisector [21]. Furthermore, there are three other generalized Chernoff points proposed by [12].
To present a further generalization, here we formulate the concepts of α -geodesic and α -bisector determined by two probability distributions p θ and p θ of a parametric statistical model M .
Definition 1. 
The α-geodesic is determined by the geodesic equation of the α-connection ( α ) as
G ( α ) ( p θ , p θ ) = p θ ( t ) M θ ( 0 ) = θ , θ ( 1 ) = θ , θ ˙ ( t ) ( α ) θ ˙ ( t ) = 0 , t [ 0 , 1 ] ,
where θ ˙ ( t ) denotes the velocity vector of a curve θ ( t ) .
Particularly, since an exponential family model is ± 1 -flat, as can be directly seen from Equations (22) and (23) for our case, the exponential geodesic for α = 1 can be determined by the linear interpolation of the natural parameters, while the mixture geodesic for α = 1 can be determined by the linear interpolation of the expectation parameters.
Definition 2. 
The α-bisector is determined by the equi-divergence identity of the α-divergence D ( α ) as
Bi ( α ) ( p θ , p θ ) = p θ M D ( α ) ( p θ p θ ) = D ( α ) ( p θ p θ ) .
Particularly, the exponential bisector for α = 1 is determined in terms of the dual Kullback–Leibler divergence, while the mixture bisector for α = 1 is determined in terms of the Kullback–Leibler divergence.
Then, we can generalize the notion of Chernoff points suggested by [12] as follows.
Definition 3. 
Two types of generalized Chernoff points are given by the intersection points with parameter α:
C P I ( α ) ( p θ , p θ ) = G ( α ) ( p θ , p θ ) Bi ( α ) ( p θ , p θ ) ,
C P II ( α ) ( p θ , p θ ) = G ( α ) ( p θ , p θ ) Bi ( α ) ( p θ , p θ ) .
Thus, the Chernoff points already proposed in previous works can be recovered by setting α = ± 1 in Definition 3.
The existence of these intersection points is assured by the intermediate value property of the determining Equation (24), since replacing θ by θ and θ , respectively, we get two opposite inequalities due to the non-negativity of the α -divergence.
While the uniqueness can be proven for an exponential family model if α = ± 1 (see [12]), we conjecture it still holds for general cases but this is not pursued in the present paper.
To elucidate what we have mentioned earlier about the application of the original Chernoff point ( C P I ( 1 ) in our notation) to the binary Bayesian hypothesis testing problem, we present here the upper bound of the probability of error of the Bayesian decision suggested by [21] as
exp D K L ( p θ * p θ ) = exp D K L ( p θ * p θ ) ,
where p θ * denotes the Chernoff point C P I ( 1 ) ( p θ , p θ ) .
The overlapping case of the two classes of generalized Chernoff points is given by α = 0 . For the multinomial and negative multinomial model, by comparing the 0-divergence, i.e., squared Hellinger distance, with the Fisher-Riemannian geodesic distance presented in Section 3.1, we find that the generalized Chernoff point C P I ( 0 ) ( p θ , p θ ) = C P II ( 0 ) ( p θ , p θ ) is exactly the unique Fisher-Riemannian geodesic midpoint between p θ and p θ within both models.
Next, we summarize some specific results about Chernoff points for both models as follows.

4.1.1. Multinomial Model

For the multinomial model M N , although the geodesic equation of the α -geodesic can be explicitly given in general cases, there are simple closed-form geodesic expressions at least for α = ± 1 .
Proposition 3. 
The exponential and mixture geodesics connecting two probability distributions p θ and p θ of the multinomial model M N are given by
G ( 1 ) ( p θ , p θ ) = p θ ( t ) M N θ i ( t ) = θ i 1 t θ i t i = 0 n θ i 1 t θ i t , i = 0 , , n , t [ 0 , 1 ] ,
G ( 1 ) ( p θ , p θ ) = p θ ( t ) M N θ i ( t ) = ( 1 t ) θ i + t θ i , i = 0 , , n , t [ 0 , 1 ] .
Proof. 
As already mentioned, the exponential and mixture geodesics can be easily obtained by the linear interpolation of the natural and expectation parameters by Equations (4) and (5), respectively. □
By using the expressions of α -divergences presented in Proposition 1, the α -bisectors are directly obtained as follows.
Proposition 4. 
The α-bisectors between p θ and p θ within the multinomial model M N are given by the following equations:
Bi ( α ) ( p θ , p θ ) = p θ M N i = 0 n θ i 1 α 2 [ ( θ i ) 1 + α 2 ( θ i ) 1 + α 2 ] = 0 , α ± 1 ,
Bi ( 1 ) ( p θ , p θ ) = p θ M N i = 0 n θ i θ i log θ i = i = 0 n θ i log θ i θ i log θ i ,
Bi ( 1 ) ( p θ , p θ ) = p θ M N i = 0 n θ i log θ i θ i = 0 .
Combining the previous two propositions, we have the following result about the determining equations for the four particular Chernoff points with α = ± 1 .
Theorem 5. 
The determining equations for the Chernoff points with α = ± 1 of p θ and p θ within the multinomial model M N are expressed in terms of the argument t [ 0 , 1 ] of the corresponding geodesics as follows:
1. 
For C P I ( 1 ) ( p θ , p θ ) ,
i = 0 n θ i 1 t θ i t log θ i θ i = 0 ;
2. 
For C P I ( 1 ) ( p θ , p θ ) ,
i = 0 n θ i θ i log ( 1 t ) θ i + t θ i = i = 0 n θ i log θ i θ i log θ i ;
3. 
For C P II ( 1 ) ( p θ , p θ ) ,
t = D K L ( p θ p θ ) D K L ( p θ p θ ) + D K L ( p θ p θ ) ;
4. 
For C P II ( 1 ) ( p θ , p θ ) ,
t = D K L ( p θ p θ ) D K L ( p θ p θ ) + D K L ( p θ p θ ) .
With the Kullback–Leibler divergence D K L given by Propositon 1, the two points of second type C P II ( 1 ) and C P II ( 1 ) are already in an explicit form. Whereas, the two points of first type C P I ( 1 ) and C P I ( 1 ) are to be solved by using some numerical methods such as simple bisection as suggested in [12].
For the Fisher-Riemannian geodesic midpoint, we have the following result.
Theorem 6. 
The Fisher-Riemannian geodesic midpoint p θ * between p θ and p θ in the multinomial model M N is determined by
θ i * = θ i + θ i 2 2 + 2 i = 0 n θ i θ i , i = 0 , , n .
Proof. 
Let F be the isometry given by Equation (8). Denote the linear midpoint of the two image points F ( θ ) + F ( θ ) 2 by x * R n + 1 . Then we normalize x * to the n-sphere as u * = 2 N x * / x * e , where the Euclidean norm x e = ( i = 0 n x i 2 ) 1 / 2 is used. At last, the required midpoint is obtained as the inverse image point θ * = F 1 ( u * ) . □
To illustrate these notions for the multinomial model M N , we present a numerical example as follows. The two parameters θ and θ are taken as the empirical probability vectors of the first and second 100 decimal digits of π , respectively. The parameters of these two points and the resulting Chernoff points are summarized in Table 2.
As we can see, the pair of points C P I ( 1 ) and C P II ( 1 ) admits a certain similarity between them as both of them lie on the same exponential geodesic, and so does the pair of points C P I ( 1 ) and C P II ( 1 ) as both of them lie on the same mixture geodesic. Whereas, the Fisher-Riemannian geodesic midpoint C P ( 0 ) can be considered as a medium version among these Chernoff points.
To be mentioned particularly, the upper bound of the probability of error given by Equation (25) is obtained via C P I ( 1 ) as being equal to 0.9862 N . Thus, we can choose sufficiently large N so that the probability of error is less than some threshold value.

4.1.2. Negative Multinomial Model

For the negative multinomial model NM M , again we can give the geodesic expressions for the exponential and mixture cases.
Proposition 5. 
The exponential and mixture geodesics connecting two probability distributions p θ and p θ of the negative multinomial model NM M are given by
G ( 1 ) ( p θ , p θ ) = p θ ( t ) NM M θ i ( t ) = θ i 1 t θ i t , i = 1 , , n , t [ 0 , 1 ] ,
G ( 1 ) ( p θ , p θ ) = p θ ( t ) NM M θ i ( t ) = ( 1 t ) θ i θ 0 + t θ i θ 0 1 t θ 0 + t θ 0 , i = 1 , , n , t [ 0 , 1 ] .
Proof. 
By using Equations (11) and (12), the exponential and mixture geodesics can be easily obtained by the linear interpolation of the natural and expectation parameters, respectively. □
By using the expressions of α -divergences presented in Proposition 2, the α -bisectors are directly obtained as follows.
Proposition 6. 
The α-bisectors between p θ and p θ within the negative multinomial model NM M are given by the following equations:
Bi ( α ) ( p θ , p θ ) = p θ M N i = 1 n θ i 1 α 2 [ ( θ i θ 0 ) 1 + α 2 ( θ i θ 0 ) 1 + α 2 ] = ( θ 0 ) 1 + α 2 ( θ 0 ) 1 + α 2 ,
Bi ( 1 ) ( p θ , p θ ) = p θ NM M i = 1 n ( θ i θ 0 θ i θ 0 ) log θ i = i = 0 n ( θ i θ 0 log θ i θ i θ 0 log θ i ) ,
Bi ( 1 ) ( p θ , p θ ) = p θ NM M i = 0 n θ i log θ i θ i = 0 .
Combining the previous two propositions, we have the following result about the determining equations for the four particular Chernoff points with α = ± 1 .
Theorem 7. 
The determining equations for the Chernoff points with α = ± 1 of p θ and p θ within the negative multinomial model NM M are expressed in terms of the argument t [ 0 , 1 ] of the corresponding geodesics as follows:
1. 
For C P I ( 1 ) ( p θ , p θ ) ,
log θ 0 θ 0 + i = 1 n θ i 1 t θ i t log θ i θ i log θ 0 θ 0 = 0 ;
2. 
For C P I ( 1 ) ( p θ , p θ ) ,
i = 1 n ( θ i θ 0 θ i θ 0 ) log ( 1 t ) θ i θ 0 + t θ i θ 0 1 t θ 0 + t θ 0 = i = 0 n ( θ i θ 0 log θ i θ i θ 0 log θ i ) ;
3. 
For C P II ( 1 ) ( p θ , p θ ) ,
t = D K L ( p θ p θ ) D K L ( p θ p θ ) + D K L ( p θ p θ ) ;
4. 
For C P II ( 1 ) ( p θ , p θ ) ,
t = D K L ( p θ p θ ) D K L ( p θ p θ ) + D K L ( p θ p θ ) .
As can be seen, the two points of second type C P II ( 1 ) and C P II ( 1 ) are in the same explicit form as before, except that the Kullback–Leibler divergence D K L is given by Proposition 2. And again, the two points of first type C P I ( 1 ) and C P I ( 1 ) can be solved by using numerical methods.
For the Fisher-Riemannian geodesic midpoint, we have the following result being complementary to the previous one.
Theorem 8. 
The Fisher-Riemannian geodesic midpoint p θ * between p θ and p θ in the negative multinomial model NM M is determined by
θ i * = θ 0 1 / 2 θ i 1 / 2 + θ 0 1 / 2 θ i 1 / 2 θ 0 1 / 2 + θ 0 1 / 2 2 , i = 1 , , n .
Proof. 
Let F be the isometry given by Equation (15). Denote the linear midpoint of the two image points F ( θ ) + F ( θ ) 2 by x * R 1 , n . Then we normalize x * to the n-hyperbola as u * = 2 M x * / x * m , where the Minkowski norm x m = ( x 0 2 i = 1 n x i 2 ) 1 / 2 is used. At last, the required midpoint is obtained as the inverse image point θ * = F 1 ( u * ) . □
A numerical illustration for the negative multinomial model NM M is presented as follows. The two parameters θ and θ are taken as the empirical probability vectors of the decimal digits of π within the first and second 10 appearances of “0”, respectively. The parameters of these two points and the resulting Chernoff points are summarized in Table 3.
Again, the pair of points C P I ( 1 ) and C P II ( 1 ) admits a certain similarity between them, and so does the pair of points C P I ( 1 ) and C P II ( 1 ) . Whereas, the Fisher-Riemannian geodesic midpoint C P ( 0 ) serves as a medium version among these Chernoff points.
The upper bound of the probability of error given by Equation (25) is obtained via C P I ( 1 ) as being equal to 0.8789 M . Thus, sufficiently large M need to be chosen so that the probability of error is less than some threshold value.

4.2. Karcher Means of Multiple Points

A natural generalization of the Fisher-Riemannian geodesic midpoint between two points is given by the Karcher mean among multiple points.
Let M be a metric space and S be a set of points on M . Define a criterion function f : M R by
f ( x ) : = 1 2 | S | p S d ( x , p ) 2 ,
where d ( · , · ) is the distance function and | S | is the number of points of S. If the minimizer of the function f exists and is unique, then it is called the Karcher mean of S on M . If d is a Riemannian metric on M , then the negative gradient vector field with respect to f is found to be the usual average of the corresponding points in the tangent space ([1]):
f ( x ) = 1 | S | p S exp x 1 ( p ) ,
where exp x 1 is the inverse of the Riemannian exponential map at x. In view of this, the Karcher mean can be alternatively understood as a point at which the above vector field vanishes.
The Karcher mean may be not unique unless all points are located on a geodesically convex region. For example, there are infinitely many geodesic midpoints between two antipodal points on a sphere. However, for model spaces, such as open half-sphere and hyperbolic space, there are existing results to assure the existence and uniqueness of Karcher mean ([22]). Thus, by virtue of Theorem 1 and Theorem 2, we conclude that the concept of a Karcher mean is well-defined on the multinomial and negative multinomial models.
Now, we focus ourselves on the computation of the Karcher mean on these two models. The Karcher mean of two points admits a closed-form expression as the Fisher-Riemannian geodesic midpoint presented previously, but for multiple points, we can only expect to obtain a numerical solution of the Karcher mean.
By virtue of Equation (26), there is a Riemannian gradient iteration algorithm with a locally superlinear convergence in general ([23]):
x i + 1 = exp x i 1 | S | p S exp x i 1 ( p ) .
However, this general algorithm is difficult to apply in practice unless proper representations of the models are derived. In our case, as we have prepared enough geometric representation results within the multinomial and negative multinomial models in Section 3, we still have to address two practical issuses: the choice of initial points and the computation of Riemannian exponential map exp x i and its inverse exp x i 1 .

4.2.1. Initial Points

Let S be a set of parameters to be averaged in either the model M N or NM M , we present a heuristic approach motivated by the proof of Theorem 1 and Theorem 2 to provide an initial point choice. The main procedure is presented as follows (here N = M = 1 is assumed as basic ideas are unchanged up to scale):
  • Set the average of isometry images x * : = 1 | S | θ S F ( θ ) ;
  • Set the normalized vector u * : = 2 x * / x * ; the parameter of the initial point is given by θ * : = F 1 ( u * ) .
For the model M N , the isometry F is given by Equation (8), and the norm · is the Euclidean norm · e in the proof of Theorem 1. For the model NM M , the isometry F is given by Equation (15), and the norm · is the Minkowski norm · m in the proof of Theorem 2.

4.2.2. Computation of Riemannian Exponential and its Inverse

Within each of the model M N and NM M (again N = M = 1 is assumed), the Riemannian exponential and its inverse map can be computed in an easy-to-manipulate way. Except for the isometry F and the norm · being given as before, we also need to set · , · by x , y = i = 0 n x i y i as the Euclidean inner product for the model M N and by x , y = x 0 y 0 i = 1 n x i y i as the Minkowski inner product for the model NM M .
To compute exp θ 1 ( ξ ) , we have the following steps:
  • Set a tangent vector by orthogonal projection
    x : = F ( ξ ) F ( θ ) F ( ξ ) , F ( θ ) / 4 ;
  • Scale the vector by geodesic distance
    u : = d ( θ , ξ ) x / x ,
    where the geodesic distance d ( · , · ) is given by Equation (9) for M N and by Equation (16) for NM M .
Thus, we use the resulting vector u to represent exp θ 1 ( ξ ) .
To compute exp θ ( u ) , we need to:
  • Set the angle α : = u / 2 ;
  • For M N , express the corresponding point on the sphere
    x : = F ( θ ) cos α + 2 u 1 u sin α ,
    for NM M , express the corresponding point on the hyperbola
    x : = F ( θ ) cosh α + 2 u 1 u sinh α ;
  • Set the parameter by isometry ξ : = F 1 ( x ) .
Thus, the resulting parameter ξ is obtained as exp θ ( u ) .

4.2.3. Numerical Example

Now, we test the above algorithm for solving Karcher mean by a numerical example. The data set S is chosen as containing 10 empirical probability vectors from the first to the tenth 100 decimal digits of π .
To illustrate the goodness of each iteration, we present the norm of the negative gradient vector field at each iteration point via Equation (26), as shown in Table 4. Within each model, we present in column (a) the iteration results starting with the initial points chosen as the aforementioned way, while in column (b), the iteration results with initial points chosen as the usual Euclidean means are provided for comparison.
As we can see, all of the four iterations shown here converge rapidly within the first two steps, and our aforementioned choice for initial points is apparently better than the usual choice of Euclidean means. In conclusion, this example, to some extent, shows the effectiveness of our computation scheme for the Karcher mean within the multinomial and negative multinomial models.

5. Conclusions

In this paper, we have studied various information geometric properties based on divergence functions for the multinomial and negative multinomial models. Some pre-derived expressions of fundamental geometric quantities, such as Fisher-Riemannian metric, isometric representation and α -curvature, have made it clear that these two models can be put together into a complementary view. With the aid of these geometric structures, we investigate the average problems on these two models. We have proposed the conception of generalized Chernoff points as midpoints of two points and presented some determining equations for them. Then we provided an effective computation scheme for the Karcher mean of multiple points on the multinomial and negative multinomial models.

Author Contributions

Conceptualization, M.L. and H.S.; funding acquisition, H.S.; supervision, H.S.; validation, M.L., H.S. and D.L.; writing—original draft, M.L.; writing—review and editing, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 61179031).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30, 509–541. [Google Scholar] [CrossRef]
  2. Nielsen, F.; Sun, K. Clustering in Hilbert’s projective geometry: The case studies of the probability simplex and the elliptope of correlation matrices. Geom. Struct. Inf. 2019. [Google Scholar] [CrossRef]
  3. Eguchi, S. Second order efficiency of minimum contrast estimations in a curved exponential family. Ann. Stat. 1983, 11, 793–803. [Google Scholar] [CrossRef]
  4. Eguchi, S. A characterization of second order efficiency in a curved exponential family. Ann. Inst. Stat. Math. 1984, 36, 199–206. [Google Scholar] [CrossRef]
  5. Arwini, K.A.; Dodson, C.T. Information Geometry: Near Randomness and Near Independance; Springer: Berlin, Germany, 2008. [Google Scholar]
  6. Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics; Springer International Publishing: Berlin, Germany, 2014. [Google Scholar]
  7. Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
  8. Zhang, Z.; Sun, H.; Zhong, F. Information geometry of the power inverse Gaussian distribution. Appl. Sci. 2007, 9, 194–203. [Google Scholar]
  9. Zhong, F.; Sun, H.; Zhang, Z. The geometry of the dirichlet manifold. J. Korean Math. Soc. 2008. [Google Scholar] [CrossRef]
  10. Amari, S. Differential-Geometrical Methods in Statistics; Springer: Berlin, Germany, 1985. [Google Scholar]
  11. Takano, K. Exponential families admitting almost complex structures. SUT J. Math. 2010, 46, 1–21. [Google Scholar]
  12. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE SPL 2013, 20, 269–272. [Google Scholar] [CrossRef]
  13. Ying, S.; Qin, H.; Peng, Y.; Wen, Z. Compute Karcher means on SO(n) by the geometric conjugate gradient method. Neurocomputing 2016, 215, 169–174. [Google Scholar] [CrossRef]
  14. Fiori, S.; Tanaka, T. An algorithm to compute averages on matrix Lie groups. IEEE Trans. Signal Process. 2010, 57, 4734–4743. [Google Scholar] [CrossRef]
  15. Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. 2010, 58, 183–195. [Google Scholar] [CrossRef] [Green Version]
  16. Ay, N.; Amari, S. A novel approach to canonical divergence within information geometry. Entropy 2015, 17, 8111–8129. [Google Scholar] [CrossRef]
  17. Amari, S. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  18. Kass, R.E.; Vos, P.W. Geom. Found. Asymptot. Inference; Wiley: Hoboken, NJ, USA, 1997. [Google Scholar]
  19. Lee, J.M. Riemannian Manifolds; Springer: Berlin, Germany, 1997. [Google Scholar]
  20. Kurose, T. Dual connections and affine geometry. Math. Z. 1990, 203, 115–121. [Google Scholar] [CrossRef]
  21. Nielsen, F. Hypothesis testing, information divergence and computational geometry. Geom. Sci. Inf. 2013. [Google Scholar] [CrossRef]
  22. Kendall, W.S. Probability, convexity and harmonic maps with small image I: uniqueness and fine existence. Proc. Lond. Math. Soc. 1990, 61, 371–406. [Google Scholar] [CrossRef]
  23. Buss, S.R.; Fillmore, J.P. Spherical averages and applications to spherical splines and interpolation. ACM Trans. Graph. 2001, 20, 95–126. [Google Scholar] [CrossRef]
Table 1. Complementary geometric structures.
Table 1. Complementary geometric structures.
ModelIsometric SpaceSectional Curvature α -Sectional Curvature
M N n-sphere within nonnegative orthant 1 4 N 1 α 2 4 N
NM M n-hyperbola within nonnegative orthant 1 4 M 1 α 2 4 M
Table 2. Chernoff points of multinomial model (%).
Table 2. Chernoff points of multinomial model (%).
0123456789
θ 8.008.0012.0011.0010.008.009.008.0012.0014.00
θ 11.0012.0012.008.0012.0012.007.004.0013.009.00
C P I ( 1 ) 9.499.9112.179.5311.109.918.065.7612.6611.41
C P I ( 1 ) 9.5210.0212.009.4811.0110.027.995.9812.5111.47
C P II ( 1 ) 9.489.8912.179.5511.089.898.075.7812.6511.44
C P II ( 1 ) 9.5410.0512.009.4611.0210.057.985.9512.5111.44
C P ( 0 ) 9.519.9712.089.5111.059.978.025.8712.5811.44
Table 3. Chernoff points of negative multinomial model (%).
Table 3. Chernoff points of negative multinomial model (%).
0123456789
θ 8.628.6212.9311.219.487.768.626.9013.7912.07
θ 10.9912.0910.996.5914.2912.096.594.4012.099.89
C P I ( 1 ) 10.9910.1111.988.7311.509.567.605.5812.9610.99
C P I ( 1 ) 9.7310.2512.029.0411.749.797.675.7212.9911.05
C P II ( 1 ) 10.9110.0012.048.8711.369.437.665.6613.0111.06
C P II ( 1 ) 9.8010.3511.968.9011.889.927.615.6512.9410.98
C P ( 0 ) 10.3610.1812.008.8911.629.677.635.6512.9811.02
Table 4. Iteration for the Karcher mean illustrated by f .
Table 4. Iteration for the Karcher mean illustrated by f .
M N NM M
Iteration(a)(b)(a)(b)
0 8.21 × 10 5 1.07 × 10 2 3.05 × 10 3 1.85 × 10 1
1 4.62 × 10 7 6.28 × 10 5 1.74 × 10 4 1.20 × 10 2
2 2.68 × 10 9 3.90 × 10 7 1.01 × 10 5 7.88 × 10 4

Share and Cite

MDPI and ACS Style

Li, M.; Sun, H.; Li, D. A Geometric Approach to Average Problems on Multinomial and Negative Multinomial Models. Entropy 2020, 22, 306. https://doi.org/10.3390/e22030306

AMA Style

Li M, Sun H, Li D. A Geometric Approach to Average Problems on Multinomial and Negative Multinomial Models. Entropy. 2020; 22(3):306. https://doi.org/10.3390/e22030306

Chicago/Turabian Style

Li, Mingming, Huafei Sun, and Didong Li. 2020. "A Geometric Approach to Average Problems on Multinomial and Negative Multinomial Models" Entropy 22, no. 3: 306. https://doi.org/10.3390/e22030306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop