[go: up one dir, main page]

0% found this document useful (0 votes)
0 views68 pages

Barndorff-Nielsen_1987

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 68

DIFFERENTIAL AND INTEGRAL GEOMETRY IN STATISTICAL INFERENCE

0. E. Barndorff-Nielsen

1. Introduction 97
2. Review and Preliminaries 99
3. Transformation Models 118
4. Transformation Submodels 127
5. Maximum Estimation and Transformation Models 130
6. Observed Geometries 135
7. Expansion of c | j |^L 147
8. Exponential Transformation Models 152
9. Appendix 1 154
10. Appendix 2 156
11. Appendix 3 157
12. References 159

^Department of Theoretical Statistics, Institute of Mathematics, University of


Aarhus, Aarhus, Denmark

95
1. INTRODUCTION

This paper gives an account of some of the recent developments in


statistical inference in which concepts and results from integral and differen-
tial geometry have been instrumental.
A great many important contributions to the field of integral and
differential geometry in statistics are not discussed or even referred to here,
but a rather comprehensive overview of the field can be obtained from the mate-
rial compiled in the present volume and from the survey paper by Barndorff-
Nielsen, Cox and Reid (1986).
Section 2 reviews pertinent parts of statistics and of integral
and differential geometry, and introduces some of the terminology and notation
that will be used in the rest of the paper.
A considerable part of the material in sections 3, 4, 5 and 8 and
in the appendices, which are mainly concerned with the systematic theory of
transformation models and exponential transformation models, has not been pub-
lished elsewhere.
Sections 6 and 7 describe a theory of "observed geometries" and its
relation to an asymptotic expansion of the formula c|j| C for the conditional
distribution of the maximum likelihood estimator; the results there are mostly
taken from Barndorff-Nielsen (1986a). Briefly speaking, the observed geome-
tries on the parameter space of a statistical model consist of a Riemannian
metric and an associated one-parameter family of affine connections, construct-
ed from the observed information matrix and from an auxiliary statistic a cho-
sen such that (ω,a), where ω denotes the maximum likelihood estimator of the

97
98 0. E. Barndorff-Nielsen

parameter of the model, is minimal sufficient. The observed geometries and the
closely related expansion of c|j|^L form a parallel to the "expected geometries"
and the associated conditional Edgeworth expansions for curved exponential
families studied primarily by Amari (cf., in particular, Amari 1985, 1986), but
with some essential differences. In particular, the developments in sections 6
and 7 are, in a sense, closer to the actual data and they do not require inte-
grations over the sample space; instead they employ "mixed derivatives of the
log model function." Furthermore, whereas the studies of expected geometries
have been largely concerned with curved exponential families the approach taken
here makes it equally natural to consider other parametric models, and in par-
ticular transformation models. The viewpoint of conditional inference has been
instrumental for the constructions in question. However, the observed geometri-
cal calculus, as discussed in section 6, does not require the employment of
exact or approximate ancillaries.
The observed geometries provide examples of the concept of
statistical manifolds discussed by Lauritzen (1986).
Throughout the paper examples are given to illustrate the general
results.
2. REVIEW AND PRELIMINARIES

We shall consider parametrized statistical models M^ specified by


(X^p(x;ω),Ω) where X. is the sample space, Ω is the parameter space and p(x ω)
is the model function, i.e. p(x ω) = dP /dy for some dominating measure y. The
ω
dimension of the parameter ω will usually be denoted by d and we write ω on
coordinate form as (ω ,...,ω ). Generic coordinates of ω will be indicated as
r s t .
ω , ω , ω , etc.
The present section is organized in a number of subsections and it
serves two purposes: to provide a survey of previous results and to set the
stage for the developments in the following sections.
Combinants. It is useful to have a term for functions which depend
on both the observation x and the parameter ω and we shall call any such func-
tion a combinant.
Jacobians. Our vectors are row vectors and we denote transposi-
tion of a matrix by an asterix *. If f is a differentiate transformation of
a space _Y then the Jacobian af/ay* of f at yεY_ is also denoted by Jr (y), while
we write J f (y) for the Jacobian determinant, i.e. J f = |JJ . When appropriate
we interpret J f (y) as an absolute value, without explicitly stating this. We
shall repeatedly use the fact that for differentiate transformations f and g
we have
Jf o g ( y ) = Jg(y)J f (g(y)) (2.1)

and hence
Jf 0 g (y) = J f (g(y))J g (y). (2.2)

99
100 0. E. Barndorff-Nielsen

Foliations. A partition of a manifold of dimension k into submani-


folds all of dimension m<k is called a foliation and the submanifolds are said
to be the leaves of the foliation.
A dimension-reducing statistical hypothesis may often, in a natural
way, be viewed as a leaf of an associated foliation of the parameter space Ω.
Likelihood. We let L = L(ω) = L(ω x) denote an arbitrary version
of the likelihood function for ω and we set 1 = log L. Furthermore, we write
3 r = 3/3ωΓ, and 1 = a 1, 1 = 3 9 1, etc. The observed information is the
matrix
j(ω) = -[l r s ] (2.3)

and the expected information is


i(ω) = E ω j(ω). (2.4)

The inverse matrices of j and i are referred to as observed and expected forma-
tion, respectively.
Suppose the minimal sufficient statistic t for M^ is of dimension k.
We then speak of M as a (k,d)-model (d being the dimension of the parameter ω).
Let (ω,a) be a one-to-one transformation of t, where ω is the maximum likeli-
hood estimator of ω and a, of dimension k-d, is an auxiliary statistic.
In most applications it will be essential to choose a so as to be
distribution constant either exactly or to the relevant asymptotic order. Then
a is ancillary and according to the conditionality principle the conditional
model for ω given a is considered the appropriate basis for inference on ω.
However, unless explicitly stated, distribution constancy of a is
not assumed in the following.
There will be no loss of generality in viewing the log likelihood
1 = l(ω) in its dependence on the observation x as being a function of the
minimal sufficient (ω,a) only. Henceforth we shall think of 1 in this manner
and we will indicate this by writing
1=1(ω,ω,a).
Differential and Integral Geometry in Statistical Inference 101

Similarly, in the case of observed information we write

j = j(ω;ώ,a)
etc. It turns out to be of interest to consider the function
«*(<*>) =*(ω;a) = l(ω;ω,a), (2.5)
obtained from l(ω;ώ,a) by substituting ω for ω. Similarly we write
£(ω) = a-(ω a) = j( ω ;ω,a). (2.6)
For a general parametric model p(x ω) and for a general auxiliary a
a conditional probability function p*(ω;ω|a) for ω given a may be defined by

p*(ω;ω|a) = clJl^C (2.7)


where L is the normed likelihood function, i.e.

C = p(x;ω)/p(x;ω),

and where c = c(ω,a) is a norming constant determined so as to make the integral


of (2.7) with respect to ω equal to 1.
Suppose now that a is approximately or exactly distribution con-
stant. Then the probability function p*(ω;ω|a), given by (2.7), is to be
considered as an approximation to the conditional probability function p(ώ;ω|a)
of the maximum likelihood estimator ω given a, cf. Barndorff-Nielsen (1980,
1983). In general, p*(ώ;ω|a) is simple to calculate since it only requires
knowledge of standard likelihood quantities plus an integration over the sample
space to determine the norming constant c. Moreover, to sufficient accuracy
this norming constant can often be approximated by (2π)~ ' , where d is the
dimension of ω; and a more refined approximation to c solely in terms of mixed
derivatives of the log model function is also available, cf. the next subsection
and section 7. In a great number of cases, including virtually all transforma-
tion models, p*(ω;ω|a) is, in fact, equal to p(ώ;ω|a). Furthermore, outside
these exactness cases one often has an asymptotic relation of the form

p(ω;ω|a) = p*(ω;ω|a){l + 0(n" 3/2 )} (2.8)

uniformly in ω for /ί(ω-ω) bounded, where n denotes sample size. This holds,
in particular, for (k,d) exponential models. For more details and further
102 0. E. Barndorff-Nielsen

discussion, see Barndorff-Nielsen (1980, 1983, 1984, 1985, 1986a,b) and


Barndorff-Nielsen and Blaesild (1984).
Expansion of c[j| L in the single-parameter case. Suppose ω is
one-dimensional. From formulas (4.2) and (4.5) of Barndorff-Nielsen and Cox
(1984) we have

cj^L = φ(ω-ω; ^ j 2

(2.9)
3/2
.{1 + 0(n" )}.

Here ψ(w γ) denotes the probability density function of the normal distribution
with mean 0 and variance γ" . Furthermore, C,, A,, and A« are given by

C
l = 2?{-3U4
+ 12U
3,1 " 5 U 3
+ 24U
2 , 1 U 3 " 2 4 U 2,1 " 1 2 U 2 , 2 }
(2 10
>
and
^ 1 2 j l + P 2 (u)U 3

A 2 (u) = P 3 (u)U 2 > 2 + P 4 (u)ϋ| f l + P 5 (u)U 4 + P 6 ( u ) U 3 J + P 7 (u)U 2

where P (u), i = 1,...,8, are polynomials, the explicit forms of which are
given in Barndorff-Nielsen (1985), and where U = U n and U c are defined as
v v, u v, s
/ x v = 1,2,3,...
,U, /( ω )x _ 9 s {r v ; (ω;ω,a)}
v,s .(v+s)/2
* s = 0,1,2...
v
r ^ denoting the v -th order derivative of 1 = l( ω ;ώ,a) with respect to ω and
8 S indicating differentiation s times with respect to ω. Note that, in the
v+s
repeated sampling situation, U is of order 0(n"^ " '' ). Hence the
v »s
quantities C^, A-j and Ap are of order 0(n ), Oίn"32) and 0(n ), respectively.
Integration of (2.7) yields an approximation to the conditional
distribution of the likelihood ratio statistic
w = 2{l(ω) - l(ω Q ) (2.11)
Differential and I n t e g r a l Geometry in S t a t i s t i c a l Inference 103

for testing a dimension reducing hypothesis ΩQ of Ω. In particular, i f ςι is


a p(r'nt hypothesis, ΩQ = {ω Q }, we have

p*(w;ωQ|a) = c e " ^ / |j|^2dώ (2.12)


ώ Iw , a

as an app^ imation to p(w;ω Q |a). (The leading term of (2.9) together with
p

(2.12) yields the usual χ approximation for w. For a connection to Bartlett


adjustment factors see Barndorff-Nielsen and Cox (1984)).
Furthermore, (2.9) may be integrated termwise to obtain expansions
for the conditional distribution function for ω and, by inversion, for confi-
-3/2
dence limits for ω, correct to order 0(n ), conditionally as well as uncon-
ditionally, cf. Barndorff-Nielsen (1985). The resulting expressions allow one
to carry out "conditional inference without conditioning and without integra-
tion."
For extensions to the case of multidimensional parameters see
section 7.
Reparametrization. A basic form of invariance is parametrization
invariance of statistical procedures (though parametrization equivariance might
be a more proper term). If we think of an inference frame as consisting of the
data in conjunction with the model and a particular parametrization of the
model, and of a statistical procedure π as a method which leads from the
inference frame to a conclusion formulated in terms of the parametrization of
the inference frame then parametrization invariance may be formally specified
as commutativity of the diagram
inference
frame ^parametrization frame
inference

procedure procedure
π π

conclusion > conclusion


reparametrization
104 0. E. Barndorff-Nielsen

In words, the procedure π is parametrization invariant if changing the inference


base by shifting to another parametrization and then applying π yields the same
conclusion as first applying π and then translating the conclusion so as to be
expressed in terms of the new parametrization. (We might describe a parametri-
zation invariant procedure as a 0-th order generalized tensor.) Maximum
likelihood estimation and likelihood ratio testing are instances of parametri-
zation invariant procedures.
Example 2.1. Consider any log-likelihood function l(ω), of a one-
dimensional parameter ω. Define the functions r ^ = r*-v^(ω), v = 1,2,...,
recursively by

r [i](ω) = 1 (D(ω)/i(ω)^

r
Lv] =
dr ίϋil/iίω) 3 5 , v = 2,3,...,

and set f^-* = r ^ ( ω ) . The derivatives f^-" are parametrization invariant,


i.e. rΓvl
LJ
takes the same value whatever the parametrization employed.
While parametrization invariance is clearly a desirable property,
there are a number of useful, and virtually indispensable, statistical methods
which do not have this property. Thus procecures which rely on the asymptotic
normality of the maximum likelihood estimator, such as the Wald test or stan-
dard ways of setting confidence intervals in non-linear regression problems,
are mostly not parametrization invariant. However, in cases of non parametri-
zation invariance particular caution must be exercised, as demonstrated for
instance for the Wald test by Hauck and Donner (1977) and Vaeth (1985).
We shall be interested in how various quantities behave under
reparametrizations of the model M^. Let ψ, of dimension d, be the parameter of
some parametrization of M, alternative to that indicated by ω. Coordinates of
ψ will be denoted by ψ p , ψ σ , etc. and we write 3 for a/3ψp and
r _ r p r 2 r, p σ
/P /per

etc. Furthermore, we write l(ψ) for the log likelihood under the parametriza-
Differential and Integral Geometry in Statistical Inference 105

tion by ψ, though formally this is in conflict with the notation l(ω), and
correspondingly we let 1 = 9 1 = 9 l(ψ), etc.; similarly for other parameter
Λ
dependent quantities. Finally, the symbol over such a quantity indicates that
the maximum likelihood estimate has been substituted for the parameter.
Using this notation and adopting the summation convention that if a
suffix occurs repeatedly in a single expression then summation over that suffix
is understood, we have
1 = 1 ω,
P r /p
1 = 1 J. ωS. + 1 J. (2.13)
v
pσ rs /p /σ r /pσ '

1 =1 ,.4.0)/ ω 7 ω, + l^.ω, ω, [3] + 1 ω, (2.14)


pστ rst /p /σ /τ rs /pσ /τ J r /pστ x
'

etc., where [3] signifies a sum of three similar terms determined by permutation
of the indices p,σ,τ. On substituting ω for ω in (2.13) we obtain the well-
known relation
j
"pσ = J V s < 7 p >

which, now by substitution of ω for ω, may be reexpressed as

or, written more explicitly,

j (Φ a) = 3-rs(ω;a) ^ ^ .
9ψ 9ψ

Equation (2.15) shows that j is a metric tensor on M, for any given value of the
auxiliary statistic a. Moreover, in wide generality 3- will be positive definite
on M^, and we assume henceforth that this is the case. In fact, for any ωεΩ we
have 3- = j, i.e. observed information at the maximum likelihood point, which is
generally positive definite (though counterexamples do exist).
p
Let A(ω) = [A (ω)] be an array, depending on ω and where
s s
l '•• q
each of the p + q indices runs from 1 to d. Then A is said to be a (p,q)
tensor, or a tensor of contravariant rank p and covariant rank q, if under
106 0. E. Barndorff-Nielsen

reparametrization from ω to ψ A obeys the transformation law


p .p s-, s p-j p r ,.. .r

Example 2.2. A covariant tensor of rank q is given by


3i
F J
ω r
ΛL
K l "• /q
Low dω .

In particular, the expected information i is a (0,2) tensor.


The inverse [i Γ S ] of i = [i ] is a contravariant second order
tensor.
The (outer) product of two tensors A and B
S
1S2 V
is defined as the array C given by 2

rlV.. tl t 2 ...
S S u υ

This product is again a tensor, of rank (p1 + p", q' + q") if (p',qf) and
(p",q") are the ranks of A and B.
Lower rank tensors may be derived from higher rank tensors by con-
traction, i.e. by pairwise identification of upper and lower indices (which
implies a summation).
The parameter space as a manifold. The parameter space Ω may be
viewed as a (pseudo-) Riemannian manifold with (pseudo-) metric determined by
a metric tensor φ, i.e. ψ is a rank 2 covariant, regular and symmetric tensor.
o
The associated Riemannian connection v is determined by the Christoffel symbols
Of
r where tu
?* = Φ ? ψ
rs rsu
and

If v is any affine connection with connection symbols r then


these symbols satisfy
Differential and Integral Geometry in Statistical Inference 107

Γ
ίs 3 t
and the transformation law

On the other hand, any set of functions [r ] which satisfy the law (2.18)
constitute the connection symbols of an affine connection on Ω. It follows that
all affine connections on Ω are of the form

ΓL = r + s (2 19)
rs rs rs \^-^)
where the S are characterized by the transformation law

S p T σ (φ) = Sί s (ω) ω / r p ω^ψ; t . (2.20)

If, for a given metric tensor ψ, we define r . and S . by

rsL r st

Γ
rst = Γ rs φ tu a n d S rst = S rs φ tu

then (2.18), (2.19) and (2.20) are equivalent to, respectively,


pστ rst /p /σ /τ tu /pσ /τ '

0
rst rst ύ rst \t-")

and
=
S S^^.ω/ ω# ω, . (2.23)
pστ rst /p /σ /τ '
Thus, in particular, [S .] is a tensor.
Suppose ψ:β -> ω is a mapping of full rank from an open subset B of
a Euclidean space of dimension d^ < d into Ω. Then ψ is said to be an immer-
sion of B in Ω. We denote coordinates of 3 by β ,β , etc. If ψ is a metric
tensor on Ω then the metric tensor on B induced from Ω by ψ is defined by

φ (a) = φ (ω)ω ω (2 24)


ab rs /a /b ' '

If r (ω) is a connection on Ω and if r . = r u ψ. then the induced connection


108 0. E. Barndorff-Nielsen

on B is defined by r£ b (e) = Γ a b d ( 3 ) φ C d ( 3 ) and by

Γ
a b c ( 3 ) = Γ rst ( ω ) ω /a ω /b ω /c + φ tu ω /ab ω /c " (2
'25)

Let G be a group acting smoothly on the parameter space. A metric


tensor ψ is said to be (G-) invariant if
r' s'
Φ Γ S (ω) = ^ g ω ^ — Φ Γ ι s ι ( g ω ) ^ 3 i ύ — f gε G. (2.26)

For a given g let a new parametrization be introduced by ψ = gω. From the


transformation law for tensors it follows that φ is invariant if and only if
Φ r s (ψ) = Φ Γ S (gω), gεG. (2.27)

(On the left hand side the tensor is expressed in ψ coordinates, on the right
hand side in ω coordinates.) Similarly, a connection r is said to be invariant
1f
rj s (ψ) = rj $ (gω), gεG. (2.28)

The pseudo-Riemannian connection derived from an invariant metric tensor is


invariant.
In generalization of (2.27) an arbitrary covariant tensor A Γ r
T q
is said to be (G-) invariant if
ψ )= 9 ω ) 9εG>
VV VV '
If r is a G-invariant connection and if ψ and S . are G-
invariant tensors, with ψ being a metric tensor, then r defined by

1
?*rs = rrs +ψΦ t ursu
S

is a G-invariant connection.
Now, let ψ be the information tensor i on Ω. Then (2.16) takes the
form
?
rst •E { V s V -
Obviously,
= E<y$lt} (2.29)
D i f f e r e n t i a l and I n t e g r a l Geometry i n S t a t i s t i c a l Inference 109

satisfies (2.23) and hence, for any real α an affine connection is defined by

?
rst=EίlrsV+¥E{1ΛV <2 3 0 >
These are the α-connections introduced and studied by Chentsov (1972) and

Amari (1982a,b, 1985, 1986).

However, we shall be mainly concerned with another type of connec-


tion, determined from observed information, more specifically from the metric
tensor 3-, see sections 6-8. We refer to i and 3- as expected and observed in-
formation metric on ίi, respectively.
Suppose, as above, that ψ:$ -* ω is an immersion of B in Ω. The
submodel NL of ^ obtained by restricting ω to lie in Ω = ψ(B) has expected
information

i(3) = ^ H ω ) ^ . (2.31)

Thus 1(3) equals the Riemannian metric induced from the metric i(ω) on Ω to
the imbedded submanifold Ω Q . Furthermore, the α-connection of the model ML
equals the connection on Ω Q induced from the α-connection on Ω , by the general
construction (2.25).
The measures on Ω defined by

li^dω (2.32)
and
\3\ αω (Z.33)
are both geometric measures, relative to expected and observed information
metric, respectively. Note that (2.33) depends on the value of the auxiliary
statistic a. We shall speak of (2.32) and (2.33) as expected and observed
information measure, respectively. It is an important property of these mea-
sures that they are parametrization invariant. This property follows from
the fact that i and a" are covariant tensors of rank 2. As a consequence we
have that c|j| L (of (2.7)) is parametrization invariant.
Invariant measures. A measure μ on x is said to be invariant with
respect to a group G acting on X^ if gy = y for all gεG.
110 0. E. Barndorff-Nielsen

Invariant measures, when they exist, may often be constructed from


a quasi-invariant measure, as follows.
A measure μ on X_ is called quasi-invariant with multiplier
=
x x(9»x) if 9y and μ are mutually absolutely continuous for ewery gεG and if

d ( g ~ M ( x ) = χ(g,x)dμ(x).

Furthermore, define a function m on X to be a modulator with associated


multiplier χ(g,x) if m is positive and
m(gx) = x(g.x)m(x).
Then, if μ x is quasi-invariant with multiplier χ(g,x) and if m is a modulator
with the same multiplier we have that
μ = m" μX

is an invariant measure on _X.


As quasi-invariance is clearly a yery weak property the problem in
constructing invariant measures lies mainly in finding appropriate modulators.
It is usually possible to specify the modulators in terms of Jacobians.
In particular, in applications it is often the case that X_ is an
open subset of a Euclidean space. By the standard theorem on transformation
of integrals, Lebesgue measure λ on X is then quasi-invariant with multiplier
J / x(x). Under mild conditions an invariant measure on X^ is then given by

dμ(x) = J ^ J O J Γ W ) . (2.34)
Here J / % denotes the Jacobian determinant of the mapping γ(g) of X. onto itself
determined by gεG and (z,u) constitutes an orbital decomposition of x, i.e.
(z,u) is a one-to-one transformation of x such that uεX^ and u is maximal
invariant while zεG and x=zu. For a more detailed discussion see section 3
and appendix 1.
Transformation models. Let G be a group acting on the sample space
)L If the class P^ of probability measures given by the statistical model is
invariant under the induced action of G on the set of all probability measures
on X then the model is called a composite transformation model and if IP
Differential and Integral Geometry in Statistical Inference 111

consists of a single orbit we use the term transformation model. For a


composite transformation model, G acts on P^ and we may, of course, equally
think of G as acting on the parameter space Ω. A parameter (function) λ which
is maximal invariant under this action is said to be an index parameter.
Virtually all composite transformation models of interest have the property
that after minimal sufficient reduction (and possibly after deletion of a null
set from X) there exists a sub-group K of G such that K is the isotropy group
for a point on every one of the orbits of X_ and of Ω. Each of these orbits is
then isomorphic to the homogeneous space G/K = {gK:gεG} of left cosets of K.
For a transformation model the information measures (2.32) and
(2.33) are invariant measures relative to the action of G on Ω induced from the
action of G on X via the maximum likelihood estimator ω, which is an equivariant
mapping from _X to Ω. This action is the same as the above-mentioned action of
G on P Ξ Ω and also the same as the natural action of G on G/K Ξ Ω.
It follows that relative to information measure on Ω the formula
(2.7) for the conditional distribution of ω is simply cL. From this it may be
shown that, with the auxiliary a as the maximal invariant statistic, p*(ω,ω|a)
is exactly equal to p(ώ;ω|a).
These results are shown in outline in Barndorff-Nielsen (1983). A
more general statement will be derived in section 5.
Exponential models. A (k,d) exponential model has model function of
the form
p(x ω) = exp{θ(ω) t(x) - κ(θ(ω)) - h(x)}. (2.35)

Here k is the order of the model (2.35) and is equal to the common dimension
of the vectors θ(ω) and t(x), while d denotes the dimension of the parameter ω.
The full exponential model generated by (2.35) has model function
p(x θ) = expίθ t(x) - κ(θ) - h(x)} (2.36)
and κ(θ) is the cumulant transform of the canonical statistic t = t(x). From
the viewpoint of inference on ω there is no restriction in assuming x = t,
since t is minimal sufficient, and we shall often do so. We set τ = τ(θ) = E t,
112 0. E. Barndorff-Nielsen

i.e. τ is the mean value parameter of (2.36), and we write T for τ(intθ)
where Θ denotes the canonical parameter domain of the full model (2.36).
Let f be a real differentiate function defined on an open subset
k
of R . The Legendre transform f T of f is defined by

f τ (y) = χ y-f(χ)
where
y = (Df)(x) =fj(x) .
The Legendre transform is a useful tool in studying various, dualistic aspects
of exponential models (cf. Barndorff-Nielsen (1978a), Barndorff-Nielsen and
Blaesild (1983a)).
In particular, we may use the Legendre transform to define the
dual likelihood function 1 of (2.35) by
-1
1 (ω) = e τ(ω) - l(τ(ω)). (2.37)
Here, and elsewhere, ' as top index indicates maximum likelihood estimation
under the full model. Further, in this connection we take 1 as the sup-log-
likelihood function of (2.36) and then 1 is, in fact, the Legendre transform of
K. Note that for τ = τ(θ) ε T we have l(τ) = θ τ - κ(θ). An inference
methodology, parallel to that of likelihood inference for exponential families,
may be developed from the dual likelihood (2.37). The estimates, tests and
confidence regions discussed by Amari and others under the name of α = -1 (or
mixture) procedures are, essentially, part of the dual likelihood methodology.
More generally, based on Amari's concepts of α-geometry and α-
α
divergence, one may for each αε[-l,l] introduce an "α-likelihood" L by
L(ω) = L(ω t) = exp{-Dα(θ,θ(ω))} (2.38)

where

Here p(x θ) is given by (2.36) and the function f is defined as


Differential and Integral Geometry in Statistical Inference 113

x log x, α= 1
(1+α)/2
f α (x) = - ^ { Ί - x }, -l<α<l . (2.40)
1-α
-log X, α = -1
α α
Letting 1 = log L we have, in particular,
1
l(θ) = l(θ) = -I(θ,θ) = θ t - κ(θ) - ί(t) (2.41)
and
-1
l(θ) = -I(θ,θ) = θ τ - f(τ) - κ(θ) (2.42)

where I denotes the discrimination information. Furthermore, for -l<α<l,

4 -ίψ*l*) +2^κ(θ) - κ ( ^ θ + ^ θ ) }
Kθ) = -K [e 2 2 2 _υ>
Ί-α
Affine subsets of Θ are simple from the likelihood viewpoint while,
correspondingly, affine subsets of T are simple in dual likelihood theory. Dual
affine foliations, of Θ and T respectively, are therefore of some particular
interest. Such foliations have been studied in Barndorff-Nielsen and Blaesild
(1983a), see also Barndorff-Nielsen and Blaesild (1983b).
Suppose that the auxiliary component a of (ω,a) is approximately or
exactly distribution constant, i.e. a is ancillary. For instance, a may be the
affine ancillary or the directed log likelihood ratio statistic, as defined in
Barndorff-Nielsen (1980, 1986b). We may think of the partitions generated,
respectively, by a and ω as foliations of T, to be called the ancillary
foliation and the maximum likelihood foliation. (Amari's ancillary subspaces
are then, in the present terminology and for α = 1, leaves of the maximum like-
lihood foliation.)
Exponential transformation models. A model M^which is both trans-
formational and exponential is called an exponential transformation model. For
such models we have the following structure theorem (Barndorff-Nielsen,
Blaesild, Jensen and Jorgensen (1982), Eriksen (1984b)).
Theorem 2.1. Let M be an exponential transformation model with
114 0. E. Barndorff-Nielsen

acting group G. Suppose X^ is locally compact and that t is continuous. Fur-


thermore, suppose that G is locally compact and acts continuously on X_.
Then there exists, uniquely, a k-dimensional representation A(g) of
G and k-dimensional vectors B(g) and B(g) such that

t(gx) = t(x)A(g) + B(g) (2.43)

θ(g) = θίejAίg" 1 )* + &(g) (2.44)

where eεG denotes the identity element. Furthermore, the full exponential model
generated by M_ is invariant under G, and &* = {[A(g~ )*,&(g)]: gεG} is a group of
affine transformations of R leaving θ and into invariant in such a way that

θ(gP) = θ(P)A(g'V + B(g), gεG, PεlP .

Dually, G = ί[A(g),B(g)]:gεG} is a group of affine transformations leaving


C = cl conv t( X_ ) as well as T = τ(intθ) invariant. Finally, let 6 be the
function given by

δ(g) = a(θ(e))a(θ(g))~1exp(-θ(g).B(g)). (2.45)

We then have

a(θ(gP)) = a(θ(P))δ(g)"1exp(-θ(gP).B(g)). (2.46)

Exponential transformation models that are full are a rarity.


However, important examples of such models are provided by the family of Wishart
distributions and the transformational submodels of this.
In general, then, an exponential transformation model M is a curved
exponential model. It is seen from the above theorem that the full model M_
generated by f^ is a composite transformation model and that, correspondingly,
P4 (and, hence Θ and T) is a foliated manifold with M as a leaf. It seems of
interest to study how the leaves of this foliation are related geometric-
statistically. Exponential transformation models of type (k,d), and in partic-
ular those of type (2,1), have been studied in some detail by Eriksen (1984a,c).
In the first of these papers the Jordan normal form of a matrix is an important
tool.
Differential and Integral Geometry in Statistical Inference 115

Many of the classical differentiable manifolds with their associated


acting Lie groups are carriers of interesting exponential transformation models.
Instances of this are compiled in table 2.1.
Analogies between exponential models and transformation models.
There are some intriguing analogies between exponential models and transforma-
tion models.
Example 2.3. Under a d-dimensional location parameter model, with
ω as the location parameter and for a fixed value of the (ancillary) configura-
tion statistic, the possible score functions are horizontal translates of each
other.
On the other hand, under a (k,d) exponential model, with ω as a
component of the canonical parameter and provided the complementary part of the
canonical statistic is a cut, the possible score functions are vertical trans-
lates of each other. (For details, see Barndorff-Nielsen (1982)).
Example 2.4. Suppose ω is one-dimensional. If ω is the location
parameter of a location model then the correction term C,, given by (2.10),
takes the simple form
i ΐ(4) j ( 3 ) 2
C = {3 +5 }
2Ϊ V " ^ 3 —
l1 "24
r r
Exactly the same expression is obtained for a (1,1) exponential
model with ω as the canonical parameter.
(This was noted in Barndorff-Nielsen and Cox (1984)).
Maximum estimation. Suppose that for a certain class of models we
have an estimation procedure according to which the estimate ω of ω is obtained
by maximizing a positive function M = M(ω) = M(ω x) with respect to ω. Let
m = log M and suppose that
ίc = -[3 Γ 3 s m](ω) (2.47)
is positive definite. We shall then say that we have a maximum estimation pro-
cedure. Maximum likelihood estimation and dual maximum likelihood estimation
-1
(where m(ω) = l(ω) = θ τ(ω) - l(ω), cf. (2.37)) are examples of this. More
116 0. E. Barndorff-Nielsen

generally, minimum contrast estimation, as discussed by Eguchi (1983), is of


this type.
Suppose that M depends on x through the minimal sufficient statis-
tic only and let a be an auxiliary statistic such that (ω,a) is minimal suf-
ficient. In generalization of (2.7) we may consider

p*(2f;ω|a) = c\H\\/ΐ, (2.48)

as a possible approximation to p(ω;ω|a). Here t = L(ω) and c is a norming


constant, determined so as to make the integral of the right hand side of
(2.48) with respect to ω equal to 1.
It will be shown in section 5 that (2.48) is exactly equal to
p(ω;ω|a) for a considerable range of cases.
Finally, it may be noted that by an argument of analogy it would
seem rather natural to consider the modification of (2.48) in which the func-
tion M is substituted for the likelihood function L. While this approach is
not without interest its general asymptotic degree of accuracy is only 0(n )
in comparison with 0(n~-1 ) or 0(n~-3/2
' ) for (2.48). Also, for transformation
models this modification is exact in exceptional cases only.
Differential and Integral Geometry in Statistical Inference 117

X
4
I

a §i CD
r-4 ^-» O

I X X

I
0)
•P

•H

Ό
CQ •H
P CD u O
U CO Φ ι-t id
A
ϋ t
U
I
CO
CO •H •H
•H β ω
0 en
> 1

n o
x 0 Q) O
U ϋ ϋ •H

II
CO
o g -P 4J
P -H
5
CO
o
en

1 O
in O
CO

g
11 If CD u
5
o
ft
I

•rH CO
Φ
ϋ Ή ϋ
•H «W H
M Φ V4
+J TJ P
<D
•H
4J
CO

a
CQ U
3. TRANSFORMATION MODELS

Transformation models were introduced in section 2. For any xεX.


the set Gx = {gx:gεG} of points traversed by x under the action of G is termed
the orbit of x. The sample space ) M s thus partitioned into disjoint orbits,
and if on each orbit we select a point u, to be called the orbit representative,
then any point x in X^can be determined by specifying the representative u of
Gx and an element zεG such that x = zu. In this way x has, as it were, been
expressed in new coordinates (z,u) and we speak of (z,u) as an orbital decompo-
sition of x.
The orbit representative, or any one-to-one transformation thereof,
is a maximal invariant - and hence ancillary - statistic, and inference under
the model proceeds by first conditioning on that statistic.
The action of G on a space Xjs said to be transitive if ^consists
of a single orbit and free if for any pair g and h of different elements of G
we have gx j hx for every xεX^. Note that after conditioning on a maximal
invariant statistic u we have a transitive action of G on the conditional sample
space. For any xεX^ the set G χ = {g:gx = x) is a subgroup, called the isotropy
group of x. The space Xjs said to be of constant orbit type if it is possible
to select the orbit representatives u so that G is the same for all u.
The situation is particularly transparent if the action of G on the
sample space _X is free. Then for given x and u there is only one choice of zεG
such that x = zu, and _X is thus representable as a product space of the form
U x G where U is the subset of X, consisting of the orbit representatives u.
Note that u and z as functions of x are, respectively, invariant and equivariant

118
Differential and Integral Geometry in Statistical Inference 119

i.e.
u(gx) = u(x), z(gx) = gz(x).

It is c "ten feasible to construct an orbital decomposition by first finding an


equivariant mapping z from _X onto G and then defining the orbit representative
u for x bv
u = z~ x.

In particular, the maximum likelihood estimate g of g is equivariant, and may be


used as z provided g(x) exists uniquely for eyery xε_X and g(X) = G. In this
case, G's action on P^ must also be free.
However, we shall need to treat more general cases where the actions
of G on X_ and on P_ are not necessarily free.
Let H and K be subsets of G. We say that these constitute a
factorization of G if G is uniquely factorizable as
G = HK
in the sense that to each element gεG there exists a unique pair (h,k)εHχK such
that g = hk. We speak of a left factorization if, in addition, K is a subgroup
of G, and similarly for right factorization. If a factorization is both left
and right then G is said to be the product of the groups H and K. An important
example of such a product is afforded by the well-known unique factorization of
a regular n x n matrix A into a product UT of an orthogonal matrix U and a
lower triangular matrix with positive diagonal elements, i.e., using standard
notations for matrix groups, GL(n) is the product of 0(n) and T + (n).
A relevant left factorization is often generated in the following
way. Let P be a member of the family P^ of probability measures for a transform-
ation model M_, and let K be the isotropy group G p , i.e.
K = {gεG gP = P}.
For each PεP^ we may select an element h of G such that P = hP, and letting H be
the set consisting of these elements we have a (left) factorization G = HK.
(In a more technical wording, the elements h are representatives of the left
]
cosets of K.) Note that G p = hG p h" , and that the action of G on P is free if
120 0. E. Barndorff-Nielsen

and only if K consists of the identity element alone. The quantity h para-
metrizes _P.
Suppose G = HK is a factorization of this kind. For most transform-
ation models of interest, if the action of G on X. is not free then there exists
an orbital decomposition (z,u) of x with zεH and such that for every u the iso-
tropy group G equals K and, furthermore, if z and z 1 are different elements of
H then zu f z'u.
Example 3.1. Hyperboloid model. This model (Barndorff-Nielsen
(1978b), Jensen (1981)) is analogous to the von Mises-Fisher model but pertains
to observations x on the unit hyperboloid H^"1 of R k , i.e.

k-1 = {x:x*x = 1, x >0}


H Q

where x = (XQ,X,,... ,x. ,) and * denotes the non-definite scalar product of


vectors in R which is given by
χ*y = χ o y o - χ 1 y 1 - . . . - χ k _ 1 y k _ Γ

The analogue of the orthogonal group 0(k) is the so called pseudo-


orthogonal group 0(1,k-1), which is the subgroup of GL(k) with matrix represent-
ation
0(1,k-1) = {U:ll* 1 U = 1}
where t denotes the k x k diagonal matrix

1 0 0
0 -1

0 . -1

For k = 4 this is the Lorentz group of relativistic physics. Topologically,


the group 0(1,k-1) has four connected components, of which one is a subgroup of
0(1,k-1) and is defined by
Differential and Integral Geometry in Statistical Inference 121

f
SO (l,k-l)

(the elements of U are denoted by u.., i and j = 0,1,...,k-l). This subgroup


•J
is called the special pseudo-orthogonal group and it acts on Hk 1by (U,x)
k-1

(vector-matrix multiplication). The points of H can be expressed in hyper-

bolic-spherical coordinates as

XQ = cosh u

x. = sinh u cos v,

x 2 = sinh u sin v, cos v ?

x. 1 = sinh u sin v 1 ... sin v k _ 2 ,


k-1 +
and an invariant measure μ on H , relative to the action of SO (l,k-l), is
specified by
dμ = sinh k " 2 u sin k " 3 v 1 ... sin v k _ 3 dudv ] ... dv k _ 2 (3.1)
The hyperboloid model function, relative to the invariant measure
(3.1) on H k ~ \ is
p(x;ξ,λ) = a k (λ)e" λ ξ * x (3.2)

where the parameters ξ and λ, called the mean direction and the precision,
satisfy ξεHk-1 and λ>0, and where
k/2 1 k/2 1
a k (λ) = λ - /{(2π) ' 2K k / 2 _ 1 (λ)} (3.3)

with Kk/2_-ι a Bessel function.


For any fixed λ, the hyperboloid distributions (3.2) constitute a
transformation model under the action of SO (l,k-l), and the induced action on
the parameter space is (U,ζ) •> ξU* (vector-matrix multiplication). The isotropy
group K of the element ξ = (1,0,...,0) may be identified with SO(k-l). Further-
more, SO*(l,k-l) can be factored as
SO*(l,k-l) = HK = H SO(k-l)
122 0. E. Barndorff-Nielsen

where the matrix representation of hεH is

Vl

1 + l+x
X
1X2 Vk-1
r 1+XΛ

x
h= 1+ 2 X k-1 (3.4)

x
k-1 x 1 x
k-l x 2 Λ
. 1 + k-1
Vl 1+Xn 1+x 0 1+Xn

for x = (xQ,x-|,...,xk->1) varying over H . In relativity theory a Lorentz


transformation of the type (3.4) is termed a "pure Lorentz transformation" or
a "boost." (It may be noted that SO f (l,k-l) can equally be factored as KH with
the same K and H as above.)
We have already mentioned the concept of equivariance of a mapping
from .X onto G. More generally, if s is a mapping of X_ onto a space S and if
s(x) = s(x') implies s(gx) = s(gx') for x,x'ε_X and all gεG then s is said to be
equivariant. In this case we may define an action of G on S by gs = s(gx)
for s = s(x) and for any xε_X, and we speak of this as the action induced by s.
In the applications to be discussed later S is typically the parameter domain
under some parametrization of the model and s is the maximum likelihood estima-
tor, which is automatically equivariant.
We are now ready to state the results which constitute the main
tools of the theory of transformation models.
Subject to mild topological regularity conditions (for details, see
Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982)) we have
Lemma 3.1. Let u be an invariant statistic with range space U =
u U ) , let s be an equivariant statistic with range space S = s O O , and assume
that the induced action of G on S is transitive. Furthermore, let μ be
Differential and Integral Geometry in Statistical Inference 123

invariant measure on X_. Then, we have (s,u)QQ = S x U and


(s,u)y = v >< p

where v is an invariant measure on S and p is some measure on U.


Suppose r, s and t are statistics on X_ (in general vector-valued).
The symbol r i s|t is used to indicate that r and s are conditionally indepen-
dent given t.
Theorem 3.1. Let the notations and assumptions be as in lemma 3.1,
and suppose that the transformation model has a model function p(x g) relative
to an invariant measure μ on X such that p(x) = p(x e) is of the form
p(x) = q(u)r(s,w) (3.5)
for some functions q and r and some invariant statistic w which is a function
of u.
Then the following conclusions are valid.
(i) The model function p(x g) is of the form
p(x g) = q(u)r(g" Ί s,w), (3.6)
and hence the statistic (s,w) is sufficient.
(ii) We have
s i u|w.
(iii) The invariant statistic u has probability function
p(u) = q(u)/r(s,w)dv(s) <p> (3.7)
(where v is invariant measure on S).
(iv) The conditional probability function of s given w is
p(s;g|w) = c(w)r(g" s,w) <v> (3.8)
where c(w) is a norming constant.
It should be noted that the theorem covers the case where no suffi-
cient reduction is available (take q constant and w = u) as well as the case
where s - typically the maximum likelihood estimator - is sufficient (take w
degenerate). Note also that theorem 3.1 does not assume that the action of G
is free. If, however, the action is free and if (z,u) is an orbital decompo-
sition of x then the theorem applies with s = z.
1 2 4
0. E. Barndorff-Nielsen

Example 3.2. Hyperboloid model (continued). Let x,,...,x be a


sample from the hyperboloid distribution (3.2) and let x = (x,,...,x ) and
x + = X-.+ ... +x . Considering λ as fixed, theorem 3.1 applies with u as the
maximal invariant statistic, s = x + // x + *x + and w = / x + *x + . In particular,
it turns out that the conditional distribution of s given w (or, equivalently,
given u) is again a hyperboloid distribution, with mean direction ξ and pre-
cision wλ. This is in complete analogy with the von Mises-Fisher situation,
and accordingly s and w are termed the mean direction and the resultant length
of the sample. For details and further results see Jensen (1981) and Barndorff-
Nielsen, Blaesild, Jensen and Jorgensen (1982).
Lemma 3.1 and theorem 3.1 are formulated in terms of invariant
dominating measures on X^ and S. In applications, however, the probability func-
tions are ordinarily expressed relative to Lebesgue measure - or, more general-
ly, relative to geometric measure when the underlying space is a differentiate
manifold. It is therefore important to have a formula which gives the relation
between the two types of dominating measure.
Let γ be an action of G on a space Y_ and suppose _Y has constant
orbit type under this action. Then there exists a subgroup K of G, a subset H
of G and an orbital decomposition (z,u) of y ε ^ such that G u = K and zεH for
every y. We assume that H can be chosen so that HK constitutes a (left)
factorization of G. If X is a differentiate manifold and if γ acts differen-
tiably on X then an invariant measure μ o n Y can typically be constructed from
geometric measure λ on _Y, by means of Jacobians. In particular, if X is an
r
open subset of some Euclidean space R , so that λ is Lebesgue measure, then
μ defined by
Oγ(z)(uΓΊdλ(y) (3.9)

will be invariant; here J / \ denotes the Jacobian determinant of the mapping


γ(g) of X onto itself. A proof of this is sketched in appendix 1.
Example 3.3. Hyperboloid model (continued). We show here how the
invariant measure (3.1) on the unit hyperboloid H k " 1 may be derived from
Differential and Integral Geometry in Statistical Inference 125

Lebesgue measure. For simplicity, suppose k = 3. The manifold H 2 is in one-


2
to-one smooth correspondence with R through the mapping

and we start by finding an invariant measure on R . The action of S0 f (l,2) on


2 ?
H is given by (U,x) -> xU* and the induced action on R is therefore of the
form (U,y) -> ψ(φ" (y)U*). These actions are transitive, and if we take
2
u = (0,0) as the orbit representative of R and let z be the boost

y0 y, y2

z= 1 + (3.10)

y
2
y1 ++ 2 2
y-i + Yo» then (u,z) constitutes an orbital decomposition of
yεR of the type required for the use of formula (3.9). Letting γ denote the
2 2
action of S0 (l,2) on R one finds that J'( z )(u) = l/l + y 2 + y 2 and hence the
Φ 2

measure
dy(y) =

p
is an invariant measure on R . Shifting to hyperbolic-spherical coordinates
(u,v) for (y , 5 y ? ) this measure is transformed to (3.1) with k = 3.
Below and in sections 4 and 5 we shall draw several important con-
clusions from lemma 3.1 and theorem 3.1. Various other applications may be
found in Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982).
Corollary 3.1. Let G = HK be a left factorization of G such that
K is the isotropy group of p. Thus the likelihood function depends on g through
h only. Suppose theorem 3.1 applies with S = H and let L(h) = L(h x) be any
version of the likelihood function. Then, the conditional probability
function of s given w may be expressed in terms of the likelihood function as
12 6 0. E. Barndorff-Nielsen

p(s;h|w) = c(w) {•[£]• <v> . (3.11)


In formula (3.11) the likelihood function changes with the value of
s. However, an alternative expression for the conditional probability function
is available which employs only the single observed likelihood function. Sup-
pose for simplicity that K consists of the identity element alone, so that
S = G. Further, let XQ denote the observed point in X^ and write Ln(g) for
L(g;x 0 ). Also, for specificity, let the action of G on S = G be the so called
left action of G on itself, i.e. a gεG acts on a point sεS simply by multiply-
ing s on the left by g, in the group theoretic sense. (Thus, the two possible
interpretations of the symbol gs coincide). The situation here specified
occurs, in particular, if the action of G on X is free and if s is the group
component of an orbital decomposition of x. Setting s Q = s(x Q ) and w~ = w(x Q ),
we are interested in the conditional distribution of s given w = w Q and by
(3.6) and (3.11) this may be written as
L Q (s s~ Ί g)
P(s;g|w o ) = c(w 0 ) L Q ( S Q ) <α> ,

the invariant measure being denoted here by α, as a standard notation for left
invariant measure on G. This formula, which generalizes a similar
expression for the location-scale model due to Fisher (1934), shows how the
"shape and position" of the conditional distribution of s is simply determined
by the observed likelihood function and the observed s Q , respectively.
Formula (3.11), however, besides being slightly more general, seems
more directly applicable in practice.
4. TRANSFORMATIONAL SUBMODELS

Let M be a transformation model with acting group G. If P~ is any


of the probability measures in M^ and if G Q is a subgroup of G then P~ =
{9Pg : 9εG 0 } defines a transformation submodel M~ of M_. For a given G n the col-
lection of such submodels typically constitutes a foliation of M^.
Suppose G is a Lie group, as is usually the case. The one-parameter
subgroups of G are then in one-to-one correspondence with TG , the tangent
space of G at the identity element e, and this in turn is in one-to-one corre-
spondence with the Lie algebra £ of left invariant vector fields on G. More
generally, each subalgebra h^ of the Lie algebra of G determines a connected
subgroup H of G whose Lie algebra is ]χ (cf., for instance, Boothby (1975) chap-
ter 4, theorem 8.7). If AεTG , the one-parameter subgroup of G determined by
A is of the form {exp(tA):tεR}. In general, the subgroup of G determined
by r linearly independent elements A,,...,A of TG may be represented as

Example 4.1. Let M be a location-scale model,

Π 1
p(x 1 5 ...,X n ;μ,σ) = σ" Π f(σ" (x.-μ)). (4.1)
1 n ]
i=l
Here G is the affine group with elements [μ,σ] which may be represented by
2 χ 2 matrices
1 0
JJ σ
the group operation being then ordinary matrix multiplication. The Lie algebra
of G, or equivalently TG , is represented as the set of 2 x 2 matrices of the
127
128 0. E. Barndorff-Nielsen

form
0 0
A=
b a
We have
e t A = I + tA + 2T t 2 A 2 +.

b/a(e t a -l) e t a

where the last expression is to be interpreted in the limiting sense if a = 0.


There are therefore four different types of submodels. Specifical-
denote an
ly, letting U Q ^ Q ) arbitrary value of (μ,σ) and taking P Q as the
corresponding measure (4.1) we have
(i) If a = 0 then P~ is a pure location model.
(ii) If a f 0, b = 0 and μ Q = 0 then P« is a pure scale model.
(iii) If a f= 0, b = 0 and μ Q f 0 then M~ may be characterized as
the submodel of M_ for which the coefficient of variation μ/σ is constant and
equal to VQ/OQ
(iv) If both a and b are different from 0 then F> may be character-
ized as the submodel PL of M for which σ" (μ+b/a) is constant and equal to
C
Ω = σ Ωσ 'Ί e if we let c = b/a then NL is determined by

σ" Ί (μ+c) = c 0 . (4.2)

Letting F denote the distribution function of f we can express (4.2) as the


condition that (μ,σ) is such that -c is the F(-co)-quantile of the distribution

The above example is prototypical in the sense that G is generally


a subgroup of the general linear group GL(m) for some m and TG may be repre-
sented as a linear subset of the set M(m) of all m x m matrices.
Example 4.2. Hyperboioid model. The model function of the hyper-
boloid model with k = 3 and a known precision parameter λ may be written as
Differential and Integral Geometry in Statistical Inference 129

p(u,v; x ,φ) = (2π)-1λeλsinh u e " λ { c o s h * c o s h u


"sinh* s i n h u

where u > 0, vε[0,2π) and x > 0, φε[0,2π). The generating group G = S0 f (l;2)

may be represented as the subgroup of GL(3) whose elements are of the form

2 , 2
0 0 coshχ sinhχ 0
!
cosψ sinψ j I sinhx cosh x 0 (4.4)
-sinψ COSφ 0 0 1 ζ

where -α><ζ<-α>. This determines the so called Iwasa decomposition (cf., for

instance, Barut and Raczka (1980) chapter 3) of S0*(l;2) into the product of

three subgroups, the three factors in (4.4) being the generic elements of the

respective subgroups. It follows that TG is the linear subspace of M(3) gen-

erated by the linearly independent elements

I 0 0 0 ί 0 1 Γo 1

' 0 0 1 , E2 = j 1 0 0
E
l
=

!
ί E
3
= 1 I

0 - 1 0 0 0 1 -1 0
L
Each of the three subgroups of the Iwasawa decomposition generates

a transformational foliation of the hyperboloid model given by (4.3), as dis-

cussed in general terms above. In particular, the group determined by the

third factor in (4.4) yields, when applied to the distribution (4.3) with

X = Φ = 0, the following one-parameter submodel of the hyperbolic model:

p(u,v;ζ)

u
"SΊnh u c o s v
) ~ 2 ^s i n h u s i n v
(2 Γ Ί λ ~λ( c o s h u
^sinh u ^

The general form o f t h e one-parameter subgroups o f SO ( 1 ; 2 ) i s

0 aa b ~i

exp{t 0
I -c 0

where a, b, c are fixed real numbers.


5. MAXIMUM ESTIMATION AND TRANSFORMATION MODELS

We shall be concerned with those situations in which there exists an


invariant measure y on X that dominates P_9 where P^ = {gP:gεG} is transformation-
al. Letting
^~M = P(χ g)

and writing p(x) for p(x e) we have

p(χ g) = p(g" x) <u>.

In most cases of interest the model has the following additional structure (pos-
sibly after deletion of a null set from _X , cf. also section 3). There exists
a left factorization G = HK of G, a K-invariant function f on X_, and an orbit-
al decomposition (fr,u) of x such that:
(i) G = K for all u and, furthermore, G p = K. Hence, in particu-
lar, H may be viewed as the parameter space of the model.
(ii) For eyery xε_X the function m(h) = f(h" x) has a unique maximum
on H and the maximum point is fr.
(iii) H may be viewed as an open subset of some Euclidean space R
and for each fixed xεX_ the function m is twice continuously differentiate on H
and the matrix * = ^(h) given by

is positive definite.
In these circumstances we have:
Proposition 5.1. The maximum estimator ft is an equivariant mapping

130
Differential and Integral Geometry in Statistical Inference 131

of _X onto H and the action of G on H induced by ίi coincides with the natural


action of G on H. Furthermore, if the mapping x •> (Fί,u) is proper then there
exists an invariant measure v on H, and for any fixed u such a measure is given
by
dv(h) = j-fe I ^dh (5.1)

where dh indicates the differential of Lebesgue measure on H.


Here H is considered as an open subset of R , in accordance with
(111).
Proof. The equivariance of h follows immediately from (ii). Obvi-
ously, there is a one-to-one correspondence between the family of left cosets
G/K = {gK:gεG} and H. Let p be the mapping from G/K to H which establishes this
correspondence. The natural action φ of G on G/K is given by
G x G/K ^ G/K
φ:
(g.gK) -> ggK
and we have to show that when this action is transferred to H by p it coincides
with the action γ of G on H induced by ft. In other words, we must verify that
for any gεG the diagram
G/K »H
Φ(g) | I γ(g) (5.2)
G/K •H
P

commutes. Let η be the mapping from G to H that sends a gεG into the uniquely
determined hεH such that g = hk for some kεK. For any fr = ίτί(x) in H we have
that γ(g)fr = ίτί(gx) is determined by
f({ίV(gχ)}"Ί gx) > f(h - 1 gx), hεH. (5.3)
Now, by the K-invariance of f,
-1 Ί Ί
f(h gx) = f U g ^ h ^ x ) = f(η(g" h)" x)

and here η(g" h) ranges over all of H when h ranges over H. Hence (5.3) may be
rewritten as
f(h" Ί χ), hεH,
132 0. E. Barndorff-Nielsen

i . e . , by ( i i ) ,

= n(rt(gx))
or, equivalently,

ίπ(χ)κ =

and this, precisely, expresses the commutativity of (5.2), since p~ (h) = hK.
When the mapping x -> (fί,u) is proper the subgroup K is compact
because K = G . Hence there exists an invariant measure on H, cf. appendix 1.
That |tfpdh is such a measure follows from (3.9) and formula (5.10) below.
In particular, then, there is only one action of G on H at play,
namely γ, and
γ(g)h = η(gh). (5.4)

Now, let h •> ω be an arbitrary reparametrization of the model and


let m(ω) = m(h(ω)) and
,2
*(ω) =*(ω;tl) = - J L j L ( ω ; h u ) . (5.5)

This matrix is a (0,2) tensor on Ω.


We shall now show that
-k(h) =*(h;u) = J. (e)~Ί**(e;u)J (e)" 1 . (5.6)
Ύ(h) Ύ(h)
Here the unit element e is to be thought of as a point in H.
We have
m(h) = f(h" ] x) = f(h"Ίίuι) = f ({ η (ίτΓ Ί h)Γ Ί u)
where, again, we have used the K-invariance of f. Thus, with η as the projec-
tion mapping defined above we obtain

Mίjpί(h) _ M i (η(rh))
^ h ) * (h) (5.7)
and
2 Ί Ί
3 m(h;x) (h) . 3η(ίτΓ h) (h) Λ ( h u) /./ft-lh^ 3n(ίτΓ h)* fh .

(h) . (5.8)
Differential and Integral Geometry in Statistical Inference 133

In these expressions we have, since η(ίτΓ h) = γ(ίi~Ί)h, that

(5.9)

On inserting ft for h in (5.7), (5.8) and (5.9) (whereby (5.7) becomes 0) and
combining with (2.1) we obtain (5.6).
From (5.6) we may draw two important conclusions.
First, taking determinants we have
5 ]
ne',u)\h (5.10)

and this, by (3.9) and the tensorial nature of *, implies that j-RfωJl^dω is an
invariant measure on Ω. In connection with formula (5.10) it may be noted that
J
γ'(h) ( e ) = J δ ( h ) ( e )

where 6 denotes left action of the group G on itself. A proof of this latter
formula is given in appendix 2.
Secondly, the tensor -K(ω) is found to be G-invariant, whatever the
value of the ancillary. In fact, by (5.4) we have, for any h Q εH and gεG,

γ(γ(g)h)h0 = γ(g) o γ(h)h Q .

Consequently

- γ (g)

and this together with (5.6) and (2.26) establishes the invariance.
In particular, observed information ^determines a G-invariant
Riemannian metric on the parameter space. The expected information metric i
can also be shown to be G-invariant.
From proposition 5.1 and corollary 3.1 we find
Corollary 5.1. The model function p*(ω*,ω|u) = c|ί<|\/t' is exactly
equal to p(ω;ω|u).
By taking m of (ii) equal to the log likelihood function 1 this
corollary specializes to theorem 4.1 of Barndorff-Nielsen (1983).
Suppose, in particular, that the model is an exponential transform-
134 0. E. Barndorff-Nielsen

α
ation model. Then the above theory applies with m(ω) = l(ω). The essential
α -j
property to check is that l(ω;t(x)) is of the form f(h x). This follows simply
α
from the definition of 1 and theorem 2.1.
6. OBSERVED GEOMETRIES

In section 2 we briefly reviewed how the parameter space of the


model f^ may be set up as a manifold with expected information i as Riemannian
metric tensor and with an associated family of affine connections, the α-con-
nections (2.30). We shall now discuss a similar type of geometries on the
parameter space, related to observed information and depending on the choice of
the auxiliary statistic a which together with the maximum likelihood estimator
ω constitutes a minimal sufficient statistic for fi. These latter geometries
are termed observed geometries (Barndorff-Neilsen, 1986a). In applications to
statistical inference questions it will usually be appropriate to take a to
be ancillary but a great part of what we shall discuss does not require dis-
tribution constancy of a and, unless explicitly stated otherwise, the auxil-
iary a is considered arbitrary (except for the implicit smoothness properties).
Let an auxiliary a be chosen. We may now take partial derivatives
of 1 = l(ω;ω,a) with respect to the coordinates ω Γ of ω as well as with respect
to ω Γ . Letting a = a/3ωΓ we introduce the notation
Ί = 9 5 ] (6J)
r r s s r \ h s
rr..rp,sΓ..sq r r . . rp s r . . sq
and refer to these quantities as mixed derivatives of the log model function.
The function of ω and a obtained from (6.1) by substituting ω for ω will be
denoted by \ . Thus, for instance,
rΓ..rp;sΓ..s
= (ω;a)
*rs;t

More generally, for any combinant g of the form g(ω;ω,a) we write

135
136 0. E. Barndorff-Nielsen

-? = -§f(ω;a) = g(ω;ω,a).

This is in consistency with the notation # introduced by (2.6). The observed


geometries, to be discussed, are expressed in terms of the mixed derivatives
(6 2 )
\ r s s
rΓ..rpSsΓ..sq
So are the terms of an asymptotic expansion of (2.7), cf. section 7.
Given the observed value of a the observed information tensor ^, of
(2.6), defines the parameter space of M^ as a Riemannian manifold. The Rieman-
nian connection determined by a- has connection symbols -°t
F given by°t
& =

f
rst * ''"At " Vrs * Vrt'
Employing the notation established above we have 9.6- = -*c + -Jc. +9 etc.
u rs rsu rs,u
so that

As we shall now show, the quantity

p = _(} + > . t [3]) (6.4)

is a covariant tensor of rank 3, i.e.

*pστ "T rst ω /p ω /σ ω /τ


(6
' 5)
First, from (2.14) we have

ω ω ω + + ω ω [3]
/p /σ /τ rS /Pσ /τ '

τ
Further, from (2.13) we obtain, on differentiating with respect to ψ and then
substituting parameter for estimate,

= + ω ω ω + ω ω (6 7
V,τ rs;t /p /σ /τ V,t /Pσ /τ' '

Finally, differentiating the likelihood equation

we find
Differential and Integral Geometry in Statistical Inference 137

*rs
or

Combination of (6.4), (6.6), (6.7) and (6.9) yields (6.5).


It follows from the tensorial nature of ? and from (6.3) and (6.9)
α
that for any real α an affine connection ? on M^may be defined by

f
rs ~ * *Vsu
with

In particular, we have
1 -1 = (6J1)
V,rs
where to obtain the latter expression we have used

which follows on differentiation of (6.8). It may also be noted that


1 - 1 1 - 1
t rs rts str str rts
and
a l.l Ί "1

α
The connections -f, which we shall refer to as the observed α-con-
α
nections, are analogues of the expected α-connections r given by (2.30). The
α α
analogy between r and -F becomes more apparent by rewriting the skewness tensor
(2.29) as
T E{1
rst = " rst
the validity of which follows on differentiation of the formula
E{1
rs + V s
}= (6J2)
°'
which, in turn, may be compared to (6.8).
Under the specifications of a of primary statistical interest one
138 0. E. Barndorff-Nielsen

has that, in broad generality, the observed geometries converge to the corre-
sponding expected geometries as the sample size tends to infinity.
For (k,k) exponential models

p(x θ) = a(θ)b(x)e θ # t ( x ) (6.13)

no auxiliary statistic is involved since θ is minimal sufficient, and we find


α α
j- = i and £ = r, αεR.
Let i,j,k,... be indices for the coordinates of θ, t and τ, using
upper indices for θ and lower indices for t and τ.
In the case of a curved exponential model (2.35), we have

\ = (t-τ)Ί.θ}Γ (6.14)

and, letting Θ denote the maximum likelihood estimator of θ under the full model
generated by (2 35), the relation + = ? takes the form
r, s rs
V , s ( ω ) = κ ij ( θ ) θ jr*/s

" κ 1J ( θ > θ /r θ /s - (*-^i


Furthermore,

• -<1jk(θ)θ/rθ/sθ/t

rst i j
;rs^t=irst (6.17)

and
( 61 8 )
^ rs-Ίj^/t^/rs-'rsf

I t is also to be noted that, under mild regularity conditions, the quantities


& and ^possess asymptotic expansions the f i r s t terms of which are given by

and

rst " { Ίjk θ ;rs θ /t θ /\ [ 3 ]


} a λ + (6 20)
Wx '
Differential and Integral Geometry in Statistical Inference 139

where a λ , λ = l,...,k-d, are the coordinates of the auxiliary statistic a. For


instance, in the repeated sampling situation and letting a« denote the affine
ancillary, as defined in Barndorff-Nielsen (1980), we may take a = n a and
the expansions (6.19) and (6.20) are asymptotic in powers of n"*5. (For further
comparison with Amari (1982a) it may be noted that the coefficient in the first
e e
order correction term of (6.19) may be written as θ /ir s θ / λ κ i- j - ; = n^ s λ where H
is Amari's notation for the exponential curvature, or α-curvature with α = 1, of
the curved exponential model viewed as a manifold imbedded in the full (k,k)
model.)
For a transformation model we find

l Γ (h;x) = T r

(cf. the more general formula (5.7)) and hence

(6 21)

(6.22)

where, for a r = a/ahΓ and a r = a/ahr,

A^ = a s n r (h" 1 l

so that

S
" ~γ(h)

while
B
st = 3 s V
B
s;t

B
st
140 0. E. Barndorff-Nielsen

Furthermore, to write the coefficients of 1 , c ,.,(e;u) in (6.21) and (6.22) as


r s K*
indicated we have used the relation
vΛh"Ίh)L = -3 ς η Γ (h" Ί h)| Λ . (6.24)
s
h=h s h=h
Formula (6.24) is proved in appendix 3.
We now briefly consider four examples. In the first three the
model is transformational and the auxiliary statistic a is taken to be the max-
imal invariant statistic, and thus a is exactly ancillary. In the fourth ex-
ample a is only approximately ancillary. Examples 6.1, 6.3 and 6.4 concern
curved exponential models whereas the model in example 6.2 - the location-scale
model - is exponential only if the error distribution is normal.
Example 6.1. Constant normal fractile. For known αε(0,l) and
cε(-oo5oo)5 let N denote the class of normal distributions having the real
"~ΪDt , C
number c as α-fractile, i.e.

N . = {N(μ,σ2):(c-μ)/σ = u },

where u denotes the α-fractile of the standard normal distribution, and let
xΊ,...,x be a sample from a distribution in N . The model for x = (x,,... ,x,J
1 n —α,c I n
thus defined is a (2,1) exponential model, except for u = 0 when it is a (1,1)
model. Henceforth we suppose that u =)= 0, i.e. α f h. The model is also a
transformation model relative to the subgroup G of the group of one-dimensional
affine transformations given by
G = ί[c(l - λ),λ]:λ>0},
the group operation being
[c(l - λ),λ][c(l - λ'),λ'] = [c(Ί - λλ'),λλ']
and the action of G on the sample space being
[c(Ί - λ),λ](x r ...,x n ) = (c(Ί - λ) + λx r ...,c(l - λ) + λx n ).
(Note that G is isomorphic to the multiplicative group.)
Letting
a = (x - c)/s',
Differential and Integral Geometry in Statistical Inference 141

where x = (x, +...+ x n )/n and

s n ^^x. x; ,

we have that a is maximal invariant and, parametrizing the model by ζ = log σ,


that the maximum likelihood estimate is
ζ = log(bs')
where
b = b(a) = (u /2)a + / l + {(u / 2 ) 2 + l}a 2 .

Furthermore, (ζ,a) is a one-to-one transformation of the minimal sufficient


statistic (x,s') and a is exactly ancillary.
The log likelihood function may be written as

l(ς) = l ( ζ ; ζ , a ) = n[ζ - ζ - ^ { b - 2 e 2 ( ^ ζ ) + (u α + a t f V " 5 ) 2 } ]

from which it is evident that the model for ζ given a is a location model.
Indicating differentiation with respect to ζ and ζ by subscripts ς
and ζ, respectively, we find

l ς = n{-l + b - 2 e 2 ( ^ ζ ) + ab" 1 (u α + a t f V ^ e ^ }

and hence
ϊ = n{2b" 2 + ab" ] (u α + 2ab" Ί )}

= n { 4 b
^ζζζ "2 + a b
"1( u
α
+ 4 a b
"]) }

+ r = -n{4b"2 + ab" Ί (u α + 4ab" ] )} = *

.- = n{4b"^ + ab"'(u + 4ab"')} = -p= -^


ζ jζζ ot

and the observed skewness tensor is

Jc = n{8b" 2 + 2ab" 1 (u α + 43b" 1 )}.

Note also that


α 1

We mention in passing that another normal submodel, that specified


142 0. E. Barndorff-Nielsen

by a known coefficient of variation μ/σ, has properties similar to those ex-


hibited by example 6.1.
Example 6.2. Location-scale model. Let data x consist of a sample
x,,...,x from a location-scale model, i.e. the model function is
n x.-μ
Π
p(x;μ,σ) = σσ"

for some known probability density function f. We assume that {x:f(x)>0} is an


open interval and that g = -log f has a positive and continuous second order
derivative on that interval. This ensures that the maximum likelihood estimate
(μ,σ) exists uniquely with probability 1 (cf., for instance, Burridge (1981)).
Taking as the auxiliary a Fisher's configuration statistic
X Ί -μ X -μ
a = (a r ...,a n ) =

which is an exact ancillary, we find

-2
V(a Σa g"(a
3-(μ,σ) = σ
Σa g"(a ) n+Σa2g"(a

and, in an obvious notation,

f '(a,)


= -σ"3{2n + 4zal2 g"(a.)
i
+ za?g"'(a.)}
l i

μμμ

-3
yyσ
Differential and Integral Geometry in Statistical Inference 143

=σ 3{4n
Kao "
Furthermore,

*Wo

Example 6.3. HyperboΊoid model. Let (u-. ,v,),... , (u ,v ) be a


sample from the hyperboioid distribution (4.3) and suppose the precision λ is
known. The resultant length is

a = { ( Σ cosh u Ί ) - (Σ sinh u. cos v..) - (Σ sinh u^ sin v^) 2 }^

and a is maximal invariant after minimal sufficient reduction. Furthermore,


the maximum likelihood estimate (χ,ί) of (χ,φ) exists uniquely, with probabil-
ity 1, (a,χ,φ) is minimal sufficient and the conditional distribution of (χ,ψ)
given the ancillary a is again hyperboloidic, as in (4.3) but with u, v and λ
replaced by χ, ψ and aλ. It follows that the log likelihood function is

l(x»φ) = Kχ 5 Φ;x,φ 9 a) = -aλ{coshχ coshχ - sinh x sinhχ cos(φ-φ)}

and hence
α α α α
2 = -F =f = -F =0
XXX XXΦ XΦX ΦΦΦ
α
¥ A Λ = aλ cosh x sinh χ
xΦΦ
α
-F Λ A = -aλ cosh x sinh χ,
ΦΦx
whatever the value of α. Thus, in this case, the α-geometries are identical.
We note again that whereas the auxiliary statistic a is taken so
as to be ancillary in the various examples discussed here - exactly distribu-
144 0. E. Barndorff-Nielsen

tion constant in the three examples above and asymptotically distribution con-
stant in the one to follow - ancillarity is no prerequisite for the general
theory of observed geometries.
Furthermore, let a be any statistic which depends on the minimal
sufficient statistic t, say, only and suppose that the mapping from t to (ω,a)
is defined and one-to-one on some subset T~ of the full range X of values of t
though not, perhaps, on all of ]_. We can then endow the model M^ with observed
geometries, in the manner described above, for values of t in T~. The
next example illustrates this point.
The above considerations allow us to deal with questions of non-
uniqueness and nonexistence of maximum likelihood estimates and nonexistence of
exact ancillaries, especially in asymptotic considerations.
Example 6.4. Inverse Gaussian - Gaussian model. Let x( ) and y( )
2
be independent Brownian motions with a common diffusion coefficient σ = 1 and
drift coefficients μ>0 and ξ, respectively. We observe the process x( ) till it
first hits a level x«>0 and at the time u when this happens we record the value
v = y(u) of the second process. The joint distribution of u and v is then
given by
p(u,v;μ,ξ)

Suppose that (u, s v^),... ,(u ,v ) is a sample from the distribution


(6.25) and let t = (ΰ,v) where ΰ and v are the arithmetic means of the observa-
tions. Then t is minimal sufficient and follows a distribution similar to
(6.25), specifically
p(ΰ,v;y,ξ)

= (2π)" Ί x o ne ° G" 2 e 2
° e 2 2
. (6.26)

Now, assume ξ equal to μ. The model (6.26) is then a (2,1) exponential model,
still with t as minimal sufficient statistic. The maximum likelihood estimate
of μ is undefined if t^T^ where
Differential and Integral Geometry in Statistical Inference 145

IQ = it = (ΰ,v):x0 + v > 0}

fhereas for tεT^, μ exists uniquely and is given by


-1
y = ^(x 0 + v) ΰ . (6.27)

he event t^T^ happens with a probability that decreases exponentially fast with
he sample size n and may therefore be ignored for most statistical purposes.
Defining, formally, μ to be given by (6.27) even for t^T^ and let-
ing
a = Φ"(ΰ;2nxQ,2 n μ 2 ) ,

here Φ ( ;x»ψ) denotes the distribution function of the inverse Gaussian dis-
ribution with density function

φ-(x χ.Φ) = ( 2 π ) " ^ e ^ x " 3 / 2 e - ^ x " 1 + * x > (6.28)

e have that the mapping t -> (μ,a) is one-to-one from X = {t = (ΰ,v):ΰ>0} onto
-oo,+») x (0,oo) and that a is asymptotically ancillary and has the property
hat p*(μ;μ|a) =c|j p L approximates the actual conditional density of μ given
to order 0 ( n ~ 3 / 2 ) , cf. Barndorff-Nielsen (1984).
Letting Φ ( ;x»ψ) denote the inverse function of φ"( ;χ,ψ) we may
rite the log likelihood function for μ as

-2
=
Π { ( X Q+ V)μ - Uμ }

= nΦ_(a;2nx 2 ,2nμ 2 ) {2μμ-μ2} (6.29)

rom this we find


= -2nΦ (a;2nx 2 2nί 2 )

o that
Xg 92nμ )

+ = 0
μyy
nd
14 6 0. E. Barndorff-Nielsen

^μμ μ = 8n
^(φ" ° W O U ;2nx2,2nμ2)

1 -1
=s = -h $
μμμ μμμ

where Φ~ denotes the derivative of Φ"(x;χ,ψ) with respect to ψ. By the well-


known result (Shuster (1968))

φ"(x;χ,ψ) = φ ( ψ V - χhx'h) +

where Φ is the distribution function of the standard normal distribution, Φ "


Ψ
could be expressed in terms of Φ and ψ = Φ 1 .
7. EXPANSION OF c l j l ^ L

We shall derive an asymptotic expansion of (2.7), by Taylor expan-


sion of cIjI L in ω around ω, for fixed value of the auxiliary a. The various
terms of this expansion are given by mixed derivatives (cf. (6.2)) of the log
model function. It should be noted that for arbitrary choice of the auxiliary
statistic a the quantity c|j|E constitutes a probability (density) function on
the domain of variation of ω and the expansions below are valid. However,
c|j|[ furnishes an approximation to the actual conditional distribution of ω
given a, as discussed in section 2, only for suitable ancillary specification
of a.
To expand c|j| L in ω around ω we first write E as exp{l-ΐ} and
expand 1 in ω around ω. By Taylor's formula,

1-1= Σ VX
V>>
(ω-ώ) Ί ...(ω-ω) v (8ΓΓ ...3 l)(ω)
v=2 ll ΓΓvv

whence, expanding each of the terms (d ...a l)(ω) around ω,


r
Ί v
1-1

oo , vv rΊ r
izJJ_ (u- ω ) Ί ...(ω-ω) V
= Σ
v=2
. Σ X P J
(ω-ω) Sl ...(ω-ω) Sp 3. . . . 3 . \
O η d l
..._ .
η l
(7.1)
O μ
1 p 1 v

Consequently, writing δ for ω-ω and 6 "' for (ω-ω) (ω-ω) ..., we have

147
148 0. E. Barndorff-Nielsen

Next, we wish to expand log{|j|/|j|Ϋ in ω around ω. To do this we observe


that if A is a d x d matrix whose elements a depend on ω then

atlog|A| = |AΓ ] 3 t |A|

where a denotes the (r,s)-element of the inverse of A. Furthermore, using

which is obtained by differentiating a a u s = ό S with respect to ω and solving


for a Γ S , we find
Vu log|A| = -aVraSV\avvΛars + asl\Vrs.

It follows that

-" t U { *Γ S ί + r s t u + + r s t ; u + * r s u ; t + + r s ; t u )

(7.3)

By means of (7.2) and (7.3) we therefore find

) dd/2
/ 2
= (2π) cφ d (ω-ω;aΉl + A ] + A 2 + ...} (7.4)

where Φ.( a-) denotes the density function of the d-dimensional normal distribu-
tion with mean 0 and precision (i.e. inverse variance-covariance matrix) a- and
where
A + +
l •- " V ^ W \^ ^St(+rs;t+ I *rst) (7
"5)
and
A2 = ± [- 3δ

1
^ » s s n
rs t r s t M vw u
Differential and Integral Geometry in Statistical Inference 149

8
*rst;u

χs;t + |^st)(+uv;w+|+uvw)], (7.6)

A^ and A 2 being of order Oίn""15) and 0(n ), respectively, under ordinary repeat-
ed sampling.
By integration of (7.4) with respect to ω we obtain

(2π) d / 2 c = 1 + C 1 + ... , (7.7)

where C-. is obtained from A« by changing the sign of A« and making the sub-
stitutions

δ rstu

the 3 and 15 terms in the two latter expressions being obtained by appropriate
permutations of the indices (thus, for example, <s r s t u -> j - r s ^ t u + > r t ^ s u +

Combination o f ( 7 . 4 ) and ( 7 . 7 ) f i n a l l y yields

c | j | ^ L = φ ( ω - ω ; ί ) { l + A1 + ( A g + C ^ + . . . } (7.8)

with an error term which in wide generality is of order 0(n-3/2 ) under repeated
sampling. In comparison with an Edgeworth expansion it may be noted that the
expansion (7.8) is in terms of mixed derivatives of the log model function,
rather than in terms of cumulants, and that the error of (7.8) is relative,
rather than absolute.
In particular, under repeated sampling and if the auxiliary statis-
tic is (approximately or exactly) ancillary such that
3/2
p(ω;ω|a) = p*(ω;ω|a){l + 0(n" )}
(cf. section 2) we generally have
150 0. E. Barndorff-Nielsen

p(ω;ω|a) = Φd(ω-ω;*){! + A1 + (A2 + C,) + 0(n" 3 / 2 )}. (7.9)

For one-parameter models, i.e. for d = 1, the expansion (7.8) with


A-., A 2 and C-, as given above reduces to the expansion (2.9). In Barndorff-
Nielsen and Cox (1984) a relation valid to order 0(n-3/2 ) was established, for
general d, between the norming constant c of (2.7) and the Bartlett adjustment
factors for likelihood ratio tests of hypotheses about ω. By means of this rel-
ation such adjustment factors may be simply calculated from the above expression
for C-j.
Example 7.1. Suppose M_ is a (k,k) exponential model with model
function (6.13). Then the expression for C . takes the form

Cr
_ 1 r0 rs tu /o ru sv tw ,o rs tu vw λ 1
l " 24{ 3 κ rstu κ κ " κ rst κ uvw ( 2 κ κ κ + 3 κ κ κ ) }

where, for a r = a/8θr and κ(θ) = -log a(θ),


κ
rs... = V s •••κ ( θ )
and where κ Γ S is the inverse matrix of K .
From (7.8) we find the following expansion for the mean value of ω:

tω =ω + μ . + y^ +

where y? is of order 0(n" ), yί is of order 0(n ), and


α . .αr.St, ,.αr.St" /-,1 Π N
μ
l " '** * + r;st = -1* *
(7J0)
^str
Hence, from (7.8) and writing δ1 for δ-μ,,

Φd(ω - ω -

= Φd(ω - ω - μ Ί ;j)Π + ^ Γ S t ( δ ' ;£) (^$.t + | + r s t ) + ..->. (7.Π)


r
-1 T"rn
where the error term is of order 0(n" ) and where h ( ;3") denotes the

tensorial Hermite polynomial (as defined by Amari and Kumon ( 1 9 8 3 ) ) , r e l a t i v e

to the tensor ί . Using (6.10) we may rewrite


-1/3 the l a s t quantity in (7.11) as
2 (7.12)
Differential and Integral Geometry in Statistical Inference 151

where

Since
hΓSt(S';j) = δ ' W * - / V ^ ] (7.14)
we find
h r S t ( δ ' ; ^ r $ t =0

and hence (7.11) reduces to


Λ , -1/3
c|j| L = φ.(ω
u
Λ
- ω - y Ίi;a ){l - hht (δ' j ) - rsL
P . + ...}, (7.15)

the error term being 0(n" ).


Suppose, in particular, that the model is an exponential (k,d)
model. We may then compare (7.15) with the Edgeworth expansion for an effi-
cient, bias adjusted estimate of ω given an ancillary statistic, provided by
formulas (3.33) and (3.25) in Amari and Kumon (1983). It appears that h
"1/3 " 1 / 3 abc
(δ' j-) z t of (7.15) is the counterpart of Amari and Kumon's r . h -
P u. m

^ab ^ h + H xa^a'iK ' ^^us ^-^^ offers some simplification over the cor-
a κ

responding expression provided by the Amari and Kumon paper.


Note that, again by the symmetry of (7.14), if
-1/3
*rst[3] = 0 (7.16)
for all r,s,t then the first order correction term in (7.15) is 0. Further-
ex
more, for any one-parameter model M^ the quantity % with α = -1/3, can be made
to vanish by choosing that parametrization for which ω is the geodesic coordin-
ate for the -1/3 observed conditional connection. (Note that generally this
parametrization will depend on the value of the ancillary a.) An analogous
result holds for the Edgeworth expansion derived by Amari and Kumon (1983),
referred to above. The parametrization making the α = -1/3 expected connection
α

r vanish has the interpretation of a skewness reducing parametrization, cf.


Kass (1984).
8. EXPONENTIAL TRANSFORMATION MODELS

Suppose M^ is an exponential transformation model and that the full


exponential model M generated by M is regular. By theorem 2.1 the group G acts
affinely on T = τ(θ), and Lebesgue measure on T is quasi-invariant (in fact,
relatively invariant) with multiplier |A(g)|. Assuming, furthermore, that N[
and G have the structure discussed in section 3 with ίg:|A(g)| = 1} c K we find,
since the mapping g •> A(g) is a representation of G, that

|A(h(gx))| = |A(g)||A(h(x))|.

Thus m(x) = |A(fi)| is a modulator and

dv(h) = |A(h)|"Ίdh (8.1)

is an invariant measure on H (cf. appendix 1).


Again by theorem 2.1 the log likelihood function is of the form

l(h) = {θ(e)A(h" ] h)* + Bf(h"Ίh)>-w - κ(θ(e)A(h~ Ί h)* + §ί(h"Ίh)) (8.2)

where w = t(u) = h" t.


Some interesting special cases are
(i) B( ) or &(•) or both are 0. Then δ( ) of (2.45) is a multi-
plier (i.e. a homomorphism of G into (R + , )) Furthermore, if &(•) = 0 and if
(2.35) is an exponential representation of M_ relative to an invariant dominat-
ing measure on X^ then b(x) is a modulator.
(ii) The norming constant a(θ(g)) does not depend on g. If in
addition B(g) does not depend on g, which implies that B( ) = 0, then the con-
ditional distribution of h given w is, on account of the exactness of (2.7),

152
Differential and Integral Geometry in Statistical Inference 153

p(h;h|w) = c ( w ) | j | * e θ ( h " l h ) w (8.3)

where the norming constant does not depend on h.


Note that the form (8.3) is preserved under repeated sampling, i.e.
the conditional distribution of h is of the same "type" whatever the sample
size.
The von Mises-Fisher model for directional data with fixed precision
has this structure with w equal to the resultant length r, and as is well-
known the conditional model given r is also of this type, irrespective of
sample size. Other examples are provided by the hyperboloid model with fixed
precision and by the class or r-dimensional normal distributions with mean 0
and precision Δ such that |Δ| = 1.
(iii) M is a (k,k-l) model.
For simplicity we now assume that M_ has all the above-mentioned
properties. There is then little further restriction in supposing that M^ is of
the form
p(x,θ) = b M e x p ί - a λ e ^ h ^ h Γ ^ e ^ } (8.4)

where λ is the index parameter, a is maximal invariant and e, and e , are


known nonrandom vectors. For (8.4) the log likelihood function is

l(h) = -aλe Ί A(h" Ί h)e* Ί (8.5)

where we have written A for A" . Hence

(8.6)

where Λr is given by (6.23). In this case, then, the conditional observed


geometries (^( ;λ,a),*( ;λ,a)) are all "proportional" for fixed α, with aλ as
the proportionality factor. The geometric leaves of the foliation of M^, deter-
mined as the partition of M_ generated by the index parameter λ, are thus highly
similar. In this connection see example 6.3.
APPENDIX 1

Construction of invariant measures


One may usefully generalize the concepts of invariant and relatively
invariant measures as follows. Let a measure μ on X_ be called quasi-invariant
with multiplier χ = χ(g,x) if gy and y are mutually absolutely continuous for
e\/ery gεG and if
d ( g Λ ) ( χ ) = χ(g 5 x)dy(x).

Furthermore, define a function m on X to be a modulator with associated multi-


plier χ(g,x) if m is positive and
m(gχ) = χ(g,x)m(x). (Al.l)
Then, if y x is quasi-invariant with multiplier χ(g,x) and if m is a modulator
satisfying (Al.l) we have that
y = m" 1 y x (AΊ.2)
is an invariant measure on _X.
In particular, to verify that the measure y defined by (3.9) is
invariant one just has to show that m(y) = J (z\(u) is a modulator with associ-
ated multiplier J (q\(y) because, by the standard theorem on transformation of
integrals, Lebesgue measure λ is quasi-invariant with multiplier J ( q )(y)
Corresponding to the factorization G = HK there are unique factorizations g = hk
and gz = hk and, using repeatedly the assumption that K = G for every orbit
representative u, we find

154
Differential and Integral Geometry in Statistical Inference 155

In the last step we have used the fact that

J ( k )(υ) = 1 for every kεK. (AT.3)

To see the validity of (A1.3) one needs only note that for fixed u the mapping
k -> J ,. x(u) is a multiplier on K and since K is compact this must be the
trivial multiplier 1. Actually, (A1.3) is a necessary and sufficient condition
for the existence of an invariant measure on _Y. This may be concluded from
Kurita (1959), cf. also Santalό (1979), section 10.3.
APPENDIX 2

An equality of Jacobians under left factorizations


Lemma. Let G = HK be a left factorization of G (as discussed in

sections 3 and 5 ) , let γ denote the natural action of G on H and let δ denote

Then J '(h)(e) = J δ ( h ) ^
for a11 hεH#
left action of G on itself.
Proof. Let g = hk denote an arbitrary element of G. Writing g
symbolically as (h,k) and employing the mappings η and ζ defined by

η:g + h ζ:g -> k

we have, for any h'εH,


δ(h')g = δ(h')(h,k) = (η(h l h) ί ζ (h'hk))
and hence the differential of δ(h')g is

0
ah
Dδ(h')(g) =
9ζ(h'hk) 3ζ(h'hk)
ah 9k

from which we find, using η(h'h) = γ(h')h and ζ(h'k) = k,

J (e) =
δ(h')

J
γ(h')(e)

156
APPENDIX 3

An inversion result
The validity of formula (6.24) is established by the following
Lemma. Let G = HK be a left factorization of the group G with the
associated mapping η:g = hk -> h (as discussed in sections 3 and 5). Further-
more, let h1 denote an arbitrary element of H. Then

3n(h~ ] h')* aη(h'"Ίh)* (A3.1)


h=h' h=h'

Proof. The mapping h -> η(h" h 1 ) may be composed of the three


mappings h -> h1 h, g -+ g" and η, as indicated in the following diagram

where i indicates the inversion g -> g-1'. This diagram of mappings between dif-
ferentiate manifolds induces a corresponding diagram for the associated dif-
ferential mappings between the tangent spaces of the manifolds, namely

157
158 0. E. Barndorff-Nielsen

"Hi —> TG .

Di

TH
n(hI-1h)

From this latter diagram and from the well-known relation


(Di)(e) = -I,
where I indicates the identity matrix, formula (A3.1) may be read off immediate-
ly.

Acknowledgements
I am much indebted to Poul Svante Eriksen, Peter Jupp, Steffen L.
Lauritzen, Hans Anton Salomonsen and Jorgen Tornehave for helpful discussions*
and to Lars Smedegaard Andersen for a careful checking of the manuscript.
REFERENCES

Amari, S.-I. (1982a). Differential geometry of curved exponential families -


curvatures and information loss. Ann. Statist. 10, 357-385.
Amari, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and condi-
tional inference. Biometrika 69, 1-17.
Amari, S.-I. (1985). Differential-Geometric Methods in Statistics. Lecture
Notes in Statistics 28, Springer, New York.
Amari, S.-I. (1986). Differential geometrical theory of statistics - towards
new developments. This volume.
Amari, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion
in curved exponential family. Ann. Inst. Statist. Math. 35, 1-24.
Barndorff-Nielsen, 0. E. (1978a). Information and Exponential Families.
Wiley, Chichester.
Barndorff-Nielsen, 0. E. (1978b). Hyperbolic distributions and distributions on
hyperbolae. Scand. J. Statist. 5_, 151-157.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1982). Contribution to the discussion of R. J.
Buehler: Some ancillary statistics and their properties. J. Amer.
Statist. Assoc. 77, 590-591.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the maxi-
mum likelihood estimator. Biometrika 70, 343-365.
Barndorff-Nielsen, 0. E. (1984). On conditionality resolution and the likeli-
hood ratio for curved exponential families. Scand. J. Statist. 11, 157-

159
160 0. E. Barndorff-Nielsen

170. Amendment Scand. J. Statist. 12. (1985).


Barndorff-Nielsen, 0. E. (1985). Confidence limits from c|j|^C in the single-
parameter case. Scand. J. Statist. ^ 2 , 83-87.
Barndorff-Nielsen, 0. E. (1986a). Likelihood and observed geometries. Ann.
Statist. 14, 856-873.
Barndorff-Nielsen, 0. E. (1986b). Inference on full or partial parameters
based on the standardized signed log likelihood ratio. Biometrika 73,
307-322.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983a). Exponential models with
affine dual foliations. Ann. Statist. 11, 753-769.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983b). Reproductive exponential
families. Ann. Statist. 11, 770-732.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1984). Combination of reproductive
models. Research Report 107, Dept. Theor. Statist., Aarhus University.
Barndorff-Nielsen, 0. E., Blaesild, P., Jensen, J. L. and Jorgensen, B. (1982).
Exponential transformation models. Proc. R. Soc. A 379, 41-65.
Barndorff-Nielsen, 0. E. and Cox, D. R. (1984). Bartlett adjustments to the
likelihood ratio statistic and the distribution of the maximum likelihood
estimator. J. R. Statist. Soc. B 46, 483-495.
Barndorff-Nielsen, 0. E., Cox. D. R. and Reid, N. (1986). The role of differen-
tial geometry in statistical theory. Int. Statist. Review 54, 83-96.
Barut, A. 0. and Raczka, R. (1980). Theory of Group Representations and Appli-
cations. Polish Scientific Publishers, Warszawa.
Boothby, W. M. (1975). An Introduction to Differentiate Manifolds and
Riemannian Geometry. Academic Press, New York.
Burridge, J. (1981). A note on maximum likelihood estimation for regression
models using grouped data. J. R. Statist. Soc. B 43, 41-45.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference.
(In Russian.) Moscow, Nauka. English translation (1982). Translation of
Mathematical Monographs Vol. 53. American Mathematical Society, Providence,
Rhode Island.
1 6 1
Differential and Integral Geometry in Statistical Inference

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a


curved exponential family. Ann. Statist. Y\_, 793-803.
Eriksen, P. S. (1984a). (k,l) exponential transformation models. Scand. J.
Statist. 21, 129-145.
Eriksen, P. S. (1984b). A note on the structure theorem for exponential trans-
formation models. Research Report 101, Dept. Theor. Statist., Aarhus
University.
Eriksen, P. S. (1984c). Existence and uniqueness of the maximum likelihood
estimator in exponential transformation models. Research Report 103,
Dept. Theor. Statist., Aarhus University.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.
Roy. Soc. A 144, 285-307.
Hauck, W. W. and Donner, A. (1977). Wald's test as applied to hypotheses in
logit analysis. J. Amer. Statist. Ass. 72, 851-853. Corrigendum:
J. Amer. Statist. Ass. 75^ (1980), 482.
Jensen, J. L. (1981). On the hyperboloid distribution. Scand. J. Statist. 8,
193-206.
Kurita, M. (1959). On the volume in homogeneous spaces. Nagoya Math. J. 15,
201-217.
Lauritzen, S. L. (1986). Statistical manifolds. This volume.
Santalό, L. A. (1979). Integral Geometry and Geometric Probability. Encyclo-
pedia of Mathematics and Its Applications. Vol. 1, Addison-Wesley, London.
Shuster, J. J. (1968). A note on the inverse Gaussian distribution function.
J. Amer. Statist. Assoc. 63, 1514-1516.
Vaeth, M. (1985). On the use of Wald's test in exponential families. Int.
Statist. Review 53, 199-214.

You might also like