67577 Intro.
to Machine Learning
Fall semester, 2008/9
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
Lecturer: Amnon Shashua
Scribe: Amnon Shashua
In the previous lecture we defined the principle of Maximum Likelihood (ML): suppose we
have random variables X1 , ..., Xn form a random sample from a discrete distribution whose joint
probability distribution is P (x | ) where x = (x1 , ..., xn ) is a vector in the sample and is
a parameter from some parameter space (which could be a discrete set of values say class
membership). When P (x | ) is considered as a function of it is called the likelihood function. The
ML principle is to select the value of that maximizes the likelihood function over the observations
(training set) x1 , ..., xm . If the observations are sampled i.i.d. (a common, not always valid,
assumption), then the ML principle is to maximize:
= argmax
m
Y
m
Y
P (xi | ) = argmax log
i=1
P (xi | ) = argmax
i=1
m
X
log P (xi | )
i=1
which due to the product nature of the problem it becomes more convenient to maximize the log
likelihood. We will take a closer look today at the ML principle by introducing a key element
known as the relative entropy measure between distributions.
3.1
ML and Empirical Distribution
The ML principle states that the empirical distribution of an i.i.d. sequence of examples is the
closest possible (in terms of relative entropy which would be defined later) to the true distribution.
To make this statement clear let X be a set of symbols {a1 , ..., an } and let P (a | ) be the probability
(belonging to a parametric family with parameter ) of drawing a symbol a X . Let x1 , ..., xm be
a sequence of symbols drawn i.i.d. according to P . The occurrence frequency f (a) measures the
number of draws of the symbol a:
f (a) = |{i : xi = a}|,
and let the empirical distribution be defined by
P (a) = P
f ()
f (a) =
1
f (a) = (1/m)f (a).
kf k1
The joint probability P (x1 , ..., xm | ) is equal to the product
definitions above is equal to:
P (x1 , ..., xm | ) =
m
Y
p(xi | ) =
i=1
Y
aX
3-1
i P (xi
| ) which according to the
P (a | )f (a) .
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
3-2
The ML principle is therefore equivalent to the optimization problem:
max
P Q
P (a | )f (a)
(3.1)
aX
where Q = {q Rn : q 0,
i qi = 1} denote the set of n-dimensional probability vectors
(probability simplex). Let pi stand for P (ai | ) and fi stand for f (ai ). Since argmaxx z(x) =
Q
P
argmaxx ln z(x) and given that ln i pifi = i fi ln pi the solution to this problem can be found by
setting the partial derivative of the Lagrangian to zero:
P
L(p, , ) =
n
X
fi ln pi (
i=1
pi 1)
i pi ,
where is the Lagrange multiplier associated with the equality constraint i pi 1 = 0 and i 0
are the Lagrange multipliers associated with the inequality constraints pi 0. We also have the
complementary slackness condition that sets i = 0 if pi > 0.
After setting the partial derivative with respect to pi to zero we get:
P
pi =
1
fi .
+ i
Assume for now that fi > 0 for i = 1, ..., n. Then from complementary slackness we must have
i = 0 (because pi > 0). We are left therefore with the result pi = (1/)fi . Following the constraint
P
P
i p1 = 1 we obtain =
i fi . As a result we obtain: P (a | ) = P (a). In case fi = 0 we could
use the convention 0 ln 0 = 0 and from continuity arrive to pi = 0.
We have arrived to the following theorem:
Theorem 1 The empirical distribution estimate P is the unique Maximum Likelihood estimate of
the probability model Q on the occurrence frequency f ().
This seems like an obvious result but it actually runs deep because the result holds for a very
particular (and non-intuitive at first glance) distance measure between non-negative vectors. Let
dist(f, p) be some distance measure between the two vectors. The result above states that:
P = argminp dist(f, p) s.t. p 0,
pi = 1,
(3.2)
for some (family?) of distance measures dist(). It turns out that there is only one2 such distance
measure, known as the relative-entropy, which satisfies the ML result stated above.
3.2
Relative Entropy
The relative-entropy (RE) measure D(x||y) between two non-negative vectors x, y Rn is defined
as:
n
X
X
xi X
D(x||y) =
xi ln
xi +
yi .
y
i
i=1
i
i
2
not exactly the picture is a bit more complex. Csiszars 1972 measures: dist(p, f) = i fi (pi /fi ) will satisfy
eqn. 3.2 provided that 01 is an exponential. However, dist(f, p) (parameters positions are switched) will not do it,
whereas the relative entropy will satisfy eqn. 3.2 regardless of the order of the parameters p, f.
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
3-3
In the definition we use the convention that 0 ln 00 = 0 and based on continuity that 0 ln y0 = 0 and
P
x ln x0 = . When x, y are also probability vectors, i.e., belong to Q, then D(x||y) = i xi ln xyii
is also known as the Kullback-Leibler divergence. The RE measure is not a distance metric as it
is not symmetric, D(x||y) 6= D(y||x), and does not satisfy the triangle inequality. Nevertheless, it
has several interesting properties which make it a fundamental measure in statistical inference.
The relative entropy is always non-negative and is zero if and only if x = y. This comes about
from the log-sum inequality:
P
X
X
xi
xi
xi ln
( xi ) ln Pi
yi
i yi
i
i
Thus,
X
D(x||y) (
P
X
x
xi X
xi +
yi = x
ln x
xi ) ln Pi
+ y
i yi
But a ln(a/b) a b for a, b 0 iff ln(a/b) 1 (b/a) which follows from the inequality
ln(x + 1) > x/(x + 1) (which holds for x > 1 and x 6= 0). We can state the following theorem:
Theorem 2 Let f 0 be the occurrence frequency on a training sample. P Q is a ML estimate
iff
X
P = argminp D(f||p) s.t. p 0,
pi = 1.
i
Proof:
D(f||p) =
fi ln pi +
X
i
fi ln fi
fi + 1,
and
argminp D(f||p) = argmaxp
X
i
fi ln pi = argmaxp ln
pfi i .
There are two (related) interesting points to make here. First, from the proof of Thm. 1 we
observe that the non-negativity constraint p 0 need not be enforced - as long as f 0 (which
P
holds by definition) the closest p to f under the constraint i pi = 1 must come out non-negative.
Second, the fact that the closest point p to f comes out as a scaling of f (which is by definition the
empirical distribution P ) arises because of the relative-entropy measure. For example, if we had
used a least-squares distance measure kfpk2 the result would not be a scaling of f. In other words,
we are looking for a projection of the vector f onto the probability simplex, i.e., the intersection of
the hyperplane x> 1 = 1 and the non-negative orthant x 0. Under relative-entropy the projection
is simply a scaling of f (and this is why we do not need to enforce non-negativity). Under leastsqaures, a projection onto the hyper-plane x> 1 = 1 could take us out of the non-negative orthant
(see Fig. 3.1 for illustration). So, relative-entropy is special in that regard it not only provides
the ML estimate, but also simplifies the optimization process3 (something which would be more
noticeable when we handle a latent class model next lecture).
3.3
Maximum Entropy and Duality ML/MaxEnt
The relative-entropy measure is not symmetric thus we expect different outcomes of the optimization minx D(x||y) compared to miny D(x||y). The latter of the two, i.e., minP Q D(P0 ||P ), where
3
The fact that non-negativity comes for free does not apply for all class (distribution) models. This point would
be refined in the next lecture.
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
3-4
p2
p^
Figure 3.1: Projection of a non-neagtaive vector f onto the hyperplane
i xi 1 = 0. Under relativeentropy the projection P is a scaling of f (and thus lives in the probability simplex). Under least-squares
the projection p2 lives outside of the probability simplex, i.e., could have negative coordinates.
P0 is some empirical evidence and Q is some model, provides the ML estimation. For example, in
the next lecture we will consider Q the set of low-rank joint distributions (called latent class model)
and see how the ML (via relative-entropy minimization) solution can be found.
P
Let H(p) = i pi ln pi denote the entropy function. With regard to minx D(x||y) we can
state the following observation:
Claim 1
Proof:
1
argminpQ D(p|| 1) = argmaxpQ H(p).
n
X
X
1
pi ln pi + ( pi ) ln(n) = ln(n) H(p),
D(p|| 1) =
n
i
i
P
which follows from the condition i pi = 1.
In other words, the closest distribution to uniform is achieved by maximizing the entropy.
To make this interesting we need to add constraints. Consider a linear constraint on p such as
P
thrown many times and we wish to
i i pi = . To be concrete, consider a die with six faces
P
estimate the probabilities p1 , ..., p6 given only the average i ipi . Say, the average is 3.5 which is
what one would expect from an unbiased die. The Laplaces principle of insufficient reasoning calls
for assuming uniformity unless there is additional information (a controversial assumption in some
P
cases). In other words, if we have no information except that each pi 0 and that i pi = 1 we
should choose the uniform distribution since we have no reason to choose any other distribution.
Thus, employing Laplaces principle we would say that if the average is 3.5 then the most likely
distribution is the uniform. What if = 4.2? This kind of problem can be stated as an optimization
problem:
X
X
max H(p) s.t.,
pi = 1,
i pi = ,
p
i
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
3-5
where i = i and = 4.2. We have now two constraints and with the aid of Lagrange multipliers
we can arrive to the result:
pi = exp(1) expi .
Note that because of the exponential pi 0 and again non-negativity comes for free4 . Following
P
P
the constraint i pi = 1 we get exp(1) = 1/ i expi from which obtain:
pi =
1
expi ,
Z
where Z (a function of ) is a normalization factor and needs to be set by using (see later).
There is nothing special about the uniform distribution, thus we could be seeking a probability
vector p as close as possible to some prior probability p0 under the constraints above:
min D(p||p0 ) s.t.,
p
X
i
pi = 1,
i pi = ,
with the result:
1
p0 expi .
Z i
P
We could also consider adding more linear constraints on p of the form: i fij pi = bj , j = 1, ..., k.
The result would be:
Pk
1
f
pi = p0 i exp j=1 j ij .
Z
Probability distributions of this form are called Gibbs Distributions. In practical applications the
linear constraints on p could arise from average information about the system such as temperature
of a fluid (where pi are the probabilities of the particles moving at various velocities), rainfall data
or general environmental data (where pi represent the probability of finding animal colonies at
P
discrete locations in a 3D map). A constraint of the form i fij pi = bj states that the expectation
Ep [fj ] should be equal to the empirical distribution = EP [fj ] where P is either uniform or given
as input. Let
pi =
P = {p Rn : p 0,
pi = 1, Ep [fj ] = Ep[fj ], j = 1, ..., k},
and
Q = {q Rn ; q is a Gibbs distribution}
We could therefore consider looking for the ML solution for the parameters 1 , ..., k of the Gibbs
distribution:
min D(
p||q),
qQ
is uniform then min D(
where if p
p||q) can be replaced by max i ln qi (because D((1/n)1||x) =
P
ln(n) i ln xi ).
As it turns out, the MaxEnt and ML are duals of each other and the intersection of the two
sets P Q contains only a single point which solves both problems.
P
Theorem 3 The following are equivalent:
MaxEnt: q = argminpP D(p||p0 )
4
Any measure of the class dist(p, p0 ) = i p0 i (pi /p0 i ) minimized under linear constraints will satisfy the result
of pi 0 provided that 01 is an exponential.
Lecture 3: Maximum Likelihood/ Maximum Entropy Duality
3-6
ML: q = argminqQ D(
p||q)
q P Q
In practice, the duality theorem is used to recover the parameters of the Gibbs distribution using
the ML route (second line in the theorem above) the algorithm for doing so is known as the
iterative scaling algorithm (which we will not get into).