0% found this document useful (0 votes)

59 views6 pages

Class3 ML MaxEnt

- The document discusses the principles of maximum likelihood (ML) and maximum entropy. - It shows that the ML principle is equivalent to finding the probability distribution that minimizes the relative entropy between the empirical distribution (estimated from data) and the model distribution. - The relative entropy is the only distance measure that satisfies this property of making the empirical distribution the ML estimate. This provides a duality between ML and maximum entropy principles.

Uploaded by

Ernesto Molina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views6 pages

Class3 ML MaxEnt

Uploaded by

Ernesto Molina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

67577 Intro.

to Machine Learning

Fall semester, 2008/9

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

Lecturer: Amnon Shashua

Scribe: Amnon Shashua

In the previous lecture we defined the principle of Maximum Likelihood (ML): suppose we
have random variables X1 , ..., Xn form a random sample from a discrete distribution whose joint
probability distribution is P (x | ) where x = (x1 , ..., xn ) is a vector in the sample and is
a parameter from some parameter space (which could be a discrete set of values say class
membership). When P (x | ) is considered as a function of it is called the likelihood function. The
ML principle is to select the value of that maximizes the likelihood function over the observations
(training set) x1 , ..., xm . If the observations are sampled i.i.d. (a common, not always valid,
assumption), then the ML principle is to maximize:
= argmax

m
Y

P (xi | ) = argmax log

i=1

P (xi | ) = argmax

i=1

m
X

log P (xi | )

i=1

which due to the product nature of the problem it becomes more convenient to maximize the log
likelihood. We will take a closer look today at the ML principle by introducing a key element
known as the relative entropy measure between distributions.

3.1

ML and Empirical Distribution

The ML principle states that the empirical distribution of an i.i.d. sequence of examples is the
closest possible (in terms of relative entropy which would be defined later) to the true distribution.
To make this statement clear let X be a set of symbols {a1 , ..., an } and let P (a | ) be the probability
(belonging to a parametric family with parameter ) of drawing a symbol a X . Let x1 , ..., xm be
a sequence of symbols drawn i.i.d. according to P . The occurrence frequency f (a) measures the
number of draws of the symbol a:
f (a) = |{i : xi = a}|,
and let the empirical distribution be defined by
P (a) = P

f ()

f (a) =

1
f (a) = (1/m)f (a).
kf k1

The joint probability P (x1 , ..., xm | ) is equal to the product

definitions above is equal to:
P (x1 , ..., xm | ) =

m
Y

p(xi | ) =

i=1

Y
aX

3-1

i P (xi

| ) which according to the

P (a | )f (a) .

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

3-2

The ML principle is therefore equivalent to the optimization problem:

max
P Q

P (a | )f (a)

(3.1)

where Q = {q Rn : q 0,
i qi = 1} denote the set of n-dimensional probability vectors
(probability simplex). Let pi stand for P (ai | ) and fi stand for f (ai ). Since argmaxx z(x) =
Q
P
argmaxx ln z(x) and given that ln i pifi = i fi ln pi the solution to this problem can be found by
setting the partial derivative of the Lagrangian to zero:
P

L(p, , ) =

n
X

fi ln pi (

i=1

pi 1)

i pi ,

where is the Lagrange multiplier associated with the equality constraint i pi 1 = 0 and i 0
are the Lagrange multipliers associated with the inequality constraints pi 0. We also have the
complementary slackness condition that sets i = 0 if pi > 0.
After setting the partial derivative with respect to pi to zero we get:
P

pi =

1
fi .
+ i

Assume for now that fi > 0 for i = 1, ..., n. Then from complementary slackness we must have
i = 0 (because pi > 0). We are left therefore with the result pi = (1/)fi . Following the constraint
P
P

i p1 = 1 we obtain =
i fi . As a result we obtain: P (a | ) = P (a). In case fi = 0 we could
use the convention 0 ln 0 = 0 and from continuity arrive to pi = 0.
We have arrived to the following theorem:
Theorem 1 The empirical distribution estimate P is the unique Maximum Likelihood estimate of
the probability model Q on the occurrence frequency f ().
This seems like an obvious result but it actually runs deep because the result holds for a very
particular (and non-intuitive at first glance) distance measure between non-negative vectors. Let
dist(f, p) be some distance measure between the two vectors. The result above states that:
P = argminp dist(f, p) s.t. p 0,

pi = 1,

(3.2)

for some (family?) of distance measures dist(). It turns out that there is only one2 such distance
measure, known as the relative-entropy, which satisfies the ML result stated above.

3.2

Relative Entropy

The relative-entropy (RE) measure D(x||y) between two non-negative vectors x, y Rn is defined
as:
n
X
X
xi X
D(x||y) =
xi ln
xi +
yi .
y
i
i=1
i
i
2

not exactly the picture is a bit more complex. Csiszars 1972 measures: dist(p, f) = i fi (pi /fi ) will satisfy
eqn. 3.2 provided that 01 is an exponential. However, dist(f, p) (parameters positions are switched) will not do it,
whereas the relative entropy will satisfy eqn. 3.2 regardless of the order of the parameters p, f.

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

3-3

In the definition we use the convention that 0 ln 00 = 0 and based on continuity that 0 ln y0 = 0 and
P
x ln x0 = . When x, y are also probability vectors, i.e., belong to Q, then D(x||y) = i xi ln xyii
is also known as the Kullback-Leibler divergence. The RE measure is not a distance metric as it
is not symmetric, D(x||y) 6= D(y||x), and does not satisfy the triangle inequality. Nevertheless, it
has several interesting properties which make it a fundamental measure in statistical inference.
The relative entropy is always non-negative and is zero if and only if x = y. This comes about
from the log-sum inequality:
P
X
X
xi
xi
xi ln
( xi ) ln Pi
yi
i yi
i
i
Thus,
X

D(x||y) (

P
X
x

xi X
xi +
yi = x
ln x
xi ) ln Pi
+ y
i yi

But a ln(a/b) a b for a, b 0 iff ln(a/b) 1 (b/a) which follows from the inequality
ln(x + 1) > x/(x + 1) (which holds for x > 1 and x 6= 0). We can state the following theorem:
Theorem 2 Let f 0 be the occurrence frequency on a training sample. P Q is a ML estimate
iff
X
P = argminp D(f||p) s.t. p 0,
pi = 1.
i

Proof:
D(f||p) =

fi ln pi +

X
i

fi ln fi

fi + 1,

and
argminp D(f||p) = argmaxp

X
i

fi ln pi = argmaxp ln

pfi i .

There are two (related) interesting points to make here. First, from the proof of Thm. 1 we
observe that the non-negativity constraint p 0 need not be enforced - as long as f 0 (which
P
holds by definition) the closest p to f under the constraint i pi = 1 must come out non-negative.
Second, the fact that the closest point p to f comes out as a scaling of f (which is by definition the
empirical distribution P ) arises because of the relative-entropy measure. For example, if we had
used a least-squares distance measure kfpk2 the result would not be a scaling of f. In other words,
we are looking for a projection of the vector f onto the probability simplex, i.e., the intersection of
the hyperplane x> 1 = 1 and the non-negative orthant x 0. Under relative-entropy the projection
is simply a scaling of f (and this is why we do not need to enforce non-negativity). Under leastsqaures, a projection onto the hyper-plane x> 1 = 1 could take us out of the non-negative orthant
(see Fig. 3.1 for illustration). So, relative-entropy is special in that regard it not only provides
the ML estimate, but also simplifies the optimization process3 (something which would be more
noticeable when we handle a latent class model next lecture).

3.3

Maximum Entropy and Duality ML/MaxEnt

The relative-entropy measure is not symmetric thus we expect different outcomes of the optimization minx D(x||y) compared to miny D(x||y). The latter of the two, i.e., minP Q D(P0 ||P ), where
3

The fact that non-negativity comes for free does not apply for all class (distribution) models. This point would
be refined in the next lecture.

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

3-4

p2
p^

Figure 3.1: Projection of a non-neagtaive vector f onto the hyperplane

i xi 1 = 0. Under relativeentropy the projection P is a scaling of f (and thus lives in the probability simplex). Under least-squares
the projection p2 lives outside of the probability simplex, i.e., could have negative coordinates.

P0 is some empirical evidence and Q is some model, provides the ML estimation. For example, in
the next lecture we will consider Q the set of low-rank joint distributions (called latent class model)
and see how the ML (via relative-entropy minimization) solution can be found.
P
Let H(p) = i pi ln pi denote the entropy function. With regard to minx D(x||y) we can
state the following observation:
Claim 1

Proof:

1
argminpQ D(p|| 1) = argmaxpQ H(p).
n
X
X
1
pi ln pi + ( pi ) ln(n) = ln(n) H(p),
D(p|| 1) =
n
i
i
P

which follows from the condition i pi = 1.

In other words, the closest distribution to uniform is achieved by maximizing the entropy.
To make this interesting we need to add constraints. Consider a linear constraint on p such as
P
thrown many times and we wish to
i i pi = . To be concrete, consider a die with six faces
P
estimate the probabilities p1 , ..., p6 given only the average i ipi . Say, the average is 3.5 which is
what one would expect from an unbiased die. The Laplaces principle of insufficient reasoning calls
for assuming uniformity unless there is additional information (a controversial assumption in some
P
cases). In other words, if we have no information except that each pi 0 and that i pi = 1 we
should choose the uniform distribution since we have no reason to choose any other distribution.
Thus, employing Laplaces principle we would say that if the average is 3.5 then the most likely
distribution is the uniform. What if = 4.2? This kind of problem can be stated as an optimization
problem:
X
X
max H(p) s.t.,
pi = 1,
i pi = ,
p
i

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

3-5

where i = i and = 4.2. We have now two constraints and with the aid of Lagrange multipliers
we can arrive to the result:
pi = exp(1) expi .
Note that because of the exponential pi 0 and again non-negativity comes for free4 . Following
P
P
the constraint i pi = 1 we get exp(1) = 1/ i expi from which obtain:
pi =

1
expi ,
Z

where Z (a function of ) is a normalization factor and needs to be set by using (see later).
There is nothing special about the uniform distribution, thus we could be seeking a probability
vector p as close as possible to some prior probability p0 under the constraints above:
min D(p||p0 ) s.t.,
p

X
i

pi = 1,

i pi = ,

with the result:

1
p0 expi .
Z i
P
We could also consider adding more linear constraints on p of the form: i fij pi = bj , j = 1, ..., k.
The result would be:
Pk
1
f
pi = p0 i exp j=1 j ij .
Z
Probability distributions of this form are called Gibbs Distributions. In practical applications the
linear constraints on p could arise from average information about the system such as temperature
of a fluid (where pi are the probabilities of the particles moving at various velocities), rainfall data
or general environmental data (where pi represent the probability of finding animal colonies at
P
discrete locations in a 3D map). A constraint of the form i fij pi = bj states that the expectation
Ep [fj ] should be equal to the empirical distribution = EP [fj ] where P is either uniform or given
as input. Let
pi =

P = {p Rn : p 0,

pi = 1, Ep [fj ] = Ep[fj ], j = 1, ..., k},

and
Q = {q Rn ; q is a Gibbs distribution}
We could therefore consider looking for the ML solution for the parameters 1 , ..., k of the Gibbs
distribution:
min D(
p||q),
qQ
is uniform then min D(
where if p
p||q) can be replaced by max i ln qi (because D((1/n)1||x) =
P
ln(n) i ln xi ).
As it turns out, the MaxEnt and ML are duals of each other and the intersection of the two
sets P Q contains only a single point which solves both problems.
P

Theorem 3 The following are equivalent:

MaxEnt: q = argminpP D(p||p0 )
4

Any measure of the class dist(p, p0 ) = i p0 i (pi /p0 i ) minimized under linear constraints will satisfy the result
of pi 0 provided that 01 is an exponential.

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

3-6

ML: q = argminqQ D(
p||q)
q P Q
In practice, the duality theorem is used to recover the parameters of the Gibbs distribution using
the ML route (second line in the theorem above) the algorithm for doing so is known as the
iterative scaling algorithm (which we will not get into).

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
2 - Maximum Likelihood
No ratings yet
2 - Maximum Likelihood
20 pages
Notes On Kullback-Leibler Divergence and Likelihood Theory
No ratings yet
Notes On Kullback-Leibler Divergence and Likelihood Theory
4 pages
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
No ratings yet
Therml: Thermodynamics of Machine Learning: Box & Draper 1987 1A
16 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
MUML Preliminiaries
No ratings yet
MUML Preliminiaries
24 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture Notes 1-12
No ratings yet
Lecture Notes 1-12
5 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Information Geometry of Maxent Principle
No ratings yet
Information Geometry of Maxent Principle
37 pages
Eeport
No ratings yet
Eeport
13 pages
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
No ratings yet
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
10 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Probabilistic Linear Prediction SEO
No ratings yet
Probabilistic Linear Prediction SEO
24 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
No ratings yet
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
16 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Statistics Part3 2013
No ratings yet
Statistics Part3 2013
25 pages
Maximum Entropy for Statisticians
No ratings yet
Maximum Entropy for Statisticians
10 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Notes
No ratings yet
Notes
10 pages
Cross Entropy Wikipedia
No ratings yet
Cross Entropy Wikipedia
8 pages
Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
22 pages
ML 1
No ratings yet
ML 1
64 pages
Rau J Statistical Mechanics in A Nutshell
No ratings yet
Rau J Statistical Mechanics in A Nutshell
23 pages
Inf 2
No ratings yet
Inf 2
37 pages
Ech 4
No ratings yet
Ech 4
39 pages
High Dimensional Probability MA3K0 Notes 3
No ratings yet
High Dimensional Probability MA3K0 Notes 3
108 pages
Probability & Statistics Primer
No ratings yet
Probability & Statistics Primer
19 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
Lecture Notes On Empirical Process Theory (Kengo Kato April 7, 2019)
No ratings yet
Lecture Notes On Empirical Process Theory (Kengo Kato April 7, 2019)
109 pages
Info Theory Course Notes
No ratings yet
Info Theory Course Notes
46 pages
Maximum Likelihood Estimation Lecture
No ratings yet
Maximum Likelihood Estimation Lecture
22 pages
ACaticha-Entropic Physics Book-July 2022
No ratings yet
ACaticha-Entropic Physics Book-July 2022
364 pages
Learning Probability Distributions
No ratings yet
Learning Probability Distributions
47 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Information Theory
No ratings yet
Information Theory
122 pages
Bayesian Models for AI Experts
No ratings yet
Bayesian Models for AI Experts
130 pages
Entropy 1
No ratings yet
Entropy 1
7 pages
Possible Generalization of Boltzmann-Gibbs Statist
No ratings yet
Possible Generalization of Boltzmann-Gibbs Statist
10 pages
Info - Information Theory Basics (3) (1) .PDF (Half)
No ratings yet
Info - Information Theory Basics (3) (1) .PDF (Half)
125 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Maximum Entropy: Density Estimation
No ratings yet
Maximum Entropy: Density Estimation
18 pages
ემპირიული პროცესები
No ratings yet
ემპირიული პროცესები
131 pages
Lecture 15
No ratings yet
Lecture 15
7 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
21 pages
978 1 4614 4869 3
No ratings yet
978 1 4614 4869 3
615 pages
Data Mining Assignment 1
No ratings yet
Data Mining Assignment 1
8 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
6 pages
Hypothesis Testing: Z-Test & T-Test Guide
No ratings yet
Hypothesis Testing: Z-Test & T-Test Guide
19 pages
Pitman Yor A Guide To Brownian Motion
No ratings yet
Pitman Yor A Guide To Brownian Motion
101 pages
Comparison of Probability Distributions Used For Harnessing The Wind Energy Potential: A Case Study From India
No ratings yet
Comparison of Probability Distributions Used For Harnessing The Wind Energy Potential: A Case Study From India
18 pages
A4 Cheatsheet
No ratings yet
A4 Cheatsheet
2 pages
Assignment 2 Mba 652 PDF
No ratings yet
Assignment 2 Mba 652 PDF
11 pages
Logistic GAMs in R: A Comprehensive Guide
No ratings yet
Logistic GAMs in R: A Comprehensive Guide
36 pages
OIE354-Quality Engg - Question Bank Format CAE-2
No ratings yet
OIE354-Quality Engg - Question Bank Format CAE-2
2 pages
Gce 3
No ratings yet
Gce 3
1 page
Design and Analysis of Experiments 8th Edition Solution Manual
No ratings yet
Design and Analysis of Experiments 8th Edition Solution Manual
39 pages
Introduction To Simulation Using Python
No ratings yet
Introduction To Simulation Using Python
19 pages
hw3 Report 109090023
No ratings yet
hw3 Report 109090023
9 pages
Statictics Maths 2 Marks - 2
No ratings yet
Statictics Maths 2 Marks - 2
42 pages
Fat Tails and (Anti) Fragility
100% (3)
Fat Tails and (Anti) Fragility
101 pages
Probability and Random Processes: Lesson 4 Counting Methods
No ratings yet
Probability and Random Processes: Lesson 4 Counting Methods
35 pages
1911 08731v2
No ratings yet
1911 08731v2
19 pages
04 Stratified Sampling
No ratings yet
04 Stratified Sampling
19 pages
Wooldridge InstrumentalVariablesEstimation 2005
No ratings yet
Wooldridge InstrumentalVariablesEstimation 2005
6 pages
Biostatistics1718 2 PDF
No ratings yet
Biostatistics1718 2 PDF
27 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
How To Analyze Data Using ANOVA in SPSS
No ratings yet
How To Analyze Data Using ANOVA in SPSS
8 pages
Gaussian PDF Products & Convolutions
No ratings yet
Gaussian PDF Products & Convolutions
13 pages
Metro Passenger Forecasting with ARIMA
No ratings yet
Metro Passenger Forecasting with ARIMA
3 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
2 pages
Chapter 13 Wiener Processes and PDF
No ratings yet
Chapter 13 Wiener Processes and PDF
29 pages
UNIT 5 - Uncertainty
No ratings yet
UNIT 5 - Uncertainty
36 pages
Extreme Value Theory
No ratings yet
Extreme Value Theory
8 pages
Tutorials + Solutions
No ratings yet
Tutorials + Solutions
21 pages

Class3 ML MaxEnt

Uploaded by

Class3 ML MaxEnt

Uploaded by

67577 Intro.

Fall semester, 2008/9

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

Scribe: Amnon Shashua

P (xi | ) = argmax log

ML and Empirical Distribution

The joint probability P (x1 , ..., xm | ) is equal to the product

| ) which according to the

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

The ML principle is therefore equivalent to the optimization problem:

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

Maximum Entropy and Duality ML/MaxEnt

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

Figure 3.1: Projection of a non-neagtaive vector f onto the hyperplane

which follows from the condition i pi = 1.

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

with the result:

pi = 1, Ep [fj ] = Ep[fj ], j = 1, ..., k},

Theorem 3 The following are equivalent:

Lecture 3: Maximum Likelihood/ Maximum Entropy Duality

You might also like