AMATH 423/523 Mathematical Analysis in Biology and Medicine Winter, 2023
2.3 Maximum entropy principle
2.3.1 Different entropy functions for differents random variables
For a real scientific problem with a large state space S, to be able to obtain the ν from data is
not really feasible; it is only a gedankenexperiment. More realistically one measures the mean
value of an observable that returns a value gk when the system is in state k ∈ S:
g : S → R, with g1 , g2 , · · · , gn .
Such a function is called a random variable in the mathematical theory of probability. The
empirical mean value is related to the counting frequency
n
k1 g1 + k2 g2 + · · · + kn gn X
g(K) = = νi gi .
K i=1
In the limit of K → ∞, since all the νi → pi , we have
n
X
lim g(K) = gi pi = E[g]. (8)
K→∞
i=1
Again, just like we asked the statistical question of “what is the probability of observing ν”,
one can ask the statistical question of “what is the probability of observing g(K):
P g(K) ∈ (x, x + dx] ? (9)
2.3.2 Contraction principle
First, let us learn a very important mathematical fact: Consider two positive numbers a, b > 0,
and number N → ∞. Then it is obvious that
lim e−aN + e−bN → 0.
N →∞
But more interestingly,
1
ln e−aN + e−bN = min{a, b}.
− lim (10)
N →∞ N
To prove this, withour loss of generality let us assum that a < b, i.e., (b − a) > 0. We see that
ln e−aN + e−bN = ln e−aN + ln 1 + e−(b−a)N .
Therefore in the limit of N → ∞, the second term goes to zero and we have
1
ln e−aN + e−bN = a.
− lim
N →∞ N
Prof. Hong Qian 12 Thursday 12th January, 2023, 14:01
AMATH 423/523 Mathematical Analysis in Biology and Medicine Winter, 2023
∞
We note that the problem in Eq. (10) is of ∞ . So by l’Hospital’s rule we can also have
ln e−aN + e−bN
ae−aN + be−bN
− lim = lim = min{a, b}.
N →∞ N N →∞ e−aN + e−bN
The result in (10) is actually valid for any a and b, positive or negative. Even more:
Z
− lim ln e−a(x)N dx = inf a(x).
N →∞ x∈[x0 ,x1 ] x∈[x0 ,x1 ]
We now turn to the answer for the question in Eq. (9). We have
Z
P g(K) ∈ (x, x + dx] = e−KH[ν ∥p] dν = e−Kφ(x) , (11a)
ν ·g =x
n o
where φ(x) = inf H(ν∥p) ν · g = x , (11b)
ν
as K → ∞, in the limit of “big data”.
This mathemtical result can be understood very intuitively: With a given value of x being
“observed” as the empirical mean value for random variable g = (g1 , g2 , · · · , gn ):
n
X
νi gi = x, (12)
i=1
not every ν in the probability simplex is compatible to the value x. In fact, among all the ν that
are compatible with Eq. 12, the ν with the smallest H(ν∥p) has the largest probability: This is
the ν among the sample data set that “had produced” the observed x, since after a measurement,
there is no probability, only missing information. See Figure 1 for an illustration.
The relation between φ(x) and H(ν∥p) is called contraction principle in mathematics, and
maximum entropy principle in physics and engineering.
3 Constrained optimization and duality
3.1 Lagrange duality
3.1.1 Lagrange multiplier and saddle point
We shall us an example to illustrate this: Consider f (x, y) = a1 x2 + a2 y 2 , where a1 , a2 > 0
under constraint g(x, y) = b0 + b1 x + b2 y = 0.
a2 (b0 + b1 x)2
n o
2
inf f (x, y) g(x, y) = 0 = inf a1 x +
x,y x b22
a2 (b0 + b1 x∗ )2 a1 a2 b20
= a1 x∗2 + = ,
b22 a1 b22 + a2 b21
Prof. Hong Qian 13 Thursday 12th January, 2023, 14:01
AMATH 423/523 Mathematical Analysis in Biology and Medicine Winter, 2023
Figure 1: The space for all possible empirical frequency ν is a probability simplex shown in orange
color. When an empirical mean value ν · g = x is observed, only the ν along a red line is possible. The
blue circles are the contour lines of entropy function Φ(ν), with the center ν = p and corresponding
Φ = 0. The tangent point between a red line and a blue circle gives the ν ∗ (x) that has minimum Φ
value along the red line. At ν ∗ (x), ∇Φ is in the direct of g. This is the geometric interpretation of the
method of Lagrangian multiplier.
with optimal
a2 b 0 b 1 a1 b 0 b 2
x∗ = − 2 2
, and y ∗ = − 2 .
a1 b 2 + a2 b 1 a1 b2 + a2 b21
By the method of Lagrange multiplier, the Lagrangian function
L(x, y, z) = a1 x2 + a2 y 2 − z b0 + b1 x + b2 y ,
its Hessian matrix for curvatures,
2a1 0 b1
0 2a2 b2 ,
b1 b2 0
with determiniant −2(a1 b22 + a2 b21 ) < 0 for any b’s.
Therefore according to the method of Lagrange multiplier, the constrained optimization
problem in Eq. 11 becomes
n o
φ(x) = inf H(ν∥p) ν · g = x
ν
n n oo
= sup inf H(ν∥p) − y ν · g − x
y ν
n n o o
= sup inf H(ν∥p) − yν · g + xy
y ν
n o
= sup xy − ψ(y) , (13)
y
Prof. Hong Qian 14 Thursday 12th January, 2023, 14:01
AMATH 423/523 Mathematical Analysis in Biology and Medicine Winter, 2023
in which we have introduced a new function:
n o n o
ψ(y) = − inf H(ν∥p) − yν · g = sup yν · g − H(ν∥p) . (14)
ν ν
Then following the theory of Legendre-Fenchel duality, we have
n o
ψ(y) = sup xy − φ(x) . (15)
x
3.1.2 Legendre-Fenchel transform (LFT) and duality
The function ψ(y) in (15) is called the Legendre-Fenchel transform (LFT) of φ(x). Then the
function φ(x) in (13) is the LFT of ψ(y). The pair φ(x) and ψ(y) are known as Legendre-
Fenchel duality. It naturally arises in a constrained optimization problem, leading to a low-
dimensional structure, the green curves in Fig. 1, that is embedded in a higher dimensional
space of ν. The independent variables x and y are called conjugate variables to each other.
3.1.3 Conjugate variables and Lagrange-Gibbs equation
Formally the Lagrange function is defined as
L[ν, y] = H(ν∥p) − y ν · g − x .
Carrying out the partial derivatives w.r.t. all the ν1 , ν2 , · · · , νn can be expressed using the
notation of differentiation dν :
dν L[ν, y] = dν H(ν∥p) − yg · dν = 0. (16)
Very interestingly, a relation called Gibbs’ equation in physics is exactly like this. In the
latter case, H is called Gibbs entropy, y −1 is temperature, and g · dν is the mechanical
work. The meaning of the equation is related to the First Law of Thermodynamics, for energy
conservation.
3.2 Linear constraints and full LFT
If function ψ(y), the LFT of the entropy function for the empirical mean value g, φ(x), plays a
key role in the theory, one naturally asks what is the LFT of the entropy function H(ν∥p) for
the empirical counting frequency ν?
Let us compute the LFT of H(ν∥p):
( n n
!) n
n o X νi X µj X
sup ν · µ − H(ν∥p) = sup − νi ln µi
pj e + ln pj eµj
ν ν i=1
p i e j=1 j=1
n
X
= ln pj eµj , (17)
j=1
Prof. Hong Qian 15 Thursday 12th January, 2023, 14:01
AMATH 423/523 Mathematical Analysis in Biology and Medicine Winter, 2023
with the optimal ν ∗ :
pi eµi
νi∗ = Pn µj
. (18)
j=1 pj e
Very interestingly again, a relation called Boltzmann’s relation in physics is exactly like this. In
the latter case, −µi is the mechanical energy of the state i in kB T unit, where T is temperature
and kB is a constant named after Boltzmann. The pi are assumed to be 1, known as the principle
of equal a priori probability.
Prof. Hong Qian 16 Thursday 12th January, 2023, 14:01