[go: up one dir, main page]

0% found this document useful (0 votes)
32 views15 pages

Generalized Linear Models and Exponential Family

Uploaded by

sabarimooc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views15 pages

Generalized Linear Models and Exponential Family

Uploaded by

sabarimooc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Generalized Linear Models and Exponential Family

DeepEigen∥ Course: AI-1.0X: Machine Learning | Year: 2019 | Instructor: Sanjeev Sharma

1 Exponential Family
This section describes a family of probability distributions known as the exponential family of distributions. Many
of the majorly used probability distribution functions belong to the exponential family. A probability density
function (pdf) or probability mass function (pmf) p(y; η), for y ∈ Rl and η ∈ Rn , is in the exponential family if it
can be written in the form
1
p(y; η) = λ(y) exp{η ⊺ S(y)}, (1)
Z(η)
= exp{− log(Z(η))}λ(y) exp {η ⊺ S(y)} ,
| {z }
A(η)
= λ(y) exp{η ⊺ S(y) − A(η)}, (2)

where

Z(η) = λ(y) exp{η ⊺ S(y)} dy (3)
l
y∈R
A(η) = log(Z(η)) (4)

Here Z(η) is known as the partition function and A(η) is known as the log-particition function (also called
cumulant function). Partition function, Z(η) (consequently, exp(−A(η))), essentially plays the role of normaliza-
tion constant that makes sure that p(y; η) integrates or sums over y to 1. Here η is called the natural parameter
(it is also called the canonical parameter); S(y) is a vector of sufficinet statistic; and λ(y) is scaling constant.
A fixed choice of S(y), A(η), and λ(y) defines a family of distributions, that is parametrized by η. By varying
η we get different distributions within the family. The next sections show that the Bernoulli and the Gaussian
distributions are examples of exponential family distributions.

1.1 Bernoulli and Gaussian Distributions as Exponential Family


Earlier we covered linear regression and logistic regression (classification). In the linear regression, we had
p(y|x; w) ∼ N (µ, σ 2 ), with µ = w⊺ ϕ(x). In the classification (logistic regression), we considered p(y|x; w) ∼

Bernoulli(µ), with µ = 1/{1 + e−w ϕ(x) } = Sigmoid(w⊺ ϕ(x)). In this section, we will now show that the Bernoulli
and the Gaussiation distributions are examples of exponential family distributions.

1.1.1 Bernoulli
The Bernoulli distribution (B) for y ∈ {0, 1}, with mean µ is: B(y; µ) = µy (1 − µ)1−y . Here p(y = 1; µ) = µ, and
p(y = 0; µ) = 1 − µ. By varying µ, we obtain Bernoulli distributions with different means. We now show that the
Bernoulli distribution can be written as an exponential family distribution. We have,

p(y; µ) = µy (1 − µ)1−y
= exp{y log(µ) + (1 − y) log(1 − µ)}
{ ( ) }
µ
= exp y log + log(1 − µ) (5)
1−µ
= λ(y) exp {η ⊺ S(y) − A(η)} . (6)

1
1.1 Bernoulli and Gaussian Distributions as Exponential Family 1 EXPONENTIAL FAMILY
By comparing eq.(5) and eq.(6) we obtain:

λ(y) = 1
( )
µ 1
η = log =⇒ µ = = Sigmoid(η) (7)
1−µ 1 + e−η
S(y) = y
( )
1
A(η) = − log(1 − µ) = log = log(1 + eη ).
1−µ

Thus, Bernoulli distribution is an example of the exponential family distributions.

1.1.2 Gaussian
The univariate Gaussian (Normal) distribution, of y ∈ R with mean µ and variance σ 2 , i.e., y ∼ N (µ, σ 2 ), is:
{ }
1 1
p(y; µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2σ
{ }
1 1 2 µ 1 2
= √ exp − 2 y + 2 y − 2 µ
2πσ 2σ σ 2σ
{[ ]⊺ [ ] }
µ 2
1 y µ
= √ − 2
σ 2
exp (8)
2πσ − 2σ1 2 y2 2σ
{[ ]⊺ [ ]}
µ
1 y
= √ ( 2 ) exp σ 2
(9)
2πσ exp 2σ2µ − 2σ2
1
y2
| {z }
1
Z(η)
λ(y) exp(η ⊺ S(y))

which gives us that


[ ]
µ
η = σ2 (10)
−1
2σ 2
[ ]
y
S(y) = (11)
y2
λ(y) = 1 (12)
{ 2 }
√ µ
Z(η) = 2πσ exp (13)
2σ 2
A(η) = log(Z(η))
1 µ2
= log(2π) + log(σ) + 2
2 (√ 2σ )
1 −1 µ2
= log(2π) + log (−2) 2 + 2
2 2σ 2σ
1 1 η2
A(η) = log(2π) − log (−2η2 ) − 1 (14)
2 2 4η2

Recall that in linear regression, we saw that the variance σ 2 played no role in the maximum likelihood estimate of
the learnable parameters (weight vector) w, and also it had no effect on the mapping function F w (ϕ(x)). Thus, we

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 2


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
2 2
can ignore σ , or assume σ = 1 when representing the univariate Gaussian as an exponential family distribution.
Thus, we will now represent univariate Gaussian as an exponential family, with σ 2 = 1. With σ 2 = 1, we have
( )
1 1
p(y; η) = √ exp − (y − µ) 2
2π 2
( 2) ( )
1 −y µ2
= √ exp exp µy − (15)
2π 2 2
| {z }
λ(y) exp(η ⊺ S(y)−A(η))

which gives us
( 2)
1 y
λ(y) = √ exp − (16)
2π 2
η = µ (17)
S(y) = y (18)
( 2)
µ2 η2 η
A(η) = = =⇒ Z(η) = exp (19)
2 2 2

1.2 Log Partition Function


Note that in Section-(1) we mentioned that the log-partition function A(η), is also known as the cumulant func-
tion. This is because the log-partition function can be used to generate the cumulants of the sufficient statistics,
thereby justifying the name cumulant function.
Here we briefly discuss what moments (of a random variable with some distribution) and cumulants are. For a
1-dimensional random variable x, with some distribution p(x), the k th moment, represented as mk is Ex∼p [xk ].
Cumulants are certain non-linear combinations of moments, and they arise naturally when analyzing sums of
independent random varilable. Here we will ignore the discussion of cumulant generation functions. Here we only
discuss the first 4 cumulants of a random variable x. We will represent the k th cumulant as ck , for a 1-dimensional
random varilable x with mk as its k th moment.

c1 = m1 = E[x] −mean
2
c 2 = m2 − m21 = E[x ] − E[x]
2
−variance
c3 = m3 − 3m1 m2 + 2m31 −skeweness
c 4 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 −kurtosis

In these notes, we refrain from discussing any additional properties of cumulants in general, and interested readers
can find more details in some good statistics text-book. We will now see how the derivatives of A(η) provide us
cumulants.
∫ We will only compute the first two cumulants. We have A(η) = log(Z(η)), which gives us A(η) =
log{ λ(y) exp[η ⊺ S(y)]dy}. We will assume that η is scalar – this was the case for ∫Bernoulli distribution (Section-
1.1.1) and univariate Gaussian (Section-1.1.2) with σ 2 = 1. Thus, A(η) = log λ(y) exp(ηS(y))dy. The first
derivative of A(η) is:
{ ∫ }
dA(η) d
= log λ(y) exp (ηS(y)) dy
dη dη

1 1
= ∫ λ(y) exp (ηS(y)) dy
λ(y) exp (ηS(y)) dy dη

S(y)λ(y) exp (ηS(y)) dy
= (20)
exp(A(η))

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 3


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
in eq.(20) we just used the definition of A(η). Proceeding further, we get:

dA(η) S(y)λ(y) exp (ηS(y)) dy
=
dη exp A(η)

= S(y)λ(y) exp (ηS(y) − A(η)) dy

= S(y) λ(y) exp (ηS(y) − A(η)) dy
| {z }
p(y;η)

= S(y)p(y; η)dy (21)
= E[S(y)] (22)

where we have used ∫the definition of expectation of a function F of some random variable x ∼ p(x), which is given
by Ex∼p(x) [F (x)] = F (x)p(x)dx, in eq.(21) to obtain eq.(22). Thus, the first derivate of the log-partition function
A(η) provides us the mean (expected value) of the sufficient statistic. Now we will compute the second derivative,
making use of the first derivative computed above:

d2 A(η) d
= S(y)λ(y) exp(ηS(y) − A(η))dy
dη 2 dη

= S(y)λ(y) exp[ηS(y) − A(η)](S(y) − A′ (η))dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
∫ ∫
= S(y)2 p(y; η)dy − A′ (η) S(y)p(y; η)dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
| {z }
E[S(y)]

= E[S(y) ] − E[S(y)] E[S(y)]


2

= E[S(y)2 ] − E[S(y)]2 (23)


= var[S(y)] definition of variance (24)

Thus, the second derivative of the log-partition function is the variance of the sufficient statistic S(y). It can be
easily verified, that for the multivariate case, we will have

∂ 2 A(η)
= E[Si (y)Sj (y)] − E[Si (y)] E[Sj (y)] (25)
∂ηi ∂ηj

where Sk (y) denotes the k th component of the vector of sufficient statistic S(y). We thus get:

∇2 A(η) = cov[S(y)]. (26)

Note that earlier we had discussed that a twice differentiable function f (x) : Rn → R is convex iff ∇2 f (x) ⪰ 0,
i.e. the Hessian is a positive semidefinite matrix. From eq.(26), we have that ∇2 A(η) = cov[S(y)], and thus the
log-partition function is a convex-function. Where we have used the property of covariance matrices, i.e., they are
positive (semi-) definite matrices.
Next we will see the Bernoulli and Gaussian distributions examples to show how the first derivate of the
log-partition function computes the mean.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 4


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
1.2.1 Examples: Bernoulli and Gaussian
In Bernoulli distribution, we have A(η) = log(1 + eη ). The first derivative gives us

dA(η) d
= log(1 + eη )
dη dη

=
1 + eη
1
= (27)
1 + e−η
= µ (28)

where eq.(28) follows from eq.(7) and eq.(27). Note that sufficient statistic S(y) = y, and µ is the expected value
of y, a Bernoulli random variable (E[y] = p(y = 1; µ) × 1 + p(y = 0; µ) × 0 = µ). Furterhmore, the second derivative
of A(η) is

d2 A(η) d 1 d
= = sigmoid(η)
dη 2 dη 1 + e−η dη
= sigmoid(η)(1 − sigmoid(η))
= µ(1 − µ) (29)
= var[S(y)] (= var[y]; y ∼ B(µ)) (30)
η2
In case of univariate Gaussian distribution, with σ 2 = 1, we hvae A(η) = 2 . Thus,

dA(η)
= η=µ (31)

2
d A(η)
= 1 (32)
dη 2

Now let us calculate the derivate for univariate Gaussian with variance σ 2 . In this case, A(η) = 1
2 log(2π) −
η12
1
2 log(−2η2 ) − 4η2 . Note that the sufficient statistic, S(y) is a vector, [y, y 2 ]⊺ . By taking the derivate with respect
to η1 gives:
{ }
∂A(η) ∂ 1 1 η12
= log(2π) − log(−2η2 ) −
∂η1 ∂η1 2 2 4η2
µ
−2η1 −2 2
= = −4σ
4η2 2σ 2
= µ

which is the mean of the first component of the sufficient statistic, i.e., of y. Furthermore, the second derivate with
respect to η1 is:

∂ 2 A(η) ∂ −2η1
= (33)
∂η12 ∂η1 4η2
−1 −1
= = −1 (34)
2η2 2. 2σ2
= σ2 (35)

which is the variance of the first compoment, y, of the sufficient statistic.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 5


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.3 Sufficient Statistic 1 EXPONENTIAL FAMILY
1.3 Sufficient Statistic
In this section we only briefly discuss what is Sufficient Statistic. We do not cover the factorization theorem ex-
tensively and entirely omit the discussion on minimal sufficient statistic. Interested readers can read more about
sufficient statistic and factor theorem in a good statistic textbook.

Let {y 1 , ..., y n }; y i ∈ RN , is generated from some distribution p(y; θ), i.e., {y 1 , ..., y n } ∼ p(y; θ), where θ is some
unknown parameter of the distribution, which may not necessarily be a random variable (in Bayesian setting, we
typically assume a distribution on the parameters, in which case θ ∼ p(θ)). Furthermore, θ might be a parameter
that perhaps we would like to learn. For example, if y is a Bernoulli random variable with mean µ, i.e., y ∼ B(y; µ),
with p(y = 1; µ) = µ. Given the sample {y 1 , ..., y n }, perhaps we would like to estimate µ.

What is a statistic?
A statistic is any real valued function, T = r(y 1 , ..., y n ), of the observation {y 1 , ..., y n }. T should not have any
unknown parameters. For example:

T1 = mean{y 1 , ..., y n }
T2 = max{y 1 , ..., y n }
T3 = 100
T4 = y 1 + γ

Here the first 3 examples, T1 , T2 , T3 , are all valid statistics. However, in T4 , if γ is unknown, then it is not a
statistic. The third example is a bit strange example, as it ignored the random sample, but it still is a statistic.

1.3.1 Sufficiency
Informally, we say T is sufficient statistic (for parameter θ) if knowing the value of T is as good as knowing the
entire random sample, for estimating the value of θ. That is, the random sample contains no information about
the parameter θ beyond what is contained in T . We can put this definition more formally: T (y) is a sufficient
statistic for θ, if the conditional distribution of y given T (y) does not depend on θ.

Putting if mathematically:

Definition 1 Suppose {y 1 , ..., y n } ∼ p(y; θ). T is sufficient statistic for θ if the conditional distribution of
{y 1 , ..., y n } given T = t is independent of θ for each t:

p(y 1 , ..., y n |T = t; θ) = p(y 1 , ..., y n |T = t) =⇒ T sufficient for θ.

To see this, we can consider an example of a Bernoulli random variable y ∈ {0, 1} ∼ B(µ).
∑n
Example 1 Let {y 1 , ..., y n } be IID Bernoulli trials, with p(y i = 1; µ) = µ, i = 1, ..., n. Let T = i=1 y
i = t. We
will see that T is sufficient statistic for µ.

Proof: Using Bayes theorem, we have

p(x1 , ..., xn )
p(x1 , ..., xn |T = t) =
p(T = t)
µt (1 − µ)1−t 1
= (n) = (n )
t µ (1 − µ)
t 1−t
t
∑n i
The conditional distribution does not involve µ, which proves i=1 y is sufficient statistic for θ.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 6


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.3 Sufficient Statistic 1 EXPONENTIAL FAMILY
It is difficult, however, to use this defition-1 for general cases to check if a statistic is a sufficient statistic, or to
find a sufficient statistic. There is a theorem that enables us to find sufficient statistics.

Theorem 1 Factorization Theorem: Let {x1 , ..., xn } be a set of random sample with joint density p(x1 , ..., xn ; θ).
A statistic T (x) = r(x1 , ..., xn ) is sufficient if an only if the joint density can be factored as follows:

p(x1 , ..., xn ; θ) = f (x1 , ..., xn )g(r(x1 , ..., xn ); θ) (36)

where the functions f and g are non-negative functions. Function f may depend on x but does not depend of θ.
The function g depends on θ and depends on observed values of the samples xi ; i = 1, ..., n only through the value
of sufficient statistic T (x).

Lets consider an example to see the factorization theorem.

Example 2 Let x1 , ..., xn ∼ N (µ, σ 2 ). Let xn represent the random sample set, i.e, xn = {x1 , ..., xn }. Let
1 ∑ i
x̂ = n i x . Further, assume that σ 2 is known. We have
{ ∑ i 2} { ∑ i 2}
1 i (x − µ) 1 i (x − x̂ + x̂ − µ)
n 2
p(x ; µ, σ ) = n exp − = n exp −
(2πσ 2 ) 2 2σ 2 (2πσ 2 ) 2 2σ 2
{ ∑ i ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ) i (x − x̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ)(nx̂ − nx̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i 2}
i (x − x̂) + n(x̂ − µ)
2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ } { }
1 (xi − x̂)2 n(x̂ − µ)2
= n exp − i × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
{ ∑ i 2} { }
1 i (x − x̂) n(x̂ − µ)2
= n exp − × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
| {z } | {z }
f (xn ) g(T (xn );µ)

Thus, x̂ is a sufficient statistic for µ, when σ 2 is known.

For general case, we have:


{ ∑ i 2} { ∑ i 2 i}
i (x − µ) i (x ) + µ − 2µx
2
1 1
n 2
p(x ; µ, σ ) = n exp − = n exp −
(2πσ 2 ) 2 2σ 2 (2πσ 2 ) 2 2σ 2
{ }
1 1 ∑ i 2 µ ∑ i n
= n exp − 2 (x ) + 2 x − 2 µ2
(2πσ 2 ) 2 2σ σ 2σ
i i
(37)
(∑ 2
∑ i )2
)
Thus, by factorization theorem, we have that T = ix , i (x is sufficient statistic for θ = (µ, σ 2 ).

We omit the detailed discussion on factorization theorem, and minimal sufficient statistic. Interested readers can
find more information in a good statistical notebook.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 7


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2 GENERALIZED LINEAR MODELS
2 Generalized Linear Models
Consider the linear regression example presented in earlier notes. There, we assumed that we have a dataset
D{xi , y i }; i = 1, ..., m; xi ∈ Rk , y i ∈ R, ϕ(x) ∈ RN . The goal is then to learn a mapping function F w (ϕ(x)) : RN →
R, using the data which can be used to predict the target values ŷ given a new input data point x̂. Furthermore,
we assumed that the dataset has some noise ϵi , i.e., each of the input data-points in D follows the relation:
y i = Fw (ϕ(xi )) + ϵi .
Furthermore, we assumed that F w (ϕ(x)) is linear in its input, which gave us linear (in features) predictor:
y i = ϕ(xi )⊺ w + ϵi
where each of the ϵ is IID. If we assume that ϵi N (0, σ 2 ), then it results in least squares regression, or L22 error
minimization over the dataset D with respect to the parameters w, with optimal (MLE) solution
wmle = argmin ∥ y −Φw∥22 (38)
w

Where Φ ∈ Rm × RN is the feature matrix, y ∈ Rm is a vector of target values, as discussed in the earlier lectures.
The conditional distribution of y given x is thus a Normal distribution with mean F w (ϕ(x)), and variance σ 2 , i.e.,
y|x; w ∼ N (F w (ϕ(x)), σ 2 ). If we assume that the noises are IID and have Lapcacian distribution with mean 0 and
diversion b, i.e., ϵi ∼ L(0, b), then the maximum likelihood estimate will result in minimization of the L1 error on
the dataset, i.e, following optimization problem:
wmle = argmin ∥ y −Φw∥1 (39)
w

We have seen earlier in regularization discussion that assuming Laplacian prior on the parameter vector w results
in L1 regularized regression. Following similar steps, by computing the likelihood of the parameters and the
maximimizing the log-likelihood or minimizing the negative log-likelihood will result in (39). As we saw in the
regularized regression example, there is no closed form solution for (39) and the solution to the optimization
problem is obatained via sub-gradient descent. In the regression model, we assumed that the noise is additive,
Laplacian or Gaussian.
In the classification example, using logistic regression, we considered that y ∈ {0, 1} is a Bernoulli random
variable, with mean µ, i.e., y ∼ B(µ). The conditional distribution of y given x is is a Bernoulli distribution, with
p(y = 1|x; µ) = µ. Generalized Linear Models (GLMs) are a statistical framework, a way, to unify regression and
classification. Furthermore, GLMs allow us to easily consider other probability models, and not just Laplacian,
Gaussian or Bernoulli.
We have already seen that Gaussian and Bernoulli both can be represented as Exponential Family distributions.
We will now see that Both the linear regession and logistic regression are special cases of GLMs. Furthermore, we
will use GLMs to derive a learning algorithm known as Softmax Regression.

2.1 Constructing GLMs


Note that, in both the classification and regression problem, the goal was to predict the conditional distribution
of y given x, where the distribution is parametrized by some learnable parameters w. Again, consider a problem
(classification or regression), where again we would like to predict y given x. Also, recall that instead of learning
the mapping directly on x, we learn a mapping on ϕ(x). We will represent an exponential family distribution
with natural parameter η as ξ(η). Driving GLM for this problem will require following three assumptions, or more
accurately, design choices:

• p(y|x; w) ∼ ξ(η); i.e., the distribution of y given x, parametrized by w, follows or can be represented as
an exponential family distribution with natural parameters η, with conditional mean µ, E[y|x] = µ, where
η = Ψ−1 (µ), where Ψ(.) is some real-function.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 8


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS

• η = w ϕ(x), where ϕ(x) is the non-linear feature function mapping ϕ(x) : x ∈ Rk → RN . This is a design
choise, in Generalized Linear Models (deriving their name). If η is a vector, then ηi = Wi⊺ ϕ(x).
• µ = Ψ(η) = Ψ(w⊺ Φ(x)), where Ψ is some real function; given x, our goal is to predict expected value of the
sufficient statistic S(y)|x. In case of Bernoulli and Gaussian (with known σ 2 ) distributions, S(y) = y. Thus,
in the cases we consider, the goal would be to predict E[y|x] = µ. We further assume that µ = F w (ϕ(x)).
The goal, eventually, is then to predict the mapping function F w (ϕ(x)), that predicts the E[y|x] = µ.

2.1.1 GLMs: Linear Regression with L22 Loss


In the linear regression with sum of suqared errors, or L2 -Norm Squared (L22 ) loss function, we learned earlier that
we assume an additive Gaussian noise in the dataset. Furthermore, the E[y|x] is a Gaussian distribution.
To show that Linear Regression with L22 error can be expressed as GLMs, we will assume that y is continuous,
y ∈ R, and we assume that that p(y|x) is a Normal distribution, N (µ, σ 2 ) (µ may depend on x). Thus, we will let
ξ(η) be the Gaussian distribution. Furthermore, we earlier saw that (with known σ 2 ), S(y) = y. Also, we saw that
for Gaussian distribution, η = µ. Thus, we have:
E[S(y)|x] = E[y|x]
= µ = Fw (ϕ(x))
= η
= w⊺ ϕ(x). (40)
Thus, we get that F w (ϕ(x)) = w⊺ ϕ(x) = µ, which was exactly the case in Linear Regression example. This proves
that the Linear Regression with L22 loss is a special case of GLMs.

2.1.2 GLMs: Logistic Regression


We now show that Logistic Regression is a special case of GLMs. We assume that y ∈ {0, 1}. Thus it is natural to
assum a Bernoulli distribution, with some mean µ, i.e., y ∼ B(µ). We thus model y|x ∼ B(µ), i.e., the conditional
distribution of y|x is a Bernoulli distribution with some mean µ. Also, recall that if y|x; w ∼ B(µ), then E[y|x] = µ
{E[y|x; w] = 1 × p(y = 1|x; w) + 0 × p(y = 0|x; w) = µ}. Furthermore, as we saw in section-1.1.1, that in case of
Bernoulli distribution, S(y) = y, and µ = Sigmoid(η). Again, we assume that µ = F w (ϕ(x)) = E[y|x; w]. Thus,
we have:
E[S(y)|x] = E[y|x]
= µ = Fw (ϕ(x))
= Sigmoid(η)
1
=
1 + e−η
1
= (41)
1 + e ⊺ ϕ(x)
−w

= Sigmoid(w⊺ ϕ(x)). (42)


Thus, we get that F w (ϕ(x)) = Sigmoid(w⊺ ϕ(x)), which was exactly the case in Logistic Regression example. This
proves that Logistic Regression is an special case of GLMs.

2.2 GLMs: Softmax Regression


We will end the topic of Exponential Family distribution and Generalized Linear Models by deriving a learning
algorithm known as Softmax Regression. Softmax Regression generalizes Logistic Regression for multi-class clas-
sification tasks, i.e., when y is non-binary. Lets consider an example where y ∈ {1, 2, ..., q}. For example, this is
useful when the input variable x can have more than two possible labels/classes, when classifying it.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 9


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
In logistic regression, y ∈ {0, 1}, and the (stochastic-) functional mapping F w (ϕ(x)) gives us the conditional
probability p(y = 1|x; w) = µ, p(y = 0|x; 0) = 1 − µ. We assumed that there are only two classes, 0 and 1.
Softmax Regression generalizes it to q-classes, y ∈ {1, ..., q}. To summarize Softmax Regression, we compute
the conditional probability y|x using a Softmax Function, or also known as the normalized exponential
function. The conditional probabilitities y|x (parametrized by W ∈ Rq×N ) are given by the functional mapping
(F W (ϕ(x)) = p(y|x; W )):
 
—(w1 )⊺ —
—(w2 )⊺ —
 
W =  ..  (43)
 . 
—(wq )⊺ —
exp(ϕ(x)⊺ wj )
p(y = j|x; W ) = ∑ ⊺ i
(44)
i exp(ϕ(x) w )
exp(Wj ϕ(x))
= ∑ (45)
i exp(Wi ϕ(x))

where (wj )⊺ = Wj ∈ RN is the j th row of the matrix W . We can now use the gradient descent or Newton’s Method
to maximize the likelihood of the parameters, to compute W ⋆ = WMLE . Furthermore, wq = 0, which we will see
later. We will now construct a GLM to derive the formulations for Softmax Regression learning algorithm.

2.2.1 Softmax Regression as GLM


y is discrete and can have multiple values, and thus we will model it as a Multinomial distribution. The multinomial
distribution over q possible values of y can be parametrized using q parameters, µ1 , µ2 , ..., µq , where µi = p(y =
i|x; W ). Also µi ; i = 1, ..., q, satsify:
∑q
µj = 1. (46)
j=1
∑q−1
Using constraint-46, the number of parameters can be reduced to q − 1, as µq = 1 − j=1 µj . For notational
∑q−1
convenience in the derivation below, we will assume that µq = 1 − j=1 µj , but note that µq is not a parameter,
and it can computed by µ1 , ..., µq−1 . In case of Logistic regression, the sufficient statistic S(y) = y. To express a
multinomial distribution as an exponential family distribution we define S(y) ∈ Rq−1 as follows:
       
1 0 0 0
0 1 0 0
       
S(y = 1) =  .  ; S(2) =  .  , . . . , S(q − 1) =  .  , S(q) =  .  . (47)
.
. .
. .
.  .. 
0 0 1 0

We can further represent S(y) more effectively using an Indicator function. An indicator function I{E}, is equal
to 1 if the event E is true, and is equal to 0 if the event E is false. We can thus write the sufficient statistic S(y)
more compactly as:
 
I{y = 1}
 I{y = 2} 
 
S(y) =  .. . (48)
 . 
I{y = q − 1}

Our goals is now to compute the conditional expectation of the Sufficient Statistic S(y) given x. We can see that
the expected value of the sufficient statistic, which is a q − 1 dimensional vector, is given by the µi , ; i = 1, ..., q − 1.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 10


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
th
We can show this mathematically, for the i component of the sufficient statistic S(y)i , as:

E[S(y)i |x] = E[I{y = i}|x] = p(y = i|x) = µi (49)

Thus, our goal is to learn the stochastic functional mapping F W (ϕ(x)), which computes the vector of conditional
∑q−1µi = p(y = i|x; W ); i = 1, ..., q. Note that we will only compute µi ; i = 1, ..., q − 1, and then
probabilities
µq = 1 − i=1 µi . This mapping function will be the Softmax function or the normalized exponential function. We
will now express the multinomial distribution as an Exponential Family Distribution (ξ(η)) with natural parameter
η ∈ Rq−1 . Hereafter µ = [µ1 , ..., µq ]⊺ , where µq can be computed using µ1 , ..., µq−1 as discussed earlier. We have

q
p(y; µ) = p(y = 1; µ)I{y=i} (50)
i=1

q
I{y=i}
= µi (51)
i=1
{q−1 }
∏ I{y=i}
= µi µI{y=q}
q (52)
i=1
{q−1 } ∑q−1
∏ S(y) 1− j=1 S(y)j
= µi i µq (53)
i=1
   

q−1 ∑
q−1
= exp  S(y)i log(µi ) + 1 − S(y)j  log(µq ) (54)
i=1 j=1
( q−1 ( ) )
∑ µi
= exp S(y)i log + log(µq ) (55)
µq
i=1
= λ(y) exp{η ⊺ S(y) − A(η)} (56)

which gives us
 ( ) 
log µµ1q
  ( )
 
 log µµ1q

η =   (57)
.. 
 
 (. )
µ
log µq−1
q

A(η) = − log(µq ) (58)


λ(y) = 1. (59)

Thus, the multinomial distribution belongs the the exponential family distribution with above parameters. For
µ
convenience, we can define η ∈ Rq with ηq = log µqq = 0. From the equations above, we can now obtain µ as:
( )
µi
ηi = log =⇒ µq eηi = µi . (60)
µq
By summing over all the i = 1, ..., q, we get:

q ∑
q
µ q e ηi = µi = 1 (61)
i=1 i=1
1
=⇒ µq = ∑q ηi
(62)
i=1 e

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 11


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
Substituting the result (eq-62) back into eq-60 gives us:
e ηi
µi = µq eηi =⇒ µi = ∑q ηj
(63)
j=1 e
η
This functional mapping, µ = Ψ(η), i.e., µi = ∑qe i , from η to µ is known as Softmax Function, as discussed
j=1 e ηj
in earlier section, see eq-44 and eq-45. To obtain the results of eq-45, we use the assumption number 2 in GLMs,
discussed in section-2.1, i.e., η is a linear function of the input x or equivalently of the feature function representation
of x, ϕ(x). Thus, we have that ηi = ϕ(x)⊺ wi , i = 1, ..., q − 1, where wi are the parameters for the ith class/label are
the parameters of the model. We also define wq = 0 a vector of 0s, which gives us ηq = wq⊺ ϕ(x) = 0, as discussed
earlier.
We can summarize it as follows:
⊺ i
eϕ(x) w
µi = ∑q ϕ(x)⊺ wj
(64)
j=1 e
1
µq = ∑q ϕ(x)⊺ wj
(65)
j=1 e

where wi ∈ RN ; i = 1, ..., q − 1 are the model parameters, and wq = 0 (not a model parameter, mentioned only for
convenience) is an N -dimensional vector with all its components being 0. We can define an q × N matrix W (last
row is not a model parameter, represented only for convenience), which contains all these model parameters as:
 
—(w1 )⊺ —
—(w2 )⊺ —
 
W = ..  (66)
 . 
—0⊺ — q×N

Furthermore, our functional mapping F W (ϕ(x)) is given by:

FW (ϕ(x)) = E[S(y)|x]
  
I{y = 1}
 I{y = 2}  
  
  
= E  I{y = 3}  x (67)
 ..  
 .  
I{y = q − 1}
 
µ1
 µ2 
 
 µ3 
=   (68)
 .. 
 . 
µq−1
 ϕ(x)⊺ w1

e
∑q
 j=1 eϕ(x)⊺ wj 
 ⊺ 2 
 ∑ eϕ(x) w ⊺ j 
 q eϕ(x) w 
 j=1 ⊺ 3 
 ϕ(x) w 
=  ∑qe ϕ(x)⊺ wj  (69)
 j=1 e 
 
 .. 
 . 
 ϕ(x)⊺ wq−1 
e
∑q ⊺ wj
j=1 eϕ(x)

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 12


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
3 SOFTMAX REGRESSION
Thus, the functional mapping F W (ϕ(x)) : R → N
[0, 1]q−1
gives us the class probabilities. It is thus a stochastic
function, which generates the conditional probabilities y|x. Althoug it doesn’t directly provide µq but it can be

computed simply as µq = 1 − q−1 j=1 µj . Furthermore, with our notation of W , whose last row is a vector of 0s, we
can assume that the F W (ϕ(x)) produces q dimensional output as, F W (ϕ(x)) : RN → [0, 1]q :
 ⊺ 1

eϕ(x) w

 qj=1 eϕ(x)⊺ wj 
 ⊺ 2 
 ∑ eϕ(x) w 
 q eϕ(x)⊺ wj 
 j=1 ⊺ 3 
 eϕ(x) w 
 ∑q ϕ(x)⊺ wj 

FW (ϕ(x)) =  j=1 e  (70)

 .. 
 . 
 ϕ(x)⊺ wq−1 
 ∑e 
 q eϕ(x)⊺ wj 
 j=1 
1
∑q ⊺ j
j=1 eϕ(x) w

We thus obtain the same result, after deriving the Softmax Regression using GLMs, as mentioned in eq-45. This
finishes our discussion on Generalized Linear Models.

3 Softmax Regression
Having presented the Softmax Regression learning algorithm, which can be used for multi-class classification tasks,
we will now see how we can learn the parameters using the gradient descent algorithm. Assume that we have a
dataset D{xi , y i }; i = 1, ..., m; x ∈ Rk , y ∈ {1, 2, 3, ..., K}. Given the feature function ϕ(x) : Rk → RN , the goal is
to learn F W (ϕ(x)) : RN → [0, 1]K , where [0, 1]K represents a K-dimensional vector, with each component of the
W ϕ(x)
vector beglonging to set [0, 1]. The conditional probabilities are p(y|x; W ) = µi = ∑Ke i Wj ϕ(x) , where Wi is the ith
j=1 e
K ×N
row of the matrix W ∈ R .
We can compute the maximum likelihood estimate (MLE) of the parameters W by maximizing the likelihood
function L(W ), as we did in Logistic Regression:

m
L(W ) = p(y i |xi ; W ) (71)
i=1
∑m
LL(W ) = log L(W ) = log p(y i |xi ; W ) (72)
i=1
∑m K
∏ I{y i =j}
= log µij (73)
i=1 j=1
K
( )I{yi =j}

m ∏ eWj ϕ(x )
i

= log ∑K Ws ϕ(xi )
(74)
i=1 j=1 s=1 e

where Wl is a row vector (lth -row of matrix W ), which is (wl ) , l = 1, ..., K. We can maximize the log-likelihood
function, LL(W ), to obtain the MLE estimate of the parameters W . This can be done using algorithms like
Gradient Descent or Netwon’s Method.

3.1 Maximizing Log Likelihood: Gradient Descent


We have the log-likelihood function LL(W ) given in eq-74. To compute the update rule for gradient descent algo-
rithm, we have to compute the gradients of the log-likelihood funtions, ∇W LL(W ), with respect to the parameters
wi ; i = 1, ..., K. We can represent ϕ(xi )⊺ wj as zij to simplify the notations.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 13


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
3.1 Maximizing Log Likelihood: Gradient Descent 3 SOFTMAX REGRESSION
Thus we get:
K
( )I{yi =j}
∑m ∏ j
ez i
LL(W ) = log ∑K zis
(75)
i=1 j=1 s=1 e

We can use the chain rule of differentiation to compute the ∇wj LL(W ) as:

∂ LL(W ) ∂z j
∇wj LL(W ) = × . (76)
∂z j ∂wj
Furthermore
∂z j ϕ(x)⊺ wj
= = ϕ(x). (77)
∂wj ∂wj
Having all the basic mathematical expressions in place, we can now compute the gradient ∇wj LL(W ). Also, there
would be two cases: (i) y i = j and (ii) y i ̸= j.

Case-1: y i = j
 
ϕ(xi )
(  )I{yi =j} z}|{ 
m  K 
∂ ∑ log 
∏ e zij
 ∂zij  
∇wj LL(W ) =  ∑K z s × j
(78)
∂z j  s=1 e i ∂w 
i=1 j=1

  ( )I{yi =j}  

m ∏K zij
 ∂ e
= log  ∑K z s  × ϕ(xi ) (79)
∂z j s=1 e i
i=1 j=1
( [( )] )
∑m

j
ez i
= log ∑K z s × ϕ(xi ) (80)
∂z j s=1 e i
i=1
∑ ( )
∑m K s
zi
ez i
j
s=1 e ∂
= j j ∑K s × ϕ(xi ) (81)
eiz ∂z s=1 e
z i
i=1
∑ ( )
∑ K ez i
m s
e
j
zi j ∂ 1
s=1 z i
= ∑K z s + e i j ∑K z s ϕ(x ) (82)
e zij e i ∂z e i
i=1 s=1 s=1
 
m ∑
∑ K ez i  ez i s j j j
ez i ez i 
 ∑K z s − (∑
s=1 i
= j )2  ϕ(x ) (83)
e z i e i K z s
s=1 e
i=1 s=1 i

( )
∑m j
ez i
= 1 − ∑K z s ϕ(xi ) (84)
s=1 e
i
i=1
( )
∑m i
eϕ(x ) w
⊺ j

= 1 − ∑K ϕ(xi ) (85)
e ϕ(xi )⊺ ws
i=1 s=1

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 14


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
3.1 Maximizing Log Likelihood: Gradient Descent 3 SOFTMAX REGRESSION
Case-2: y = a ̸= j
i

m ∑K
( )
∑ e zis
∂ e zia
∇wj LL(W ) = s=1
∑ × ϕ(xi ) (86)
e zia ∂z j K s=1 e zis
i=1
m ∑K
( )
∑ e zis
a ∂ 1
= s=1
zia
ezi j ∑K z s × ϕ(xi ) (87)
e ∂z s=1 e
i
i=1
 
∑m ∑K z s z a zj
s=1 e  e ie i 
i
− (∑
i
= z a )2  ϕ(x ) (88)
i=1
e i K
e z s
i
s=1


m
−ezi
j
i
= ∑K z s ϕ(x ) (89)
i=1 s=1 e i

∑m
−eϕ(x ) w
i ⊺ j
i
= ∑K ϕ(xi )⊺ ws ϕ(x ) (90)
i=1 s=1 e

Having derived the results for both the cases, we can now combine the results to obtain the gradient ∇wj LL(W ).
By combining eq-85 and eq-90, we get:
[m ( ) ]
∑ i ⊺ j
eϕ(x ) w ∑m
−e ϕ(xi )⊺ wj
∇wj LL(W ) = 1 − ∑K I{y i = j} + ∑K ϕ(xi )⊺ ws I{y ̸= j} ϕ(x )
i i
(91)
e ϕ(xi )⊺ ws e
i=1 s=1 i=1 s=1

We can simplify it further as:


[( ) ]

m
eϕ(x )
i ⊺ wj
eϕ(x )
i ⊺ wj

∇wj LL(W ) = 1 − ∑K I{y = j} − ∑K


i
I{y ̸= j} ϕ(xi )
i
(92)
i=1 s=1 eϕ(xi )⊺ ws s=1 eϕ(xi )⊺ ws
  

m ϕ(xi )⊺ wj
= I{y i = j} − ∑ e I{y i ̸= j} + I{y i = j} ϕ(xi ) (93)
K | {z }
i=1 s=1 eϕ(xi )⊺ ws
1
[ ]

m
eϕ(x )
i ⊺ wj

= I{y = j} − ∑K
i
ϕ(xi ) (94)
ϕ(xi )⊺ ws
i=1 s=1 e
[ ]
∑m
eϕ(x ) w
i ⊺ j

= δij − ∑K ϕ(xi ) (95)


e ϕ(xi )⊺ ws
i=1 s=1

Where δij = I{y i = j}. Thus, we can compute the parameters, using the Batch-Gradient Descent algorithm, which
uses all the training data at once to compute one update for the parameters. We can instead used mini-batch or
stochastic gradient descent, which updates the parameters after observing every training data point as:
( )
e ϕ(xi )⊺ wj
∇wj LL(W ; Di ) = δij − ∑K ϕ(xi ) (96)
e ϕ(xi )⊺ ws
s=1

Where ∇wj LL(W ; Di ) is the gradient with respect to wj ; j = 1, ..., K for the ith training data-point in the dataset
D. Eq-96 gives the update rule for the stochastic gradient descent algorith, for the parameters wj .
Furthermore, as discussed earlier wK = 0. Thus, all the other weights can be initially initialized to random
values, and wK can be initialized to 0 and then after every stochastic gradient descent step, wK will be shifted to
some value ̸= 0. We can thus, after every step of stochastic gradient descent, project wK to 0, or simply reset them
to 0. This completes our discussion of batch gradient descent, and stochastic gradient descent for the Softmax
Regression.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 15


Emails: sanjeevsharma@swaayatt-robots.com

R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.

You might also like