0% found this document useful (0 votes)

32 views15 pages

Generalized Linear Models and Exponential Family

Uploaded by

sabarimooc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views15 pages

Generalized Linear Models and Exponential Family

Uploaded by

sabarimooc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Generalized Linear Models and Exponential Family

DeepEigen∥ Course: AI-1.0X: Machine Learning | Year: 2019 | Instructor: Sanjeev Sharma

1 Exponential Family
This section describes a family of probability distributions known as the exponential family of distributions. Many
of the majorly used probability distribution functions belong to the exponential family. A probability density
function (pdf) or probability mass function (pmf) p(y; η), for y ∈ Rl and η ∈ Rn , is in the exponential family if it
can be written in the form
1
p(y; η) = λ(y) exp{η ⊺ S(y)}, (1)
Z(η)
= exp{− log(Z(η))}λ(y) exp {η ⊺ S(y)} ,
| {z }
A(η)
= λ(y) exp{η ⊺ S(y) − A(η)}, (2)

where
∫
Z(η) = λ(y) exp{η ⊺ S(y)} dy (3)
l
y∈R
A(η) = log(Z(η)) (4)

Here Z(η) is known as the partition function and A(η) is known as the log-particition function (also called
cumulant function). Partition function, Z(η) (consequently, exp(−A(η))), essentially plays the role of normaliza-
tion constant that makes sure that p(y; η) integrates or sums over y to 1. Here η is called the natural parameter
(it is also called the canonical parameter); S(y) is a vector of suﬃcinet statistic; and λ(y) is scaling constant.
A fixed choice of S(y), A(η), and λ(y) defines a family of distributions, that is parametrized by η. By varying
η we get diﬀerent distributions within the family. The next sections show that the Bernoulli and the Gaussian
distributions are examples of exponential family distributions.

1.1 Bernoulli and Gaussian Distributions as Exponential Family

Earlier we covered linear regression and logistic regression (classification). In the linear regression, we had
p(y|x; w) ∼ N (µ, σ 2 ), with µ = w⊺ ϕ(x). In the classification (logistic regression), we considered p(y|x; w) ∼
⊺
Bernoulli(µ), with µ = 1/{1 + e−w ϕ(x) } = Sigmoid(w⊺ ϕ(x)). In this section, we will now show that the Bernoulli
and the Gaussiation distributions are examples of exponential family distributions.

1.1.1 Bernoulli
The Bernoulli distribution (B) for y ∈ {0, 1}, with mean µ is: B(y; µ) = µy (1 − µ)1−y . Here p(y = 1; µ) = µ, and
p(y = 0; µ) = 1 − µ. By varying µ, we obtain Bernoulli distributions with diﬀerent means. We now show that the
Bernoulli distribution can be written as an exponential family distribution. We have,

p(y; µ) = µy (1 − µ)1−y
= exp{y log(µ) + (1 − y) log(1 − µ)}
{ ( ) }
µ
= exp y log + log(1 − µ) (5)
1−µ
= λ(y) exp {η ⊺ S(y) − A(η)} . (6)

1
1.1 Bernoulli and Gaussian Distributions as Exponential Family 1 EXPONENTIAL FAMILY
By comparing eq.(5) and eq.(6) we obtain:

λ(y) = 1
( )
µ 1
η = log =⇒ µ = = Sigmoid(η) (7)
1−µ 1 + e−η
S(y) = y
( )
1
A(η) = − log(1 − µ) = log = log(1 + eη ).
1−µ

Thus, Bernoulli distribution is an example of the exponential family distributions.

1.1.2 Gaussian
The univariate Gaussian (Normal) distribution, of y ∈ R with mean µ and variance σ 2 , i.e., y ∼ N (µ, σ 2 ), is:
{ }
1 1
p(y; µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2σ
{ }
1 1 2 µ 1 2
= √ exp − 2 y + 2 y − 2 µ
2πσ 2σ σ 2σ
{[ ]⊺ [ ] }
µ 2
1 y µ
= √ − 2
σ 2
exp (8)
2πσ − 2σ1 2 y2 2σ
{[ ]⊺ [ ]}
µ
1 y
= √ ( 2 ) exp σ 2
(9)
2πσ exp 2σ2µ − 2σ2
1
y2
| {z }
1
Z(η)
λ(y) exp(η ⊺ S(y))

which gives us that

[ ]
µ
η = σ2 (10)
−1
2σ 2
[ ]
y
S(y) = (11)
y2
λ(y) = 1 (12)
{ 2 }
√ µ
Z(η) = 2πσ exp (13)
2σ 2
A(η) = log(Z(η))
1 µ2
= log(2π) + log(σ) + 2
2 (√ 2σ )
1 −1 µ2
= log(2π) + log (−2) 2 + 2
2 2σ 2σ
1 1 η2
A(η) = log(2π) − log (−2η2 ) − 1 (14)
2 2 4η2

Recall that in linear regression, we saw that the variance σ 2 played no role in the maximum likelihood estimate of
the learnable parameters (weight vector) w, and also it had no eﬀect on the mapping function F w (ϕ(x)). Thus, we

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 2

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
2 2
can ignore σ , or assume σ = 1 when representing the univariate Gaussian as an exponential family distribution.
Thus, we will now represent univariate Gaussian as an exponential family, with σ 2 = 1. With σ 2 = 1, we have
( )
1 1
p(y; η) = √ exp − (y − µ) 2
2π 2
( 2) ( )
1 −y µ2
= √ exp exp µy − (15)
2π 2 2
| {z }
λ(y) exp(η ⊺ S(y)−A(η))

which gives us
( 2)
1 y
λ(y) = √ exp − (16)
2π 2
η = µ (17)
S(y) = y (18)
( 2)
µ2 η2 η
A(η) = = =⇒ Z(η) = exp (19)
2 2 2

1.2 Log Partition Function

Note that in Section-(1) we mentioned that the log-partition function A(η), is also known as the cumulant func-
tion. This is because the log-partition function can be used to generate the cumulants of the suﬃcient statistics,
thereby justifying the name cumulant function.
Here we briefly discuss what moments (of a random variable with some distribution) and cumulants are. For a
1-dimensional random variable x, with some distribution p(x), the k th moment, represented as mk is Ex∼p [xk ].
Cumulants are certain non-linear combinations of moments, and they arise naturally when analyzing sums of
independent random varilable. Here we will ignore the discussion of cumulant generation functions. Here we only
discuss the first 4 cumulants of a random variable x. We will represent the k th cumulant as ck , for a 1-dimensional
random varilable x with mk as its k th moment.

c1 = m1 = E[x] −mean
2
c 2 = m2 − m21 = E[x ] − E[x]
2
−variance
c3 = m3 − 3m1 m2 + 2m31 −skeweness
c 4 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 −kurtosis

In these notes, we refrain from discussing any additional properties of cumulants in general, and interested readers
can find more details in some good statistics text-book. We will now see how the derivatives of A(η) provide us
cumulants.
∫ We will only compute the first two cumulants. We have A(η) = log(Z(η)), which gives us A(η) =
log{ λ(y) exp[η ⊺ S(y)]dy}. We will assume that η is scalar – this was the case for ∫Bernoulli distribution (Section-
1.1.1) and univariate Gaussian (Section-1.1.2) with σ 2 = 1. Thus, A(η) = log λ(y) exp(ηS(y))dy. The first
derivative of A(η) is:
{ ∫ }
dA(η) d
= log λ(y) exp (ηS(y)) dy
dη dη
∫
1 1
= ∫ λ(y) exp (ηS(y)) dy
λ(y) exp (ηS(y)) dy dη
∫
S(y)λ(y) exp (ηS(y)) dy
= (20)
exp(A(η))

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 3

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
in eq.(20) we just used the definition of A(η). Proceeding further, we get:
∫
dA(η) S(y)λ(y) exp (ηS(y)) dy
=
dη exp A(η)
∫
= S(y)λ(y) exp (ηS(y) − A(η)) dy
∫
= S(y) λ(y) exp (ηS(y) − A(η)) dy
| {z }
p(y;η)
∫
= S(y)p(y; η)dy (21)
= E[S(y)] (22)

where we have used ∫the definition of expectation of a function F of some random variable x ∼ p(x), which is given
by Ex∼p(x) [F (x)] = F (x)p(x)dx, in eq.(21) to obtain eq.(22). Thus, the first derivate of the log-partition function
A(η) provides us the mean (expected value) of the suﬃcient statistic. Now we will compute the second derivative,
making use of the first derivative computed above:
∫
d2 A(η) d
= S(y)λ(y) exp(ηS(y) − A(η))dy
dη 2 dη
∫
= S(y)λ(y) exp[ηS(y) − A(η)](S(y) − A′ (η))dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
∫ ∫
= S(y)2 p(y; η)dy − A′ (η) S(y)p(y; η)dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
| {z }
E[S(y)]

= E[S(y) ] − E[S(y)] E[S(y)]

= E[S(y)2 ] − E[S(y)]2 (23)

= var[S(y)] definition of variance (24)

Thus, the second derivative of the log-partition function is the variance of the suﬃcient statistic S(y). It can be
easily verified, that for the multivariate case, we will have

∂ 2 A(η)
= E[Si (y)Sj (y)] − E[Si (y)] E[Sj (y)] (25)
∂ηi ∂ηj

where Sk (y) denotes the k th component of the vector of suﬃcient statistic S(y). We thus get:

∇2 A(η) = cov[S(y)]. (26)

Note that earlier we had discussed that a twice diﬀerentiable function f (x) : Rn → R is convex iﬀ ∇2 f (x) ⪰ 0,
i.e. the Hessian is a positive semidefinite matrix. From eq.(26), we have that ∇2 A(η) = cov[S(y)], and thus the
log-partition function is a convex-function. Where we have used the property of covariance matrices, i.e., they are
positive (semi-) definite matrices.
Next we will see the Bernoulli and Gaussian distributions examples to show how the first derivate of the
log-partition function computes the mean.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 4

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.2 Log Partition Function 1 EXPONENTIAL FAMILY
1.2.1 Examples: Bernoulli and Gaussian
In Bernoulli distribution, we have A(η) = log(1 + eη ). The first derivative gives us

dA(η) d
= log(1 + eη )
dη dη
eη
=
1 + eη
1
= (27)
1 + e−η
= µ (28)

where eq.(28) follows from eq.(7) and eq.(27). Note that suﬃcient statistic S(y) = y, and µ is the expected value
of y, a Bernoulli random variable (E[y] = p(y = 1; µ) × 1 + p(y = 0; µ) × 0 = µ). Furterhmore, the second derivative
of A(η) is

d2 A(η) d 1 d
= = sigmoid(η)
dη 2 dη 1 + e−η dη
= sigmoid(η)(1 − sigmoid(η))
= µ(1 − µ) (29)
= var[S(y)] (= var[y]; y ∼ B(µ)) (30)
η2
In case of univariate Gaussian distribution, with σ 2 = 1, we hvae A(η) = 2 . Thus,

dA(η)
= η=µ (31)
dη
2
d A(η)
= 1 (32)
dη 2

Now let us calculate the derivate for univariate Gaussian with variance σ 2 . In this case, A(η) = 1
2 log(2π) −
η12
1
2 log(−2η2 ) − 4η2 . Note that the suﬃcient statistic, S(y) is a vector, [y, y 2 ]⊺ . By taking the derivate with respect
to η1 gives:
{ }
∂A(η) ∂ 1 1 η12
= log(2π) − log(−2η2 ) −
∂η1 ∂η1 2 2 4η2
µ
−2η1 −2 2
= = −4σ
4η2 2σ 2
= µ

which is the mean of the first component of the suﬃcient statistic, i.e., of y. Furthermore, the second derivate with
respect to η1 is:

∂ 2 A(η) ∂ −2η1
= (33)
∂η12 ∂η1 4η2
−1 −1
= = −1 (34)
2η2 2. 2σ2
= σ2 (35)

which is the variance of the first compoment, y, of the suﬃcient statistic.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 5

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.3 Sufficient Statistic 1 EXPONENTIAL FAMILY
1.3 Sufficient Statistic
In this section we only briefly discuss what is Sufficient Statistic. We do not cover the factorization theorem ex-
tensively and entirely omit the discussion on minimal sufficient statistic. Interested readers can read more about
sufficient statistic and factor theorem in a good statistic textbook.

Let {y 1 , ..., y n }; y i ∈ RN , is generated from some distribution p(y; θ), i.e., {y 1 , ..., y n } ∼ p(y; θ), where θ is some
unknown parameter of the distribution, which may not necessarily be a random variable (in Bayesian setting, we
typically assume a distribution on the parameters, in which case θ ∼ p(θ)). Furthermore, θ might be a parameter
that perhaps we would like to learn. For example, if y is a Bernoulli random variable with mean µ, i.e., y ∼ B(y; µ),
with p(y = 1; µ) = µ. Given the sample {y 1 , ..., y n }, perhaps we would like to estimate µ.

What is a statistic?
A statistic is any real valued function, T = r(y 1 , ..., y n ), of the observation {y 1 , ..., y n }. T should not have any
unknown parameters. For example:

T1 = mean{y 1 , ..., y n }
T2 = max{y 1 , ..., y n }
T3 = 100
T4 = y 1 + γ

Here the first 3 examples, T1 , T2 , T3 , are all valid statistics. However, in T4 , if γ is unknown, then it is not a
statistic. The third example is a bit strange example, as it ignored the random sample, but it still is a statistic.

1.3.1 Sufficiency
Informally, we say T is sufficient statistic (for parameter θ) if knowing the value of T is as good as knowing the
entire random sample, for estimating the value of θ. That is, the random sample contains no information about
the parameter θ beyond what is contained in T . We can put this definition more formally: T (y) is a sufficient
statistic for θ, if the conditional distribution of y given T (y) does not depend on θ.

Putting if mathematically:

Definition 1 Suppose {y 1 , ..., y n } ∼ p(y; θ). T is suﬃcient statistic for θ if the conditional distribution of
{y 1 , ..., y n } given T = t is independent of θ for each t:

p(y 1 , ..., y n |T = t; θ) = p(y 1 , ..., y n |T = t) =⇒ T suﬃcient for θ.

To see this, we can consider an example of a Bernoulli random variable y ∈ {0, 1} ∼ B(µ).
∑n
Example 1 Let {y 1 , ..., y n } be IID Bernoulli trials, with p(y i = 1; µ) = µ, i = 1, ..., n. Let T = i=1 y
i = t. We
will see that T is suﬃcient statistic for µ.

Proof: Using Bayes theorem, we have

p(x1 , ..., xn )
p(x1 , ..., xn |T = t) =
p(T = t)
µt (1 − µ)1−t 1
= (n) = (n )
t µ (1 − µ)
t 1−t
t
∑n i
The conditional distribution does not involve µ, which proves i=1 y is suﬃcient statistic for θ.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 6

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
1.3 Sufficient Statistic 1 EXPONENTIAL FAMILY
It is difficult, however, to use this defition-1 for general cases to check if a statistic is a sufficient statistic, or to
find a sufficient statistic. There is a theorem that enables us to find sufficient statistics.

Theorem 1 Factorization Theorem: Let {x1 , ..., xn } be a set of random sample with joint density p(x1 , ..., xn ; θ).
A statistic T (x) = r(x1 , ..., xn ) is suﬃcient if an only if the joint density can be factored as follows:

p(x1 , ..., xn ; θ) = f (x1 , ..., xn )g(r(x1 , ..., xn ); θ) (36)

where the functions f and g are non-negative functions. Function f may depend on x but does not depend of θ.
The function g depends on θ and depends on observed values of the samples xi ; i = 1, ..., n only through the value
of suﬃcient statistic T (x).

Lets consider an example to see the factorization theorem.

Example 2 Let x1 , ..., xn ∼ N (µ, σ 2 ). Let xn represent the random sample set, i.e, xn = {x1 , ..., xn }. Let
1 ∑ i
x̂ = n i x . Further, assume that σ 2 is known. We have
{ ∑ i 2} { ∑ i 2}
1 i (x − µ) 1 i (x − x̂ + x̂ − µ)
n 2
p(x ; µ, σ ) = n exp − = n exp −
(2πσ 2 ) 2 2σ 2 (2πσ 2 ) 2 2σ 2
{ ∑ i ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ) i (x − x̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ)(nx̂ − nx̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i 2}
i (x − x̂) + n(x̂ − µ)
2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ } { }
1 (xi − x̂)2 n(x̂ − µ)2
= n exp − i × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
{ ∑ i 2} { }
1 i (x − x̂) n(x̂ − µ)2
= n exp − × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
| {z } | {z }
f (xn ) g(T (xn );µ)

Thus, x̂ is a suﬃcient statistic for µ, when σ 2 is known.

For general case, we have:

{ ∑ i 2} { ∑ i 2 i}
i (x − µ) i (x ) + µ − 2µx
2
1 1
n 2
p(x ; µ, σ ) = n exp − = n exp −
(2πσ 2 ) 2 2σ 2 (2πσ 2 ) 2 2σ 2
{ }
1 1 ∑ i 2 µ ∑ i n
= n exp − 2 (x ) + 2 x − 2 µ2
(2πσ 2 ) 2 2σ σ 2σ
i i
(37)
(∑ 2
∑ i )2
)
Thus, by factorization theorem, we have that T = ix , i (x is suﬃcient statistic for θ = (µ, σ 2 ).

We omit the detailed discussion on factorization theorem, and minimal suﬃcient statistic. Interested readers can
find more information in a good statistical notebook.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 7

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2 GENERALIZED LINEAR MODELS
2 Generalized Linear Models
Consider the linear regression example presented in earlier notes. There, we assumed that we have a dataset
D{xi , y i }; i = 1, ..., m; xi ∈ Rk , y i ∈ R, ϕ(x) ∈ RN . The goal is then to learn a mapping function F w (ϕ(x)) : RN →
R, using the data which can be used to predict the target values ŷ given a new input data point x̂. Furthermore,
we assumed that the dataset has some noise ϵi , i.e., each of the input data-points in D follows the relation:
y i = Fw (ϕ(xi )) + ϵi .
Furthermore, we assumed that F w (ϕ(x)) is linear in its input, which gave us linear (in features) predictor:
y i = ϕ(xi )⊺ w + ϵi
where each of the ϵ is IID. If we assume that ϵi N (0, σ 2 ), then it results in least squares regression, or L22 error
minimization over the dataset D with respect to the parameters w, with optimal (MLE) solution
wmle = argmin ∥ y −Φw∥22 (38)
w

Where Φ ∈ Rm × RN is the feature matrix, y ∈ Rm is a vector of target values, as discussed in the earlier lectures.
The conditional distribution of y given x is thus a Normal distribution with mean F w (ϕ(x)), and variance σ 2 , i.e.,
y|x; w ∼ N (F w (ϕ(x)), σ 2 ). If we assume that the noises are IID and have Lapcacian distribution with mean 0 and
diversion b, i.e., ϵi ∼ L(0, b), then the maximum likelihood estimate will result in minimization of the L1 error on
the dataset, i.e, following optimization problem:
wmle = argmin ∥ y −Φw∥1 (39)
w

We have seen earlier in regularization discussion that assuming Laplacian prior on the parameter vector w results
in L1 regularized regression. Following similar steps, by computing the likelihood of the parameters and the
maximimizing the log-likelihood or minimizing the negative log-likelihood will result in (39). As we saw in the
regularized regression example, there is no closed form solution for (39) and the solution to the optimization
problem is obatained via sub-gradient descent. In the regression model, we assumed that the noise is additive,
Laplacian or Gaussian.
In the classification example, using logistic regression, we considered that y ∈ {0, 1} is a Bernoulli random
variable, with mean µ, i.e., y ∼ B(µ). The conditional distribution of y given x is is a Bernoulli distribution, with
p(y = 1|x; µ) = µ. Generalized Linear Models (GLMs) are a statistical framework, a way, to unify regression and
classification. Furthermore, GLMs allow us to easily consider other probability models, and not just Laplacian,
Gaussian or Bernoulli.
We have already seen that Gaussian and Bernoulli both can be represented as Exponential Family distributions.
We will now see that Both the linear regession and logistic regression are special cases of GLMs. Furthermore, we
will use GLMs to derive a learning algorithm known as Softmax Regression.

2.1 Constructing GLMs

Note that, in both the classification and regression problem, the goal was to predict the conditional distribution
of y given x, where the distribution is parametrized by some learnable parameters w. Again, consider a problem
(classification or regression), where again we would like to predict y given x. Also, recall that instead of learning
the mapping directly on x, we learn a mapping on ϕ(x). We will represent an exponential family distribution
with natural parameter η as ξ(η). Driving GLM for this problem will require following three assumptions, or more
accurately, design choices:

• p(y|x; w) ∼ ξ(η); i.e., the distribution of y given x, parametrized by w, follows or can be represented as
an exponential family distribution with natural parameters η, with conditional mean µ, E[y|x] = µ, where
η = Ψ−1 (µ), where Ψ(.) is some real-function.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 8

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
⊺
• η = w ϕ(x), where ϕ(x) is the non-linear feature function mapping ϕ(x) : x ∈ Rk → RN . This is a design
choise, in Generalized Linear Models (deriving their name). If η is a vector, then ηi = Wi⊺ ϕ(x).
• µ = Ψ(η) = Ψ(w⊺ Φ(x)), where Ψ is some real function; given x, our goal is to predict expected value of the
suﬃcient statistic S(y)|x. In case of Bernoulli and Gaussian (with known σ 2 ) distributions, S(y) = y. Thus,
in the cases we consider, the goal would be to predict E[y|x] = µ. We further assume that µ = F w (ϕ(x)).
The goal, eventually, is then to predict the mapping function F w (ϕ(x)), that predicts the E[y|x] = µ.

2.1.1 GLMs: Linear Regression with L22 Loss

In the linear regression with sum of suqared errors, or L2 -Norm Squared (L22 ) loss function, we learned earlier that
we assume an additive Gaussian noise in the dataset. Furthermore, the E[y|x] is a Gaussian distribution.
To show that Linear Regression with L22 error can be expressed as GLMs, we will assume that y is continuous,
y ∈ R, and we assume that that p(y|x) is a Normal distribution, N (µ, σ 2 ) (µ may depend on x). Thus, we will let
ξ(η) be the Gaussian distribution. Furthermore, we earlier saw that (with known σ 2 ), S(y) = y. Also, we saw that
for Gaussian distribution, η = µ. Thus, we have:
E[S(y)|x] = E[y|x]
= µ = Fw (ϕ(x))
= η
= w⊺ ϕ(x). (40)
Thus, we get that F w (ϕ(x)) = w⊺ ϕ(x) = µ, which was exactly the case in Linear Regression example. This proves
that the Linear Regression with L22 loss is a special case of GLMs.

2.1.2 GLMs: Logistic Regression

We now show that Logistic Regression is a special case of GLMs. We assume that y ∈ {0, 1}. Thus it is natural to
assum a Bernoulli distribution, with some mean µ, i.e., y ∼ B(µ). We thus model y|x ∼ B(µ), i.e., the conditional
distribution of y|x is a Bernoulli distribution with some mean µ. Also, recall that if y|x; w ∼ B(µ), then E[y|x] = µ
{E[y|x; w] = 1 × p(y = 1|x; w) + 0 × p(y = 0|x; w) = µ}. Furthermore, as we saw in section-1.1.1, that in case of
Bernoulli distribution, S(y) = y, and µ = Sigmoid(η). Again, we assume that µ = F w (ϕ(x)) = E[y|x; w]. Thus,
we have:
E[S(y)|x] = E[y|x]
= µ = Fw (ϕ(x))
= Sigmoid(η)
1
=
1 + e−η
1
= (41)
1 + e ⊺ ϕ(x)
−w

= Sigmoid(w⊺ ϕ(x)). (42)

Thus, we get that F w (ϕ(x)) = Sigmoid(w⊺ ϕ(x)), which was exactly the case in Logistic Regression example. This
proves that Logistic Regression is an special case of GLMs.

2.2 GLMs: Softmax Regression

We will end the topic of Exponential Family distribution and Generalized Linear Models by deriving a learning
algorithm known as Softmax Regression. Softmax Regression generalizes Logistic Regression for multi-class clas-
sification tasks, i.e., when y is non-binary. Lets consider an example where y ∈ {1, 2, ..., q}. For example, this is
useful when the input variable x can have more than two possible labels/classes, when classifying it.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 9

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
In logistic regression, y ∈ {0, 1}, and the (stochastic-) functional mapping F w (ϕ(x)) gives us the conditional
probability p(y = 1|x; w) = µ, p(y = 0|x; 0) = 1 − µ. We assumed that there are only two classes, 0 and 1.
Softmax Regression generalizes it to q-classes, y ∈ {1, ..., q}. To summarize Softmax Regression, we compute
the conditional probability y|x using a Softmax Function, or also known as the normalized exponential
function. The conditional probabilitities y|x (parametrized by W ∈ Rq×N ) are given by the functional mapping
(F W (ϕ(x)) = p(y|x; W )):
 
—(w1 )⊺ —
—(w2 )⊺ —
 
W =  ..  (43)
 . 
—(wq )⊺ —
exp(ϕ(x)⊺ wj )
p(y = j|x; W ) = ∑ ⊺ i
(44)
i exp(ϕ(x) w )
exp(Wj ϕ(x))
= ∑ (45)
i exp(Wi ϕ(x))

where (wj )⊺ = Wj ∈ RN is the j th row of the matrix W . We can now use the gradient descent or Newton’s Method
to maximize the likelihood of the parameters, to compute W ⋆ = WMLE . Furthermore, wq = 0, which we will see
later. We will now construct a GLM to derive the formulations for Softmax Regression learning algorithm.

2.2.1 Softmax Regression as GLM

y is discrete and can have multiple values, and thus we will model it as a Multinomial distribution. The multinomial
distribution over q possible values of y can be parametrized using q parameters, µ1 , µ2 , ..., µq , where µi = p(y =
i|x; W ). Also µi ; i = 1, ..., q, satsify:
∑q
µj = 1. (46)
j=1
∑q−1
Using constraint-46, the number of parameters can be reduced to q − 1, as µq = 1 − j=1 µj . For notational
∑q−1
convenience in the derivation below, we will assume that µq = 1 − j=1 µj , but note that µq is not a parameter,
and it can computed by µ1 , ..., µq−1 . In case of Logistic regression, the suﬃcient statistic S(y) = y. To express a
multinomial distribution as an exponential family distribution we define S(y) ∈ Rq−1 as follows:
       
1 0 0 0
0 1 0 0
       
S(y = 1) =  .  ; S(2) =  .  , . . . , S(q − 1) =  .  , S(q) =  .  . (47)
.
. .
. .
.  .. 
0 0 1 0

We can further represent S(y) more eﬀectively using an Indicator function. An indicator function I{E}, is equal
to 1 if the event E is true, and is equal to 0 if the event E is false. We can thus write the suﬃcient statistic S(y)
more compactly as:
 
I{y = 1}
 I{y = 2} 
 
S(y) =  .. . (48)
 . 
I{y = q − 1}

Our goals is now to compute the conditional expectation of the Suﬃcient Statistic S(y) given x. We can see that
the expected value of the suﬃcient statistic, which is a q − 1 dimensional vector, is given by the µi , ; i = 1, ..., q − 1.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 10

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
th
We can show this mathematically, for the i component of the suﬃcient statistic S(y)i , as:

E[S(y)i |x] = E[I{y = i}|x] = p(y = i|x) = µi (49)

Thus, our goal is to learn the stochastic functional mapping F W (ϕ(x)), which computes the vector of conditional
∑q−1µi = p(y = i|x; W ); i = 1, ..., q. Note that we will only compute µi ; i = 1, ..., q − 1, and then
probabilities
µq = 1 − i=1 µi . This mapping function will be the Softmax function or the normalized exponential function. We
will now express the multinomial distribution as an Exponential Family Distribution (ξ(η)) with natural parameter
η ∈ Rq−1 . Hereafter µ = [µ1 , ..., µq ]⊺ , where µq can be computed using µ1 , ..., µq−1 as discussed earlier. We have
∏
q
p(y; µ) = p(y = 1; µ)I{y=i} (50)
i=1
∏
q
I{y=i}
= µi (51)
i=1
{q−1 }
∏ I{y=i}
= µi µI{y=q}
q (52)
i=1
{q−1 } ∑q−1
∏ S(y) 1− j=1 S(y)j
= µi i µq (53)
i=1
   
∑
q−1 ∑
q−1
= exp  S(y)i log(µi ) + 1 − S(y)j  log(µq ) (54)
i=1 j=1
( q−1 ( ) )
∑ µi
= exp S(y)i log + log(µq ) (55)
µq
i=1
= λ(y) exp{η ⊺ S(y) − A(η)} (56)

which gives us
 ( ) 
log µµ1q
  ( )
 
 log µµ1q

η =   (57)
.. 
 
 (. )
µ
log µq−1
q

A(η) = − log(µq ) (58)

λ(y) = 1. (59)

Thus, the multinomial distribution belongs the the exponential family distribution with above parameters. For
µ
convenience, we can define η ∈ Rq with ηq = log µqq = 0. From the equations above, we can now obtain µ as:
( )
µi
ηi = log =⇒ µq eηi = µi . (60)
µq
By summing over all the i = 1, ..., q, we get:
∑
q ∑
q
µ q e ηi = µi = 1 (61)
i=1 i=1
1
=⇒ µq = ∑q ηi
(62)
i=1 e

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 11

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
2.2 GLMs: Softmax Regression 2 GENERALIZED LINEAR MODELS
Substituting the result (eq-62) back into eq-60 gives us:
e ηi
µi = µq eηi =⇒ µi = ∑q ηj
(63)
j=1 e
η
This functional mapping, µ = Ψ(η), i.e., µi = ∑qe i , from η to µ is known as Softmax Function, as discussed
j=1 e ηj
in earlier section, see eq-44 and eq-45. To obtain the results of eq-45, we use the assumption number 2 in GLMs,
discussed in section-2.1, i.e., η is a linear function of the input x or equivalently of the feature function representation
of x, ϕ(x). Thus, we have that ηi = ϕ(x)⊺ wi , i = 1, ..., q − 1, where wi are the parameters for the ith class/label are
the parameters of the model. We also define wq = 0 a vector of 0s, which gives us ηq = wq⊺ ϕ(x) = 0, as discussed
earlier.
We can summarize it as follows:
⊺ i
eϕ(x) w
µi = ∑q ϕ(x)⊺ wj
(64)
j=1 e
1
µq = ∑q ϕ(x)⊺ wj
(65)
j=1 e

where wi ∈ RN ; i = 1, ..., q − 1 are the model parameters, and wq = 0 (not a model parameter, mentioned only for
convenience) is an N -dimensional vector with all its components being 0. We can define an q × N matrix W (last
row is not a model parameter, represented only for convenience), which contains all these model parameters as:
 
—(w1 )⊺ —
—(w2 )⊺ —
 
W = ..  (66)
 . 
—0⊺ — q×N

Furthermore, our functional mapping F W (ϕ(x)) is given by:

FW (ϕ(x)) = E[S(y)|x]
  
I{y = 1}
 I{y = 2}  
  
  
= E  I{y = 3}  x (67)
 ..  
 .  
I{y = q − 1}
 
µ1
 µ2 
 
 µ3 
=   (68)
 .. 
 . 
µq−1
 ϕ(x)⊺ w1

e
∑q
 j=1 eϕ(x)⊺ wj 
 ⊺ 2 
 ∑ eϕ(x) w ⊺ j 
 q eϕ(x) w 
 j=1 ⊺ 3 
 ϕ(x) w 
=  ∑qe ϕ(x)⊺ wj  (69)
 j=1 e 
 
 .. 
 . 
 ϕ(x)⊺ wq−1 
e
∑q ⊺ wj
j=1 eϕ(x)

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 12

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
3 SOFTMAX REGRESSION
Thus, the functional mapping F W (ϕ(x)) : R → N
[0, 1]q−1
gives us the class probabilities. It is thus a stochastic
function, which generates the conditional probabilities y|x. Althoug it doesn’t directly provide µq but it can be
∑
computed simply as µq = 1 − q−1 j=1 µj . Furthermore, with our notation of W , whose last row is a vector of 0s, we
can assume that the F W (ϕ(x)) produces q dimensional output as, F W (ϕ(x)) : RN → [0, 1]q :
 ⊺ 1

eϕ(x) w
∑
 qj=1 eϕ(x)⊺ wj 
 ⊺ 2 
 ∑ eϕ(x) w 
 q eϕ(x)⊺ wj 
 j=1 ⊺ 3 
 eϕ(x) w 
 ∑q ϕ(x)⊺ wj 

FW (ϕ(x)) =  j=1 e  (70)

 .. 
 . 
 ϕ(x)⊺ wq−1 
 ∑e 
 q eϕ(x)⊺ wj 
 j=1 
1
∑q ⊺ j
j=1 eϕ(x) w

We thus obtain the same result, after deriving the Softmax Regression using GLMs, as mentioned in eq-45. This
finishes our discussion on Generalized Linear Models.

3 Softmax Regression
Having presented the Softmax Regression learning algorithm, which can be used for multi-class classification tasks,
we will now see how we can learn the parameters using the gradient descent algorithm. Assume that we have a
dataset D{xi , y i }; i = 1, ..., m; x ∈ Rk , y ∈ {1, 2, 3, ..., K}. Given the feature function ϕ(x) : Rk → RN , the goal is
to learn F W (ϕ(x)) : RN → [0, 1]K , where [0, 1]K represents a K-dimensional vector, with each component of the
W ϕ(x)
vector beglonging to set [0, 1]. The conditional probabilities are p(y|x; W ) = µi = ∑Ke i Wj ϕ(x) , where Wi is the ith
j=1 e
K ×N
row of the matrix W ∈ R .
We can compute the maximum likelihood estimate (MLE) of the parameters W by maximizing the likelihood
function L(W ), as we did in Logistic Regression:
∏
m
L(W ) = p(y i |xi ; W ) (71)
i=1
∑m
LL(W ) = log L(W ) = log p(y i |xi ; W ) (72)
i=1
∑m K
∏ I{y i =j}
= log µij (73)
i=1 j=1
K
( )I{yi =j}
∑
m ∏ eWj ϕ(x )
i

= log ∑K Ws ϕ(xi )
(74)
i=1 j=1 s=1 e
⊺
where Wl is a row vector (lth -row of matrix W ), which is (wl ) , l = 1, ..., K. We can maximize the log-likelihood
function, LL(W ), to obtain the MLE estimate of the parameters W . This can be done using algorithms like
Gradient Descent or Netwon’s Method.

3.1 Maximizing Log Likelihood: Gradient Descent

We have the log-likelihood function LL(W ) given in eq-74. To compute the update rule for gradient descent algo-
rithm, we have to compute the gradients of the log-likelihood funtions, ∇W LL(W ), with respect to the parameters
wi ; i = 1, ..., K. We can represent ϕ(xi )⊺ wj as zij to simplify the notations.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 13

Emails: sanjeevsharma@swaayatt-robots.com
⃝
R⃝c Swaayatt Robots | ⃝ c DeepEigen
Cite it as: Sanjeev Sharma, "Lecture Notes on Generalized Linear Models and Exponential Family". Swaayatt Robots, 2019.
3.1 Maximizing Log Likelihood: Gradient Descent 3 SOFTMAX REGRESSION
Thus we get:
K
( )I{yi =j}
∑m ∏ j
ez i
LL(W ) = log ∑K zis
(75)
i=1 j=1 s=1 e

We can use the chain rule of diﬀerentiation to compute the ∇wj LL(W ) as:

∂ LL(W ) ∂z j
∇wj LL(W ) = × . (76)
∂z j ∂wj
Furthermore
∂z j ϕ(x)⊺ wj
= = ϕ(x). (77)
∂wj ∂wj
Having all the basic mathematical expressions in place, we can now compute the gradient ∇wj LL(W ). Also, there
would be two cases: (i) y i = j and (ii) y i ̸= j.

Case-1: y i = j
 
ϕ(xi )
(  )I{yi =j} z}|{ 
m  K 
∂ ∑ log 
∏ e zij
 ∂zij  
∇wj LL(W ) =  ∑K z s × j
(78)
∂z j  s=1 e i ∂w 
i=1 j=1

  ( )I{yi =j}  
∑
m ∏K zij
 ∂ e
= log  ∑K z s  × ϕ(xi ) (79)
∂z j s=1 e i
i=1 j=1
( [( )] )
∑m
∂
j
ez i
= log ∑K z s × ϕ(xi ) (80)
∂z j s=1 e i
i=1
∑ ( )
∑m K s
zi
ez i
j
s=1 e ∂
= j j ∑K s × ϕ(xi ) (81)
eiz ∂z s=1 e
z i
i=1
∑ ( )
∑ K ez i
m s
e
j
zi j ∂ 1
s=1 z i
= ∑K z s + e i j ∑K z s ϕ(x ) (82)
e zij e i ∂z e i
i=1 s=1 s=1
 
m ∑
∑ K ez i  ez i s j j j
ez i ez i 
 ∑K z s − (∑
s=1 i
= j )2  ϕ(x ) (83)
e z i e i K z s
s=1 e
i=1 s=1 i

( )
∑m j
ez i
= 1 − ∑K z s ϕ(xi ) (84)
s=1 e
i
i=1
( )
∑m i
eϕ(x ) w
⊺ j

= 1 − ∑K ϕ(xi ) (85)
e ϕ(xi )⊺ ws
i=1 s=1

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 14

m ∑K
( )
∑ e zis
∂ e zia
∇wj LL(W ) = s=1
∑ × ϕ(xi ) (86)
e zia ∂z j K s=1 e zis
i=1
m ∑K
( )
∑ e zis
a ∂ 1
= s=1
zia
ezi j ∑K z s × ϕ(xi ) (87)
e ∂z s=1 e
i
i=1
 
∑m ∑K z s z a zj
s=1 e  e ie i 
i
− (∑
i
= z a )2  ϕ(x ) (88)
i=1
e i K
e z s
i
s=1

∑
m
−ezi
j
i
= ∑K z s ϕ(x ) (89)
i=1 s=1 e i

∑m
−eϕ(x ) w
i ⊺ j
i
= ∑K ϕ(xi )⊺ ws ϕ(x ) (90)
i=1 s=1 e

Having derived the results for both the cases, we can now combine the results to obtain the gradient ∇wj LL(W ).
By combining eq-85 and eq-90, we get:
[m ( ) ]
∑ i ⊺ j
eϕ(x ) w ∑m
−e ϕ(xi )⊺ wj
∇wj LL(W ) = 1 − ∑K I{y i = j} + ∑K ϕ(xi )⊺ ws I{y ̸= j} ϕ(x )
i i
(91)
e ϕ(xi )⊺ ws e
i=1 s=1 i=1 s=1

We can simplify it further as:

[( ) ]
∑
m
eϕ(x )
i ⊺ wj
eϕ(x )
i ⊺ wj

∇wj LL(W ) = 1 − ∑K I{y = j} − ∑K

i
I{y ̸= j} ϕ(xi )
i
(92)
i=1 s=1 eϕ(xi )⊺ ws s=1 eϕ(xi )⊺ ws
  
∑
m ϕ(xi )⊺ wj
= I{y i = j} − ∑ e I{y i ̸= j} + I{y i = j} ϕ(xi ) (93)
K | {z }
i=1 s=1 eϕ(xi )⊺ ws
1
[ ]
∑
m
eϕ(x )
i ⊺ wj

= I{y = j} − ∑K
i
ϕ(xi ) (94)
ϕ(xi )⊺ ws
i=1 s=1 e
[ ]
∑m
eϕ(x ) w
i ⊺ j

= δij − ∑K ϕ(xi ) (95)

e ϕ(xi )⊺ ws
i=1 s=1

Where δij = I{y i = j}. Thus, we can compute the parameters, using the Batch-Gradient Descent algorithm, which
uses all the training data at once to compute one update for the parameters. We can instead used mini-batch or
stochastic gradient descent, which updates the parameters after observing every training data point as:
( )
e ϕ(xi )⊺ wj
∇wj LL(W ; Di ) = δij − ∑K ϕ(xi ) (96)
e ϕ(xi )⊺ ws
s=1

Where ∇wj LL(W ; Di ) is the gradient with respect to wj ; j = 1, ..., K for the ith training data-point in the dataset
D. Eq-96 gives the update rule for the stochastic gradient descent algorith, for the parameters wj .
Furthermore, as discussed earlier wK = 0. Thus, all the other weights can be initially initialized to random
values, and wK can be initialized to 0 and then after every stochastic gradient descent step, wK will be shifted to
some value ̸= 0. We can thus, after every step of stochastic gradient descent, project wK to 0, or simply reset them
to 0. This completes our discussion of batch gradient descent, and stochastic gradient descent for the Softmax
Regression.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 15

Lecture 11
No ratings yet
Lecture 11
6 pages
Generalized Linear Models
No ratings yet
Generalized Linear Models
23 pages
Probabilistic Learning and Generalized Linear Models (GLMS)
No ratings yet
Probabilistic Learning and Generalized Linear Models (GLMS)
38 pages
W5 Probability Distributions - 2
No ratings yet
W5 Probability Distributions - 2
19 pages
5 - GLM
No ratings yet
5 - GLM
38 pages
Exponential Family Distributions
No ratings yet
Exponential Family Distributions
17 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Conjugacy and The Exponential Family
No ratings yet
Conjugacy and The Exponential Family
6 pages
Financial Series Econometrics
No ratings yet
Financial Series Econometrics
7 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Lec12 GLM ExponentialFamilies
No ratings yet
Lec12 GLM ExponentialFamilies
24 pages
Advanced Statistical Models Lecture
No ratings yet
Advanced Statistical Models Lecture
67 pages
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
100% (1)
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
14 pages
Research Article: Estimating The Reliability Function For A Family of Exponentiated Distributions
No ratings yet
Research Article: Estimating The Reliability Function For A Family of Exponentiated Distributions
11 pages
CQF January 2016 M5S8 Workings Annotated
No ratings yet
CQF January 2016 M5S8 Workings Annotated
7 pages
Physics Lecture Notes: Random Systems
No ratings yet
Physics Lecture Notes: Random Systems
17 pages
Cheatsheet PDF
100% (1)
Cheatsheet PDF
4 pages
Advanced Statistical Distributions
No ratings yet
Advanced Statistical Distributions
13 pages
A Two Parameter Distribution Obtained by
No ratings yet
A Two Parameter Distribution Obtained by
15 pages
An Asymmetric Generalization of Gaussian and Lapla
No ratings yet
An Asymmetric Generalization of Gaussian and Lapla
29 pages
Generalized Linear Models-1
No ratings yet
Generalized Linear Models-1
29 pages
15 Exponential Families
No ratings yet
15 Exponential Families
33 pages
Lecture04 CH 04 ContinuousDistributions Baron Inf Stats FA24
No ratings yet
Lecture04 CH 04 ContinuousDistributions Baron Inf Stats FA24
46 pages
Generalized Linear Models
No ratings yet
Generalized Linear Models
109 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
Exponential and Location-Scale Families
No ratings yet
Exponential and Location-Scale Families
19 pages
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
No ratings yet
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
6 pages
Problems and Solutions 4 PDF
No ratings yet
Problems and Solutions 4 PDF
67 pages
Exp Family
No ratings yet
Exp Family
7 pages
Lec 38
No ratings yet
Lec 38
8 pages
Exponential Families: Peter D. Hoff September 26, 2013
No ratings yet
Exponential Families: Peter D. Hoff September 26, 2013
9 pages
w6 - Statistical Modelling
No ratings yet
w6 - Statistical Modelling
24 pages
Statistical Machine Learning 1665832214
No ratings yet
Statistical Machine Learning 1665832214
55 pages
Common Life Distributions: 40 Probability Distributions Used in Reliability Engineering
No ratings yet
Common Life Distributions: 40 Probability Distributions Used in Reliability Engineering
9 pages
Characteristic Function Exercises
No ratings yet
Characteristic Function Exercises
10 pages
Queuing Theory PDF
No ratings yet
Queuing Theory PDF
38 pages
Generalised Linear Models and Bayesian Statistics
No ratings yet
Generalised Linear Models and Bayesian Statistics
35 pages
Probability and Statistics Cheat Sheet
100% (3)
Probability and Statistics Cheat Sheet
28 pages
SMM Not 3
No ratings yet
SMM Not 3
11 pages
Probability Distributions & Theorems
No ratings yet
Probability Distributions & Theorems
18 pages
College Statistics
No ratings yet
College Statistics
244 pages
Bstract: Date: August 16, 2005. Giller Investments Research Note: 20031222/1
No ratings yet
Bstract: Date: August 16, 2005. Giller Investments Research Note: 20031222/1
7 pages
Exponential Family: X H A X T X F X X X F
No ratings yet
Exponential Family: X H A X T X F X X X F
3 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
Equation Sheet
No ratings yet
Equation Sheet
2 pages
Cumulants
No ratings yet
Cumulants
6 pages
Categorical Notes Ch4
No ratings yet
Categorical Notes Ch4
40 pages
PT2425 Cheatsheet Updatedv2
No ratings yet
PT2425 Cheatsheet Updatedv2
5 pages
15 GLM
No ratings yet
15 GLM
32 pages
Advanced Statistics
100% (1)
Advanced Statistics
131 pages
MIT18 S096F13 Lecnote3
No ratings yet
MIT18 S096F13 Lecnote3
7 pages
Va Tu SC 06
No ratings yet
Va Tu SC 06
23 pages
Halat
No ratings yet
Halat
12 pages
Exponential Family
No ratings yet
Exponential Family
45 pages
Exponential Families
No ratings yet
Exponential Families
42 pages
Data Structure Exam Solutions
No ratings yet
Data Structure Exam Solutions
299 pages
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
100% (2)
FEM 2063 - Data Analytics: CHAPTER 4: Classifications
76 pages
032) Statistical Thermodynamics or Gases
No ratings yet
032) Statistical Thermodynamics or Gases
18 pages
Factor Analysis
No ratings yet
Factor Analysis
44 pages
TOC: IPU Intro and Unit1
No ratings yet
TOC: IPU Intro and Unit1
33 pages
Report Formula Summary Functions: Parentgroupval and Prevgroupval
No ratings yet
Report Formula Summary Functions: Parentgroupval and Prevgroupval
2 pages
FJ ORD
No ratings yet
FJ ORD
14 pages
Fake Profile Identification in Social Network Using Machine Learning and NLP
No ratings yet
Fake Profile Identification in Social Network Using Machine Learning and NLP
8 pages
Obfuscate
No ratings yet
Obfuscate
2 pages
Banker's Algorithm in Operating System
67% (3)
Banker's Algorithm in Operating System
4 pages
Deep Learning with Keras Guide
No ratings yet
Deep Learning with Keras Guide
12 pages
Practice Questions 2
No ratings yet
Practice Questions 2
1 page
Special Theory of Relativity (Solution)
No ratings yet
Special Theory of Relativity (Solution)
5 pages
Extended Applications of Linear Algebra in Electrical Engineering
No ratings yet
Extended Applications of Linear Algebra in Electrical Engineering
5 pages
MEVD-104 DIGITAL SIGNAL PROCESSING RGPV Dec-2011
No ratings yet
MEVD-104 DIGITAL SIGNAL PROCESSING RGPV Dec-2011
2 pages
A Deep Reinforcement-Learning Approach For Inverse Kinematics Solution of A High Degree of Freedom Robotic Manipulator
No ratings yet
A Deep Reinforcement-Learning Approach For Inverse Kinematics Solution of A High Degree of Freedom Robotic Manipulator
17 pages
Grade 11 Statistics Worksheet
No ratings yet
Grade 11 Statistics Worksheet
2 pages
Type 1: N Jobs × 2 Machine Problems: Sequencing
No ratings yet
Type 1: N Jobs × 2 Machine Problems: Sequencing
2 pages
07 DictionaryCoding
No ratings yet
07 DictionaryCoding
37 pages
Determination of Optimum SLA Process Parameters of H-Shaped Parts
No ratings yet
Determination of Optimum SLA Process Parameters of H-Shaped Parts
7 pages
Quiz 3
No ratings yet
Quiz 3
2 pages
Computer Aided Civil Eng - 2024 - Yao - Deep Reinforcement Learning Based Active Mass Driver Decoupled Control Framework
No ratings yet
Computer Aided Civil Eng - 2024 - Yao - Deep Reinforcement Learning Based Active Mass Driver Decoupled Control Framework
24 pages
Using Machine Learning Algorithms To Detect Milk Quality (#1180639) - 2673216
No ratings yet
Using Machine Learning Algorithms To Detect Milk Quality (#1180639) - 2673216
12 pages
Power BI Week 2
No ratings yet
Power BI Week 2
43 pages
Unit-19 Assignment brief 2 - đã chuyển đổi
No ratings yet
Unit-19 Assignment brief 2 - đã chuyển đổi
26 pages
Soft Computing Manual
No ratings yet
Soft Computing Manual
14 pages
Midterm Examination Advanced Mathematics For Me
No ratings yet
Midterm Examination Advanced Mathematics For Me
3 pages
Unit 3 - Binary
No ratings yet
Unit 3 - Binary
10 pages
ORno 6
No ratings yet
ORno 6
1 page
Convolutional Layer Examples
No ratings yet
Convolutional Layer Examples
69 pages

Generalized Linear Models and Exponential Family

Uploaded by

Generalized Linear Models and Exponential Family

Uploaded by

Generalized Linear Models and Exponential Family

1.1 Bernoulli and Gaussian Distributions as Exponential Family

Thus, Bernoulli distribution is an example of the exponential family distributions.

which gives us that

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 2

1.2 Log Partition Function

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 3

= E[S(y) ] − E[S(y)] E[S(y)]

= E[S(y)2 ] − E[S(y)]2 (23)

∇2 A(η) = cov[S(y)]. (26)

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 4

which is the variance of the first compoment, y, of the suﬃcient statistic.

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 5

p(y 1 , ..., y n |T = t; θ) = p(y 1 , ..., y n |T = t) =⇒ T suﬃcient for θ.

Proof: Using Bayes theorem, we have

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 6

p(x1 , ..., xn ; θ) = f (x1 , ..., xn )g(r(x1 , ..., xn ); θ) (36)

Lets consider an example to see the factorization theorem.

Thus, x̂ is a suﬃcient statistic for µ, when σ 2 is known.

For general case, we have:

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 7

2.1 Constructing GLMs

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 8

2.1.1 GLMs: Linear Regression with L22 Loss

2.1.2 GLMs: Logistic Regression

= Sigmoid(w⊺ ϕ(x)). (42)

2.2 GLMs: Softmax Regression

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 9

2.2.1 Softmax Regression as GLM

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 10

E[S(y)i |x] = E[I{y = i}|x] = p(y = i|x) = µi (49)

A(η) = − log(µq ) (58)

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 11

Furthermore, our functional mapping F W (ϕ(x)) is given by:

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 12

3.1 Maximizing Log Likelihood: Gradient Descent

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 13

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 14

We can simplify it further as:

∇wj LL(W ) = 1 − ∑K I{y = j} − ∑K

= δij − ∑K ϕ(xi ) (95)

Sanjeev Sharma, Founder and CEO, Swaayatt Robots 15

You might also like