Generalized Linear Models and Exponential Family
Generalized Linear Models and Exponential Family
DeepEigen∥ Course: AI-1.0X: Machine Learning | Year: 2019 | Instructor: Sanjeev Sharma
1 Exponential Family
This section describes a family of probability distributions known as the exponential family of distributions. Many
of the majorly used probability distribution functions belong to the exponential family. A probability density
function (pdf) or probability mass function (pmf) p(y; η), for y ∈ Rl and η ∈ Rn , is in the exponential family if it
can be written in the form
1
p(y; η) = λ(y) exp{η ⊺ S(y)}, (1)
Z(η)
= exp{− log(Z(η))}λ(y) exp {η ⊺ S(y)} ,
| {z }
A(η)
= λ(y) exp{η ⊺ S(y) − A(η)}, (2)
where
∫
Z(η) = λ(y) exp{η ⊺ S(y)} dy (3)
l
y∈R
A(η) = log(Z(η)) (4)
Here Z(η) is known as the partition function and A(η) is known as the log-particition function (also called
cumulant function). Partition function, Z(η) (consequently, exp(−A(η))), essentially plays the role of normaliza-
tion constant that makes sure that p(y; η) integrates or sums over y to 1. Here η is called the natural parameter
(it is also called the canonical parameter); S(y) is a vector of sufficinet statistic; and λ(y) is scaling constant.
A fixed choice of S(y), A(η), and λ(y) defines a family of distributions, that is parametrized by η. By varying
η we get different distributions within the family. The next sections show that the Bernoulli and the Gaussian
distributions are examples of exponential family distributions.
1.1.1 Bernoulli
The Bernoulli distribution (B) for y ∈ {0, 1}, with mean µ is: B(y; µ) = µy (1 − µ)1−y . Here p(y = 1; µ) = µ, and
p(y = 0; µ) = 1 − µ. By varying µ, we obtain Bernoulli distributions with different means. We now show that the
Bernoulli distribution can be written as an exponential family distribution. We have,
p(y; µ) = µy (1 − µ)1−y
= exp{y log(µ) + (1 − y) log(1 − µ)}
{ ( ) }
µ
= exp y log + log(1 − µ) (5)
1−µ
= λ(y) exp {η ⊺ S(y) − A(η)} . (6)
1
1.1 Bernoulli and Gaussian Distributions as Exponential Family 1 EXPONENTIAL FAMILY
By comparing eq.(5) and eq.(6) we obtain:
λ(y) = 1
( )
µ 1
η = log =⇒ µ = = Sigmoid(η) (7)
1−µ 1 + e−η
S(y) = y
( )
1
A(η) = − log(1 − µ) = log = log(1 + eη ).
1−µ
1.1.2 Gaussian
The univariate Gaussian (Normal) distribution, of y ∈ R with mean µ and variance σ 2 , i.e., y ∼ N (µ, σ 2 ), is:
{ }
1 1
p(y; µ, σ 2 ) = √ exp − 2 (y − µ)2
2πσ 2σ
{ }
1 1 2 µ 1 2
= √ exp − 2 y + 2 y − 2 µ
2πσ 2σ σ 2σ
{[ ]⊺ [ ] }
µ 2
1 y µ
= √ − 2
σ 2
exp (8)
2πσ − 2σ1 2 y2 2σ
{[ ]⊺ [ ]}
µ
1 y
= √ ( 2 ) exp σ 2
(9)
2πσ exp 2σ2µ − 2σ2
1
y2
| {z }
1
Z(η)
λ(y) exp(η ⊺ S(y))
Recall that in linear regression, we saw that the variance σ 2 played no role in the maximum likelihood estimate of
the learnable parameters (weight vector) w, and also it had no effect on the mapping function F w (ϕ(x)). Thus, we
which gives us
( 2)
1 y
λ(y) = √ exp − (16)
2π 2
η = µ (17)
S(y) = y (18)
( 2)
µ2 η2 η
A(η) = = =⇒ Z(η) = exp (19)
2 2 2
c1 = m1 = E[x] −mean
2
c 2 = m2 − m21 = E[x ] − E[x]
2
−variance
c3 = m3 − 3m1 m2 + 2m31 −skeweness
c 4 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 −kurtosis
In these notes, we refrain from discussing any additional properties of cumulants in general, and interested readers
can find more details in some good statistics text-book. We will now see how the derivatives of A(η) provide us
cumulants.
∫ We will only compute the first two cumulants. We have A(η) = log(Z(η)), which gives us A(η) =
log{ λ(y) exp[η ⊺ S(y)]dy}. We will assume that η is scalar – this was the case for ∫Bernoulli distribution (Section-
1.1.1) and univariate Gaussian (Section-1.1.2) with σ 2 = 1. Thus, A(η) = log λ(y) exp(ηS(y))dy. The first
derivative of A(η) is:
{ ∫ }
dA(η) d
= log λ(y) exp (ηS(y)) dy
dη dη
∫
1 1
= ∫ λ(y) exp (ηS(y)) dy
λ(y) exp (ηS(y)) dy dη
∫
S(y)λ(y) exp (ηS(y)) dy
= (20)
exp(A(η))
where we have used ∫the definition of expectation of a function F of some random variable x ∼ p(x), which is given
by Ex∼p(x) [F (x)] = F (x)p(x)dx, in eq.(21) to obtain eq.(22). Thus, the first derivate of the log-partition function
A(η) provides us the mean (expected value) of the sufficient statistic. Now we will compute the second derivative,
making use of the first derivative computed above:
∫
d2 A(η) d
= S(y)λ(y) exp(ηS(y) − A(η))dy
dη 2 dη
∫
= S(y)λ(y) exp[ηS(y) − A(η)](S(y) − A′ (η))dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
∫ ∫
= S(y)2 p(y; η)dy − A′ (η) S(y)p(y; η)dy
∫ ∫
= S(y)2 λ(y) exp[ηS(y) − A(η)]dy − A′ (η) S(y)λ(y) exp[ηS(y) − A(η)]dy
| {z }
E[S(y)]
Thus, the second derivative of the log-partition function is the variance of the sufficient statistic S(y). It can be
easily verified, that for the multivariate case, we will have
∂ 2 A(η)
= E[Si (y)Sj (y)] − E[Si (y)] E[Sj (y)] (25)
∂ηi ∂ηj
where Sk (y) denotes the k th component of the vector of sufficient statistic S(y). We thus get:
Note that earlier we had discussed that a twice differentiable function f (x) : Rn → R is convex iff ∇2 f (x) ⪰ 0,
i.e. the Hessian is a positive semidefinite matrix. From eq.(26), we have that ∇2 A(η) = cov[S(y)], and thus the
log-partition function is a convex-function. Where we have used the property of covariance matrices, i.e., they are
positive (semi-) definite matrices.
Next we will see the Bernoulli and Gaussian distributions examples to show how the first derivate of the
log-partition function computes the mean.
dA(η) d
= log(1 + eη )
dη dη
eη
=
1 + eη
1
= (27)
1 + e−η
= µ (28)
where eq.(28) follows from eq.(7) and eq.(27). Note that sufficient statistic S(y) = y, and µ is the expected value
of y, a Bernoulli random variable (E[y] = p(y = 1; µ) × 1 + p(y = 0; µ) × 0 = µ). Furterhmore, the second derivative
of A(η) is
d2 A(η) d 1 d
= = sigmoid(η)
dη 2 dη 1 + e−η dη
= sigmoid(η)(1 − sigmoid(η))
= µ(1 − µ) (29)
= var[S(y)] (= var[y]; y ∼ B(µ)) (30)
η2
In case of univariate Gaussian distribution, with σ 2 = 1, we hvae A(η) = 2 . Thus,
dA(η)
= η=µ (31)
dη
2
d A(η)
= 1 (32)
dη 2
Now let us calculate the derivate for univariate Gaussian with variance σ 2 . In this case, A(η) = 1
2 log(2π) −
η12
1
2 log(−2η2 ) − 4η2 . Note that the sufficient statistic, S(y) is a vector, [y, y 2 ]⊺ . By taking the derivate with respect
to η1 gives:
{ }
∂A(η) ∂ 1 1 η12
= log(2π) − log(−2η2 ) −
∂η1 ∂η1 2 2 4η2
µ
−2η1 −2 2
= = −4σ
4η2 2σ 2
= µ
which is the mean of the first component of the sufficient statistic, i.e., of y. Furthermore, the second derivate with
respect to η1 is:
∂ 2 A(η) ∂ −2η1
= (33)
∂η12 ∂η1 4η2
−1 −1
= = −1 (34)
2η2 2. 2σ2
= σ2 (35)
Let {y 1 , ..., y n }; y i ∈ RN , is generated from some distribution p(y; θ), i.e., {y 1 , ..., y n } ∼ p(y; θ), where θ is some
unknown parameter of the distribution, which may not necessarily be a random variable (in Bayesian setting, we
typically assume a distribution on the parameters, in which case θ ∼ p(θ)). Furthermore, θ might be a parameter
that perhaps we would like to learn. For example, if y is a Bernoulli random variable with mean µ, i.e., y ∼ B(y; µ),
with p(y = 1; µ) = µ. Given the sample {y 1 , ..., y n }, perhaps we would like to estimate µ.
What is a statistic?
A statistic is any real valued function, T = r(y 1 , ..., y n ), of the observation {y 1 , ..., y n }. T should not have any
unknown parameters. For example:
T1 = mean{y 1 , ..., y n }
T2 = max{y 1 , ..., y n }
T3 = 100
T4 = y 1 + γ
Here the first 3 examples, T1 , T2 , T3 , are all valid statistics. However, in T4 , if γ is unknown, then it is not a
statistic. The third example is a bit strange example, as it ignored the random sample, but it still is a statistic.
1.3.1 Sufficiency
Informally, we say T is sufficient statistic (for parameter θ) if knowing the value of T is as good as knowing the
entire random sample, for estimating the value of θ. That is, the random sample contains no information about
the parameter θ beyond what is contained in T . We can put this definition more formally: T (y) is a sufficient
statistic for θ, if the conditional distribution of y given T (y) does not depend on θ.
Putting if mathematically:
Definition 1 Suppose {y 1 , ..., y n } ∼ p(y; θ). T is sufficient statistic for θ if the conditional distribution of
{y 1 , ..., y n } given T = t is independent of θ for each t:
To see this, we can consider an example of a Bernoulli random variable y ∈ {0, 1} ∼ B(µ).
∑n
Example 1 Let {y 1 , ..., y n } be IID Bernoulli trials, with p(y i = 1; µ) = µ, i = 1, ..., n. Let T = i=1 y
i = t. We
will see that T is sufficient statistic for µ.
p(x1 , ..., xn )
p(x1 , ..., xn |T = t) =
p(T = t)
µt (1 − µ)1−t 1
= (n) = (n )
t µ (1 − µ)
t 1−t
t
∑n i
The conditional distribution does not involve µ, which proves i=1 y is sufficient statistic for θ.
Theorem 1 Factorization Theorem: Let {x1 , ..., xn } be a set of random sample with joint density p(x1 , ..., xn ; θ).
A statistic T (x) = r(x1 , ..., xn ) is sufficient if an only if the joint density can be factored as follows:
where the functions f and g are non-negative functions. Function f may depend on x but does not depend of θ.
The function g depends on θ and depends on observed values of the samples xi ; i = 1, ..., n only through the value
of sufficient statistic T (x).
Example 2 Let x1 , ..., xn ∼ N (µ, σ 2 ). Let xn represent the random sample set, i.e, xn = {x1 , ..., xn }. Let
1 ∑ i
x̂ = n i x . Further, assume that σ 2 is known. We have
{ ∑ i 2} { ∑ i 2}
1 i (x − µ) 1 i (x − x̂ + x̂ − µ)
n 2
p(x ; µ, σ ) = n exp − = n exp −
(2πσ 2 ) 2 2σ 2 (2πσ 2 ) 2 2σ 2
{ ∑ i ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ) i (x − x̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i }
i (x − x̂) + n(x̂ − µ) − 2(x̂ − µ)(nx̂ − nx̂)
2 2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ i 2}
i (x − x̂) + n(x̂ − µ)
2
1
= n exp −
(2πσ 2 ) 2 2σ 2
{ ∑ } { }
1 (xi − x̂)2 n(x̂ − µ)2
= n exp − i × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
{ ∑ i 2} { }
1 i (x − x̂) n(x̂ − µ)2
= n exp − × exp −
(2πσ 2 ) 2 2σ 2 2σ 2
| {z } | {z }
f (xn ) g(T (xn );µ)
We omit the detailed discussion on factorization theorem, and minimal sufficient statistic. Interested readers can
find more information in a good statistical notebook.
Where Φ ∈ Rm × RN is the feature matrix, y ∈ Rm is a vector of target values, as discussed in the earlier lectures.
The conditional distribution of y given x is thus a Normal distribution with mean F w (ϕ(x)), and variance σ 2 , i.e.,
y|x; w ∼ N (F w (ϕ(x)), σ 2 ). If we assume that the noises are IID and have Lapcacian distribution with mean 0 and
diversion b, i.e., ϵi ∼ L(0, b), then the maximum likelihood estimate will result in minimization of the L1 error on
the dataset, i.e, following optimization problem:
wmle = argmin ∥ y −Φw∥1 (39)
w
We have seen earlier in regularization discussion that assuming Laplacian prior on the parameter vector w results
in L1 regularized regression. Following similar steps, by computing the likelihood of the parameters and the
maximimizing the log-likelihood or minimizing the negative log-likelihood will result in (39). As we saw in the
regularized regression example, there is no closed form solution for (39) and the solution to the optimization
problem is obatained via sub-gradient descent. In the regression model, we assumed that the noise is additive,
Laplacian or Gaussian.
In the classification example, using logistic regression, we considered that y ∈ {0, 1} is a Bernoulli random
variable, with mean µ, i.e., y ∼ B(µ). The conditional distribution of y given x is is a Bernoulli distribution, with
p(y = 1|x; µ) = µ. Generalized Linear Models (GLMs) are a statistical framework, a way, to unify regression and
classification. Furthermore, GLMs allow us to easily consider other probability models, and not just Laplacian,
Gaussian or Bernoulli.
We have already seen that Gaussian and Bernoulli both can be represented as Exponential Family distributions.
We will now see that Both the linear regession and logistic regression are special cases of GLMs. Furthermore, we
will use GLMs to derive a learning algorithm known as Softmax Regression.
• p(y|x; w) ∼ ξ(η); i.e., the distribution of y given x, parametrized by w, follows or can be represented as
an exponential family distribution with natural parameters η, with conditional mean µ, E[y|x] = µ, where
η = Ψ−1 (µ), where Ψ(.) is some real-function.
where (wj )⊺ = Wj ∈ RN is the j th row of the matrix W . We can now use the gradient descent or Newton’s Method
to maximize the likelihood of the parameters, to compute W ⋆ = WMLE . Furthermore, wq = 0, which we will see
later. We will now construct a GLM to derive the formulations for Softmax Regression learning algorithm.
We can further represent S(y) more effectively using an Indicator function. An indicator function I{E}, is equal
to 1 if the event E is true, and is equal to 0 if the event E is false. We can thus write the sufficient statistic S(y)
more compactly as:
I{y = 1}
I{y = 2}
S(y) = .. . (48)
.
I{y = q − 1}
Our goals is now to compute the conditional expectation of the Sufficient Statistic S(y) given x. We can see that
the expected value of the sufficient statistic, which is a q − 1 dimensional vector, is given by the µi , ; i = 1, ..., q − 1.
Thus, our goal is to learn the stochastic functional mapping F W (ϕ(x)), which computes the vector of conditional
∑q−1µi = p(y = i|x; W ); i = 1, ..., q. Note that we will only compute µi ; i = 1, ..., q − 1, and then
probabilities
µq = 1 − i=1 µi . This mapping function will be the Softmax function or the normalized exponential function. We
will now express the multinomial distribution as an Exponential Family Distribution (ξ(η)) with natural parameter
η ∈ Rq−1 . Hereafter µ = [µ1 , ..., µq ]⊺ , where µq can be computed using µ1 , ..., µq−1 as discussed earlier. We have
∏
q
p(y; µ) = p(y = 1; µ)I{y=i} (50)
i=1
∏
q
I{y=i}
= µi (51)
i=1
{q−1 }
∏ I{y=i}
= µi µI{y=q}
q (52)
i=1
{q−1 } ∑q−1
∏ S(y) 1− j=1 S(y)j
= µi i µq (53)
i=1
∑
q−1 ∑
q−1
= exp S(y)i log(µi ) + 1 − S(y)j log(µq ) (54)
i=1 j=1
( q−1 ( ) )
∑ µi
= exp S(y)i log + log(µq ) (55)
µq
i=1
= λ(y) exp{η ⊺ S(y) − A(η)} (56)
which gives us
( )
log µµ1q
( )
log µµ1q
η = (57)
..
(. )
µ
log µq−1
q
Thus, the multinomial distribution belongs the the exponential family distribution with above parameters. For
µ
convenience, we can define η ∈ Rq with ηq = log µqq = 0. From the equations above, we can now obtain µ as:
( )
µi
ηi = log =⇒ µq eηi = µi . (60)
µq
By summing over all the i = 1, ..., q, we get:
∑
q ∑
q
µ q e ηi = µi = 1 (61)
i=1 i=1
1
=⇒ µq = ∑q ηi
(62)
i=1 e
where wi ∈ RN ; i = 1, ..., q − 1 are the model parameters, and wq = 0 (not a model parameter, mentioned only for
convenience) is an N -dimensional vector with all its components being 0. We can define an q × N matrix W (last
row is not a model parameter, represented only for convenience), which contains all these model parameters as:
—(w1 )⊺ —
—(w2 )⊺ —
W = .. (66)
.
—0⊺ — q×N
FW (ϕ(x)) = E[S(y)|x]
I{y = 1}
I{y = 2}
= E I{y = 3} x (67)
..
.
I{y = q − 1}
µ1
µ2
µ3
= (68)
..
.
µq−1
ϕ(x)⊺ w1
e
∑q
j=1 eϕ(x)⊺ wj
⊺ 2
∑ eϕ(x) w ⊺ j
q eϕ(x) w
j=1 ⊺ 3
ϕ(x) w
= ∑qe ϕ(x)⊺ wj (69)
j=1 e
..
.
ϕ(x)⊺ wq−1
e
∑q ⊺ wj
j=1 eϕ(x)
We thus obtain the same result, after deriving the Softmax Regression using GLMs, as mentioned in eq-45. This
finishes our discussion on Generalized Linear Models.
3 Softmax Regression
Having presented the Softmax Regression learning algorithm, which can be used for multi-class classification tasks,
we will now see how we can learn the parameters using the gradient descent algorithm. Assume that we have a
dataset D{xi , y i }; i = 1, ..., m; x ∈ Rk , y ∈ {1, 2, 3, ..., K}. Given the feature function ϕ(x) : Rk → RN , the goal is
to learn F W (ϕ(x)) : RN → [0, 1]K , where [0, 1]K represents a K-dimensional vector, with each component of the
W ϕ(x)
vector beglonging to set [0, 1]. The conditional probabilities are p(y|x; W ) = µi = ∑Ke i Wj ϕ(x) , where Wi is the ith
j=1 e
K ×N
row of the matrix W ∈ R .
We can compute the maximum likelihood estimate (MLE) of the parameters W by maximizing the likelihood
function L(W ), as we did in Logistic Regression:
∏
m
L(W ) = p(y i |xi ; W ) (71)
i=1
∑m
LL(W ) = log L(W ) = log p(y i |xi ; W ) (72)
i=1
∑m K
∏ I{y i =j}
= log µij (73)
i=1 j=1
K
( )I{yi =j}
∑
m ∏ eWj ϕ(x )
i
= log ∑K Ws ϕ(xi )
(74)
i=1 j=1 s=1 e
⊺
where Wl is a row vector (lth -row of matrix W ), which is (wl ) , l = 1, ..., K. We can maximize the log-likelihood
function, LL(W ), to obtain the MLE estimate of the parameters W . This can be done using algorithms like
Gradient Descent or Netwon’s Method.
We can use the chain rule of differentiation to compute the ∇wj LL(W ) as:
∂ LL(W ) ∂z j
∇wj LL(W ) = × . (76)
∂z j ∂wj
Furthermore
∂z j ϕ(x)⊺ wj
= = ϕ(x). (77)
∂wj ∂wj
Having all the basic mathematical expressions in place, we can now compute the gradient ∇wj LL(W ). Also, there
would be two cases: (i) y i = j and (ii) y i ̸= j.
Case-1: y i = j
ϕ(xi )
( )I{yi =j} z}|{
m K
∂ ∑ log
∏ e zij
∂zij
∇wj LL(W ) = ∑K z s × j
(78)
∂z j s=1 e i ∂w
i=1 j=1
( )I{yi =j}
∑
m ∏K zij
∂ e
= log ∑K z s × ϕ(xi ) (79)
∂z j s=1 e i
i=1 j=1
( [( )] )
∑m
∂
j
ez i
= log ∑K z s × ϕ(xi ) (80)
∂z j s=1 e i
i=1
∑ ( )
∑m K s
zi
ez i
j
s=1 e ∂
= j j ∑K s × ϕ(xi ) (81)
eiz ∂z s=1 e
z i
i=1
∑ ( )
∑ K ez i
m s
e
j
zi j ∂ 1
s=1 z i
= ∑K z s + e i j ∑K z s ϕ(x ) (82)
e zij e i ∂z e i
i=1 s=1 s=1
m ∑
∑ K ez i ez i s j j j
ez i ez i
∑K z s − (∑
s=1 i
= j )2 ϕ(x ) (83)
e z i e i K z s
s=1 e
i=1 s=1 i
( )
∑m j
ez i
= 1 − ∑K z s ϕ(xi ) (84)
s=1 e
i
i=1
( )
∑m i
eϕ(x ) w
⊺ j
= 1 − ∑K ϕ(xi ) (85)
e ϕ(xi )⊺ ws
i=1 s=1
m ∑K
( )
∑ e zis
∂ e zia
∇wj LL(W ) = s=1
∑ × ϕ(xi ) (86)
e zia ∂z j K s=1 e zis
i=1
m ∑K
( )
∑ e zis
a ∂ 1
= s=1
zia
ezi j ∑K z s × ϕ(xi ) (87)
e ∂z s=1 e
i
i=1
∑m ∑K z s z a zj
s=1 e e ie i
i
− (∑
i
= z a )2 ϕ(x ) (88)
i=1
e i K
e z s
i
s=1
∑
m
−ezi
j
i
= ∑K z s ϕ(x ) (89)
i=1 s=1 e i
∑m
−eϕ(x ) w
i ⊺ j
i
= ∑K ϕ(xi )⊺ ws ϕ(x ) (90)
i=1 s=1 e
Having derived the results for both the cases, we can now combine the results to obtain the gradient ∇wj LL(W ).
By combining eq-85 and eq-90, we get:
[m ( ) ]
∑ i ⊺ j
eϕ(x ) w ∑m
−e ϕ(xi )⊺ wj
∇wj LL(W ) = 1 − ∑K I{y i = j} + ∑K ϕ(xi )⊺ ws I{y ̸= j} ϕ(x )
i i
(91)
e ϕ(xi )⊺ ws e
i=1 s=1 i=1 s=1
= I{y = j} − ∑K
i
ϕ(xi ) (94)
ϕ(xi )⊺ ws
i=1 s=1 e
[ ]
∑m
eϕ(x ) w
i ⊺ j
Where δij = I{y i = j}. Thus, we can compute the parameters, using the Batch-Gradient Descent algorithm, which
uses all the training data at once to compute one update for the parameters. We can instead used mini-batch or
stochastic gradient descent, which updates the parameters after observing every training data point as:
( )
e ϕ(xi )⊺ wj
∇wj LL(W ; Di ) = δij − ∑K ϕ(xi ) (96)
e ϕ(xi )⊺ ws
s=1
Where ∇wj LL(W ; Di ) is the gradient with respect to wj ; j = 1, ..., K for the ith training data-point in the dataset
D. Eq-96 gives the update rule for the stochastic gradient descent algorith, for the parameters wj .
Furthermore, as discussed earlier wK = 0. Thus, all the other weights can be initially initialized to random
values, and wK can be initialized to 0 and then after every stochastic gradient descent step, wK will be shifted to
some value ̸= 0. We can thus, after every step of stochastic gradient descent, project wK to 0, or simply reset them
to 0. This completes our discussion of batch gradient descent, and stochastic gradient descent for the Softmax
Regression.