Introduction to Machine Learning
Softmax Regression
Learning goals
Know softmax regression
Understand that softmax
regression is a generalization of
logistic regression
FROM LOGISTIC REGRESSION ...
Remember logistic regression (Y = {0, 1}): We combined the
hypothesis space of linear functions, transformed by the logistic
1
function s(z ) = 1+exp(− z)
, i.e.
n o
⊤
H = π : X → R | π(x) = s(θ x) ,
with the Bernoulli (logarithmic) loss:
L(y , π(x)) = −y log (π(x)) − (1 − y ) log (1 − π(x)) .
Remark: We suppress the intercept term for better readability. The intercept term can
be easily included via θ ⊤ x̃, θ ∈ Rp+1 , x̃ = (1, x).
© Introduction to Machine Learning – 1 / 9
... TO SOFTMAX REGRESSION
There is a straightforward generalization to the multiclass case:
Instead of a single linear discriminant function we have g linear
discriminant functions
fk (x) = θk⊤ x, k = 1, 2, ..., g ,
each indicating the confidence in class k .
The g score functions are transformed into g probability functions
by the softmax function s : Rg → Rg
exp(θk⊤ x)
πk (x) = s(f (x))k = Pg ⊤
,
j =1 exp(θj x)
instead of the logistic function for g = 2. The probabilities are
P
well-defined: πk (x) = 1 and πk (x) ∈ [0, 1] for all k .
© Introduction to Machine Learning – 2 / 9
... TO SOFTMAX REGRESSION
The softmax function is a generalization of the logistic function.
For g = 2, the logistic function and the softmax function are
equivalent.
Instead of the Bernoulli loss, we use the multiclass logarithmic
loss
g
X
L(y , π(x)) = − 1{y =k } log (πk (x)) .
k =1
Note that the softmax function is a “smooth” approximation of the
arg max operation, so s((1, 1000, 2)T ) ≈ (0, 1, 0)T (picks out 2nd
element!).
Furthermore, it is invariant to constant offsets in the input:
exp(θk⊤ x + c ) exp(θk⊤ x) · exp(c )
s(f (x)+c) = Pg ⊤
= Pg ⊤
= s(f (x))
j =1 exp(θj x + c ) j =1 exp(θj x) · exp(c )
© Introduction to Machine Learning – 3 / 9
LOGISTIC VS. SOFTMAX REGRESSION
Logistic Regression Softmax Regression
Y {0, 1} {1, 2, ..., g }
Discriminant fun. f (x) = θ ⊤ x fk (x) = θk⊤ x, k = 1, 2, ..., g
1 exp(θk⊤ x)
Probabilities π(x) = πk (x) = Pg
1+exp(−θ ⊤ x) j =1
exp(θj⊤ x)
L(y , π(x)) Bernoulli / logarithmic loss Multiclass logarithmic loss
Pg
−y log (π(x)) − (1 − y ) log (1 − π(x)) − k =1 [y = k ] log (πk (x))
© Introduction to Machine Learning – 4 / 9
LOGISTIC VS. SOFTMAX REGRESSION
We can schematically depict softmax regression as follows:
© Introduction to Machine Learning – 5 / 9
LOGISTIC VS. SOFTMAX REGRESSION
We can schematically depict softmax regression as follows:
© Introduction to Machine Learning – 5 / 9
LOGISTIC VS. SOFTMAX REGRESSION
Further comments:
We can now, for instance, calculate gradients and optimize this
with standard numerical optimization software.
Softmax regression has an unusual property in that it has a
“redundant” set of parameters. If we subtract a fixed vector from all
θk , the predictions do not change at all. Hence, our model is
“over-parameterized”. For any hypothesis we might fit, there are
multiple parameter vectors that give rise to exactly the same
hypothesis function. This also implies that the minimizer of
Remp (θ) above is not unique! Hence, a numerical trick is to set
θg = 0 and only optimize the other θk . This does not restrict our
hypothesis space, but the constrained problem is now convex, i.e.,
there exists exactly one parameter vector for every hypothesis.
A similar approach is used in many ML models: multiclass LDA,
naive Bayes, neural networks and boosting.
© Introduction to Machine Learning – 6 / 9
SOFTMAX: LINEAR DISCRIMINANT FUNCTIONS
Softmax regression gives us a linear classifier.
exp(z k )
The softmax function s(z )k = Pg is
j =1 exp(z j )
a rank-preserving function, i.e. the ranks among the elements
of the vector z are the same as among the elements of s(z ).
This is because softmax transforms all scores by taking the
exp(·) (rank-preserving) and divides each element by the
same normalizing constant.
Thus, the softmax function has a unique inverse function
s−1 : Rg → Rg that is also monotonic and rank-preserving.
exp(θk⊤ x)
Applying sk−1 to πk (x) = Pg ⊤ gives us fk (x) = θk⊤ x.
j =1 exp(θj x)
Thus, softmax regression is a linear classifier.
© Introduction to Machine Learning – 7 / 9
GENERALIZING SOFTMAX REGRESSION
Instead of simple linear discriminant functions we could use any model
that outputs g scores
fk (x) ∈ R, k = 1, 2, ..., g
We can choose a multiclass loss and optimize the score functions
fk , k ∈ {1, ..., g } by multivariate minimization. The scores can be
transformed to probabilities by the softmax function.
© Introduction to Machine Learning – 8 / 9
GENERALIZING SOFTMAX REGRESSION
For example for a neural network (note that softmax regression is also
a neural network with no hidden layers):
Remark: For more details about neural networks please refer to the
lecture Deep Learning.
© Introduction to Machine Learning – 9 / 9