[go: up one dir, main page]

0% found this document useful (0 votes)
89 views11 pages

Slides MC Softmax Regression

Uploaded by

rohit rushil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views11 pages

Slides MC Softmax Regression

Uploaded by

rohit rushil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Machine Learning

Softmax Regression

Learning goals
Know softmax regression
Understand that softmax
regression is a generalization of
logistic regression
FROM LOGISTIC REGRESSION ...
Remember logistic regression (Y = {0, 1}): We combined the
hypothesis space of linear functions, transformed by the logistic
1
function s(z ) = 1+exp(− z)
, i.e.
n o

H = π : X → R | π(x) = s(θ x) ,

with the Bernoulli (logarithmic) loss:

L(y , π(x)) = −y log (π(x)) − (1 − y ) log (1 − π(x)) .

Remark: We suppress the intercept term for better readability. The intercept term can
be easily included via θ ⊤ x̃, θ ∈ Rp+1 , x̃ = (1, x).

© Introduction to Machine Learning – 1 / 9


... TO SOFTMAX REGRESSION
There is a straightforward generalization to the multiclass case:
Instead of a single linear discriminant function we have g linear
discriminant functions

fk (x) = θk⊤ x, k = 1, 2, ..., g ,

each indicating the confidence in class k .


The g score functions are transformed into g probability functions
by the softmax function s : Rg → Rg

exp(θk⊤ x)
πk (x) = s(f (x))k = Pg ⊤
,
j =1 exp(θj x)

instead of the logistic function for g = 2. The probabilities are


P
well-defined: πk (x) = 1 and πk (x) ∈ [0, 1] for all k .

© Introduction to Machine Learning – 2 / 9


... TO SOFTMAX REGRESSION
The softmax function is a generalization of the logistic function.
For g = 2, the logistic function and the softmax function are
equivalent.
Instead of the Bernoulli loss, we use the multiclass logarithmic
loss
g
X
L(y , π(x)) = − 1{y =k } log (πk (x)) .
k =1

Note that the softmax function is a “smooth” approximation of the


arg max operation, so s((1, 1000, 2)T ) ≈ (0, 1, 0)T (picks out 2nd
element!).
Furthermore, it is invariant to constant offsets in the input:
exp(θk⊤ x + c ) exp(θk⊤ x) · exp(c )
s(f (x)+c) = Pg ⊤
= Pg ⊤
= s(f (x))
j =1 exp(θj x + c ) j =1 exp(θj x) · exp(c )

© Introduction to Machine Learning – 3 / 9


LOGISTIC VS. SOFTMAX REGRESSION

Logistic Regression Softmax Regression

Y {0, 1} {1, 2, ..., g }

Discriminant fun. f (x) = θ ⊤ x fk (x) = θk⊤ x, k = 1, 2, ..., g

1 exp(θk⊤ x)
Probabilities π(x) = πk (x) = Pg
1+exp(−θ ⊤ x) j =1
exp(θj⊤ x)

L(y , π(x)) Bernoulli / logarithmic loss Multiclass logarithmic loss


Pg
−y log (π(x)) − (1 − y ) log (1 − π(x)) − k =1 [y = k ] log (πk (x))

© Introduction to Machine Learning – 4 / 9


LOGISTIC VS. SOFTMAX REGRESSION
We can schematically depict softmax regression as follows:

© Introduction to Machine Learning – 5 / 9


LOGISTIC VS. SOFTMAX REGRESSION
We can schematically depict softmax regression as follows:

© Introduction to Machine Learning – 5 / 9


LOGISTIC VS. SOFTMAX REGRESSION
Further comments:
We can now, for instance, calculate gradients and optimize this
with standard numerical optimization software.
Softmax regression has an unusual property in that it has a
“redundant” set of parameters. If we subtract a fixed vector from all
θk , the predictions do not change at all. Hence, our model is
“over-parameterized”. For any hypothesis we might fit, there are
multiple parameter vectors that give rise to exactly the same
hypothesis function. This also implies that the minimizer of
Remp (θ) above is not unique! Hence, a numerical trick is to set
θg = 0 and only optimize the other θk . This does not restrict our
hypothesis space, but the constrained problem is now convex, i.e.,
there exists exactly one parameter vector for every hypothesis.
A similar approach is used in many ML models: multiclass LDA,
naive Bayes, neural networks and boosting.

© Introduction to Machine Learning – 6 / 9


SOFTMAX: LINEAR DISCRIMINANT FUNCTIONS
Softmax regression gives us a linear classifier.
exp(z k )
The softmax function s(z )k = Pg is
j =1 exp(z j )

a rank-preserving function, i.e. the ranks among the elements


of the vector z are the same as among the elements of s(z ).
This is because softmax transforms all scores by taking the
exp(·) (rank-preserving) and divides each element by the
same normalizing constant.
Thus, the softmax function has a unique inverse function
s−1 : Rg → Rg that is also monotonic and rank-preserving.
exp(θk⊤ x)
Applying sk−1 to πk (x) = Pg ⊤ gives us fk (x) = θk⊤ x.
j =1 exp(θj x)
Thus, softmax regression is a linear classifier.

© Introduction to Machine Learning – 7 / 9


GENERALIZING SOFTMAX REGRESSION
Instead of simple linear discriminant functions we could use any model
that outputs g scores

fk (x) ∈ R, k = 1, 2, ..., g

We can choose a multiclass loss and optimize the score functions


fk , k ∈ {1, ..., g } by multivariate minimization. The scores can be
transformed to probabilities by the softmax function.

© Introduction to Machine Learning – 8 / 9


GENERALIZING SOFTMAX REGRESSION
For example for a neural network (note that softmax regression is also
a neural network with no hidden layers):

Remark: For more details about neural networks please refer to the
lecture Deep Learning.

© Introduction to Machine Learning – 9 / 9

You might also like