[go: up one dir, main page]

0% found this document useful (0 votes)
14 views4 pages

Solution 5

The document outlines Exercise 5 for a Deep Learning course at Leuphana University, focusing on multiclass classification using the softmax function and cross-entropy loss. It includes tasks to derive the derivatives of the softmax function and the loss function, as well as implementing a 10-class classifier for the MNIST dataset using PyTorch. The exercise also involves testing dropout layers and visualizing learned filters in the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Solution 5

The document outlines Exercise 5 for a Deep Learning course at Leuphana University, focusing on multiclass classification using the softmax function and cross-entropy loss. It includes tasks to derive the derivatives of the softmax function and the loss function, as well as implementing a 10-class classifier for the MNIST dataset using PyTorch. The exercise also involves testing dropout layers and visualizing learned filters in the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Deep Learning Summer Term 2025

http://ml3.leuphana.de/lectures/summer25/DL
Machine Learning Group, Leuphana University of Lüneburg
Soham Majumder (soham.majumder@leuphana.de)

Exercise 5
Discussion date: 02.06.2025

Task 10 Multiclass Classification

Part 1
Let x ∈ Rd be a vector. The softmax function softmax : Rd → (0, 1)d is given by
 
exp(x1 )  
exp(x2 ) Xd
p = softmax(x) =  / exp(xj )
  
..
 .  j=1
exp(xd )
and returns a probability distribution p, i.e.,
exp(xj )
pj = Pd ≥0
k=1 exp(xk )
Pd
and j=1 pj = 1.
A suitable loss function is the cross-entropy loss. It is given by
d
X
H(p, y) = − yj log(pj ),
j=1

where y is a one-hot encoded target vector and p is the output of the softmax layer.

(i) Show that the derivative of the softmax function with respect to x is
∂pj
= pj (δij − pi ),
∂xi
where δij is 1 if i = j and 0 otherwise.
(ii) Show that the derivative of the cross-entropy loss in combination with the softmax function
with respect to x is
∂H(p, y)
= p − y.
∂x
∂pj
Hint: For the first part, you should do a case distinction (i = j and i ̸= j) of ∂xi . In the second
∂ log pj
part, you need the chain rule when considering ∂xi .

1
Solution
(i) For i = j we have
P
∂pj exp(xj ) ( k exp(xk )) − exp(xj ) exp(xi )
= 2 (derivative of a rational)
∂xi
P
( k exp(xk ))
P
exp(xj ) ( k exp(xk )) − exp(xi )
=P · P (separate exp(xj ))
k exp(x k ) exp(xk )
 P k 
exp(xj ) ( exp(xk )) exp(xi )
=P · Pk −P
k exp(xk ) k exp(xk ) k exp(xk )
= pj · (1 − pi )

and for i ̸= j we get

∂pj 0 − exp(xj ) exp(xi )


= 2
∂xi
P
( k exp(xk ))
exp(xj ) − exp(xi )
=P ·P
k exp(x k ) k exp(xk )
= pj · (0 − pi ).

Combined, that yields


∂pj
= pj (δij − pi ),
∂xi
where δij is 1 if i = j and 0 otherwise.
(ii)
Pd d d
∂H(p, y) ∂− j=1 yj log(pj ) X ∂ log(pj ) X ∂ log(pj ) ∂pj
= =− yj =− yj
∂xi ∂xi j=1
∂xi j=1
∂pj ∂xi
d d d
X 1 X X
=− yj pj (δij − pi ) = − yj (δij − pi ) = − yj δij − yj pi
j=1
pj j=1 j=1
d
X d
X
=− yj δij + pi yj = −yi + pi
j=1 j=1

Pd
Here, we used the fact that y is a one-hot encoded target vector, hence j=1 yj = 1.

2
Part 2
The log-linear model for logistic regression allows us to derive the softmax function to model the
probabilities in multiclass classification. For a problem with c classes, start by writing the log-
probability of each class as a linear function of the inputs and the partition (“normalization”) term
− log Z

log P (Y = 1|X = x) = w1 x + b1 − log Z,


log P (Y = 2|X = x) = w2 x + b2 − log Z,
..
.
log P (Y = c|X = x) = wc x + bc − log Z,

Pc
and using j=1 P (Y = j|X = x) = 1, show how this model is equivalent to modeling the class
probabilities with the softmax function.

Solution
First, we rewrite the log-linear models as probabilities by exponentiating both sides:
1
P (Y = 1|X = x) = exp(w1 x + b1 ),
Z
1
P (Y = 2|X = x) = exp(w2 x + b2 ),
Z
..
.
1
P (Y = c|X = x) = exp(wc x + bc ).
Z

Pc
We can now determine Z by using j=1 P (Y = j|X = x) = 1:

c
X 1
1= exp(wj x + bj ) (multiplying both sides by Z)
j=1
Z
c
X
Z= exp(wj x + bj )
j=1

Thus,

exp(wi x + bi )
P (Y = i|X = x) = Pc = pi ,
j=1 exp(wj x + bj )

where pi is the i-th component of softmax((w1 x + b1 , w2 x + b2 , . . . , wc x + bc )⊤ ).

3
Task 11 Multiclass Classification with PyTorch
(i) Read the classification tutorial from the PyTorch documentation.∗
(ii) Adapt the code and implement a 10-class classifier for the MNIST data set based on the
tutorial you just read. Use the CrossEntropyLoss and the Adam optimizer.
(iii) Add a dropout layer† between the fully connected linear layers of the classifier. Test several
values of the dropout probability p and report on the train and test accuracy. Use a learning
rate of 0.01 and train for at least 15 epochs. Which p works best?
(iv) Visualize the learned filters of the convolutional layers. You can access the weights via
net.conv1.weight.data.cpu().numpy()
(v) Take a training sample and manually apply every operation of the forward pass. Take a look
at the intermediate results.

Solution
The code is provided as solution5.ipynb.

∗ https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
† https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout

You might also like