[go: up one dir, main page]

0% found this document useful (0 votes)
11 views55 pages

CS-878 Lecture-02 Logistic Regression

The document outlines a lecture on Deep Learning, focusing on Artificial Neural Networks and logistic regression. Key topics include the concept of perceptrons, loss functions, cost functions, and optimization algorithms in the context of classification models. The lecture also discusses how to set up a logistic regression problem using RGB images to determine the presence of a cat.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

CS-878 Lecture-02 Logistic Regression

The document outlines a lecture on Deep Learning, focusing on Artificial Neural Networks and logistic regression. Key topics include the concept of perceptrons, loss functions, cost functions, and optimization algorithms in the context of classification models. The lecture also discusses how to set up a logistic regression problem using RGB images to determine the presence of a cat.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Deep Learning (CS-878)

Fall – 2024
Muhammad Naseer Bajwa
Assistant Professor,
Faculty of Computing, SEECS
Co-Director,
Deep Learning Lab, NCAI
NUST, Islamabad
naseer.bajwa@seecs.edu.pk

WM/04.02 S. 1
Overview of this week’s lecture

Artificial Neural Networks

- What is a perceptron?

- Setting up a logistic regression problem

- Loss function for logistic regression

- Cost functions

- Optimisation Algorithm

WM/04.02 S. 2

02/16
A perceptron is a mathematical model of a biological neuron

WM/04.02 S. 3
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly

03/16
A perceptron is a mathematical model of a biological neuron

𝑥1 𝑤1

𝑥2 𝑤2
𝑛
𝑦ො
෍ 𝑤𝑖 𝑥𝑖 + 𝑏
𝑤3
𝑖=1
𝑥3
𝑤4

𝑥4

WM/04.02 S. 4
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly

03/16
A set of perceptron, receiving the same input, makes a layer

- Each perceptron/node in a layer receives the same


input but have its own weights and biases. 𝑥1 𝑤1
𝑛
𝑤5 𝑦ො1
- Different learnable parameters applied on the same ෍ 𝑤𝑖 𝑥𝑖 + 𝑏1
𝑤2
input help each node learn (hopefully) a distinct 𝑥2 𝑖=1
pattern/feature from the input. 𝑤6
𝑤3

𝑥3 𝑤7 𝑛
𝑤4 𝑦ො2
෍ 𝑤𝑖 𝑥𝑖 + 𝑏2
𝑤8 𝑖=1
𝑥4

WM/04.02 S. 5

04/16
A set of perceptron, receiving the same input, makes a layer

- Each perceptron/node in a layer receives the same


input but have its own weights and biases. 𝑥1 𝑤1
𝑛
𝑤5 𝑦ො1
- Different learnable parameters applied on the same ෍ 𝑤𝑖 𝑥𝑖 + 𝑏1
𝑤2
input help each node learn (hopefully) a distinct 𝑥2 𝑖=1
pattern/feature from the input. 𝑤6
𝑤3
- For each input 𝑥 of the size 𝑛𝑥 , the learnable
𝑥3 𝑤7
parameters are weights 𝒘 ∈ ℝ𝑛𝑥 and biases 𝑏 ∈ ℝ. 𝑛
𝑦ො2
𝑤4
෍ 𝑤𝑖 𝑥𝑖 + 𝑏2
- Weighted sum 𝑧 of the input and parameters can 𝑤8 𝑖=1

be calculated as 𝑥4

𝑧 = 𝒘𝑇 𝒙 + 𝑏

- The predicted output 𝑦ො is obtained by an activation function on 𝑧.

WM/04.02 S. 6
𝑦ො = 𝑔(𝒘𝑇 𝒙 + 𝑏)
04/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

WM/04.02 S. 7

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

WM/04.02 S. 8

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

WM/04.02 S. 9

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
. . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 10 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .


𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
. . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 11 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .


𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 12 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .


𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
- Similarly, 𝒀 = [𝑦 1 , 𝑦 2 , … , 𝑦 𝑚 ] . . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 13 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .


𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
- Similarly, 𝒀 = [𝑦 1 , 𝑦 2 , … , 𝑦 𝑚 ] . . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
- Goal of Logistic Regression is to find a function that produces 𝑦ො given 𝒙 and 𝑦.
. . .
. . .
𝑦ො = 𝑃(𝑦 = 1| 𝒙) 𝒙𝟏𝒏𝒙
𝒋
𝒙𝒏𝒙 𝒙𝒎
WM/04.02 S. 14 𝒏𝒙

05/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

WM/04.02 S. 15

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

- Which loss function is suitable?

WM/04.02 S. 16

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

- Which loss function is suitable?

- A suitable loss functions should be differentiable on it’s entire range and convex.

WM/04.02 S. 17

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

- Which loss function is suitable?

- A suitable loss functions should be differentiable on its entire range and convex.

- Mean Absolute Error (MAE), or 𝐿 − 1 Loss?


𝑚
1
𝑀𝐴𝐸 = ෍ 𝑦ො 1 − 𝑦1
𝑚
𝑖=1

WM/04.02 S. 18

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

- Which loss function is suitable?

- A suitable loss functions should be differentiable on its entire range and convex.

- Mean Absolute Error (MAE), or 𝐿 − 1 Loss?


𝑚
1
𝑀𝐴𝐸 = ෍ 𝑦ො 1 − 𝑦1
𝑚
𝑖=1

- Mean Squared Error (MSE), or 𝐿 − 2 Loss?


𝑚
WM/04.02 S. 19 1
𝑀𝑆𝐸 = ෍(𝑦ො 1 − 𝑦1 )2
𝑚
𝑖=1
06/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

WM/04.02 S. 20

07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

- If 𝑦 = 1,
𝐿 = − log 𝑦ො

To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.

WM/04.02 S. 21

07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

- If 𝑦 = 1,
𝐿 = − log 𝑦ො

To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.

- Similarly, if 𝑦 = 0,
𝐿 = − log(1 − 𝑦)

To minimize the loss, log(1 − 𝑦)


ො needs to be large, which means 𝑦ො needs to be small.

WM/04.02 S. 22

07/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

𝑚
𝑖=1

WM/04.02 S. 23

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

𝑚
𝑖=1

- Why we take average of loss function?

WM/04.02 S. 24

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

𝑚
𝑖=1

- Why we take average of loss function?

- What if we don’t take negative of log likelihood?

WM/04.02 S. 25

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)

𝑚
𝑖=1

- Why we take average of loss function?

- What if we don’t take negative of log likelihood?

- What if we don’t take log of likelihood?

WM/04.02 S. 26

08/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

WM/04.02 S. 27

09/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

- Repeat the process until algorithm converges.

WM/04.02 S. 28

09/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

- Repeat the process until algorithm converges.

- Which optimisation algorithm is better?

WM/04.02 S. 29

09/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

WM/04.02 S. 30
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

WM/04.02 S. 31
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

WM/04.02 S. 32
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

- Less accurate but more time and memory efficient. Can do more with
same resources.

- Updates are made based on a randomly selected input samples.

WM/04.02 S. 33
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

- Less accurate but more time and memory efficient. Can do more with
same resources.

- Updates are made based on a randomly selected input samples.

- Minibatches can help smooth the optimisation path of SGD and take
advantage of parallelisation on GPUs.

WM/04.02 S. 34
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
+ x
b
x
c

WM/04.02 S. 35

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

WM/04.02 S. 36

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

- For 𝑎 = 5, 𝑏 = 3, and 𝑐 = 2, 𝑓 evaluates to 33.

WM/04.02 S. 37

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

- For 𝑎 = 5, 𝑏 = 3, and 𝑐 = 2, 𝑓 evaluates to 33.

- How 𝑓 will be affected with change in 𝑎, 𝑏, and 𝑐?


WM/04.02 S. 38

11/16
Computational graphs also help understand optimisation intuitively

- Consider the following neural network.

𝑥1 𝑤1
𝑧 = 𝑤1𝑥1 + 𝑤 2𝑥 2 + 𝑏 𝑦ො = 𝜎(𝑧)
𝜎 𝐿𝑜𝑠𝑠 = 𝐿 (𝑦,
ො 𝑦)
Σ
𝑤2
𝑥2

WM/04.02 S. 39

12/16
Computational graphs also help understand optimisation intuitively

- Consider the following neural network.

𝑥1 𝑤1
𝑧 = 𝑤1𝑥1 + 𝑤 2𝑥 2 + 𝑏 𝑦ො = 𝜎(𝑧)
𝜎 𝐿𝑜𝑠𝑠 = 𝐿 (𝑦,
ො 𝑦)
Σ
𝑤2
𝑥2
𝜕𝐿
1 = 𝑥1𝑑 𝑧
𝜕𝑤
𝑏
𝜕𝐿
2 = 𝑥 2 𝑑(𝑧)
𝜕𝑤

WM/04.02 S. 40

12/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏
- 𝑦ො = 𝜎 (𝑧 𝑖 )
- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖
- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 41

13/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖


- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 42

13/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient


- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 43

13/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient


- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖 - Solution?
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 44

13/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient


- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖 - Solution?
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- Vectorisation
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 45

13/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

𝑤, 𝑥 ∈ ℝ𝑛𝑥

𝒘 𝒙

WM/04.02 S. 46

14/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

With Vectorisation 𝑤, 𝑥 ∈ ℝ𝑛𝑥


z = np.dot (w.T, x) + b

𝒘 𝒙

WM/04.02 S. 47

14/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

With Vectorisation 𝑤, 𝑥 ∈ ℝ𝑛𝑥


z = np.dot (w.T, x) + b

With Loops
𝒘 𝒙
z = 0
For i in range (n_x)
z+= w[i]*x[i]
z+=b

WM/04.02 S. 48

14/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒙𝟏𝒊 𝒙𝟐 𝒙𝒎

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 49

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 50

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 51

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 = + 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

With Vectorisation

Z = np.dot (w.T, X) + b
WM/04.02 S. 52

15/16
Backpropagation using vectorisation is also efficient

𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚

𝑌෠ = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]

𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]

WM/04.02 S. 53

16/16
Backpropagation using vectorisation is also efficient

𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚

𝑌෠ = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]

𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]

𝑑𝑍 = 𝑌෠ − 𝑌

1
𝑑𝐵 = ෍ 𝑑𝑍
𝑚
1
𝑑𝑤 = 𝑋. 𝑑𝑍 𝑇
𝑚

WM/04.02 S. 54

16/16
Do you have any problem?

Some material (images, tables, text etc.) in this


presentation has been borrowed from different
books, lecture notes, and the web. The original
contents solely belong to their owners and are
used in this presentation only for clarifying
various educational concepts. Any copyright
infringement is not at all intended.

WM/04.02 S. 55

EOP

You might also like