0% found this document useful (0 votes)

11 views55 pages

CS-878 Lecture-02 Logistic Regression

The document outlines a lecture on Deep Learning, focusing on Artificial Neural Networks and logistic regression. Key topics include the concept of perceptrons, loss functions, cost functions, and optimization algorithms in the context of classification models. The lecture also discusses how to set up a logistic regression problem using RGB images to determine the presence of a cat.

Uploaded by

Hadia Ramzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views55 pages

CS-878 Lecture-02 Logistic Regression

Uploaded by

Hadia Ramzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Deep Learning (CS-878)

Fall – 2024
Muhammad Naseer Bajwa
Assistant Professor,
Faculty of Computing, SEECS
Co-Director,
Deep Learning Lab, NCAI
NUST, Islamabad
naseer.bajwa@seecs.edu.pk

WM/04.02 S. 1
Overview of this week’s lecture

Artificial Neural Networks

- What is a perceptron?

- Setting up a logistic regression problem

- Loss function for logistic regression

- Cost functions

- Optimisation Algorithm

WM/04.02 S. 2

02/16
A perceptron is a mathematical model of a biological neuron

WM/04.02 S. 3
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly

03/16
A perceptron is a mathematical model of a biological neuron

𝑥1 𝑤1

𝑥2 𝑤2
𝑛
𝑦ො
෍ 𝑤𝑖 𝑥𝑖 + 𝑏
𝑤3
𝑖=1
𝑥3
𝑤4

𝑥4

WM/04.02 S. 4
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly

03/16
A set of perceptron, receiving the same input, makes a layer

- Each perceptron/node in a layer receives the same

𝑥3 𝑤7 𝑛
𝑤4 𝑦ො2
෍ 𝑤𝑖 𝑥𝑖 + 𝑏2
𝑤8 𝑖=1
𝑥4

WM/04.02 S. 5

04/16
A set of perceptron, receiving the same input, makes a layer

- Each perceptron/node in a layer receives the same

input but have its own weights and biases. 𝑥1 𝑤1
𝑛
𝑤5 𝑦ො1
- Different learnable parameters applied on the same ෍ 𝑤𝑖 𝑥𝑖 + 𝑏1
𝑤2
input help each node learn (hopefully) a distinct 𝑥2 𝑖=1
pattern/feature from the input. 𝑤6
𝑤3
- For each input 𝑥 of the size 𝑛𝑥 , the learnable
𝑥3 𝑤7
parameters are weights 𝒘 ∈ ℝ𝑛𝑥 and biases 𝑏 ∈ ℝ. 𝑛
𝑦ො2
𝑤4
෍ 𝑤𝑖 𝑥𝑖 + 𝑏2
- Weighted sum 𝑧 of the input and parameters can 𝑤8 𝑖=1

be calculated as 𝑥4

𝑧 = 𝒘𝑇 𝒙 + 𝑏

- The predicted output 𝑦ො is obtained by an activation function on 𝑧.

WM/04.02 S. 6
𝑦ො = 𝑔(𝒘𝑇 𝒙 + 𝑏)
04/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

WM/04.02 S. 7

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

WM/04.02 S. 8

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

WM/04.02 S. 9

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
. . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 10 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
. . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 11 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 12 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
- Similarly, 𝒀 = [𝑦 1 , 𝑦 2 , … , 𝑦 𝑚 ] . . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 13 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙

05/16
Logistic Regression is the simplest of classification models

- Let’s take an RGB image and determine if the image contains a cat.

- Each training example will be of the form 𝒙, 𝑦 , 𝒙 ∈ ℝ𝑛𝑥 , 𝑦 ∈ {0, 1}.

- There are a total of 𝑚 number of training examples, { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}.

- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).

- Unroll all channels into a feature vector of the size 4 × 4 × 3 = 𝑛𝑥 .

𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
- Stack all inputs 𝒙𝒊 as a column vector into a matrix 𝑿 ∈ ℝ𝑛𝑥 ×𝑚 . . . .
. . .
- Similarly, 𝒀 = [𝑦 1 , 𝑦 2 , … , 𝑦 𝑚 ] . . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
- Goal of Logistic Regression is to find a function that produces 𝑦ො given 𝒙 and 𝑦.
. . .
. . .
𝑦ො = 𝑃(𝑦 = 1| 𝒙) 𝒙𝟏𝒏𝒙
𝒋
𝒙𝒏𝒙 𝒙𝒎
WM/04.02 S. 14 𝒏𝒙

05/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠

WM/04.02 S. 15

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

- Which loss function is suitable?

WM/04.02 S. 16

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

- Which loss function is suitable?

- A suitable loss functions should be differentiable on it’s entire range and convex.

WM/04.02 S. 17

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

- Which loss function is suitable?

- A suitable loss functions should be differentiable on its entire range and convex.

- Mean Absolute Error (MAE), or 𝐿 − 1 Loss?

𝑚
1
𝑀𝐴𝐸 = ෍ 𝑦ො 1 − 𝑦1
𝑚
𝑖=1

WM/04.02 S. 18

06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚

Given { 𝒙𝟏 , 𝑦1 , 𝒙𝟐 , 𝑦 2 , … , (𝒙𝒎, 𝑦 𝑚 )}, we want 𝑦ො 𝑖 ≈ 𝑦 𝑖

- Which loss function is suitable?

- A suitable loss functions should be differentiable on its entire range and convex.

- Mean Absolute Error (MAE), or 𝐿 − 1 Loss?

𝑚
1
𝑀𝐴𝐸 = ෍ 𝑦ො 1 − 𝑦1
𝑚
𝑖=1

- Mean Squared Error (MSE), or 𝐿 − 2 Loss?

𝑚
WM/04.02 S. 19 1
𝑀𝑆𝐸 = ෍(𝑦ො 1 − 𝑦1 )2
𝑚
𝑖=1
06/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො

WM/04.02 S. 20

07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො

- If 𝑦 = 1,
𝐿 = − log 𝑦ො

To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.

WM/04.02 S. 21

07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression

- Negative Log Likelihood is the loss function used for logistic regression.

𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො

- If 𝑦 = 1,
𝐿 = − log 𝑦ො

To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.

- Similarly, if 𝑦 = 0,
𝐿 = − log(1 − 𝑦)
ො

To minimize the loss, log(1 − 𝑦)

ො needs to be large, which means 𝑦ො needs to be small.

WM/04.02 S. 22

07/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1

WM/04.02 S. 23

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1

- Why we take average of loss function?

WM/04.02 S. 24

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1

- Why we take average of loss function?

- What if we don’t take negative of log likelihood?

WM/04.02 S. 25

08/16
Cost function is just the average loss function

𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − ෍ 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1

- Why we take average of loss function?

- What if we don’t take negative of log likelihood?

- What if we don’t take log of likelihood?

WM/04.02 S. 26

08/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

WM/04.02 S. 27

09/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

- Repeat the process until algorithm converges.

WM/04.02 S. 28

09/16
Optimisation means learning optimal model parameters

- The objective of an optimisation algorithm is to find optimal parameters to minimize loss.

𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤

𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏

𝛼 represents learning rate.

- Repeat the process until algorithm converges.

- Which optimisation algorithm is better?

WM/04.02 S. 29

09/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

WM/04.02 S. 30
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

WM/04.02 S. 31
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

WM/04.02 S. 32
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

- Less accurate but more time and memory efficient. Can do more with
same resources.

- Updates are made based on a randomly selected input samples.

WM/04.02 S. 33
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Gradient Descent is an accurate but slow optimization algorithm

- Gradient Descent is an old, time-tested but very slow optimisation algorithm.

- Requires computation of gradients with respect to all training samples.

- How to fix it?

- Stochastic Gradient Descent

- Less accurate but more time and memory efficient. Can do more with
same resources.

- Updates are made based on a randomly selected input samples.

- Minibatches can help smooth the optimisation path of SGD and take
advantage of parallelisation on GPUs.

WM/04.02 S. 34
https://www.youtube.com/watch?v=UmathvAKj80

10/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
+ x
b
x
c

WM/04.02 S. 35

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

WM/04.02 S. 36

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

- For 𝑎 = 5, 𝑏 = 3, and 𝑐 = 2, 𝑓 evaluates to 33.

WM/04.02 S. 37

11/16
Computation graphs help organise computations

- Let’s say we want to calculate the function 𝑓 = 3(𝑎 + 𝑏𝑐)

a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c

- For 𝑎 = 5, 𝑏 = 3, and 𝑐 = 2, 𝑓 evaluates to 33.

- How 𝑓 will be affected with change in 𝑎, 𝑏, and 𝑐?

WM/04.02 S. 38

11/16
Computational graphs also help understand optimisation intuitively

- Consider the following neural network.

𝑥1 𝑤1
𝑧 = 𝑤1𝑥1 + 𝑤 2𝑥 2 + 𝑏 𝑦ො = 𝜎(𝑧)
𝜎 𝐿𝑜𝑠𝑠 = 𝐿 (𝑦,
ො 𝑦)
Σ
𝑤2
𝑥2

WM/04.02 S. 39

12/16
Computational graphs also help understand optimisation intuitively

WM/04.02 S. 40

12/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏
- 𝑦ො = 𝜎 (𝑧 𝑖 )
- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖
- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 41

13/16
Following algorithm updates the parameters

- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖

- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 42

13/16
Following algorithm updates the parameters

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient

- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 43

13/16
Following algorithm updates the parameters

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient

- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖 - Solution?
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 44

13/16
Following algorithm updates the parameters

- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො෡𝑖 - Explicit for loops: Inefficient

- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖 - Solution?
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- Vectorisation
- 𝑑𝑏 += 𝑑𝑧 𝑖

- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 45

13/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

𝑤, 𝑥 ∈ ℝ𝑛𝑥

𝒘 𝒙

WM/04.02 S. 46

14/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

With Vectorisation 𝑤, 𝑥 ∈ ℝ𝑛𝑥

z = np.dot (w.T, x) + b

𝒘 𝒙

WM/04.02 S. 47

14/16
Vectorisation helps the code run faster

For 𝑧 = 𝒘𝑻 𝒙 + 𝑏

With Vectorisation 𝑤, 𝑥 ∈ ℝ𝑛𝑥

z = np.dot (w.T, x) + b

With Loops
𝒘 𝒙
z = 0
For i in range (n_x)
z+= w[i]*x[i]
z+=b

WM/04.02 S. 48

14/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒙𝟏𝒊 𝒙𝟐 𝒙𝒎

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 49

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 50

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

WM/04.02 S. 51

15/16
Vectorise more to make the code even faster

𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏

𝒛𝟏 𝒛𝟐 𝒛𝒎 = + 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚

𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙

𝑿 ∈ ℝ𝑛𝑥×𝑚

With Vectorisation

Z = np.dot (w.T, X) + b
WM/04.02 S. 52

15/16
Backpropagation using vectorisation is also efficient

𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚

𝑌෠ = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]

𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]

WM/04.02 S. 53

16/16
Backpropagation using vectorisation is also efficient

𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚

𝑌෠ = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]

𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]

𝑑𝑍 = 𝑌෠ − 𝑌

1
𝑑𝐵 = ෍ 𝑑𝑍
𝑚
1
𝑑𝑤 = 𝑋. 𝑑𝑍 𝑇
𝑚

WM/04.02 S. 54

16/16
Do you have any problem?

Some material (images, tables, text etc.) in this

presentation has been borrowed from different
books, lecture notes, and the web. The original
contents solely belong to their owners and are
used in this presentation only for clarifying
various educational concepts. Any copyright
infringement is not at all intended.

WM/04.02 S. 55

EOP

3-LG Eval
No ratings yet
3-LG Eval
52 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Cours 1
No ratings yet
Cours 1
42 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Binary Logistic Regression 2
No ratings yet
Binary Logistic Regression 2
43 pages
Output 23
No ratings yet
Output 23
6 pages
Lecture 8 Logistic Regression
No ratings yet
Lecture 8 Logistic Regression
34 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Data Science L19 - LogisticRegression
No ratings yet
Data Science L19 - LogisticRegression
52 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
Lecture 03 Logistic Regression
No ratings yet
Lecture 03 Logistic Regression
34 pages
Lecture W3
No ratings yet
Lecture W3
28 pages
Iet Cipher ML Bootcamp (Session-1)
No ratings yet
Iet Cipher ML Bootcamp (Session-1)
67 pages
ML Module V
No ratings yet
ML Module V
156 pages
Logistic Regression
No ratings yet
Logistic Regression
79 pages
AML - Lecture 3 Logistic Regression. Neural Networks
No ratings yet
AML - Lecture 3 Logistic Regression. Neural Networks
59 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
Logistic Regression
No ratings yet
Logistic Regression
74 pages
5 LogRegNN
No ratings yet
5 LogRegNN
74 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
Lec 4
No ratings yet
Lec 4
24 pages
Machine Learning 2
No ratings yet
Machine Learning 2
19 pages
Logistic Regression - Byimran
No ratings yet
Logistic Regression - Byimran
35 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
04 LogisticRegression
No ratings yet
04 LogisticRegression
29 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Lec 02 LogisticReg
No ratings yet
Lec 02 LogisticReg
33 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
Regression vs Classification Algorithms
100% (1)
Regression vs Classification Algorithms
13 pages
Exp 2
No ratings yet
Exp 2
7 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
DM - Lecture 3
No ratings yet
DM - Lecture 3
41 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
06 Logistic Regression
No ratings yet
06 Logistic Regression
55 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
Logistic Regression
No ratings yet
Logistic Regression
61 pages
Machine Learning for Mechanics
No ratings yet
Machine Learning for Mechanics
19 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
COMP-377Week6 v1.1
No ratings yet
COMP-377Week6 v1.1
38 pages
05 LogisticRegression PDF
No ratings yet
05 LogisticRegression PDF
23 pages
Linear Regression Lecture Notes
No ratings yet
Linear Regression Lecture Notes
34 pages
Logistic Regressions
No ratings yet
Logistic Regressions
11 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Lecture W5ab
No ratings yet
Lecture W5ab
56 pages
Lecture W7ab
No ratings yet
Lecture W7ab
21 pages
Self Reading - KNN - Notes
No ratings yet
Self Reading - KNN - Notes
7 pages
Lesson 8-Image Segmentation - Traditional Approaches
No ratings yet
Lesson 8-Image Segmentation - Traditional Approaches
35 pages
Lecture W6b
No ratings yet
Lecture W6b
33 pages
Lecture 14 15 - Temporal Difference Learning, Lambda-Return, Backward View of TD (Lambda)
No ratings yet
Lecture 14 15 - Temporal Difference Learning, Lambda-Return, Backward View of TD (Lambda)
26 pages
Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
No ratings yet
Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
24 pages
Eigen Values and Eigen Vectors
No ratings yet
Eigen Values and Eigen Vectors
53 pages
Lecture 34 - Model Based Reinforcement Learning
No ratings yet
Lecture 34 - Model Based Reinforcement Learning
26 pages
Lecture 19 - Model-Free Control, Off-Policy Learning
No ratings yet
Lecture 19 - Model-Free Control, Off-Policy Learning
9 pages
Lecture 35 36 - Exploration vs. Exploitation
No ratings yet
Lecture 35 36 - Exploration vs. Exploitation
18 pages
Lecture 22 - Value Function Approximation
No ratings yet
Lecture 22 - Value Function Approximation
17 pages
The Shadow Sister Lily Meade Online Reading
No ratings yet
The Shadow Sister Lily Meade Online Reading
41 pages
Freedom of The Human Person
100% (1)
Freedom of The Human Person
30 pages
Study On Probing Depth, Relative Attachment Loss, and Alveolar Bone Defect
No ratings yet
Study On Probing Depth, Relative Attachment Loss, and Alveolar Bone Defect
19 pages
SAP Labs Shuttle Schedule
No ratings yet
SAP Labs Shuttle Schedule
35 pages
My Teen Romantic Comdey Is Wrong, As I Expected - 14.5
No ratings yet
My Teen Romantic Comdey Is Wrong, As I Expected - 14.5
179 pages
Butterfly Valve
No ratings yet
Butterfly Valve
2 pages
MSS SP 54 2002
100% (1)
MSS SP 54 2002
9 pages
Flame Scanner Info Folder 261A1812P012
100% (2)
Flame Scanner Info Folder 261A1812P012
16 pages
Simtech Guide Specification Industrial Alphaplus (Ap) Polypropylene Piping System
No ratings yet
Simtech Guide Specification Industrial Alphaplus (Ap) Polypropylene Piping System
2 pages
Hazard Identification and Risk Assessment
100% (2)
Hazard Identification and Risk Assessment
10 pages
Pump Usermanual
No ratings yet
Pump Usermanual
1 page
UDAAN 500+ Science & Technology Current Affairs Yearly Magazine
No ratings yet
UDAAN 500+ Science & Technology Current Affairs Yearly Magazine
85 pages
Kyra Sundance and Jadie: Ages 0-2 Years
No ratings yet
Kyra Sundance and Jadie: Ages 0-2 Years
180 pages
Perkins Engine Overhaul Invoice
No ratings yet
Perkins Engine Overhaul Invoice
2 pages
Phased Treatment Plan
No ratings yet
Phased Treatment Plan
41 pages
Cookery Rubrics
No ratings yet
Cookery Rubrics
3 pages
Small Incision Lenticule Extraction
No ratings yet
Small Incision Lenticule Extraction
2 pages
Basolo Hammaker 2002 Synthesis and Isomerization of Nitritopentammine Complexes of Rhodium (Iii) Iridium (Iii) and
No ratings yet
Basolo Hammaker 2002 Synthesis and Isomerization of Nitritopentammine Complexes of Rhodium (Iii) Iridium (Iii) and
5 pages
Casagrande General Catalogue Webedition
No ratings yet
Casagrande General Catalogue Webedition
16 pages
Freyr Energy 2
No ratings yet
Freyr Energy 2
2 pages
Porton Automatico
No ratings yet
Porton Automatico
7 pages
JB Core Programming
No ratings yet
JB Core Programming
3 pages
DIY Transformerless Power Supply
No ratings yet
DIY Transformerless Power Supply
2 pages
Bmax Technology Brochure
100% (1)
Bmax Technology Brochure
2 pages
Cosmic Kid Captured
No ratings yet
Cosmic Kid Captured
21 pages
ApplicationofNMRSpectroscopyinthePetroleumIndustry AReviewbyPNA
No ratings yet
ApplicationofNMRSpectroscopyinthePetroleumIndustry AReviewbyPNA
47 pages
Asia Builders - Mail List - Share2019 AUG Updated
No ratings yet
Asia Builders - Mail List - Share2019 AUG Updated
33 pages
An Application of Taguchi Loss Function in Apparel Industry PDF
No ratings yet
An Application of Taguchi Loss Function in Apparel Industry PDF
6 pages
Mimasa K Series
No ratings yet
Mimasa K Series
5 pages
The Lorentz Group
No ratings yet
The Lorentz Group
2 pages