CS-878 Lecture-02 Logistic Regression
CS-878 Lecture-02 Logistic Regression
Fall – 2024
Muhammad Naseer Bajwa
Assistant Professor,
Faculty of Computing, SEECS
Co-Director,
Deep Learning Lab, NCAI
NUST, Islamabad
naseer.bajwa@seecs.edu.pk
WM/04.02 S. 1
Overview of this week’s lecture
- What is a perceptron?
- Cost functions
- Optimisation Algorithm
WM/04.02 S. 2
02/16
A perceptron is a mathematical model of a biological neuron
WM/04.02 S. 3
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly
03/16
A perceptron is a mathematical model of a biological neuron
𝑥1 𝑤1
𝑥2 𝑤2
𝑛
𝑦ො
𝑤𝑖 𝑥𝑖 + 𝑏
𝑤3
𝑖=1
𝑥3
𝑤4
𝑥4
WM/04.02 S. 4
https://today.ucsd.edu/story/why_are_neuron_axons_long_and_spindly
03/16
A set of perceptron, receiving the same input, makes a layer
𝑥3 𝑤7 𝑛
𝑤4 𝑦ො2
𝑤𝑖 𝑥𝑖 + 𝑏2
𝑤8 𝑖=1
𝑥4
WM/04.02 S. 5
04/16
A set of perceptron, receiving the same input, makes a layer
be calculated as 𝑥4
𝑧 = 𝒘𝑇 𝒙 + 𝑏
WM/04.02 S. 6
𝑦ො = 𝑔(𝒘𝑇 𝒙 + 𝑏)
04/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
WM/04.02 S. 7
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
WM/04.02 S. 8
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
WM/04.02 S. 9
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).
𝒙𝟏𝟏 𝒙𝟏
𝒋
𝒙𝒎
𝟏
. . .
. . .
. . .
𝒙𝟏𝒊 𝒙𝒊
𝒋
𝒙𝒎
𝒊
. . .
. . .
. . .
𝒋
WM/04.02 S. 10 𝒙𝟏𝒏𝒙 𝒙𝒏𝒙 𝒙𝒎
𝒏𝒙
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).
05/16
Logistic Regression is the simplest of classification models
- Let’s take an RGB image and determine if the image contains a cat.
- Each channel in RGB has a size of 4 × 4 (in reality, the size will be large).
05/16
ෝ and 𝒚 are
Loss function determines how close 𝒚
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
WM/04.02 S. 15
06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
WM/04.02 S. 16
06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
- A suitable loss functions should be differentiable on it’s entire range and convex.
WM/04.02 S. 17
06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
- A suitable loss functions should be differentiable on its entire range and convex.
WM/04.02 S. 18
06/16
ෝ and 𝒚 are
Loss function determines how close 𝒚
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦 = 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
- A suitable loss functions should be differentiable on its entire range and convex.
- Negative Log Likelihood is the loss function used for logistic regression.
𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
WM/04.02 S. 20
07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression
- Negative Log Likelihood is the loss function used for logistic regression.
𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
- If 𝑦 = 1,
𝐿 = − log 𝑦ො
To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.
WM/04.02 S. 21
07/16
Negative Log Likelihood is the most preferred loss for binary logistic regression
- Negative Log Likelihood is the loss function used for logistic regression.
𝐿 𝑦ො − 𝑦 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
- If 𝑦 = 1,
𝐿 = − log 𝑦ො
To minimize the loss, log 𝑦ො needs to be large, which means 𝑦ො needs to be large.
- Similarly, if 𝑦 = 0,
𝐿 = − log(1 − 𝑦)
ො
WM/04.02 S. 22
07/16
Cost function is just the average loss function
𝑚
1
𝐽 𝑤, 𝑏 = − 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1
WM/04.02 S. 23
08/16
Cost function is just the average loss function
𝑚
1
𝐽 𝑤, 𝑏 = − 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1
WM/04.02 S. 24
08/16
Cost function is just the average loss function
𝑚
1
𝐽 𝑤, 𝑏 = − 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1
WM/04.02 S. 25
08/16
Cost function is just the average loss function
𝑚
1
𝐽 𝑤, 𝑏 = − 𝐿(𝑦ො 𝑖 , 𝑦 𝑖 )
𝑚
𝑖=1
𝑚
1
𝐽 𝑤, 𝑏 = − 𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦)
ො
𝑚
𝑖=1
WM/04.02 S. 26
08/16
Optimisation means learning optimal model parameters
𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤
𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏
WM/04.02 S. 27
09/16
Optimisation means learning optimal model parameters
𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤
𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏
WM/04.02 S. 28
09/16
Optimisation means learning optimal model parameters
𝜕𝐽(𝑤, 𝑏)
𝑤 ≔𝑤 −𝛼
∂𝑤
𝜕𝐽(𝑤, 𝑏)
𝑏 ≔𝑏 −𝛼
∂𝑏
WM/04.02 S. 29
09/16
Gradient Descent is an accurate but slow optimization algorithm
WM/04.02 S. 30
https://www.youtube.com/watch?v=UmathvAKj80
10/16
Gradient Descent is an accurate but slow optimization algorithm
WM/04.02 S. 31
https://www.youtube.com/watch?v=UmathvAKj80
10/16
Gradient Descent is an accurate but slow optimization algorithm
WM/04.02 S. 32
https://www.youtube.com/watch?v=UmathvAKj80
10/16
Gradient Descent is an accurate but slow optimization algorithm
- Less accurate but more time and memory efficient. Can do more with
same resources.
WM/04.02 S. 33
https://www.youtube.com/watch?v=UmathvAKj80
10/16
Gradient Descent is an accurate but slow optimization algorithm
- Less accurate but more time and memory efficient. Can do more with
same resources.
- Minibatches can help smooth the optimisation path of SGD and take
advantage of parallelisation on GPUs.
WM/04.02 S. 34
https://www.youtube.com/watch?v=UmathvAKj80
10/16
Computation graphs help organise computations
a
+ x
b
x
c
WM/04.02 S. 35
11/16
Computation graphs help organise computations
a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c
WM/04.02 S. 36
11/16
Computation graphs help organise computations
a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c
WM/04.02 S. 37
11/16
Computation graphs help organise computations
a
𝑣 =𝑎+𝑢 𝑓 = 3𝑣
+ x
b
𝑢 = 𝑏𝑐
x
c
11/16
Computational graphs also help understand optimisation intuitively
𝑥1 𝑤1
𝑧 = 𝑤1𝑥1 + 𝑤 2𝑥 2 + 𝑏 𝑦ො = 𝜎(𝑧)
𝜎 𝐿𝑜𝑠𝑠 = 𝐿 (𝑦,
ො 𝑦)
Σ
𝑤2
𝑥2
WM/04.02 S. 39
12/16
Computational graphs also help understand optimisation intuitively
𝑥1 𝑤1
𝑧 = 𝑤1𝑥1 + 𝑤 2𝑥 2 + 𝑏 𝑦ො = 𝜎(𝑧)
𝜎 𝐿𝑜𝑠𝑠 = 𝐿 (𝑦,
ො 𝑦)
Σ
𝑤2
𝑥2
𝜕𝐿
1 = 𝑥1𝑑 𝑧
𝜕𝑤
𝑏
𝜕𝐿
2 = 𝑥 2 𝑑(𝑧)
𝜕𝑤
WM/04.02 S. 40
12/16
Following algorithm updates the parameters
- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏
- 𝑦ො = 𝜎 (𝑧 𝑖 )
- 𝐽 += − 𝑦 𝑖 log 𝑦ො 𝑖 + 1 − 𝑦 𝑖 log 1 − 𝑦ො𝑖
- 𝑑𝑧 𝑖 = 𝑦ො 𝑖 −𝑦 𝑖
- 𝑑𝑤 1 += 𝑥1𝑖 𝑑𝑧 𝑖
- 𝑑𝑤 2 += 𝑥2𝑖 𝑑𝑧 𝑖
- 𝑑𝑏 += 𝑑𝑧 𝑖
- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 41
13/16
Following algorithm updates the parameters
- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖
- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 42
13/16
Following algorithm updates the parameters
- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖
- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 43
13/16
Following algorithm updates the parameters
- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖
- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 44
13/16
Following algorithm updates the parameters
- For the computation graph on the last slide and cross-entropy loss,
Initialise
- 𝐽 = 0, 𝑑𝑤1 = 0, 𝑑𝑤2 = 0, 𝑑𝑏 = 0
- 𝐹𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑚
- 𝑧𝑖 = 𝑤𝑇𝑥𝑖 + 𝑏 - Drawback of such implementation?
- 𝑦ො = 𝜎 (𝑧 )𝑖
- 𝐽 = 𝐽/𝑚
- 𝑑𝑤 1 = 𝑑𝑤 1 /m
- 𝑑𝑤 2 = 𝑑𝑤 2/m
- 𝑑𝑏 = 𝑑𝑏/𝑚
WM/04.02 S. 45
13/16
Vectorisation helps the code run faster
For 𝑧 = 𝒘𝑻 𝒙 + 𝑏
𝑤, 𝑥 ∈ ℝ𝑛𝑥
𝒘 𝒙
WM/04.02 S. 46
14/16
Vectorisation helps the code run faster
For 𝑧 = 𝒘𝑻 𝒙 + 𝑏
𝒘 𝒙
WM/04.02 S. 47
14/16
Vectorisation helps the code run faster
For 𝑧 = 𝒘𝑻 𝒙 + 𝑏
With Loops
𝒘 𝒙
z = 0
For i in range (n_x)
z+= w[i]*x[i]
z+=b
WM/04.02 S. 48
14/16
Vectorise more to make the code even faster
𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏
𝒙𝟏𝒊 𝒙𝟐 𝒙𝒎
𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙
𝑿 ∈ ℝ𝑛𝑥×𝑚
WM/04.02 S. 49
15/16
Vectorise more to make the code even faster
𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏
𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒃 ∈ ℝ𝑚
𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙
𝑿 ∈ ℝ𝑛𝑥×𝑚
WM/04.02 S. 50
15/16
Vectorise more to make the code even faster
𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏
𝒛𝟏 𝒛𝟐 𝒛𝒎 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚
𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙
𝑿 ∈ ℝ𝑛𝑥×𝑚
WM/04.02 S. 51
15/16
Vectorise more to make the code even faster
𝒙𝟏𝟏 𝒙𝟐𝟏 𝒙𝒎
𝟏
𝒛𝟏 𝒛𝟐 𝒛𝒎 = + 𝒃𝟏 𝒃𝟐 𝒃𝒎
𝟐
𝒙𝟏𝒊 𝒙 𝒙𝒎
𝒛 ∈ ℝ𝑚 𝒘𝑻 ∈ ℝ 1×𝑛𝑥 𝒃 ∈ ℝ𝑚
𝒙𝟏𝒏𝒙 𝒙𝟐𝒏𝒙 𝒙𝒎
𝒏𝒙
𝑿 ∈ ℝ𝑛𝑥×𝑚
With Vectorisation
Z = np.dot (w.T, X) + b
WM/04.02 S. 52
15/16
Backpropagation using vectorisation is also efficient
𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚
𝑌 = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]
𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]
WM/04.02 S. 53
16/16
Backpropagation using vectorisation is also efficient
𝑑𝑍 = 𝑑𝑧1 , 𝑑𝑧 2, … , 𝑑𝑧 𝑚
𝑌 = [𝑦ො 1 , 𝑦ො 2 , … , 𝑦ො 𝑚 ]
𝑌 = [𝑦1, 𝑦 2, … , 𝑦 𝑚 ]
𝑑𝑍 = 𝑌 − 𝑌
1
𝑑𝐵 = 𝑑𝑍
𝑚
1
𝑑𝑤 = 𝑋. 𝑑𝑍 𝑇
𝑚
WM/04.02 S. 54
16/16
Do you have any problem?
WM/04.02 S. 55
EOP