18/11/2021
Supervised
learning (Part Ⅱ)
Outline
- Linear regression
• Model representation
• Cost function
• Gradient Descent
- Logistic regression
• Generative vs discriminative classifiers
• Hypothesis representation
• Cost function
1
18/11/2021
Linear regression
Regression: An example
Regression
Training set
real-valued output
Learning Algorithm
Size of house Hypothesis Estimated price
4
2
18/11/2021
House pricing prediction
Price ($)
in 1000’s
400
300
200
100
500 1000 1500 2000 2500
Size in m^2 5
Training set
Size in m^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
𝑚 = 47
852 178
… …
• Notation:
• 𝑚 = Number of training examples
• 𝑥 = Input variable / features Examples:
• 𝑦 = Output variable / target variable 𝑥 ( ) = 2104
• (𝑥, 𝑦) = One training example 𝑥 ( ) = 1416
• (𝑥 ( ) , 𝑦 ( ) ) = 𝑖 training example 𝑦 ( ) = 460
6
3
18/11/2021
Model representation
Training set
ℎ 𝑥
Learning Algorithm Price ($)
in 1000’s
400
300
200
100
Size of house Hypothesis Estimated price 500 1000 1500 2000 2500
Size in m^2
Univariate linear regression 7
Cost function
Size in m^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
𝑚 = 47
852 178
… …
•Hypothesis
: parameters/weights
How to choose ’s? 8
4
18/11/2021
Cost function
𝑦 𝑦 𝑦
3 3 3
2 2 2
1 1 1
1 2 3 𝑥 1 2 3 𝑥 1 2 3 𝑥
Cost function
• Idea: 𝜃 ,𝜃
Choose 𝜃 , 𝜃 so that
ℎ 𝑥 is close to 𝑦 for our ()
training example (𝑥, 𝑦)
𝑦 1
Price ($) 𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
in 1000’s 2𝑚
400
300
200
100
𝑥 Cost function
500 1000 1500 2000 2500 𝜃 ,𝜃
Size in feet^2 10
5
18/11/2021
Cost function
Simplified
• Hypothesis: • Hypothesis:
ℎ 𝑥 =𝜃 +𝜃 𝑥 ℎ 𝑥 =𝜃 𝑥 𝜃 =0
• Parameters: • Parameters:
𝜃 ,𝜃 𝜃
• Cost function: • Cost function:
1 1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦 𝐽 𝜃 = ℎ 𝑥 −𝑦
2𝑚 2𝑚
• Goal: • Goal:
minimize 𝐽 𝜃 , 𝜃 minimize 𝐽 𝜃
𝜃 ,𝜃 𝜃 ,𝜃
11
Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3
2 2
1 1
1 2 3 𝑥 0 1 2 3 𝜃
12
6
18/11/2021
Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3
2 2
1 1
1 2 3 𝑥 0 1 2 3 𝜃
13
Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3
2 2
1 1
1 2 3 𝑥 0 1 2 3 𝜃
14
7
18/11/2021
Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3
2 2
1 1
1 2 3 𝑥 0 1 2 3 𝜃
15
Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3
2 2
1 1
1 2 3 𝑥 0 1 2 3 𝜃
16
8
18/11/2021
Cost function
•Hypothesis:
•Parameters:
•Cost function:
•Goal:
𝜃 ,𝜃
17
Cost function
18
9
18/11/2021
How do we find good that minimize ?
19
Gradient descent
Have some function
Want
𝜃 ,𝜃
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at minimum
20
10
18/11/2021
Gradient descent
21
Gradient descent
Repeat until convergence{
,
(for and )
}
: Learning rate (step size)
,
: derivative (rate of change)
22
11
18/11/2021
Gradient descent
Correct: simultaneous update Incorrect:
𝜕𝐽 𝜃 , 𝜃 𝜕𝐽 𝜃 , 𝜃
temp0 ≔ 𝜃 −𝛼 temp0 ≔ 𝜃 −𝛼
𝜕𝜃 𝜕𝜃
𝜕𝐽 𝜃 , 𝜃 𝜃 ≔ temp0
temp1 ≔ 𝜃 −𝛼
𝜕𝜃 𝜕𝐽 𝜃 , 𝜃
temp1 ≔ 𝜃 −𝛼
𝜃 ≔ temp0 𝜕𝜃
𝜃 ≔ temp1 𝜃 ≔ temp1
23
𝐽 𝜃
𝜕
3 𝐽 𝜃 <0
𝜕𝜃
𝜕
2 𝐽 𝜃 >0
𝜕𝜃
0 1 2 3 𝜃 24
12
18/11/2021
Learning rate
25
Gradient descent for linear regression
Repeat until convergence{
𝝏𝑱 𝜽𝟎 ,𝜽𝟏
𝜽𝒋 ≔ 𝜽𝒋 − 𝜶 (for 𝒋 = 𝟎 and 𝒋 = 𝟏)
𝝏𝜽𝒋
}
• Linear regression model
ℎ 𝑥 =𝜃 +𝜃 𝑥
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
26
13
18/11/2021
Computing partial derivative
,
• =
= ()
,
• :
,
• :
27
Gradient descent for linear regression
Repeat until convergence{
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦
𝑚
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
}
Update 𝜃 and 𝜃 simultaneously
28
14
18/11/2021
Batch gradient descent
• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
𝑚: Number of training examples
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦
𝑚
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
}
29
Training dataset with one feature
Size in m^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
852 178
… …
ℎ 𝑥 =𝜃 +𝜃 𝑥
30
15
18/11/2021
Multiple features (input variables)
Number of Number of Age of home Price ($) in
Size in m^2 (𝑥 )
bedrooms (𝑥 ) floors (𝑥 ) (years) (𝑥 ) 1000’s (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… …
Notation:
𝑛 = Number of features ( )
𝑥 =?
𝑥 ( ) = Input features of 𝑖 training example ( )
() 𝑥 =?
𝑥 = Value of feature 𝑗 in 𝑖 training example
31
Multiple features (input variables)
Hypothesis
Previously:
Now:
32
16
18/11/2021
Matrix representation
ℎ 𝑥 = 𝜃 + 𝜃 𝑥 + 𝜃 𝑥 +⋯+ 𝜃 𝑥
• For convenience of notation, define 𝑥 = 1
()
(𝑥 = 1 for all examples)
𝑥 𝜃
𝑥 𝜃
•𝒙= 𝑥 ∈𝑅 𝜽= 𝜃 ∈𝑅
⋮ ⋮
𝑥 𝜃
• ℎ 𝑥 = 𝜃 +𝜃 𝑥 +𝜃 𝑥 +⋯+𝜃 𝑥
=𝜽 𝒙 33
Gradient descent
• Previously (𝑛 = 1) • New algorithm (𝑛 ≥ 1)
Repeat until convergence{ Repeat until convergence{
1 1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚 𝑚
}
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
Simultaneously update
} 𝜃 , for 𝑗 = 0, 1, ⋯ , 𝑛
34
17
18/11/2021
Gradient descent in practice: Feature scaling
• Idea: Make sure features are on a similar scale (e.g,. −1 ≤ 𝑥 ≤ 1)
• E.g. 𝑥 = size (0-2000 m^2)
𝑥 = number of bedrooms (1-5)
𝜃 𝜃
3 3
2 2
1 1
0 1 2 3 𝜃 0 1 2 3 𝜃 35
Gradient descent in practice: Learning rate
• 𝛼 too small: slow convergence
• 𝛼 too large: may not converge
• To choose 𝛼, try
0.001, … 0.01, …, 0.1, … , 1
36
18
18/11/2021
Logistic regression
37
Logistic regression: An example
1 (Yes)
Malignant?
0 (No)
Tumor Size
A classification problem
38
19
18/11/2021
If we use linear regression?
1 (Yes)
Malignant?
0 (No)
Tumor Size
ℎ 𝑥 =𝜃 𝑥
• Threshold classifier output at 0.5
• If ℎ 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ 𝑥 < 0.5, predict “𝑦 = 0” 39
If we use linear regression?
Classification: 𝑦 = 1 or 𝑦 = 0
ℎ 𝑥 = 𝜃 𝑥 (from linear regression)
can be > 1 or < 0
Logistic regression: 0 ≤ ℎ 𝑥 ≤ 1
Logistic regression is actually for classification
40
20
18/11/2021
Classification
• Learn: h:X->Y
– X – features
– Y – target classes
• Suppose you know P(Y|X) exactly, how should you
classify?
– Bayes classifier:
y * hbayes ( x ) arg max P (Y y | X x )
y
41
Generative vs. Discriminative
Classifiers - Intuition
• Generative classifier, e.g., Naïve Bayes:
• Assume some functional form for P(X|Y), P(Y)
• Estimate parameters of P(X|Y), P(Y) directly from training data
• Use Bayes rule to calculate P(Y|X=x)
• This is ‘generative’ model
• Indirect computation of P(Y|X) through Bayes rule
• Probabilistic model of each class
• Discriminative classifier, e.g., Logistic Regression:
• Assume some functional form for P(Y|X)
• Estimate parameters of P(Y|X) directly from training data
• This is the ‘discriminative’ model
• Directly learn P(Y|X)
• Focus on the decision boundary
42
21
18/11/2021
Logistic Regression: Hypothesis
• In logistic regression, we learn the conditional distribution P(y|x)
• Let py(x; 𝜃) be our estimate of P(y|x), where 𝜃 is a vector of adjustable
parameters.
• Assume there are two classes, y = 0 and y = 1 and
1 1
𝑝1(𝑥; 𝜃) = 𝑝0(𝑥; 𝜃) = 1 −
1+𝑒 1+𝑒
• This is equivalent to
( ; )
Log =𝜃 𝑥
( ; )
• That is, the log odds of class 1 is a linear function of x
43
Hypothesis representation
• Want
where
𝑔(𝑧)
• Sigmoid function
• Logistic function 𝑧 44
22
18/11/2021
Interpretation of hypothesis output
• ℎ 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥
𝑥 1
• Example: If 𝑥 = x =
tumorSize
• ℎ 𝑥 = 0.7
• Tell patient that 70% chance of tumor being malignant
45
Logistic regression
ℎ 𝑥 =𝑔 𝜃 𝑥 𝑔(𝑧)
1
𝑔 𝑧 =
1+𝑒
𝑧=𝜃 𝑥
Suppose predict “y = 1” if ℎ 𝑥 ≥ 0.5
𝑧= 𝜃 𝑥≥0
predict “y = 0” if ℎ 𝑥 < 0.5
𝑧= 𝜃 𝑥<0
46
23
18/11/2021
Decision boundary
Age
E.g.,
Tumor Size
• Predict “ ” if
47
Cost Function
Training set with 𝑚 examples
{ 𝑥 ,𝑦 , 𝑥 ,𝑦 ,⋯, 𝑥 ,𝑦
𝑥
𝑥
𝑥∈ ⋮ 𝑥 = 1, 𝑦 ∈ {0, 1}
𝑥
1
ℎ 𝑥 =
1+𝑒
How to choose parameters 𝜃?
48
24
18/11/2021
Reminder: Cost function for Linear Regression
1 1
𝐽 𝜃 = ℎ 𝑥 −𝑦 = Cost(ℎ (𝑥 ), 𝑦))
2𝑚 𝑚
49
Cost function for Logistic Regression
−log ℎ 𝑥 if 𝑦 = 1
Cost(ℎ 𝑥 , 𝑦) =
−log 1 − ℎ 𝑥 if 𝑦 = 0
0 ℎ 𝑥 1 0 ℎ 𝑥 1 50
25
18/11/2021
Logistic regression cost function
−log ℎ 𝑥 if 𝑦 = 1
• Cost(ℎ 𝑥 , 𝑦) =
−log 1 − ℎ 𝑥 if 𝑦 = 0
• Cost ℎ 𝑥 , 𝑦 = −𝑦 log h x − (1 − y) log 1 − ℎ 𝑥
• If 𝑦 = 1: Cost ℎ 𝑥 , 𝑦 = −log ℎ 𝑥
• If 𝑦 = 0: Cost ℎ 𝑥 , 𝑦 = −log 1 − ℎ 𝑥 51
Logistic regression
1
𝐽 𝜃 = Cost(ℎ (𝑥 ), 𝑦 ( ) ))
𝑚
=
− ∑ 𝑦 ( ) log ℎ 𝑥 ( ) + (1 − 𝑦 ( ) ) log 1 − ℎ 𝑥 ( )
Learning: fit parameter 𝜃 Prediction: given new 𝑥
min 𝐽(𝜃) Output ℎ 𝑥 =
52
26
18/11/2021
Reference
• Andrew Y.Ng. Machine learning. Stanford University.
53
27