[go: up one dir, main page]

0% found this document useful (0 votes)
24 views27 pages

Supervised Learning Part2

Uploaded by

lulucifer610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views27 pages

Supervised Learning Part2

Uploaded by

lulucifer610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

18/11/2021

Supervised
learning (Part Ⅱ)

Outline

- Linear regression
• Model representation
• Cost function
• Gradient Descent
- Logistic regression
• Generative vs discriminative classifiers
• Hypothesis representation
• Cost function

1
18/11/2021

Linear regression

Regression: An example
Regression
Training set
real-valued output

Learning Algorithm

Size of house Hypothesis Estimated price


4

2
18/11/2021

House pricing prediction

Price ($)
in 1000’s
400
300
200
100

500 1000 1500 2000 2500


Size in m^2 5

Training set
Size in m^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
𝑚 = 47
852 178
… …
• Notation:
• 𝑚 = Number of training examples
• 𝑥 = Input variable / features Examples:
• 𝑦 = Output variable / target variable 𝑥 ( ) = 2104
• (𝑥, 𝑦) = One training example 𝑥 ( ) = 1416
• (𝑥 ( ) , 𝑦 ( ) ) = 𝑖 training example 𝑦 ( ) = 460
6

3
18/11/2021

Model representation

Training set
ℎ 𝑥

Learning Algorithm Price ($)


in 1000’s
400
300
200
100

Size of house Hypothesis Estimated price 500 1000 1500 2000 2500
Size in m^2

Univariate linear regression 7

Cost function
Size in m^2 (x) Price ($) in 1000’s (y)
2104 460
1416 232
1534 315
𝑚 = 47
852 178
… …

•Hypothesis

: parameters/weights

How to choose ’s? 8

4
18/11/2021

Cost function

𝑦 𝑦 𝑦
3 3 3
2 2 2
1 1 1

1 2 3 𝑥 1 2 3 𝑥 1 2 3 𝑥

Cost function

• Idea: 𝜃 ,𝜃
Choose 𝜃 , 𝜃 so that
ℎ 𝑥 is close to 𝑦 for our ()
training example (𝑥, 𝑦)

𝑦 1
Price ($) 𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
in 1000’s 2𝑚
400
300
200
100
𝑥 Cost function
500 1000 1500 2000 2500 𝜃 ,𝜃
Size in feet^2 10

5
18/11/2021

Cost function
Simplified
• Hypothesis: • Hypothesis:
ℎ 𝑥 =𝜃 +𝜃 𝑥 ℎ 𝑥 =𝜃 𝑥 𝜃 =0
• Parameters: • Parameters:
𝜃 ,𝜃 𝜃
• Cost function: • Cost function:
1 1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦 𝐽 𝜃 = ℎ 𝑥 −𝑦
2𝑚 2𝑚

• Goal: • Goal:
minimize 𝐽 𝜃 , 𝜃 minimize 𝐽 𝜃
𝜃 ,𝜃 𝜃 ,𝜃
11

Cost function

, function of , function of
𝑦 𝐽 𝜃
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃
12

6
18/11/2021

Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃
13

Cost function
, function of , function of
𝑦 𝐽 𝜃
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃
14

7
18/11/2021

Cost function

, function of , function of
𝑦 𝐽 𝜃
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃
15

Cost function

, function of , function of
𝑦 𝐽 𝜃
3 3

2 2

1 1

1 2 3 𝑥 0 1 2 3 𝜃
16

8
18/11/2021

Cost function
•Hypothesis:

•Parameters:

•Cost function:

•Goal:
𝜃 ,𝜃
17

Cost function

18

9
18/11/2021

How do we find good that minimize ?


19

Gradient descent

Have some function


Want
𝜃 ,𝜃

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at minimum
20

10
18/11/2021

Gradient descent

21

Gradient descent

Repeat until convergence{


,
(for and )
}

: Learning rate (step size)


,
: derivative (rate of change)
22

11
18/11/2021

Gradient descent

Correct: simultaneous update Incorrect:


𝜕𝐽 𝜃 , 𝜃 𝜕𝐽 𝜃 , 𝜃
temp0 ≔ 𝜃 −𝛼 temp0 ≔ 𝜃 −𝛼
𝜕𝜃 𝜕𝜃
𝜕𝐽 𝜃 , 𝜃 𝜃 ≔ temp0
temp1 ≔ 𝜃 −𝛼
𝜕𝜃 𝜕𝐽 𝜃 , 𝜃
temp1 ≔ 𝜃 −𝛼
𝜃 ≔ temp0 𝜕𝜃
𝜃 ≔ temp1 𝜃 ≔ temp1

23

𝐽 𝜃

𝜕
3 𝐽 𝜃 <0
𝜕𝜃
𝜕
2 𝐽 𝜃 >0
𝜕𝜃

0 1 2 3 𝜃 24

12
18/11/2021

Learning rate

25

Gradient descent for linear regression

Repeat until convergence{


𝝏𝑱 𝜽𝟎 ,𝜽𝟏
𝜽𝒋 ≔ 𝜽𝒋 − 𝜶 (for 𝒋 = 𝟎 and 𝒋 = 𝟏)
𝝏𝜽𝒋
}

• Linear regression model


ℎ 𝑥 =𝜃 +𝜃 𝑥
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
26

13
18/11/2021

Computing partial derivative

,
• =

= ()

,
• :
,
• :
27

Gradient descent for linear regression

Repeat until convergence{


1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦
𝑚

1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
}

Update 𝜃 and 𝜃 simultaneously


28

14
18/11/2021

Batch gradient descent

• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
𝑚: Number of training examples
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦
𝑚

1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
}
29

Training dataset with one feature

Size in m^2 (x) Price ($) in 1000’s (y)


2104 460
1416 232
1534 315
852 178
… …

ℎ 𝑥 =𝜃 +𝜃 𝑥

30

15
18/11/2021

Multiple features (input variables)


Number of Number of Age of home Price ($) in
Size in m^2 (𝑥 )
bedrooms (𝑥 ) floors (𝑥 ) (years) (𝑥 ) 1000’s (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… …
Notation:
𝑛 = Number of features ( )
𝑥 =?
𝑥 ( ) = Input features of 𝑖 training example ( )
() 𝑥 =?
𝑥 = Value of feature 𝑗 in 𝑖 training example
31

Multiple features (input variables)

Hypothesis
Previously:

Now:

32

16
18/11/2021

Matrix representation
ℎ 𝑥 = 𝜃 + 𝜃 𝑥 + 𝜃 𝑥 +⋯+ 𝜃 𝑥
• For convenience of notation, define 𝑥 = 1
()
(𝑥 = 1 for all examples)
𝑥 𝜃
𝑥 𝜃
•𝒙= 𝑥 ∈𝑅 𝜽= 𝜃 ∈𝑅
⋮ ⋮
𝑥 𝜃

• ℎ 𝑥 = 𝜃 +𝜃 𝑥 +𝜃 𝑥 +⋯+𝜃 𝑥
=𝜽 𝒙 33

Gradient descent

• Previously (𝑛 = 1) • New algorithm (𝑛 ≥ 1)

Repeat until convergence{ Repeat until convergence{


1 1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚 𝑚
}
1
𝜃 ≔𝜃 −𝛼 ℎ 𝑥 −𝑦 𝑥
𝑚
Simultaneously update
} 𝜃 , for 𝑗 = 0, 1, ⋯ , 𝑛

34

17
18/11/2021

Gradient descent in practice: Feature scaling


• Idea: Make sure features are on a similar scale (e.g,. −1 ≤ 𝑥 ≤ 1)
• E.g. 𝑥 = size (0-2000 m^2)
𝑥 = number of bedrooms (1-5)

𝜃 𝜃
3 3
2 2
1 1

0 1 2 3 𝜃 0 1 2 3 𝜃 35

Gradient descent in practice: Learning rate

• 𝛼 too small: slow convergence


• 𝛼 too large: may not converge

• To choose 𝛼, try

0.001, … 0.01, …, 0.1, … , 1

36

18
18/11/2021

Logistic regression

37

Logistic regression: An example

1 (Yes)
Malignant?

0 (No)
Tumor Size
A classification problem

38

19
18/11/2021

If we use linear regression?

1 (Yes)
Malignant?

0 (No)
Tumor Size
ℎ 𝑥 =𝜃 𝑥

• Threshold classifier output at 0.5


• If ℎ 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ 𝑥 < 0.5, predict “𝑦 = 0” 39

If we use linear regression?

Classification: 𝑦 = 1 or 𝑦 = 0

ℎ 𝑥 = 𝜃 𝑥 (from linear regression)


can be > 1 or < 0

Logistic regression: 0 ≤ ℎ 𝑥 ≤ 1

Logistic regression is actually for classification


40

20
18/11/2021

Classification

• Learn: h:X->Y
– X – features
– Y – target classes
• Suppose you know P(Y|X) exactly, how should you
classify?
– Bayes classifier:
y *  hbayes ( x )  arg max P (Y  y | X  x )
y

41

Generative vs. Discriminative


Classifiers - Intuition
• Generative classifier, e.g., Naïve Bayes:
• Assume some functional form for P(X|Y), P(Y)
• Estimate parameters of P(X|Y), P(Y) directly from training data
• Use Bayes rule to calculate P(Y|X=x)
• This is ‘generative’ model
• Indirect computation of P(Y|X) through Bayes rule
• Probabilistic model of each class

• Discriminative classifier, e.g., Logistic Regression:


• Assume some functional form for P(Y|X)
• Estimate parameters of P(Y|X) directly from training data
• This is the ‘discriminative’ model
• Directly learn P(Y|X)
• Focus on the decision boundary

42

21
18/11/2021

Logistic Regression: Hypothesis

• In logistic regression, we learn the conditional distribution P(y|x)


• Let py(x; 𝜃) be our estimate of P(y|x), where 𝜃 is a vector of adjustable
parameters.
• Assume there are two classes, y = 0 and y = 1 and
1 1
𝑝1(𝑥; 𝜃) = 𝑝0(𝑥; 𝜃) = 1 −
1+𝑒 1+𝑒
• This is equivalent to
( ; )
Log =𝜃 𝑥
( ; )

• That is, the log odds of class 1 is a linear function of x


43

Hypothesis representation

• Want

where
𝑔(𝑧)

• Sigmoid function
• Logistic function 𝑧 44

22
18/11/2021

Interpretation of hypothesis output

• ℎ 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥

𝑥 1
• Example: If 𝑥 = x =
tumorSize
• ℎ 𝑥 = 0.7

• Tell patient that 70% chance of tumor being malignant

45

Logistic regression

ℎ 𝑥 =𝑔 𝜃 𝑥 𝑔(𝑧)
1
𝑔 𝑧 =
1+𝑒
𝑧=𝜃 𝑥
Suppose predict “y = 1” if ℎ 𝑥 ≥ 0.5
𝑧= 𝜃 𝑥≥0
predict “y = 0” if ℎ 𝑥 < 0.5
𝑧= 𝜃 𝑥<0
46

23
18/11/2021

Decision boundary

Age
E.g.,

Tumor Size

• Predict “ ” if

47

Cost Function
Training set with 𝑚 examples
{ 𝑥 ,𝑦 , 𝑥 ,𝑦 ,⋯, 𝑥 ,𝑦
𝑥
𝑥
𝑥∈ ⋮ 𝑥 = 1, 𝑦 ∈ {0, 1}
𝑥

1
ℎ 𝑥 =
1+𝑒
How to choose parameters 𝜃?
48

24
18/11/2021

Reminder: Cost function for Linear Regression

1 1
𝐽 𝜃 = ℎ 𝑥 −𝑦 = Cost(ℎ (𝑥 ), 𝑦))
2𝑚 𝑚

49

Cost function for Logistic Regression

−log ℎ 𝑥 if 𝑦 = 1
Cost(ℎ 𝑥 , 𝑦) =
−log 1 − ℎ 𝑥 if 𝑦 = 0

0 ℎ 𝑥 1 0 ℎ 𝑥 1 50

25
18/11/2021

Logistic regression cost function

−log ℎ 𝑥 if 𝑦 = 1
• Cost(ℎ 𝑥 , 𝑦) =
−log 1 − ℎ 𝑥 if 𝑦 = 0

• Cost ℎ 𝑥 , 𝑦 = −𝑦 log h x − (1 − y) log 1 − ℎ 𝑥

• If 𝑦 = 1: Cost ℎ 𝑥 , 𝑦 = −log ℎ 𝑥
• If 𝑦 = 0: Cost ℎ 𝑥 , 𝑦 = −log 1 − ℎ 𝑥 51

Logistic regression

1
𝐽 𝜃 = Cost(ℎ (𝑥 ), 𝑦 ( ) ))
𝑚
=
− ∑ 𝑦 ( ) log ℎ 𝑥 ( ) + (1 − 𝑦 ( ) ) log 1 − ℎ 𝑥 ( )

Learning: fit parameter 𝜃 Prediction: given new 𝑥


min 𝐽(𝜃) Output ℎ 𝑥 =

52

26
18/11/2021

Reference

• Andrew Y.Ng. Machine learning. Stanford University.

53

27

You might also like