[go: up one dir, main page]

0% found this document useful (0 votes)
19 views5 pages

Lecture 3.3 - Logistic Regression

This document discusses logistic regression as a method for solving classification problems, where the goal is to predict discrete outcomes based on input variables. It introduces the logistic function as a suitable hypothesis function for binary classification and explains the probabilistic interpretation of logistic regression. The document also covers the likelihood function and the gradient ascent method for maximizing the likelihood to update model parameters.

Uploaded by

narutohoang5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Lecture 3.3 - Logistic Regression

This document discusses logistic regression as a method for solving classification problems, where the goal is to predict discrete outcomes based on input variables. It introduces the logistic function as a suitable hypothesis function for binary classification and explains the probabilistic interpretation of logistic regression. The document also covers the likelihood function and the gradient ascent method for maximizing the likelihood to update model parameters.

Uploaded by

narutohoang5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture III.

3 : Logistic
Regression

III.3.1. Introduction to Classification


Let’s now talk about the classification problem. This is just like the regression
problem, except that the values ywe now want to predict is a discrete random
variable instead of continuous random variable.
We can display our classification problem like this :
Giving a training dataset :

D = {(x1 , y1 ); (x2 , y2 ); … ; (xn , yn )}


​ ​ ​ ​ ​ ​

where :

xi are input variables


yi are corresponding labels which belongs to a value set T ( T can be


{1; 2; ...; n}or other discrete value set).


Our goal is to giving a prediction function f(x) in order to accurately predicts the
label y(belongs to the value set T )from an unseen datapoint x.

Let’s begin with binary classification problem, where ycan only take two values,
0 and 1. For instance, if we are trying to build a spam classifier for email, then x(i) 
maybe some features of a email, then ycan be 1 if it’s a spam email, and 0
otherwise.

III.3.2 Logistic regression


Let’s begin with choosing a new hypothesis function hθ (x). In this model, our

hypothesis function will be :

Lecture III.3 : Logistic Regression 1


1
hθ (x) = g(θT x) =
1 + e−θ .x
T
​ ​

where :

1
g(z) =
1 + e−x

This function is called the logistic function or the sigmoid function.

III.3.2.1. Why choosing logistic function ?


Let’s begin with the model of different activation functions.

The yellow one represents for linear regression. This line is not restricted, so
it’s not suitable for this problem (though we can fixed that through a simple
function : if y > 1, then y = 1 && if y > 0, then y = 0. However, this’s still not a
good choice as linear regression is sensitive with noise.

Here is an example :

The red one represents for the hard threshold (which seems closed to PLA).
Our PLA also doesn’t work efficiently in this problem since our data isn’t
linearly separable (which will be mentioned later).

Therefore, the blue & green line seems much suitable for our problem.

Lecture III.3 : Logistic Regression 2


III.3.2.2. Logistic Regression under Probabilistic Interpretation
In this section, we are working with a binary classification problem, so let’s
assume that:

P (y = 1∣x; θ) = hθ (x) ​

P (y = 0∣x; θ) = 1 − hθ (x) ​

Note that this can be written more compactly as

p(y∣x; θ) = (hθ (x))y (1 − hθ (x))1−y ​ ​

Assuming that the mtraining examples were generated independently, our


likelihood of the parameters as:

m
L(θ) = p(y ∣ X; θ) = ∏ p(y(i) ∣ x(i) ; θ)
​ ​

i=1
m
y (i) 1−y (i)
= ∏ (hθ (x )) ​ ​
(i)
(1 − hθ (x ))

(i)

i=1

As in the before likelihood function, we will have :

l(θ) = log L(θ)


n
= ∑ y(i) . l og h(x(i) ) + (1 − y(i) ).log (1 − h(x(i) )

i=1

How do we maximize the likelihood ? Similar to our derivation in the case of linear
regression, we can use gradient ascent.

Our update formula : θj ​ := θj + α ∂θ​l(θ) ​

Lecture III.3 : Logistic Regression 3


Note that we are maximizing our function rather than minimizing it; therefore, we
will move in the direction of the gradient ascent (which also means moving along
the direction of the derivative).

Let’s start by working with just one training example (x, y)and take derivatives to
derive the stochastic gradient ascent rule :

∂ 1 1 ∂
l(θ) = (y. T − (1 − y). ). g(θT x)
∂θj 1 − g(θ x) ∂θj
T
​ ​ ​ ​

​ g(θ x) ​

Here, we will use a consequence of the logistic function that is :

g′ (z) = g(z).(1 − g(z))

Therefore :


g(θT x) = g(θT x).(1 − g(θT x).xj
∂θj
​ ​

Then :


l(θ) = y.(1 − g(θT x)).xj − (1 − y).g(θT x)xj
∂θj
​ ​ ​

Lecture III.3 : Logistic Regression 4


= (y − g(θT x)).xj = (y − hθ (x)).xj
​ ​ ​

In conclusion, our stochastic gradient ascent rule is :

(i)
θj := θj + α.(y(i) − hθ (x(i) ))xj
​ ​ ​ ​

We might see this looks identical to the LMS update rule; however, this is not the
same algorithm as hθ (x(i) )is now defined as a non-linear function of θT x(i) .

It’s surprising that we end up with the same update rule for 2
different learning algorithms and learning problems.
Is this coincidence or is there a deeper reason behind this ?
We’ll answer this when get to GLM models.

Lecture III.3 : Logistic Regression 5

You might also like