Lecture III.
3 : Logistic
Regression
III.3.1. Introduction to Classification
Let’s now talk about the classification problem. This is just like the regression
problem, except that the values ywe now want to predict is a discrete random
variable instead of continuous random variable.
We can display our classification problem like this :
Giving a training dataset :
D = {(x1 , y1 ); (x2 , y2 ); … ; (xn , yn )}
where :
xi are input variables
yi are corresponding labels which belongs to a value set T ( T can be
{1; 2; ...; n}or other discrete value set).
Our goal is to giving a prediction function f(x) in order to accurately predicts the
label y(belongs to the value set T )from an unseen datapoint x.
Let’s begin with binary classification problem, where ycan only take two values,
0 and 1. For instance, if we are trying to build a spam classifier for email, then x(i)
maybe some features of a email, then ycan be 1 if it’s a spam email, and 0
otherwise.
III.3.2 Logistic regression
Let’s begin with choosing a new hypothesis function hθ (x). In this model, our
hypothesis function will be :
Lecture III.3 : Logistic Regression 1
1
hθ (x) = g(θT x) =
1 + e−θ .x
T
where :
1
g(z) =
1 + e−x
This function is called the logistic function or the sigmoid function.
III.3.2.1. Why choosing logistic function ?
Let’s begin with the model of different activation functions.
The yellow one represents for linear regression. This line is not restricted, so
it’s not suitable for this problem (though we can fixed that through a simple
function : if y > 1, then y = 1 && if y > 0, then y = 0. However, this’s still not a
good choice as linear regression is sensitive with noise.
Here is an example :
The red one represents for the hard threshold (which seems closed to PLA).
Our PLA also doesn’t work efficiently in this problem since our data isn’t
linearly separable (which will be mentioned later).
Therefore, the blue & green line seems much suitable for our problem.
Lecture III.3 : Logistic Regression 2
III.3.2.2. Logistic Regression under Probabilistic Interpretation
In this section, we are working with a binary classification problem, so let’s
assume that:
P (y = 1∣x; θ) = hθ (x)
P (y = 0∣x; θ) = 1 − hθ (x)
Note that this can be written more compactly as
p(y∣x; θ) = (hθ (x))y (1 − hθ (x))1−y
Assuming that the mtraining examples were generated independently, our
likelihood of the parameters as:
m
L(θ) = p(y ∣ X; θ) = ∏ p(y(i) ∣ x(i) ; θ)
i=1
m
y (i) 1−y (i)
= ∏ (hθ (x ))
(i)
(1 − hθ (x ))
(i)
i=1
As in the before likelihood function, we will have :
l(θ) = log L(θ)
n
= ∑ y(i) . l og h(x(i) ) + (1 − y(i) ).log (1 − h(x(i) )
i=1
How do we maximize the likelihood ? Similar to our derivation in the case of linear
regression, we can use gradient ascent.
∂
Our update formula : θj := θj + α ∂θl(θ)
Lecture III.3 : Logistic Regression 3
Note that we are maximizing our function rather than minimizing it; therefore, we
will move in the direction of the gradient ascent (which also means moving along
the direction of the derivative).
Let’s start by working with just one training example (x, y)and take derivatives to
derive the stochastic gradient ascent rule :
∂ 1 1 ∂
l(θ) = (y. T − (1 − y). ). g(θT x)
∂θj 1 − g(θ x) ∂θj
T
g(θ x)
Here, we will use a consequence of the logistic function that is :
g′ (z) = g(z).(1 − g(z))
Therefore :
∂
g(θT x) = g(θT x).(1 − g(θT x).xj
∂θj
Then :
∂
l(θ) = y.(1 − g(θT x)).xj − (1 − y).g(θT x)xj
∂θj
Lecture III.3 : Logistic Regression 4
= (y − g(θT x)).xj = (y − hθ (x)).xj
In conclusion, our stochastic gradient ascent rule is :
(i)
θj := θj + α.(y(i) − hθ (x(i) ))xj
We might see this looks identical to the LMS update rule; however, this is not the
same algorithm as hθ (x(i) )is now defined as a non-linear function of θT x(i) .
It’s surprising that we end up with the same update rule for 2
different learning algorithms and learning problems.
Is this coincidence or is there a deeper reason behind this ?
We’ll answer this when get to GLM models.
Lecture III.3 : Logistic Regression 5