Logistic Regression
Logistic Regression
Discriminative Classifiers
Logistic
Regression
Logistic Regression
by contrast:
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Logistic
Regression
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
1
y = s (z) = z
(5
1+e
20
Idea of logistic regression
P(y = 1) = s (w · xx +
+b)
b)
11
=
1 + exp
exp(( (w (w··xx+
+b))
b))
P(y =
P(y = 0) = 11 s
0) = (w··xx+
s(w +b)b)
11
= 11 1 + exp ( (w · x + b))
=
1 + exp ( (w · x + b))
exp
exp (( (w
(w ··xx++ b))
b))
=
= 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
P(y = 0) = 1 s (w · x + b)
P(y 1) s x b)
The sigmoid function has the property
= = (w · +
By the way: 1 = 1
1
= s (x) = s ( x) 1 + exp ( (w · x
1 + exp ( (w · x1+ b))
exp ( (w · x + b))
=
so we could also have expressed
P(y = 0) = 1 s (w · x + b) P(y = 0) =
as s ( (w ·1x+ b)).( (w · x + b
+exp
Now we have an algorithm 1 that given an instance x computes the
P(y = 1|x). How = 1 do The we sigmoid
make a function has For
decision? the property
a test instance x, we sa
1 + exp ( (w · x + b)) Because
probability P(y = exp 1|x)( is(wmore
· x +than
b)) .5, and no otherwise. We call .5
1 s (x) = s ( x)
boundary: = (5.5)
1 + exp ( (w · x + b))
so we could⇢ also have expressed P(y = 0) as s ( (w · x
tion has the property Now we 1 ifanP(y
have = 1|x)that
algorithm > 0.5given an instance
ŷ =
P(y = 1|x). How 0 otherwise
do we make a decision? For a test i
1 s (x) = s ( x) P(y = 1|x) is more than .5, (5.6)
probability and no otherwi
5.1.1 Example:decision
boundary sentiment classification
boundary:
How do we make a decision? For a test instance x, we sa
(y = 1|x) is more
Turning than .5, andinto
a probability no otherwise.
a classifierWe call .5 t
⇢
1 if P(y = 1|x) > 0.5
ŷ =
0 otherwise
⇢
1 if P(y = 1|x) > 0.5 if w·x+b > 0
ŷ =
0 otherwise if w·x+b ≤ 0
sentiment classification
e. Suppose we are doing binary sentiment classification
Classification in Logistic Regression
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
29
1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x 1|x) = s (w · x + b)
p(+|x) = P(Y 0= otherwise
3 1
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30
x2the following
tures x1 ...x6 of the input shown in count(negative
⇢ table; Fig. lexicon) thedoc)
5.2 shows 2 features
Classifying sentiment
a sample mini test document.
x
for input x 1 if “no” 2 doc
3
Var Definition 0 otherwise
Value in Fig. 5.2
x1 x4 2 doc)
count(positive lexicon) count(1st
⇢ and3 2nd pronouns 2 doc)
x2 count(negative lexicon) 2 doc)1 if “!” 22doc
⇢
1 if “no” 2 doc 5
x
x3 0 otherwise
1
0 otherwise x6 log(word count of doc)
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 log(word countLet’s assume for the moment
of doc) ln(66) =that
4.19we’ve alrea
for each of these features, and that the 6 weights
Suppose ware= [2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. (
et’s assume for the moment that we’ve already learned a real-valued weight for
bhow
= 0.1the weights are learned.) The weight w1 , for e
ch of these features, and that the 6 weights corresponding to the 6 features are
31
Figure 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.
Classifying
Figure 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s (w · x + b)
p(+|x) = P(Y = 1|x) = s ([2.5,
(w · x + 5.0,
b) 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
s (.833)
= 0.70 (5.6)
0.70s (w · x + b)
p( |x) = P(Y = 0|x) = 1 (5.6)
1 s (w · x + b)
p( |x) = P(Y = 0|x) = 0.30
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider
appliedthetotask
all of period
sorts disambiguation:
of NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa period
task word, by classifying each
disambiguation: period
deciding
32
erty of the input can be a feature. Consider the task of period disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“Prof.”) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.to is
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end
⇢
1 if “Case(wi ) = Lower”
x1 =
0 otherwise
⇢
1 if “wi 2 AcronymDict”
x2 =
0 otherwise
⇢
1 if “wi = St. & Case(wi 1 ) = Cap”
x3 =
0 otherwise 33
, which is just whatinwe
Classification want for
(binary) a probability.
logistic Because
regression: it is
summary
but has a sharp slope toward the ends, it tends to squash outlier
Given:
And it’s differentiable, which as we’ll see in Section 5.8 will
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
e. If we◦apply the sigmoid to the sum of the weighted features,
x1= count( "awesome")
ween 0 and
◦ x2 1. To make itofawords
= log(number probability, we just need to make
in review)
s, p(y◦=A 1) and wp(y
vector = 0), sum
of weights to 1.w2,
[w1, We can
…, do
wn]this as follows:
◦ wi for each feature fi
P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
Learning: Cross-Entropy Loss
Logistic
Regression
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, "!
We want to set w and b to minimize the distance between our
estimate "! (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
36
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between "! and y
noting:
take the log of both sides.
if y=1, This will to
this simplifies turn
"! out to be handy mathema
n’t hurt us; whatever values maximize a probability will also maxim
if y=0, this simplifies to 1- "!
e probability:
he probability p(y|x) that our classifier produces for one observa
y 1 y
wingDeriving in mind that ifp(y|x)
(keepingcross-entropy y=1, Eq.
= ŷ
5.9 (1 ŷ)
loss for a single observation
simplifies x
to ŷ; if y=0,
s to 1 ŷ):
Now weGoal:
takemaximize
the log ofprobability
both sides.of This
the correct
will turn labeloutp(y|x)
to be handy m
nd doesn’t hurt us; whatever
Maximize: p(y|x) = ŷ y (1maximize
values ŷ)1 y a probability will also
og of the probability:
Now take the log of both sides (mathematically handy)
take the log of both sides. This will turn ⇥ y out to be handy
⇤ mathema
n’t hurt us; whatever log
Maximize: p(y|x)
values maximize (1 ŷ)1 ywill also maxim
= log aŷprobability
e probability: = y log ŷ + (1 y) log(1 ŷ)
⇥ y 1 y
⇤
Eq. 5.10 log p(y|x)
describes
Whatever values = log ŷ log
a logmaximize
likelihood that
(1 ŷ)
should
p(y|x) will be
alsomaximized.
maximize In or
p(y|x)
nto loss function (something thatŷ +
we(1need
= y log y) to minimize),
log(1 ŷ) we’ll just
Eq. 5.10. The result is the cross-entropy loss LCE :
and doesn’t hurt us; whatever values maximize a probability will also maximiz
Now weNow wethe
take takelog
theof
logboth
of both sides.
sides. This Thiswill
willturn
turn out
out to
tobe
behandy
handy mathe
ma
logDeriving
of the probability:
cross-entropy
and doesn’t hurtwhatever loss
us; whatever for
values a single
maximize observation
a probability
probability will xalso
also max
and doesn’t hurt us; values ⇥ maximize a⇤ will m
log of the probability: y 1 y
log of the probability: log p(y|x) = log ŷ (1 ŷ)
Goal: maximize probability of the correct label p(y|x)
⇥ y 1 y
⇤
= y log
log p(y|x) = ŷlog
+⇥(1ŷ (1 y) log(1
ŷ) ŷ)
⇤ (
Maximize: log p(y|x) = =logy logŷ yŷ(1 ŷ) 1 y
+ (1 y) log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In order to turn
= y log ŷ + (1 y) log(1 ŷ)
into loss
Eq.function (something
5.10 describes a logthat we needthat
likelihood to should
minimize), we’ll just flip
be maximized. the sig
In order t
Eq.Now
5.10.flip
intoThesign
loss toisturn
result this into that
the(something
function a loss:
cross-entropy loss something to minimize
LCE :to minimize),
we need we’ll just flip th
Eq. 5.10Eq.describes
5.10. The aresult
log islikelihood that should
the cross-entropy loss LCE be: maximized. In orde
intoCross-entropy
loss function loss
= (because
(something
LCE (ŷ, y) is =
that
log p(y|x) formula
we [y logfor
need to cross-entropy(y,
ŷ +minimize),
(1 y) log(1we’ll "! )) fli(
ŷ)] just
Eq. 5.10. The result
Minimize: LCEis(ŷ,the cross-entropy
y) = log p(y|x) =loss [y Llog
CE :ŷ + (1 y) log(1 ŷ)]
Finally, we can plug in the definition of ŷ = s (w · x + b):
Or, plugging
Finally, we in
candefinition of ":
!
LCE (ŷ, y) = log p(y|x) =of ŷ =[yslog
plug in the definition (w · x + b):
ŷ + (1 y) log(1 ŷ)
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] (
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigure
see 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:
Logistic
Regression
Stochastic Gradient Descent
Logistic
Regression
th gradient descent is to find the optimal weights: minim
Our goal: minimize the loss
ve defined for the model. In Eq. 5.13 below, we’ll explici
the loss
Let'sfunction L is that
make explicit parameterized by the
the loss function weights, whic
is parameterized
by weights
e learning !=(w,b)as q (in the case of logistic regressio
in general
s to find
• theAndset of represent
we’ll weights which minimizes
#" as f (x; θ ) to makethetheloss functi
mples: dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
m
X
1
q̂ = argmin LCE ( f (x(i) ; q ), y(i) )
q m
i=1
Intuition of gradient descent
How do I get to the bottom of this river canyon?
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
slope of loss at w1
is negative
So we'll move positive
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
one step
of gradient
slope of loss at w1 descent
is negative
So we'll move positive
w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
t+1 t d
w =w h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variable
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
ach dimension wi , we express the slope as a partial derivative ∂ wi of th
n. The gradient is then defined as a vector of these partials. We’ll repre
The gradient
q ) to make the dependence on q more 2 ∂obvious: 3
∂ w L( f (x; q ), y)
We’ll represent "! as f (x; θ ) to make
6 the∂
1 dependence on 7 θ more
obvious: 6
2 ∂ w L( f (x; q ), y)37
—q L( f (x; q ), y)) = 6 ∂2
6 ∂ w1 L( f (x;. q ), y) 7
7
4
6 ∂ .
. 75
6 ∂ w∂ 2 L( f (x; q ), y)7
—q L( f (x; q ), y)) = 6 6 ∂ wn L( f ..(x; q ), y)
7
7
4 . 5
∂
nal equation for updating q based on ∂ wnthe (x; q ), y)is thus
L( fgradient
The final equation for updating θ based on the gradient is thus
final equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x; q ), y)
Gradient for Logistic Regression
4.1 What
The are
Gradient for Logistic
these partial Regression
derivatives for logistic regression?
te q , we need a definition for the gradient —L( f (x; q ), y). Re
order to update q , we need a definition for the gradient —L( f (x; q ), y). Recal
ession, the cross-entropy loss function is:
logistic regression, the cross-entropy loss function is:
The loss function
y) =LCE (ŷ,[yy)log=s (w ·x+
[y log b)· x++(1
s (w y) log
b) + (1 (1(1 ss(w
y) log (w ·· xx+ b))]
+b))]
It turns
that theoutderivative
that the derivative
of thisoffunction
this function
forfor oneobservation
one observation vecto
vec
. 5.18The
(theelegant derivative
reader of
canthis
seefunction
Section(see textbook 5.8 for derivation)
erested reader can see Section 5.8 for the derivation of
interested 5.8 for the derivation
ofthis
thisequa
eq
∂ LCE (ŷ, y)
∂ LCE (ŷ, y) ∂ w = [s (w · x + b) y]x j
=j [s (w · x + b) y]x j
∂ wthat
Note in Eq. 5.18 j the gradient with respect to a single weight w j represe
y intuitive
5.18 that value: the difference
the gradient with between
respectthe
totrue y and our
a single estimated
weight w reprŷ =
j
function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns q
# where: L is the loss function
# f is a function parameterized by q
# x is the set of training inputs x(1) , x(2) , ..., x(m)
# y is the set of training outputs (labels) y(1) , y(2) , ..., y(m)
q 0
repeat til done # see caption
For each training tuple (x(i) , y(i) ) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ (i) = f (x(i) ; q ) # What is our estimated output ŷ?
Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ?
2. g —q L( f (x(i) ; q ), y(i) ) # How should we move q to maximize loss?
3. q q hg # Go the other way instead
return q
Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter oneNote
stepthat enough negative
of gradient descent,examples wouldhave
the weights eventually
shiftedmake w2 negative
to be: w = .15,
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Regularization
Logistic
Regression
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
79
o the unseen test set, but a model that overfits will have poor generalization
Regularization
o avoid overfitting, a new regularization term R(q ) is added to the object
on in Eq. 5.13, resulting in the following objective for a batch of m exa
slightlyArewritten
solutionfrom
for Eq.
overfitting
5.13 to be maximizing log probability rather th
Addand
mizing loss, a regularization
removing the m1 term R(θ) todoesn’t
term which the loss function
affect the argmax):
(for now written as maximizing logprob rather than minimizing loss)
X m
q̂ = argmax (i) (i)
log P(y |x ) aR(q ) (5.
q i=1
Idea: choose an R(θ) that penalizes large weights
ew regularization term R(q ) is used to penalize large weights. Thus a sett
◦ fitting the data well with lots of big weights not as good
weights that matches
as fitting thethe training
data a littledata
lessperfectly— but uses
well, with small many weights w
weights
values to do so—will be penalized more than a setting that matches the dat
ess well, but does so using smaller weights. There are two common ways
high
to dovalues to dobe
so—will so—will be penalized
penalized more thanmorea than a setting
setting that matches
that matches the the
datadaa
ell, lessL2
littlebut does Regularization
well, butusing
so does so
smaller (= ridge
using smaller
weights.weights.regression)
There There
are twoare common
two common way
ways to
scompute this regularization
regularization term R(qterm ). L2R(q ). L2 regularization
regularization is a quadratic
is a quadratic function
function o
the weight values, named because it uses the (square of the) L2 norm of the wei
values, The named sum because
of theitsquares
uses the of(square
the of the) L2 norm of the weigh
weights
values. The L2 norm, ||q ||2 , is the same as the Euclidean distance of the vecto
L2 norm,
from theThe ||q ||If2 ,qisconsists
origin. the sameof n the Euclidean
asweights, then: distance of the vector q
name is because this is the (square of the)
gin. If q consists of n weights, then:
L2 norm ||θ||2, = Euclidean distance
n of θ to the origin.
X
2
R(q ) = ||qn||
X2 = q j2 (5
R(q ) = ||q ||22 = q j2 j=1 (5.23
j=1
The L2 regularized objective function becomes:
L2 regularized objective "
function: #
larized objective function becomes:
Xm n
X
q̂ ="argmax log P(y(i)
# ) |x (i)
a q j2 (5
mq
X n
X
i=1 j=1
q i=1 j=1
i=1 j=1
L1 Regularization
regularization is a linear function(= lasso
of the regression)
weight values, named after the L1 n
ularization is a linear function of the weight values, named after the L1 n
||1 , the sum of the absolute values of the weights, or Manhattan distance
the sumdistance
nhattan of the absolute
is the valuesyou’d
distance of thehave
weights,
to walk Manhattan
orbetween two distance
points in a
The sum of the (absolute value of the) weights
htan distance
a street gridislike
theNewdistance
York):you’d have to walk between two points in a
Named
treet grid afterYork):
like New the L1 norm ||W||1, = sum of the absolute
n
values of the weights, = Manhattan
X
n
distance
R(q ) = ||q ||1 =
X |qi | (
R(q ) = ||q ||1 = |qi |
i=1 (
i=1
L1 regularized objective function becomes:
L1 regularized objective function:
regularized objective function
" mbecomes: #
X n
X
q̂ = argmax " m log P(y(i) |x(i)#) aX
n |q j | (
q
X
q̂ = argmax log P(y(i) |x(i) )
1=i a j=1|q
j| (
Multinomial Logistic
Logistic Regression
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
84
Recall that that the cross-entropy loss for binary logistic regression
LCE(",y)=
! -logp(y|x)=-[ylog "! +(1-y)log(1- ")]
!
)
=- % "*+,-"#* =-log "#$ [where c is the true class)
&'(
=-logP(y=c|x)
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
Pk z
The softmax of an input vector
maxThe
of denominator
an input vector
i=1 e is
i
= z[z=
z used to [zz1 , z2 , ..., zall
normalize
,
1 2 , ..., z k ] is
k ] thus
is
the thusa vector
values a itself:
vector
into itself:
probabilities.
Thus for example given a vector:
" #
" exp1.1,
z = [0.6, exp
(z1 ) 1.5, 1.2, 3.2,
(z2 )1.1] exp (zk ) #
softmax(z) = Pk , Pk , ..., Pk (5.31)
exp (zi=1
1 )exp (zi ) exp (z2(z) i )
i=1 exp exp
i=1 exp (zi(z
) k)
max(z)
the result softmax(z)
= Pisk , Pk , ..., Pk
i=1 exp (zi ) i=1 exp (zi )
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] i=1 exp (zi ) 87
ze zi softmaxze
2iszidefined as: eze
k zi
For a vector z of dimensionality
i=1e 1the
k, e
i=1 i=1
softmax(z) = Pk , Pk , ..., Pk
ThePsoftmax function i=1 e zi
i=1 ezi
i=1 e zi
k z
nominator i=1softmax(z
e is used
i
i ) = to P
normalize
exp (zi ) all
1 the
i kvalues into proba
(5.30)
xample givenP
◦ Turns
a a vector
kvector: z
z = [z ,z
1 2 ,...,zk ] kof k arbitrary
j=1 exp (z j )
values into probabilities
enominator e is used to normalize all the values into prob
i=1
i
example givenofaanvector:
The softmax input vector z = [z1 , z2 , ..., zk ] is thus a vector itself:
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1]
" #
z = [0.6,
exp1.1,
(z1 ) 1.5,exp
1.2,
(z23.2,
) 1.1]exp (zk )
oftmax(z) is
softmax(z) = Pk , Pk , ..., Pk (5.31)
i=1 exp (zi ) i=1 exp (zi ) i=1 exp (zi )
softmax(z) is
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
ike the sigmoid, the input to the softmax will be the dot product b 88
vector w and an input vector x (plus a bias). But now we’ll need sepa
ectors (and bias) for each of the K classes.
Softmax in multinomial logistic regression
exp (wc · x + bc )
p(y = c|x) = k (5
X
exp (w j · x + b j )
j=1