[go: up one dir, main page]

0% found this document useful (0 votes)
5 views91 pages

Logistic Regression

The document discusses logistic regression as a discriminative classifier used in machine learning for classification tasks, contrasting it with generative classifiers like Naive Bayes. It outlines the components of a probabilistic machine learning classifier, the training and testing phases of logistic regression, and how to interpret the output as probabilities using the sigmoid function. Additionally, it provides examples of sentiment classification and the importance of feature weights in determining class predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views91 pages

Logistic Regression

The document discusses logistic regression as a discriminative classifier used in machine learning for classification tasks, contrasting it with generative classifiers like Naive Bayes. It outlines the components of a probabilistic machine learning classifier, the training and testing phases of logistic regression, and how to interpret the output as probabilities using the sigmoid function. Additionally, it provides examples of sentiment classification and the importance of feature weights in determining class predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Background: Generative and

Discriminative Classifiers
Logistic
Regression
Logistic Regression

Important analytic tool in natural and


social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative


classifier
Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:


Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!


Let's ignore everything else
ISTIC R EGRESSION
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
ormally, recall that the naive Bayes assigns a class c to a document d no
computing P(c|d) but by computing a likelihood and a prior
ISTIC RNaive Bayes
EGRESSION
likelihood prior
z }| { z}|{
ĉ =the
rmally, recall that argmax P(d|c)assigns
naive Bayes P(c) a class c to a document
(5.1d
c2C
computing P(c|d) but by computing a likelihood and a prior
Logistic
ive model like Regression
naive Bayes makes use of this likelihood term, which
documentprior
how to generate the features oflikelihood
aposterior if we knew it was of class c.
z }| { z}|{
trast a discriminative model in this text categorization scenario attempts
ĉ = argmax P(d|c)
P(c|d) P(c) (
compute P(c|d). Perhapsc2Cit will learn to assign high weight to documen
at directly improve its ability to discriminate between possible classes
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input


observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes ", ! the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
The two phases of logistic regression

Training: we learn weights w and b using stochastic


gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)


using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Background: Generative and
Discriminative Classifiers
Logistic
Regression
Classification in Logistic Regression

Logistic
Regression
Classification Reminder

Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class "! Î C


Binary Classification in Logistic Regression

Given a series of input/output pairs:


◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class "! (i) Î {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation x

Input observation: vector x = [x1, x2,…, xn]


Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class "! Î {0,1}

(multinomial logistic regression: "! Î {0, 1, 2, 3, 4})


her real number that’s n
added
! to the weighted inputs.
ecision X
Howon to a
do
z = test instance—
classification
wi xi + b after we’ve learned the
ssifier first multiplies each x i by its weight wi , sums up th
For each featurei=1 xi, weight wi tells us importance of xi
ds the◦ bias termhave
(Plus we'll b.a bias
Theb) resulting single number z exp
kthe
we’ll represent
evidence for such
the sums using the dot product notatio
class.
We'll sum up all the weighted features and the bias
ot product of two vectors a and ! b, written as a · b is the s
orresponding elementsX n
of each vector. Thus the followin
to Eq. 5.2: z = wi xi + b
i=1
z = w·x+b
ook we’ll represent
If this sum is high,such sums
we say y=1;using thendot
if low, the y=0product no
But we want a probabilistic classifier

We need to formalize “sum is high”.


We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
ear around 0 but outlier values get squashed toward 0 or 1.
to Eq.
The5.2:
problem: z isn't a probability, it's just a
number! we’ll pass z through the sigmoid func
e a probability,
ction (named becausez =it w ·x+
looks b an s) is also called t
like
es logistic regression its name. The sigmoid has the fol
ically in Fig.
Solution: use5.1:
a function of z that goes from 0 to 1
g in Eq. 5.3 forces z to be a legal probabil
fact, since weights are1 real-valued,
1 the outp
y = s (z) = =
om • to •. 1+e z 1 + exp ( z)
rest of the book, we’ll use the notation exp(x) to mean e
r of advantages; it takes a real-valued number and maps
rangeswe’ll
bility, • to
frompass •.
z through the sigmoid function, s (z). T
The very useful sigmoid or logistic function
med because it looks like an s) is also called the logistic fu
regression its name. The sigmoid has the following equati
Fig. 5.1:

1
y = s (z) = z
(5
1+e

20
Idea of logistic regression

We’ll compute w·x+b


And then we’ll pass it through the
sigmoid function:
σ(w·x+b)
And we'll just treat it as a probability
wo cases,
wo cases, p(y =probabilities
Making 1) and p(y =
= 0),
0), sum
withsum to
to 1.
1. We
sigmoids Wecan
cando
dothis
thisasasfoll
fol

P(y = 1) = s (w · xx +
+b)
b)
11
=
1 + exp
exp(( (w (w··xx+
+b))
b))
P(y =
P(y = 0) = 11 s
0) = (w··xx+
s(w +b)b)
11
= 11 1 + exp ( (w · x + b))
=
1 + exp ( (w · x + b))
exp
exp (( (w
(w ··xx++ b))
b))
=
= 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
P(y = 0) = 1 s (w · x + b)
P(y 1) s x b)
The sigmoid function has the property
= = (w · +
By the way: 1 = 1
1
= s (x) = s ( x) 1 + exp ( (w · x
1 + exp ( (w · x1+ b))
exp ( (w · x + b))
=
so we could also have expressed
P(y = 0) = 1 s (w · x + b) P(y = 0) =
as s ( (w ·1x+ b)).( (w · x + b
+exp
Now we have an algorithm 1 that given an instance x computes the
P(y = 1|x). How = 1 do The we sigmoid
make a function has For
decision? the property
a test instance x, we sa
1 + exp ( (w · x + b)) Because
probability P(y = exp 1|x)( is(wmore
· x +than
b)) .5, and no otherwise. We call .5
1 s (x) = s ( x)
boundary: = (5.5)
1 + exp ( (w · x + b))
so we could⇢ also have expressed P(y = 0) as s ( (w · x
tion has the property Now we 1 ifanP(y
have = 1|x)that
algorithm > 0.5given an instance
ŷ =
P(y = 1|x). How 0 otherwise
do we make a decision? For a test i
1 s (x) = s ( x) P(y = 1|x) is more than .5, (5.6)
probability and no otherwi
5.1.1 Example:decision
boundary sentiment classification
boundary:
How do we make a decision? For a test instance x, we sa
(y = 1|x) is more
Turning than .5, andinto
a probability no otherwise.
a classifierWe call .5 t


1 if P(y = 1|x) > 0.5
ŷ =
0 otherwise

ample: sentiment classification


0.5 here is called the decision boundary

n example. Suppose we are doing binary sentiment class


text, and we would like to know whether to assign the sen
ranges from • to •.
The probabilistic classifier P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
P(y=1)
P(y = 0) = 1 s (w · x +
1
= 1
1 + e (w·
e (w·x+b)
=
1 + e (w·x+b)
Now we have an
wx algorithm
+b that given an instan
algorithm that given an instance x computes the probabil
we make a decision? For a test instance x, we say yes if t
Turning
is more a probability
than .5, into a classifier
and no otherwise. We call .5 the decisi


1 if P(y = 1|x) > 0.5 if w·x+b > 0
ŷ =
0 otherwise if w·x+b ≤ 0

sentiment classification
e. Suppose we are doing binary sentiment classification
Classification in Logistic Regression

Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

29
1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x 1|x) = s (w · x + b)
p(+|x) = P(Y 0= otherwise
3 1
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30
x2the following
tures x1 ...x6 of the input shown in count(negative
⇢ table; Fig. lexicon) thedoc)
5.2 shows 2 features
Classifying sentiment
a sample mini test document.
x
for input x 1 if “no” 2 doc
3
Var Definition 0 otherwise
Value in Fig. 5.2
x1 x4 2 doc)
count(positive lexicon) count(1st
⇢ and3 2nd pronouns 2 doc)
x2 count(negative lexicon) 2 doc)1 if “!” 22doc

1 if “no” 2 doc 5
x
x3 0 otherwise
1
0 otherwise x6 log(word count of doc)
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 log(word countLet’s assume for the moment
of doc) ln(66) =that
4.19we’ve alrea
for each of these features, and that the 6 weights
Suppose ware= [2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. (
et’s assume for the moment that we’ve already learned a real-valued weight for
bhow
= 0.1the weights are learned.) The weight w1 , for e
ch of these features, and that the 6 weights corresponding to the 6 features are
31
Figure 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.

Classifying
Figure 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s (w · x + b)
p(+|x) = P(Y = 1|x) = s ([2.5,
(w · x + 5.0,
b) 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
s (.833)
= 0.70 (5.6)
0.70s (w · x + b)
p( |x) = P(Y = 0|x) = 1 (5.6)
1 s (w · x + b)
p( |x) = P(Y = 0|x) = 0.30
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider
appliedthetotask
all of period
sorts disambiguation:
of NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa period
task word, by classifying each
disambiguation: period
deciding
32
erty of the input can be a feature. Consider the task of period disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“Prof.”) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.to is
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end

1 if “Case(wi ) = Lower”
x1 =
0 otherwise

1 if “wi 2 AcronymDict”
x2 =
0 otherwise

1 if “wi = St. & Case(wi 1 ) = Cap”
x3 =
0 otherwise 33
, which is just whatinwe
Classification want for
(binary) a probability.
logistic Because
regression: it is
summary
but has a sharp slope toward the ends, it tends to squash outlier
Given:
And it’s differentiable, which as we’ll see in Section 5.8 will
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
e. If we◦apply the sigmoid to the sum of the weighted features,
x1= count( "awesome")
ween 0 and
◦ x2 1. To make itofawords
= log(number probability, we just need to make
in review)
s, p(y◦=A 1) and wp(y
vector = 0), sum
of weights to 1.w2,
[w1, We can
…, do
wn]this as follows:
◦ wi for each feature fi

P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
Learning: Cross-Entropy Loss

Logistic
Regression
Wait, where did the W’s come from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, "!
We want to set w and b to minimize the distance between our
estimate "! (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
36
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
The distance between "! and y

We want to know how far is the classifier output:


"! = σ(w·x+b)

from the true output:


y [= either 0 or 1]

We'll call this difference:


L("! ,y) = how much "! differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood


estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
is the negative log likelihood loss, generally called the cross-entrop
derive this loss function, applied to a single observation x. We’d
Deriving cross-entropy loss for a single observation
ghts that maximize the probability of the correct label p(y|x). Sinc
x
two discrete outcomes (1 or 0), this is a Bernoulli distribution, and
Goal: maximize
he probability p(y|x) probability of the correct
that our classifier labelfor
produces p(y|x)
one observa
Since there
wing (keeping are only
in mind that2 if
discrete outcomes
y=1, Eq. (0 or 1) to
5.9 simplifies weŷ;can
if y=0,
s to 1 express
ŷ): the probability p(y|x) from our classifier (the thing
we want to maximize) as
p(y|x) = ŷ y (1 ŷ)1 y

noting:
take the log of both sides.
if y=1, This will to
this simplifies turn
"! out to be handy mathema
n’t hurt us; whatever values maximize a probability will also maxim
if y=0, this simplifies to 1- "!
e probability:
he probability p(y|x) that our classifier produces for one observa
y 1 y
wingDeriving in mind that ifp(y|x)
(keepingcross-entropy y=1, Eq.
= ŷ
5.9 (1 ŷ)
loss for a single observation
simplifies x
to ŷ; if y=0,
s to 1 ŷ):
Now weGoal:
takemaximize
the log ofprobability
both sides.of This
the correct
will turn labeloutp(y|x)
to be handy m
nd doesn’t hurt us; whatever
Maximize: p(y|x) = ŷ y (1maximize
values ŷ)1 y a probability will also
og of the probability:
Now take the log of both sides (mathematically handy)
take the log of both sides. This will turn ⇥ y out to be handy
⇤ mathema
n’t hurt us; whatever log
Maximize: p(y|x)
values maximize (1 ŷ)1 ywill also maxim
= log aŷprobability
e probability: = y log ŷ + (1 y) log(1 ŷ)
⇥ y 1 y

Eq. 5.10 log p(y|x)
describes
Whatever values = log ŷ log
a logmaximize
likelihood that
(1 ŷ)
should
p(y|x) will be
alsomaximized.
maximize In or
p(y|x)
nto loss function (something thatŷ +
we(1need
= y log y) to minimize),
log(1 ŷ) we’ll just
Eq. 5.10. The result is the cross-entropy loss LCE :
and doesn’t hurt us; whatever values maximize a probability will also maximiz
Now weNow wethe
take takelog
theof
logboth
of both sides.
sides. This Thiswill
willturn
turn out
out to
tobe
behandy
handy mathe
ma
logDeriving
of the probability:
cross-entropy
and doesn’t hurtwhatever loss
us; whatever for
values a single
maximize observation
a probability
probability will xalso
also max
and doesn’t hurt us; values ⇥ maximize a⇤ will m
log of the probability: y 1 y
log of the probability: log p(y|x) = log ŷ (1 ŷ)
Goal: maximize probability of the correct label p(y|x)
⇥ y 1 y

= y log
log p(y|x) = ŷlog
+⇥(1ŷ (1 y) log(1
ŷ) ŷ)
⇤ (
Maximize: log p(y|x) = =logy logŷ yŷ(1 ŷ) 1 y
+ (1 y) log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In order to turn
= y log ŷ + (1 y) log(1 ŷ)
into loss
Eq.function (something
5.10 describes a logthat we needthat
likelihood to should
minimize), we’ll just flip
be maximized. the sig
In order t
Eq.Now
5.10.flip
intoThesign
loss toisturn
result this into that
the(something
function a loss:
cross-entropy loss something to minimize
LCE :to minimize),
we need we’ll just flip th
Eq. 5.10Eq.describes
5.10. The aresult
log islikelihood that should
the cross-entropy loss LCE be: maximized. In orde
intoCross-entropy
loss function loss
= (because
(something
LCE (ŷ, y) is =
that
log p(y|x) formula
we [y logfor
need to cross-entropy(y,
ŷ +minimize),
(1 y) log(1we’ll "! )) fli(
ŷ)] just
Eq. 5.10. The result
Minimize: LCEis(ŷ,the cross-entropy
y) = log p(y|x) =loss [y Llog
CE :ŷ + (1 y) log(1 ŷ)]
Finally, we can plug in the definition of ŷ = s (w · x + b):
Or, plugging
Finally, we in
candefinition of ":
!
LCE (ŷ, y) = log p(y|x) =of ŷ =[yslog
plug in the definition (w · x + b):
ŷ + (1 y) log(1 ŷ)
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] (
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigure
see 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.

Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s (w · x + b)


= s ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
R 5 • L OGISTIC R EGRESSION= s (.833)
= 0.70 (5.6)
side of the
p(equation drops
|x) = P(Y out, =
= 0|x) leading
1 sto (wthe
· x +following
b) loss (we’ll use log to mean
natural log when the base is not specified):
Pretty well! What's the loss?
= 0.30
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))]
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
= can be a[log
of the input s (w Consider
feature. · x + b)] the task of period disambiguation: deciding
if a period
= is the end of a sentence
log(.70)or part of a word, by classifying each period
into one=of two classes EOS (end-of-sentence)
.36 and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
CE
p(+|x) = P(Y = 1|x) = s (w · x + b)
[log s (w · x + b)]
Let's see= if this works= for our 5.0,
sentiment example
=
s ([2.5,
log(.70)
1.2, 0.5, 2.0, 0.7] · [3, 2, 1
= s (.833)
= .36
Suppose true value instead was y=0.
= 0.70
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 p( |x) =
(perhaps theP(Y = 0|x)
reviewer s (w
went=on1to say ·x+
“But b) line, the movie is
bottom
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
= 0.30
the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) = .31 from Eq. 5.7
intoWhat's theleftloss?
Eq. 5.12, the side of the equation drops out:
Logistic regression is commonly applied to all sorts of NLP tasks,
LCEof
(ŷ, the
y) =input can be sa(w
[y log feature. Consider
· x + b)+(1 s (w ·of
the task
y) log (1 x +period
b))] disambig
if a period
= is the end of a sentence [log (1or spart
(w · xof a word, by classif
+ b))]
into one= of two classes EOS (end-of-sentence)
log (.30) and not-EOS. We m
like x=1 below expressing that 1.2 the current word is lower case and
(perhaps with a positive weight), or that the current word is in o
Sure enough, the loss for the first classifier (.37) is less than the loss for the second
8 C HAPTER 5
LCE• (ŷ,Ly) = [y log s (w · x + b) + (1
OGISTIC R EGRESSION
y) log (1 s (w · x + b))]
Let's see if= this works for
[log s (w · x + b)] our sentiment example
side of the=equation drops out, log(.70)
leading to the following loss (we’ll use log to mean
The loss when model was right (if true y=1)
natural log=when the base is not specified):
.36
LCE (pretend
By contrast, let’s ŷ, y) = instead [y
that s (wexample
logthe · x + b) +in(1Fig.y)5.2
logwas s (w · x +negative,
(1 actually b))]
i.e., y = 0 (perhaps the = reviewer[log
wents (won· x to
+ b)]
say “But bottom line, the movie is
terrible! I beg you not to = see it!”). In this case
log(.70)our model is confused and we’d want
the loss to be higher. Now = if we plug y = 0 and .361 s (w · x + b) = .31 from Eq. 5.7
Is lower than the loss when model was wrong (if true y=0):
into Eq. 5.12, the left side of the equation drops out:
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
LCE (yŷ,=y)0=(perhaps the
i.e., [y log s (w · xwent
reviewer + b)+(1 y) log
on to say “But s (w ·line,
(1 bottom x + b))]
the movie is
terrible! I beg
= you not to see it!”). In this case[log
our (1
model is confused
s (w · x + b))]and we’d want
the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) = .31 from Eq. 5.7
into Eq. 5.12, logout:
= the left side of the equation drops (.30)
= 1.2
LCE (ŷ, y) = [y log s (w · x + b)+(1 y) log (1 s (w · x + b))]
Sure enough, the loss for
= the first classifier (.37) is less than
[log (1 the loss
s (w · x for the second
+ b))]
Sure enough,
classifier (1.17). loss
=
was bigger whenlogmodel (.30)
was wrong!
Cross-Entropy Loss

Logistic
Regression
Stochastic Gradient Descent

Logistic
Regression
th gradient descent is to find the optimal weights: minim
Our goal: minimize the loss
ve defined for the model. In Eq. 5.13 below, we’ll explici
the loss
Let'sfunction L is that
make explicit parameterized by the
the loss function weights, whic
is parameterized
by weights
e learning !=(w,b)as q (in the case of logistic regressio
in general
s to find
• theAndset of represent
we’ll weights which minimizes
#" as f (x; θ ) to makethetheloss functi
mples: dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
m
X
1
q̂ = argmin LCE ( f (x(i) ; q ), y(i) )
q m
i=1
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘


Find the direction of
steepest slope down
x Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss Should we move


right or left from here?

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

slope of loss at w1
is negative
So we'll move positive

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

one step
of gradient
slope of loss at w1 descent
is negative
So we'll move positive

w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss


function at the current point and move in the
opposite direction.
How much do we move in that direction ?

The value of the gradient (slope in our example)


GISTIC• R EGRESSION
!
#(% &; ( , *) weighted by a learning rate η
!"
• Higher learning rate means move w faster

t+1 t d
w =w h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variable
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
ach dimension wi , we express the slope as a partial derivative ∂ wi of th
n. The gradient is then defined as a vector of these partials. We’ll repre
The gradient
q ) to make the dependence on q more 2 ∂obvious: 3
∂ w L( f (x; q ), y)
We’ll represent "! as f (x; θ ) to make
6 the∂
1 dependence on 7 θ more
obvious: 6
2 ∂ w L( f (x; q ), y)37
—q L( f (x; q ), y)) = 6 ∂2
6 ∂ w1 L( f (x;. q ), y) 7
7
4
6 ∂ .
. 75
6 ∂ w∂ 2 L( f (x; q ), y)7
—q L( f (x; q ), y)) = 6 6 ∂ wn L( f ..(x; q ), y)
7
7
4 . 5

nal equation for updating q based on ∂ wnthe (x; q ), y)is thus
L( fgradient
The final equation for updating θ based on the gradient is thus
final equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x; q ), y)
Gradient for Logistic Regression
4.1 What
The are
Gradient for Logistic
these partial Regression
derivatives for logistic regression?
te q , we need a definition for the gradient —L( f (x; q ), y). Re
order to update q , we need a definition for the gradient —L( f (x; q ), y). Recal
ession, the cross-entropy loss function is:
logistic regression, the cross-entropy loss function is:
The loss function
y) =LCE (ŷ,[yy)log=s (w ·x+
[y log b)· x++(1
s (w y) log
b) + (1 (1(1 ss(w
y) log (w ·· xx+ b))]
+b))]
It turns
that theoutderivative
that the derivative
of thisoffunction
this function
forfor oneobservation
one observation vecto
vec
. 5.18The
(theelegant derivative
reader of
canthis
seefunction
Section(see textbook 5.8 for derivation)
erested reader can see Section 5.8 for the derivation of
interested 5.8 for the derivation
ofthis
thisequa
eq
∂ LCE (ŷ, y)
∂ LCE (ŷ, y) ∂ w = [s (w · x + b) y]x j
=j [s (w · x + b) y]x j
∂ wthat
Note in Eq. 5.18 j the gradient with respect to a single weight w j represe
y intuitive
5.18 that value: the difference
the gradient with between
respectthe
totrue y and our
a single estimated
weight w reprŷ =
j
function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns q
# where: L is the loss function
# f is a function parameterized by q
# x is the set of training inputs x(1) , x(2) , ..., x(m)
# y is the set of training outputs (labels) y(1) , y(2) , ..., y(m)

q 0
repeat til done # see caption
For each training tuple (x(i) , y(i) ) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ (i) = f (x(i) ; q ) # What is our estimated output ŷ?
Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ?
2. g —q L( f (x(i) ; q ), y(i) ) # How should we move q to maximize loss?
3. q q hg # Go the other way instead
return q

Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent

Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1

Theqsingle = qstep h—L(thatf (x; q ), y)the gradient, multiplied by the


It turns out that the derivative of this function for one observation ve
t+1 update t requires we compute
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
learning rate
∂ LCE (ŷ, y)
where q t+1 = q t
∂wj
h—q L( ; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1

Theqsingle = qstep h—L(thatf (x; q ), y)the gradient, multiplied by the


It turns out that the derivative of this function for one observation ve
t+1 update t requires we compute
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
learning rate
∂ LCE (ŷ, y)
where q t+1 = q t
∂wj
h—q L( ; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1

Theqsingle = qstep h—L(thatf (x; q ), y)the gradient, multiplied by the


It turns out that the derivative of this function for one observation ve
t+1 update t requires we compute
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
learning rate
∂ LCE (ŷ, y)
where q t+1 = q t
∂wj
h—q L( ; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1

Theqsingle = qstep h—L(thatf (x; q ), y)the gradient, multiplied by the


It turns out that the derivative of this function for one observation ve
t+1 update t requires we compute
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
learning rate
∂ LCE (ŷ, y)
where q t+1 = q t
∂wj
h—q L( ; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update
CE (ŷ, y) =
θ is:
w1 = w2 = b = 0
[y log s (w · x + b) + (1 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1

Theqsingle = qstep h—L(thatf (x; q ), y)the gradient, multiplied by the


It turns out that the derivative of this function for one observation ve
t+1 update t requires we compute
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
learning rate
∂ LCE (ŷ, y)
where q t+1 = q t
∂wj
h—q L( ; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient
Now that we descent we
have a gradient, is an onlinethealgorithm
compute that vector
new parameter minimizes
q 1 by the loss
moving
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y)7
— L( f (x; q ), y)) = 6
In our mini example 2 7
there are three parameters, so the gradient vector has 3 dimen-
(5.15)
q
Example
r mini example
for w1 , w22, and
there
b.
are
sions,
We
of
for
can
w1gradient
three
, 6
w
4
2
parameters,
,
compute
and b. We
the
.. descent
.
can so
first
the
computegradient
gradient
7
the first vector
gradient
5 as follows:
has
as 3 dimen-
follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w
CE
1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) L(
n y)x
f (x;
5
q3
4
(s
), y)
(0)
2
= (s (0) 1)x
1)x 1
5 = 34
0.5x 1 1.5
0.5x 5 = 43 1.0 5
2
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 0in the opposite direction from the gradient:
θ1 by moving θ in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the w new
1 parameter
1.5 vector
.15 q 1 by moving
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
= 4qw),
(x; y) h 4 1.0 5 =η 4= .10.1;5
2 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
q 1w=2= 2 5b = h
4.1,wand 4 1.0 5 = 4 .1 5
.05.
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter oneNote
stepthat enough negative
of gradient descent,examples wouldhave
the weights eventually
shiftedmake w2 negative
to be: w = .15,
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Regularization

Logistic
Regression
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks

79
o the unseen test set, but a model that overfits will have poor generalization
Regularization
o avoid overfitting, a new regularization term R(q ) is added to the object
on in Eq. 5.13, resulting in the following objective for a batch of m exa
slightlyArewritten
solutionfrom
for Eq.
overfitting
5.13 to be maximizing log probability rather th
Addand
mizing loss, a regularization
removing the m1 term R(θ) todoesn’t
term which the loss function
affect the argmax):
(for now written as maximizing logprob rather than minimizing loss)
X m
q̂ = argmax (i) (i)
log P(y |x ) aR(q ) (5.
q i=1
Idea: choose an R(θ) that penalizes large weights
ew regularization term R(q ) is used to penalize large weights. Thus a sett
◦ fitting the data well with lots of big weights not as good
weights that matches
as fitting thethe training
data a littledata
lessperfectly— but uses
well, with small many weights w
weights
values to do so—will be penalized more than a setting that matches the dat
ess well, but does so using smaller weights. There are two common ways
high
to dovalues to dobe
so—will so—will be penalized
penalized more thanmorea than a setting
setting that matches
that matches the the
datadaa
ell, lessL2
littlebut does Regularization
well, butusing
so does so
smaller (= ridge
using smaller
weights.weights.regression)
There There
are twoare common
two common way
ways to
scompute this regularization
regularization term R(qterm ). L2R(q ). L2 regularization
regularization is a quadratic
is a quadratic function
function o
the weight values, named because it uses the (square of the) L2 norm of the wei
values, The named sum because
of theitsquares
uses the of(square
the of the) L2 norm of the weigh
weights
values. The L2 norm, ||q ||2 , is the same as the Euclidean distance of the vecto
L2 norm,
from theThe ||q ||If2 ,qisconsists
origin. the sameof n the Euclidean
asweights, then: distance of the vector q
name is because this is the (square of the)
gin. If q consists of n weights, then:
L2 norm ||θ||2, = Euclidean distance
n of θ to the origin.
X
2
R(q ) = ||qn||
X2 = q j2 (5
R(q ) = ||q ||22 = q j2 j=1 (5.23
j=1
The L2 regularized objective function becomes:
L2 regularized objective "
function: #
larized objective function becomes:
Xm n
X
q̂ ="argmax log P(y(i)
# ) |x (i)
a q j2 (5
mq
X n
X
i=1 j=1
q i=1 j=1
i=1 j=1
L1 Regularization
regularization is a linear function(= lasso
of the regression)
weight values, named after the L1 n
ularization is a linear function of the weight values, named after the L1 n
||1 , the sum of the absolute values of the weights, or Manhattan distance
the sumdistance
nhattan of the absolute
is the valuesyou’d
distance of thehave
weights,
to walk Manhattan
orbetween two distance
points in a
The sum of the (absolute value of the) weights
htan distance
a street gridislike
theNewdistance
York):you’d have to walk between two points in a
Named
treet grid afterYork):
like New the L1 norm ||W||1, = sum of the absolute
n
values of the weights, = Manhattan
X
n
distance
R(q ) = ||q ||1 =
X |qi | (
R(q ) = ||q ||1 = |qi |
i=1 (
i=1
L1 regularized objective function becomes:
L1 regularized objective function:
regularized objective function
" mbecomes: #
X n
X
q̂ = argmax " m log P(y(i) |x(i)#) aX
n |q j | (
q
X
q̂ = argmax log P(y(i) |x(i) )
1=i a j=1|q
j| (
Multinomial Logistic
Logistic Regression
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
84
Recall that that the cross-entropy loss for binary logistic regression

LCE(",y)=
! -logp(y|x)=-[ylog "! +(1-y)log(1- ")]
!
)
=- % "*+,-"#* =-log "#$ [where c is the true class)
&'(

The loss function for multinomial logistic regression


.
LCE(",y)=-
! % "*+,-"#* (where there are K classes)
&'(
=-log "#$ [where c is the true class)

=-logP(y=c|x)
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1

Need a generalization of the sigmoid called the softmax


◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1
86
akes a vector z = [z1 , z2 , ..., zk ] of k arbitrary
e values and maps them to a probability
moid, it is an exponential
softmax(zi )function.
= Pk
distribution, with each value in the range (0,1), z
1ik
and all the values
(5.32)
summing to 1.
j=1 e
The softmax function
j
ctor z ofsigmoid,
Like the dimensionality
it is an exponential k, thefunction.
softmax is defined as:
For
Thea softmax
vector zofofandimensionality
input vector z =k,[zthe
1 , z2softmax
, ..., zk ] isisthus
defined as: itself:
a vector
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities
" exp (z i )
expe(zz2i )
#
softmax(z i )
softmax(z
softmax(z)
= Pe z1
= i ) P=k kPk, Pk
1
 i i k k
, ...,1P
ezk
(5.33)(5.30)
ezi expexp (z
(z
e zi jj)) k zi
i=1 j=1 j=1 i=1 i=1 e

Pk z
The softmax of an input vector
maxThe
of denominator
an input vector
i=1 e is
i
= z[z=
z used to [zz1 , z2 , ..., zall
normalize
,
1 2 , ..., z k ] is
k ] thus
is
the thusa vector
values a itself:
vector
into itself:
probabilities.
Thus for example given a vector:
" #
" exp1.1,
z = [0.6, exp
(z1 ) 1.5, 1.2, 3.2,
(z2 )1.1] exp (zk ) #
softmax(z) = Pk , Pk , ..., Pk (5.31)
exp (zi=1
1 )exp (zi ) exp (z2(z) i )
i=1 exp exp
i=1 exp (zi(z
) k)
max(z)
the result softmax(z)
= Pisk , Pk , ..., Pk
i=1 exp (zi ) i=1 exp (zi )
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] i=1 exp (zi ) 87
ze zi softmaxze
2iszidefined as: eze
k zi
For a vector z of dimensionality
i=1e 1the
k, e
i=1 i=1
softmax(z) = Pk , Pk , ..., Pk
ThePsoftmax function i=1 e zi
i=1 ezi
i=1 e zi
k z
nominator i=1softmax(z
e is used
i
i ) = to P
normalize
exp (zi ) all
1  the
i  kvalues into proba
(5.30)
xample givenP
◦ Turns
a a vector
kvector: z
z = [z ,z
1 2 ,...,zk ] kof k arbitrary
j=1 exp (z j )
values into probabilities
enominator e is used to normalize all the values into prob
i=1
i

example givenofaanvector:
The softmax input vector z = [z1 , z2 , ..., zk ] is thus a vector itself:
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1]
" #
z = [0.6,
exp1.1,
(z1 ) 1.5,exp
1.2,
(z23.2,
) 1.1]exp (zk )
oftmax(z) is
softmax(z) = Pk , Pk , ..., Pk (5.31)
i=1 exp (zi ) i=1 exp (zi ) i=1 exp (zi )
softmax(z) is
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
ike the sigmoid, the input to the softmax will be the dot product b 88
vector w and an input vector x (plus a bias). But now we’ll need sepa
ectors (and bias) for each of the K classes.
Softmax in multinomial logistic regression
exp (wc · x + bc )
p(y = c|x) = k (5
X
exp (w j · x + b j )
j=1

Input is still the dot product between weight vector w


he sigmoid, the softmax has the property of squashing values toward 0
and input vector x
ne of the inputs is larger than the others, it will tend to push its probab
But now we’ll need separate weight vectors for each
and suppress
of the the probabilities of the smaller inputs.
K classes.
89
for gradient descent we don’t need the loss, we need its
gradient. The gradient for a single example turns out to be
very similar to the gradient for binary logistic regression,
("! −y)x,
Multinomial Logistic
Logistic Regression
Regression

You might also like