3 Linear
3 Linear
3 Linear
= Class A
January 21, 2003
= Class B
= unclassified
= decision line
1 2
– Least squares methods (if classes not lin- · ρ > 0 is learning rate
early separable)
· (y(t) − ŷ(t)) moves weights toward correct
prediction for x
3 4
The Perceptron Algorithm
Intuition
The Perceptron Algorithm
Example
• Compromise between correctiveness and
x2 y(t) = 0 conservativeness
w0 = 0
y(t) = 1
our new dec. line – Correctiveness: Tendency to improve on x(t)
our dec. line
if prediction error made
x(t)
– Conservativeness: Tendency to keep
w(t + 1) close to w(t)
w* w(t+1)
opt. dec. line • Use cost function that measures both:
x1 conserv. corrective
z }| { z }| {
w(t) U (w) = kw(t + 1) − w(t)k2 2
2 +η (y(t) − w(t + 1) · x(t))
X̀
(ω1 ) = (wi(t + 1) − wi(t))2 +
i=1
(ω2 )
2
X̀
η y(t) − wi(t + 1) xi(t)
i=1
5 6
0 =2 (wi(t + 1) − wi(t)) −
X̀
• For real-valued output, can replace threshold
2η y(t) − wi(t) xi(t) xi(t), function on sum with
i=1
– Identity function: f (x) = x
which yields
wi(t + 1) = wi(t) + 1
– Sigmoid function: e.g. f (x) = 1+exp(−ax)
X̀
η y(t) − wi(t)xi (t) xi(t) – Hyperbolic tangent: e.g. f (x) = c tanh(ax)
i=1
7 8
Winnow/Exponentiated Gradient
Intuition
Winnow/Exponentiated Gradient
• Measure distance in cost function with
1
(ω1 ) unnormalized relative entropy:
x1 w1 w0
y(t)=1 if sum > 0
conserv.
Σi wi xi z }| !{
X̀ w (t + 1)
xl wl y(t)=0 otherwise U (w) = wi(t) − wi(t + 1) + wi(t + 1) ln i
(ω2 ) i=1 wi(t)
May also use +1 and -1 corrective
z }| {
+ η (y − w(t + 1) · x(t))2
• Same as Perceptron, but update weights:
• Take gradient w.r.t. w(t + 1) and set to 0:
wi(t + 1) = wi(t) exp (−2η(ŷ(t) − y(t)) xi(t))
w (t + 1) X̀
0 = ln i − 2η y(t) − wi(t + 1) xi(t) xi(t)
• If y(t), ŷ(t) ∈ {0, 1} ∀t, then set η = (ln α)/2 wi(t) i=1
(α > 1) and get Winnow:
• Approximate with
xi (t)
wi(t)/α if ŷ(t) = 1, y(t) = 0
wi(t + 1) = wi(t)αxi (t) w (t + 1) X̀
if ŷ(t) = 0, y(t) = 1 0 = ln i − 2η y(t) − wi(t)xi (t) xi(t),
wi(t) if ŷ(t) = y(t) wi(t) i=1
which yields
wi(t + 1) = wi(t) exp (−2η (ŷ(t) − y(t)) xi(t))
9 10
• Winnow and EG update wts by multiplying by • Winnow and EG are muliplicative weight update
a pos const: impossible to change sign schemes versus additive weight update schemes,
e.g. Perceptron
– Weight vectors restricted to one quadrant
• Winnow and EG work well when most attributes
(features) are irrelevant, i.e. optimal weight
• Solution: Maintain wt vectors w +(t) and w−(t) vector w∗ is sparse (many 0 entries)
– Predict ŷ(t) = w+(t) − w−(t) · x(t) • E.g. xi ∈ {0, 1}, x’s are labelled by a monotone
k-disjunction over ` attributes, k `
– Update:
– Remaining ` − k are irrelevant
ri+ (t) = exp (−2η (ŷ(t) − y(t)) xi(t) U )
– E.g. x5 ∨ x9 ∨ x12, ` = 150, k = 3
ri−(t) = 1/ri+(t)
– For disjunctions, number of on-line
wi+ (t) ri+(t) prediction mistakes is O(k log `) for Winnow
wi+ (t + 1) = U · P
+ + − −
` and worst-case Ω(k`) for Perceptron
j=1 wi (t) ri (t) + wi (t) ri (t)
U and denominator normalize wts for proof of error – So in worst case, need exponentially fewer
bound updates for training in Winnow than Per-
ceptron
Kivinen & Warmuth, “Additive Versus Exponen-
tiated Gradient Updates for Linear Prediction.” • Other bounds exist for real-valued inputs and
Information and Computation, 132(1):1–64, Jan.
outputs
1997. [see web page]
11 12
Non-Linearly Separable Classes Non-Linearly Separable Classes
Winnow’s Agnostic Results
– Kernel functions/SVMs
• Pocket algorithm (p. 63) guarantees conver- • Loss bound related to performance of best
gence to a best hyperplane classifier and total distance under k · k1 that
feature vectors must be moved to make best
• Winnow’s & EG’s agnostic results classifier perfect [Littlestone, COLT ’91]
• Least squares methods (Sec. 3.4)
13 14
∗ Note
that here w(t) is weight before trial t. In book it is
∗ The extra 1 is added so threshold can be placed in w.
weight after trial t.
15 16
Multiclass learning Multiclass learning
Kessler’s Construction (cont’d) Error-Correcting Output Codes (ECOC)
17 18
19