Tutorial: PART 1
Optimization for machine learning
Elad Hazan
Princeton University
+ help from Sanjeev Arora, Yoram Singer
ML paradigm
Machine
Chair/car
Distribution
over label
{a} ∈ 𝑅% 𝑏 = 𝑓)*+*,-.-+/ (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/online/stochastic gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization
NOT touch upon:
• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov-chain-
monte-carlo, Partial information and bandit algorithms
Mathematical optimization
Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8
Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾
Accessing f? (values, differentials, …)
Generally NP-hard, given full access to function.
What is Optimization
Learning = optimizationBut
over dataspeaking...
generally
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f
Fitting the parameters of the model (“training”) = optimization
h(x) = sin(2πx) = 0
problem:
250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
1 Duchi (UC Berkeley) Convex Optimization for Machine Learning
arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,
m = # of examples (a,b) = (features, labels)
d = dimension
Example: linear classification
Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,
find hyperplane (through the origin w.l.o.g)
such that:
𝑥 = arg min # of mistakes =
B QL
arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |
B QL
`
L
arg min ∑I ℓ(𝑥, 𝑎I , 𝑏I ) for ℓ 𝑥, 𝑎I , 𝑏I = _1 𝑥 𝑎≠𝑏
] QL , 0 𝑥 `𝑎 = 𝑏
NP hard!
Sum of signs à global optimization NP-hard!
but locally verifiable…
Local property that ensures global optimality?
Convexity
A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:
1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2
• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)` (𝑦 − 𝑥)
𝑥 𝑦
Convex sets
Set K is convex if and only if:
𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
Z
Loss functions ℓ 𝑥, 𝑎I , 𝑏I = ℓ(𝑥 𝑎I ⋅ 𝑏I )
Convex relaxations for linear (&kernel)
classification
𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |
B QL
1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l
2. SVM ℓ 𝑥 `𝑎I , 𝑦I = max{0,1 − 𝑏I 𝑥 ` 𝑎I }
t*
3. Logistic regression ℓ 𝑥 `𝑎I , 𝑦I = log(1 + 𝑒 qrs ⋅B s )
We have: cast learning as mathematical optimization,
argued convexity is algorithmically important
Next è algorithms!
Gradient descent, constrained set
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥. @
[rf (x)]i = f (x)
𝑥.uL = arg min |𝑦.uL − 𝑥| @xi
]∈x
p* p3 p2 p1
Convergence of gradient descent
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
y
𝑥.uL = arg min |𝑦.uL − 𝑥|
Theorem: for step size 𝜂 = ]∈x
z Z
1 ∗ + 𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥
𝑇 B ∗ ∈x 𝑇
.
Where:
• G = upper bound on norm of gradients
|𝛻𝑓 𝑥. | ≤ 𝐺
• D = diameter of constraint set
∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l
= x ∗ − x• l
− 2𝜂𝛻𝑓(𝑥. )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥. ) l
2. Observation 2:
x ∗ − 𝑥.uL l
≤ x ∗ − y.uL l
This is the Pythagorean theorem:
Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l = x ∗ − x• l − 2𝜂𝛻𝑓(𝑥 . )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥 . ) l
2. Observation 2:
x ∗ − 𝑥 .uL l ≤ x ∗ − y.uL l
Thus:
x ∗ − x•uL l ≤ x ∗ − x• l − 2𝜂𝛻𝑓 (𝑥 .)(𝑥 . − 𝑥 ∗ ) + 𝜂 l 𝐺 l
And hence:
1 1 1
𝑓( G 𝑥 .) − 𝑓 𝑥 ∗ ≤ G 𝑓 𝑥. − 𝑓 𝑥 ∗ ≤ G 𝛻𝑓 𝑥 . 𝑥. − 𝑥 ∗
𝑇 𝑇 𝑇
. . .
1 1 𝜂
≤ G x ∗ − x•uL l − x ∗ − x• l + 𝐺l
𝑇 2𝜂 2
.
1 𝜂 𝐷𝐺
≤ 𝐷l + 𝐺l ≤
𝑇 ⋅ 2𝜂 2 𝑇
Recap
y
Theorem: for step size 𝜂 =
z Z
1 ∗
𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥 +
𝑇 ∗
B ∈x 𝑇
.
L
Thus, to get 𝜖-approximate solution, apply O
‚ƒ
gradient iterations.
Gradient Descent - caveat
For ERM problems
1
arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,
1. Gradient depends on all data
2. What about generalization?
p* p3 p2 p1
Next few slides:
Simultaneous optimization and generalization
è Faster optimization! (single example per iteration)
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
h1
learner:
h2
Hypothesis h
l
Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 −𝑏
𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN
Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:
⇤
err(h) min
⇤
err(h )+✏
h 2H
More powerful setting:
Online Learning in Games
AxB
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ. ∈ 𝐻
Adversary: (𝑎. , 𝑏. ) ∈ 𝐴 H
Loss ℓ(ℎ. , (𝑎. , 𝑏. ))
Goal: minimize (average, expected) regret:
" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1
Vanishing regret à generalization in PAC setting! (online2batch)
From this point onwards: 𝑓. 𝑥 = ℓ(𝑥, 𝑎. , 𝑏. ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent [Zinkevich ‘05]
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K
Theorem: Regret = ∑. 𝑓. 𝑥. − ∑. 𝑓. 𝑥 ∗ = 𝑂 𝑇
Analysis
𝛻. ≔ 𝛻𝑓. (𝑥. )
Observation 1:
kyt+1 x⇤ k2 = kxt x⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
Observation 2: (Pythagoras)
⇤ ⇤
kxt+1 x k kyt+1 x k
Thus: kxt+1 x⇤ k2 kxt x ⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
X X
Convexity: [ft (xt ) ⇤
ft (x )] rt (xt x⇤ )
t t
1 X
⇤ 2 ⇤ 2
(kxt x k kxt+1 x k )+⌘ krt k2
⌘ t
1 ⇤ 2
p
kx1 x k + ⌘T G = O( T )
⌘
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥 , 𝑓l 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)
p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0
Stochastic gradient descent
250
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I
−3 −3
B∈C
random example: 𝑓. 𝑥 = ℓI 𝑥, 𝑎I , 𝑏I Duchi (UC Berkeley) Convex Optimization for Machine Learning
1. We have proved: (for any sequence of 𝛻. )
1 1 ` 𝑥 ∗ + 𝐷𝐺
G 𝛻.` 𝑥 . ≤ min G 𝛻.
𝑇 B∗∈x 𝑇 𝑇
. .
2. Taking (conditional) expectation:
1 1 𝐷𝐺
𝐸 𝐹 G 𝑥 . − min
∗ 𝐹 𝑥∗ ≤𝐸 G 𝛻.` (𝑥. − 𝑥 ∗)] ≤
𝑇 B ∈x 𝑇 𝑇
. .
One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
8 ,8
O vs. O total running time for 𝜖 generalization error.
‚ƒ ‚ƒ
Stochastic vs. full gradient descent
Regularization &
Gradient Descent++
Why “regularize”?
• Statistical learning theory /
Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier
• PAC theory: Regularization <-> reduce complexity
• Regret minimization: Regularization <-> stability
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min
⇤
ft (x⇤ )
x 2K
t t
• Most natural:
.qL
𝑥. = arg min G 𝑓I 𝑥
B∈x
IKL
• Provably works [Kalai-Vempala’05]:
.
𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL
B∈x
IKL
• So if 𝑥. ≈ 𝑥.uL, we get a regret bound
• But instability 𝑥. − 𝑥.uL can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓. 𝑥. Z 𝑥
• Add regularization:
1
𝑥. = arg min G 𝛻.` 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL
• R(x) is a strongly convex function, ensures stability:
𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)
FTRL vs. gradient descent
L
• 𝑅 𝑥 = ∥ 𝑥 ∥l
l
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )
• Essentially OGD: starting with y1 = 0, for t = 1, 2, …
Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ% distributions over experts
• 𝑓. 𝑥 = 𝑐.Z 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑I 𝑥I log 𝑥I : negative entropy
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential
• Gives the Multiplicative Weights method!
FTRL ⇔ Online Mirror Descent
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K
BR (xky) := R(x) R(y) rR(y)> (x y)
QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
Adaptive Regularization: AdaGrad
• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥
𝛻𝑓. 𝑥 = ℓ 𝑎. , 𝑏. , 𝑥 𝑎.
• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.
• features treated equally in updating parameter vector
• In typical text classification tasks, feature vectors at are very sparse,
Slow learning!
• Adaptive regularization: per-feature learning rates
Optimal regularization
• The general RFTL form
1
𝑥. = arg min G 𝑓I 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL
• Which regularizer to pick?
• AdaGrad: treat this as a learning problem!
Family of regularizations:
𝑅 𝑥 = 𝑥 l 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
¤
• Objective in matrix world: best regret in hindsight!
AdaGrad (diagonal form)
• Set 𝑥L ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥. obtain ft
2. compute 𝑥.uL as follows:
Pt >
Gt = diag( i=1 rf i (x i )rf i (x i ) )
1/2
yt+1 = xt ⌘Gt rft (xt )
xt+1 = arg min(yt+1 x)> Gt (yt+1 x)
x2K
• Regret bound: [Duchi, Hazan, Singer ‘10]
𝑂 ∑I ∑. 𝛻.l,I , can be 𝑑 better than SGD
• Infrequently occurring, or small-scale, features have small influence
on regret (and therefore, convergence to optimal parameter)
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/stochastic/online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization