0% found this document useful (0 votes)

36 views37 pages

Berkeley-Tutorial Optimization For Machine Learning-Part1

This tutorial by Elad Hazan focuses on optimization techniques in machine learning, covering topics such as stochastic optimization, empirical risk minimization, and gradient descent methods. It discusses the importance of convexity in optimization problems and introduces various algorithms for efficient learning. The tutorial also touches on online learning and regret minimization but does not cover parallelism or Bayesian inference.

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views37 pages

Berkeley-Tutorial Optimization For Machine Learning-Part1

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Tutorial: PART 1

Optimization for machine learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora, Yoram Singer

ML paradigm

Machine

Chair/car

Distribution
over label
{a} ∈ 𝑅% 𝑏 = 𝑓)*+*,-.-+/ (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/online/stochastic gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

NOT touch upon:

• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov-chain-
monte-carlo, Partial information and bandit algorithms
Mathematical optimization

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8

Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾

Accessing f? (values, differentials, …)

Generally NP-hard, given full access to function.

What is Optimization

Learning = optimizationBut
over dataspeaking...
generally
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f

Fitting the parameters of the model (“training”) = optimization

h(x) = sin(2πx) = 0
problem:
250

200

150

100

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

m = # of examples (a,b) = (features, labels)

d = dimension
Example: linear classification

Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,

find hyperplane (through the origin w.l.o.g)
such that:

𝑥 = arg min # of mistakes =

B QL

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

B QL

`
L
arg min ∑I ℓ(𝑥, 𝑎I , 𝑏I ) for ℓ 𝑥, 𝑎I , 𝑏I = _1 𝑥 𝑎≠𝑏
] QL , 0 𝑥 `𝑎 = 𝑏

NP hard!
Sum of signs à global optimization NP-hard!
but locally verifiable…

Local property that ensures global optimality?

Convexity

A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:

1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2

• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)` (𝑦 − 𝑥)

𝑥 𝑦
Convex sets

Set K is convex if and only if:

𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
Z
Loss functions ℓ 𝑥, 𝑎I , 𝑏I = ℓ(𝑥 𝑎I ⋅ 𝑏I )
Convex relaxations for linear (&kernel)
classification

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

B QL

1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l

2. SVM ℓ 𝑥 `𝑎I , 𝑦I = max{0,1 − 𝑏I 𝑥 ` 𝑎I }
t*
3. Logistic regression ℓ 𝑥 `𝑎I , 𝑦I = log(1 + 𝑒 qrs ⋅B s )
We have: cast learning as mathematical optimization,
argued convexity is algorithmically important

Next è algorithms!
Gradient descent, constrained set

𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥. @
[rf (x)]i = f (x)
𝑥.uL = arg min |𝑦.uL − 𝑥| @xi
]∈x

p* p3 p2 p1
Convergence of gradient descent
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
y
𝑥.uL = arg min |𝑦.uL − 𝑥|
Theorem: for step size 𝜂 = ]∈x
z Z

1 ∗ + 𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥
𝑇 B ∗ ∈x 𝑇
.

Where:
• G = upper bound on norm of gradients
|𝛻𝑓 𝑥. | ≤ 𝐺

• D = diameter of constraint set

∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l
= x ∗ − x• l
− 2𝜂𝛻𝑓(𝑥. )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥. ) l

2. Observation 2:
x ∗ − 𝑥.uL l
≤ x ∗ − y.uL l

This is the Pythagorean theorem:

Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l = x ∗ − x• l − 2𝜂𝛻𝑓(𝑥 . )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥 . ) l

2. Observation 2:
x ∗ − 𝑥 .uL l ≤ x ∗ − y.uL l

Thus:
x ∗ − x•uL l ≤ x ∗ − x• l − 2𝜂𝛻𝑓 (𝑥 .)(𝑥 . − 𝑥 ∗ ) + 𝜂 l 𝐺 l
And hence:

1 1 1
𝑓( G 𝑥 .) − 𝑓 𝑥 ∗ ≤ G 𝑓 𝑥. − 𝑓 𝑥 ∗ ≤ G 𝛻𝑓 𝑥 . 𝑥. − 𝑥 ∗
𝑇 𝑇 𝑇
. . .

1 1 𝜂
≤ G x ∗ − x•uL l − x ∗ − x• l + 𝐺l
𝑇 2𝜂 2
.

1 𝜂 𝐷𝐺
≤ 𝐷l + 𝐺l ≤
𝑇 ⋅ 2𝜂 2 𝑇
Recap

y
Theorem: for step size 𝜂 =
z Z

1 ∗
𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥 +
𝑇 ∗
B ∈x 𝑇
.

L
Thus, to get 𝜖-approximate solution, apply O
‚ƒ
gradient iterations.
Gradient Descent - caveat

For ERM problems

1
arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

1. Gradient depends on all data

2. What about generalization?

p* p3 p2 p1
Next few slides:

Simultaneous optimization and generalization

è Faster optimization! (single example per iteration)
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
h1
learner:
h2
Hypothesis h
l
Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 −𝑏

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN

Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:
⇤
err(h)  min
⇤
err(h )+✏
h 2H
More powerful setting:
Online Learning in Games
AxB
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ. ∈ 𝐻
Adversary: (𝑎. , 𝑏. ) ∈ 𝐴 H
Loss ℓ(ℎ. , (𝑎. , 𝑏. ))

Goal: minimize (average, expected) regret:

" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1

Vanishing regret à generalization in PAC setting! (online2batch)

From this point onwards: 𝑓. 𝑥 = ℓ(𝑥, 𝑎. , 𝑏. ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent [Zinkevich ‘05]
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K

Theorem: Regret = ∑. 𝑓. 𝑥. − ∑. 𝑓. 𝑥 ∗ = 𝑂 𝑇
Analysis
𝛻. ≔ 𝛻𝑓. (𝑥. )
Observation 1:
kyt+1 x⇤ k2 = kxt x⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
Observation 2: (Pythagoras)
⇤ ⇤
kxt+1 x k  kyt+1 x k
Thus: kxt+1 x⇤ k2  kxt x ⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
X X
Convexity: [ft (xt ) ⇤
ft (x )]  rt (xt x⇤ )
t t
1 X
⇤ 2 ⇤ 2
 (kxt x k kxt+1 x k )+⌘ krt k2
⌘ t
1 ⇤ 2
p
 kx1 x k + ⌘T G = O( T )
⌘
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥 , 𝑓l 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)

p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0

Stochastic gradient descent

250

200

150

100

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2

Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I

−3 −3

B∈C
random example: 𝑓. 𝑥 = ℓI 𝑥, 𝑎I , 𝑏I Duchi (UC Berkeley) Convex Optimization for Machine Learning

1. We have proved: (for any sequence of 𝛻. )

1 1 ` 𝑥 ∗ + 𝐷𝐺
G 𝛻.` 𝑥 . ≤ min G 𝛻.
𝑇 B∗∈x 𝑇 𝑇
. .
2. Taking (conditional) expectation:

1 1 𝐷𝐺
𝐸 𝐹 G 𝑥 . − min
∗ 𝐹 𝑥∗ ≤𝐸 G 𝛻.` (𝑥. − 𝑥 ∗)] ≤
𝑇 B ∈x 𝑇 𝑇
. .

One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
8 ,8
O vs. O total running time for 𝜖 generalization error.
‚ƒ ‚ƒ
Stochastic vs. full gradient descent
Regularization &
Gradient Descent++
Why “regularize”?

• Statistical learning theory /

Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier

• PAC theory: Regularization <-> reduce complexity

• Regret minimization: Regularization <-> stability
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min
⇤
ft (x⇤ )
x 2K
t t
• Most natural:
.qL

𝑥. = arg min G 𝑓I 𝑥
B∈x
IKL

• Provably works [Kalai-Vempala’05]:

𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL

B∈x
IKL

• So if 𝑥. ≈ 𝑥.uL, we get a regret bound

• But instability 𝑥. − 𝑥.uL can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓. 𝑥. Z 𝑥
• Add regularization:

1
𝑥. = arg min G 𝛻.` 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• R(x) is a strongly convex function, ensures stability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)

FTRL vs. gradient descent

L
• 𝑅 𝑥 = ∥ 𝑥 ∥l
l
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ% distributions over experts
• 𝑓. 𝑥 = 𝑐.Z 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑I 𝑥I log 𝑥I : negative entropy

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential

• Gives the Multiplicative Weights method!

FTRL ⇔ Online Mirror Descent

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K

Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K

BR (xky) := R(x) R(y) rR(y)> (x y)

QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
Adaptive Regularization: AdaGrad

• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥

𝛻𝑓. 𝑥 = ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• features treated equally in updating parameter vector

• In typical text classification tasks, feature vectors at are very sparse,

Slow learning!

• Adaptive regularization: per-feature learning rates

Optimal regularization

• The general RFTL form

1
𝑥. = arg min G 𝑓I 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• Which regularizer to pick?

• AdaGrad: treat this as a learning problem!
Family of regularizations:

𝑅 𝑥 = 𝑥 l 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
¤

• Objective in matrix world: best regret in hindsight!

AdaGrad (diagonal form)

• Set 𝑥L ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥. obtain ft
2. compute 𝑥.uL as follows:
Pt >
Gt = diag( i=1 rf i (x i )rf i (x i ) )
1/2
yt+1 = xt ⌘Gt rft (xt )
xt+1 = arg min(yt+1 x)> Gt (yt+1 x)
x2K

• Regret bound: [Duchi, Hazan, Singer ‘10]

𝑂 ∑I ∑. 𝛻.l,I , can be 𝑑 better than SGD
• Infrequently occurring, or small-scale, features have small influence
on regret (and therefore, convergence to optimal parameter)
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/stochastic/online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Regularization in ML Models
No ratings yet
Regularization in ML Models
47 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Online Convex Programming Insights
No ratings yet
Online Convex Programming Insights
8 pages
Gradient Descent New
No ratings yet
Gradient Descent New
42 pages
25 Optimization
No ratings yet
25 Optimization
28 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture4 Compressed
No ratings yet
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture4 Compressed
22 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
IML Summary
No ratings yet
IML Summary
12 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Unit3 ML
No ratings yet
Unit3 ML
52 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Lec 16
No ratings yet
Lec 16
10 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
2022 06 Masters
No ratings yet
2022 06 Masters
58 pages
The Search For Meaning in Human Existence
No ratings yet
The Search For Meaning in Human Existence
1 page
Essay 8
No ratings yet
Essay 8
1 page
Essay 4
No ratings yet
Essay 4
1 page
The Dance of Connection in Human Life
No ratings yet
The Dance of Connection in Human Life
1 page
AI Bullshit Regulatory Gaps
No ratings yet
AI Bullshit Regulatory Gaps
1 page
Essay 9
No ratings yet
Essay 9
1 page
AI Bullshit Healthcare Hype
No ratings yet
AI Bullshit Healthcare Hype
1 page
AI Bullshit Economic Disparities
No ratings yet
AI Bullshit Economic Disparities
1 page
1995 CommuOfACM ComputingInVN 1995
No ratings yet
1995 CommuOfACM ComputingInVN 1995
6 pages
AI Bullshit Overhyped Promises
No ratings yet
AI Bullshit Overhyped Promises
1 page
AI Bullshit Misinformation
No ratings yet
AI Bullshit Misinformation
1 page
AI Bullshit Consumer Misleading
No ratings yet
AI Bullshit Consumer Misleading
1 page
AI Bullshit Ethical Concerns
No ratings yet
AI Bullshit Ethical Concerns
1 page
LTL Artale Slides
No ratings yet
LTL Artale Slides
44 pages
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
No ratings yet
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
11 pages
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
No ratings yet
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
5 pages
Linear Pde
No ratings yet
Linear Pde
23 pages
Chapter 7: Statistics 7.1 Measures of Central Tendency 7.1.1 Mean
100% (1)
Chapter 7: Statistics 7.1 Measures of Central Tendency 7.1.1 Mean
13 pages
MATH-352 Numerical Methods Quiz
No ratings yet
MATH-352 Numerical Methods Quiz
3 pages
Simultaneous Determination of Salicylic Acid and Acetylsalicylic Aci
No ratings yet
Simultaneous Determination of Salicylic Acid and Acetylsalicylic Aci
5 pages
Numerical Methods for Scientists
No ratings yet
Numerical Methods for Scientists
84 pages
Normal Distribution
No ratings yet
Normal Distribution
33 pages
Attachment Report
No ratings yet
Attachment Report
48 pages
Optimization in Ansys Workbench
No ratings yet
Optimization in Ansys Workbench
36 pages
SNP Yb 098
No ratings yet
SNP Yb 098
151 pages
Statistical Inference Project Part 1
No ratings yet
Statistical Inference Project Part 1
5 pages
Purpose Type of T-Test
No ratings yet
Purpose Type of T-Test
3 pages
IGA: A Simplified Introduction and Implementation Details For Finite Element Users
No ratings yet
IGA: A Simplified Introduction and Implementation Details For Finite Element Users
25 pages
QB 2014 Math SL 5
No ratings yet
QB 2014 Math SL 5
119 pages
Properties of ROC
No ratings yet
Properties of ROC
6 pages
Numerical Analysis Basics
No ratings yet
Numerical Analysis Basics
23 pages
Practice - Test - 2 - Systems - of - Inequalities 4
No ratings yet
Practice - Test - 2 - Systems - of - Inequalities 4
4 pages
Student Data Analysis Report
No ratings yet
Student Data Analysis Report
7 pages
Data Analysis Is The Process of Systematically Applying Statistical and
No ratings yet
Data Analysis Is The Process of Systematically Applying Statistical and
7 pages
HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing
No ratings yet
HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing
4 pages
Forecasting Data Analysis
No ratings yet
Forecasting Data Analysis
9 pages
Dinius Project Paper
No ratings yet
Dinius Project Paper
8 pages
Presentation1 (1275)
No ratings yet
Presentation1 (1275)
31 pages
Chapter 7 Solutions
No ratings yet
Chapter 7 Solutions
92 pages
Unaware Unfunded and Uneducated Cybersecurity in SMEs
No ratings yet
Unaware Unfunded and Uneducated Cybersecurity in SMEs
32 pages
MATH20142
No ratings yet
MATH20142
4 pages
Role of Computing in Sociology
100% (2)
Role of Computing in Sociology
7 pages
Diagrammatic Reasoning
100% (1)
Diagrammatic Reasoning
13 pages
Business Analytics Technology For Humanity 3rd Edition Sumit Chakraborty Digital Download
100% (4)
Business Analytics Technology For Humanity 3rd Edition Sumit Chakraborty Digital Download
151 pages
The Logic of Chemical Synthesis Corey E J Amp Cheng X M 1 45
No ratings yet
The Logic of Chemical Synthesis Corey E J Amp Cheng X M 1 45
45 pages
Lecture 7-2: Moment Distribution Method: School of Engineering ES95D
No ratings yet
Lecture 7-2: Moment Distribution Method: School of Engineering ES95D
15 pages

Berkeley-Tutorial Optimization For Machine Learning-Part1

Uploaded by

Berkeley-Tutorial Optimization For Machine Learning-Part1

Uploaded by

Tutorial: PART 1

Optimization for machine learning

+ help from Sanjeev Arora, Yoram Singer

NOT touch upon:

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8

Accessing f? (values, differentials, …)

Generally NP-hard, given full access to function.

Fitting the parameters of the model (“training”) = optimization

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

m = # of examples (a,b) = (features, labels)

Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,

𝑥 = arg min # of mistakes =

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

Local property that ensures global optimality?

A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:

Set K is convex if and only if:

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l

• D = diameter of constraint set

This is the Pythagorean theorem:

For ERM problems

1. Gradient depends on all data

Simultaneous optimization and generalization

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN

Goal: minimize (average, expected) regret:

Vanishing regret à generalization in PAC setting! (online2batch)

Stochastic gradient descent

Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I

1. We have proved: (for any sequence of 𝛻. )

• Statistical learning theory /

• PAC theory: Regularization <-> reduce complexity

• Provably works [Kalai-Vempala’05]:

𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL

• So if 𝑥. ≈ 𝑥.uL, we get a regret bound

• R(x) is a strongly convex function, ensures stability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

• Gives the Multiplicative Weights method!

BR (xky) := R(x) R(y) rR(y)> (x y)

• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥

• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• In typical text classification tasks, feature vectors at are very sparse,

• Adaptive regularization: per-feature learning rates

• The general RFTL form

• Which regularizer to pick?

• Objective in matrix world: best regret in hindsight!

• Regret bound: [Duchi, Hazan, Singer ‘10]

You might also like