ED5340 - Data Science: Theory apa th y
and Practise an
h ug
u t
M
a n
L15 - Optimization for multiple
n a t variable h
a
Ram
Ramanathan Muthuganapathy (https://ed.iitm.ac.in/~raman)
Course web page: https://ed.iitm.ac.in/~raman/datascience.html
Moodle page: Available at https://courses.iitm.ac.in/
Unconstrained optimization
2 3
• Single variable (e.g. min J(w), e.g J(w) = w th, yJ(w) = w ,
2
J(w) = w + 54/w) n ap a
a
h ug
t 2 2
• multivariable (e.g. min J(w1, w2) = (w
n
M1 − 2) + (w2 − 2) )
u
h a
a t
• n-dimensional multivariable (e.g. n
m
a
a 2 2 2
J(w1, w2, . . . . . . . , wn) = (w1 − 2) + (w2 − 2) + . . . . + (wn − 2) ))
R
• min J(w1, w2, . . . . . . . , wn)
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Surface plot
2 2
J(w1, w2) = (w1 − 2) + (w2 − 2)
h y
a t
ap
an
h ug
u t
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Contour plot / level set / height function
2 2
J(w1, w2) = (w1 − 2) + (w2 − 2)
• Two points in a contour have h y
the same J value a t
ap
an
ug
• Imaging cutting with J-plane u th
at different J-values M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Demo using SrfPlots.py
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Optimality criteria - multiple variables
• min J(w) th y
p a
n a
• The value of w for which the function J(w)
h u ga has the least (minimum) value
u t
M
• Local minimum t h a n
n a
a
a m
R
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient - multiple variables
2 2
min J(w1, w2) = (w1 − 2) + (w2 − 2) - Partial derivatives
2 2
• J(w1, w2) = (w1 − 2) + (w2 − 2) y
th
a
∂J n ap
- Partial derivation of J(w1, w2) wrt w a
g1
• ∂w1 u th u
M
a n
∂J a t h
- Partial derivation of J(w1a,mw
a ) wrt w
n
2 2
• ∂w2 R
( ∂w1 ∂w2 )
∂J ∂J
• ∇J(w ,
1 2w ) = , , where ∇J(w1 , w2 ) or grad. J
• NOTE: grad. J is a vector.
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient - multiple variables
What is ∇J(w1, w2) or grad. J? f(x, y, z)
• Surface f(x, y, z) = c
h y
a t
• Any curve f(x(t), y(t), z(t)) an ap
ug
∂f dx ∂f dy ∂f dz u th
+
• ∂x dt ∂y dt ∂z dt + = 0 h a n
M
a t
a n
• ( ∂x ∂y ∂z ) ( dt dt dt )
am
∂f ∂f ∂f dx dy dzR
, , . , , = 0
′ ′ ′
• ∇f . (x (t), y (t) . z (t)) = 0
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient - multiple variables
What is ∇J(w1, w2) or grad. J?
f(x, y, z)
′ ′ ′
• ∇f is the grad. f and (x (t), y (t) . z (t)) is the y
tangent vector. th
pa
n a
a
h ug
u t
M
a n
t h
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient - multiple variables
What is ∇J(w1, w2) or grad. J? f(x, y, z)
• Take another curve (blue)
h y
a t
′ ′ ′
• ∇f . (x (t), y (t) . z (t)) = 0 an ap
h ug
t
• Dot product n
M
u
h a
a t
• ∇f is perpendicular to set of tangents
am
a at n
that
point. R
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient - multiple variables
What is ∇J(w1, w2) or grad. J? f(x, y, z)
′ ′ ′
• ∇f . (x (t), y (t) . z (t)) = 0 y
th
pa
• Dot product an a
h ug
t
• ∇f is perpendicular to set of tangents
u
Mat that
an h
point. a t
an
am
• ∇f is the Normal vector at a R
point.
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient at a point
What is ∇J(w1, w2) or grad. J? - Back to our notation
• ∇J(w1, w2) is a normal vector th y
pa
n a
a
h ug
u t
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Hessian Matrix
2 2
min J(w1, w2) = (w1 − 2) + (w2 − 2) - Second partial derivatives
• ∂w 2 ∂w1 ( ∂w1 )
2
∂J ∂ ∂J
= th y
1 pa
n a
a
• ∂w 2 ∂w2 ( ∂w2 )
2
∂J ∂ ∂J h ug
t
= n
M
u
2 h a
a t
a n
• ∂w1∂w2 ∂w1 ( ∂w2 )
2
∂J ∂ ∂J Ram
=
• ∂w2∂w1 ∂w2 ( ∂w1 )
2
∂J ∂ ∂J
=
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Hessian Matrix
Matrix of second partial derivatives
h y
2 2 ap t
∂J ∂ganJ
a
u
M∂w1∂w2
h
∂w1
2
a n
u t
th
n a
H= 2 a 2
∂ JR am ∂J
∂w2∂w1 ∂w2
2
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Optimality Criteria for Multiple Variables
2 2
min J(w1, w2) = (w1 − 2) + (w2 − 2)
• ∇J(w1, w2) = 0, Get w* = (w*
1
, w*
2
)
a th y
ap
an
• Hessian H should be positive definite th ug
at w* for min M
u
n a
th
a
• Hessian H should be negative ma n
a
definite at w* for max R
• At a saddle point, H is indefinite
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Optimality Criteria for Multiple Variables
How to find the type for H? (Use LA)
• H is positive definite if all the Eigenvalues are >thy0 (All λ′i s > 0 )
p a
n a
• H is negative definite if all the Eigenvalues u g are < 0 (All λ′i s < 0 )
a
th
u
M
n
• H is indefinite if some Eigenvaluesathare > 0 and some are < 0 (All λ′i s
a >0)
an
a m
R
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Example
2 2
min J(w1, w2) = (w1 − 2) + (w2 − 2)
• ∇J(w1, w2) = 0, Get w* = (w*
1
, w*
2
)
a th y
ap
∂J ug
an
• ∂w1 = 2(w1 − 2) u th
M
a n
th
a
∂J a n
• ∂w2 = (2w2 − 2) Ram
• Critical point
w* = (w* 1
, w*
2
) = (2, 2)
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Example
Compute Hessian at (2, 2)
2
∂J y
= 2 a th
• ∂w 2 n ap
1 a
2 h ug
∂J M
u t
= 2 a n
• ∂w 2 a th
2 a n
2 am
∂J R
• ∂w1∂w2 = 0
2
∂J
• ∂w2∂w1 = 0
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Hessian Matrix
Matrix of second partial derivatives
[0 2]
y
2 0
th
pa
n a
a
H= h ug
u t
M
a n
th
n a
a
Ram
Eigen values ?
H is then _____________ definite and hence the point
w* = (w*, w*) = (2, 2) is local _________________
1 2
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
CW: Do a similar exercise for
2 2
J(w1, w2) = w1 − w2
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Unidirectional search
2 2
J(w1, w2) = (w1 − 2) + (w2 − 2)
• Starting point
s s s
w = (w1, w2) = (−4, − 4) p
s a th y
n a
a
h ug
• Search direction s (vector) n
M
u t
h a
a t
s
• w* = w + αS am
a n
R
• Bracketing method to find α
• Fine tuning with interval
halving (or golden search etc.)
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Unidirectional search - Issues
2 2
J(w1, w2) = (w1 − 2) + (w2 − 2)
• Starting point
p
s a th y
a
• Search direction s (vector) ug
a n
th
u
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient at a point
What is ∇J(w1, w2) or grad. J? - Back to our notation
• ∇J(w1, w2) is a normal vector th y
pa
n a
a
h ug
u t
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient at a point
Traveling along grad. J
• If you travel along the direction of h y
t
the grad. J, what happens to J? apa
an
h ug
u t
M
a n
th
n a
a
R am
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Gradient descent
Traveling along -grad. J or − ∇J(w1, w2)
• We should travel along − ∇J th y
pa
n a
a
h ug
u t
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Potential directions and steepest descent
Traveling along -grad. J or − ∇J(w1, w2)
• Let d be such that ∇J . d is -ve. th y
pa
n a
• Let d be such that d = − ∇J h ug
a
u t
M
• ∇J. − ∇J = − 1 a th a n
a n
am
• Hence − ∇J is the steepest! R
− ∇J
• Steepest (Cauchy’s) Gradient
Descent d
∇J
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Algorithm - Gradient descent
Traveling along -grad. J or − ∇J(w1, w2)
• Starting point w* = (w*
1
, w*
2
)
a th y
ap
an
• Compute J, − ∇J at w*
k
= w*.
u th ug
M
a n
• Update w’s a n a th
Ram
• w*
k+1
= w*
k
− αk ∇J
− ∇J
• Check for stopping criteria d
• Else continue the iteration ∇J
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Algorithm - Update step
w*
k+1
= w*
k
− αk ∇J
• Update w’s y
th
pa
k+1 k
• w1 = w1 − αk ∇J an a
h ug
k+1 k u t
• w2 = w2 − αk ∇J n
M
h a
a t
• Compute J, − ∇J at w*
k+1
.
am
a n
R
• Finding αk − ∇J
• Unidirectional search (or)
d
• Make it a constant (Learning rate in ML) ∇J
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Algorithm - Stopping criteria
1. if | | ∇J(w*
k
) | | ≤ ϵ1 h y
a t
ap
n
2. if | ∇J(w*
k+1
) . ∇J(w*
k
) | ≤ ϵ2 ug
a
th
u
M
| | w*
k+1 − w*
k | | h a n
3. if ≤ ϵ1 n a t
| | w*
k | | am
a
R
4. if number of iterations exceeds a − ∇J
predefined constant (k > 100, say)
d
• NOTE: Compute 1 or 4 before
update and 2 or 3, after ∇J
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Himmelblau function
2 2 2 2
J(w1, w2) = (w1 + w2 − 11) + (w1 + w2 − 7)
h y
a t
ap
an
h ug
u t
M
a n
th
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Recap - Single vs Multiple
w*
k+1
= w*
k
− αk ∇J
h y
a t
p
wga2n a
h u
w (1) u t
w (1) M
a n
J(w) a th − ∇J
a n
R am
d
∇J
w
w1
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Optimization strategies
Variations in (steepest) gradient descent
• Constant step length α h y
a t
ap
• Adaptive step length (αk) using line search
ug
a n
th
u
M
• Stochastic gradient descent th a n
n a
a
Ram
Ramanathan Muthuganapathy, Department of Engineering Design, IIT Madras
Line Search
s s
[w* w*
1 2
] = [w1 w2] + αS
s s
[w* w*
1 2
] = [w1 w2] + α( − ∇J)
α=0 α = αi α = 10
s s
[w1 w2] − ∇J
s s
[w1 w2] + α( − ∇J)
S. Grad. Des.
s s
[w1 w2]
α = 0.01
− ∇J
− ∇J
− ∇J