Optimization: 1 Motivation
Optimization: 1 Motivation
Optimization: 1 Motivation
Optimization
1 Motivation
In data analysis, optimization methods are used to fit models to the available data. Model
parameters are chosen by maximizing or minimizing a cost function, such as the likeli-
hood of the data given the parameters or the error achieved by the model on a training
dataset. In these notes we will introduce some of the basic methods that can be applied to
unconstrained optimization problems of the form
min f (x) , (1)
x
2 Optimization in 1D
The optimization methods that we will describe in the following sections start from an
arbitrary point and try to make progress towards the minimum of the function of interest
by using the local characteristics of the function, such as the slope or the curvature. This
local information is captured by the derivatives of the function.
Definition 2.1 (Derivative). The derivative of a real-valued function f : R → R at x ∈ R
is defined as
f (x + h) − f (x)
f 0 (x) := lim . (2)
h→0 h
Higher order derivatives are defined recursively,
0
f 00 (x) := (f 0 (x)) , (3)
0
f 000 (x) := (f 00 (x)) , (4)
i.e. the second derivative of a function is the derivative of the first derivative, the third
derivative is the derivative of the second derivative and so on.
A function is said to be n-times continuously differentiable if all its derivatives up to the nth
exist and are continuous.
fx1 (y)
x
f (y)
Geometrically, the derivative at a point x corresponds to the slope of the line that is tangent
to the curve (y, f (y)) at x. This line is a linear approximation to the function, as
illustrated in Figure 1.
Definition 2.2 (First-order approximation). The first-order or linear approximation of a
differentiable function f : R → R at x ∈ R is
It follows immediately from the definition of derivative that the linear function fx1 (y) becomes
an arbitrarily good approximation of f as we approach x, even if we divide the error by the
distance to x.
Lemma 2.3. The linear approximation fx1 : R → R at x ∈ R of a differentiable function
f : R → R satisfies
f (y) − fx1 (y)
lim = 0. (6)
y→x x−y
The value of the derivative provides valuable information about whether the value of the
function is increasing or decreasing.
Lemma 2.4 (Local monotonicity). Let f : R → R be a differentiable function.
2
f (y)
f (θx + (1 − θ)y)
f (x)
Figure 2: Illustration of condition (7) in Definition 2.5. The curve corresponding to the function
must lie below any chord joining two of its points.
In general, information about the local behavior of a function does not necessarily reflect its
global behavior. We could move in a direction in which the function decreases only to find
a local minimum that is much larger than the global minimum of the function. This is not
the case for convex functions.
Definition 2.5 (Convexity). A function is convex if for any x, y ∈ R and any θ ∈ (0, 1),
θf (x) + (1 − θ) f (y) ≥ f (θx + (1 − θ) y) . (7)
The function is strictly convex if the inequality is strict
θf (x) + (1 − θ) f (y) > f (θx + (1 − θ) y) . (8)
A function f is said to be concave is −f is convex, and strictly concave if −f is strictly
convex.
Condition (7) is illustrated in Figure 2. The curve corresponding to the function must
lie below any chord joining two of its points. This is a general condition that applies to
functions that are not differentiable. The following lemma, proved in Section A.1, provides
an equivalent condition when the function is differentiable: its curve must lie above all of its
linear approximations (for example, the function in Figure 1 is convex).
Lemma 2.6. A differentiable function f is convex if and only if for all x, y ∈ R
f (y) ≥ fx1 (y) (9)
and strictly convex if and only if for all x, y ∈ R
f (y) > fx1 (y) . (10)
3
An immediate corollary is that for a convex function, any point at which the derivative is
zero is a global minimum.
Corollary 2.7. If a differentiable function f is convex and f 0 (x) = 0, then for any y ∈ R
This means that if we are trying to minimize a convex function, once we locate a point at
which the derivative of the function is zero we are done!
For functions that are twice-differentiable, convexity is related to the curvature of the func-
tion. Curvature is quantified by the second derivative of the function. The following lemma,
proved in Section A.2 of the appendix, establishes that such functions are convex if and only
if their curvature is always nonnegative.
Lemma 2.8. A 2-times continuously differentiable function f is convex if and only if
f 00 (x) ≥ 0 for all x ∈ R. It is strictly convex if and only f 00 (x) > 0.
The quadratic function fx2 (y) becomes an arbitrarily good approximation of f as we approach
x, even if we divide the error by the squared distance between x and y. We omit the proof
that follows from basic calculus.
Lemma 2.10. The quadratic approximation fx2 : R → R at x ∈ R of a twice continuously
differentiable function f : R → R satisfies
f (y) − fx2 (y)
lim = 0. (14)
y→x (x − y)2
4
fx2 (y)
f (y)
In order to explain how gradient descent works, we will first introduce a 1D-version which
we call derivative descent (this is not a standard term). The idea is to start at an arbitrary
point x0 and move in the direction in which the function decreases.
Algorithm 2.11 (Derivative descent).
Input: A differentiable function f : R → R, its derivative f 0 , a step size α and a stopping
threshold .
Output: An estimate of the minimum of the function.
until |f 0 (xi )| ≤ .
Note that the size of the step we take in each iteration is proportional to the magnitude of
the derivative at that point. This means that we will make rapid progress in regions where
|f 0 | is large and slower progress when |f 0 | is small. For functions that are locally quadratic,
the derivative gradually becomes smaller as we approach the minimum. For such functions,
taking a step that is proportional to |f 0 | decreases the magnitude of the steps and allows the
algorithm to converge to the minimum. Figure 4 shows the progress of derivative descent
when applied to a quadratic for two values of the constant α. If α is small, we make slower
progress towards the minimum, but if it is very large we might repeatedly overshoot the
minimum.
5
α=2 α=7
Figure 4: Derivative descent applied to a quadratic function for two different values of α. In both
cases the iterations converge to the minimum.
Derivative descent only uses first-order information about the function. It does not take into
account its curvature. If we know that the function is well approximated by a quadratic,
we could adapt the step size that we take accordingly. This is the idea behind Newton’s
method. The following simple lemma shows that we always minimize a quadratic by taking
a step that depends on the curvature.
Lemma 2.12. Derivative descent finds the global minimum of a convex quadratic
a
q (x) := x2 + bx + c (16)
2
if we set
1 1
α= = . (17)
q 00 (x0 ) a
Proof. The global minimum of a quadratic function can be found by setting the derivative
q 0 (x) = ax + b (18)
to zero, which yields that the minimum x∗ = −b/a. The first step of derivative descent with
a step size of 1/a always reaches x∗ ,
x1 = x0 − αf 0 (x0 ) (19)
ax0 + b
= x0 − (20)
a
= x∗ . (21)
6
Corollary 2.13. For a 2-times continuously differentiable function f : R → R,
f 0 (x0 )
x1 = x0 − 00 (22)
f (x0 )
2. For i = 1, 2, . . . compute
f 0 (xi−1 )
xi = xi−1 − . (23)
f 00 (xi−1 )
until |f 0 (xi )| ≤ .
When the function is well approximated by a quadratic then Newton’s method typically
converges more rapidly than derivative descent. As expected, for a quadratic it converges in
one step as shown in Figure 5. Figures 6 and 7 show the result of applying both algorithms
to a convex function that is not a quadratic. Note how the step size of derivative descent
depends on the slope of the function at each point. Figure 7 illustrates how Newton’s method
moves to the minimum of the quadratic approximation of the function in each step.
Both derivative descent and Newton’s method will usually converge for well-behaved con-
vex functions (this statement can be made rigorous, but proof of the convergence of these
algorithms are beyond the scope of these notes). Figure 8 shows an example where they
are applied to a nonconvex function that has local minima. Both algorithms converge to
different local minima depending on the initialization.
7
Figure 5: When applied to a quadratic function, Newton’s method converges in one step.
α = 0.05 α = 0.2
Figure 6: Derivative descent applied to a convex function for two different values of α. In both
cases the iterations converge to the minimum.
8
Quadratic approximation
Figure 7: Newton’s method applied to a convex function. The image on the right shows the
quadratic approximations to the function at each iteration.
Figure 8: Derivative descent and Newton’s method applied to a nonconvex function with two
local minima. The algorithms converge to different minima depending on the initialization.
9
3.1 Gradient, Hessian and convexity
As we saw in the previous section, the derivative quantifies the variation of functions defined
on R. However, when the domain of a function is Rn instead, the function may vary dif-
ferently in different directions. The directional derivative of the function quantifies the
variation in a fixed direction.
f (x + hu) − f (x)
fu0 (x) := lim . (24)
h→0 h
Higher-order directional derivatives in the same direction are computed recursively
f 0 (x + hu) − f 0 (x)
fu00 (x) := lim (25)
h→0 h
The partial derivatives of a function are its directional derivatives in the direction of the
axes.
where ei is the ith standard basis vector. Higher-order directional derivatives are computed
recursively
The gradient is a vector that contains the partial derivatives of the function at a certain
point.
10
Definition 3.3 (Gradient). The gradient of a function f : Rn → R is
∂f (x)
∂x1
∂f (x)
∇f (x) = ∂x2 . (29)
···
∂f (x)
∂xn
More intuitively, the gradient encodes the variation of the function in every direction.
We omit the proof of the lemma which uses basic multivariable calculus. An important corol-
lary is that the gradient provides the direction of maximum positive and negative variation
of the function.
11
Figure 9: Contour lines of a function f : R2 → R. The gradients at different points are represented
by black arrows, which are orthogonal to the contour lines.
12
The condition for convexity in higher dimensions is remarkably similar to the definition in
1D: the hypersurface (z, f (z)) must stay below any chord between two points (x, f (x)) and
(y, f (y)). For differentiable functions, as in one dimension, an equivalent condition is for
the linear approximation to remain below the hypersurface.
Lemma 3.9. A differentiable function f : Rn → R is convex if and only if for every x, y ∈
Rn
f (y) ≥ f (x) + ∇f (x)T (y − x) . (38)
It is strictly convex if and only if
f (y) > f (x) + ∇f (x)T (y − x) . (39)
We omit the proof, which is similar to that of the 1D case. An immediate corollary is that
for a convex function, any point at which the gradient is zero is a global minimum and if
the function is strictly convex, the point is the only minimum.
Corollary 3.10. If a differentiable function f is convex and ∇f (x) = 0, then for any y ∈ R
f (y) ≥ f (x) . (40)
If f is strictly convex then for any y ∈ R
f (y) > f (x) . (41)
If a function has a Hessian matrix, we say that the function is twice differentiable.
The Hessian matrix encodes the curvature of the function in every direction. It follows from
the definition (and some basic multivariable calculus) that
fu00 (x) = uT ∇2 f (x) u, (43)
i.e. the second directional derivative of the function can be computed from the quadratic
form induced by the Hessian matrix. The following lemma shows how to obtain the local
directions of maximum and minimum curvature of the function.
13
Positive definite Negative definite Neither
Figure 10: Quadratic forms induced by a positive definite matrix (left), a negative definite matrix
(center) and a matrix that is neither positive nor negative definite (right).
Proof. ∇2 f (x) is symmetric, so the lemma follows from Theorem 6.2 in Lecture Notes 9
and (43).
Example 3.13 (Quadratic forms). If all the eigenvalues of a symmetric matrix A ∈ Rn×n
are nonnegative, the matrix is said to be positive semidefinite. If the eigenvalues are
positive the matrix is positive definite. If the eigenvalues are all nonpositive, the matrix
is negative semidefinite. If they are negative, the matrix is negative definite.
The quadratic forms
q (x) := xT Ax + bT x + c (44)
induced by A has nonnegative curvature in every direction if the matrix is positive semidef-
inite by Lemma 3.12 because A is the Hessian matrix of the quadratic form. Similarly, the
curvature is always positive if A is positive definite and negative if A is negative definite.
Figure 10 illustrates this with quadratic forms in two dimensions.
14
Lemma 3.14. A twice-differentiable function f : Rn → R is convex if and only if for every
x ∈ Rn , the Hessian matrix ∇2 f (x) is positive semidefinite. The function is strictly convex
if and only if for every x ∈ Rn , ∇2 f (x) is positive definite.
The proof of the lemma, which we omit, is similar to the one of the 1D case.
As in 1D we can define a quadratic approximation of a twice-differentiable function at each
point.
Definition 3.15 (Second-order approximation). The second-order or quadratic approxima-
tion of f at x is
1
fx2 (y) := f (x) + ∇f (x) (y − x) + (y − x)T ∇2 f (x) (y − x) (45)
2
The quadratic form fx2 (y) becomes an arbitrarily good approximation of f as we approach
x, even if we divide the error by the squared distance between x and y. We omit the proof
that follows from multivariable calculus.
Lemma 3.16. The quadratic approximation fx2 : Rn → R at x ∈ R of a twice continuously
differentiable function f : Rn → R satisfies
f (y) − fx2 (y)
lim =0 (46)
y→x ||y − x||22
Gradient descent is arguably the most important optimization algorithm. The idea is simple,
and very similar to the 1D version that we dubbed derivative descent. By Corollary 3.5 the
direction of −∇f is the direction of steepest descent (gradient descent is also called steepest
descent in the literature).
Algorithm 3.17 (Gradient descent).
Input: A differentiable function f : Rn → R, its gradient ∇f , a step size α and a stopping
threshold .
Output: An estimate of the minimum of the function.
2. For i = 1, 2, . . . compute
15
α=1 α=6
Figure 11: Gradient descent applied to a quadratic function for two different values of α. In both
cases the iterations converge to the minimum.
Figure 11 shows the results of applying the method on a quadratic function. As in the 1D
case, choosing the step size implies trying to reach a tradeoff between making slow progress
or repeatedly overshooting the minimum.
Newton’s method consists of modifying the step in gradient descent using curvature infor-
mation. The global minimum x∗ of a convex quadratic form
1
q (x) := xT Ax + bT x + c (48)
2
is −A−1 b. As a result the minimum x∗ can be found by computing
Note that multiplying by the inverse of the Hessian not only changes the magnitude of the
step but also its direction.
Newton’s method iteratively minimizes the quadratic approximation of a function.
16
Figure 12: When applied to a quadratic function, Newton’s method converges in one step.
As expected, when applied to a quadratic function, Newton’s method converges in one step.
This is shown in Figure 12. Figure 13 shows the result of applying gradient descent and
Newton’s method to a convex function f : R2 → R.
Let us assume that our cost function of interest can be decomposed into the sum of m cost
functions
m
1 X
f (x) = fi (x) . (51)
m i=1
An important example is the squared `2 error achieved by a model on a training set. In this
case, each fi is the error for a single example in the training set. The aim of stochastic-
optimization algorithms is to optimize the whole cost function as the different fi are revealed.
17
Gradient descent Newton’s method
Figure 13: Gradient descent and Newton’s method applied to a convex function. Both algorithms
converge to the minimum.
Intuitively, this allows to refine the model parameters as more data become available. By far,
the most popular algorithm for stochastic optimization is stochastic gradient descent, which
consists of taking a gradient-descent step for each piece of data that becomes available.
2. For i = 1, 2, . . . , m compute
18
A Proofs
First we show that convexity implies that (9) holds. If f is convex then for any x, y ∈ R and
any 0 ≤ θ ≤ 1
θ (f (y) − f (x)) + f (x) ≥ f (x + θ (y − x)) . (53)
Rearranging the terms we have
f (x + θ (y − x)) − f (x)
f (y) ≥ + f (x) . (54)
θ
Setting h = θ (y − x), this implies
f (x + h) − f (x)
f (y) ≥ (y − x) + f (x) . (55)
h
Taking the limit when h → 0 yields
f (y) ≥ f 0 (x) (y − x) . (56)
To complete the proof, we show that if (9) holds, then the function is convex. Indeed, let
z = θx + (1 − θ) y, then by (9)
f (x) ≥ f 0 (z) (x − z) + f (z) (57)
= f 0 (z) (1 − θ) (x − y) + f (z) (58)
f (y) ≥ f 0 (z) (y − z) + f (z) (59)
= f 0 (z) θ (y − x) + f (z) (60)
Multiplying (58) by θ, then (60) by 1 − θ and summing the inequalities, we obtain
θf (x) + (1 − θ) f (y) ≥ f (θx + (1 − θ) y) . (61)
The second derivative is nonnegative anywhere if and only if the first derivative is nonde-
creasing, because f 00 is the derivative of f 0 .
First we prove that if the function is convex then the derivative is nondecreasing. By
Lemma 2.6, if the function is convex then for any x, y ∈ R such that y > x
f (x) ≥ f 0 (y) (x − y) + f (y) , (62)
f (y) ≥ f 0 (x) (y − x) + f (x) . (63)
19
Rearranging, we obtain
f (y) − f (x)
f 0 (γ) = . (65)
y−x
For arbitrary x, y, θ ∈ R, such that y > x and 0 < θ < 1, let z = θy + (1 − θ) x. Since
y > z > x, there exist γ1 ∈ [x, z] and γ2 ∈ [z, y] such that
f (z) − f (x)
f 0 (γ1 ) = , (66)
z−x
f (y) − f (z)
f 0 (γ2 ) = . (67)
y−z
20