Chap01 Introduction
Chap01 Introduction
2
Prerequisites: no formal ones, but class will be fairly fast paced
If you fall short on any one of these things, it’s certainly possible to
catch up; but don’t hesitate to talk to us
3
Evaluation:
• 5 homeworks
• 2 little tests
• 1 project (can enroll for 9 units with no project)
• Many easy quizzes
4
Scribing: sign up to scribe one lecture per semester, on the course
website (multiple scribes per lecture). Can bump up your grade in
boundary cases
5
Optimization problems are ubiquitous in Statistics and
Machine Learning
6
Presumably, other people have already figured out how to solve
P : min f (x)
x∈D
So why bother?
7
Example: algorithms for the 2d fused lasso
The 2d fused lasso or 2d total variation denoising problem is:
n
1X X
min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E
9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E
9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E
9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E
9
What’s the message here?
10
Example: testing changepoints from the 1d fused lasso
In the 1d fused lasso or 1d total variation denoising problem
n n−1
1X X
min (yi − θi )2 + λ |θi − θi+1 |
θ 2
i=1 i=1
1.2
1.2
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●● ● ●● ● ● ●●● ● ●● ● ● ●●● ● ●● ● ●
1.0
1.0
1.0
● ●
● ●● ● ●
● ●● ● ●
● ●●
● ●● ● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ●●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●● ● ●● ●
0.8
0.8
0.8
● ● ●
0.6
0.6
0.6
0.4
0.4
0.4
● ● ●
0.2
0.2
0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●●● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ● ● ● ●● ●
0.0
0.0
0.0
● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ●● ●● ● ●●
● ● ●
● ● ● ●●● ● ● ● ●●● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.2
−0.2
−0.2
● ● ● ● ● ●
● ● ●
λ = 25 λ = 0.62 λ = 0.41
11
Let’s look at the solution at λ = 0.41 a little more closely
1.2
● ● ●
● ●
● ● ●
● ● ●
●
● ●● ● ● ●
●●● ●● ●
● ●
How can we test the
1.0
● ●
● ●●
● ●● ● ●●● ●●
● ● ●
●
●●
●
● ●
significance of detected
0.8
changepoints? Say, at
location 11?
0.6
A B C
0.4
● ●
● ● ●
● ● ●
●●●
●
● ●
● ● ●●
● ●
●
●● region A minus the av-
0.0
●
●
●
●
● ●
●●
●
●●
●●●
●
● ●●
erage in B, compare this
● ● ●
● ●
to what we expect if the
−0.2
● ●
●
0 20 40 60 80 100
signal was flat
1.2
●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ● ●●
● ●●
●●● ● ●● ● ● ● ●● ● ● ● ●
1.0
1.0
●
● ●
● ●● ● ●
●●●
● ●● ● ● ●●
●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ● ●
● ● ●
● ●● ● ●
●● ●
0.8
0.8
● ●
0.6
0.6
A B C A B C
0.4
0.4
●
●
●
0.2
0.2
● ●
● ● ● ● ●
● ● ● ●
●●● ● ●
● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ● ●
0.0
0.0
●
● ●● ● ●● ●● ●●
● ●● ● ● ● ● ●
●● ● ●● ● ● ● ●●
● ●● ● ●
● ● ● ●●● ● ●● ● ●
● ● ●
● ● ●●● ● ●
−0.2
−0.2
● ● ● ●
● ●
●
0 20 40 60 80 100 0 20 40 60 80 100
But it took 1222 simulated data sets to get one reference data set!
13
The role of optimization: if we understand the 1d fused lasso, i.e.,
the way it selects changepoints (stems from KKT conditions), then
we can come up with a reference distribution without simulating
1.2
● ● ●
● ●
● ● ●
● ● ●
●
● ● ●● ● ● ●
● ● ● ●● ● ●
1.0
● ●
● ●●
● ●● ● ●●● ●●
● ● ●
●
● ●
●● ●
0.8
●
0.6
rigorous significance
●
tests (Hyun et al.,
0.2
● p−value
● = 0.359
● ● ●
● ● ●
●●●
●
● ●
● ● ●●
● ●
●
●●
2016)
0.0
●
●
● ●● ●
●● ● ●●
●
● ● ● ●●●
● ● ●
● ●
−0.2
● ●
●
0 20 40 60 80 100
14
Central concept: convexity
15
Chapter 3
Convex sets and functions
Convex functions
Convex set: C ⊆ Rn such that
243.1 ∈ Cproperties
x, yBasic =⇒ tx and+ (1 − t)y
examples ∈ C for all 0 ≤ t ≤ 12 Convex sets
3.1.1 Definition
Geometrically, this inequality means that the line segment between (x, f (x)) and
(y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1).
A function f is strictly convex if strict inequality holds in (3.1) whenever x ̸= y
and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is
Figure 2.2nSome simple convex and nonconvex sets. Left.
Convex function: f : R → R such that dom(f ) ⊆ R convex, and
strictly convex.
n The hexagon,
which
For an affine includes
function its boundary
we always have equality(shown
in (3.1),darker),
so all affineis(and
convex. Middle. The kidney
therefore
also linear)shaped
functionsset
are isboth
notconvex and concave.
convex, since the Conversely,
line segmentany function
betweenthat the two points in
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for 0 ≤ t ≤ 1
is convex andtheconcave is affine.
set shown as dots is not contained in the set. Right. The square contains
A function is convex if and only if it is convex when restricted
some boundary points but not others, and is not convex. to any line that
and all x, y ∈ dom(f )
intersects its domain. In other words f is convex if and only if for all x ∈ dom f and
(y, f (y))
(x, f (x))
Figure 3.1 Graph of a convex function. The chord (i.e., line segment) be-
tween any two points on the graph lies above the graph. 2
16
Convex optimization problems
Optimization problem:
min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . m
hj (x) = 0, j = 1, . . . r
Tp
Here D = dom(f ) ∩ m
T
i=1 dom(gi ) ∩ j=1 dom(hj ), common
domain of all the functions
hj (x) = aTj x + bj , j = 1, . . . p
17
Local minima are global minima
For convex optimization problems, local minima are global minima
Formally, if x is feasible—x ∈ D, and satisfies all constraints—and
minimizes f in a local neighborhood,
f (x) ≤ f (y) for all feasible y, kx − yk2 ≤ ρ,
then
f (x) ≤ f (y) for all feasible y
●
This is a very useful ●
● ●
fact and will save us ●
●
a lot of trouble!
● ● ●●
Convex Nonconvex
18