[go: up one dir, main page]

0% found this document useful (0 votes)
15 views21 pages

Chap01 Introduction

The document outlines the course structure for Convex Optimization, emphasizing its relevance to Statistics and Machine Learning. It details prerequisites, evaluation methods, and the importance of optimization in understanding statistical procedures. The course will cover various algorithms and their performance in solving optimization problems.

Uploaded by

Anh Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views21 pages

Chap01 Introduction

The document outlines the course structure for Convex Optimization, emphasizing its relevance to Statistics and Machine Learning. It details prerequisites, evaluation methods, and the importance of optimization in understanding statistical procedures. The course will cover various algorithms and their performance in solving optimization problems.

Uploaded by

Anh Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction: Why Optimization?

Lecturer: Ryan Tibshirani


Convex Optimization 10-725/36-725
Course setup

Welcome to the course on Convex Optimization, with a focus on


its ties to Statistics and Machine Learning!

Basic adminstrative details:


• Instructors: Javier Peña, Ryan Tibshirani
• Teaching assistants: Alnur Ali, Christoph Dann, Sangwon
Hyun, Mariya Toneva, Han Zhao
• Course website:
http://www.stat.cmu.edu/~ryantibs/convexopt/
• We will use Piazza for announcements and discussions
• We will Blackboard just as a gradebook

2
Prerequisites: no formal ones, but class will be fairly fast paced

Assume working knowledge of/proficiency with:


• Real analysis, calculus, linear algebra
• Core problems in Stats/ML
• Programming (Matlab, Python, R ...)
• Data structures, computational complexity
• Formal mathematical thinking

If you fall short on any one of these things, it’s certainly possible to
catch up; but don’t hesitate to talk to us

3
Evaluation:
• 5 homeworks
• 2 little tests
• 1 project (can enroll for 9 units with no project)
• Many easy quizzes

Project: something useful/interesting with optimization. Groups of


2 or 3, milestones throughout the semester, details to come

Quizzes: due at midnight the day of each lecture. Should be very


short, very easy if you’ve attended lecture ...

4
Scribing: sign up to scribe one lecture per semester, on the course
website (multiple scribes per lecture). Can bump up your grade in
boundary cases

Lecture videos: see links on course website. These are supposed to


be helpful supplements, not replacements! Best to attend lectures

Auditors: welcome, please audit rather than just sitting in

Most important: work hard and have fun!

5
Optimization problems are ubiquitous in Statistics and
Machine Learning

Optimization problems underlie most everything we do in Statistics


and Machine Learning. In many courses, you learn how to:

translate into P : min f (x)


x∈D

Conceptual idea Optimization problem

Examples of this? Examples of the contrary?

This course: how to solve P , and also why this is important

6
Presumably, other people have already figured out how to solve

P : min f (x)
x∈D

So why bother?

Many reasons. Here’s two:


1. Different algorithms can perform better or worse for different
problems P (sometimes drastically so)
2. Studying P can actually give you a deeper understanding of
the statistical procedure in question

Optimization is a very current field. It can move quickly, but there


is still much room for progress, especially at the intersection with
Statistics and ML

7
Example: algorithms for the 2d fused lasso
The 2d fused lasso or 2d total variation denoising problem is:
n
1X X
min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

This fits a piecewise constant function over an image, given data


yi , i = 1, . . . , n at pixels
7
6
5
4
3

True image Data Solution


8
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-


erations

Proximal gradient descent,


1000 iterations

Coordinate descent, 10K


cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-


erations

Proximal gradient descent,


1000 iterations

Coordinate descent, 10K


cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-


erations

Proximal gradient descent,


1000 iterations

Coordinate descent, 10K


cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-


erations

Proximal gradient descent,


1000 iterations

Coordinate descent, 10K


cycles

(Last two from the dual)

9
What’s the message here?

So what’s the right conclusion here?

Is the alternating direction method of multipliers (ADMM) method


simply a better method than proximal gradient descent, coordinate
descent? ... No

In fact, different algorithms will work better in different situations.


We’ll learn details throughout the course

In the 2d fused lasso problem:


• Specialized ADMM: fast (structured subproblems)
• Proximal gradient: slow (poor conditioning)
• Coordinate descent: slow (large active set)

10
Example: testing changepoints from the 1d fused lasso
In the 1d fused lasso or 1d total variation denoising problem
n n−1
1X X
min (yi − θi )2 + λ |θi − θi+1 |
θ 2
i=1 i=1

the parameter λ ≥ 0 is called a tuning parameter. As λ decreases,


we see more changepoints in the solution β̂
1.2

1.2

1.2
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●● ● ●● ● ● ●●● ● ●● ● ● ●●● ● ●● ● ●
1.0

1.0

1.0
● ●
● ●● ● ●
● ●● ● ●
● ●●
● ●● ● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ●●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●● ● ●● ●
0.8

0.8

0.8
● ● ●
0.6

0.6

0.6
0.4

0.4

0.4
● ● ●
0.2

0.2

0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●●● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ● ● ● ●● ●
0.0

0.0

0.0
● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ●● ●● ● ●●
● ● ●
● ● ● ●●● ● ● ● ●●● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.2

−0.2

−0.2
● ● ● ● ● ●
● ● ●

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

λ = 25 λ = 0.62 λ = 0.41
11
Let’s look at the solution at λ = 0.41 a little more closely
1.2
● ● ●
● ●
● ● ●
● ● ●

● ●● ● ● ●
●●● ●● ●
● ●
How can we test the
1.0

● ●
● ●●
● ●● ● ●●● ●●
● ● ●

●●

● ●
significance of detected
0.8

changepoints? Say, at
location 11?
0.6

A B C
0.4

Classically: take the av-


erage of data points in


0.2

● ●
● ● ●
● ● ●
●●●

● ●
● ● ●●
● ●

●● region A minus the av-
0.0




● ●
●●

●●
●●●

● ●●
erage in B, compare this
● ● ●
● ●
to what we expect if the
−0.2

● ●

0 20 40 60 80 100
signal was flat

But this is incorrect, because location 11 was selected based on the


data, so of course the difference in averages looks high!
12
What we want to do: compare our observed difference to that in
reference (null) data, in which the signal was flat and we happen
to select the same location 11 (and 50)
1.2

1.2
●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ● ●●
● ●●
●●● ● ●● ● ● ● ●● ● ● ● ●
1.0

1.0

● ●
● ●● ● ●
●●●
● ●● ● ● ●●
●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ● ●
● ● ●
● ●● ● ●
●● ●
0.8

0.8
● ●
0.6

0.6
A B C A B C
0.4

0.4



0.2

0.2
● ●
● ● ● ● ●
● ● ● ●
●●● ● ●
● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ● ●
0.0

0.0

● ●● ● ●● ●● ●●
● ●● ● ● ● ● ●
●● ● ●● ● ● ● ●●
● ●● ● ●
● ● ● ●●● ● ●● ● ●
● ● ●
● ● ●●● ● ●
−0.2

−0.2
● ● ● ●
● ●


0 20 40 60 80 100 0 20 40 60 80 100

Observed data Reference data


Abs. difference ≈ 0.088 Abs. difference ≈ 0.072

But it took 1222 simulated data sets to get one reference data set!
13
The role of optimization: if we understand the 1d fused lasso, i.e.,
the way it selects changepoints (stems from KKT conditions), then
we can come up with a reference distribution without simulating
1.2

● ● ●
● ●
● ● ●
● ● ●

● ● ●● ● ● ●
● ● ● ●● ● ●
1.0

● ●
● ●●
● ●● ● ●●● ●●
● ● ●

● ●
●● ●
0.8


0.6

p−value = 0.000 Accordingly, we can


efficiently conduct
0.4

rigorous significance

tests (Hyun et al.,
0.2

● p−value
● = 0.359
● ● ●
● ● ●
●●●

● ●
● ● ●●
● ●

●●
2016)
0.0



● ●● ●
●● ● ●●

● ● ● ●●●
● ● ●
● ●
−0.2

● ●

0 20 40 60 80 100

14
Central concept: convexity

Historically, linear programs were the focus in optimization

Initially, it was thought that the important distinction was between


linear and nonlinear optimization problems. But some nonlinear
problems turned out to be much harder than others ...

Now it is widely recognized that the right distinction is between


convex and nonconvex problems

Your supplementary textbooks for the course:

Boyd and Vandenberghe Rockafellar


,
(2004) (1970)

15
Chapter 3
Convex sets and functions
Convex functions
Convex set: C ⊆ Rn such that

243.1 ∈ Cproperties
x, yBasic =⇒ tx and+ (1 − t)y
examples ∈ C for all 0 ≤ t ≤ 12 Convex sets
3.1.1 Definition

A function f : Rn → R is convex if dom f is a convex set and if for all x,


y ∈ dom f , and θ with 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (3.1)

Geometrically, this inequality means that the line segment between (x, f (x)) and
(y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1).
A function f is strictly convex if strict inequality holds in (3.1) whenever x ̸= y
and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is
Figure 2.2nSome simple convex and nonconvex sets. Left.
Convex function: f : R → R such that dom(f ) ⊆ R convex, and
strictly convex.
n The hexagon,
which
For an affine includes
function its boundary
we always have equality(shown
in (3.1),darker),
so all affineis(and
convex. Middle. The kidney
therefore
also linear)shaped
functionsset
are isboth
notconvex and concave.
convex, since the Conversely,
line segmentany function
betweenthat the two points in
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for 0 ≤ t ≤ 1
is convex andtheconcave is affine.
set shown as dots is not contained in the set. Right. The square contains
A function is convex if and only if it is convex when restricted
some boundary points but not others, and is not convex. to any line that
and all x, y ∈ dom(f )
intersects its domain. In other words f is convex if and only if for all x ∈ dom f and

(y, f (y))

(x, f (x))

Figure 3.1 Graph of a convex function. The chord (i.e., line segment) be-
tween any two points on the graph lies above the graph. 2
16
Convex optimization problems

Optimization problem:

min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . m
hj (x) = 0, j = 1, . . . r
Tp
Here D = dom(f ) ∩ m
T
i=1 dom(gi ) ∩ j=1 dom(hj ), common
domain of all the functions

This is a convex optimization problem provided the functions f


and gi , i = 1, . . . m are convex, and hj , j = 1, . . . p are affine:

hj (x) = aTj x + bj , j = 1, . . . p

17
Local minima are global minima
For convex optimization problems, local minima are global minima
Formally, if x is feasible—x ∈ D, and satisfies all constraints—and
minimizes f in a local neighborhood,
f (x) ≤ f (y) for all feasible y, kx − yk2 ≤ ρ,
then
f (x) ≤ f (y) for all feasible y


This is a very useful ●
● ●
fact and will save us ●

a lot of trouble!
● ● ●●

Convex Nonconvex
18

You might also like