0% found this document useful (0 votes)

15 views21 pages

Chap01 Introduction

The document outlines the course structure for Convex Optimization, emphasizing its relevance to Statistics and Machine Learning. It details prerequisites, evaluation methods, and the importance of optimization in understanding statistical procedures. The course will cover various algorithms and their performance in solving optimization problems.

Uploaded by

Anh Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

Chap01 Introduction

Uploaded by

Anh Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Introduction: Why Optimization?

Lecturer: Ryan Tibshirani

Convex Optimization 10-725/36-725
Course setup

Welcome to the course on Convex Optimization, with a focus on

its ties to Statistics and Machine Learning!

Basic adminstrative details:

• Instructors: Javier Peña, Ryan Tibshirani
• Teaching assistants: Alnur Ali, Christoph Dann, Sangwon
Hyun, Mariya Toneva, Han Zhao
• Course website:
http://www.stat.cmu.edu/~ryantibs/convexopt/
• We will use Piazza for announcements and discussions
• We will Blackboard just as a gradebook

2
Prerequisites: no formal ones, but class will be fairly fast paced

Assume working knowledge of/proficiency with:

• Real analysis, calculus, linear algebra
• Core problems in Stats/ML
• Programming (Matlab, Python, R ...)
• Data structures, computational complexity
• Formal mathematical thinking

If you fall short on any one of these things, it’s certainly possible to
catch up; but don’t hesitate to talk to us

3
Evaluation:
• 5 homeworks
• 2 little tests
• 1 project (can enroll for 9 units with no project)
• Many easy quizzes

Project: something useful/interesting with optimization. Groups of

2 or 3, milestones throughout the semester, details to come

Quizzes: due at midnight the day of each lecture. Should be very

short, very easy if you’ve attended lecture ...

4
Scribing: sign up to scribe one lecture per semester, on the course
website (multiple scribes per lecture). Can bump up your grade in
boundary cases

Lecture videos: see links on course website. These are supposed to

be helpful supplements, not replacements! Best to attend lectures

Auditors: welcome, please audit rather than just sitting in

Most important: work hard and have fun!

5
Optimization problems are ubiquitous in Statistics and
Machine Learning

Optimization problems underlie most everything we do in Statistics

and Machine Learning. In many courses, you learn how to:

translate into P : min f (x)

x∈D

Conceptual idea Optimization problem

Examples of this? Examples of the contrary?

This course: how to solve P , and also why this is important

6
Presumably, other people have already figured out how to solve

P : min f (x)
x∈D

So why bother?

Many reasons. Here’s two:

1. Different algorithms can perform better or worse for different
problems P (sometimes drastically so)
2. Studying P can actually give you a deeper understanding of
the statistical procedure in question

Optimization is a very current field. It can move quickly, but there

is still much room for progress, especially at the intersection with
Statistics and ML

7
Example: algorithms for the 2d fused lasso
The 2d fused lasso or 2d total variation denoising problem is:
n
1X X
min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

This fits a piecewise constant function over an image, given data

yi , i = 1, . . . , n at pixels
7
6
5
4
3

True image Data Solution

8
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-

erations

Proximal gradient descent,

1000 iterations

Coordinate descent, 10K

cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-

erations

Proximal gradient descent,

1000 iterations

Coordinate descent, 10K

cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-

erations

Proximal gradient descent,

1000 iterations

Coordinate descent, 10K

cycles

(Last two from the dual)

9
n
1X X
Our problem: min (yi − θi )2 + λ |θi − θj |
θ 2
i=1 (i,j)∈E

Specialized ADMM, 20 it-

erations

Proximal gradient descent,

1000 iterations

Coordinate descent, 10K

cycles

(Last two from the dual)

9
What’s the message here?

So what’s the right conclusion here?

Is the alternating direction method of multipliers (ADMM) method

simply a better method than proximal gradient descent, coordinate
descent? ... No

In fact, different algorithms will work better in different situations.

We’ll learn details throughout the course

In the 2d fused lasso problem:

• Specialized ADMM: fast (structured subproblems)
• Proximal gradient: slow (poor conditioning)
• Coordinate descent: slow (large active set)

10
Example: testing changepoints from the 1d fused lasso
In the 1d fused lasso or 1d total variation denoising problem
n n−1
1X X
min (yi − θi )2 + λ |θi − θi+1 |
θ 2
i=1 i=1

the parameter λ ≥ 0 is called a tuning parameter. As λ decreases,

we see more changepoints in the solution β̂
1.2

1.2

1.2
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●● ● ●● ● ● ●●● ● ●● ● ● ●●● ● ●● ● ●
1.0

1.0

1.0
● ●
● ●● ● ●
● ●● ● ●
● ●●
● ●● ● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ●●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●● ● ●● ●
0.8

0.8

0.8
● ● ●
0.6

0.6

0.6
0.4

0.4

0.4
● ● ●
0.2

0.2

0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●●● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ● ● ● ●● ●
0.0

0.0

0.0
● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ●● ●● ● ●●
● ● ●
● ● ● ●●● ● ● ● ●●● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.2

−0.2

−0.2
● ● ● ● ● ●
● ● ●

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

λ = 25 λ = 0.62 λ = 0.41
11
Let’s look at the solution at λ = 0.41 a little more closely
1.2
● ● ●
● ●
● ● ●
● ● ●
●
● ●● ● ● ●
●●● ●● ●
● ●
How can we test the
1.0

● ●
● ●●
● ●● ● ●●● ●●
● ● ●
●

●●
●
● ●
significance of detected
0.8

changepoints? Say, at
location 11?
0.6

A B C
0.4

Classically: take the av-

●

erage of data points in

0.2

● ●
● ● ●
● ● ●
●●●
●
● ●
● ● ●●
● ●
●
●● region A minus the av-
0.0

●
●

●
●
● ●
●●
●
●●
●●●
●
● ●●
erage in B, compare this
● ● ●
● ●
to what we expect if the
−0.2

● ●
●

0 20 40 60 80 100
signal was flat

But this is incorrect, because location 11 was selected based on the

data, so of course the difference in averages looks high!
12
What we want to do: compare our observed difference to that in
reference (null) data, in which the signal was flat and we happen
to select the same location 11 (and 50)
1.2

1.2
●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ● ●●
● ●●
●●● ● ●● ● ● ● ●● ● ● ● ●
1.0

1.0
●
● ●
● ●● ● ●
●●●
● ●● ● ● ●●
●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ● ●
● ● ●
● ●● ● ●
●● ●
0.8

0.8
● ●
0.6

0.6
A B C A B C
0.4

0.4
●
●
●
0.2

0.2
● ●
● ● ● ● ●
● ● ● ●
●●● ● ●
● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ● ●
0.0

0.0
●
● ●● ● ●● ●● ●●
● ●● ● ● ● ● ●
●● ● ●● ● ● ● ●●
● ●● ● ●
● ● ● ●●● ● ●● ● ●
● ● ●
● ● ●●● ● ●
−0.2

−0.2
● ● ● ●
● ●

●
0 20 40 60 80 100 0 20 40 60 80 100

Observed data Reference data

Abs. difference ≈ 0.088 Abs. difference ≈ 0.072

But it took 1222 simulated data sets to get one reference data set!
13
The role of optimization: if we understand the 1d fused lasso, i.e.,
the way it selects changepoints (stems from KKT conditions), then
we can come up with a reference distribution without simulating
1.2

● ● ●
● ●
● ● ●
● ● ●
●
● ● ●● ● ● ●
● ● ● ●● ● ●
1.0

● ●
● ●●
● ●● ● ●●● ●●
● ● ●
●
● ●
●● ●
0.8

●
0.6

p−value = 0.000 Accordingly, we can

efficiently conduct
0.4

rigorous significance
●
tests (Hyun et al.,
0.2

● p−value
● = 0.359
● ● ●
● ● ●
●●●
●
● ●
● ● ●●
● ●
●
●●
2016)
0.0

●
●
● ●● ●
●● ● ●●
●
● ● ● ●●●
● ● ●
● ●
−0.2

● ●
●

0 20 40 60 80 100

14
Central concept: convexity

Historically, linear programs were the focus in optimization

Initially, it was thought that the important distinction was between

linear and nonlinear optimization problems. But some nonlinear
problems turned out to be much harder than others ...

Now it is widely recognized that the right distinction is between

convex and nonconvex problems

Your supplementary textbooks for the course:

Boyd and Vandenberghe Rockafellar

,
(2004) (1970)

15
Chapter 3
Convex sets and functions
Convex functions
Convex set: C ⊆ Rn such that

243.1 ∈ Cproperties
x, yBasic =⇒ tx and+ (1 − t)y
examples ∈ C for all 0 ≤ t ≤ 12 Convex sets
3.1.1 Definition

A function f : Rn → R is convex if dom f is a convex set and if for all x,

y ∈ dom f , and θ with 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (3.1)

Geometrically, this inequality means that the line segment between (x, f (x)) and
(y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1).
A function f is strictly convex if strict inequality holds in (3.1) whenever x ̸= y
and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is
Figure 2.2nSome simple convex and nonconvex sets. Left.
Convex function: f : R → R such that dom(f ) ⊆ R convex, and
strictly convex.
n The hexagon,
which
For an affine includes
function its boundary
we always have equality(shown
in (3.1),darker),
so all affineis(and
convex. Middle. The kidney
therefore
also linear)shaped
functionsset
are isboth
notconvex and concave.
convex, since the Conversely,
line segmentany function
betweenthat the two points in
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for 0 ≤ t ≤ 1
is convex andtheconcave is affine.
set shown as dots is not contained in the set. Right. The square contains
A function is convex if and only if it is convex when restricted
some boundary points but not others, and is not convex. to any line that
and all x, y ∈ dom(f )
intersects its domain. In other words f is convex if and only if for all x ∈ dom f and

(y, f (y))

(x, f (x))

Figure 3.1 Graph of a convex function. The chord (i.e., line segment) be-
tween any two points on the graph lies above the graph. 2
16
Convex optimization problems

Optimization problem:

min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . m
hj (x) = 0, j = 1, . . . r
Tp
Here D = dom(f ) ∩ m
T
i=1 dom(gi ) ∩ j=1 dom(hj ), common
domain of all the functions

This is a convex optimization problem provided the functions f

and gi , i = 1, . . . m are convex, and hj , j = 1, . . . p are affine:

hj (x) = aTj x + bj , j = 1, . . . p

17
Local minima are global minima
For convex optimization problems, local minima are global minima
Formally, if x is feasible—x ∈ D, and satisfies all constraints—and
minimizes f in a local neighborhood,
f (x) ≤ f (y) for all feasible y, kx − yk2 ≤ ρ,
then
f (x) ≤ f (y) for all feasible y

●
This is a very useful ●
● ●
fact and will save us ●
●
a lot of trouble!
● ● ●●

Convex Nonconvex
18

Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Convexity in Optimization: Key Concepts
No ratings yet
Convexity in Optimization: Key Concepts
13 pages
Computational OPT Book 2023 Chapter 01
No ratings yet
Computational OPT Book 2023 Chapter 01
18 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
L01 Intro
No ratings yet
L01 Intro
70 pages
09 Convex
No ratings yet
09 Convex
48 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Eecs127 Reader
No ratings yet
Eecs127 Reader
199 pages
Convex Optimization Tutorial
No ratings yet
Convex Optimization Tutorial
179 pages
Convex Optimization Course Notes
No ratings yet
Convex Optimization Course Notes
20 pages
AIMLB PGP 2025 Session 4
No ratings yet
AIMLB PGP 2025 Session 4
38 pages
Optimization for Engineers
No ratings yet
Optimization for Engineers
59 pages
Convex Optimization in Data Science
No ratings yet
Convex Optimization in Data Science
31 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
Convexity in ML
No ratings yet
Convexity in ML
23 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Optimization Techniques in Engineering
No ratings yet
Optimization Techniques in Engineering
44 pages
Convex Cardinality Optimization
No ratings yet
Convex Cardinality Optimization
26 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Fast Algorithms for Convex Optimization
No ratings yet
Fast Algorithms for Convex Optimization
114 pages
Lecture 02
No ratings yet
Lecture 02
30 pages
Tut 2 - FromStats2DM - Linear Algebra and Convex Optimzation
No ratings yet
Tut 2 - FromStats2DM - Linear Algebra and Convex Optimzation
26 pages
Convex Optimization Course Guide
No ratings yet
Convex Optimization Course Guide
25 pages
Classification of Optimization Methods
No ratings yet
Classification of Optimization Methods
68 pages
5.2 Regression
No ratings yet
5.2 Regression
19 pages
University of Maryland: Econ 600
No ratings yet
University of Maryland: Econ 600
22 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Lecture-05 - Least Squares and Optimization
No ratings yet
Lecture-05 - Least Squares and Optimization
34 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Matlab For Optimization PDF
No ratings yet
Matlab For Optimization PDF
49 pages
ML Notes
No ratings yet
ML Notes
14 pages
Optimizatio With Matlab
No ratings yet
Optimizatio With Matlab
49 pages
COMP 4211 - Machine Learning
No ratings yet
COMP 4211 - Machine Learning
19 pages
Convex Optimization in Machine Learning
No ratings yet
Convex Optimization in Machine Learning
110 pages
Numerical Optimization Course Notes
No ratings yet
Numerical Optimization Course Notes
96 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
ECE 236B Course Notes
No ratings yet
ECE 236B Course Notes
90 pages
Daniiiiiiiii
No ratings yet
Daniiiiiiiii
84 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
(Book) Fundas of OptTheory Applications To ML
No ratings yet
(Book) Fundas of OptTheory Applications To ML
832 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Project Le Duc Anh 202559027
No ratings yet
Project Le Duc Anh 202559027
2 pages
Chap12 KKT
No ratings yet
Chap12 KKT
23 pages
Chap13 Dual-Corres
No ratings yet
Chap13 Dual-Corres
25 pages
Chap04 ConvexOptimizationBasics
No ratings yet
Chap04 ConvexOptimizationBasics
29 pages
2c.80386 Multitasking
No ratings yet
2c.80386 Multitasking
35 pages
Database 1 RJC 1
No ratings yet
Database 1 RJC 1
20 pages
Talent Acquisition & Marketing Expert Resume
No ratings yet
Talent Acquisition & Marketing Expert Resume
2 pages
Manual-Bluetooth Keyboard-2011
No ratings yet
Manual-Bluetooth Keyboard-2011
2 pages
Ai and Machine Learning in Software Development
No ratings yet
Ai and Machine Learning in Software Development
25 pages
Ict in Language Teaching
No ratings yet
Ict in Language Teaching
25 pages
ML-1001 R400 Rev 03.1 Student Course Book
100% (6)
ML-1001 R400 Rev 03.1 Student Course Book
678 pages
Capital One Offers Terms and Conditions
No ratings yet
Capital One Offers Terms and Conditions
4 pages
Emifil (Capacitor Type) : Single Circuit Type For Signal Lines NFM18C Series (0603 Size)
No ratings yet
Emifil (Capacitor Type) : Single Circuit Type For Signal Lines NFM18C Series (0603 Size)
2 pages
Product Design & Manufacturing Guide
100% (2)
Product Design & Manufacturing Guide
13 pages
Elementary Os
No ratings yet
Elementary Os
24 pages
Kate Adu-Frimpong - Resume
No ratings yet
Kate Adu-Frimpong - Resume
2 pages
Operation and Maintenance Manual For PZ61 DC Power System
No ratings yet
Operation and Maintenance Manual For PZ61 DC Power System
15 pages
Bad News (Routledge Revivals)
100% (9)
Bad News (Routledge Revivals)
13 pages
Computer Vision Important Questions Answers 250322 101712
No ratings yet
Computer Vision Important Questions Answers 250322 101712
26 pages
Ethiopian Construction Tech Transfer
No ratings yet
Ethiopian Construction Tech Transfer
23 pages
ABB Profile
No ratings yet
ABB Profile
8 pages
Diesel Generator Specs Guide
No ratings yet
Diesel Generator Specs Guide
13 pages
Elseviers Cas Latex Double Column Template
No ratings yet
Elseviers Cas Latex Double Column Template
4 pages
Backbone Tutorials: Thomas Davis
No ratings yet
Backbone Tutorials: Thomas Davis
50 pages
SumatraPDF Settings
No ratings yet
SumatraPDF Settings
2 pages
Elevator Door Controller Upgrade
No ratings yet
Elevator Door Controller Upgrade
2 pages
Qualitative Vs Quantitative Data
100% (1)
Qualitative Vs Quantitative Data
12 pages
Searching For Hoopes Prize Theses
No ratings yet
Searching For Hoopes Prize Theses
2 pages
Lid Driven Cavity Semi Implicit CPP CODE
No ratings yet
Lid Driven Cavity Semi Implicit CPP CODE
10 pages
Sl1din 485
No ratings yet
Sl1din 485
1 page
ARM Assembly Language
100% (1)
ARM Assembly Language
27 pages
Lighting Systems in Office Environments
No ratings yet
Lighting Systems in Office Environments
12 pages
Internship
No ratings yet
Internship
10 pages
Power Control System: Section
No ratings yet
Power Control System: Section
116 pages

Chap01 Introduction

Uploaded by

Chap01 Introduction

Uploaded by

Introduction: Why Optimization?

Lecturer: Ryan Tibshirani

Welcome to the course on Convex Optimization, with a focus on

Basic adminstrative details:

Assume working knowledge of/proficiency with:

Project: something useful/interesting with optimization. Groups of

Quizzes: due at midnight the day of each lecture. Should be very

Lecture videos: see links on course website. These are supposed to

Auditors: welcome, please audit rather than just sitting in

Most important: work hard and have fun!

Optimization problems underlie most everything we do in Statistics

translate into P : min f (x)

Conceptual idea Optimization problem

Examples of this? Examples of the contrary?

This course: how to solve P , and also why this is important

Many reasons. Here’s two:

Optimization is a very current field. It can move quickly, but there

This fits a piecewise constant function over an image, given data

True image Data Solution

Specialized ADMM, 20 it-

Proximal gradient descent,

Coordinate descent, 10K

(Last two from the dual)

Specialized ADMM, 20 it-

Proximal gradient descent,

Coordinate descent, 10K

(Last two from the dual)

Specialized ADMM, 20 it-

Proximal gradient descent,

Coordinate descent, 10K

(Last two from the dual)

Specialized ADMM, 20 it-

Proximal gradient descent,

Coordinate descent, 10K

(Last two from the dual)

So what’s the right conclusion here?

Is the alternating direction method of multipliers (ADMM) method

In fact, different algorithms will work better in different situations.

In the 2d fused lasso problem:

the parameter λ ≥ 0 is called a tuning parameter. As λ decreases,

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Classically: take the av-

erage of data points in

But this is incorrect, because location 11 was selected based on the

Observed data Reference data

p−value = 0.000 Accordingly, we can

Historically, linear programs were the focus in optimization

Initially, it was thought that the important distinction was between

Now it is widely recognized that the right distinction is between

Your supplementary textbooks for the course:

Boyd and Vandenberghe Rockafellar

A function f : Rn → R is convex if dom f is a convex set and if for all x,

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (3.1)

This is a convex optimization problem provided the functions f

You might also like