[go: up one dir, main page]

0% found this document useful (0 votes)
25 views6 pages

Convex - Optimization - Homework 3

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

Convex - Optimization - Homework 3

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Convex optimization - Homework 3

1. Second order methods for dual problem

1. Let the LASSO (Least Absolute Shrinkange Operator and Selection Opera-
tor) problem
minimize 21 ||Xw − y||22 + λ||w||1
In the variable w ∈ Rd , with X ∈ Rn×d , y ∈ Rn and λ > 0 the regularization
parameter.

We can rewrite the problem as


minimize 12 ||z − y||22 + λ||w||1
subject to z = Xw
The Lagrangian of the problem will be
L(w, z, v) = 21 ||z − y||22 + λ||w||1 + v T (Xw − z).

As the function is separable


inf L(w, z, v) = inf (λ||w||1 + v T Xw) + inf ( 21 ||z − y||22 − v T z)
w,z w z

We have the rst part


1
− y||22 − v T z = 12 z T z − (y + v)T z + 21 y T y
2 ||z
arg min( 12 z T z − (y + v)T z + 12 y T y) = y + v
z
inf ( 12 ||z − y||22 − v T z) = − 12 ||v||22 − y T v
z

And the second part P


d
inf (λ||w||1 + (X T v)T w) = i=1 (inf (λ|wi | + (X T v)i wi ))
w w i

inf (λ|wi | + (X T v)i wi ) = 0 if |(X T v)i | ≤ λ


wi
−∞ else
inf (λ||w||1 + (X T v)T w) = 0 if ||X T v||∞ ≤ λ
w
−∞ else
We deduce that :
inf L(w, z, v) = − 21 ||v||22 − y T v when ||X T v||∞ ≤ λ
w,z

Thus, the dual problem is


maximize - 21 ||v||22 − y T v
subject to ||X T v||∞ ≤ λ
We can rewrite the problem as
minimize v T Qv + pT v
subject to Av ≤ b
In the variable v ∈ Rn , with Q = 12 In , p = y , A = (X, −X)T and b = λI2d .

3.We can take v0 = 0 that satises the initial condition.


We notice that the bigger µ is, the deeper the objective function goes at each
iterations. However, when µ is too big, the number of iterations decreases.

1
Figure 1: The objective function vs the number of iterations for dierent µ

Figure 2: The gap in a semilog scale

Thus, an appropriate µ should be 50

2. First order methods for primal problem

1. The function f (w) = 12 ||Xw − y||22 + λ||w||1 is not dierentiable. However,


the function is dierentiable in the set {w ∈ Rd |∃i ∈ {1, ..., d}wi = 0}, otherwise

2
∂f (w) = X T Xw − X T y + λ(sgn(wi ))1≤i≤d .
By setting g(w) = X T Xw − X T y + λ(sgn(wi ))1≤i≤d , with sgn(0) = 0, we
have
∀z ∈ Rd f (z) − f (w) − g(w)T (z − w) ≥ 0. So we can consider g a subgradient
for f .

Figure 3: The objective function vs the number of iterations for dierent strate-
gies

Figure 4: The gap in a semilog scale for dierent strategies

• Strategy 1 : Constant step size αk = h


• Strategy 2 : Constant step length αk = h
gk

3
• Strategy 3 : Square summable but not summable αk = h
k

• Strategy 4 : Nonsummable diminishing αk = √h


h

We can see that the 4th strategy is the fastest, while the 1st and the 2nd are
really slow. However, if we continue to iterate a certain number of times, we'll
notice that the rst two strategies are more precise, even though they are not
converging.
2. The function is in the form f (w) = 12 wT X T Xw − y T Xw + 12 y T y + λ||w||1 .
If we writePthe function in close form, P
we'll get :
f (w) = 21 i=1 j=1 (X T X)ij wi wj − i=1 (X T y)i wi + 21 y T y + λ i=1 |wi |.
d Pd Pd
A better form
P should be : P
d d i−1
f (w) = 21 i=1 (X T X)ii wi2 + T T
y)i wi + 21 y T y+
P P
i=1 j=1 (X X)ij wi wj − i=1 (X
λ i=1 |wi |.
Pd

Here, for each i ∈ {1, .., d}, if we x wj with j 6= i, fi (wi ) = f (w) in


the form fi (wi ) = αi wi2 + βi wi + γi |wi | + δi with αi = 12 (X T X)ii , βi =
j6=i (X X)ij wj − (X y)i , γi = λ and δi = 2
P T T 1
Pd Pd T
j6=i k6=i (X X)kj wk wj −
Pd
T
y)j wj + 21 y T y + λ j6=i |wj |
P
j6=i (X
We can say than αi , γi > 0. Thus, when we study the function fi (wi ) =
αi wi2 + βi wi + γi |wi | + δi , we know that the function is convex and its lim-
its are +∞. Moreover :
fi0 (wi ) = 2αi wi + βi + γi if wi > 0
2αi wi + βi − γi if wi < 0
(the funcion is not dierentiable on 0).

We can thus deduce that :


arg min fi (wi ) = 0 if |βi | ≤ γi
wi
−βi −γi
2αi if βi < −γi
−βi +γi
2αi if βi > γi
We can thus set at each iteration wi(k) = arg min fi (wi ) and wj(k+1) = wj(k) when
wi
j 6= i, by cycling over i.

We notice that the coordinate descent method is faster than the sub-gradient
method. Indeed, while with the sub-gradient method, there is no convergence
but an oscillation over p∗, the coordinate descent method assures a convergence
in a certain number of step (approximately 250 to have a gap  = 10−3 ).

4
Figure 5: The objective function vs the number of iterations for the coordinate
descent

Figure 6: The gap in a semilog scale for the coordinate descent

5
Figure 7: Comparison of the gap for the 2 methods

You might also like