annote
Machine Learning Course - CS-433
Least Squares
Sept 24, 2024
Martin Jaggi
Last updated on: September 24, 2024
credits to Mohammad Emtiyaz Khan & Rüdiger Urbanke
Motivation
In rare cases, one can compute the
optimum of the cost function ana-
lytically. Linear regression using a
-
mean-squared error cost function is loss-fu = MSE
-
one such case. Here the solution can
be obtained explicitly, by solving a
linear system of equations. These
-
equations are sometimes called the
normal equations. This method is
one of the most popular methods
for data fitting. It is called least
squares.
non-convey
M
To derive the normal equations, we
first show that the problem is con- -
vex. We then use the optimality ⑧
-
conditions for convex functions (see NI = 0
the previous lecture notes on opti-
mization). I.e., at the optimum pa-
rameter, call it w?, it must be true
that the gradient of the cost function
is 0. I.e.,
rL(w?) = 0.
This is a system of D equations.
⑦ convex
-Optimal (global/
Normal Equations
C
Recall that the cost function for lin-
EnlXw y(R
ear regression with mean-squared er- -
ror is given by =
2 e
N
X
1 2 1 i
L(w) = yn x>
n w = (y Xw)>(y Xw),
2N n=1
2N
where feahes
2
y1
3 2
x11 12
↓
x ... x1D
3
wari
6 y2 7 6 x21 x22 . . . x2D 7 o
y = 4 . 5,X = 6
6 7
4 .. .. 7
. . . .. 5 .
.
yN xN 1 xN 2 . . . xN D
① We claim that this cost function is
convex in the w. There are several
-
ways of proving this:
fog =
f(g(w))
1. Simplest way: observe that : limea
L is naturally represented as g
the sum (with positive coef- ↓ : convex
ficients) of the simple terms
O fog
courex
(yn x> 2
n w) . Further, each of
these simple terms is the com-
position of a linear function sum of comiret
with a convex function (the ; convex
square function). Therefore,
each of these simple terms is
convex and hence the sum is
convex.
2. Directly verify the definition,
that for any 2 [0, 1] and ~
y
-o
W
w
line
w, w0, segment
stain
0 0
L( w + (1
-m
i
)w ) O
( L(w) + (1 O
)L(w )) 0.
Computation: LHS = ((Xi -
y /l -
z((xx y/ -
1 u
0
--
n -
x))(xw)y
=
(1 )kX(w w )k22,
2N -
which indeed is non-positive.
3. We can compute the sec-
[ )/,
ond derivative (the Hessian)
and show that it is positive
semidefinite (all its eigenval- xi =
ues are non-negative). For
the present case a computa-
tion shows that the Hessian -O PS D
..
.
has the form 2
= convex
1 >
X X.
N Example
This matrix is indeed positive
NSF-XT(y =
Xw)
semidefinite since its non-zero
eigenvalues are the squares of DEH =
EX
*
X
z
the non-zero singular values of
the matrix X. DXD matix
P..D . viMr30 Ne
W
⑧ ⑨
⑭ V
11Xv/ 0
O
= ,
= P SP
. . .
XL = O
② Now where we know that the func-
Compute
tion is convex, let us find its min-
imum. If we take the gradient of
this expression with respect to the
weight vector w we get
1 > = Xe = 0
rL(w) = X (y Xw).
N
If we set this expression to 0 we get
the normal equations for linear re- D EIRP
#xw
=
gression,
00 >
X (y Xw) = 0.
| {z }
error
Dequali
D variables
Geometric Interpretation
2EI12N
The error is orthogonal to all
-
columns of X.
The span of X is the space spanned
by the columns of X. Every ele-
ment of the span can be written
as u = Xw for some choice of w.
Which element of span(X) shall we
take? The normal equations tell us
-
that the optimum choice for u, call
it u?, is that element so that y u?
is orthogonal to span(X). In other
words, we should pick u? to be equal
Ext = o
to the projection of y onto span(X).
The following figure illustrates this:
(taken from Bishop’s book)
S= span(X/ YEIRN
4-
·
=
t
2
'2
⑧
'1 Xw
y
Col1
of X
min llell
W
11"-XwP
f(w) =
Least Squares solve X*(y -Xw) = 0
The matrix X>X 2 RD⇥D is called
the Gram matrix. If it is invertible, #X+xw
we can multiply the normal equation um
=
by the inverse of the Gram matrix DxD
-
from the left to get a closed-form ex-
-
pression for the minimum:
-
=
w? = (X>X) 1X>y. Ok
We can use this model to predict a
new value for an unseen datapoint
(test point) xm:
ŷm := x> ? >
O Stu
> 1 >
m w = xm (X X) X y.
Invertibility and Uniqueness
Note that the Gram matrix
X>X 2 RD⇥D is invertible if and
only if X has full column rank, or
#
in other words rank(X) = D.
-
&
Proof: To see this assume first that rank(X) < D. Then there exists a
non-zero vector u so that Xu = 0. It follows that X>Xu = 0, and so
rank(X>X) < D. Therefore, X>X is not invertible.
Conversely, assume that X>X is not invertible. Hence, there exists a
non-zero vector v so that X>Xv = 0. It follows that
Fit
0 = v>X>Xv = (Xv)>(Xv) = kXvk2.
This implies that Xv = 0, i.e., rank(X) < D.
case 2 : D> N
over-paramelized
Rank Deficiency and Ill-Conditioning
Unfortunately, in practice, X is of-
ten rank deficient.
-over-param setting
E
• If D > N , we always have
rank(X) < D
(since row rank = col. rank) invertible
-
potentially
I
• If D N , but some of
the columns x:d are (nearly)
D
.
collinear, then the matrix is ill-
-
-
conditioned, leading to numer-
-
ical issues when solving the lin-
ear system.
Can we solve least squares if X is
rank deficient? Yes, using a linear
system solver. o
XXw = Xy
Summary of Linear Regression
We have studied three types of methods:
O
1. Grid Search
O
2. Iterative Optimization Algorithms
(Stochastic) Gradient Descent
3. Least squares
closed-form solution, for linear MSE
-
Additional Notes
Solving linear systems
There are many ways to solve a linear system Mb
Xw = y, but it usually
involves a decomposition of the matrixM
X such as the QR or LU decom-
position which are very robust. Matlab’s backslash operator and also
NumPy’s linalg package implement this in just one line:
MD
w = np . l i n a l g . s o l v e (X, y)
It is important to never invert a matrix to solve a linear system - as this
would incur a cost at least three times the cost of using a linear solver.
For more, see this blog post https://gregorygundersen.com/blog/2020/
12/09/matrix-inversion/.
For a robust implementation, see Sec. 7.5.2 of Kevin Murphy’s book.
Closed-form solution for MAE mi 1xw-y1
Can you derive closed-form solution for 1-parameter model when using
the MAE cost function?
See this short article: http://www.johnmyleswhite.com/notebook/2013/
03/22/modes-medians-and-means-an-unifying-perspective/.