‘A Comprehensive Guide to Machine Learning
Soroush Nasiriany, Garrett Thomas, William Wang, Alex Yang, Jennifer Listgarten, Anant Sahai
Department of Ele
trical Engineering and Computer Sciences
lifornia, Berkeley
November 18, 20192
About
CS 189 is the Machine Learning couse at UC Berkeley. In this guide we have created a com
prehensive course guide in order to share our knowledge with students and the general public,
and hopefully draw the interest of students from other universities to Berkeley's Machine Learning
curriculum,
This guide was started by CS 180 TAs Soroush Nasiriany and Garrett Thomas in Fall 2017, with
the assistance of William Wang and Alex Yang.
We owe gratitude to Professors Anant Sahai, Stella Yu, and Jennifer Listgarten, as this book is
heavily inspired from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk
for his machine learning notes, from which we drew inspiration.
‘The latest version of this document can be found either at http: //www.eecs189.org/ or http
//snasiriany me/cs189/. Please report any mistakes to the staff, and contact the authors if you
wish to redistribute this document.
Notation
Notation | Meaning
Rr set of real numbers
R set (vector space) of n-tuples of real numbers, endowed with the usual inner product,
Rm set (vector space) of m-by-n matrices
by Kronecker delta, ie. 84; = 1 if t= j, 0 otherwise
V(x) | gradient of the function f at x
V?s(x) | Hessian of the function f at x
p(X) distribution of random variable X
p(x) probability density/mass function evaluated at x
EX) expected value of random variable X
Var(X) | variance of random variable X
Cov(X,¥) | covariance of random variables X and ¥
Other notes:
# Vectors and matrices are in bold (e.g. x,A). ‘This is true for vectors in R” as well as for
vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman
letters for matrices and random variables.
‘© We assume that vectors are column vectors, ie. that a vector in R” can be interpreted as an
n-by-1 matrix, As such, taking the transpose of a vector is well-defined (and produces a row
vector, which is a I-by-n matrix).Contents
1 Regression I
LL Ordinary Least Squares
1.2 Ridge Regression
1.3 Feature Engineering
14 Hyperparameters and Validation
2. Regression IT
2.1 MLE and MAP for Regression (Part I)
2.2 Bias-Variance Tradeoff
2.3 Multivariate Gaussians
2A MLE and MAP for Regression (Part I)
2.5 Kemels and Ridge Regression
2.6 Sparse Least Squares
2.7 Total Least Squares
3 Dimensionality Reduction
3.1 Principal Component Analysis
3.2 Canonical Correlation Analysis
4 Beyond Least Squares: Optimization and Neural Networks
4.1 Nonlinear Least Squares
4.2 Optimization
4.3. Gradient Descent
44. Line Search
4.5 Convex Optimization
4.6 Newton's Method
4.7 Gauss-Newton Algorithm
4.8 Neural Networks
4.9 ‘Training Neural Networks
79
79
al
82
88
89
93
96
97
103Classification
5.1 Generative vs. Discriminative Classification
5.2 Least Squares Support Vector Machine
5.3 Logistic Regression
Gaussian Discriminant Analysis
Support Vector Machines
Duality
Nearest Neighbor Classification
Clustering
6.1 K-means Clustering
6.2 Mixture of Gaussians
6.3. Expectation Maximization (EM) Algorithm
Decision Tree Learning
7.1 Decision ‘Irees
7.2 Random Forests
7.3. Boosting
Deep Learning
8.1, Convolutional Neural Networks
8.2 CNN Architectures
8.3. Visualizing and Understanding CNNs
CONTENTS
107
107
109
3
121
127
134
45
151
182
155
156
163
163
168
169
175
175
182
185,Chapter 1
Regression I
Our goal in machine learning is to extract a relationship from data, In regression tasks, this
relationship takes the form of a function y = f(x), where y € R is some quantity that can be
predicted from an input x < R¢, which should for the time being be thought of as some collection
of numerical measurements. The truc relationship f is mknown to us, and our aim is to recover it
as well as we can from data, Our end product is a function 9 = h(x), called the hypothesis, that
should approximate f, We assume that we have access to a dataset D = {(x;,y;)}!" ,, where each
pair (x,,,) is an example (possibly noisy or otherwise approximate) of the input-output mapping
to be leamed. Since learning arbitrary fmctions is intractable, we restrict ourselves to some
hypothesis class 7/ of allowable functions. More specifically, we typically employ a parametric
model, meaning that there is some finite-dimensional vector w ¢ R&, the elements of which are
known as parameters or weights, that controls the behavior of the function, That is,
Ire (x) = 9(, W)
for some other function g. The hypothesis class is then the set of all functions induced by the
possible choices of the parameters w:
= (he | WER
After designating a cost function ZL, which measures how poorly the predictions 9 of the hypothesis
match the true output y, we can proceed to search for the parameters that best fit the data by
minimizing this function:
1.1 Ordinary Least Squares
Ordinary least squares (OLS) is one of the simplest regression problems, but it is well-understood
and practically useful. It is a linear regression problem, which means that we take hw to be of
the form h(x) = x'w. We want
Yi GE = hew(x) =CIAPTER 1. REGRESSION T
for each i= 1,...,n. This set of equations can be written in matrix form as
mw) pat] fev
av
io) [xn") [wa
a.
¥ xv
In words, the matrix X ¢ R"™# has the input datapoint x, as its ith row. This matrix is some-
times called the design matrix. Usually n > d, meaning that there are more datapoints than
measurements,
‘There will in general be no exact solution to the equation y = Xw (even if the data were perfect,
consider how many equations and variables there are), but we can find an approximate solution by
minimizing the sum (or equivalently, the mean) of the squared errors:
Dow — 1)? = min | Xw — yl
a
Low
Now that we have formulated an optimization problem, we want to go about solving it. We will see
that the particular structure of OLS allows us to compute a closed-form expression for a globally
optimal solution, which we denote Wiz.
Approach 1: Vector calculus
Calculus is the primary mathematical workhorse for studying the optimization of differentiable
functions. Recall the following important result: if L : R4 + R is contimnously differentiable, then
any local optimum w satisfies VL(w*) = 0. In the OLS case,
L(w) = |Xw — yl
(Xw ~ y)"(Xw — y)
= (Kw)'Xw — (Kw)'y —y'Xw+y'y
=w'X'Xw-2w xX'yty'y
Using the following results from matrix calculus
Vx(alx) =a
V(x Ax) = (A + AT)x
the gradient of L is easily seen to be
VE(w) = Vor(w'X'Xw ~ 2w!X'y +-y"y)
= Vw X"Xw) — 20a (W XY) + Vly)
~~
= 2X'Xw — 2Xx'y
where in the last line we have used the symmetry of X"X to simplify XX + (X™X)" = 2x7X.
Setting the gradient to 0, we conclude that any optimum wr. satisfies1.1. ORDINARY LEAST SQUARES 7
If X is full rank, then X"X is as well (assuming n > d), so we can solve for a unique solution
Wiis = (XIX) IX'y
Note: Although we write (X'X)~1, in practice one would not actually compute the inverse; it
is more numerically stable to solve the linear system of equations above (e.g. with Gaussian
elimination).
In this derivation we have used the condition VL(w*) = 0, which is a necessary but not sufficient
condition for optimality. We found a critical point, but in general such a point could be a local
minimum, a local maximum, or a saddle point, Fortunately, in this case the objective function
is convex, which implies that any critical point is indeed a global minimum, ‘To show that Lis
convex, it suffices to compute the Hessian of L, which in this case is
V?L(w) = 2X"X
and show that this is positive semi-definite:
vw, w!(2X'X)w = 2(Xw)'Xw = 2|Xw|} >0
Approach 2: Orthogonal projection
‘There is also a linear algebraic way to arrive at the same solution: orthogonal projections.
Recall that if V is an inner product space and S a subspace of V, then any v < V can be decomposed
‘uniquely in the form
vavsty,
where vs € Sand v, € S4. Here S+ is the orthogonal complement of S, i. the set of vectors
that are perpendicular to every vector in S.
The orthogonal projection onto 5, denoted Ps, is the linear operator that maps v to vs in the
decomposition above. An important property of the orthogonal projection is that
lv - Psvll < lIv —s\]
for all s € S, with equality if and only ifs = P.v. That is,
Psy = arg min ||v — s|)
365
Proof. By the Pythagorean theorem,
5)’ = |v — Psv|P’ + |Psv — sl? > |Iv — Pov?
\lv —s|? = lv - Pov + Pov
os es
with equality holding if and only if ||Psv—s|/? =0, ie. 8 = Psv. Taking square roots on both
sides gives |v —s|| > lv — Psv|| as claimed (since norms are nonnegative) a
Here is a visual representation of the argument above:8 CMAPTER 1. REGRI
In the OLS case,
Why = argimin Xw ~ yl
But observe that the set of vectors that can be written Xw for some w € R* is precisely the range
of X, which we know to be a subspace of R", so.
scrange(X)
le yl = ain, Xow ~ yh
By pattern matching with the earlier optimality statement about Ps, we observe that Prangetx)¥ =
Xwizs) where W5,. is any optimum for the right-hand side. The projected point Xw3,. is always
unique, but if X is full rank (again assuming n > d), then the optimum wé,, is also unique (as
expected). This is because X being full rank means that the columns of X are linearly independent,
in which case there is a one-to-one correspondence between w and Xw.
To solve for w3,., we need the following fact?
null(X") = range(X)*
Since we are projecting onto range(X), the orthogonality condition for optimality is that y — Py
range(X), ie. y — XW, € mull(X"), This leads to the equation
X'(y — Xwizs) = 0
which is equivalent to
X™Xwi,5 = X'y
as before,
1.2 Ridge Regression
While Ordinary Least Squares can be used for solving linear least squares problems, it falls short
due to numerical instability and generalization issues. Numerical instability arises when the features
of the data are close to collinear (leading to linearly dependent feature columns), causing the input
T This vest i often stated os part oftmatrix X to lose its rank or have singular values that very close to 0. Why are small singular values
bad? Let us illustrate this via the singular value decomposition (SVD) of X:
X=UEV)
where U € R"™", De R™4, Vc R4*4, In the context of OLS, we must have that X'X is invertible,
or equivalently, rank(X"X) = rank(X") = rank(X) = d. Assuming that X and X" are full column
rank d, we can express the SVD of X as
_y fa
x-u[ Gv
where Ey € Ris a diagonal matrix with strictly positive entries, Now let's try to expand the
(X™X)-1 term in OLS using the SVD of X:
(xy = (vay o}uTT [i vy
=(V [Ee olf] vy
= (Viv) (V)hagy tv =
‘This means that (X™X)~1 will have singular values that are the squared inverse of the singular
values of X, potentially leading to extremely large singular values when the singular value of X are
close to 0, Such excessively large singular values can be very problematic for numerical stability
purposes. In addition, abnormally high values to the optimal w solution would prevent OLS from
generalizing to unscen data.
There is a very simple solution to these issues: penalize the entries of w from becoming too large.
‘We can do this by adding @ penalty term constraining the norm of w. For a fixed, small scalar
A> 0, we now have:
main | Xw — y|3 + Allw]3
Note that the A in our objective fimction is a hyperparameter that measures the sensitivity to
the values in w. Just like the degrce in polynomial features, \ is a value that we must choose
arbitratily through validation. Let's expand the terms of the objective function:
L(w) = ||Xw — yl + Alwil
w'X'Xw — 2w'Xly +y’y + Awl
Finally take the gradient of the objective and find the value of w that achieves O for the gradient
VwL(w) =0
2X"Xw — 2X"y + 2dw = 0
(XX + Aw =Xy
=(X'X+ Al Xy
Whoa
This value is guaranteed to achieve the (unique) global minimum, because the objective function
is strongly convex. To show that f is strongly convex, it suffices to compute the Hessian of f,
which in this case is
VL XX + 2AL10 CMAPTER 1. REGRI
and show that this is positive definite (PD):
vw £0, w(X"X + Al)w = (Xw)"Xw + Aw'w = ||Xw]]} + Aliw/} > 0
Since the Hessian is positive definite, we can equivalently say that the eigenvalues of the Hessian are
strictly positive and that the objective fmetion is strongly convex. A useful property of strongly
convex functions is that they have a unique optimum point, so the solution to ridge regression is
unique. We cannot make such guarantees about ordinary least squares, because the corresponding
Hessian could have cigenvalues that are 0. Let us explore the case in OLS when the Hessian has
a0 eigenvalue. In this context, the term X"X is not invertible, but this does not imply that no
solution exists! In OLS, there always exists a solution, and when the Hessian is PD that solution
is unique; when the Hessian is PSD, there are infinitely many solutions. (‘There always exists a
solution to the expression X'Xw = X"y, because the range of XX and the range space of X”
are equivalent; since X"y lies in the range of X", it must equivalently lie in the range of X'X and
therefore there always exists a w that satisfies the equation X’Xw = Xy.)
‘The technique we just described is known as ridge regression. Note that now the expression
XX + Al is invertible, regardless of rank of X. Let’s find (X"X + AI)! through SVD:
(XIX + AIT = (vf
=wyt Pee 4 ya
o AL
cy [ao] yr
-v| 0 aulY
Now with our slight tweak, the matrix XTX + AT has become full rank and thus invertible. The
singular values have become =y and 4, meaning that the singular values are guaranteed to be
at most }, solving our numerical instability issues. Furthermore, we have partially solved the
overfitting issue. By penalizing the norm of x, we encourage the weights corresponding to relevant
features that capture the main structure of the true model, and penalize the weights corresponding
to complex features that only serve to fine tune the model and fit noise in the data,13, FEATURE ENGINEERING hn
1.3. Feature Engineering
‘We've seen that the least-squares optimization problem
min |[Xw — y|3
represents the “best-fit” linear model, by projecting y onto the subspace spanned by the columns
of X. However, the true input-output relationship y = f(c) may be nonlinear, so it is useful to
consider nonlinear models as well. It tums out that we can still do this under the framework of
linear least-squares, by augmenting the data with new features. In particular, we devise some
function @ : R' > Re, called a feature map, that maps each raw data point x € Rf into a vector
of features #(x). The hypothesis function then writes
‘
Iae(X) = > wys(x) = WTOC)
Note that the resulting model is still linear with respect to the features, but it is nonlinear with
respect to the original data if ¢ is nonlinear. ‘The component functions 4, are sometimes called
basis functions because our hypothesis is a linear combination of them. In the simplest case, we
could just use the components of x as features (i.e. 9;(x) = 2;), but in general it is helpful to
disambiguate the features of an example from the example’s entries.
‘We can then use least-squares to estimate the weights w, just as before. To do this, we replace the
original data matrix X ¢ R'™é by @ ¢ R"4, which has (x;)" as its ith row:
min |w — yl}
Example: Fitting Ellipses
Let's use least-squares to estimate the parameters of an ellipse from data.
Assume that we have n data points D = {(71,i,72,:)}{), which may be noisy (i.e. could be off the
actual orbit). Our goal is to determine the relationship between 21 and 29.
‘We assume that the ellipse from which the points were generated has the form
ma? 4 a0008 + ware + ent + 502 = 1
where the coefficients w,,...,ws are the parameters we wish to estimate
‘We formulate the problem with least-squares:
ain | w — 13
where
yo
why oH) tiara ta tap
e :
tn En Fintan Fin Fan
In this caso, the feature map @ is given by
(x) = (@}, 23, x19, 21, 02)
Note that there is no “target” vector y here, so this is not a traditional regression problem, but it,
still its into the framework of least-squares.2 CIAPTER 1. REGRESSION T
Polynomial Features
The example above demonstrates an important class of features known as polynomial features.
Remember that a polynomial is linear combination of monomial basis terms. Monomials can be
classified in two ways, by their degree and dimension:
Dewree] > 3
Dimensio
T (univariate) T= e
2 (bivariate) 1 aad, m2)
a3, 2209, 223
fh thn, nia}
A big reason we care polynomial features is that any smooth function can be approximated ar-
bitrarily closely by some polynomial,? For this reason, polynomials are said to be universal
approximators.
One downside of polynomials is that as their degree increases, their number of terms increases
rapidly. Specifically, one can use a “stars and bars” style combinatorial argument* to show that a
polynomial of degree d in £ variables has
(4)
terms, ‘To get an idea for how quickly this quantity grows, consider a few examples:
(+a!
~ Ea
a 5 5
: 3 5 10 25
T fe] 4 o Tr 26
3 |4] 2] 56 236 3276
5 |e] 56 | 252 3003 142506
10 | 11] 286 | 3003 | 184756 183579396
25 _| 26 | 3276 | 142506 | 183579396 | 126410606437752
Later we will learn about the kernel trick, a clever mathematical method that allows us to
circumvent this rapidly growing cost in certain cases.
1.4 Hyperparameters and Validation
As above, consider a hypothesis of the form
‘
Ig) = > 25856) = WOE)
i
7 Taylor's theorem gives more precae datements about the approsimation ero.
5 We coust the sumber of distinct monomils of degree st mont din € vtiblee 3.3, oF equivaleaty, the number of
istinet monomiale of degre exactiy din £1 varabine xy —1,#3_..,22. Bvery monomial has the form 28°
hy tres 1 he =. This corresponds to on arrangement of dviare and ébaze, where the mumber of ats bebween conesctive
‘ts (or the end ofthe exprerron) gives the degree of that crdered variable. Ror exemple
The number of unique ways to arrange thess stars nnd bare is the aummber of wags to chaoee the pustions of the £ bars out
of the tal 4+ d slots, Le. £-+ ad choose f. (You could alsa pick the positions of the d star out of the total £+ slot; the
expression i symmetric in € and.)1.4, IIYPERPARAMBTERS AND VALIDATION 13
Observe that the model order d is not one of the decision variables being optimized when we fit to
the data, For this reason d is called a hyperparameter. We might say more specifically that it is
a model hyperparameter, since it determines the structure of the model.
For another example, recall ridge regression, in which we add an @ penalty on the parameters
rin |Xw — y|}-+ All]
‘The regularization weight is also a hyperparameter, as it is fixed during the minimization above.
However A, wilike the previously discussed hyperparameter d, is not a part of the model, Rather,
it is an aspect of the optimization procedure used to fit the model, so we say it is an optimization
hyperparameter. Hyperparameters tend to fall into one of these two categories.
Since hyperparameters are not determined by the data-fitting optimization procedure, how should
we choose their values? A suitable answer to this question requires some discussion of the different
types of error at play.
‘Types of Error
‘We have seen that it is common to minimize some measure of how poorly our hypothesis fits the
data we have, but what we actually care about is how well the hypothesis predicts future data,
Lot us try to formally distinguish the various types of error. Assume that the data are distributed
according to some (umknown) distribution D, and that we have a loss function £:R xR > R,
which is to measure the error between the true output y and our estimate 9 = h(x). The risk (or
true error) of a particular hypothesis h € 2 is the expected loss over the whole data distribution:
R(h) = Exxy)dl€(A(x), ¥)]
Ideally, we would find the hypothesis that minimizes the risk, i.e
‘AY = argmin R(h)
heH
However, computing this expectation is impossible because we do not have access to the true data
distribution, Rather, we have access to samples (x:, y;) ~ D. These enable us to approximate the
real problem we care about by minimizing the empirical risk (or training error)
Renas(h) = EY (hG«).4)
i
But since we have a finite number of samples, the hypothesis that performs the best on the training
data is not necessarily the best on the whole data distribution. In particular, if we both train and
evaluate the hypothesis using the same data points, the training error will be a very biased estimate
of the true error, since the hypothesis has been chosen specifically to perform well on those points
‘This phenomenon is sometimes referred to as “data incest”
A common sohition is to set aside some portion (say 30%) of the data, to be called the validation
set, which is disjoint from the training set and not allowed to be used when fitting the model
Validation ‘TrainingCIAPTER 1. REGRESSION T
¢ can use this validation set to estimate the true error by the validation error
alt) = ESP cacao)
With this estimate, we have a simple method for choosing hyperparameter values: try a bunch of
configurations of the hyperparameters and choose the one that yields the lowest validation error.
The effect of hyperparameters on error
Note that as we add more features to a linear model, training error can only decrease. ‘This is
because the optimizer can set w, = 0 if feature ¢ cannot be used to reduce training error
Model Order
Adding more features tends to reduce true error as long as the additional features are useful
predictors of the output. However, if we keep adding features, these begin to fit noise in the
‘training data instead of the true signal, causing true error to actually inerease, This phenomenon,
is known as overfitting,
‘rue Exror
Model Order
‘The validation error tracks the true error reasonably well as long as the validation set is sufficiently
large. The regularization hyperparameter \ has a somewhat different effect on training error.
Observe that if =, we recover the exact OLS problem, which is directly minimizing the training,
error. As ) increases, the optimizer places less emphasis on the training error and more em
on reducing the magnitude of the parameters. This leads to a degradation in training error as
grows14, HYPERPARAMETERS AND VALIDATION 15
‘Training Error
Regularization Weight
Cross-validation
Setting aside a validation sct works well, but comes at a cost, since we cannot use the validation
data for training, Since having more data generally improves the quality of the trained model,
we may prefer not to let that data go to waste, especially if we have little data to begin with
and/or collecting more data is expensive. Cross-validation is an alternative to having a dedicated
validation set
k-fold cross-validation works as follows:
1. Shuifle the data and partition i into k equally-sized (or as equal as possible) blocks
2. Fori=1, k,
‘¢ Train the model on all the data except block i
valuate the model (ie. compute the validation error) using block i
‘rain validate
train validate ‘train
3. Average the k validation errors; this is our final estimate of the true error
Observe that, although every datapoint is used
is always evs
evaluation at some time or another, the model
Juated on a different set of points than it was trained ou, thereby cleverly avoiding the
“data incest” problem mentioned earlier
Note also that this process (except for the shuffling and partitioning) must be repeated for every.
hyperparameter configuration we wish to test. This is the principle drawback of k-fold cross-
validation as compared to using @ held-out validation set — there is roughly & times as much
computation required. This is not a big deal for the relatively small linear models that we've seen
so far, but it can be prohibitively expensive when the model takes a long time to train, as is the
case in the Big Data regime or when using neural networks.16
CMAPTER 1. REGRIChapter 2
Regression IT
2.1 MLE and MAP for Regression (Part I)
So far, we've explored two approaches of the regression framework, Ordinary Least Squares and
Ridge Regression
Wous = argmin |ly — Xw.
in lly — Xw|} + Allw(3
Wranan — arg
One question that arises is why we specifically use the & norm to measure the error of our predlic-
tions, and to penalize the model parameters. We will justify this design choice by exploring the
statistical interpretations of regression — namely, we will employ Gaussians, MLE and MAP to
validate what we've done so far through a different lens.
Probabi
ic Model
In the context of supervised learning, we assume that there exists a true underlying model
mapping inputs to outputs:
fix f(x)
‘The true model is unknown to us, and our goal is to find a hypothesis model that best represents
the true model. The only information that we have about the true model is via a dataset
D= {wi
where x; € R* is the input and y; € R is the observation, a noisy version of the true output f(x:):
Y= fa) + Zi
‘We assume that 2, is a fixed value (which implies that f(x) is fixed as well), while Z, is a random
variable (which implies that Y; is a random variable as well). We always assume that Z, has zero
mean, because otherwise there would be systematic bias in our observations. The Z's could be
Gaussian, uniform, Laplacian, etc... In most contexts, we us assume that they are independent
identically distributed (iid) Gaussians: Z, N(0,02). We can therefore say that Y; is a
random variable whose probability distribution is given by
¥, 2 N(f60), 04)
718 CHAPTER 2. REGRESSION IT
Now that we have defined the model and data, we wish to find a hypothesis model hg (parameterized
by 8) that best captures the relationships in the data, while possibly taking into account prior beliefs
‘that we have about the true model. We can represent this as a probability problem, where the goal
is to find the optimal model that maximizes our probability.
Maximum Likelihood Estimation
In Maximum Likelihood Estimation (MLE), the goal is to find the hypothesis model that
maximizes the probability of the data. If we parameterize the set of hypothesis models with 8, we
‘can express the problem as
O.ux = arg max £(0;D) = p(data =D | true model = he)
°
‘The quantity £(8) that we are maximizing is also known as the likelihood, hence the term MLE.
Substituting our representation of D we have
8
argmax £(8;X,y) = plu, Ym | y+ %ny OB)
‘Note that we implicitly condition on the x;’s, because we treat them as fized values of the data. The
only randomness in our data comes from the y's (since they are noisy versions of the true values
J(x:)). We can further simplify the problem by working with the log likelihood €(@;X,y) =
log £(0;X, y)
Ove = are thax £(6;X,y) = argmax (0; X,¥)
With logs we are still working with the same problem, because logarithms are monotonic functions.
In other words we have that:
P(A) < P(B) <=> log P(A) < log P(B)
Let's decompose the log likelihood:
Tog play «5 Yn [1-15 %n)) = log T] plu | xi, 0)
=
£O;X,y)
Vroelplys | xi, )]
a
We decoupled the probabilities from each datapoints because their corresponding noise components
are independent, Note that the logs allow us to work with sums rather products, simplifying
‘the problem — one reason why the log likelihood is such a powerful tool, Each individual term
(yi |x:,8) comes from a Gaussian
¥, | 0 ~N(ho(x:), 0°)
Continuing with logs:
Ose = argmax £(0;X,y) (2a)
°
= srgamox ) logo | 0) (22)
a
sean (SOME sg Ve es)2.1. MLE AND MAP FOR REGRESSION (PART I) 19
=sapin (So
Y maa
TR)
= ED fa(Ru) | dee(R)|Aur Aue
eo
Note the change of variables in the last step: we sum over the squares in U space, instead of
parallelograms in R. space.
So far, we have shown that (for any dimension n)
PUUCR’) =| Ih Fula) durduy...dtey23. MULTIVARIATE GAUSSTANS 33
and
PILE T(R') / The ‘fa(Reu) | dot(R)|duyduy dug
Notice that these two probabilities aro equivalent! The probability that U takes value in R’ must
‘equal the probability that the transformed random veetor Z takes a value in the transformed region
TR)
‘Therefore, we can say that
parce) = [ff fotwdindra...dm
=/ Jf, sotto aR) laoacten dy
= P(ZETIR')
We conclude that
fu(u) = fz(Ru) | det(R)|
‘An almost identical argument will allow us to state that
falz) = fo(R™2) det(R ) =
1
jaa eR)
Since the densities for all the U;’s are iid, and U = R-¥Z, we can write the joint density ftmetion
of Zas
1 Ro
hale) = ayo "9
1
pray Lee 0
ot
1 Lp sarayen
1 tenors
~ [ett] Waa"
eoRTRR
Note that (RR')~* is simply the covariance matrix for Z
Coy|Z) = EZZ'| = E[RUU'R'] = RE[UU JR’ = RIR' = RR’
‘Thus the density function of Z can be written as
1 Laas
2” Tae a
Furthermore, we know that,
| det(z)| = det(RR’)|
= | det(R) aet(R')a CHAPTER 2. REGRESSION IT
= |det(R) - det(R)| = | det(R))?
and therefore
fa(z) = —L—
eS Fae) (Van
Estimating Gaussians from Data
For a particular multivariate Gaussian distribution f(.), if we do not have the true means and
covariances #4, B, then our best bet is to use MLE to estimate them empirically with iid, samples
2X2)
Note that the above formulas are not necessarily trivial and must be formally proven using MLE.
Just to present a glimpse of the process, let's prove that these formulas hold for the case where we
are dealing with Id data points. For notation purposes, assume that D = {1,2,...,tq} is the
set of all training data points that belong to class k. Note that the data points are iid. Our goal
is to solve the following MLE problem:
argmax P11, 222, 44m | M0?)
= arg max stn | ve)
Note that the objective above is not jointly convex, so we cannot simply take derivatives and set
them to 0! Instead, we decompose the minimization over @? and 4. into a nested optimization
problem:
(ein)?
min In(o) = minmin >
nin So EAE sao) mom
‘The optimization problem has been decomposed into an inner problem that optimizes for js given
a fixed o?, and an outer problem that optimizes for o? given the optimal value fl. Let’s first solve
the inner optimization problem. Given a fixed 07, the objective is convex in js, so we can simply
take a partial derivative wart ye and set it equal to 0:
ot + In) ye23. MULTIVARIATE GAUSSTANS 35
Having solved the inner optimization problem, we now have that
Note that this objective is not convex in ¢, so we must instead find the critical point of the objective
that minimizes the objective. Assuming that o > 0, the critical points are:
0 =0: assuming that not all of the points x; are equal to fi, there are two terms that are at
odds with each other: a 1/0? term that blows off to oo, and a In(a) term that blows off to
00 as ¢ +0. Note that the 1/a? term blows off at a faster rate, so we conclude that
lim
© Points at which the derivative w.rt o is 0
= ot
We conclude that the optimal point is
ft
Isocontours
Let's try to understand in detail how to visualize a multivariate Gaussian distribution. For sim-
plicity, let's consider a zero-mean Gaussian distribution (0,3), which just leaves us with the
covariance matrix 3. Since is a symmetric, positive semidefinite matrix, we can decompose it
by the spectral theorem into 3 = VAV?, where the columns of V form an orthonormal basis in
IR, and A is a diagonal matrix with real, non-negative values. We wish to find its level set
or simply the set of all points x such that the probability density f(x) evaluates to a fixed constant
k. This is equivalent to the level set In( f(x)) = In(k) which further reducr
xTUx se
for some constant . Without loss of generality, assume that this constant is 1. The level set
x7 B-lx = 1 is an ellipsoid with axes v), v2,..., V4, with lengths V7, Va, ..., VAs respectively.
Bach axis of the ellipsoid is the vector y Xv, and we can verify that
(iwi Biv) = AwT Dy, = wl (BV) = Aw? Ay tw) = vP ws = 136 CHAPTER 2. REGRESSION IT
The entries of A dictate how elongated or shrunk the distribution is along cach direction. In the
case of isotropic distributions, the entries of A are all identical, meaning the the axes of the
ellipsoid form a circle, In the case of anisotropic distributions, the entries of A are not necessarily
identical, meaning that the resulting ellipsoid may be elongated shrunken and also rotated
Figure 2.1: Tsotzopic (left) vs Anisotropic (right) contours are ellipsoids with axes /T;vs. Images courtesy
Professor Shewchuk’s notes
Properties
Let’s state some well-known properties of Multivariate Gaussians. Given a JG random vector
Z~N(t4z, Ba), the linear transformation AZ (where A is an appropriately dimensioned constant
matrix) is also JG:
AZ ~N(Apz,ADzA")
‘We can derive the mean and covariance of AZ using the linearity of expectations:
Haz = E[AZ) = ABZ] = Ap,
and
Zaz = E(AZ ~ E[AZ])(AZ — E[AZ))"]
= E/A(Z ~ E[Z])(Z — E[Z])"A")
AE((Z — B[2))( — E[z})"]AT
AZzA!
Note that the statements above did not rely on the fact that Z is JG, so this reasoning applies
to all random vectors. We know that AZ is JG itself, because it can be expressed as a linear
transformation of iid. Gaussians: AZ = ARU.
Now suppose that we have the partition Z = Fl whose distribution is given by Z~ A (te2, Ez)
and , 5
— lex] yyy — [Bxx Bx
Ma [ra] z [ee Be]
It turns out that the marginal distribution of the individual random vector X (and Y) is JG
X~N (Hx, Exx)24, MLE AND MAP FOR REGRESSION (PART I1) 37
However, the converse is not necessarily true: if X and Y are cach individually JG, it is not
necessarily the case that |*| is JG! To see why, let's suppose that X and Y are individually JG.
x!
ly
‘Thus, we can express each ai a linear transformation of ii.d, Gaussian random variables:
X =RxUx, ¥ = RyUy
we would expect that the expression for the joint distribution would be JG:
[X] [Rx 0] [Ux
ly] > [0 Ry| [Uy
However, since we cannot guarantee that the entries of Ux are independently distributed from the
entries of Uy, we cannot conclude that the joint distribution is JG. If the entries are independently
distributed, then we would be able to conclude that the joint distribution is JG
Let’s now transition back to our discussion of Z. The conditional distribution of X given Y
(and vice versa) is also JG:
XY ~N (ux + ExyByy(¥ — Hy), Exx — ExyDyyByx)
IfX and ¥ are uncorrelated (that is, if Bxy = Byx = 0), we can say that they are independent
Namely, the conditional distribution of X given Y does not depend on Y:
xy ~s
(ux + ONY (¥ ~ py), Exx — OLY 0) = N (Hx, Bxx)
‘This also follows from the multivariate Gaussian pdf
B
Wa
= fx (x) fyly)
Note the significance of this statement, Given any two general random vectors, we cannot neces-
sarily say “if they are uncorrelated, then they are independent”. However in the case of random
vectors from the same JG joint distribution, we can make this claim.
2.4 MLE and MAP for Regression (Part II)
‘The power of probabilistic thinking is that it allows us a way to model situations that arise and
adapt our approaches in a reasonably principled way. This is particularly true when it comes to
incorporating information about the situation that comes from the physical context of the data
kathering process. In this note, we will explore what happens as we vary our assumptions about
the noise in our data and the priors for our parameters, as well as the “importance” of certain
training points
So far we have used MLE and MAP to justify the optimization formulation of OLS and ridge
regression, respectively. The MLE formulation assumes that the observation ¥; is a noisy version
of the true underlying output:
Y= fe) + Zi38 CHAPTER 2. REGRESSION IT
where the noise for each datapoint is crucially i.id. The MAP formulation assumes that the model
parameter 1W; is according to an iid. Gaussian prior
Wj SN (uy, 02)
So far, we have restricted ourselves to the case when the noise/parameters are iid:
Z~N(0,071), W~N(uw,o3])
However, what about the case when N\'s/W;’s are non-identical or dependent on one another? We
would like to explore the case when the observation noise and underlying parameters are jointly
Gaussian with arbitrary individual covariance matrices, but are independent of each other.
Z~N(0Ez), W~N (Hw, Ew)
It turns out that via a change of coordinates, we can reduce these non-ii.d. problems back to the
iid. case and solve them using the original techniques we used to solve OLS and Ridge Regression!
Changing coordinates is a powerful tool in thinking about machine learning
Weighted Least Squares
‘The basic idea of weighted least squares is the following: we place more emphasis on the loss
contributed from certain data points over others - that is, we care more about fitting some data
points over others. It turns out that this weighted perspective is ve
when we go beyond traditional least-squares problems
xy useful as a building block
Optimization View
From an optimization perspective, the problem can be expressed as
Wes = arginin (u(y —67w)?)
‘This objective is the same as OLS, except that each term in the sum is weighted by a positive
coefficient «. As always, we can vectotize this problem:
Wes = argmin (y — Xw)' My — Xw)
Where the s'th row X is x,7, and 9 € R"*" is a diagonal matrix with 44 = 4;
We rewrite the WLS objective to an OLS obj
Woes = argmin (y ~ Xw)/ My — Xw)
= argmin (y ~ Xw)'AV?2A4?(y — Xw)
wer!
= argmin (Sy — 04? Xw)"(A'2y — 0? Xw)
wer!
= argmin ||0'y — 2"? Xw/|?
wer!24, MLE AND MAP FOR REGRESSION (PART II) 39
‘This formulation is identical to OLS except that we have scaled the data matrix and the observation
veetor by 4?, and we conclude that
“1
Was = ((A4?X) (2X) (4X) Ay = (XTX) XTAY
Probabilistic View
As in MLE, we assume that our observations y are noisy, but now suppose that some of the y's
are more noisy than others. How can we take this into account in our learning algorithm so we can
get a better estimate of the weights? Our probabilistic model looks like
Yaxlwt %
where the Z's are still independent Gaussians random variables, but not necessarily identical:
Z,~N(0,92). Jointly, we have that Z~ N (tz, Bz), where
a? 0 °
0 a. 0
Bee
0 oe
We can morph the problem into an MLE one by scaling the data to make sure all the Z's are
identically distributed, by dividing by 9
Note that the scaled noise entries are now iid
Zita gy
S28 vor
FEvOD
Jointly, we can express this change of coordinates as
Bzty ~N(Bz?Xw, Bz B2Bz7) — N(BziXw, 1)
‘This change of variable is sometimes called the reparameterization trick. Now that the noise is
iid. using the change of coordinates, we rewrite our original problem as a scaled MLE problem:
(e lwy?
sap) +nlog v2n
x,'w)?,
‘The MLE estimate of this sealed problem is equivalent to the WLS estimate of the original problem:
exist, ix) x's, sty
(x13 X) XT Ey ty
As long as no o is 0, Bz, is invertible. Note that «i from the optimization perspective is directly
related to o? from the probabilistic perspective: w; = 2. Or at the level of matrices, @ = Ez-*
AAs the variance 9? of the noise corresponding to data point # decreases, the weight w inereases: we
fare more concerned about fitting data point # because it is likely to match the true underlying de-
noised point. Inversely, as the variance o? increases, the weight w; decreases: we are less eoncerned
about fitting data point # because it is noisy and should not be trusted.0 CHAPTER 2. REGRESSION IT
Generalized Least Squares
‘Now let’s consider the case when the noise random variables are dependent on one another. We
have
Y=XwtZ
where Z is now a jointly Gaussian random vector. That is,
Z~N(0Bz), ¥~N(Xw, Ez)
‘This problem is known as generalized least squares. Our goal is to maximize the probability of
our data over the set of possible w's:
ous = argmex —__} (y-Xw "Ez y—Xw)
cus = NEE deta) (Vea
= argmin (y Xw)'Sz(y — Xw)
The optimization problem is therefore given by
Wous = argmin (y ~ Xw)/Bz'y ~ Xw)
Since Bz is symmetric, we can decompose it into its eigen structure using the spectral theorem:
oF 0 0
0 oF 0
=z=Q Q
oO of
where Q is orthonormal. As before with weighted least squares, our goal is to find an appropriate
linear transformation so that we can reduce the problem into the iid, case.
Consider
+o °
ob 0
Bz -Q
° a
We can scale the data to morph the problem into an MLB problem with iid, noise variables, by
Jointly, we can express
premultiplymg the data matrix X and the observation vector y by ©
this change of coordinates as
Byty ~N(Bz'Xw, By E2B,*) = NB, Xw.
Consequently, in a very similar fashion to the independent noise problem, the MLE of the scaled
dependent noise problem is
Weis = (XE X) 1X"E;'y.24, MLE AND MAP FOR REGRESSION (PART I1) at
“Ridge Regression” with Dependent Parameters
In the ordinary least squares (OLS) statistical model, we assume that the output Y is a linear
function of the input, plus some Gaussian noise. We take this one step further in MAP estimation,
where we assume that the weights are a random variable. The new statistical model is
Y=xwizZ
where ¥ and Z are n-dimensional random vectors, W is a d-dimensional random veetor, and X is
a fixed n x d matrix. Note that random vectors are not notationally distinguished from matrices
here, so keep in mind what each symbol represents.
We have seen that ridge regression can be derived by assuming a prior distribution on W in which
W; are iid, (univariate) Gaussian, or equivalently,
wW-NOD
But more generally, we can allow W to be any multivariate Gaussian:
W ~NM(uw. Ew)
mation of a standard
Recall that we can rewrite a multivariate Gaussian variable as an affine transf
Gaussian variable:
W=SWV+uw, V~N(0,1)
Plugging this parameterization into our previous statistical model gives
Y=X(2YV 4 ww) +2
But this can be re-written
Y —Xpw =XBDYV+Z
which we see has the form of the statistical problem that underlies traditional Ridge Regression
with \= 1, and therefore
= (BYPXTXBY +) BYEXTy - Xnw)
However V is not what we care about — we need to convert back to the actual weights W in order to
make predictions. Since W is completely determined by V (assuming fixed mean and covariance),
wan
bw
pw + BW (BW XBW +) TBR — Xpew)
Ix" y — Xpw)
= py + (XIX + By Od,
= Hw + (XTX + By) XY — Xtew)
Note that there are two terms: the prior mean fly, plus another term that depends on both the
data and the prior. The positive-definite precision matrix of W's prior (Zw) controls how the data
fit error affects our estimate, This is called Tikhonov regularization in the literature and generalizes
ridge regularization.CHAPTER 2. REGRESSION IT
To gain intuition, let us consider the simplified case where
o} 0 0
0 03 0
Ew =
00 o
When the prior variance o? for dimension j is large, the prior is telling us that Wj may take on a
wide range of values. Thus we do not want to penalize that dimension as much, preferring to let
the data fit sort it out. And indeed the corresponding entry in Ny! will be small, as desired.
Conversely if ¢? is small, there is little variance in the value of W, so WY; = 1. As such we penalize
the magnitude of the data-fit contribution to W, more heavily.
If all the 0? are the same, then we have traditional ridge regularization.
Alternative derivation: directly conditioning jointly Gaussian random variables
In an explicitly probabilistic perspective, MAP with colored noise (and known X) can be expressed.
U,V ENO) (2.6)
‘Y]_[Rz XRw] [U (27)
w|~[o Rw] |v en
where Rz and Rw are relationships with W and Z, respectively. Note that the Rw appears
twice because our model assumes ¥ = XW + noise, so if W = Rw, then we must have Y =
XRwV + noise,
‘We want to find the posterior W | ¥ = y. The formulation above makes it relatively easy to find
the posterior of ¥ conditioned on W (see below), but not vieo-versa. So let’s pretend instead that
uv 80.)
) TA BI TO"
y|~|o D| |v!
Now W | Y = y is straightforward. Since V! = D~'Y, the conditional mean and variance of
W | ¥ =y can be computed as follows:
[WY =y] =B/AU' + BV | ¥ =y]
= BIAU' |Y =y| + E[BD2Y|Y¥=y;
= AEU') +£[BD“Y | ¥ =y]
Saee
¥
—Bp-y
Var(W | ¥ = y) = E((W - E/W))(W -E[W))" | ¥ = y]
= E|(AU! +BD-TY — BD“Y)(AU' + BD“'Y — BD"TY)' | Y =y|
=E(AU')(AUY | ¥ =yl
=E|AU(U'YTA24, MLE AND MAP FOR REGRESSION (PART I1) 43
= AgiU(U)') a"
au)
Nort
= AAT
In both cases above where we drop the conditioning on Y, we are using the fact U’ is independent
of V! (and thus independent of ¥ = DV’). Therefore
WIY=y~N(BD'y,AA)
Recall that a Gaussian distribution is completely specified by its mean and covariance matrix. We
see that the covariance matrix of the joint distribution is
wy] j|_ja Byfav o
ale (wr ¥ -( Hl [i 3
AAT BBT BD"
be’ opp]
-[Es Swy]
Zyw Ey |
‘Matching the corresponding terms, we can express the conditional mean and variance of W | ¥ = y
in terms of these (cross-Jeovarianee matrives
=BD'D-"D"'y = (BD(DD)"'Y = Bw yEs!
BD'Y 2oe. D°Y =(BD')(DD) Ty = EwyEy'¥
AAT= AA’ +BB'- BB"
=AA'+BB'-BD'D_p 'DB
Y¥
AA’ + BB" - (BD')(DD')"'DBT
= Ew - 2w.y2y'Dyw
‘We can then apply the same reasoning to the original setup:
R,Rz’ + XRwRw'X’ XRwRw'|
Y
w) [yw] = RwRw'X RwRw
_ [By Eyw
Ewy Ew
‘Therefore after defining Bz —RzRz', we can read off
Zw = RwRw
By =2z+XBwX
Byw = XEw
Bw.y = EwX”
Plugging this into our estimator yields
w=E[W|Y-y]
= EwydyyCHAPTER 2. REGRESSION IT
= EwX'(Bz + XBW") ly
One may be concerned because this expression does not take the form we expect ~ the inverted
matrix is hitting y directly, unlike in other solutions we've seen. Although this form will turn out
to be quite informative when we introduce the idea of the kernel trick in machine, learning, it is
still disconcertingly different from what we are used to,
However, by using a lot of algebra together with the Woodbury matrix identity’, it tums out that
wwe can rewrite this expression as
w= (X'E;'X+ By) Ox Ezy
which looks more familiar, In fact, you can recognize this as the general solution when we have
both a generic Gaussian prior on the parameters and colored noise in the observations,
Summary of Linear Gaussian Statistical Models
‘We have seen a number of related linear models, with varying assumptions about the randomness
in the observations and the weights. We summarize these below:
we (0,1) N(0,Bz)
No prior OX TXTy Wau = OBA) XB
(0.4711) (TX + ar xty (Xap x + AX"; 'y
Nw, Ew) | ay + 7K + BX Kaw) | May + (RTS 4 By) XB
Xw)
2.5 Kernels and Ridge Regression
In ridge regression, we given a vector y € R” and a matrix X ¢ R™‘, where n is the number of
training points and ¢ is the dimension of the raw data points. In most settings we don’t want to
work with just the raw feature space, so we atigment features to the data points and replace X
with & € R"*4, where ¢,! = $(x,) € R¢. Then we solve a well-defined optimization problem that
involves & and y, over the parameters w € R¢. Note the problem that arises here. If we have
polynomial features of degree at most p in the raw £ dimensional space, then there are d= ('7?)
terms that we need to optimize, which can be very, very large (much larger than the number of
training points n). Wouldn't it be useful, if instead of solving an optimization problem over d
variables, we could solve an equivalent problem over (potentially much smaller) n variables, and
achieve a computational runtime independent of the number of augmented features? As it turns
‘out, the concept of kernels (in addition to a technique called the kernel trick) will allow us to
achieve this goal. Recall the solution to ridge regression:
wi = (8S +a1) ley
‘This operation involves calculating ®"®, which is a dx d matrix and takes O(d2n) time to compute.
‘The matrix inversion operation takes an additional O(d*) time to compute. What we would really
like is to have an n x n matrix that takes O(n) to invert. Here's a simple observation: if we flip
the order of &” and &, we end up with an n x n matrix ®&". In fact, the matrix ®®" has a very
intuitive meaning: it is the matrix of inner products between all of the augmented datapoints, whieh
in loose terms measures the “similarity” among of the datapoints and captures their relationship.
Now let's see if we could somehow express the solution to ridge regression using the matrix @&
(As veV
TATU(ET VATU) HVAT!25. KERNELS AND RIDGE REGRE:
De
For simplicity of notation, let's revert back to using X instead of ® (pretend that we are only
working with raw features, our analysis of kernel ridge regression still holds if we use just the raw
features). Rearranging, the terms of the original ridge regression solution, we have
w- (XX +A) Xly
(XX + Alw = xX'y
X'Xw + Aw = X'y
dw = XTy — X"Xw
w= (x'y—x"xw)
x
xy — X'Xw
wea
w= xy kw
which says that whatever w is, it is some linear combination of the training points x; (because
anything of the form X"v is a linear combination of the columns of X", which are the training
points). To find w it suffices to find v, where w = X'v.
Recall that the relationship we have to satisfy is X'Xw — Aw = X'y. Let's assume that we had
v, and just substitute X'v in for all the w's.
XX(X'y) + A(X"y) = XTy
XXX'v+ X'(Av) = Xy
X1(XX'v + Av) = X"(y)
We can't yet isolate v and have a closed-form solution for it, but we can make the observation that
if we found an v such that we had
XX'v+\v=y
that would imply that this v also satisfies the above equation. Note that we did not “cancel the
X's on both sides of the equation.” We saw that having v satisfy one equation implied that it
satisfied the other as well, So, indeed, we can isolate v in this new equation:
(XX + ADV = y => v= (XXT +arp ty
and have that the v which satisfies this equation will be such that Xv equals w. We conclude
‘that the optimal w is
wr = X'vt = XXX =A ly
Recall that previously, we derived ridge regression and ended up with
* = (X™X + AL IXTy
In fact, these two are equivalent expressions! The question that now arises is which expression
should you pick? Which is more efficient to calculate? We will answer this question after we
introduce kernels,6 CHAPTER 2. REGRESSION IT
Linear Algebra Derivation
‘The previous derivation involved using some intuitive manipulations to achieve the desired answer
Let’s formalize our derivation using more principled arguments from linear algebra and optimiz
tion Before we do so, we must first introduce the Fundamental Theorem of Lincar Algebra
(FTLA): Suppose that there is a matrix (linear map) X that maps R! to R". Denote A(X) as
the nullspace of X, and R(X) as the range of X. Then the following properties hold:
1. NK) SRK!) = RE and M(K!)S
The symbol © indicates that we taking a direct sum of V(X) and R(X"), which means that
Yu © R¢ there exist unique elements uj ¢ A(X) and uz € R(X") such that w= uy + uz
Furthermore, the symbol L indicates that V(X) and R(X") are orthogonal subspaces.
R(X) = RM by symmetry
2. N(X™X) = N’(X) and W(XX) = J
3. R(XTX) = R(X!) and R(X") = R(X) by symmet
(X7) by symmetry
Here's where FTLA comes, in the context of kernel ridge regression. We know that we can express
any w € Rf as a unique combination w = wi + Wa, where wr € R(X") and wz € V(X).
Equivalently we can express this as w = X'v +r, where v € R" and r € V(X). Now, instead of
optimizing over w € R‘, we can optimize over v € R" and r € R¢, which equates to optimizing over
n+ £ variables. However, as we shall see, the optimization over r will be trivial so we just have to
optimize an n dimensional problem.
We know that w = X'v +4, where v € R” and r € A(X), Let's now solve ridge regression by
optimizing over the variables v and r instead of w:
vive = argmin_ |[Xw — y|[} +Allwi
veRreN)
= argmin |\X(X'v+r)—y}+A|Xv+x|3
veRn revi)
[XXTy +X yl3 = AIXTY + eB
= argmin (WRXXRY —av'KKTy yy) 4a (WKY 4 BeRF EH)
veR PENX)
= tema (WCW avTER'Y) + (WEY +)
We crossed out Xr and 2v/Xr because r € N(X) and therefore Xr = 0. Now wo are optimizing
over L(v,), which is jointly convex in v and r, because its Hessian is PSD. Let's show that this
is indeed the case:
ViL(v,r) = 21> 0
VeVyL(v,r) = VwVeL(v,r) = 0
ViL(v,r) =2XX'XX! + 2\KX' > 0
Since the cross terms of the Hessian are 0, it sulfices that V2Z(v,r) and V2L(v,r) are PSD to
establish joint convexity. With joint convexity established, we can set the gradient to 0 w.r.t r and
v and obtain the global minimum:
VeL(v,r*) = 2x" =0 => r= 0