[go: up one dir, main page]

0% found this document useful (0 votes)
24 views75 pages

Robust Nonparametric Methods 2

The course STAT240 focuses on robust and nonparametric statistics, addressing challenges in model training and distributional shifts. It emphasizes worst-case robustness, particularly in high-dimensional settings, and explores nonparametric modeling to mitigate model mis-specification. Key topics include training time robustness, handling corrupted data, and the use of nonparametric methods like kernel regression and the bootstrap to improve generalization performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views75 pages

Robust Nonparametric Methods 2

The course STAT240 focuses on robust and nonparametric statistics, addressing challenges in model training and distributional shifts. It emphasizes worst-case robustness, particularly in high-dimensional settings, and explores nonparametric modeling to mitigate model mis-specification. Key topics include training time robustness, handling corrupted data, and the use of nonparametric methods like kernel regression and the bootstrap to improve generalization performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Lecture Notes for STAT240 (Robust and Nonparametric Statistics)

Jacob Steinhardt
Last updated: April 1, 2021

[Lecture 1]

1 What is this course about?


Consider the process of building a statistical or machine learning model. We typically first collect training
data, then fit a model to that data, and finally use the model to make predictions on new test data.
In theory and in practice, we generally rely on the train and test data coming from the same distribution,
or at least being closely related in some way. However, there are several ways this could fail to be the case:
1. The data collection process itself could be noisy and thus not reflect the actual underlying signal we
wish to learn. For instance, there could be human error in labelling or annotation, or measurement
error due to imperfect sensors.
2. There could be distributional shift, due to changes in the world over time or because we seek to deploy
the model in some new situation (a language model trained on news articles but deployed on twitter).
There might also be noise e.g. due to a sensor failing.
Robustness concerns what we should do when the train and test distribution are not the same, for any of
the reasons above. There are two underlying perspectives influencing the choice of material in this course.
First, we are generally interested in worst-case rather than average-case robustness. For instance, when
handling data collection errors we will avoid modeling the errors as random noise and instead build procedures
that are robust to any errors within some allowed family. We prefer this because average-case robustness
requires the errors to satisfy a single, specific distribution for robustness guarantees to be meaningful, while a
goal of robustness is to handle unanticipated situations that are difficult to model precisely in advance.
Second, we will study robustness in high-dimensional settings. Many natural approaches to robustness
that work in low dimensions fail in high dimensions. For instance, the median is a robust estimate of the
mean in one dimension,
√ but the per-coordinate median is a poor robust estimator when the dimension is large
(its error grows as d in d dimensions). We will see that more sophisticated estimators can substantially
improve on this first attempt.
Complementary to robustness is the idea of model mis-specification. When the true distribution p∗ lies
within our model family, many robustness issues are less severe: we can rely on the model to extrapolate to
new settings, and we can often get well-calibrated uncertainty estimates. This motivates the second focus of
the course, nonparametric modeling, where we consider broad function classes (e.g. all smooth functions)
that are more likely to be correctly specified. Another connection between nonparametrics and robustness is
that we often want robust methods to work for any distribution within some large, infinite-dimensional class.

Overarching framework. Most robustness questions can be cast in the following way: We let p∗ denote
the true test distribution we wish to estimate, and assume that training data X1 , . . . , Xn is sampled i.i.d. from
some distribution p̃ such that D(p̃, p∗ ) ≤  according to some discrepancy D. We also assume that p∗ ∈ G,
which encodes the distributional assumptions we make (e.g. that p∗ has bounded moments or tails, which
is typically necessary for robust estimation to be possible). We benchmark an estimator θ̂(X1 , . . . , Xn )
according to some cost L(p∗ , θ̂) (the test error). The diagram in Figure 1 illustrates this.
This framework captures both of the examples discused at the beginning. However, it will be profitable to
think about each case separately due to different emphases:

1
test distribution train distribution
D(p∗ , p̃) ≤ 

p p̃


G
distributional assumptions
samples

X1 , . . . , Xn

estimated parameters cost

θ̂(X1 , . . . , Xn ) L(p∗ , θ̂)

Figure 1: Framework for studying robust learning.

1. For corrupted training data, we think of p̃ as being corrupted and p∗ as being nice.
2. For distributional shift, we think of p̃ and p∗ as both being nice (but different).
Additionally, since both p̃ and p∗ are nice for distributional shift, we should have greater ambitions and seek
to handle larger differences between train and test than in the corruption cases.

Training robustness. Designing robust estimators for training corruptions usually involves reasoning
about what the real data “might have” looked like. This could involve operations such as removing outliers,
smoothing points away from extremes, etc. Unfortunately, many intuitive algorithms in low dimensions
achieve essentially trivial bounds in high dimensions. We will show how to achieve more meaningful bounds,
focusing on three aspects:
1. good dependence of the error on the dimension,
2. good finite-sample bounds,
3. computational tractability.
Each of these aspects turns out to require new machinery and we will devote roughly equal space to each.

Distributional shift. For distributional shift, we often seek invariant features or structure that can transfer
information from the train to test distributions. We can also counteract distribution shift by training on more
diverse data. Finally, we can use model uncertainty to infer out-of-distribution error, but these inferences can
be fraught if the model is mis-specified.

Table 1: Comparison of different robust settings.

Train robustness Distributional shift


p∗ nice p∗ and p̃ both nice
undo corruptions invariant features
diverse training data

Nonparametric modeling. This brings us to nonparametric methods. The simplest way to be non-
parametric is through the model–using a rich class such as smooth functions, or neural networks. This
mitigates model mis-specification but raises new statistical challenges. Familiar phenomena such as the
central limit theorem cease to hold, and error rates are instead governed by the eigenvalue spectrum of an
infinite-dimensional kernel. Aside from the model, we can also be nonparametric at inference time; an example
is the bootstrap, which constructs confidence intervals that are more robust to model mis-specification than
classical parametric tests.

2
Figure 2: Possible corruptions to be robust to. Left: data contains outliers. Middle: outputs are perturbed
(process noise); Right: inputs are perturbed (measurement error).

Summary. We will cover each of training time robustness, distribution shift, and nonparametric modeling,
drawing connections as we go. The first third will focus on training time, also building up the statistical
machinery needed to prove generalization bounds. Then, we will shift focus to model mis-specification, and to
nonparametric methods as a remedy. These including kernel regression, the bootstrap, and neural networks,
and we will both see how they mitigate model mis-specification and how to analyze their generalization
performance. Finally, we will turn to distribution shift, and see that nonparametric models are often robust
to distribution shift when trained on the right data. Training robustness will receive a largely theoretical
treatment, while for nonparametrics and distribution shift we will see a mix of theoretical and empirical
results.

2 Training Time Robustness


We will start our investigation with training time robustness. As in Figure 1, we observe samples X1 , . . . , Xn
from a corrupted training distribution p̃, whose relationship to the true (test) distribution is controlled by
the constraint D(p̃, p∗ ) ≤ . We additionally constrain p∗ ∈ G, which encodes our distributional assumptions.
Note that this setting corresponds to an oblivious adversary that can only apply corruptions at the
population level (changing p∗ to p̃); we can also consider a more powerful adaptive adversary that can apply
corruptions to the samples themselves. Such an adversary is called adaptive because it is allowed to adapt
to the random draw of the samples points X1 , . . . , Xn . Formally defining adaptive adversaries is somewhat
technical and we defer this to later.
Figure 2 illustrates several ways in which a training distribution could be corrupted. In the left panel,
an  fraction of real points have been replaced by outliers. This can be modeled by the discrepancy
D(p, q) = TV(p, q), where TV is the total variation distance:
def
TV(p, q) = sup{|p(E) − q(E)| | E is a measurable event}. (1)

If p and q both have densities then an equivalent characterization is TV(p, q) = 12 |p(x) − q(x)|dx.
R

In the middle and right panels of Figure 2, either the inputs or outputs have been moved slightly. Both
operations can be modeled using Wasserstein distances (also called earthmover distances), which we will
discuss later. For now, however, we will focus on the case of handling outliers. Thus for the next several
sections our discrepancy will be the total variation distance D = TV.

2.1 Robustness to Outliers in 1 Dimension


First consider mean estimation in one dimension: we observe n data points x1 , . . . , xn ∈ R drawn from p̃,
and our goal is to estimate the mean µ = Ex∼p∗ [x] of p∗ . Accordingly our loss is L(p∗ , θ) = |θ − µ(p∗ )|.
The following histogram illustrates a possible dataset, where the height of each bar represents the number
of points with a given value:

3
Are the red points outliers? Or part of the real data? Depending on the conclusion, the estimated mean
could vary substantially. Without further assumptions on the data-generating distribution p∗ , we cannot rule
out either case. This brings us to an important principle:
With no assumptions on the distribution p∗ , robust estimation is impossible.
In particular, we must make assumptions that are strong enough to reject sufficiently extreme points as
outliers, or else even a small fraction of such points can dominate the estimate of the mean. For simplicity,
here and in the next several sections we will assume that we directly observe the training distribution p̃ rather
than samples x1:n from p̃. This allows us to avoid analyzing finite-sample concentration, which requires
introducing additional technical tools that we will turn to in Section 2.5.

Assumption: bounded variance. One possible assumption is that p∗ has bounded variance: Ex∼p∗ [(x −
µ)2 ] ≤ σ 2 for some parameter σ. We take G = Gcov (σ) to be the set of√distributions satisfying this constraint.
Under this assumption, we can estimate µ to within error O(σ ) under TV-perturbations of size .
Indeed, consider the following procedure:

Algorithm 1 TrimmedMean
1: Remove the upper and lower (2)-quantiles from p̃ (so 4 mass is removed in total).
2: Let p̃2 denote the new distribution after re-normalizing, and return the mean of p̃2 .

To analyze Algorithm 1, we will make use of a strengthened version of Chebyshev’s inequality, which we
recall here (see Section B.1 for a proof):
Lemma 2.1 (Chebyshev inequality). Suppose that p has mean µ and variance σ 2 . Then, PX∼p q [X ≥

µ + σ/ δ] ≤ δ. Moreover, if E is any event with probability at least δ, then |EX∼p [X | E] − µ| ≤ σ 2(1−δ)
δ .
The first part, which is the standard Chebyshev inequality, says that it is unlikely for a point to be more
than a few standard deviations away from µ. The second part says that any large population of points must
have a mean close to µ. This second property, which is called resilience, is central to robust estimation, and
will be studied in more detail in Section 2.4.
With Lemma 2.1 in hand, we can prove the following fact about Algorithm 1:

Proposition 2.2. Assume that TV(p̃, p∗ ) ≤  ≤ 18 . Then the output µ̂ of Algorithm 1 satisfies |µ̂−µ| ≤ 9σ .
Proof. If TV(p̃, p∗ ) ≤ , then we can get from p∗ to p̃ by adding an -fraction of points (outliers) and deleting
an -fraction of the original points.
First note that all outliers that exceed the -quantile of p∗ are removed by Algorithm 1. Therefore, all
non-removed outliers lie within √σ of the mean µ by Chebyshev’s inequality.
Second, we and the adversary together remove at most a 5-fractionq of the mass in p∗ . Applying Lemma 2.1
10
with δ = 1 − 5, the mean of the remaining good points lies within σ 1−5 of µ.
Now let  be the fraction of remaining points which are bad, and note that 0 ≤ 1−4
0 
. The mean of all
q q √ √

the remaining points differs from µ by at most 0 · σ 1 + (1 − 0 ) · σ 1−5
10
, which is at most (1 + 10) 1−4 σ.
√ 1
This is in turn at most 9σ  assuming that  ≤ 8 .

4
√ √
Figure 3: The outliers can lie at distance d without being detected, skewing the mean by  d.


Optimality. The O(σ ) dependence is optimal, because the adversary can themselves apply the same
trimming
√ procedure we do, and in general this will shift the mean of a bounded covariance distribution by
O(σ ) while keeping the covariance bounded.

Alternate assumptions. The key fact driving√ the proof of Proposition 2.2 is that any (1 − )-fraction
of the good points has mean at most O(σ ) away from the true mean due to Chebyshev’s inequality
(Lemma 2.1), which makes use of the bound σ 2 on the variance. Any other bound on the deviation from √
the mean would yield an analogous result. For instance, if p∗ has bounded kth moment, then the O(σ )
1−1/k k
in Lemma 2.1 can be improved to O(σk  ), where (σk ) is a bound on the kth moment; in this case
Algorithm 1 will estimate µ with a correspondingly improved error of O(σk 1−1/k ).

2.2 Problems in High Dimensions


In the previous section, we saw how to robustly estimating the mean of a 1-dimensional dataset assuming the
true data had bounded variance. Our estimator worked by removing data points that are too far away from
the mean, and then returning the mean of the remaining points.
It is tempting to apply this same idea in higher dimensions—for instance, removing points that are far
away from the mean in `2 -distance. Unfortunately, this incurs large error in high dimensions.
To see why, consider the following simplified example. The distribution p∗ over the true data is an isotropic
Gaussian N (µ, I), with unknown mean µ and independent variance √ 1 in every coordinate. In this case, the
typical distance kxi − µk2 of a sample xi from the mean µ is roughly d, since there are d coordinates and√xi
differs from µ by roughly 1 in every coordinate. (In fact, kxi − µk2 can be√shown to concentrate around d
with high probability.) This means
√ that the outliers can lie at a distance d from µ without being detected,
thus shifting the mean
√ by Θ( d); Figure 3 depicts
√ this. Therefore, filtering based on `2 distance incurs an
error of at least  d. This dimension-dependent d factor often renders bounds meaningless.
In fact, the situation is even worse; not only are the bad points no further from the mean than the
good points in `2 -distance, they actually have the same probability density under the true data-generating √
distribution N (µ, I). There is thus no procedure that measures each point in isolation and can avoid the d
factor in the error.
This leads us to an important take-away: In high dimensions, outliers can substantially perturb the mean
while individually looking innocuous. To handle this, we will instead need to analyze entire populations of
outliers at once. In the next section we will do this using minimum distance functionals, which will allow us
to avoid the dimension-dependent error.

[Lecture 2]

5
2.3 Minimum Distance Functionals
In the previous section we saw
√ that simple approaches to handling outliers in high-dimensional data, such as
the trimmed mean, incur a d error. We will avoid this error using minimum distance functionals, an idea
which seems to have first appeared in Donoho and Liu (1988).
Definition 2.3 (Minimum distance functional). For a family G and discrepancy D, the minimum distance
functional is
θ̂(p̃) = θ∗ (q) = arg min L(q, θ), where q = arg min D(q, p̃). (2)
θ q∈G

In other words, θ̂ is the parameters obtained by first projecting p̃ onto G under D, and then outputting the
optimal parameters for the resulting distribution.
An attractive property of the minimum-distance functional is that it does not depend on the perturbation
level . More importantly, it satisfies the following cost bound in terms of the modulus of continuity of G:
Proposition 2.4. Suppose D is a pseudometric. Then the cost L(p∗ , θ̂(p̃)) of the minimum distance functional
is at most the maximum loss between any pair of distributions in G of distance at most 2:

m(G, 2, D, L) , sup L(p, θ∗ (q)). (3)


p,q∈G:D(p,q)≤2

The quantity m is called the modulus of continuity because, if we think of L(p, θ∗ (q)) as a discrepancy
between distributions, then m is the constant of continuity between L and D when restricted to pairs of
nearby distributions in G.
Specialize again to the case D = TV and L(p∗ , θ) = kθ − µ(p∗ )k2 (here we allow p∗ to be a distribution
over Rd rather than just R). Then the modulus is supp,q∈G:TV(p,q)≤2 kµ(p) − µ(q)k2 . As a concrete example,
let G be the family of Gaussian distributions with unknown mean µ and identity covariance. For this family,
the TV distance is essentially linear in the difference in mean:
Lemma 2.5. Let N (µ, I) denote a Gaussian distribution with mean µ and identity covariance. Then
√ √
min(u/2, 1)/ 2π ≤ TV(N (µ, I), N (µ0 , I)) ≤ min(u/ 2π, 1), (4)

where u = kµ − µ0 k2 .
Proof. By rotational and translational symmetry, it suffices to consider the case of one-dimensional Gaussians
N (−u/2, 1) and N (u/2, 1). Then we have that
Z ∞
1 2 2
TV(N (−u/2, 1), N (u/2, 1)) = √ |e−(t+u/2) /2 − e−(t−u/2) /2 |dt (5)
2 2π −∞
Z u/2
(i) 1 2
= √ e−t /2 dt. (6)
2π −u/2

(The equality (i) is a couple lines of algebra, but is easiest to see by drawing a graph of the two Gaussians
and cancelling out most of the probability mass.)
2
For the lower bound, note that e−t /2 ≥ 12 if |t| ≤ 1.
2
For the upper bound, similarly note that e−t /2 ≤ 1 for all t ∈ R, and also that the entire integral must
be at most 1 because it is the probability density of a Gaussian.
Lemma 2.5 allows us to compute the modulus for Gaussians:
Corollary 2.6. Let Ggauss be the √ family of isotropic Gaussians, D = TV, and L the difference in means as
above. Then m(Ggauss , , D, L) ≤ 2 2π whenever  ≤ 2√12π .

In particular, by Proposition 2.4 the minimum distance functional


√ achieves error O() for Gaussian
distributions when  ≤ 2√12π . This improves substantially on the  d error of the trimmed mean estimator
from Section 2.2. We have achieved our goal at least for Gaussians.

6
More general families. Taking G to be Gaussians is restrictive, as it assumes that p∗ has a specific
parametric form—counter to our goal of being robust! However, the modulus m is bounded for much more
general families. As one example, we can take the distributions with bounded covariance (compare to
Proposition 2.2):
2
Lemma 2.7. Let G√ cov (σ) be the family of distributions whose covariance matrix Σ satisfies Σ  σ I. Then
m(Gcov (σ), ) = O(σ ).
Proof. Let p, q ∈ Gcov (σ) such that TV(p, q) ≤ . This means that we can get from p to q by first deleting 
mass from p and then adding  new points to end up at q. Put another way, there is a distribution r that
can be reached from both p and q by deleting  mass (and then renormalizing). In fact, this distribution is
exactly
min(p, q)
r= . (7)
1 − TV(p, q)
Since r can be obtained from both p and q by deletions, we can make use of the following multi-dimensional
analogue of Chebyshev’s inequality (Lemma 2.1):
Lemma 2.8 (Chebyshev in Rd ). Suppose that p has mean µ and covariance 2
q Σ, where Σ  σ I. Then, if E
is any event with probability at least δ, we have kEX∼p [X | E] − µk2 ≤ σ 2(1−δ)
δ .
p p
As a consequence, we have kµ(r) − µ(p)k2 ≤ σ 2/(1 − ) and kµ(r) − µ(q)k2 ≤ σ 2/(1 − ) (since r
can be obtained from either p or q by conditioning√on an event of probability 1 − ). By triangle inequality
and assuming  ≤ 21 , we have kµ(p) − µ(q)k2 ≤ 4σ , as claimed.
As a consequence, the minimum
√ distance functional robustly estimates the mean bounded covariance
distributions with error O(σ ), generalizing the 1-dimensional bound obtained by the trimmed mean.
In Lemma 2.7, the two key properties we needed were:
• The midpoint property of TV distance (i.e., that there existed an r that was a deletion of p and q).
• The bounded tails guaranteed by Chebyshev’s inequality.
If we replace bounded covariance distributions with any other family that has tails bounded in a similar way,
then the minimum distance functional will similarly yield good bounds. A general family of distributions
satisfying this property are resilience distributions, which we turn to next.

2.4 Resilience
Here we generalize Lemma 2.7 to prove that the modulus of continuity m is bounded for a general family of
distributions containins Gaussians, sub-Gaussians, bounded covariance distributions, and many others. The
main observation is that in the proof of Lemma 2.7, all we needed was that the tails of distributions in G
were bounded, in the sense that deleting an -fraction of the points could not substantially change the mean.
This motivates the following definition:
Definition 2.9. A distribution p over Rd is said to be (ρ, )-resilient (with respect to some norm k · k) if
kEX∼p [X | E] − EX∼p [X]k ≤ ρ for all events E with p(E) ≥ 1 − . (8)
We let GTV (ρ, ) denote the family of (ρ, )-resilient distributions.
p
We observe that Gcov (σ) ⊂ GTV (σ 2/(1 − ), ) for all  by Lemma p 2.8; in other words, bounded
covariance distributions are resilient. We can also show that Ggauss ⊂ GTV (2 log(1/), ), so Gaussians are
resilient as well.
Resilient distributions always have bounded modulus:
Theorem 2.10. The modulus of continuity m(GTV , 2) satisfies the bound
m(GTV (ρ, ), 2) ≤ 2ρ (9)
whenever  < 1/2.

7
Proof. As in Lemma 2.7, the key idea is that any two distributions p, q that are close in TV have a midpoint
min(p,q)
distribution r = 1−TV(p,q) that is a deletion of both distributions). This midpoint distribution connects the
two distributions, and it follows from the triangle inequality that the modulus of GTV . is bounded. We
illustrate this idea in Figure 4 and make it precise below.

TV(p, q) ≤ 
p q

p q
r≤ 1−
r≤ 1−

min(p,q)
r= 1−TV(p,q)

p ∈ GTV kµ(p) − µ(r)k ≤ ρ


kµ(p) − µ(q)k ≤ 2ρ
q ∈ GTV kµ(q) − µ(r)k ≤ ρ

Figure 4: Midpoint distribution r helps bound the modulus for GTV .

Recall that

m(GTV (ρ, ), 2) = sup kµ(p) − µ(q)k. (10)


p,q∈GTV (ρ,):TV(p,q)≤2

min(p,q)
From TV(p, q) ≤ 2, we know that r = 1−TV(p,q) can be obtained from either p and q by conditioning on
an event of probability 1 − . It then follows from p, q ∈ GTV (ρ, ) that kµ(p) − µ(r)k ≤  and similarly
kµ(q) − µ(r)k ≤ . Thus by the triangle inequality kµ(p) − µ(q)k ≤ 2ρ, which yields the desired result.
We have seen so far that resilient distributions have bounded modulus, and that both Gaussian and
bounded covariance distributions
√ are resilient. The bound on the modulus for Gcov p that is implied by
resilience is optimal (O(σ )), while for Ggauss it is optimal up to log factors (O( log(1/)) vs. O()).
In fact, Gaussians are a special case and resilience yields an essentially optimal bound at least for most
non-parametric families of distributions. As one family of examples, consider distributions with bounded
Orlicz norm:
Definition 2.11 (Orlicz norm). A function ψ : R≥0 → R≥0 is an Orlicz function if ψ is convex, non-
decreasing, and satisfies ψ(0) = 0, ψ(x) → ∞ as x → ∞. For an Orlicz function ψ, the Orlicz norm or
ψ-norm of a random variable X is defined as
    
|X|
kXkψ , inf t > 0 : Ep ψ ≤1 . (11)
t

We let Gψ (σ) denote the family of distributions with kX − E[X]kψ ≤ σ.


As special cases, we say that a random variable X ∼ p is sub-Gaussian with parameter σ if khX −
2
Ep [X], vikψ2 ≤ σ whenever kvk2 ≤ 1, where ψ2 (x) = ex − 1. We define a sub-exponential random variable
x
similarly for the function ψ1 (x) = e − 1.
Definition 2.11 applies to distributions on R, but we can generalize this to distributions on Rd by taking
one-dimensional projections:
Definition 2.12 (Orlicz norm in Rd ). For a random variable X ∈ Rd and Orlicz function ψ, we define the
d-dimensional ψ-norm as

kXkψ , inf{t > 0 : khX, vikψ ≤ t whenever kvk2 ≤ 1}. (12)

We let Gψ (σ) denote the distributions with bounded ψ-norm as in Definition 2.11.

8
Thus a distribution has bounded ψ-norm if each of its 1-dimensional projections does. As an example,
Gcov (σ) = Gψ (σ) for ψ(x) = x2 , so Orlicz norms generalize bounded covariance. It is also possible to generalize
Definition 2.12 to norms other than the `2 -norm, which we will see in an exercise.
Functions with bounded Orlicz norm are resilient:
Lemma 2.13. The family Gψ (σ) is contained in GTV (2σψ −1 (1/), ) for all 0 <  < 1/2.
Proof. Without loss of generality assume E[X] = 0. For any event E with p(E) = 1 − 0 ≥ 1 − , denote its
complement as E c . We then have
(i) 0
kEX∼p [X | E]k2 = kEX∼p [X | E c ]k2 (13)
1 − 0
0
= sup EX∼p [hX, vi | E c ] (14)
1 − 0 kvk2 ≤1
(ii) 0
≤ sup σψ −1 (EX∼p [ψ(|hX, vi|/σ) | E c ]) (15)
1 − 0 kvk2 ≤1
(iii) 0
≤ sup σψ −1 (EX∼p [ψ(|hX, vi|/σ)]/0 ) (16)
1 − 0 kvk2 ≤1
(iv) 0
≤ σψ −1 (1/0 ) ≤ 2σψ −1 (1/), (17)
1 − 0
as was to be shown. Here (i) is because (1 − 0 )E[X | E] + 0 E[X | E c ] = 0. Meanwhile (ii) is by convexity of
ψ, (iii) is by non-negativity of ψ, and (iv) is the assumed ψ-norm bound.
As a consequence, the modulus m of Gψ (σ) is O(σψ −1 (1/)), and hence the minimum distance functional
estimates the mean with error O(σψ −1 (1/)). Note that for ψ(x) = x2 this reproduces our result for bounded
covariance. For ψ(x) = xk we get error O(σ1−1/k ) when p a distribution has kth moments bounded by σ k .
Similarly for sub-Gaussian distributions we get error O(σ log(1/)). We will show in an exercise that the
error bound implied by Lemma 2.13 is optimal for any Orlicz function ψ.

Further properties and dual norm perspective. Having seen several examples of resilient distributions,
we now collect some basic properties of resilience, as well as a dual perspective that is often fruitful. First,
we can make the connection between resilience and tails even more precise with the following lemma:
Lemma 2.14. For a fixed vector v, let τ (v) denote the -quantile of hx − µ, vi: Px∼p [hx − µ, vi ≥ τ (v)] = .
Then, p is (ρ, )-resilient in a norm k · k if and only if the -tail of p has bounded mean when projected onto
any dual unit vector v:
1−
Ep [hx − µ, vi | hx − µ, vi ≥ τ (v)] ≤ ρ whenever kvk∗ ≤ 1. (18)

1−
In particular, the -quantile satisfies τ (v) ≤  ρ.

In other words, if we project onto any unit vector v in the dual norm, the -tail of x − µ must have mean
at most 1−
 ρ. Lemma 2.14 is proved in Section C.
The intuition for Lemma 2.14 is the following picture, which is helpful to keep in mind more generally:
Specifically, letting µ̂ = E[X | E], if we have kµ̂ − µk = ρ, then there must be some dual norm unit vector
v such that hµ̂ − µ, vi = ρ and kvk∗ = 1. Moreover, for such a v, hµ̂ − µ, vi will be largest when E consists of
the (1 − )-fraction of points for which hX − µ, vi is largest. Therefore, resilience reduces to a 1-dimensional
problem along each of the dual unit vectors v.
A related result establishes that for  = 12 , resilience in a norm is equivalent to having bounded first
moments in the dual norm (see Section D for a proof):
Lemma 2.15. Suppose that p is (ρ, 12 )-resilient in a norm k · k, and let k · k∗ be the dual norm. Then p has
1st moments bounded by 2ρ: Ex∼p [|hx − µ, vi|] ≤ 2ρkvk∗ for all v ∈ Rd .
Conversely, if p has 1st moments bounded by ρ, it is (2ρ, 12 )-resilient.

9
µ
µ̂ vkµ̂ −dual vector v:
µk = hµ̂ − µ, vi

Figure 5: The optimal set T discards the smallest |S| elements projected onto a dual unit vector v.


Recap. We saw that the error of the trimmed mean grew as d in d dimensions, and introduced an
alternative estimator–the minimum distance functional–that achieves better error. Specifically, it achieves
error 2ρ for the family of (ρ, )-resilient distributions, which includes all distributions with bounded Orlicz
norm (including bounded covariance, bounded moments, and sub-Gaussians).
The definition of resilience is important not just as an analysis tool. Without it, we would need a different
estimator for each of the cases of bounded covariance, sub-Gaussian, etc., since the minimum distance
functional depends on the family G. Instead, we can always project onto the resilient family GTV (ρ, ) and be
confident that this will typically yield an optimal error bound. The only complication is that projection still
depends on the parameters ρ and ; however, we can do without knowledge of either one of the parameters as
long as we know the other.

[Lecture 3]

2.5 Concentration Inequalities


So far we have only considered the infinite-data limit where we directly observe p̃; but in general we would
like to analyze what happens in finite samples where we only observe X1 , . . . , Xn sampled independently
from p̃. In order to do this, we will want to be able to formalize statements such as “if we take the average of
a large number of samples, it converges to the population mean”. In order to do this, we will need a set of
mathematical tools called concentration inequalities. A proper treatment of concentration could itself occupy
an entire course, but we will cover the ideas here that are most relevant for our later analyses. See Boucheron
et al. (2003), Boucheron et al. (2013), or Ledoux (2001) for more detailed expositions. Terence Tao also has
some well-written lectures notes.
Concentration inequalities usually involve two steps:
1. We establish concentration for a single random variable, in terms of some property of that random
variable.

2. We show that the property composes nicely for products of independent random variables.
A prototypical example (covered below) is showing that (1) a random variable has at most a 1/t2 probability
of being t standard
√ deviations from its mean; and (2) the standard deviation of a sum of n i.i.d. random
variables is n times the standard deviation of a single variable.
The simplest concentration inequality is Markov’s inequality. Consider the following question:

A slot machine has an expected pay-out of $5 (and its payout is always non-negative). What can
we say about the probability that it pays out at least $100?
We observe that the probability must be at most 0.05, since a 0.05 chance of a $100 payout would by itself
already contribute $5 to the expected value. Moreover, this bound is achievable by taking a slot machine
that pays $0 with probability 0.95 and $100 with probability 0.05. Markov’s inequality is the generalization
of this observation:

10
Theorem 2.16 (Markov’s inequality). Let X be a non-negative random variable with mean µ. Then,
P[X ≥ t · µ] ≤ 1t .
Markov’s inequality accomplishes our first goal of establishing concentration for a single random variable,
but it has two issues: first, the 1t tail bound decays too slowly in many cases (we instead would like
exponentially decaying tails); second, Markov’s inequality doesn’t compose well and so doesn’t accomplish
our second goal.
We can address both issues by applying Markov’s inequality to some transformed random variable. For
instance, applying Markov’s inequality to the random variable Z = (X − µ)2 yields the stronger Chebyshev
inequality:
Theorem 2.17 (Chebyshev’s inequality). Let X be a real-valued random variable with mean µ and variance
σ 2 . Then, P[|X − µ| ≥ t · σ] ≤ t12 .
1
Proof. Since Z = (X − µ)2 is non-negative, we have that P[Z ≥ t2 · σ 2 ] ≤ t2 by Markov’s inequality. Taking
the square-root gives P[|X − µ| ≥ t · σ] ≤ t12 , as was to be shown.
Chebyshev’s inequality improves the 1/t dependence to 1/t2 . But more importantly, it gives a bound in
terms of a quantity (the variance σ 2 ) that composes nicely:
Lemma 2.18 (Additivity of variance). Let X1 , . . . , Xn be pairwise independent random variables, and let
Var[X] denote the variance of X. Then,

Var[X1 + · · · + Xn ] = Var[X1 ] + · · · + Var[Xn ]. (19)

Proof. It suffices by induction to prove this for two random variables. Without loss of generality assume
that both variables have mean zero. Then we have Var[X + Y ] = E[(X + Y )2 ] = E[X 2 ] + E[Y 2 ] + 2E[XY ] =
Var[X] + Var[Y ] + 2E[X]E[Y ] = Var[X] + Var[Y ], where the second-to-last step uses pairwise independence.
Chebyshev’s inequality together with Lemma √ 2.18 together allow us to show that an average of i.i.d.
random variables converges to its mean at a 1/ n rate:
Corollary 2.19. Suppose X1 , . . . , Xn are drawn i.i.d.
√ from p, where p has mean µ and variance σ 2 . Also
let S = n1 (X1 + · · · + Xn ). Then, P[|S − µ|/σ ≥ t/ n] ≤ 1/t2 .
Proof. Lemma 2.18 implies that Var[S] = σ 2 /n, from which the result follows by Chebyshev’s inequality.

Higher moments. Chebyshev’s inequality gives bounds in terms of the second moment of X − µ. Can we
do better by considering higher moments such as the 4th moment? Supposing that E[(X − µ)4 ] ≤ τ 4 , we do
get the analogous bound P[|X − µ| ≥ t · τ ] ≤ 1/t4 . However, the 4th moment doesn’t compose as nicely as
the variance; if X and Y are two independent mean-zero random variables, then we have

E[(X + Y )4 ] = E[X 4 ] + E[Y 4 ] + 6E[X 2 ]E[Y 2 ], (20)

where the E[X 2 ]E[Y 2 ] can’t be easily dealt with. It is possible to bound higher moments under composition,
for instance using the Rosenthal inequality which states that
X X √ X
E[| Xi |p ] ≤ O(p)p E[|Xi |p ] + O( p)p ( E[Xi2 ])p/2 (21)
i i i

for independent random variables Xi . Note that the first term on the right-hand-side typically grows as

n · O(p)p while the second term typically grows as O( pn)p .
We will typically not take the Rosenthal approach and instead work with an alternative, nicer object
called the moment generating function:
def
mX (λ) = E[exp(λ(X − µ))]. (22)

For
Qn independent random variables, the moment generating function composes via the identity mX1 +···+Xn (λ) =
i=1 mXi (λ). Applying Markov’s inequality to the moment generating function yields the Chernoff bound :

11
Theorem 2.20 (Chernoff bound). For a random variable X with moment generating mX (λ), we have

P[X − µ ≥ t] ≤ inf mX (λ)e−λt . (23)


λ≥0

Proof. By Markov’s inequality, P[X − µ ≥ t] = P[exp(λ(X − µ)) ≥ exp(λt)] ≤ E[exp(λ(X − µ))]/ exp(λt),
which is equal to mX (λ)e−λt by the definition of mX . Taking inf over λ yields the claimed bound.

Sub-exponential and sub-Gaussian distributions. An important special case is sub-exponential ran-


dom variables; recall these are random variables satisfying E[exp(|X − µ|/σ)] ≤ 2. For these, applying the
Chernoff bound with λ = 1/σ yields P[X − µ ≥ t] ≤ 2e−t/σ .
Another special case is sub-Gaussian random variables (those satisfying E[exp((X − µ)2 /σ 2 )] ≤ 2). In
this case, using the inequality ab ≤ a2 /4 + b2 , we have

mX (λ) = E[exp(λ(X − µ))] ≤ E[exp(λ2 σ 2 /4 + (X − µ)2 /σ 2 )] ≤ 2 exp(λ2 σ 2 /4). (24)

The factor of 2 is pesky and actually we can get the more convenient bound mX (λ) ≤ exp(3λ2 σ 2 /2) (Rivasplata,
2012). Plugging this into the Chernoff bound yields P[X − µ ≥ t] ≤ exp(3λ2 σ 2 /2 − λt); minimizing over λ
gives the optimized bound P[X − µ ≥ t] ≤ exp(−t2 /6σ 2 ).
Sub-Gaussians are particularly convenient because the bound mX (λ) ≤ exp(3λ2 σ 2 /2) composes well.
Let X1 , . . . , Xn be independent sub-Gaussians with constants σ1 , . . . , σn . Then we have mX1 +···+Xn (λ) ≤
exp(3λ2 (σ12 + · · · + σn2 )/2). We will use this to bound the behavior of sums of bounded random variables
using Hoeffding’s inequality:1
Theorem 2.21 (Hoeffding’s inequality). Let X1 , . . . , Xn be zero-mean random variables lying in [−M, M ],
and let S = n1 (X1 + · · · + Xn ). Then, P[S ≥ t] ≤ exp(− ln(2)nt2 /6M 2 ) ≤ exp(−nt2 /9M 2 ).

Proof. First, note that each Xi is sub-Gaussian with parameter σ = M/ ln 2, since E[exp(Xi2 /σ 2 )] ≤
exp(M 2 /σ 2 ) = exp(ln(2)) = 2. We thus have mXi (λ) ≤ exp(3λ2 M 2 /2 ln 2), and so by the multiplicativity of
moment generating functions we obtain mS (λ) ≤ exp(3λ2 M 2 /(2n ln 2)). Plugging into Chernoff’s bound and
optimizing λ as before yields P[S ≥ t] ≤ exp(− ln(2)nt2 /6M 2 ) as claimed.

√Hoeffding’s inequality shows that a sum of independent random variables converges to its mean at a
1/ n rate, with tails that decay as fast as a Gaussian as long as each of the individual variables is bounded.
Compare this to the 1/t2 decay that we obtained earlier through Chebyshev’s inequality.

Cumulants. The moment generating function is a convenient tool because it multiplies over independent
random variables. However, its existence requires that X already have thin tails, since E[exp(λX)] must be
finite. For heavy-tailed distributions a (laborious) alternative is to use cumulants.
The cumulant function is defined as
def
KX (λ) = log E[exp(λX)]. (25)

Note this is the log of the moment-generating function. Taking the log is convenient because now we have
additivity: KX+Y (λ) = KX (λ) + KY (λ) for independent X, Y . Cumulants are obtained by writing KX (λ) as
a power series:

X κn (X) n
KX (λ) = 1 + λ . (26)
n=1
n!
1 Most of the constants presented here are suboptimal; we have focused on giving simpler proofs at the expense of sharp
constants.

12
When E[X] = 0, the first few values of κn are:

κ1 (X) = 0, (27)
2
κ2 (X) = E[X ], (28)
3
κ3 (X) = E[X ], (29)
4 2 2
κ4 (X) = E[X ] − 3E[X ] , (30)
5 3 2
κ5 (X) = E[X ] − 10E[X ]E[X ], (31)
6 4 2 3 2 2 3
κ6 (X) = E[X ] − 16E[X ]E[X ] − 10E[X ] + 30E[X ] . (32)

Since K is additive, each of the κn are as well. Thus while we ran into the issue that E[(X + Y )4 ] 6=
E[X 4 ] + E[Y 4 ], it is the case that κ4 (X + Y ) = κ4 (X) + κ4 (Y ) as long as X and Y are independent. By
going back and forth between moments and cumulants it is possible to obtain tail bounds even if only some
of the moments exist. However, this can be arduous and Rosenthal’s inequality is probably the better route
in such cases.

[Lecture 4]

2.5.1 Applications of concentration inequalities


Having developed the machinery above, we next apply it to a few concrete problems to give a sense of how
to use it. A key lemma which we will use repeatedly is the union bound, which states that if E1 , . . . , En
are events with probabilities π1 , . . . , πn , then the probability of E1 ∪ · · · ∪ En is at most π1 + · · · + πn . A
corollary is that if n events each have probability  1/n, then there is a large probability that none of the
events occur.

Maximum of sub-Gaussians. Suppose that X1 , . . . , Xn are mean-zero sub-Gaussian with parameter σ,


and let Y = maxni=1 Xi . How large is Y ? We will show the following:
p
Lemma 2.22. The random variable Y is O(σ log(n/δ)) with probability 1 − δ.
p
Proof. By the Chernoff bound for sub-Gaussians, we have that P[Xi ≥ σ 6plog(n/δ)] ≤ exp(− log(n/δ)) =
δ/n. Thus by the union bound, the probabilitypthat any of the Xi exceed σ 6 log(n/δ) is at most δ. Thus
with probability at least 1 − δ we have Y ≤ σ 6 log(n/δ), as claimed.
Lemma 2.22 illustrates a typical proof strategy: We first decompose the event we care about as a union of
simpler events, then show that each individual event holds with high probability by exploiting independence.
As long as the “failure probability” of a single event is much small than the inverse of the number of events,
we obtain a meaningful bound. In fact, this strategy can be employed even for an infinite number of events
by discretizing to an “-net”, as we will see below:

Eigenvalue of random matrix.PLet X1 , . . . , Xn be independent zero-mean sub-Gaussian variables in Rd


n
with parameter σ, and let M = n1 i=1 Xi Xi> . How large is kM k, the maximum eigenvalue of M ? We will
show:

Lemma 2.23. The maximum eigenvalue kM k is O(σ 2 · (1 + d/n + log(1/δ)/n)) with probability 1 − δ.
Proof. The maximum eigenvalue can be expressed as
n
1X
kM k = sup v > M v = sup |hXi , vi|2 . (33)
kvk2 ≤1 kvk2 ≤1 n i=1

13
The quantity inside the sup is attractive to analyze because it is an average of independent random variables.
Indeed, we have
n
n > X
E[exp( v M v)] = E[exp( |hXi , vi|2 /σ 2 )] (34)
σ2 i=1
n
Y
= E[exp(|hXi , vi|2 /σ 2 )] ≤ 2n , (35)
i=1

where the last step follows by sub-Gaussianity if hXi , vi. The Chernoff bound then gives P[v > M v ≥ t] ≤
2n exp(−nt/σ 2 ).
If we were to follow the same strategy as Lemma 2.22, the next step would be to union bound over v.
Unfortunately, there are infinitely many v so we cannot do this directly. Fortunately, we can get by with only
considering a large but finite number of v; we will construct a finite subset N1/4 of the unit ball such that
1
sup v > M v ≥ sup v > M v. (36)
v∈N1/4 2 kvk2 ≤1
Our construction follows Section 5.2.2 of Vershynin (2010). Let N1/4 be a maximal set of points in the unit
ball such that kx − yk2 ≥ 1/4 for all distinct x, y ∈ N1/4 . We observe that |N1/4 | ≤ 9d ; this is because the
balls of radius 1/8 around each point in N1/4 are disjoint and contained in a ball of radius 9/8.
To establish (36), let v maximize v > M v over kvk2 ≤ 1 and let u maximize v > M v over N1/4 . Then
|v > M v − u> M u| = |v > M (v − u) + u> M (v − u)| (37)
≤ (kvk2 + kuk2 )kM kkv − uk2 (38)
≤ 2 · kM k · (1/4) = kM k/2. (39)
Since v > M v = kM k, we obtain |kM k − u> M u| ≤ kM k/2, whence u> M u ≥ kM k/2, which establishes
(36). We are now ready to apply the union bound: Recall that from the Chernoff bound on v > M v, we had
P[v > M v ≥ t] ≤ 2n exp(−nt/σ 2 ), so
P[ sup v > M v ≥ t] ≤ 9d 2n exp(−nt/σ 2 ). (40)
v∈N1/4

Solving for this quantity to equal δ, we obtain


σ2
t= · (n log(2) + d log(9) + log(1/δ)) = O(σ 2 · (1 + d/n + log(1/δ)/n)), (41)
n
as was to be shown.

VC dimension. Our final example will be important in the following section; it concerns how quickly
a family of events with certain geometric structure converges to its expectation. Let H be a collection of
functions f : X → {0, 1}, and define the VC dimension vc(H) to be the maximum d for which there are
points x1 , . . . , xd such that (f (x1 ), . . . , f (xd )) can take on all 2d possible values. For instance:
• If X = R and H = {I[x ≥ τ ] | τ ∈ R} is the family of threshold functions, then vc(H) = 1.
• If X = Rd and H = {I[hx, vi ≥ τ ] | v ∈ Rd , τ ∈ R} is the family of half-spaces, then vc(H) = d + 1.
Additionally, for a point set S = {x1 , . . . , xn }, let VH (S) denote the number of distinct values of (f (x1 ), . . . , f (xn ))
and VH (n) = max{VH (S) | |S| = n}. Thus the VC dimension is exactly the maximum n such that VH (n) = 2n .
We will show the following:
Proposition 2.24. Let H be a family of functions with vc(H) = d, and let X1 , . . . , Xn ∼ p be i.i.d. random
variables over X . For f : X → {0, 1}, let νn (f ) = n1 |{i | f (Xi ) = 1}| and let ν(f ) = p(f (X) = 1). Then
r
 d + log(1/δ) 
sup |νn (f ) − ν(f )| ≤ O (42)
f ∈H n
with probability 1 − δ.

14
We will prove a weaker result that has a d log(n) factor instead of d, and which bounds the expected
value rather than giving a probability 1 − δ bound. The log(1/δ) tail bound follows from McDiarmid’s
inequality, which is a standard result in a probability course but requires tools that would take us too far
afield. Removing the log(n) factor is slightly more involved and uses a tool called chaining.
Proof of Proposition 2.24. The importance of the VC dimension for our purposes lies in the Sauer-Shelah
lemma:
Pd
Lemma 2.25 (Sauer-Shelah). Let d = vc(H). Then VH (n) ≤ k=0 nk ≤ 2nd .


It is tempting to union bound over the at most VH (n) distinct values of (f (X1 ), . . . , f (Xn )); however,
this doesn’t work because revealing X1 , . . . , Xn uses up all of the randomness in the problem and we have no
randomness left from which to get a concentration inequality! We will instead have to introduce some new
randomness using a technique called symmetrization.
Regarding the expectation, let X10 , . . . , Xn0 be independent copies of X1 , . . . , Xn and let νn0 (f ) denote the
version of νn (f ) computed with the Xi0 . Then we have

EX [sup |νn (f ) − ν(f )|] ≤ EX,X 0 [sup |νn (f ) − νn0 (f )|] (43)
f ∈H f ∈H
n
1 X
= EX,X 0 [sup | f (Xi ) − f (Xi0 )|. (44)
n f ∈H i=1

We can create our new randomness by noting that since Xi and Xi0 are identically distributed, f (Xi ) − f (Xi0 )
has the same distribution as si (f (Xi ) − f (Xi0 )), where si is a random sign variable that is ±1 with equal
probability. Introducing these variables and continuing the inequality, we thus have
n n
1 X 1 X
EX,X 0 [sup | f (Xi ) − f (Xi0 )|] = EX,X 0 ,s [sup | si (f (Xi ) − f (Xi0 ))|]. (45)
n f ∈H i=1 n f ∈H i=1

We now have enough randomness to exploit the Sauer-Shelah lemma. If we fix X and X 0 , note that the
quantities f (Xi ) − f (Xi0 ) take values in [−1, 1] and collectively can take on at most VH (n)2 = O(n2d ) values.
But for fixed X, X 0 , the quantities 0
Psi (f (Xi ) − f (Xi ))0 are independent,2 zero-mean, bounded random variables
and hence for fixed f we have P[ i si (f (Xi ) − f (Xi )) ≥ t] ≤ exp(−t /9n) by Hoeffding’s inequality. Union
bounding over the O(n2d ) effectively distinct f , we obtain
X
Ps [sup | si (f (Xi ) − f (Xi0 ))| ≥ t | X, X 0 ] ≤ O(n2d ) exp(−t2 /9n). (46)
f ∈H i
√ p
This is small as long as t  nd log n, so (45) is O( d log n/n), as claimed.

A particular consequence of Proposition 2.24 is the Dvoretzky-Kiefer-Wolfowitz inequality:


Proposition 2.26 (DKW inequality). For a distribution Pn p on R and i.d.d. samples X1 , . . . , Xn ∼ p, define
the empirical cumulative density function as Fn (x) = n1 i=1 I[Xi ≤ x], and the population cumulative density
2
function as F (x) = p(X ≤ x). Then P[supx∈R |Fn (x) − F (x)| ≥ t] ≤ 2e−2nt .
This follows from applying Proposition 2.24 to the family of threshold functions.

[Lecture 5]

2.6 Finite-Sample Analysis


Now that we have developed tools for analyzing statistical concentration, we will use these to analyze
the finite-sample behavior of robust estimators. Recall that we previously studied the minimum distance
functional defined as
θ̂(p̃) = θ∗ (q), where q = arg min TV(q, p̃). (47)
q∈G

15
This projects onto the set G under TV distance and outputs the optimal parameters for the projected
distribution.
The problem with the minimum distance functional defined above is that projection under TV usually
doesn’t make sense for finite samples! For instance, suppose that p is a Gaussian distribution and let pn and
p0n be the empirical distributions of two different sets of n samples. Then TV(pn , p) = TV(pn , p0n ) = 1 almost
surely. This is because samples from a continuous probability distribution will almost surely be distinct,
and TV distance doesn’t give credit for being “close”—the TV distance between two point masses at 1 and
1.000001 is still 1.2
To address this issue, we will consider two solutions. The first solution is to relax the distance. Intuitively,
the issue is that the TV distance is too strong—it reports a large distance even between a population
distribution p and the finite-sample distribution pn . We will replace the distance TV with a more forgiving
distance TVf and use the minimum distance functional corresponding to this relaxed distance. To show
that projection under TV f still works, we will need to check that the modulus m(G, ) is still small after we
replace TV with TV, and we will also need to check that the distance TV(p,
f f pn ) between p and its empirical
distribution is small with high probability. We do this below in Section 2.6.1.
An alternative to relaxing the distance from TV to TV f is to expand the destination set from G to some
M ⊃ G, such that even though p is not close to the empirical distribution pn , some element of M is close to
pn . Another advantage to expanding the destination set is that projecting onto G may not be computationally
tractable, while projecting onto some larger set M can sometimes be done efficiently. We will see how
to statistically analyze this modified projection algorithm in Section 2.6.2, and study the computational
feasibility of projecting onto a set M starting in Section 2.7.

2.6.1 Relaxing the Distance


Here we instantiate the first solution of replacing TV with some TV
f for the projection algorithm. The
following lemma shows that properties we need TV to satisfy:
f

Lemma 2.27. Suppose that TV f is a (pseudo-)metric such that TV(p,


f q) ≤ TV(p, q) for all p, q. If we assume
∗ ∗
that p ∈ G and TV(p , p̃) ≤ , then the error of the minimum distance functional (2) with D = TV f is at
most m(G, 20 , TV,
f L), where 0 =  + TV(p̃,
f p̃n ).

Proof. By Proposition 2.4 we already know that the error is bounded by m(G, 2TV(p f ∗ , p̃n ), TV,
f L). Since
∗ ∗
f is a pseudometric, by the triangle inequality we have TV(p
TV f , p̃n ) ≤ TV(p
f , p̃) + TV(p̃,
f p̃n ). Finally,
∗ ∗
TV(p , p̃) ≤ TV(p , p̃) by assumption.
f

Lemma 2.27 shows that we need TV


f to satisfy two properties: TV(p̃,
f p̃n ) should be small, and the modulus
m(G, , TV)
f should not be too much larger than m(G, , TV).
For mean estimation (where recall L(p, θ) = kθ − µ(p)k2 ), we will use the following TV:
f

f H (p, q) def
TV = sup |PX∼p [f (X) ≥ τ ] − PX∼q [f (X) ≥ τ ]|. (48)
f ∈H,τ ∈R

(Note the similarity to the distance in Proposition 2.24; we will make use of this later.) We will make the
def
particular choice H = Hlin , where Hlin = {x 7→ hv, xi | v ∈ Rd }.
f H is indeed upper-bounded by TV, since TV(p, q) = sup |p(E) − q(E)| is the supremum
First note that TV E
over all events E, and (48) takes a supremum over a subset of events. The intuition for taking the particular
family H is that linear projections of our data contain all information needed to recover the mean, so perhaps
it is enough for distributions to be close only under these projections.

Bounding the modulus. To formalize this intuition, we prove the following mean crossing lemma:

Lemma 2.28. Suppose that p and q are two distributions such that TV f H (p, q) ≤ . Then for any f ∈ H,
p q
there are distributions rp ≤ 1− and rq ≤ 1− such that EX∼rp [f (X)] ≥ EY ∼rq [f (Y )].
2 We will later study the W1 distance, which does give credit for being close.

16
Proof. We will prove the stronger statement that f (X) under rp stochastically dominates f (Y ) under rq .
Starting from p, q, we delete  probability mass corresponding to the smallest points of f (X) in p to get rp ,
f H (p, q) ≤  we
and delete  probability mass corresponding to the largest points f (Y ) in q to get rq . Since TV
have
sup |PX∼p (f (X) ≥ τ ) − PY ∼q (f (Y ) ≥ τ )| ≤ , (49)
τ ∈R

which implies that Prp (f (X) ≥ τ ) ≥ Prq (f (Y ) ≥ τ ) for all t ∈ R. Hence, rp stochastically dominates rq and
Erp [f (X)] ≥ Erq [(Y )].
Mean crossing lemmas such as Lemma 2.28 help us bound the modulus of relaxed distances for the family
of resilient distributions. In this case we have the following corollary:
Corollary 2.29. For the family GTV (ρ, ) of (ρ, )-resilient distributions and L(p, θ) = kθ − µ(p)k2 , we have

f H ) ≤ 2ρ.
m(GTV (ρ, ), , TV (50)
lin

Compare to Theorem 2.10 where we showed that m(GTV , , TV) ≤ ρ. Thus as long as Theorem 2.10 is
tight, relaxing from TV to TV
f H doesn’t increase the modulus at all!
lin

Proof of Corollary 2.29. Let p, q ∈ GTV such that TV(p,


f q) ≤ . Take v = arg maxkvk2 =1 v > (Ep [X] − Eq [X]),
> > p q
hence Ep [v X]−Eq [v X] = kEp [X]−Eq [X]k2 . It follows from Lemma 2.28 that there exist rp ≤ 1− , rq ≤ 1−
such that

Erp [v > X] ≤ Erq [v > X]. (51)

Furthermore, from p, q ∈ GTV (ρ, ), we have

Ep [v > X] − Erp [v > X] ≤ ρ, (52)


Erq [v > X] − Eq [v > X] ≤ ρ. (53)

Then,

kEp [X] − Eq [X]k2 = Ep [v > X] − Eq [v > X] (54)


> > > > > >
= Ep [v X] − Erp [v X] + Erp [v X] − Erq [v X] + Erq [v X] − Eq [v X] (55)
| {z } | {z } | {z }
≤ρ ≤0 ≤ρ

≤ 2ρ, (56)

which shows the modulus is small as claimed.

µp1 µrp2 µrp1 µp2

Figure 6: Illustration of mean cross lemma. For any distributions p1 , p2 that are close under TV,
f we can
truncate the -tails of each distribution to make their means cross.

17
Bounding the distance to the empirical distribution. Now that we have bounded the modulus, it
remains to bound the distance TV(p̃,
f p̃n ). Note that TV(p̃,
f p̃nq
) is exactly the quantity bounded in equation
f H (p̃, p̃n ) ≤ O vc(H)+log(1/δ) 
(42) of Proposition 2.24; we thus have that TV n with probability 1 − δ. Here
vc(H) is the VC dimension of the family of threshold functions {x 7→ I[f (x) ≥ τ ] | f ∈ H, τ ∈ R}. So, for
H = Hlin all we need to do is bound the VC dimension of the family of halfspace functions on Rd .
We claimed earlier that this VC dimension is d + 1, but we prove it here for completeness. We will show
that no set of points x1 , . . . , xd+2 ∈ Rd cannot be shattered into all 2d+2 possible subsets using halfspaces.
For any such points we can find multipliers a1 , . . . , ad+2 ∈ R such that
d+2
X d+2
X
ai xi = 0, ai = 0. (57)
i=1 i=1

Let S+ = {i | ai > 0} and S− = {i | ai < 0}. We will show that the convex hulls of S+ and S− intersect.
Consequently, there is no vector v and threshold τ such that hxi , vi ≥ τ iff i ∈ S+ . (This is because both a
halfspace and its complement are convex, so if we let Hv,τ denote the half-space, it is impossible to have
c
S+ ⊂ Hv,τ , S− ⊂ Hv,τ , and conv(S+ ) ∩ conv(S− ) 6= ∅.)
To prove that the convex hulls intersect, note that we have
1 X 1 X
ai xi = (−ai )xi , (58)
A A
i∈S+ i∈S−
P P
where A = i∈S+ ai = i∈S− (−ai ). But the left-hand-side lies in conv(S+ ) while the right-hand-side lies in
conv(S− ), so the convex hulls do indeed intersect.
This shows that x1 , . . . , xd+2 cannot be shattered, so vc(Hlin ) ≤ d+1. Combining this with Proposition 2.24,
we obtain:
q
d+log(1/δ) 
Proposition 2.30. With probability 1 − δ, we have TV f H (p̃, p̃n ) ≤ O
lin n .
0
Combining this with Corollary 2.29 and Lemma p 2.27, we see that projecting onto GTV (ρ, 2 ) under TVHlin
f
0
performsq well in finite samples, for  =  + O( d/n). For instance, if G has bounded covariance we achieve
p p
error O(  + d/n); if G is sub-Gaussian we achieve error Õ( + d/n); and in general if G has bounded
p p
ψ-norm we achieve error O  + d/n ψ −1 √1 ≤ O(( + d/n)ψ −1 (1/)).
 
+ d/n
p
This analysis is slightly sub-optimal as the best lower bound we are aware of is Ω(ψ −1 (1/) + d/n), i.e.
the ψ −1 (1/) coefficient
p in the dependence on n shouldn’t be there. However, it is accurate as long as  is
large compared to d/n.

Connection to Tukey median. A classical robust estimator for the mean is the Tukey median, which
solves the problem
min max |PX∼p̃n [hX, vi ≥ hµ, vi] − 12 | (59)
µ v∈Rd

[Note: this definition is slightly wrong as it does not behave gracefully when there is a point mass at µ.]
It is instructive to compare this to projection under TV,
f which corresponds to

min max |PX∼p̃n [hX, vi ≥ τ ] − PX∼q [hX, vi ≥ τ ]|. (60)


q∈G v∈Rd ,τ ∈R

The differences are: (1) the Tukey median only minimizes over the mean rather than the full distribution
q; (2) it only considers the threshold hµ, vi rather than all thresholds τ ; it assumes that the median of any
one-dimensional projection hX, vi is equal to its mean (which is why we subtract 12 in (59)). Distributions
satisfying this final property are said to be unskewed.
For unskewed distributions with “sufficient probability mass” near the mean, the Tukey median yields a
robust estimator. In fact, it can be robust even if the true distribution has heavy tails (and hence is not
resilient), by virtue of leveraging the unskewed property. We will explore this in an exercise.
[Lectures 6-7]

18
2.6.2 Expanding the Set
In Section 2.6.1 we saw how to resolve the issue with TV projection by relaxing to a weaker distance TV. f
We will now study an alternate approach, based on expanding the destination set G to a larger set M. For
this approach we will need to reference the “true empirical distribution” p∗n . What we mean by this is the
following: Whenever TV(p∗ , p̃) ≤ , we know that p∗ and p̃ are identical except for some event E of probability
. Therefore we can sample from p̃ as follows:
1. Draw a sample from X ∼ p∗ .

2. Check if E holds; if it does, replace X with a sample from the conditional distribution p̃|E .
3. Otherwise leave X as-is.
Thus we can interpret a sample from p̃ as having a 1 −  chance of being “from” p∗ . More generally, we
can construct the empirical distribution p̃n by first constructing the empirical distribution p∗n coming from
p∗ , then replacing Binom(n, ) of the points with samples from p̃|E . Formally, we have created a coupling
between the random variables p∗n and p̃n such that TV(p∗n , p̃n ) is distributed as n1 Binom(n, ).
Let us return to expanding the set from G to M. For this to work, we need three properties to hold:
• M is large enough: minq∈M TV(q, p∗n ) is small with high probability.

• The empirical loss L(p∗n , θ) is a good approximation to the population loss L(p∗ , θ).
• The modulus is still bounded: minp,q∈M:TV(p,q)≤2 L(p, θ∗ (q)) is small.
In fact, it suffices for M to satisfy a weaker property; we only need the “generalized modulus” to be small
relative to some G 0 ⊂ M:

Proposition 2.31. For a set G 0 ⊂ M, define the generalized modulus of continuity as


def
m(G 0 , M, 2) = min L(p, θ∗ (q)). (61)
p∈G 0 ,q∈M:TV(p,q)≤2

Assume that the true empirical distribution p∗n lies in G 0 with probability 1 − δ. Then the minimum distance
functional projecting under TV onto M has empirical error L(p∗n , θ̂) at most m(G 0 , M, 20 ) with probability at
least 1 − δ − P[Binom(, n) ≥ 0 n].

Proof. Let 0 = TV(p∗n , p̃n ), which is Binom(, n)-distributed. If p∗n lies in G 0 , then since G 0 ⊂ M we know
that p̃n has distance at most 0 from M, and so the projected distribution q satisfies TV(q, p̃n ) ≤ 0 and hence
TV(q, p∗n ) ≤ 20 . It follows from the definition that L(p∗n , θ̂) = L(p∗n , θ∗ (q)) ≤ m(G 0 , M, 20 ).

A useful bound on the binomial tail is that P[Binom(, n) ≥ 2n] ≤ exp(−n/3). In particular the empirical
error is at most m(G 0 , M, 4) with probability at least 1 − δ − exp(−n/3).

Application: bounded kth moments. First suppose that the distribution p∗ has bounded kth moments,
i.e. Gmom,k (σ) = {p | kpkψ ≤ σ}, where ψ(x) = xk . When k > 2, the empirical distribution p∗n will not have
bouned kth moments until n ≥ Ω(dk/2 ). This is because if we take a single sample x1 ∼ p and let v be a
unit vector in the direction of x1 − µ, then Ex∼p∗n [hx − µ, vik ] ≥ n1 kx1 − µkk2 ' dk/2 /n, since the norm of

kx1 − µk2 is typically d.
Consequently, it is necessary to expand the set and we will choose G 0 = M = GTV (ρ, ) for ρ = O(σ1−1/k )
to be the set of resilience distributions with appropriate parameters ρ and . We already know that the
modulus of M is bounded by O(σ1−1/k ), so the hard part is showing that the empirical distribution p∗n lies
in M with high probability.
As noted above, we cannot hope to prove that p∗n has bounded moments except when n = Ω(dk/2 ), which
is too large. We will instead show that certain truncated moments of p∗n are bounded as soon as n = Ω(d),

19
and that these truncated moments suffice to show resilience. Specifically, if ψ(x) = xk is the Orlicz function
for the kth moments, we will define the truncated function

xk

: x ≤ x0
ψ̃(x) = k−1 (62)
kx0 (x − x0 ) + xk0 : x > x0

In other words, ψ̃ is equal to ψ for x ≤ x0 , and is the best linear lower bound to ψ for x > x0 . Note that ψ̃ is
L-Lipschitz for L = kxk−10 . We will eventually take x0 = (k k−1 )−1/k and hence L = (1/)(k−1)/k . Using a
symmetrization argument, we will bound the truncated supkvk2 ≤1 Ep∗n [ψ̃(|hx − µ, vi|/σ)].
Proposition 2.32. Let X1 , . . . , Xn ∼ p∗ , where p∗ ∈ Gmom,k (σ). Then,
 n   k  r k
1X |hXi − µ, vi| dk
EX1 ,...,Xn ∼p∗ sup ψ̃ − U (v) ≤ O 2L , (63)
kvk2 ≤1 n i=1 σ n

where L = kxk−1
0 and U (v) is a function satisfying U (v) ≤ 1 for all v.
Before proving Proposition 2.32, let us interpret its significance. Take x0 = (k k−1 )−1/k and hence L =
1−1/k
 . Take n large enough so that the right-hand-side of (63) is at most 1, which requires n ≥ Ω(kd/2−2/k ).
We then obtain a high-probability bound on the ψ̃-norm of p∗n , i.e. the ψ̃-norm is at most O(δ −1/k ) with
probability 1 − δ. This implies that p∗n is resilient with parameter ρ = σψ̃ −1 (O(δ −1/k )/) = 2σ1−1/k . A
useful bound on ψ̃ −1 is ψ̃(−1)(z) ≤ x0 + z/L, and since x0 ≤ (1/)−1/k and L = (1/)(k−1)/k in our case, we
have
ρ ≤ O(σ1−1/k δ −1/k ) with probability 1 − δ.
This matches the population-bound of O(σ1−1/k ), and only requires kd/2−2/k samples, in contrast to the
d/2 samples required before. Indeed, this sample complexity dependence is optimal (other than the factor of
−1/k
k); the only
p drawback is that we do not get exponential tails (we instead obtain tails of δ , which is worse
than the log(1/δ) from before).
Now we discuss some ideas that are needed in the proof. We would like to somehow exploit the fact that ψ̃
is L-Lipschitz to prove concentration. We can do so with the following keystone result in probability theory:

Theorem 2.33 (Ledoux-Talagrand Contraction). Let φ : R → R be an L-Lipschitz function such that


φ(0) = 0. Then for any convex, increasing function g and Rademacher variables 1:n ∼ {±1}, we have
n
X n
X
E1:n [g(sup i φ(ti ))] ≤ E1:n [g(L sup i ti )]. (64)
t∈T i=1 t∈T i=1

Let us interpret this result. We should think of the ti as a quantity such as hxi − µ, vi, where
P abstracting
to ti yields generality
P and notational simplicity. Theorem 2.33 says that if we let Y = sup t∈T i i φ(ti ) and
Z = L supt∈T i i ti , then E[g(Y )] ≤ E[g(Z)] for all convex increasing functions g. When this holds we say
that Y stochastically dominates Z in second order ; intuitively, it is equivalent to saying that Z has larger
mean than Y and greater variation around its mean. For distributions supported on just two points, we can
formalize this as follows:
Lemma 2.34 (Two-point stochastic dominance). Let Y take values y1 and y2 with probability 12 , and Z take
values z1 and z2 with probability 12 . Then Z stochastically dominates Y (in second order) if and only if

z1 + z2 y1 + y2
≥ and max(z1 , z2 ) ≥ max(y1 , y2 ). (65)
2 2
Proof. Without loss of generality assume z2 ≥ z1 and y2 ≥ y1 . We want to show that E[g(Y )] ≤ E[g(Z)] for
all convex increasing g if and only if (65) holds. We first establish necessity of (65). Take g(x) = x, then we
require E[Y ] ≤ E[Z], which is the first condition in (65). Taking g(x) = max(x − z2 , 0) yields E[g(Z)] = 0
and E[g(Y )] ≥ 12 max(y2 − z2 , 0), so E[g(Y )] ≤ E[g(Z)] implies that y2 ≤ z2 , which is the second condition in
(65).

20
We next establish sufficiency, by conjuring up appropriate weights for Jensen’s inequality. We have

y2 − z1 z2 − y2  z (y − z ) + z (z − y ) 
2 2 1 1 2 2
g(z2 ) + g(z1 ) ≥ g = g(y2 ), (66)
z2 − z1 z2 − z1 z2 − z1
z2 − y2 y2 − z1  z (z − y ) + z (y − z ) 
2 2 2 1 2 1
g(z2 ) + g(z1 ) ≥ g = g(z1 + z2 − y2 ) ≥ g(y1 ). (67)
z2 − z1 z2 − z1 z2 − z1

Here the first two inequalities are Jensen while the last is by the first condition in (65) together with the
monotonicity of g. Adding these together yields g(z2 ) + g(z1 ) ≥ g(y2 ) + g(y1 ), or E[g(Z)] ≥ E[g(Y )], as
−z1
desired. We need only check that the weights yz22 −z 1
and zz22 −y
−z1 are positive. The second weight is positive
2

by the assumption z2 ≥ y2 . The first weight could be negative if y2 < z1 , meaning that both y1 and y2 are
smaller than both z1 and z2 . But in this case, the inequality E[g(Y )] ≤ E[g(Z)] trivially holds by monotonicity
of g. This completes the proof.
We are now ready to prove Theorem 2.33.
Proof of Theorem 2.33. Without loss of generality we may take L = 1. Our strategy will be to iteratively
apply an inequality for a single i to replace all the φ(ti ) with ti one-by-one. The inequality for a single i is
the following:
Lemma 2.35. For any 1-Lipschitz function φ with φ(0) = 0, any collection T of ordered pairs (a, b), and
any convex increasing function g, we have

E∼{−1,+1} [g( sup a + φ(b))] ≤ E∼{−1,+1} [g( sup a + b)]. (68)


(a,b)∈T (a,b)∈T

To prove this, let (a+ , b+ ) attain the sup of a + φ(b) for  = +1, and (a− , b− ) attain the sup for  = −1.
We will check the conditions of Lemma 2.34 for

y1 = a− − φ(b− ), (69)
y2 = a+ + φ(b+ ), (70)
z1 = max(a− − b− , a+ − b+ ), (71)
z2 = max(a− + b− , a+ + b+ ). (72)

(Note that z1 and z2 are lower-bounds on the right-hand-side sup for  = −1, +1 respectively.)
First we need max(y1 , y2 ) ≤ max(z1 , z2 ). But max(z1 , z2 ) = max(a− + |b− |, a+ + |b+ |) ≥ max(a− −
φ(b− ), a+ + φ(b+ )) = max(y1 , y2 ). Here the inequality follows since φ(b) ≤ |b| since φ is Lipschitz and
φ(0) = 0.
Second we need y1 +y2
2
≤ z1 +z
2 . We have z1 + z2 ≥ max((a− − b− ) + (a+ + b+ ), (a− + b− ) + (a+ − b+ )) =
2

+ −b− |
a+ + a− + |b+ − b− |, so it suffices to show that a+ +a− +|b
2 ≥ a+ +a− +φ(b
2
+ )−φ(b− )
. This exactly reduces to
φ(b+ ) − φ(b− ) ≤ |b+ − b− |, which again follows since φ is Lipschitz. This completes the proof of the lemma.
Now to prove the general proposition we observe that if g(x) is convex in x, so is g(x + t) for any t. We

21
then proceed by iteratively applying Lemma 2.35:
n
X n−1
X
E1:n [g(sup i φ(ti ))] = E1:n−1 [En [g(sup i φ(ti ) +n φ(tn )) | 1:n−1 ]] (73)
t∈T i=1 t∈T i=1 | {z }
| {z } φ(b)
a
n−1
X
≤ E1:n−1 [En [g(sup i φ(ti ) + n tn ) | 1:n−1 ]] (74)
t∈T i=1
n−1
X
= E1:n [[g(sup i φ(ti ) + n tn )] (75)
t∈T i=1

..
. (76)
n
X
≤ E1:n [g(sup 1 φ(t1 ) + i ti )] (77)
t∈T i=2
n
X
≤ E1:n [g(sup i ti )], (78)
t∈T i=1

which completes the proof.


Let us return now to bounding the truncated moments in Proposition 2.32.
Proof of Proposition 2.32. We start with a symmetrization argument. Let µψ̃ = EX∼p∗ [ψ̃(|hX − µ, vi|/σ)],
and note that µψ̃ ≤ µψ ≤ 1. Now, by symmetrization we have
 n   k
1X |hXi − µ, vi|
EX1 ,...,Xn ∼p∗ sup ψ̃ − µψ̃ (79)
kvk2 ≤1 n i=1 σ
 n   |hX − µ, vi|   |hX 0 − µ, vi|  k 
1X i i
≤ EX,X 0 ∼p, sup i ψ̃ − ψ̃ (80)
kvk2 ≤1 n i=1 σ σ
 n k 
1X  |hXi − µ, vi| 
≤ 2k EX∼p, sup i ψ̃ . (81)
kvk2 ≤1 n i=1 σ

Here the first inequality adds and subtracts the mean, the second applies symmetrization, while the third
uses the fact that optimizing a single v for both X and X 0 is smaller than optimizing v separately for each
(and that the expectations of the expressions with X and X 0 are equal to each other in that case).
We now apply Ledoux-Talagrand contraction. Invoking Theorem 2.33 with g(x) = |x|k , φ(x) = ψ̃(|x|)
and ti = hXi − µ, vi|/σ, we obtain
 n  |hX − µ, vi|  k  n ki
1X i
h 1X
EX∼p, sup i ψ̃ ≤ (L/σ)k EX∼p, sup i hXi − µ, vi (82)
kvk2 ≤1 n i=1 σ kvk2 ≤1 n i=1
h 1X n ki
k
= (L/σ) EX∼p, i (Xi − µ) . (83)
n i=1 2

Pn
We are thus finally left to bound EX∼p, [k i=1 i (Xi − µ)kk ]. Here we will use Khintchine’s inequality, which
says that X
Ak kzk2 ≤ E [| i zi |k ]1/k ≤ Bk kzk2 , (84)
i

where Ak is Θ(1) and Bk is Θ( k) for k ≥ 1. Applying this in our case, we obtain
n
X n
X
EX, [k i (Xi − µ)kk2 ] ≤ O(1)k EX,,0 [| i hXi − µ, 0 i|k ]. (85)
i=1 i=1

22
Next apply Rosenthal’s inequality (Eq. 21), which yields that

Xn n
X √ n
X
EX, [ i hXi − µ, 0 i|k | 0 ] ≤ O(k)k EX, [|hXi − µ, 0 i|k | 0 ] + O( k)k ( E[hXi − µ, 0 i|2 ])k/2 (86)
i=1 i=1 i=1

≤ O(k)k · nσ k k0 kk2 + O( kn)k σ k k0 kk2 (87)
√ √
= O(σk d)k n + O(σ kd)k nk/2 , (88)

where the last step uses that k0 k2 = d and the second-to-last step uses the bounded moments of X. As
long as n  k k/(k−2) the latter term dominates and hence plugging back into we conclude that
n
X √
EX, [k i (Xi − µ)kk2 ]1/k = O(σ kdn). (89)
i=1
p
Thus bounds the symmetrized truncated moments in (82-83) by O(L kd/n)k , and plugging back into (81)
completes the proof.

Application: isotropic Gaussians. Next take Ggauss to be the family of isotropic Gaussians N (µ, I). We
saw earlier that the modulus m(Ggauss , ) was O() for the mean estimation loss L(p, θ) = kθ − µ(p)k2 . Thus
projecting onto Ggauss yields error O() for mean estimation in the limit of infinite samples, but doesn’t work
for finite samples since the TV distance to Ggauss will always be 1.
Instead we will project onto the set Gcov (σ) = {p | kE[(X − µ)(X − µ)> ]k ≤ σ 2 }, for σ 2 = O(1 + d/n +
log(1/δ)/n). We already saw in Lemma 2.23 that when p∗ is √ (sub-)Gaussian the empirical distribution p∗n
lies within this set. But the modulus of Gcov only decays as O( ), which is worse than the O() dependence
that we had in infinite samples! How can we resolve this issue?
We will let Giso be the family of distributions whose covariance is not only bounded, but close to the
identity, and where moreover this holds for all (1 − )-subsets:
def p
Giso (σ1 , σ2 ) = {p | kEr [X − µ]k2 ≤ σ1 and kEr [(X − µ)(X − µ)> − Ik ≤ (σ2 )2 , whenever r ≤ }. (90)
1−

p
The following
p improvement on Lemma 2.23 implies that pn ∈ G iso (σ 1 , σ2 ) for σ 1 = O( log(1/)) and
σ2 = O(  log(1/)). [Note: the lemma below is wrong as stated. To be fixed.]
Lemma 2.36. Suppose that X1 , . . . , Xn are drawn independently from a sub-Gaussian distribution with
sub-Gaussian parameter σ, mean 0, and identity covariance. Then, with probability 1 − δ we have
n
1 X   d + log(1/δ) 
Xi Xi> − I ≤ O σ 2 ·  log(1/) + , and (91)
|S| n
i∈S
n
r
1 X   p d + log(1/δ) 
Xi ≤ O σ ·  log(1/) + (92)
|S| 2 n
i∈S

for all subsets S ⊆ {1, . . . , n} with |S| ≥ (1 − )n. In particular, if n  d/(2 log(1/)) then δ ≤
exp(−cn log(1/)) for some constant c.
∗ 0
0
We will return p of Lemma 2.36 later. For now, note that this means 0that pn ∈ G for
p to the proof
G = Giso (O( log(1/), O(  log(1/))), at least for large enough n. Furthermore, G ⊂ M for M =
Gcov (1 + O( log(1/))).
Now we bound the generalized modulus of continuity:
p
Lemma 2.37. Suppose that√p ∈ Giso (σ1 , σ2 ) and q ∈ Gcov ( 1 + σ22 ), and furthermore TV(p, q) ≤ . Then
kµ(p) − µ(q)k2 ≤ O(σ1 + σ2  + ).

23
Proof. Take the midpoint distribution r = min(p,q) 0
1− , and write q = (1 − )r + q . We will bound kµ(r) − µ(q)k2
(note that kµ(r) − µ(p)k2 is already bounded since p ∈ Giso ). We have that
Covq [X] = (1 − )Er [(X − µq )(X − µq )> ] + Eq0 [(X − µq )(X − µq )> ] (93)
> >
= (1 − )(Covr [X] + (µq − µr )(µq − µr ) ) + Eq0 [(X − µq )(X − µ)q) ] (94)
> >
 (1 − )(Covr [X] + (µq − µr )(µq − µr ) ) + (µq − µq0 )(µq − µq0 ) . (95)
(1−)2
A computation yields µq − µq0 =  (µq − µr ). Plugging this into (95) and simplifying, we obtain that

Covq [X]  (1 − )(Covr [X] + (1/)(µq − µr )(µq − µr )> ). (96)


Now since Covr [X]  (1 − σ22 )I, we have kCovq [X]k ≥ (1 − )(1 − σ22 ) + (1/)kµq − µr k22 . But by assumption

kCovq [X]k ≤ 1+σ22 . Combining these yields that kµr −µq k22 ≤ (2σ22 ++σ22 ), and so kµr −µq k2 ≤ O(+σ2 ),
which gives the desired result.
In conclusion, projecting onto Gcov (1 + O( log(1/)))
p under TV distance gives a robust mean estimator
for isotropic Gaussians, which achieves error O( log(1/)). √ This is slightly worse than the optimal O()
bound but improves over the naı̈ve analysis that only gave O( ).
Another advantage of projecting onto Gcov is that, as we will see in Section 2.7, this projection can be
done computationally efficiently.

Proof of Lemma 2.36. TBD


[Lecture 8]

2.7 Efficient Algorithms


We now turn our attention to efficient algorithms. Recall that previously we considered minimum distance
functionals projecting onto sets G and M under distances TV and TV. f Here we will show how to approximately
project onto the set Gcov (σ), the family of bounded covariance distributions, under TV distance. The basic
idea is to write down a (non-convex) optimization problem that tries to find the projection, and then show
that the cost landscape of this optimization is nice enough that all local minima are within a constant factor
of the global minimum.
To study efficient computation we need a way of representing the distributions p̃ and p∗ . To do this we
will suppose that p̃ is the empirical distribution over n points x1 , . . . , xn , while p∗ is the empirical distribution
over some subset S of these points with |S| ≥ (1 − )n. Thus in particular p∗ is an -deletion of p̃.
min(p∗ ,p̃) p̃
Before we assumed that TV(p∗ , p̃) ≤ , but taking p0 = 1−TV(p 0
∗ ,p̃) , we have p ≤ 1− and kCovp0 [X]k ≤

σ2 p̃
1− ≤ 2σ 2 whenever kCovp∗ [X]k ≤ σ 2 . Therefore, taking p∗ ≤ 1− is equivalent to the TV corruption model
from before for our present purposes.
We will construct an efficient algorithm that, given p̃, outputs a distribution q such that TV(q, p∗ ) ≤ O()
and kCovq [X]k2 ≤ O(σ 2 ). This is similar to the minimum distance functional, in that it finds a distribution
close to p∗ with bounded covariance; the main difference is that q need not be the projection of p̃ onto Gcov ,
and also the covariance of q is bounded by O(σ 2 ) instead of σ 2 . However, the modulus of continuity bound
from before says that any distribution q that is near p∗ and has bounded covariance will approximate the
mean of p∗ . Specifically, we have (by tracking the constants in our previous midpoint argument):
q q
kΣq k kΣp k
Lemma 2.38. If TV(p, q) ≤ , then kµ(p) − µ(q)k2 ≤ 1− + 1− .

We will prove Lemma 2.38 at the end of the section.


The main result of this section is the following:
Proposition 2.39. Suppose p̃ and p∗ are empirical distributions as above with p∗ ≤ p̃/(1 − ), and further
suppose that kCovp∗ [X]k ≤ σ 2 and  < 1/3. Then given p̃ (but not p∗ ), there is an algorithm with runtime
1− 2 2
poly(n, d) that outputs a q with TV(p∗ , q) ≤ 1−

σ . In addition, kµ(q) − µ(p∗ )k2 ≤

and kCovq [X]k ≤ 1−3

4(1−2)
1−3 σ.

24
Note that the bound on kµ(p∗ ) − µ(q)k2 follows directly from the modulus bound on Gcov (σ) together
with the property TV(p∗ , q) ≤ 1−

.
The algorithm, MinCovL2, underlying Proposition 2.39 is given below; it maintains a weighted distribution
q, which places weight qi on point xi . It then computes the weighted mean and covariance, picking the
weights that minimize the norm of the covariance.

Algorithm 2 MinCovL2
1: Input: x1 , . . . , xn ∈ Rd .
2: Find any stationary point q of the optimization problem:
n
X
min sup qi hv, xi − µq i2 , (97)
q kvk2 ≤1 i=1
X
s.t. µq = qi xi ,
i
X 1
q ≥ 0, qi = 1, qi ≤
i
(1 − )n

3: Output µ̂q , the empirical mean for the stationary point q.

1
The intuition behind Algorithm 2 is as follows: the constraint qi ≤ (1−)n ensures that q is an -deletion
of the uniform distribution over x1 , . . . , xn . Then, subject to that constraint, Algorithm 2 seeks to minimize
the weighted covariance matrix: note the objective is exactly kΣq k2 .
Algorithm 2 is non-trivial to analyze, because although the constraint set is convex, the objective is
non-convex: both qi and µq are linear in q, and so the overall objective (even for a fixed v) is thus a non-convex
cubic in q. On the other hand, for any “reasonable” choice of q, µq should be close to µp∗ . If we apply this
approximation–substituting µp∗ for µq –then the objective becomes convex again. So the main idea behind
the proof is to show that this substitution can be (approximately) done.
Before getting into that, we need to understand what stationary points of (117) look like. In general,
a stationary point is one where the gradient is either zero, or where the point is at the boundary of the
constraint set and the gradient points outward into the infeasible region for the constraints.
However, the supremum over v can lead to a non-differentiable function (e.g. max(x1 , x2 ) is non-
differentiable when x1 = x2 ), so these conditions need to be refined slightly to handle that. We can use
something called a “Clarke subdifferential”, to show that the preceding conditions at least hold for some v
that maximizes the supremum:
Lemma 2.40. Suppose that q is a stationary point of (117). Then, for any feasible p, there exists a v ∈ Rd
with kvk2 = 1 such that
Eq [hv, X − µq i2 ] ≤ Ep [hv, X − µq i2 ]. (98)
Moreover, v is a maximizer of the left-hand-side, i.e. v > Σq v = kΣq k.
Proof. Let Fv (q) = Eq [hv, X − µq i2 ]. First, compute ∇Fv (q). We have
n
∂ ∂ X
F (q) = qj hv, xj − µq i2 (99)
∂qi ∂qi j=1
n
X  ∂µ
q
= hv, xi − µq i2 + 2 qj hv, xj − µq i (100)
j=1
∂qi

= hv, xi − µq i2 , (101)

where the last equality is because j qj (xj − µq ) = 0. Consequently, ∇Fv (q)i = hv, xi − µq i2 .
P

Now, let F (q) = maxkvk2 =1 Fv (q) = kΣq k. If the Pmaximizing v is unique and equal to v , then ∇F (q) =
∇Fv (q), and q is a stationary point if and only if i (qi − pi )∇Fv (q)i ≤ 0 for all feasible p, or equivalently
∗ ∗

Eq [hv ∗ , X − µq i2 ] − Ep [hv ∗ , X − µq i2 ] ≤ 0, which is exactly the condition (98).

25
Suppose (the harder case) that the maximizing v is not unique. Then F is not differentiable at q, but
P Clark subdifferential is the convex hull of ∇Fv (q) for all maximizing v’s.
the P Stationarity implies that
i (qi − pi )gi ≤ 0 for some g in this convex hull, and thus by convexity that i (qi − pi )∇Fv (q)i ≤ 0 for

some maximizing v ∗ . This gives us the same desired condition as before and thus completes the proof.
Given Lemma 2.40, we are in a better position to analyze Algorithm 2. In particular, for any p (we will
eventually take the global minimizer p of (117)), Lemma 2.40 yields

kCovq k = Eq [hv, X − µq i2 ] (102)


2
≤ Ep [hv, X − µq i ] (103)
2 2
= Ep [hv, X − µp i ] + hv, µp − µq i (104)
≤ kCovp k + kµp − µq k22 . (105)

The kµp − µq k22 quantifies the “error due to non-convexity”–recall that if we replace µq with a fixed µp
in (117), the problem becomes convex, and hence any stationary point would be a global minimizer. The
distance kµp − µq k22 is how much we pay for this discrepancy.
Fortunately, µp −µq is small, precisely due to the modulus of continuity! We can show that
q any feasible
q p, q
 q kΣ k p kΣ k
for (117) satisfies TV(p, q) ≤ 1− (see Lemma 2.41), hence Lemma 2.38 gives kµp − µq k2 ≤ 1−2 + 1−2 .
Plugging back in to (105), we obtain
 
q q 2
kCovq k ≤ kCovp k + kΣp k + kΣq k . (106)
1 − 2
p
For fixed kCovp k, we can view this as a quadratic inequality in kΣp k. Solving the quadratic then yields
 1 −  2
kCovq k ≤ kCovp k. (107)
1 − 3
In particular, taking p to be the global minimum p of (117), we have kCovp k ≤ kCovp∗ k ≤ σ 2 , so kCovq k ≤
 2
1−
1−3 σ 2 . Plugging back into Lemma 2.38 again, we then have

1 − 2 √
r
  1− 
kµq − µ k2 ≤
p∗ σ+ σ =2 σ, (108)
1 − 2 1 − 3 1 − 3
which proves Proposition 2.39.

2.7.1 Lower Bound (Breakdown at  = 1/3)


The 1 − 3 in the denominator of our bound means that Proposition 2.39 becomes vacuous once  ≥ 1/3. Is
this necessary? We will show that it indeed is.
Specifically, when  = 1/3, it is possible to have:
1 1
p∗ = δ−a + δ0 , (109)
2 2
1 1 1
p̃ = δ−a + δ0 + δb , (110)
3 3 3
1 1
q = δ0 + δb , (111)
2 2
where q is a stationary point no matter how large b is. In particular, µq = 2b can be arbitrarily far away from
the true mean of − a2 .
PnTo see this more2
intuitively, note that an equivalent minimization problem to (117) would be to minimize
i=1 qi (xi − µ) with respect to both qi and µ (since the minimizer for fixed q is always at µ = µq ). Therefore,
a stationary point is one such that:
• The qi are concentrated on the (1 − )n smallest values of (xi − µ)2

26
• µ is equal to µq
The distribution q clearly satisfies this: we have µq = b/2, and both 0 and b are closer to µq than −a is.
This also shows why the breakdown point is 1/3 and not smaller. If  were slightly smaller than 1/3, then
some of the mass of q would have to remain on δ−a . Then as b increases, the mean µq would increase more
slowly, and eventually −a would be closer to µq than b.

2.7.2 Auxiliary Lemmas


Proof of Lemma 2.38. Note thatq the proof of Lemma 2.1 implies that if E is an event with q(E) ≥ 1 − ,

then kEq [X] − Eq [X | E]k ≤ kΣq k 1− . Now if q, p satisfy TV(p, q) ≤ , there is a midpoint r that is an
q

-deletion of both p and q. Applying the preceding result, we thus have kEq [X] − Er [X]k2 ≤ kΣq k 1− .
q

Similarly kEp [X] − Er [X]k2 ≤ kΣp k 1− . The result then follows by the triangle inequality.

Lemma 2.41. Suppose that q, q 0 are both -deletions of a distribution p. Then TV(q, q 0 ) ≤ 
1− .

Proof. Conceptually, the reason Lemma 2.41 is true is that q 0 can be obtained from q by first adding an
-fraction of points (to get to p), then deleting an -fraction. Since TV ≤  exactly allows an -fraction of of
additions and deletions, this should yield the result. The reason we get 1− 
is because q and q 0 are only a
(1 − )-“fraction” of p, so the -deletions are more like 1− -deletions relative to q and q 0 .


To be more formal, for any set A, note that we have


p(A) r(A)
q(A) ≤ and q 0 (A) ≤ . (112)
1− 1−
Also, using Ac instead of A, we also get
p(A) −  p(A) − 
q(A) ≥ and q 0 (A) ≥ . (113)
1− 1−
Combining these inequalities yields
 + (1 − )q 0 (A) 
q(A) ≤ ≤ + q 0 (A), (114)
1− 1−
and similarly q 0 (A) ≤ 1−

+ q(A), which together implies |q(A) − q 0 (A)| ≤ 
1− . Since this holds for all A, we
obtain our TV distance bound.

[Lecture 9]

2.8 Approximate Eigenvectors in Other Norms


Algorithm 2 is specific to the `2 -norm. Let us suppose that we care about recovering an estimate µ̂ such that
kµ − µ̂k is small in some norm other than `2 (such as the `1 -norm, which may be more appropriate for some
combinatorial problems).
√ It turns out that an analog of bounded covariance is sufficient to enable estimation
with the typical O(σ ) error, as long as we can approximately solve the analogous eigenvector problem. To
formalize this, we will make use of the dual norm:
Definition 2.42. Given a norm k · k, the dual norm k · k∗ is defined as

kuk∗ = sup hu, vi. (115)


kvk2 ≤1

As some examples, the dual of the `2 -norm is itself, the dual of the `1 -norm is the `∞ -norm, and the dual
of the `∞ -norm is the `1 -norm. An important property (we omit the proof) is that the dual of the dual is the
original norm:
Proposition 2.43. If k · k is a norm on a finite-dimensional vector space, then k · k∗∗ = k · k.

27
For a more complex example: let kvk(k) be the sum of the k largest coordinates of v (in absolute value).
Then the dual of k · k(k) is max(kuk∞ , kuk1 /k). This can be seen by noting that the vertices of the constraint
set {u | kuk∞ ≤ 1, kuk1 ≤ k} are exactly the k-sparse {−1, 0, +1}-vectors.
Let Gcov (σ, k · k) denote the family of distributions satisfying maxkvk∗ ≤1 v > Covp [X]v ≤ σ 2 . Then Gcov is
resilient exactly analogously to the `2 -case:
q
p 2
Proposition 2.44. If p ∈ Gcov (σ, k · k) and r ≤ 1− , then kµ(r) − µ(p)k ≤ 1− σ. In other words, all

distributions in Gcov (σ, k · k) are (, O(σ ))-resilient.
Proof. We have that kµ(r) − µ(p)k = hµ(r) − µ(p), vi for some vector v with kvk∗ = 1. The result then
follows by resilience for the one-dimensional distribution hX, vi for X ∼ p.
When p∗ ∈ Gcov (σ, k · k), we will design efficient algorithms analogous to Algorithm 2. The main difficulty
is that in norms other than `2 , it is generally not possible to exactly solve the optimization problem
maxkvk∗ ≤1 v > Σ̂c v that is used in Algorithm 2. We instead make use of a κ-approximate relaxation:

Definition 2.45. A set C of positive semidefinie matrices is a κ-approximate relaxation if C contains vv >
for all v satisfying kvk∗ ≤ 1, and furthermore

sup hM, Σi ≤ κ sup v > Σv for all Σ  0. (116)


M ∈C kvk∗ ≤1

Thus a κ-approximate relaxation over-approximates hvv > , Σi, but also it underapproximates it within a
factor of κ. Given such a relaxation, we have the following analog to Algorithm 2:

Algorithm 3 MinCovNorm
1: Input: x1 , . . . , xn ∈ Rd and κ-approximate relaxation C.
2: Find any stationary point q of the optimization problem:
n
X
min sup qi (xi − µq )> M (xi − µq ), (117)
q M ∈C i=1
X
s.t. µq = qi xi ,
i
X 1
q ≥ 0, qi = 1, qi ≤
i
(1 − )n

3: Output µ̂q , the empirical mean for the stationary point q.


Algorithm 3 outputs an estimate of the mean with error O(σ κ). The proof is almost exactly the same
as Algorithm 2; the main difference is that we need to ensure that hΣp , M i, the inner product of M with the
true covariance, is not too large. This is where we use the κ-approximation property. We leave the detailed
proof as an exercise, and focus on how to construct a κ-approximate relaxation C.

Semidefinite
Pd programming. As a concrete example, suppose that we wish to estimate µ in the `1 -norm
kvk = j=1 |vj |. The dual norm is the `∞ -norm, and hence our goal is to approximately solve the optimization
problem
maximize v > Σv subject to kvk∞ ≤ 1. (118)
The issue with (118) is that it is not concave in v because of the quadratic function v > Σv. However, note
that v > Σv = hΣ, vv > i. Therefore, if we replace v with the variable M = vv > , then we can re-express the
optimization problem as

maximize hΣ, M i subject to Mjj ≤ 1 for all j, M  0, rank(M ) = 1. (119)

Here the first constraint is a translation of kvk∞ ≤ 1, while the latter two constrain M to be of the form vv > .

28
This is almost convex in M , except for the constraint rank(M ) = 1. If we omit this constraint, we obtain
the optimization

maximize hΣ, M i
subject to Mjj = 1 for all j,
M  0. (120)

Note that here we replace the constraint Mjj ≤ 1 with Mjj = 1; this can be done because the maximizer of
(120) will always have Mjj = 1 for all j. For brevity we often write this constraint as diag(M ) = 1.
The problem (120) is a special instance of a semidefinite program and can be solved in polynomial time (in
general, a semidefinite program allows arbitrary linear inequality or positive semidefinite constraints between
linear functions of the decision variables; we discuss this more below).
The optimizer M ∗ of (120) will always satisfy hΣ, M ∗ i ≥ supkvk∞ ≤1 v > Σv because and v with kvk∞ ≤ 1
yields a feasible M . The key is to show that it is not too much larger than this. This turns out to be a
fundamental fact in the theory of optimization called Grothendieck’s inequality:
Theorem 2.46. If Σ  0, then the value of (120) is at most π
2 supkvk∞ ≤1 v > Σv.
See Alon and Naor (2004) for a very well-written exposition on Grothendieck’s inequality and its relation
to optimization algorithms. In that text we also see that a version of Theorem 2.46 holds even when Σ is not
positive semidefinite or indeed even square. Here we produce a proof based on [todo: cite] for the semidefinite
case.
Proof of Theorem 2.46. The proof involves two key relations. To describe the first, given a matrix X let
arcsin[X] denote the matrix whose i, j entry is arcsin(Xij ) (i.e. we apply arcsin element-wise). Then we have
(we will show this later)
2
max v > Σv = max hΣ, arcsin[X]i. (121)
kvk∞ ≤1 X0,diag(X)=1 π

The next relation is that


arcsin[X]  X. (122)
Together, these imply the approximation ratio, because we then have
π
max hΣ, M i ≤ max hΣ, arcsin[M ]i = max v > Σv. (123)
M 0,diag(M )=1 M 0,diag(M )=1 2 kvk∞ ≤1

We will therefore focus on establishing (121) and (122).


To establish (121), we will show that any X with X  0, diag(X) = 1 can be used to produce a probability
distribution over vectors v such that E[v > Σv] = π2 hΣ, arcsin[X]i.
First, by Graham/Cholesky decomposition we know that there exist vectors ui such that Mij = hui , uj i
for all i, j. In particular, Mii = 1 implies that the ui have unit norm. We will then construct the vector v by
taking vi = sign(hui , gi) for a Gaussian random variable g ∼ N (0, I).
We want to show that Eg [vi vj ] = π2 arcsin(hui , uj i). For this it helps to reason in the two-dimensional
space spanned by vi and vj . Then vi vj = −1 if the hyperplane induced by g cuts between ui and uj , and +1
if it does not. Letting θ be the angle between ui and uj , we then have P[vj vj = −1] = πθ and hence

θ θ 2 π 2
Eg [vi vj ] = (1 − ) − = ( − θ) = arcsin(hui , uj i), (124)
π π π 2 π
as desired. Therefore, we can always construct a distribution over v for which E[v > Σv] = π2 hΣ, arcsin[M ]i,
hence the right-hand-side of (121) is at most the left-hand-side. For the other direction, note that the
maximizing v on the left-hand-side is always a {−1, +1} vector by convexity of v > Σv, and for any such vector
we have π2 arcsin[vv > ] = vv > . Thus the left-hand-side is at most the right-hand-side, and so the equality
(121) indeed holds.
We now turn our attention to establishing (122). For this, let X k denote the matrix whose i, j entry is
k
Xij (we take element-wise power). We require the following lemma:

29
k
Lemma 2.47. For all k ∈ {1, 2, . . .}, if X  0 then X  0.
Proof. The matrix X k is a submatrix of X ⊗k , where (X ⊗k )i1 ···ik ,j1 ···jk = Xi1 ,j1 · · · Xik ,jk . We can verify
that X ⊗k  0 (its eigenvalues are λi1 · · · λik where λi are the eigenvalues of X), hence so is X k since
submatrices of PSD matrices are PSD.
z 2n+1
P∞
With this in hand, we also make use of the Taylor series for arcsin(z): arcsin(z) = n=0 (2(2n)! n n!)2 2n+1 =

z3
z+ 6 + · · · . Then we have

X (2n)! 1 (2n+1)
arcsin[X] = X + n n!)2 2n + 1
X  X, (125)
n=1
(2

as was to be shown. This completes the proof.


Alternate proof (by Mihaela Curmei): We can also show that X k  0 more directly. Specifically,
we will  0 then A B  0, from which the result follows by induction. To show this let
P show that if A, B P
A = i λi ui u>
i and B = >
j νj vj vj and observe that
X X
A B=( λi ui u>
i ) ( νj vj vj> ) (126)
i j
X
= λi νj (ui u>
i ) (vj vj> ) (127)
i,j
X
= λi νj (ui vj )(ui v j )> , (128)
i,j
|{z} | {z }
≥0 0

from which the claim follows. Here the key step is that for rank-one matrices the operation behaves nicely:
(ui u>
i ) (vj vj> ) = (ui vj )(ui vj )> .

[Bonus Material (Optional)]

2.9 Semidefinite Programming and Sum-of-Squares


In the previous subsection, we saw how to approximately solve maxkvk∞ ≤1 v > Σv via the semidefinite program
defined by maxM 0,diag(M )=1 hM, Σi. In this section we will cover semidefinite programming in more detail,
and build up to sum-of-squares programming, which will be used to achieve error O(1−1/k ) when p∗ has
“certifiably bounded” kth moments (recall that we earlier achieved error O(1−1/k ) for bounded kth moments
but did not have an efficient algorithm).
A semidefinite program is an optimization problem of the form

maximize hA, Xi (129)


subject to X  0,
hX, B1 i ≤ c1 ,
..
.
hX, Bm i ≤ cm .

Here hX, Y i = tr(X T Y ) = ij Xij Yij is the inner product between matrices, which is the same as the
P

elementwise dot product when considered as n2 -dimensional vectors.


Here the matrix A specifies the objective of the program, while (Bj , cj ) specify linear inequality constraints.
We additionally have the positive semidefinite cone constraint that X  0, meaning that X must be symmetric
with only non-negative eigenvalues. Each of A and B1 , . . . , Bm are n × n matrices while the cj are scalars.
We can equally well minimize as maximize by replacing A with −A.

30
While (129) is the canonical form for a semidefinite program, problems that are seemingly more complex
can be reduced to this form. For one, we can add linear equality constraints as two-sided inequality constraints.
In addition, we can replace X  0 with L(X)  0 for any linear function L, by using linear equality constraints
to enforce the linear relations implied by L. Finally, we can actually include any  number of constraints

L1 (X) 0
L1 (X)  0, Lk (X)  0, since this is e.g. equivalent to the single constraint when
0 L2 (X)
k = 2. As an example of these observations, the following (arbitrarily-chosen) optimization problem is also a
semidefinite program:

minimize a> x + hA1 , M i + hA2 , Y i (130)


x,M,Y

subject to M + Y  Σ
diag(M ) = 1
tr(Y ) ≤ 1
Y 0
1 x>
 
0
x M

(As a brief aside, the constraint [1 x> ; x M ]  0 is equivalent to xx>  M which is in turn equivalent to
x> M −1 x ≤ 1 and M  0.)

Semidefinite constraints as quadratic polynomials. An alternative way of viewing the constraint


M  0 is that the polynomial pM (v) = v > M v is non-negative for all v ∈ Rd . More generally, if we have a
> > 0
non-hogoneous
  pM,y,c (v) = v M v + y v + c, we have pM,y,c (v) ≥ 0 for all v if and only if M  0
polynomial
>
c y /2
for M 0 =  0.
y/2 M
This polynomial perspective is helpful for solving eigenvalue-type problems. For instance, kM k ≤ λ if and
only if v > M v ≤ λkvk22 for all v, which is eqvuialent to asking that v > (λI − M )v ≥ 0 for all v. Thus kM k
can be expressed as the solution to

minimize λ (131)
subject to λI − M  0 (equivalently, v > (λI − M )v ≥ 0 for all v)

We thus begin to see a relationship between moments and polynomial non-negativity constraints.

Higher-degree polynomials. It is tempting to generalize the polynomial approach to higher moments.


For instance, M4 (p) denote the 4th moment tensor of p, i.e. the unique symmetric tensor such that

hM4 , v ⊗4 i = Ex∼p [hx − µ, vi4 ]. (132)

Note we can equivalently express hM4 , v ⊗4 i = ijkl (M4 )ijkl vi vj vk vl , and hence (M4 )ijkl = E[(xi − µ)(xj −
P
µ)(xk − µ)(xl − µ)].
def
A distribution p has bounded 4th moment if and only if hM, v ⊗4 i ≤ λkvk42 for all v. Letting pM (v) =
hM, v ⊗4 i, we thus can express the 4th moment of p as the polynomial program

minimize λ (133)
subject to λ(v12 + ··· + vd2 )2 − pM (v) ≥ 0 for all v ∈ R d

Unfortunately, in constrast to (130), (133) is NP-hard to solve in general. We will next see a way to
approximate (133) via a technique called sum-of-squares programming, which is a way of approximately
reducing polynomial programs such as (133) to a large but polynomial-size semidefinite program.

31
Warm-up: certifying non-negativity over R. Consider the one-dimensional polynomial

q(x) = 2x4 + 2x3 − x2 + 5 (134)

Is it the case that q(x) ≥ 0 for all x? If so, how would we check this?
What if I told you that we had
1 1
q(x) = (2x2 + x − 3)2 + (3x + 1)2 (135)
2 2
Then, it is immediate that q(x) ≥ 0 for all x, since it is a (weighted) sum of squares.
How can we construct such decompositions of q? First observe that we can re-write q as the matrix
function  >   
1 5 0 0 1
q(x) =  x   0 −1 1   x  . (136)
x2 0 1 2 x2
| {z }
M

On the other hand, the sum-of-squares decomposition for q implies that we can also write
 >    >    >   
1 −3 −3 1 1 1
1 1
q(x) =  x    1   1  +  3   3    x  , (137)
 
2 2
x2 2 2 0 0 x2

i.e. we can decompose the matrix M defining q(x) = [1; x; x2 ]> M [1; x;2 ] into a non-negative combination of
rank-one outer products, which is true if and only if M  0.
There is one problem with this, which is that despite our successful decomposition of q, M is self-evidently
 instance, M22 = −1.)
not positive semidefinite! (For  The issue is that the matrix M defining q(x) is not
5 0 −a
unique. Indeed, any M (a) =  0 2a − 1 1  would give rise to the same q(x), and a sum-of-squares
−a 1 2
decomposition merely implies that M (a)  0 for some a. Thus, we obtain the following characterization:
k
X
q(x) is a sum of squares qj (x)2 ⇐⇒ M (a)  0 for some a ∈ R. (138)
j=1

For the particular decomposition above we took a = 3.

Sum-of-squares in two dimensions. We can generalize the insights to higher-dimensional problems.


Suppose for instance that we wish to check whether q(x, y) = a40 x4 + a31 x3 y + a22 x2 y 2 + a13 xy 3 + a04 y 4 +
a30 x3 + a21 x2 y + a12 xy 2 + a03 y 3 + a20 x2 + a11 xy + a02 y 2 + a10 x + a01 y + a00 is non-negative for all x, y.
Again, this is hard-to-check, but we can hope to check the sufficient condition that q is a sum-of-squares,
which we will express as q sos 0. As before this is equivalent to checking that a certain matrix is positive
semidefinite. Observe that
 2 > 
−b0
 2 
x a40 a31 /2 −b a30 /2 −c x
0 00  
 xy   a31 /2 a22 + 2b a 13 /2 a 21 /2 + c −c −c xy 
0 00  
 2    2 
 y   −b a13 /2 a04 a21 /2 + c a03 /2 −b   y 
q(x, y) = 
 x   a30 /2 a21 /2 + c a21 /2 + c0 a20 + 2b0 a11 /2 + c00 a10 /2   x 
   (139)
−c0 a11 /2 + c00 a02 + 2b00 a01 /2   y 
     
 y   −c a03 /2
0 00 00
1 −b −c −b a10 /2 a01 /2 a00 1

for any b, b0 , b00 , c, c0 , c00 . Call the above expression M (b, b0 , b00 , c, c0 , c00 ), which is linear in each of its variables.
Then we have q sos 0 if and only if M (b, b0 , b00 , c, c0 , c00 )  0 for some setting of the bs and cs.

32
Sum-of-squares in arbitrary dimensions. In general, if we have a polynomial q(x1 , . . . , xd ) in d variables,
which has degree 2t, then we can embed it as some matrix M (b) (for decision variables b that capture the
symmetries in M as above), and the dimensionality of M will be the number of monomials of degree at most
t which turns out to be d+tt = O((d + t)t ).
The upshot is that any constraint of the form q sos 0, where q is linear in the decision variables, is a
semidefinite constraint in disguise. Thus, we can solve any program of the form

maximize c> y (140)


y

subject to q1 sos 0, . . . , qm sos 0,

where the qj are linear in the decision variables y. (And we are free to throw in any additional linear inequality
or semidefinite constraints as well.) We refer to such optimization problems as sum-of-squares programs, in
analogy to semidefinite programs.

Sum-of-squares for kth moment. Return again to the kth moment problem. As a polynomial program
we sought to minimize λ such that λ(v12 + · · · + vd2 )k/2 − hM2k , v ⊗2k i was a non-negative polynomial. It is
then natural to replace the non-negativity constraint with the constraint that λkvkk2 − hM2k , v ⊗2k i sos 0.
However, we actually have a bit more flexibility and it turns out that the best program to use is

minimize λ (141)
⊗2k
subject to λ − hM2k , v i+ (kvk22 − 1)q(v) sos 0 for some q of degree at most 2k − 2

Note that the family of all such q can be linearly parameterized and so the above is indeed a sum-of-
squares program. It is always at least as good as the previous program because we can take q(v) =
λ(1 + kvk22 + · · · + kvk2k−2
2 ).
When the solution λ∗ to (141) is at most σ 2k for M2k (p), we say that p has 2kth moment certifiably
bounded by σ 2k . In this case a variation on the filtering algorithm achieves error O(σ1−1/2k ). We will not
discuss this in detail, but the main issue we need to resolve to obtain a filtering algorithm is to find some
appropriate tensor T such that hT, M2k i = λ∗ and T “looks like” the expectation of v ⊗2k for some probability
distribution over the unit sphere. Then we can filter using τi = hT, (xi − µ̂)⊗2k i.
To obtain T requires computing the dual of (141), which requires more optimization theory than we
have assumed from the reader, but it can be done in polynomial time. We refer to the corresponding T
as a pseudomoment matrix. Speaking very roughly, T has all properties of a moment matrix that can be
“proved using only sum-of-squares inequalities”, which includes all properties that we needed for the filtering
algorithm to work. We will henceforth ignore the issue of T and focus on assumptions on p that ensure that
M2k (p) is certifiably bounded. The main such assumption is the Poincaré inequality, which we cover in the
next section.

2.10 Sum-of-Squares Certifiably from the Poincaré inequality


We now turn our attention to bounding the value of (141). Ignoring finite-sample issues, our goal is to
def
identify assumptions on p such that M2k (p) = EX∼p [(X − µ)⊗2k ] yields a small value for (141).
Before doing so, we will introduce some machinery for establishing bounds on (141). The main idea is
that of a sum-of-squares proof:
Definition 2.48. A polynomial inequality p(v) ≤ q(v) has a sum-of-squares proof if q(v) − p(v) sos 0. We
will also denote this as q(v) sos p(v) or p(v) sos q(v).
The usefulness of this perspective is that the relation sos satisfies many of the same properties as ≤:
• If p1 sos p2 and p2 sos p3 , then p1 sos p3 .
• If p1 sos q1 and p2 sos q2 , then p1 + p2 sos q1 + q2 .
• If p1 sos 0 and p2 sos 0, then p1 p2 sos 0.

33
• If p1 sos p2 , q1 sos q2 , and p2 , q1 sos 0, then p1 q1 sos p2 q2 .
• Moreover, many “standard” inequalities such as Cauchy-Schwarz and Hölder have sum-of-squares proofs.
Using these, we can often turn a normal proof that p ≤ q into a sum-of-squares proof that p  q as long as
we give sum-of-squares proofs for a small number of key steps.
For concreteness, we will prove the last two claims properties above. We first prove that p1 , p2 sos 0 =⇒
p1 p2 sos 0. Indeed we have
X X X
p1 (v)p2 (v) = ( p1i (v)2 )( p2j (v)2 ) = (p1 i(v)p2j (v))2 sos 0 (142)
i j ij

Next we prove that p1 sos p2 , q1 sos q2 , and p2 , q1 sos 0 implies p1 q2 sos p2 q2 . This is because

p2 q2 − p1 q1 = p2 (q2 − q1 ) + (p2 − p1 )q1 sos 0, (143)

where the second relation uses p2 , q2 − q1 sos 0 and p2 − p1 , q1 sos 0 together with the previous result.
In view of this, we can reframe bounding (141) as the following goal:
Goal: Find a sum-of-squares proof that hM2k (p), v ⊗2k i sos λkvk2k
2 .

Certifiability for Gaussians. We now return to the assumptions needed on p that will enable us to
provide the desired sum-of-squares proof. Let us start by observing that a sum-of-squares proof exists for any
Gaussian distribution: If p = N (µ, Σ), then

hM2k (N (µ, Σ)), v ⊗2k i = hM2k (N (0, I)), (Σ1/2 v)⊗2k i (144)
k
Y
(2i − 1) hI, (Σ1/2 v)⊗2k i

= (145)
i=1
k
Y
(2i − 1) kΣ1/2 vk2k

= 2 (146)
i=1
≤ (2k) kΣkk kvk2k
k
2 , (147)

so we may take λ = (2kkΣk)k . (Here I denotes the identity tensor that is 1 along the diagonal and zero
elsewhere.) Therefore normal distributions have certifiably bounded moments, but the proof above heavily
exploited the rotational symmetry of a normal distribution. We can provide similar proofs for other highly
symmetric distributions (such as the uniform distribution on the hypercube), but these are unsatisfying as
they only apply under very specific distributional assumptions. We would like more general properties that
yield certifiably bounded moments.

Poincaré inequality. The property we will use is the Poincaré inequality. A distribution p on Rd is said
to satisfy the Poincaré inequality with parameter σ if

Varx∼p [f (x)] ≤ σ 2 Ex∼p [k∇f (x)k22 ] (148)

for all differentiable functions f : Rd → R. This is a “global to local property”–it says that for any function
that for any function f that varies under p, that variation can be picked up in terms of local variation (the
gradient). In particular, it says that p doesn’t have any “holes” (regines with low probability density that lie
between two regions of high probability density). Indeed, suppose that A and B were two disjoint convex
regions with p(A) = p(B) = 12 . Then p cannot satisfy the Poincaré inequality with any constant, since there
is a function that is 1 on A, 0 on B, and constant on both A and B.
Below are some additional examples and properties of Poincaré distributions:
• A one-dimensional Gaussian N (µ, σ 2 ) is Poincaré with constant σ.
• If p, q are σ-Poincaré then their product p × q is σ-Poincaré. In particular a multivariate Gausssian
N (µ, σ 2 I) is σ-Poincaré.

34
• If X√∼ p is σ-Poincaré and A is a linear map, then AX is (σkAk)-Poincaré. In particular, aX1 + aX2
is ( a2 + b2 σ)-Poincaré when X1 and X2 are both σ-Poincaré, and N (µ, Σ) is kΣk1/2 -Poincaré.
• More generally, if X ∼ p is σ-Poincaré and f is L-Lipschitz, then f (X) is (σL)-Poincaré.
Together these imply that Poincaré distributions contain multivariate Gaussians, arbitrary Lipschitz functions
of Gaussians, and independent sums of such distributions. The above properties (except the initial Gaussian
property) are all straightforward computations. Let us next state two substantially deeper results:
• If p is σ-strongly log-concave (meaning that the log-probability density log p(x) satisfies ∇2 log p(x) 
− σ12 I), then p is σ-Poincaré (Bakry and Émery, 1985).
2
• Suppose that
√ the support of X ∼ p has `2 -radius at most R, and let Z = N (0, τ I) for τ ≥ 2R. Then
X + Z is (τ e)-Poincaré (Bardet et al., 2018).
Thus Poincaré encompasses all strongly log-concave densities, and effectively any product of bounded random
variables (after adding Gaussian noise, which we can always do ourselves).
It is instructive to compare Poincaré to the sub-Gaussian property that we have so far relied on. Poincaré
is neither strictly stronger or weaker than sub-Gaussian, but it is stronger than sub-exponential (we will see
this below). In general, we should think of Poincaré as being substantially stronger than sub-exponential: it
implies that not only is the distribution itself sub-exponential, but so is any Lipschitz function of the density.
As an example, consider the random variable (X, Y ) ∈ Rd where X ∼ N (0, I) and Y = X for a
Rademacher random P variable . Then (X, Y ) is sub-Gaussian, but not Poincaré with good constant: if we
take f (X, Y ) = i Xi Yi , then f is with high probability close to either +d or −d, so Var[f (X, Y )] ≈ d2 .
However, ∇f (X, Y ) = (Y1 , . . . , Yd , X1 , . . . , Xd ) and so k∇f (X, Y )k22 is close to 2d with
√ high probability. Thus
while the sub-Gaussian constant is O(1), the Poincaré constant in this case is Ω( d).

Consequences of Poincaré. So far we have seen conditions that imply Poincaré, but we would also like
to derive consequences of this property. Below are some of the most useful ones:
• If X ∼ p is σ-Poincaré, then Lipschitz functions concentrate: P[|f (x) − E[f (x)]| ≥ t] ≤ 6 exp(−t/(σL))
for any L-Lipschitz f .
• As a corollary, we have volume expansion: For any set A, let A be the set of points within `2 -distance
 of A. Then p(A)p(Ac ) ≤ 36 exp(−/σ).
This second property implies, for instance, that if p(A) ≥ δ, then almost all points will be within distance
O(σ log(1/δ)) of A.
To prove the second property, let f (x) = min(inf y∈A kx − yk2 .). Then f is Lipschitz, is 0 on A, and is 
on Ac . Let µ be the mean of f (X). Since f is sub-exponential we have p(A) = p(f (X) = 0) ≤ 6 exp(−µ/σ),
and p(Ac ) = p(f (X) = ) ≤ 6 exp(−( − µ))/σ). Multiplying these together yields the claimed result.
The most important property for our purposes, however, will be the following:
Theorem 2.49. Suppose that p is σ-Poincaré and let f be a differentiable function such that E[∇j f (X)] = 0
for j = 1, . . . , k − 1. Then there is a universal constant Ck such that Var[f (X)] ≤ Ck σ 2k E[k∇k f (X)k2F ].
Note that k = 1 is the original Poincaré property, so we can think of Theorem 2.49 as a generalization of
k
Poincaré to higher derivatives. Note also that ∇k f (X) is a tensor in Rd ; the notation k∇k f (X)k2F denotes
k
the squared Frobenius norm of ∇ f (X), i.e. the sum of the squares of its entries.
Theorem 2.49, while it may appear to be a simple generalization of the Poincaré property, is a deep
result that was established in Adamczak and Wolff (2015), building on work of Latala (2006). We will use
Theorem 2.49 in the sequel to construct our sum-of-squares proofs.

Sum-of-squares proofs for Poincaré distributions. Here we will construct sum-of-squares proofs that
def
M2k (v) = Ep [hx − µ, vi2k ] sos Ck0 σ 2k kvk2k 0
2 whenever p is σ-Poincaré, for some universal constants Ck . We

35
will exhibit the proof for k = 1, 2, 3 (the proof extends to larger k and the key ideas appear already by k = 3).
We introduce the notation

Mk = E[(x − µ)⊗k ], (149)


⊗k k
Mk (v) = hMk , v i = E[hx − µ, vi ]. (150)

Proof for k = 1. We wish to show that Ep [hx − µ, vi2 ] sos σ 2 kvk22 . To do this take fv (x) = hx, vi. Then
the Poincaré inequality applied to fv yields

Ep [hx − µ, vi2 ] = Var[fv (x)] ≤ σ 2 E[k∇fv (x)k22 ] = σ 2 E[kvk22 ] = σ 2 kvk22 . (151)

Thus M2 (v) ≤ σ 2 kvk22 (this is just saying that Poincaré distributions have bounded covariance). This
property has a sum-of-squares proof because it is equivalent to σ 2 I − M2  0, and we know that all positive
semidefiniteness relations are sum-of-squares certifiable.
Proof for k = 2. Extending to k = 2, it makes sense to try fv (x) = hx − µ, vi2 . Then we have
∇fv (x) = 2hx − µ, viv and hence E[∇fv (x)] = 0. We also have ∇2 fv (x) = 2v ⊗ v. Thus applying
Theorem 2.49 we obtain
Var[fv (x)] ≤ C2 σ 4 E[k2v ⊗ vk2F ] = 4C2 σ 4 kvk42 . (152)
We also have Var[fv (x)] = E[hx − µ, vi4 ] − E[hx − µ, vi2 ]2 = M4 (v) − M2 (v)2 . Thus

M4 (v) = (M4 (v) − M2 (v)2 ) + M2 (v)2 (153)


4
≤ 4C2 σ kvk42 +σ 4
kvk42 = (4C2 + 1)σ 4
kvk42 . (154)

This shows that the fourth moment is bounded, but how can we construct a sum-of-squares proof? We
already have that M2 (v)2 sos σ 4 kvk42 (by 0 sos M2 (v) sos σ 2 kvk22 and the product property). Therefore
we focus on bounding M4 (v) − M2 (v)2 = Var[fv (x)].
For this we will apply Theorem 2.49 to a modified version of fv (x). For a matrix A, let fA (x) =
(x − µ)> A(x − µ) = hA, (x − µ)⊗2 i. Then fv (x) = fA (x) for A = vv > . By the same calculations as above we
have E[∇fA (x)] = 0 and ∇2 fA (x) = 2A. Thus by Theorem 2.49 we have

Var[fA (x)] ≤ C2 σ 4 E[k2Ak2F ] = 4C2 σ 4 kAk2F . (155)

On the other hand, we have Var[fA (x)] = hM4 , A ⊗ Ai − hM2 , Ai2 = hM4 − M2 ⊗ M2 , A ⊗ Ai. Thus (155)
implies that
hM4 − M2 ⊗ M2 , A ⊗ Ai ≤ 4C2 σ 4 kAk2F . (156)
2 2
Another way of putting this is that M4 − M2 ⊗ M2 , when considered as a matrix in Rd ×d , is smaller than
4C2 σ 4 I in the semidefinite ordering. Hence 4C2 σ 4 I − (M4 − M2 ⊗ M2 )  0 and so 4C2 σ 4 kvk42 − hM4 − M2 ⊗
M2 , v ⊗4 i sos 0, giving us our desired sum-of-squares proof. To recap, we have:

M4 (v) = (M4 (v) − M2 (v)2 ) + M2 (v)2 (157)


4
sos 4C2 σ kvk42 +σ 4
kvk42 = (4C2 + 1)σ 4
kvk42 , (158)

so we can take C20 = 4C2 + 1.


Proof for k = 3. Inspired by the k = 1, 2 cases, we try fv (x) = hx − µ, vi3 . However, this choice runs into
problems, because ∇fv (x) = 3hx − µ, vi2 v and so E[∇fv (x)] = 3M2 (v)v 6= 0. We instead should take

fv (x) = hx − µ, vi3 − 3M2 (v)hx − µ, vi, which yields (159)


2
E[∇fv (x)] = E[3hx − µ, vi v − 3M2 (v)v] = 0, (160)
2
E[∇ fv (x)] = E[6hx − µ, vi(v ⊗ v)] = 0, (161)
3
∇ fv (x) = 6(v ⊗ v ⊗ v). (162)

Applying Theorem 2.49 to fv (x) yields

Var[fv (x)] ≤ C3 σ 6 k6(v ⊗ v ⊗ v)k2F = 36C3 σ 6 kvk62 . (163)

36
We can additionally compute

Var[fv (x)] = E[(hx − µ, vi3 − 3M2 (v)hx − µ, vi)2 ] − E[hx − µ, vi3 − 3M2 (v)hx − µ, vi]2 (164)
3 2
= M6 (v) − 6M2 (v)M4 (v) + 9M2 (v) − M3 (v) . (165)

Since our goal is to bound M6 (v), we re-arrange to obtain

M6 (v) = Var[fv (x)] + 6M2 (v)M4 (v) + M3 (v)2 − 9M2 (v)2 (166)
≤ 36C3 σ 6
kvk62 + 6(σ 2
kvk22 )(C20 σ 4 kvk42 ) 2
+ M3 (v) + 0 (167)

We can also use Hölder’s inequality to obtain M3 (v)2 ≤ M2 (v)M4 (v), which yields an overall bound of
M6 (v) ≤ (36C3 + 12C20 )σ 6 kvk62 .
We now turn this into a sum-of-squares proof. We need to show the following four relations:

(i) Var[fv (x)] sos 36C3 σ 6 kvk62 , (ii) M2 (v)M4 (v) sos (σ 2 kvk22 )(C20 σ 4 kvk42 ), (168)
2
(iii) M3 (v) sos M2 (v)M4 (v), (iv) − 9M2 (v) sos 0. (169)

The relation (ii) again follows by the product property of sos , while −9M2 (v)2 sos 0 is direct because
M2 (v)2 is already a square. We will show in an exercise that the Hölder inequality in (iii) has a sum-of-squares
proof, and focus on (i).
3
The relation (i) holds for reasons analogous to the k = 2 case. For a symmetric tensor A ∈ Rd , let
⊗3 2
fA (x) = hA, (x − µ) − 3M2 ⊗ (x − µ)i. Then just as before we have E[∇fA (x)] = 0, E[∇ fA (x)] = 0, and
so Var[fA (x)] ≤ 36C3 σ 6 kAk2F , which implies that3

M6 − 6M2 ⊗ M4 + 9M2 ⊗ M2 ⊗ M2 − M3 ⊗ M3  36C3 σ 6 I, (170)

and hence Var[fv (x)] sos 36C3 σ 6 kvk62 (again because semidefinite relations have sum-of-squares proofs).
In summary, we have M6 (v) sos (36C3 + 12C20 )σ 6 kvk62 , as desired.
Generalizing to higher k. For higher k the proof is essentially the same. What is needed is a function fv (x)
whose first k − 1 derivates all have zero mean. This always exists and is unique up to scaling by constants. For
instance, when k = 4 we can take fv (x) = hx−µ, vi4 −6M2 (v)hx−µ, vi2 −4M3 (v)hx−µ, vi−M4 (v)+6M2 (v)2 .
This appears somewhat clunky but is a special case of a combinatorial sum. For the general case, let Tk be
the set of all integer tuples (i0 , i1 , . . .) such that i0 ≥ 0, is ≥ 2 for s > 0, and i0 + i1 + · · · = k. Then the
general form is
 
X
r k
fv,k (x) = (−1) hx − µ, vii0 Mi1 (v)Mi2 (v) · · · Mir (v). (171)
i0 · · · ir
(i0 ,...,ir )∈Tk

The motivation for this formula is that it is the solution to ∇fv,k (x) = kfv,k−1 (x)v. Using fv,k , one can
construct sum-of-squares proofs by applying Theorem 2.49 to the analogous fA,k function as before, and then
use induction, the product rule, and Hölder’s inequality as in the k = 3 case.

[Lectures 10-11]

3 Resilience Beyond Mean Estimation


We have so far focused primarily on mean estimation, first considering information theoretic and then
algorithmic issues. We now turn back to information theoretic issues with a focus on generalizing our results
from mean estimation to other statistical problems.
Let us recall our general setup: for true (test) distribution p∗ and corrupted (train) distribution p̃,
we observe samples X1 , . . . , Xn from p̃ (oblivious contamination, although we can also consider adaptive
3 Actually this is not quite true because we only bound Var[f (x)] for symmetric tensors A. What is true is that this holds if
A
we symmetrize the left-hand-side of (170), which involves averaging over all ways of splitting M2 and M4 over the 3 copies of Rd
in Rd×d×d .

37
contamination as in Section 2.6.2). We wish to estimate a parameter θ and do so via en estimator θ̂ =
θ̂(X1 , . . . , Xn ). Our goal is to construct an estimator such that L(p∗ , θ̂) is small according to a given loss
function L. This was summarized in Figure 1 from Section 1.
As before, we will start by allowing our estimator θ̂ to directly access the population distribution p̃
rather than samples. Thus we wish to control the error L(p∗ , θ̂(p̃)). Since this is hopeless without further
assumptions, we assume that D(p∗ , p̃) ≤  for some distance D, and that p∗ lies in some family G.
For now we continue to take D = TV and focus on more general losses L, corresponding to tasks beyond
mean estimation. Two key examples will be:
• Second moment estimation in spectral norm, which corresponds to the loss L(p, S) = kEp [XX > ] −
Sk.
• Linear regression, which corresponds to the loss L(p, θ) = Ex,y∼p [(y − θ> x)2 − (y − θ∗ (p)> x)2 ]. Note
that here L measures the excess predictive loss so that L(p, θ∗ (p)) = 0.
As in the mean estimation case, we will define the modulus of continuity and the family of resilience
distributions, and derive sufficient conditions for resilience.

Modulus of continuity. The modulus of continuity generalizes straightforwardly from the mean estimation
case. We define
m(G, 2, L) = sup L(p, θ∗ (q)). (172)
p,q∈G:TV(p,q)≤2

As before, the modulus m upper-bounds the minimax loss. Specifically, consider the projection estimator that
outputs θ̂(p̃) = θ∗ (q) for any q ∈ G with TV(p̃, q) ≤ . Then the error of θ̂ is at most m because TV(q, p∗ ) ≤ 2
and p∗ , q ∈ G.

Resilience. Generalizing resilience requires more care. Recall that for mean estimation the set of (ρ, )-
resilient distributions was
TV def  p
Gmean (ρ, ) = p | kEr [X] − Ep [X]k ≤ ρ for all r ≤ . (173)
1−
We saw in Section 2.4 that robust mean estimation is possible for the family Gmean of resilient distributions; the
two key ingredients were the existence of a midpoint distribution and the triangle inequality for L(p, θ∗ (q)) =
kµp − µq k. We now extend the definition of resilience to arbitrary cost functions L(p, θ) that may not satisfy
the triangle inequality. The general definition below imposes two conditions: (1) the parameter θ∗ (p) should
p p
do well on all distributions r ≤ 1− , and (2) any parameter that does well on some r ≤ 1− also does well on
p. We measure performance on r with a bridge function B(r, θ), which is often the same as the loss L but
need not be.

Definition 3.1 (G TV (ρ1 , ρ2 , )). Given an arbitrary loss function L(p, θ), we define G TV (ρ1 , ρ2 , ) = G↓TV (ρ1 , )∩
G↑TV (ρ1 , ρ2 , ), where:

G↓TV (ρ1 , ) , {p | sup B(r, θ∗ (p)) ≤ ρ1 }, (174)


p
r≤ 1−
p
G↑TV (ρ1 , ρ2 , ) , {p | for all θ, r ≤ , (B(r, θ) ≤ ρ1 ⇒ L(p, θ) ≤ ρ2 )}, (175)
1−
The function B(p, θ) is an arbitrary cost function that serves the purpose of bridging.

If we take B(p, θ) = L(p, θ) = kEp [X] − Eθ [X]k, ρ2 = 2ρ1 , then this exactly reduces to the resilient set
TV TV
Gmean (ρ1 , ) for mean estimation. To see the reduction, note that Gmean is equivalent to G↓TV in Equation
(174). Thus we only need to show that G↑TV is a subset of G↓TV . By our choice of B, L and ρ2 , the implication
condition in G↑TV follows from the triangle inequality..
We will show that G TV is not too big by bounding its modulus of continuity, and that it is not too small
by exhibiting reasonable sufficient conditions for resilience.

38
TV(p1 , p2 ) ≤ η
p1 p2

p1 p2
r≤ 1−η
r≤ 1−η

min(p1 ,p2 )
r= 1−TV(p1 ,p2 )

p2 ∈ G↑TV
p1 ∈ G↓TV B(r, θ∗ (p1 )) ≤ ρ1 L(p2 , θ∗ (p1 )) ≤ ρ2

Figure 7: Midpoint distribution helps bridge the modulus for G TV .

Not too big: bounding m. We show that the designed G TV (ρ1 , ρ2 , ) has small modulus of continuity
(and thus population minimax limit) in the following theorem:
Theorem 3.2. For G TV (ρ1 , ρ2 , ) in Definition 3.1, we have m(G TV (ρ1 , ρ2 , ), ) ≤ ρ2 .
Proof. As illustrated in Figure 7, we still rely on the midpoint distribution r to bridge the modulus. Consider
p1 p2
any p1 , p2 satisfying TV(p1 , p2 ) ≤ . Then there is a midpoint r such that r ≤ 1− and r ≤ 1− . From
the fact that p1 ∈ G TV (ρ1 , ρ2 , ) ⊂ G↓TV (ρ1 , ), we have B(r, θ∗ (p1 )) ≤ ρ1 . From this and the fact that
p2 ∈ G TV (ρ1 , ρ2 , ) ⊂ G↑TV (ρ1 , ρ2 , ), we then have L(p2 , θ∗ (p1 )) ≤ ρ2 . Since p1 and p2 are arbitrary, this
bounds the modulus of continuity by ρ2 .

Not too small: concrete examples. We next show that G TV yields sensible conditions for second moment
estimation and linear regression. We start with second moment estimation:
Proposition 3.3. Let B(p, S) = L(p, S) = kEp [XX > ] − Sk, and let p be a distribution on Rd such that
p ∈ Gmom,k )(σ), i.e. p∗ has bounded kth moments. Then assuming k > 2, we have p ∈ G TV (ρ, 2ρ, ) for
ρ = O(σ 2 1−2/k ).
This is essentially the same statement as for mean estimation, except with σ 2 1−2/k instead of σ1−1/k .
Proof. First we show that p ∈ G ↓ (ρ, ), for which we need to show that
p
kEr [XX > ] − Ep [XX > ]k ≤ ρ for all r ≤ . (176)
1−

Letting Y = XX > , this asks that Y is resilient in operator norm, which in turn asks that hY, Zi is resilient for
any kZk∗ ≤ 1, where k · k∗ is dual to the operator norm. Recalling that the operator norm is the maximum
P k · k∗ is the nuclear norm, or the sum of the singular values. Thus for
singular value, it turns out that
Z = U ΛV > we have kZk∗ = i Λii . (Proving this duality requires some non-trivial but very useful matrix
inequalities that we provide at the end of this section.)
Conveniently, the extreme points of the nuclear norm ball are exactly rank-one matrices of the form
±vv > where kvk2 = 1. Thus we exactly need that hv, Xi2 is resilience for all v. Fortunately we have that
E[|hv, Xi2 − E[hv, Xi2 ]|k/2 ] ≤ E[|hv, Xi|k ] ≤ σ k , so p is (ρ1 , )-resilient with ρ1 = σ 2 1−2/k , which gives that
p ∈ G↓.
Next we need to show that p ∈ G ↑ . We want
p
kEr [XX > ] − Sk ≤ ρ1 =⇒ kEp [XX > ] − Sk ≤ ρ2 whenever r ≤ , (177)
1−

but this is the same as ρ2 − ρ1 ≤ kEr [XX > ] − Ep [XX > ]k, and we already know that the right-hand-side is
bounded by ρ1 , so we can take ρ2 = 2ρ1 , which proves the claimed result.
We move on to linear regression. In the proof for second moment estimation, we saw that p ∈ G ↑ was
essentially implied by p ∈ G ↓ . This was due to the symmetry of the second moment loss together with the

39
triangle inequality for k · k, two properties that we don’t have in general. The proof for second moment
estimation will require somewhat more different proofs for G ↑ and G ↓ . For simplicity we state the result only
for fourth moments:
Proposition 3.4. For a distribution p on Rd × R, let B(p, θ) = L(p, θ) = Ep [(y − hθ, xi)2 − (y − hθ∗ (p), xi)2 ].
Let Z = Y − hθ∗ (p), Xi and suppose that the following two conditions holds:

Ep [XZ 2 X > ]  σ 2 E[XX > ], (178)


4 2 2
Ep [hX, vi ] ≤ κEp [hX, vi ] for all v. (179)
1
Then p ∈ G TV (ρ, 5ρ, ) for ρ = 2σ 2  as long as (κ − 1) ≤ 6 and  ≤ 18 .

Let us interpret the two conditions. First, as long as X and Z are independent (covariates are independent
of noise), we have Ep [XZ 2 X > ] = E[Z 2 ]E[XX > ], so in that case σ 2 is exactly a bound on the noise Z. Even
when X and Z are not independent, the first condition holds when Z has bounded 4th moment.
The second condition is a hypercontractivity condition stating that the fourth moments of X should not
be too large compared to the second moments. It is a bit unusual from the perspective of mean estimation,
because it does not require X to be well-concentrated, but only well-concentrated relative to its variance. For
regression, this condition makes sense because κ bounds how close √ the covariates are to being rank-deficient
(the worst-case is roughly an -mass at some arbitrary distance t/ , which would have second moment t2
and fourth moment t4 /, so we roughly want κ < 1/). We will show later that such a hypercontractivity
condition is needed, i.e. simply assuming sub-Gaussianity (without making it relative to the variance) allows
for distributions that are hard to robustly estimate due to the rank-deficiency issue.

Proof. First note that L(p, θ) = (θ − θ∗ (p))> Sp (θ − θ∗ (p)), where Sp = Ep [XX > ], and analogously for L(r, θ).
At a high level our strategy will be to show that θ∗ (r) ≈ θ∗ (p) and Sr ≈ Sp , and then use this to establish
membership in G ↓ and G ↑ .
We first use the hypercontractivity condition to show that Sr ≈ Sp . We have

1
q
Er [hv, Xi2 ] ≥ Ep [hv, Xi2 ] − Varp [hv, Xi2 ] (180)
1−
1
q
= Ep [hv, Xi2 ] − (Ep [hv, Xi4 ] − Ep [hv, Xi2 ]2 ) (181)
1−
1 p
≥ Ep [hv, Xi2 ] − (κ − 1)Ep [hv, Xi2 ] (182)
1−
1 p
= (1 − (κ − 1))Ep [hv, Xi2 ]. (183)
1−
1
p 1
p
Thus Sr  (1− 1− (κ − 1))Sp , and similarly Sr  (1+ 1− (κ − 1))Sp . Assuming  ≤ 18 and (κ−1) ≤ 61 ,
1
p p
we have 1− (κ − 1) ≤ 87 1/6 < 12 , and so 21 Sp  Sr  32 Sp .
We now turn to G ↑ and G ↓ . A useful relation is θ∗ (p) = Sp−1 Ep [XY ], and θ∗ (r) − θ∗ (p) = Sr−1 Er [XZ].
To prove that p ∈ G ↓ we need to show that (θ∗ (r) − θ∗ (p))> Sr (θ∗ (r) − θ∗ (p)) is small. We have
3 ∗
(θ∗ (r) − θ∗ (p))> Sr (θ∗ (r) − θ∗ (p)) ≤ (θ (r) − θ∗ (p))> Sp (θ∗ (r) − θ∗ (p)) (184)
2
3 3
= Er [XZ]> Sp−1 Er [XZ] = kEr [Sp−1/2 XZ] − Ep [Sp−1/2 XZ]k22 . (185)
2 2
−1/2
This final condition calls for Sp XZ to be resilient, and bounded variance of this distribution can be seen to
3σ 2 
exactly correspond to the condition E[XZ 2 X > ]  σ 2 E[XX > ]. Thus we have resilience with ρ = 2(1−) 2
2 ≤ 2σ 

(since  < 18 ).
Now we turn to G ↑ . We want that (θ − θ∗ (r))> Sr (θ − θ∗ (r)) ≤ ρ implies (θ − θ∗ (p))> Sp (θ − θ∗ (p)) ≤ 5ρ.

40
By the triangle inequality we have
q q q
(θ − θ∗ (p))> Sp (θ − θ∗ (p)) ≤ (θ − θ∗ (r))> Sp (θ − θ∗ (r)) + (θ∗ (r) − θ∗ (p))> Sp (θ∗ (r) − θ∗ (p)) (186)
q p
≤ 2(θ − θ∗ (r))> Sr (θ − θ∗ (r)) + (4/3)σ 2  (187)
p p √ √ p p
≤ 2ρ + (4/3)σ 2  = ρ( 2 + 2/3) ≤ 5ρ, (188)

which completes the proof.

Lower bound. TBD

Proving that nuclear norm is dual to operator norm. Here we establish a series of matrix inequalities
that are useful more broadly, and use these to analyze the nuclear norm. The first allows us to reduce dot
products of arbitrary matrices to symmetric PSD matrices:
Proposition 3.5. For any (rectangular) matrices A, B of equal dimensions, we have

hA, Bi2 ≤ h(A> A)1/2 , (B > B)1/2 ih(AA> )1/2 , (BB > )1/2 i. (189)

In a sense, this is like a “matrix Cauchy-Schwarz”.


(AA> )1/2
 
A
Proof. We first observe that  0. This is because, if A = U ΛV > is the singular
A> (A> A)1/2
U ΛU > U ΛV >
 
value decomposition, we can write the above matrix as , which is PSD because it can be
V ΛU > V ΛV >
factorized as [U ; V ]Λ[U ; V ]> . More generally this is true if we multiply (AA> )1/2 by λ and (A> A)1/2 by λ1 .
We therefore have
λ(AA> )1/2 λ(BB > )1/2
   
A −B
, ≥ 0, (190)
A> 1 >
λ (A A)
1/2
−B > 1 >
λ (B B)
1/2

since both terms in the inner product are PSD. This gives λ2 h(AA> )1/2 , (BB > )1/2 i+ λ12 h(A> A)1/2 , (B > B)1/2 i ≥
2hA, Bi. Optimizing λ yields the claimed result.
Next we show:
Theorem 3.6. If A and B are matrices of the same dimensions with (sorted) lists of singular values
σ1 , . . . , σn and τ1 , . . . , τn , then
n
X
hA, Bi ≤ σi τi . (191)
i=1

This says that the dot product between two matrices is bounded by the dot product between their sorted
singular values.
Proof. By Proposition 3.5, it suffices to show this in the case that A and B are both PSD and σ, τ are the
eigenvalues. Actually we will only need A and B to be symmetric (which implies that, oddly, the inequality
can hold even if some of the σi and τi are negative).
By taking similarity transforms we can assume Pwithout loss P
of generality that A = diag(σ1 , . . . , σn ) with
n n
σ1 ≥ σ2 ≥ · · · ≥ σn . We thus wish to prove that i=1 σi Bii ≤ i=1 σi τi , where τi are the eigenvalues of B.
We make use of the following lemma:
Pk Pk
Lemma 3.7. For all 1 ≤ k ≤ n, we have i=1 Bii ≤ i=1 τi .

41
Pk
Proof. Let Bk be the k × k top-left submatrix of B. Then i=1 Bii = tr(Bk ) is the sum of the eigenvalues
of Bk . We will show that the jth largest eigenvalue of Bk is smaller than the jth largest eigenvalue of B
(this is a special case of the Cauchy interlacing theorem). We prove this using the min-max formulation of
eigenvalues: λi (M ) = minW :dim(W )=i−1 maxv∈W ⊥ ,kvk2 ≤1 v > M v. Let W ∗ be the W that attains the min for
λj (B), and let Pk denote projection onto the first k coordinates. We have

λj (Bk ) = min max v > Bk v (192)


W :dim(W )=i−1 v∈W ⊥ :kvk2 ≤1

≤ max (Pk v)> Bk (Pk v) (193)


v∈(W ∗ )⊥ :kvk2 ≤1

≤ max v > Bv = λj (B), (194)


v∈(W ∗ )⊥ :kvk2 ≤1

which proves the lemma.


Now with the lemma in hand we observe that, if we let σn+1 = 0 for convenience, we have
n
X n
X
σi Bii = (σi − σi+1 )(B11 + · · · + Bii ) (195)
i=1 i=1
Xn
≤ (σi − σi+1 )(τ1 + · · · + τi ) (196)
i=1
Xn
= σi τi , (197)
i=1

which yields the desired result. In the above algebra we have used Abel summation, which is the discrete
version of integration by parts.
Now that we have Theorem 3.6 in hand, we can easily analyze the operator and nuclear norms. Letting
~σ (A) denote the vector of non-decreasing singular values of A, we have

hY, Zi ≤ h~σ (Y ), ~σ (Z)i ≤ k~σ (Y )k∞ k~σ (Z)k1 . (198)

This shows that the dual of the operator norm is at most the nuclear norm, since k~σ (Z)k1 is the nuclear
norm of Z. But we can achieve equality when Y = U ΛV > by taking Z = u1 v1> (then kZk∗ = 1 while
hY, Zi = Λ11 = kY k). So operator and nuclear norm are indeed dual to each other.

[Lecture 12]

3.1 Efficient Algorithm for Robust Regression


We now turn to the question of efficient algorithms, focusing on linear regression (we will address finite-sample
issues later). Recall that information-theoretically, we found that two conditions are sufficient to imply
resilience:
• Hypercontractivity: For all v, Ex∼p [hx, vi4 ] ≤ κEx∼p [hx, vi2 ]2 .
• Bounded noise: Ex∼p [xz 2 x> ]  σ 2 Ex∼p [xx> ].
As for mean estimation under bounded covariance, our strategy will be to write down a non-convex optimization
problem that tries to find small κ and σ, then show that this problem can be approximately solved. Specifically,
let
Ex∼q [hx, vi4 ]
F1 (q) = sup , and (199)
v Ex∼q [hx, vi2 ]2
Ex∼q [hx, vi2 (y − hθ(q), xi)2 ]
F2 (q) = sup . (200)
v Ex∼q [hx, vi2 ]

42
Then we seek to find a q such that F1 (q) ≤ κ, F2 (q) ≤ σ 2 , and q ∈ ∆n, , where ∆n, is the set of -deletions
of p.
However, there are a couple wrinkles. While with mean estimation we could minimize the objective with
gradient descent, in this case we will need to use quasigradient descent–following a direction that is not the
gradient, but that we can show makes progress towards the optimum. The rough reason for this is that, since
p appears on both the left- and right-hand sides of the inequalities above, the gradients become quite messy,
e.g. ∇F1 (q) has a mix of positive and negative terms:

hxi , vi4 Eq [hx, vi4 ]hxi , vi2


∇F1 (q)i = − 2 , (201)
Ex∼q [hx, vi2 ]2 Eq [hx, vi2 ]3

and it isn’t clear that following them will not land us at bad local minima. To address this, we instead
construct a simpler quasigradient for F1 and F2 :

Eq [hx, vi4 ]
g1 (xi ; q) = hxi , vi4 , where v = arg max , (202)
kvk2 =1 Eq [hx, vi2 ]2
Eq [hx, vi2 (y − hθ∗ (q), xi)2 ]
g2 (xi ; q) = hxi , vi2 (yi − hθ∗ (q), xi i)2 , where v = arg max . (203)
kvk2 =1 Eq [hx, vi2 ]

We will then follow g1 until F1 is small, and then follow g2 until F2 is small.
The other wrinkle is that computationally, the hypercontractivity condition is difficult to certify, because
E [hx,vi4 ]
it involves maximizing Epp[hx,vi2 ]2 , which is no longer a straightforward eigenvalue problem as in the mean
estimation case. We’ve already seen this sort of difficulty before–for norms beyond the `2 -norm, we had to use
SDP relaxations and Grothendieck’s inequality in order to get a constant factor relaxation. Here, there is also
an SDP relaxation called the sum-of-squares relaxation, but it doesn’t always give a constant factor relaxation.
We’ll mostly ignore this issue and assume that we can find the maximizing v for hypercontractivity.
We are now ready to define our efficient algorithm for linear regression, Algorithm 4. It is closely analogous
to the algorithm for mean estimation (Algorithm 2), but specifies the gradient steps more explicitly.

Algorithm 4 QuasigradientDescentLinReg
1: Input: (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R.
2: Initialize q ∈ ∆n, arbitrarily.
3: while F1 (q) ≥ 2κ or F2 (q) ≥ 4σ 2 do
4: if F1 (q) ≥ 2κ then
5: Find the unit vector v that maximizes Eq [hx, vi4 ]/Eq [hx, vi2 ]2 .
6: Take a projected gradient step in the direction (g1 )i = hxi , vi4 .
7: else Pn Pn
8: Compute the empirical least squares regressor: θ∗ (q) = ( i=1 qi xi x> i )
−1
( i=1 qi xi yi ).
9: Find the unit vector v that maximizes Eq [hx, vi2 (y − hθ∗ (q)xi)2 ]/Eq [hx, vi2 ].
10: Take a projected gradient step in the direction (g2 )i = hxi , vi2 (yi − hθ∗ (q), xi i)2 .
11: end if
12: end while
13: Output θ∗ (q).

Analyzing Algorithm 4 enjoys the following loss bound:


Proposition 3.8. Suppose that a set S of (1 − )n of the xi satisfy:

EpS [hx, vi4 ] ≤ κEpS [hx, vi2 ]2 ∀v ∈ Rd , and EpS [(y − hθ∗ (pS ), xi)2 xx> ]  σ 2 EpS [xx> ]. (204)

Then assuming κ ≤ 1
80 , Algorithm 4 terminates and its output has excess loss L(pS , θ∗ (q)) ≤ 40σ 2 .
To prove Proposition 3.8, we need a few ideas. The first is a result from optimization justifying the use of
the quasigradients:

43
Lemma 3.9 (Informal). Asymptotically, the iterates of Algorithm 4 (or any other low-regret algorithm)
satisfy the conditions
EX∼q [gj (X; q)] ≤ EX∼pS [gj (X; q)] for j = 1, 2. (205)
We will establish this more formally later, and for now assume that (205) holds. Then, asuming this,
we will show that q is both hypercontractive and has bounded noise with constants κ0 , σ 0 that are only a
constant factor worse than κ and σ.
First, we will show this for hypercontractivity:
1
Lemma 3.10. Suppose that EX∼q [g1 (X; q)] ≤ EX∼pS [g1 (X; q)] and that κ ≤ 80 . Then q is hypercontractive
with parameter κ0 = 1.5κ.
Proof. Take the maximizing v such that g1 (xi ; q) = hxi , vi4 . To establish hypercontractivity, we want to
show that Eq [hx, vi4 ] is small while Eq [hx, vi2 ]2 is large. Note the quasigradient condition gives us that
Eq [hx, vi4 ] ≤ EpS [hx, vi4 ]; so we will mainly focus on showing that Eq [hx, vi2 ] is large, and in fact not much

smaller than EpS [hx, vi2 ]. Specifically, by the fact that TV(q, pS ) ≤ 1− , together with resilience applied to
2
hx, vi , we have
   21
|Eq [hx, vi2 ] − EpS [hx, vi2 ]| ≤ (Eq [hx, vi 4
] + EpS [hx, vi4
]) (206)
(1 − 2)2
(i)  2  12
≤ 2
EpS [hx, vi4 ] (207)
(1 − 2)
(ii)  2κ  12
≤ EpS [hx, vi2 ], (208)
(1 − 2)2

where (i) uses the quasigradient condition and (ii) uses hypercontractivity for p pS . Now assuming that
1 1
κ ≤ 80 (and hence also  ≤ 80 ), the coefficient on the right-hand-side is at most (1/40)/(1 − 1/40)2 < 61 .
Consequently |Eq [hx, vi2 ] − EpS [hx, vi]2 | ≤ 61 EpS [hx, vi2 ], and re-arranging then yields

5
Eq [hx, vi2 ] ≥ Ep [hx, vi2 ]. (209)
6 S
But we already have Eq [hx, vi4 ] ≤ EpS [hx, vi2 ], and so the ratio Eq [hx, vi2 ]/Eq [hx, vi2 ]2 is at most (6/5)2 the
same ratio under pS , and in particular at most (6/5)2 κ ≤ 1.5κ.
Next, we will show this for bounded noise assuming that hypercontractivity holds:
1
Lemma 3.11. Suppose that F1 (q) ≤ 2κ and that EX∼q [g2 (X; q)] ≤ EX∼pS [g2 (X; q)], and that κ ≤ 80 . Then
q has bounded noise with parameter (σ 0 )2 = 4σ 2 , and furthermore satisfies L(pS , θ∗ (q)) ≤ 40σ 2 .
Proof. Again take the maximizing v such that g2 (xi ; q) = hxi , vi2 (yi − hθ∗ (q), xi i)2 . We want to show that q
has bounded noise, or in other words that Eq [g2 (x; q)] is small relative to Eq [hx, vi2 ]. By the quasigradient
assumption, we have

Eq [g2 (x; q)] + Eq [hx, vi2 (y − hθ∗ (q), xi)2 ] (210)


2 ∗ 2
≤ EpS [hx, vi (y − hθ (q), xi) ]. (211)

Intuitively, we want to use the bounded noise condition for pS to upper-bound the right-hand-side. The
problem is that the term inside the expectation contains θ∗ (q), rather than θ∗ (pS ). But we can handle this
using the AM-RMS inequality. Specifically, we have

EpS [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2 EpS [hx, vi2 (y − hθ∗ (pS ), xi)2 ] + EpS [hx, vi2 hθ∗ (q) − θ∗ (pS ), xi2 ] . (212)

| {z } | {z }
(a) (b)

We will bound (a) and (b) in turn. To bound (a) note that by the bounded noise condition we simply have
(a) ≤ σ 2 EpS [hx, vi2 ].

44
To bound (b), let R = EpS [hθ∗ (q) − θ∗ (pS ), xi2 ] = L(pS , θ∗ (q)) be the excess loss of θ∗ (q) under pS . We
will upper-bound (b) in terms of R, and then apply resilience to get a bound for R in terms of itself. Solving
the resulting inequality will provide an absolute bound on R and hence also on (b).
More specifically, we have
(i)  1/4  1/2
EpS [hx, vi2 hθ∗ (q) − θ∗ (pS ), xi2 ] ≤ EpS [hx, vi4 EpS [hθ∗ (q) − θ∗ (pS ), xi4 ] (213)
(ii)   
≤ κ EpS [hx, vi2 ] EpS [hθ∗ (q) − θ∗ (pS ), xi2 (214)
2
= κREpS [hx, vi ]. (215)

Here (i) is Cauchy-Schwarz and (ii) invokes hypercontractivity of pS . Combining the bounds on (a) and (b)
and plugging back in to (212), we obtain

EpS [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2(σ 2 + κR)EpS [hx, vi2 ]. (216)

Remember that we would like a bound such as the above, but with expectations taken with respect to q
instead of pS . For the left-hand-side, we can directly move to Eq [·] using the quasigradient assumption. For
the right-hand-side, since F1 (q) ≤ 2κ, the same argument as in (206)-(209) yields (with modified constants)
that Eq [hx, vi2 ] ≥ 45 EpS [hx, vi2 ]. Applying both of these, we have that

Eq [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2.5(σ 2 + κR)Eq [hx, vi2 ]. (217)

This establishes bounded noise with parameter (σ 0 )2 = 2.5(σ 2 + κR). By assumption, we also have
hypercontractivity with parameter κ0 = 2κ. We are not yet done, because we do not know R. But
recall that R is the excess loss, and so by the resilience conditions for linear regression (Proposition 3.4) we
have
R ≤ 5ρ(κ0 , σ 0 ) ≤ 10(σ 0 )2  = 25(σ 2 + κR), (218)
as long as  ≤ 1
8 and κ0  = 2κ ≤ 16 . Re-arranging, we have

25σ 2 
R≤ ≤ 40σ 2 , (219)
1 − 25κ

since κ ≤ 1
80 . Plugging back into σ 0 , we also have (σ 0 )2 ≤ 2.5σ 2 (1 + 40κ) ≤ 4σ 2 , as claimed.

Quasigradient bounds via low-regret algorithms. Combining Lemmas 3.9, 3.10, and 3.11 together
yields Proposition 3.8. However, we still need to formalize Lemma 3.9, showing that we can drive the
quasigradients to be small. We can do so with the idea of low-regret online optimization algorithms.
An online optimization algorithm is one that takes a sequence of losses `1 (·), `2 (·) rather than a fixed loss
`. In traditional optimization, we have a single ` with parameter w, and seek to produce iterates w1 , w2 , . . .
such that `(wT ) − `(w∗ ) → 0 as T → ∞. In online optimization algorithms, we instead consider the regret,
defined as
T
1X
RegretT = max `t (wt ) − `t (w). (220)
w T
t=1

This is the average excess loss compared to the best fixed w, picked in hindsight. We then seek to produce
iterates wt such that RegretT → 0 as T → ∞. Note that when `t = ` is fixed for all t, this is exactly the
same as traditional optimization.
Remarkably, for most “nice” loss functions `t , it is possible to ensure that RegretT → 0; in fact, projected
gradient descent with an appropriate step size will achieve this.
How does this relate to quasigradients? For any quasigradient g(X; q), define the loss function `t (q) =
Ex∼q [g(x; qt )]. Even though this loss depends on qt , the regret is well-defined:

1 1
RegretT = max Ex∼qt [g(x; qt )] − Ex∼q0 [g(x; qt )] ≥ Ex∼qt [g(x; qt )] − Ex∼pS [g(x; qt )]. (221)
0 q T T

45
PT PT
In particular, as long as RegretT → 0, we asymptotically have that T1 t=1 Ex∼qt [g(x; qt )] ≤ t=1 Ex∼pS [g(x; qt )],
so eventually the quasigradient bound Ex∼q [g(x; q)] ≤ Ex∼pS [g(x; q)] must (almost) hold for one of the qt .
This is enough to show that, for any fixed quasigradient, a statement such as Lemma 3.9 holds4 . However,
we want both g1 and g2 to be small simultaneously.
There are two ways to handle this. The first, crude way is to first use g1 until we have hypercontractivity,
then take the resulting q as our new p̃ (so that we restrict to -deletions of q) and running gradient descent
with g2 until we have bounded noise. This uses the fact that -deletions of hypercontractive distributions are
still hypercontractive, but yields worse constants (since we need 2-deletions instead of -deletions).
A slicker approach is to alternate between g1 and g2 as in Algorithm 4. Note that we then still
(asymptotically) have
T T
1X X
Ex∼qt [gjt (x; qt )] ≤ Ex∼pS [gjt (x; qt )], (222)
T t=1 t=1

where jt ∈ {1, 2} denotes the choice of quasigradient at iteration t of Algorithm 4.


Next note that asymptotically, only a vanishingly small fraction of jt can equal 1, since we only take
quasigradient steps in g1 when F1 (q) ≥ 2κ, and in these cases Eqt [g1 (x; qt )] is quite a bit larger than
EpS [g1 (x; qt )], since if they were equal we would have F1 (q) ≤ 1.5κ. Therefore, eventually almost all of the
quasigradient steps are with respect to g2 , and so low regret of the entire sum implies low regret of g2 . We
therefore both have F1 (qt ) ≤ 2κ and the quasigradient condition for g2 :
Lemma 3.12 (Formal version of Lemma 3.9). Suppose that |g1 (xi , q)| ≤ B and |g2 (xi , q)| ≤ B for all i,
where B is at most polynomially-large in the problem parameters. Then for any δ, within polynomially many
steps Algorithm 4 generates an iterate qt such that F1 (qt ) ≤ 2κ and Egt [g2 (x, qt )] ≤ EpS [g2 (x, qt )] + δ.
Combining Lemma 3.9 with Lemma 3.11 then yields the desired Proposition 3.8.

[Lectures 13-14]

4 Resilience Beyond TV Distance


We now turn our attention to distances other than the distance D = TV that we have considered so far. The
family of distances we will consider are called Wasserstein distances. Given a cost function c(x, y) (which is
usually assumed to be a metric), we define the distance Wc (p, q) between two distributions p and q as

Wc (p, q) = inf Ex,y∼π [c(x, y)] (223)


π
Z Z
subject to π(x, y)dy = p(x), π(x, y)dx = q(y). (224)

This definition is a bit abstruse so let us unpack it. The decision variable π is called a coupling between p
and q, and can be thought of as a way of matching points in p with points in q (π(x, y) is the amount of
mass in p(x) that is matched to q(y)). The Wasserstein distance is then the minimum cost coupling (i.e.,
minimum cost matching) between p and q. Some special cases include:
• c(x, y) = I[x =
6 y]. Then Wc is the total variation distance, with the optimal coupling being π(x, x) =
min(p(x), q(x)) (the off-diagonal π(x, y) can be arbitrary as long as the total mass adds up correctly).
• c(x, y) = kx − yk2 . Then Wc is the earth-mover distance—the average amount that we need to move
points around to “move” p to q.
• c(x, y) = kx − yk0 . Then Wc is the average number of coordinates we need to change to move p to q.
• c(x, y) = kx − ykα
2 , for α ∈ [0, 1]. This is still a metric and interpolates between TV and earthmover
distance.
4 We have to be a bit careful because outliers could make g(x; q) arbitrarily large, which violates the assumptions needed to

achieve low regret. This can be addressed with a pre-filtering step that removes data points that are obviously too large to be
inliers, but we will not worry about this here.

46
There are a couple key properties of Wasserstein distance we will want to use. The first is that Wc is a metric
if c is:
Proposition 4.1. Suppose that c is a metric. Then Wc is also a metric.
Proof. TBD

The second, called Kantorovich-Rubinstein duality, provides an alternate definition of Wc distance in


terms of functions that are Lipschitz under c, meaning that |f (x) − f (y)| ≤ c(x, y).
Theorem 4.2 (Kantorovich-Rubinstein). Call a function f Lipschitz in c if |f (x) − f (y)| ≤ c(x, y) for all
x, y, and let L(c) denote the space of such functions. If c is a metric, then we have

Wc (p, q) = sup Ex∼p [f (x)] − Ex∼q [f (x)]. (225)


f ∈L(c)

As a special case, take c(x, y) = I[x =6 y] (corresponding to TV distance). Then f ∈ L(c) if and only if
|f (x) − f (y)| ≤ 1 for all x =
6 y. By translating f , we can equivalently take the supremum over all f mapping
to [0, 1]. This says that
TV(p, q) = sup Ep [f (x)] − Eq [f (x)], (226)
f :X →[0,1]

which recovers the definition of TV in terms of the maximum difference in probability of any event E.
As another special case, take c(x, y) = kx − yk2 . Then the supremum is over all 1-Lipschitz functions (in
the usual sense).
In the next section, we will see how to generalize the definition of resilience to any Wasserstein distance.

4.1 Resilience for Wasserstein distances


We show how to extend the idea of resilience to Wasserstein distances Wc . Recall that for TV distance, we
showed that resilient sets have bounded modulus m; this crucially relied on the midpoint property that any
p1 , p2 have a midpoint r obtained via deletions of p1 or p2 . In other words, we used the fact that any TV
perturbation can be decomposed into a “friendly” operation (deletion) and its opposite (addition). We think
of deletion as friendlier than addition, as the latter can move the mean arbitrarily far by adding probability
mass at infinity.
To extend this to other Wasserstein distances, we need to identify a similar way of decomposing a
Wasserstein perturbation into a friendly perturbation and its inverse. Unfortunately, deletion is closely tied
to the TV distance in particular. To get around this, we use the following re-interpretation: Deletion is
equivalent to movement towards the mean under TV. More precisely:
µ̂ is a possible mean of an -deletion of p if and only if some r with mean µ̂ can be obtained from
p by moving points towards µ̂ with TV distance at most .
This is more easily seen in the following diagram:

µp µr

47
Here we can equivalently either delete the left tail of p or shift all of its mass to µr ; both yield a modified
distribution with the same mean µr . Thus we can more generally say that a perturbation is friendly if it only
moves probability mass towards the mean. This motivates the following definition:
Definition 4.3 (Friendly perturbation). For a distribution p over X , fix a function f : X → R. A distribution
r is an -friendly perturbation of p for f under Wc if there is a coupling π between X ∼ p and Y ∼ r such
that:
• The cost (Eπ [c(X, Y )]) is at most .
• All points move towards the mean of r: f (Y ) is between f (X) and Er [f (Y )] almost surely.
Note that friendliness is defined only in terms of one-dimensional functions f : X → R; we will see how to
handle higher-dimensional objects later. Intuitively, a friendly perturbation is a distribution r for which there
exists a coupling that ‘squeezes’ p to µr .
The key property of deletion in the TV case was the existence of a midpoint: for any two distributions
that are within  in TV, one can find another distribution that is an -deletion of both distributions. We
would like to show the analogous result for Wc –i.e. that if Wc (p, q) ≤  then there exists an r that is an
-friendly perturbation of both p and q for the function f .
The intuitive reason this is true is that any coupling between two one-dimensional distributions can be
separated into two stages: in one stage all the mass only moves towards some point, in the other stage all the
mass moves away from that point. This is illustrated in Figure 8.

µp1 µr µp2

Figure 8: Illustration of midpoint lemma. For any distributions p1 , p2 that are close under Wc , the coupling
between p1 and p2 can be split into couplings πp1 ,r , πp2 ,r such that p1 , p2 only move towards µr under the
couplings. We do this by “stopping” the movement from p1 to p2 at µr .

To formalize this intuitive argument, we need a mild topological property:


Assumption 4.4 (Intermediate value property). For any x and y and any u with f (x) < u < f (y), there is
some z satisfying f (z) = u and max(c(x, z), c(z, y)) ≤ c(x, y).
This holds for any f if c = I[x 6= y] (TV distance), and for any continuous f if c is a path metric (a metric
with “nice” paths between points, which includes the `2 -distance). Under this assumption we can prove the
desired midpoint lemma:
Lemma 4.5 (Midpoint lemma for Wc ). Suppose Assumption 4.4 holds. Then for any p1 and p2 such that
Wc (p1 , p2 ) ≤  and any f , there exists a distribution r that is an -friendly perturbation of both p1 and p2
with respect to f .
Proof. Given any two points x and y, without loss of generality we assume f (x) ≤ f (y). Define

min(f (x), f (y)), u ≤ min(f (x), f (y))

sxy (u) = u, u ∈ [f (x), f (y)] (227)

max(f (x), f (y)), u ≥ max(f (x), f (y)).

48
If we imagine u increasing from −∞ to +∞, we can think of sxy as a “slider” that tries to be as close to
u as possible while remaining between f (x) and f (y).
By Assumption 4.4, there must exist some point z such that max(c(x, z), c(z, y)) ≤ c(x, y) and f (z) =
sxy (u). Call this point zxy (u).
Given a coupling π(x, y) from p1 to p2 , if we map y to zxy (u), we obtain a coupling π1 (x, z) to some
distribution r(u), which by construction satisfies the squeezing property, except that it squeezes towards
u rather than towards the mean µ(u) = EX∼r(u) [f (X)]. However, note that u − µ(u) is a continuous,
monotonically non-decreasing function (since u − sxy (u) is non-decreasing) that ranges from −∞ to +∞. It
follows that there is a u∗ with µ(u∗ ) = u∗ . Then the couplings to r(u∗ ) squeeze towards its mean µ(u∗ ).
Moreover, E(X,Z)∼π1 [c(X, Z)] ≤ E(X,Y )∼π [c(X, Y )] = Wc (p1 , p2 ). The coupling π1 therefore also has small
enough cost, and so is a friendly perturbation. Similarly, the coupling π2 mapping y to zxy (u∗ ) satisfies the
squeezing property and has small enough cost by the same argument.

Defining resilience: warm-up. With Lemma 4.5 in hand, we generalize resilience to Wasserstein distances
by saying that a distribution is resilient if Er [f (X)] is close to Ep [f (X)] for every η-friendly perturbation r
and every function f lying within some appropriate family F. For now, we will focus on second moment
estimation under Wk·k2 (we consider second moment estimation because mean estimation is trivial under
Wk·k2 ). This corresponds to the loss function

L(p, S) = kEx∼p [xx> ] − Sk. (228)

For notational convenience we also typically denote Wk·k2 as W1 .


For the loss L(p, S), we will take our family F to be all functions of the form fv (x) = hx, vi2 with kvk2 = 1.
Thus we define the (ρ, )-resilient distributions under W1 as
W1
Gsec (ρ, ) = {p | |Er [hx, vi2 ] − Ep [hx, vi2 ]| ≤ ρ whenever r is -friendly under hx, vi2 and kvk2 = 1}. (229)
W1
Note the twist in the definition of Gsec –the allowed r depends on the current choice of v, since friendliness is
specific to the function fv = hx, vi2 , which is different from deletions in the TV case.
W1
We will first show that Gsec has small modulus, then derive sufficient moment conditions for p to be
(ρ, )-resilient.
W1
Proposition 4.6. The set of (ρ, )-resilient distributions for W1 has modulus m(Gsec (ρ, 2), ) ≤ 2ρ.
Proof. For a distribution q, let Sq = Eq [xx> ]. Suppose that p1 , p2 ∈ Gsec W1
(ρ, ) and W1 (p1 , p2 ) ≤ 2. For any
v, by Lemma 4.5, there exists an r that is a (2)-friendly perturbation of both p1 and p2 with respect to hx, vi2 .
We conclude that |Epi [hx, vi2 ] − Er [hx, vi2 ]| ≤ ρ for i = 1, 2, and hence |Ep1 [hx, vi2 ] − Ep2 [hx, vi2 ]| ≤ 2ρ, which
can be written as |v > (Sp1 − Sp2 )v| ≤ 2ρ. Taking the sup over kvk2 = 1 yields kSp1 − Sp2 k ≤ 2ρ. Since
L(p1 , θ∗ (p2 )) = kSp1 − Sp2 k, this gives the desired modulus bound.

Sufficient conditions for W1 -resilience. Recall that for mean estimation under TV perturbation, any
distribution with bounded
√ ψ-norm was (O(ψ −1 (1/)), )-resilient. In particular, bounded covariance dis-
tributions were (O( ), )-resilient. We have an analogous result for W1 -resilience, but with a modified ψ
function:
Proposition 4.7. Let ψ be an Orlicz function, and define ψ̃(x) = xψ(2x). Suppose that X (not X − µ) has
bounded ψ̃-norm: Ep [ψ̃(|v > X|/σ)] ≤ 1 for all unit vectors v. Also assume that the second moment of p is at
most σ 2 . Then p is (ρ, ) resilient for ρ = max(σψ −1 ( 2σ 2
 ), 4 + 2σ).

Let us interpret Proposition 4.7 before giving the proof. Take for instance ψ(x) = x2 . Then Proposition
√ 4.7
asks for the 3rd moment to be bounded by σ 3 /4. In that case we have ρ = σψ −1 (2σ/) = 2σ 3/2 1/2 . If
the units seem weird, remember that  has units of distance (before it was unitless) and hence σ 3/2 1/2 has
quadratic units, which matches the second moment estimation task.
More generally, taking ψ(x) = xk , we ask for a (k + 1)st moment bound and get error O(σ 1+1/k 1−1/k ).
We now turn to proving Proposition 4.7. A helpful auxiliary lemma (here and later) proves a way to use
Orlicz norm bounds:

49
Lemma 4.8. Let p and q be two distributions over X , g : X → R be any function, c be a non-negative cost
function, and ψ be an Orlicz function. Then for any coupling πp,q between p and q and any σ > 0 we have

Eπp,q c(X, Y )ψ |g(X)−g(Y )|  !



−1 σc(X,Y )
|EX∼p [g(X)] − EY ∼q [g(Y )]| ≤ σEπp,q [c(X, Y )]ψ . (230)
Eπp,q [c(X, Y )]

Proof. Note that |Ep [g(X)] − Eq [g(Y )]| = |Eπ [g(X) − g(Y )]|. We weight the coupling π by the cost c to
obtain a new probability measure π 0 (x, y) = c(x, y)π(x, y)/E[c(x, y)]. We apply Jensen’s inequality under π 0
as follows:
 E [g(X) − g(Y )]   h c(X, Y ) g(X) − g(Y ) i 
π
ψ = ψ Eπ · (231)
σEπ [c(X, Y )] E[c(X, Y )] σc(X, Y )
 h g(X) − g(Y ) i 
= ψ Eπ0 (232)
σc(X, Y )
h  |g(X) − g(Y )| i
≤ Eπ0 ψ (233)
σc(X, Y )
h |g(X) − g(Y )| i
= Eπ c(X, Y )ψ /Eπ [c(X, Y )]. (234)
σc(X, Y )
Inverting ψ yields the desired result.
Proof of Proposition 4.7. We apply Lemma 4.8 with q = r an -friendly perturbation of p under hx, vi2 , and
g = hx, vi2 ; we will also use cost c0 (x, y) = |v > (x − y)|, which satisfies c0 (x, y) ≤ c(x, y). Taking π to be the
-friendly coupling (under c, not c0 ) between p and r yields
 |hx,vi2 −hy,vi2 |  !
E π |hx − y, vi|ψ σ|hx−y,vi|
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 (235)

  !
−1 Eπ |hx − y, vi|ψ |hx, vi + hy, vi|/σ
= σψ . (236)


Now we will split into two cases. First, we observe that the worst-case friendly perturbation will either move
all of the hx, vi2 upwards, or all of the hx, vi2 downwards, since otherwise we could take just the upwards part
or just the downwards part and perturb the mean further. In other words, we either have (i) hx, vi2 ≥ hy, vi2
for all (x, y) ∈ supp(π) with x 6= y, or (ii) hx, vi2 ≤ hy, vi2 for all (x, y) ∈ supp(π) with x 6= y. We analyze
each case in turn.
Case (i): y moves downwards. In this case we can use the bounds |hx − y, vi| ≤ 2|hx, vi| and |hx + y, vi| ≤
2|hx, vi| together with (236) to conclude that
 h  2|hx, vi| i 
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 Eπ 2|hx, vi|ψ / (237)
σ
 h  |hx, ri| i 
= σψ −1 Ep 2σ ψ̃ / (238)
σ
≤ σψ −1 (2σ/), (239)
where the final inequality is by bounded Orlicz norm of p.
Case (ii): y moved upwards. In this case by friendliness we have that |hy, vi|2 ≤ v > Sr v whenever
(x, y) ∈ supp(π) and y 6= x. Thus
p
|hx − y, vi|ψ(|hx, vi + hy, vi|/σ) ≤ |hx − y, vi|ψ(2|hy, vi|/σ) ≤ |hx − y, vi|ψ(2 v > Sr v/σ). (240)
for all (x, y) ∈ supp(π). Plugging back into (236) yields
p
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 (Eπ [|hx − y, vi|ψ(2 v > Sr v/σ)]/) (241)
p
≤ σψ −1 ( · ψ(2 v > Sr v/σ)/) (242)
p
= σ · 2v > Sr v/σ = 2 v > Sr v. (243)

50
Here the final inequality is because Eπ [|hx − y, vi|] ≤ Eπ [c(x, y)] ≤  under p the coupling. Comparing
the left-hand-side to the final right-hand-side yields |v > Sp v − v > Sr v|√≤ 2 v > Sr v. Thus defining ∆ =
|v > Sp v − v > Sr v| and using the fact that v > Sp v ≤ σ 2 , we obtain ∆ ≤ 2 ∆ + σ 2 , which implies (after solving
the quadratic) that ∆ ≤ 42 + 2σ.
Thus overall we have |Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ max(σψ −1 (2σ/), 42 + 2σ), as was to be shown.

4.2 Other Results


Our understanding of robust estimation under Wc distances is still rudimentary. Below are a couple of known
results, but many of these may be improved or extended in the near future (perhaps by you!).
The most straightforward extension is from second moment estimation to kth moment estimation. In that
case instead of using ψ̃(x) = xψ(2x), we use ψ̃(x) = xψ(kxk−1 ). Essentially the same proof goes through.
We can also extend to more general loss functions L(p, θ), as long as L is a convex function of p for fixed
θ (this holds e.g. for any L(p, θ) = Ex∼p [`(θ; x)], since these loss functions are linear in p and hence also
convex). Here the main challenge is defining an appropriate family F of functions for which to consider
friendly perturbations. For second moment estimation our family F was motivated by the obsevation that
L(p, S) = sup{|Ep [fv (x)] − Eq [fv (x)]| | fv (x) = hx, vi2 , kvk2 = 1}, but such linear structure need not hold in
general. But we can still exploit linear structure by looking at subgradients of the loss. In particular, we can
take the Fenchel-Moreau representation

L(p, θ) = sup Ex∼p [f (x)] − L∗ (f, θ), (244)


f ∈Fθ

which exists for some Fθ and L∗ whenever L(p, θ) is convex in p. The family Fθ is roughly the family of
subgradients of L(p, θ) as p varies for fixed θ. In this case we obtain conditions G↓ and G↑ as before, asking
that
Er [f (x)] − L∗ (f, θ∗ (p)) ≤ ρ1 for all f ∈ Fθ∗ (p) and -friendly r, (↓)
and furthermore

L(p, θ) ≤ ρ2 if for every f ∈ Fθ there is an -friendly r such that Er [f (x)] − L∗ (f, θ) ≤ ρ1 . (↑)

Note that for the second condition (G↓ ), we allow the perturbation r to depend on the current function f . If
r was fixed this would closely match the old definition, but we can only do that for deletions since in general
even the set of feasible r depends on f .
Using this, we can (after sufficient algebra) derive sufficient conditions for robust linear regression under
W1 , for conditions similar to the hypercontractivity condition from before. This will be a challenge problem
on the homework.
Finally, we can define a W̃1 similar to TV, ˜ but our understanding of it is far more rudimentary. In
particular, known analyses do not seem to yield the correct finite-sample rates (for instance, the rate of
convergence includes an n−1/3 term that seems unlikely to actually exist).

5 Inference under Model Mis-specification


5.1 Model Mis-specification in Generalized Linear Models
5.2 Robust Inference via the Bootstrap
[Lecture 17]

5.3 Robust Inference via Partial Specification


In the previous section we saw how using a non-parametric inference method–the bootstrap–allowed us to
avoid the pitfalls of mis-specified parametric models. Next we will explore a different idea, called partial
specification or robust standard errors. Here we stay within a parametric model, but we derive algebraic

51
formulas that hold even when the particular parametric model is wrong, as long as certain “orthogonality
assumptions” are true.
Specifically, we will consider linear regression, deriving standard error estimates via typical parametric
confidence regions as with GLMs. We will see that these are brittle, but that they are primarily brittle
because they implicitly assume that certain equations hold. If we instead explicitly subtitute the right-hand
side of those equations, we get better confidence intervals that hold under fewer assumptions. As a bonus,
we’ll be able to study how linear regression performs under distribution shift.

Starting point: linear response with Gaussian errors. In the simplest setting, suppose that we
completely believe our model:

Y = hβ, Xi + Z, where Z ∼ N (0, σ 2 I). (245)

We observe samples (x1 , y1 ), . . . , (xn , yn ) ∼ p. Suppose that we estimate β using the ordinary least squares
estimator:
n n n
1X X X
β̂ = arg min (yi − hβ, xi i)2 = ( xi x>
i )−1
xi yi . (246)
β n i=1 i=1 i=1
Pn
Define S = n1 i=1 xi x> >
i . Then since yi = xi β + zi , we can further write

Xn n
X
β̂ = ( xi x>
i )−1
( xi x>
i β + xi zi ) (247)
i=1 i=1
Xn
= (nS)−1 (nSβ + xi zi ) (248)
i=1
n
1 −1 X
=β+ S xi zi . (249)
n i=1

From this we see that, conditional on the xi , β̂ − β is a zero-mean Gaussian distribution. Its covariance
matrix is given by
n
1 −1 X 2 > −1 σ 2 −1
S E[z i | x i ]x i xi S = S . (250)
n2 i=1
n

Confidence regions. The above calculation shows that the error β̂ − β is exactly Gaussian with covariance
2
matrix σn S −1 (at least assuming the errors zi are i.i.d. Gaussian). Thus the (parametric) confidence region
for β̂ − β would be an ellipsoid with shape S −1 and radius depending p on σ, n, and the significance level α of
the test. As a specific consequence, the standard error for βi is σ (S −1 )ii /n. This is the standard error
estimate returned by default in most software packages.
Of course, this all so far rests on the assumption of Gaussian error. Can we do better?

Calculation from moment assumptions. It turns out that our calculation above relied only on condi-
tional moments of the errors, rather than Gaussianity. We will show this explicitly by doing the calculations
more carefully. Re-using steps above, we have that
n
1 −1 X
β̂ − β = S xi zi . (251)
n i=1

In particular, assuming that the (xi , yi ) are i.i.d., we have


n
1 −1 X
E[β̂ − β | x1 , . . . , xn ] = S xi E[zi | xi ] = S −1 b, (252)
n i=1

def 1 Pn
where b = n i=1 xi E[zi | xi ].

52
In particular, as long as E[Z | X] = 0 for all X, β̂ is an unbiased estimator for X. In fact, since this only
needs to hold on average, as long as E[ZX] = 0 (covariates uncorrelated with noise) then E[β̂ − β] = 0, and
E[β̂ − β | x1:n ] converges to zero as n → ∞. This yields an insight that is important more generally:
Orindary least squares yields an unbiased estimate of β whenever the covariates X and noise Z
are uncorrelated.
This partly explains the success of OLS compared to other alternatives (e.g. penalizing the absolute error or
fourth power of the error). While OLS might initially look like the maximum likelihood estimator under
Gaussian errors, it yields consistent estimates of β under much weaker assumptions. Minimizing the fourth
power of the error requires stronger assumptions for consistency, while minimizing the absolute error would
yield a different condition in terms of medians rather than expectations.
Next we turn to the covariance of β̂. Assuming again that the (xi , yi ) are independent across i, we have
n
1 X
Cov[β̂ | x1:n ] = Cov[ S −1 xi zi | x1:n ] (253)
n i=1
n
1 −1 X
= S xi Cov[zi , zj | xi , xj ]x>
j S
−1
(254)
n2 i,j=1
n
1 −1 X
= S xi Var[zi | xi ]x>
i S
−1
, (255)
n2 i=1
Pn
where the final line is because zi , zj are independent for i 6= j. If we define Ω = n1 i=1 xi Var[zi | xi ]x>
i , then
the final term becomes n1 S −1 ΩS −1 .
This quantity can be upper-bounded under much weaker assumptions than Gaussianity. If we, for instance,
2
merely assume that Var[zi | xi ] ≤ σ 2 for all i, then we have that Ω  σ 2 S and hence Cov[β̂ | x1:n ]  σn S −1 .
Even better, this quantity can be estimated from data. Let u2i = (yi − β̂ > xi )2 . This is a downward-biased,
but asymptotically unbiased, estimate for Var[zi | xi ] (it would be unbiased if we used β instead of β̂).
Therefore, form the matrix
n
1X
Ω̂n = xi u2i x>
i . (256)
n i=1
1 −1
Then nS Ω̂n S −1
can be used to generate confidence regions and standard errors. In particular, the standard
q
error estimate for βi is (S −1 Ω̂n S −1 )ii /n. This is called the robust standard error or heteroskedacity-
consistent standard error.
There are a couple of simple improvements on this estimate. The first is a “degrees of freedom” correction:
we know that u2i is downward-biased, and it is more downward-biased the larger the dimension d (because
1
then β̂ can more easily overfit). We often instead use n−d S −1 Ω̂n S −1 , which corrects for this.
A fancier correction, based on the jacknnife, first corrects the errors ui , via
1 > −1
u0i = ui /(1 − κi ), with κi = x S xi .
n i
Pn
We obtain a corresponding Ω0n = 1
n i=1 xi (u0i )2 x>
i , and the matrix for the standard errors is

n
1 −1 0 1X
S (Ωn − ζζ > )S −1 , where ζ = xi u0i .
n n i=1

1
The main difference is that each ui gets a different correction factor 1−κi
(which is however roughly equal to
n
n−d ) and also that we subtract off the mean ζ. There is some evidence that this more complicated estimator
works better when the sample size is small, see for instance MacKinnon and White (1985).

53
Out-of-distribution error. Now suppose that we wish to estimate the error on test samples x̄1:m drawn
from a distribution p̄ 6= p. Start again with the Gaussian assumption that y = β > x + z. Pm
1 >
The expected error on sample x̄i (over test noise z̄i ) is σ 2 + hβ̂ − β, x̄i i2 . If we let S̄ = m i=1 x̄i x̄i , and
let EZ denote the expectation with respect to the training noise z1 , . . . , zn , then the overall average expected
error (conditional on x1:n , x̄1:m ) is
m m
1 X > 1 X
σ 2 + EZ [ (x̄i (β − β̂))2 ] = σ 2 + h x̄i x̄> >
i , EZ [(β − β̂)(β − β̂) ]i (257)
m i=1 m i=1
σ2
= σ 2 + hS̄, S −1 i (258)
n
 1 
+ σ 1 + hS̄, S −1 i .
2
(259)
n
This shows that the error depends on the divergence between the second moment matrices of p(x) and p̄(x):
• When p(x) = p̄(x), then hS̄, S −1 i = tr(S̄S −1 ) ≈ tr(I) = d, so the error decays as d
n.

• If S is low-rank and is missing any directions that appear in S̄, then the error is infinite. This makes
sense, as we have no way of estimating β along the missing directions, and we need to be able to
estimate β in those directions to get good error under p̄. We can get non-infinite bounds if we further
assume some norm bound on β; e.g. if kβk2 is bounded then the missing directions only contribute
some finite error.
• On the other hand, if S is full-rank but S̄ is low-rank, then we still achieve finite error. For instance,
suppose that S = I is the identity, and S̄ = kd P is a projection matrix onto a k-dimensional subspace,
scaled to have trace d. Then we get a sample complexity of nd , although if we had observed samples
with second moment matrix S̄ at training time, we would have gotten a better sample complexity of nk .
• In general it is always better for S to be bigger. This is partially an artefact of the noise σ 2 being the
same for all X, so we would always rather have X be as far out as possible since it pins down β more
effectively. If the noise was proportional to kXkF (for instance) then the answer would be different.

Robust OOD error estimate. We can also estimate the OOD error even when the Gaussian assumption
doesn’t hold, using
Pm the same idea as for robust standard errors. Letting z̄i be the noise for x̄i , the squared
1 2
error is then m j=1 (hβ − β̂, x̄ i i + z̄i ) , and computing the expectation given x1:n , x̄1:m yields
m
1 X
E[ (hβ − β̂, x̄i i + z̄i )2 | x1:n , x̄1:m ] (260)
m j=1
m
1 X >
= x̄ E[(β − β̂)(β − β̂)> | x1:n ]x̄i + 2x̄> 2
i E[β − β̂ | x1:n ]E[z̄i | xi ] + E[z̄i | x̄i ] (261)
m i=1 i
1 m
D  E D E 1 X
= S̄, S −1 Ω + bb> S −1 + 2 b̄, S −1 b + E[z̄i2 | x̄i ]. (262)
n m j=1

To interpret this expression, first assume that the true Pm model is “actually linear”, meaning that b = b̄ = 0.
Then the expression reduces to n1 hS̄, S −1 ΩS −1 i + m
1
j=1 E[z̄ 2
|
i xi ]. The second term is the intrinsic variance
1 −1
in the data, while the first term is similar to the n hS̄, S i term from before, but accounts for correlation
between X and the variation in Z.
If the model is not actually linear, then we need to decide how to define β (since the optimal β is then
no longer independent of the distribution). In that case a natural choice is to let β be the minimizer under
the training distribution, in which case b → 0 as n → ∞ and thus the hb̄, S −1 bi term conveniently becomes
asymptotically negligible. The twist is that E[z̄i2 | x̄i ] now measures not just the intrinsic variance but also
the departure from linearity, and could be quite large if the linear extrapolation away from the training points
ends up being poor.

54
Partial specification. In general, we see that we can actually form good estimates of the mean-squared
error on p̄ making only certain moment assumptions (e.g. b = b̄ = 0) rather than needing to assume the
Gaussian model is correct. This idea is called partial specification, where rather than making assumptions that
are stringent enough to specify a parametric family, we make weaker assumptions that are typically insufficient
to even yield a likelihood, but show that our estimates are still valid under those weaker assumptions. The
weaker the assumptions, the more happy we are. Of course b = b̄ = 0 is still fairly strong, but much better
than Gaussianity. The goal of partial specification aligns with our earlier desire to design estimators for the
entire family of resilient distributions, rather than specific parametric classes. We will study other variants of
partial specification later in the course, in the context of clustering algorithms.

[Lecture 18]

5.4 Partial Specification and Agnostic Clustering


We next study the idea of partial specification for clustering. Our setting for clustering will be the following:
• There are k unknown distributions p1 , . . . , pk .
• We observe points x1 , . . . , xn , such that a fraction αj of the points xi are drawn from pj .
Generally the αj are not known but we have a lower bound on αmin = minkj=1 αj . In clustering we have two
goals:
• Parameter recovery: We wish to estimate some parameter of the pj (usually their means).
• Cluster recovery: We wish to determine for each point xi which cluster pj it was drawn from.
In the simplest setting, we assume that each of the pj has a known parametric form (for instance, each pj is
a Gaussian with unknown mean and variance). In the agnostic setting, we do not assume a parametric form
for the pj but instead only assume e.g. bounded moments. In the robust setting, we allow some fraction  of
the points to be arbitrary outliers (so α1 + · · · + αk = 1 − ).
Partial specification thus corresponds to the agnostic setting. Clustering is a particularly interesting setting
for studying partial specification because some algorithms that work in the simple setting fail completely in
the agnostic setting. Below we will first study the simple setting and give an algorithm based on the method
of moments, then turn our attention to the agnostic setting. In the agnostic setting, resilience will appear
once again as an information-theoretically sufficient condition enabling clustering. Finally, we will turn our
attention to efficient algorithms. In many cases the agnostic algorithms will work even in the robust agnostic
setting.

5.4.1 Clustering Mixtures of Gaussians


Here we assume that each pj = N (µj , Σj ). Thus we can treat each xi as being drawn from p =
Pk
j=1 αj N (µj , Σj ). This is a parametric model with parameters (αj , µj , Σj ), so (at least in the limit
of infinite data) a sufficient condition for exact parameter recovery is for the model to be identifiable, meaning
Pk Pk
that if j=1 αj N (µj , Σj ) = j=1 αj0 N (µ0j , Σ0j ), then αj = αj0 , µj = µ0j , and Σj = Σ0j .5
As stated, the model is never identifiable because we can always permute the (αj , µj , Σj ) and obtain an
identical distribution. What we actually care about is identifiability up to permutation: if pα,µ,Σ = pα0 ,µ0 ,Σ0
0
then αj = ασ(j) , µj = µ0σ(j) , and Σj = Σ0σ(j) for some permutation σ.
We have the following result:
Proposition 5.1. As long as the orders pairs (µj , Σj ) are all distinct, the parameters (αj , µj , Σj ) are
identifiable up to permutation.
5 We also need to worry about the case where k 6= k0 , but for simplicity we ignore this.

55
Proof. This is equivalent to showing that the functions fµ,Σ (x) defining the pdf of a Gaussian are all linearly
independent (i.e., there is no non-trivial finite combination that yields the zero function). We will start by
showing this in one dimension. So, suppose for the sake of contradiction that
m
X √
cj exp(−(x − µj )2 /2σj2 )/ 2πσ 2 = 0, (263)
j=1

where the cj are all non-zero. Then integrating (263) against the function exp(λx) and using the formula for
the moment generating function of a Gaussian, we obtain
m
X 1
cj exp( σj2 λ2 + µj λ) = 0. (264)
j=1
2

1 2
Let σmax = maxm 2
j=1 σj , then dividing the above equation by exp( 2 σmax λ ) and taking λ → ∞, we see that
only those j such that σj = σmax affect the limit. If S is the set of such indices j, we obtain
X
cj exp(µj λ) = 0, (265)
j∈S

i.e. there is a linear relation between the functions gµj (λ) = exp(µj λ). But this is impossible, because as long
as the µj are distinct, the largest µj will always dominate the limit of the linear relation as λ → ∞, and so
we must have cj = 0 for that j, a contradiction.
It remains to extend to the n-dimensional case. Suppose there was a linear relation among the PDFs of
n-dimensional Gaussians with distinct parameters. Then if we project to a random 1-dimensional subspace,
the corresponding marginals (which are linear functions of the n-dimensional PDFs) are also each Gaussian,
and have distinct parameters with probability 1. This is again a contradiction since we already know that
distinct 1-dimensional Gaussians cannot satisfy any non-trivial linear relation.
Proposition 5.1 shows that we can recover the parameters exactly in the limit of infinite data, but it
doesn’t say anything about finite-sample rates. However, asymptotically, as long as the log-likelihood function
is locally quadratic around the true parameters,
√ we can use tools from asymptotic statistics to show that we
approach the true parameters at a 1/ n rate.

Recovery from moments. Proposition 5.1 also leaves open the question of efficient computation. In
practice we would probably use k-means or EM, but another algorithm is based on the method of moments.
It has the virtue of being provably efficient, but is highly brittle to mis-specification.
The idea is that the first, second,Pand third moments give a system of equations that can be solved for
the parameters (α, µ, Σ): letting p = j αj N (µj , Σj ), we have

k
X
Ep [X] = αj µj , (266)
j=1
k
X
Ep [X ⊗ X] = αj (µj µ>
j + Σj ), (267)
j=1
k
X
Ep [X ⊗ X ⊗ X] = αj (µ⊗3
j + 3 Sym(µj ⊗ Σj )), (268)
j=1

where Sym(X)i1 i2 i3 = 16 (Xi1 i2 i3 + Xi1 i3 i2 + Xi2 i1 i3 + Xi2 i3 i1 + Xi3 i1 i2 + Xi3 i2 i1 ).


In d dimensions, this yields d + d+1 + d+2 ≈ d3 /6 equations and k(1 + d + d+1
   2
2 3 2 ) ≈ kd /2 unknowns.
Thus as long as d > 3k we might hope that these equations have a unique (up to permutation) solution for
(α, µ, Σ). As an even more special case, if we assume that the covariance matrices are all diagonal, then
we only have approximately 2kd unknowns, and the equations have a solution whenever the µj are linearly
independent. We can moreover find this solution via an efficient algorithm called the tensor power method,

56
which is a generalization of the power method for matrices, and the
√ rate of convergence is polynomial in k, d,
and the condition number of certain matrices (and decays as 1/ n).
However, this method is very brittle—it relies on exact algebraic moment relations of Gaussians, so even
small departures from the assumptions (like moving from Gaussian to sub-Gaussian) will likely break the
algorithm. This is one nice thing about the agnostic clustering setting—it explicitly reveals the brittleness of
algorithms like the one above, and (as we shall see) shows why other algorithms such as k-means are likely to
perform better in practice.

Cluster recovery. An important point is that even in this favorable setting, exact cluster recovery is
impossible. This is because even if the Gaussians are well-separated, there is some small probability that a
sample ends up being near the center of a different Gaussian.
To measure this quantitatively, assume for simplicity that Σj = σ 2 I for all j (all Gaussians are isotropic
with the same variance), and suppose also that the µj are known exactly and that we assign each point x to
the cluster that minimizes kx − µj k2 .6 Then the error in cluster recovery is exactly the probability that a
sample from µj ends up closer to some other sample µj 0 , which is
k
X k
X X
αj Px∼N (µj ,σ2 I) [kx − µj k2 > kx − µj 0 k2 for some j 0 6= j] ≤ αj Φ(kµj − µj 0 k/σ) (269)
j=1 j=1 j 0 6=j

≤ kΦ(∆/σ), (270)
p
where ∆ = minj 0 6=j kµj − µj 0 k2 and Φ is the normal CDF. As long as ∆  log(k/), the cluster error will
be at most . Note that the cluster error depends on a separation condition stipulating that the cluster
centers are all sufficiently far apart. Moreover, we need greater separation if there are more total clusters
(albeit at a slowly-growing rate in the Gaussian case).

5.4.2 Clustering Under Resilience


The mixture of Gaussians case is unsatisfying because data are unlikely to actually be Gaussian mixtures in
practice, yet common algorithms like k-means still do a good job at clustering data. We therefore move to the
agnostic setting, and show that we only need the distributions to be resilient in order to cluster successfully.
We will start by proving an even stronger result—that if a set of points contains a (ρ, α)-resilient subset
S of size αn, then it is possible to output an estimate µ̂ that is close to the true mean µ of S, regardless
of the other (1 − α)n points. As stated, this is impossible, since there could be O(1/α) identical clusters
in the data. So what we will actually show is a list-decoding result—that it is possible to output O(1/α)
“candidates” µ̂l such that one of them is close to the mean of S:
Proposition 5.2. Suppose that a set of points S̃ = {x1 , . . . , xn } contains a (ρ, α/4)-resilient set S with
mean µ. Then if |S| ≥ αn (even if α < 12 ), it is possible to output m ≤ α2 candidates µ̂1 , . . . , µ̂m such that
kµ̂j − µk ≤ 8ρ
α for some j.

Proof. The basic intuition is that we can cover the points in S̃ by resilient sets S10 , . . . , S2/α
0
of size α2 n. Then
by the pigeonhole principle, the resilient set S must have large overlap with at least one of the S 0 , and hence
have similar mean. This is captured in Figure 9 below.
The main difference is that S and S 0 may have relatively small overlap (in a roughly α-fraction of elements).
We thus need to care about resilience when the subset T is small compared to S. The following lemma relates
resilience on large sets to resilience on small sets:
Lemma 5.3. For any 0 <  < 1, a distribution/set is (ρ, )-resilient if and only if it is ( 1−
 ρ, 1 − )-resilient.

This was already proved in Appendix C as part of Lemma 2.14. Given Lemma 5.3, we can prove
Proposition 5.2 with a similar triangle inequality argument to how we showed that resilient sets have small
modulus of continuity. However, we now need to consider multiple resilient sets Si rather than a single S 0 .
6 This is not quite optimal, in reality we would want to assign based on kx − µ k2 /σ 2 + log α , but we consider this simpler
j 2 j
assignment for simplicity.

57
Figure 9: If we cover S̃ by resilient sets, at least one of the sets S 0 has large intersection with S.

Suppose S is (ρ, α4 )-resilient around µ–and thus also ( α4 ρ, 1 − α4 )-resilient by Lemma 5.3–and let S1 , . . . , Sm
be a maximal collection of subsets of [n] such that:
α
1. |Sj | ≥ 2n for all j.
2. Sj is ( α4 ρ, 1 − α2 )-resilient (with mean µj ).
3. Sj ∩ Sj 0 = ∅ for all j 6= j 0 .
Clearly m ≤ α2 . We claim that S has large intersection with at least one of the Sj and hence µj is close
to µ. By maximality of the collection {Sj }m j=1 , it must be that S0 = S\(S1 ∪ · · · ∪ Sm ) cannot be added to
the collection. First suppose that |S0 | ≥ 2 n. Then S0 is ( α4 ρ, 1 − α2 )-resilient (because any subset of α2 |S0 |
α

points in S0 is a subset of at least α4 |S| points in S). This contradicts the maximality of {Sj }m j=1 , so we must
have |S0 | < α2 n.
Now, this implies that |S ∩ (S1 ∪ · · · ∪ Sm )| ≥ α2 n, so by pigeonhole we must have |S ∩ Sj | ≥ α2 |Sj | for
some j. Letting T = S ∩ Sj as before, we find that |T | ≥ α2 |Sj | ≥ α4 |S| and hence by resilience of Sj and S
we have kµ − µj k ≤ 2 · ( α4 ρ) = α8 ρ by the same triangle inequality argument as before.

Better bounds for well-separated clusters. Proposition 5.2 is powerful because it holds under very
minimal conditions (we do not need to assume anything about separation of clusters or even about any of the
clusters other than the one we are estimating). However, its guarantees are also minimal—we only know that
we get approximate parameter recovery in the list-decoding model, and cannot say anything about cluster
recovery. We next obtain a stronger bound assuming that the data can actually be separated into clusters
(with a small fraction of outliers) and that the means are well-separated. This stronger result both gives
cluster recovery, and gives better bounds for parameter recovery:
Proposition 5.4. Suppose that a set of points {x1 , . . . , xn } can be partitioned into k sets C1 , . . . , Ck of size
α1 n, . . . , αk n, together with a fraction n of outliers ( = 1 − (α1 + · · · + αk )), where 2 ≤ α = minkk=1 αj .
Further suppose that
• Each cluster is (ρ1 , )-resilient and (ρ2 , 2/α)-resilient.
4ρ1
• The means are well-separated: ∆ >  where ∆ = minj6=j 0 kµj − µj 0 k2 .

Then we can output clusters Ĉ1 , . . . , Ĉk such that:


• |Cj 4Ĉj | ≤ O(/α)|Cj | (cluster recovery)

• The mean µ̂j of Ĉj satisfies kµ̂j − µj k2 ≤ 2ρ2 (parameter recovery).


Proof. We will construct a covering by resilient sets as before, but this time make use of the fact that we
know the data can be approximately partitioned into clusters. Specifically, let S1 , . . . , Sk be a collection of k
sets such that:
• |Sl | ≥ αn
• The Sl are disjoint and contain all but n points.

58
• Each Sl is (ρ1 , )-resilient.
We know that such a collection exists because we can take the Cj themselves. Now call a set S “j-like” if it
contains at least αj (/α)|S| points from Cj . We claim that each Sl is j-like for exactly one j. Indeed, by
pigeonhole it must be j-like for at least one j since /α ≤ 1/2 < 1.
In the other direction, note that if S if j-like then S ∩ Cj contains at least (αj /α) of the points in S,
and at least (|S|/n)(/α) ≥  of the points in Cj . Thus by resilience of both sets, the means of both S and
Cj are within ρ1 of the mean of S ∩ Cj and hence within 2ρ 1 of each other. In summary, kµj − µS k2 ≤ 2ρ 1 .
Now if S were j-like and also j 0 -like, then we would have kµj − µj 0 k2 ≤ 4ρ 1 , which contradicts the separation
assumption.
Since Sl is j-like for a unique j, it contains at most (/α)|Sl | points from any of the other Cj 0 , together
with at most n outliers. Moreover, since the other Sl0 are not j-like, Sl is missing at most αj (/α)n points
from Cj . Thus Sl ∩ Cj is missing at most 2/α|Sl | points from Sl and at most /α|Cj | points from Cj . By
resilience their means are thus within 2ρ2 of each other, as claimed.

5.5 Efficient Clustering Under Bounded Covariance


We saw that resilience is information-theoretically sufficient for agnostic clustering, but we would also like to
develop efficient algorithms for clustering. This is based on work in Kumar and Kannan (2010) and Awasthi
and Sheffet (2012), although we will get a slightly slicker argument by using the machinery on resilience that
we’ve developed so far.
As before, we will need a strong assumption than resilience. Specifically, we will assume that each cluster
had bounded covariance and that the clusters are well-separated:
Theorem 5.5. Suppose that the data points x1 , . . . , xn can be split into k clusters C1 , . . . , Ck with sizes
αn , · · · , αk n and means µ1 , . . . , µk , and moreover that the following covariance and separation conditions
hold:

• |C1j | i∈Cj (xi − µj )(xi − µj )>  σ 2 I for each cluster Cj ,


P


• ∆ ≥ 36σ/ α, where ∆ = minj6=j 0 kµj − µj 0 k2 .

Then there is a polynomial-time algorithm outputting candidate clusters Ĉ1 , . . . , Ĉk and means µ̂1 , . . . , µ̂k
such that:
• |Cj 4Ĉj | = O(σ 2 /α∆2 ) (cluster recovery), and

• kµj − µ̂j k2 = O(σ 2 /α∆) (parameter recovery).


The basic idea behind the algorithm is to project each of the points xi onto the span of the top k singular
vectors of the data matrix X = [x1 · · · xn ]. Let Pk be the projection operator onto this space. Then since
the points P xi lie in only a k-dimensional space instead of a d-dimensional space, they are substantially
easier to cluster. The algorithm itself has three core steps and an optional step:

1. Project points xi to P xi .
2. Form initial clusters based on the P xi .
3. Compute the means of each of these clusters.
4. Optionally run any number of steps of k-means in the original space of xi , initialized with the computed
means from the previous step.
We will provide more formal psuedocode later [NOTE: TBD]. For now, we focus on the analysis, which has
two stages: (1) showing that the initial clustering from the first two steps is “nice enough”, and (2) showing
that this niceness is preserved by the k-means iterations in the second two steps.

59
Analyzing the projection. We start by analyzing the geometry of the points Pk xi . The following lemma
shows that the projected clusters are still well-separated and have small covariance:
Lemma
p 5.6. The projected √ points Pk xi satisfy the covariance and separation conditions with parameters σ
and ∆2 − 4σ 2 /α ≥ 35σ/ α:
1 X p
(P xi − P µj )(P xi − P µj )>  σ 2 I and kP µj − P µj 0 k2 ≥ ∆2 − 4σ 2 /α. (271)
|Cj |
i∈Cj

In other words, the covariance condition is preserved, and separation is only decreased slightly.
Proof. The covariance condition is preserved because the covariance matrix of the projected points for cluster
j is Pk Σj Pk , where Σj is the un-projected covariance matrix. This evidently has smaller singular values than
Σk .
The separation condition requires more detailed analysis. We start by showing that there is not much in
the orthogonal component (I − Pk )xi . Indeed, we have that the top singular value of (I − Pk )xi is at most σ:
n
1X
S= ((I − Pk )xi )((I − Pk )xi )>  σ 2 I (272)
n i=1

This is because Pk minimizes this top singular value among all k-dimensional projection matrices, and if we
take the projection Qk onto the space spanned by the means µ1 , . . . , µk , we have
n k
1X X αj X
((I − Qk )xi )((I − Qk )xj )> = ((I − Qk )xi )((I − Qk )xi )> (273)
n i=1 j=1
|C j |
i∈Cj
k
X αj X
= ((I − Qk )(xi − µj ))((I − Qk )(xi − µj ))> (274)
j=1
|C j |
i∈Cj
k k
X αj X X
 (xi − µj )(xi − µj )>  αj σ 2 I = σ 2 I. (275)
j=1
|C j | j=1
i∈Cj

Given this, we know that the projections (I − Pk )µj must be small, since otherwise we have
n
1X
v > Sv = h(I − Pk )xi , vi2 (276)
n i=1
αj X
≥ h(I − Pk )xi , vi2 (277)
|Cj |
i∈Cj
D 1 X E2
≥ αj (I − Pk )xi , v (278)
|Cj |
i∈Cj

= αj h(I − Pk )µj , vi2 . (279)


Consequently h(I − Pk )µj , vi2 ≤ σ 2 /αj and hence (taking
√ v to align with (I − Pk )µj ) we have k(I − Pk )µj k2 ≤

σ/ αj . In particular k(I − Pk )(µj − µj 0 )k2 ≤ 2σ/ α.
Now, by the Pythagorean theorem we have
kPk (µj − µj 0 )k22 = kµj − µj 0 k22 − k(I − Pk )(µj − µj 0 )k22 ≥ ∆2 − 4σ 2 /α, (280)
p
and hence the projected means are separated by at least ∆2 − 4σ 2 /α, as was to be shown.

Analyzing the initial clustering. We now analyze the √ initial clustering. Call a point i a proto-center if
there are at least α2 n projected points within distance 3σ k of Pk xi , and call the set of these nearby points
the associated proto-cluster.
We will show that the proto-clusters are nearly pure (have few points not from Cj ) using a similar
argument as when we analyzed resilient clustering. As before, call a proto-cluster j-like if there are at least
αj α
4 n points from Cj in the proto-cluster.

60
Lemma 5.7. Each proto-cluster is j-like for exactly one j.
Proof. We know that it is j-like for at least one j by the Pigeonhole principle (if not, then the proto-cluster
has at most α4 n points in total, contradicting its size of at least α2 n). So suppose for the sake of√contradiction
that it is both j-like and j 0 -like. By resilience, the mean of the points
√ from Cj is at most 2σ/ α away from
Pk µj , and√similarly the mean of the points from Cj 0 is at most 2σ/ α away from Pk µj 0 . Since the cluster has
√ √
radius 3σ k ≤ 3σ/ α, this implies that kPk (µj − µj 0 )k2 ≤ 10σ/ α, contradicting the separation condition
for the projected means. Thus no proto-cluster can be j-like for multiple j, which proves the lemma.
Now since each proto-cluster is j-like for exactly one j, at least half of the points must come from that
proto-cluster.
At this point we are essentially done if all we care about is constructing an efficient algorithm for cluster
recovery (but not parameter recovery), since if we just extend each proto-cluster by O(σ) we are guaranteed
to contain almost all of the points in its corresponding cluster, while still containing very few points from any
other cluster (assuming the data are well-separated). However, parameter recovery is a bit trickier because we
need to make sure that the small number of points from other clusters don’t mess up the mean of the cluster.
The difficulty is that while we have control over the projected distances, and can recover the projected centers
Pk µj well, we need to somehow get back to the original centers µj .
The key here is that for each proto-cluster, the P xi are all close to each other, and the missing component
(I − Pk )xi has bounded covariance. Together, these imply that the proto-cluster is resilient—deleting
√ an
-fraction of points can change the mean by at most O(σ) in the Pk component, and O(σ ) in the (I − Pk )
component. In fact, we have:
Lemma 5.8. Let B be a proto-cluster with mean ν. Then
1 X
(xi − ν)(xi − ν)>  11σ 2 /α. (281)
|B|
i∈B

In particular, if B is j-like then we have kµj − νk2 ≤ 9σ/ α.

Proof. The covariance bound is because the covariance of the xi are bounded in norm by at most 3σ k
in the Pk component and hence can contribute at most 9σ 2 k ≤ 9σ 2 /α to the covariance, while we get an
additional 2σ 2 /α in an orthogonal direction because the overall second moment of the (I − Pk )xi is σ 2 and
the i ∈ B contribute to at least an α2 fraction of that.
Now, this implies that B is resilient, while we already have that Cj is resilient. Since B √ ∩ Cj contains
√ at
least√half the points in both B and Cj , this gives that their means are close–within distance 2( 11+1)σ/ α <
9σ/ α.

Analyzing k-means. We next show that k-means iterations preserve certain important invariants. We
will call the assigned means µ̂j R-close if kµ̂j − µj k2 ≤ R for all j, and we will call the assigned clusters Ĉj
-close if |Cj 4Ĉj | ≤ |Cj | for all j. We will show that if the means are R-close then the clusters are -close
for some  = f (R), and that the resulting new means are then g(R)-close. If R is small enough then we will
also have g(R) < R so that we obtain an invariant.
Let ∆jj 0 = kµj − µj 0 k2 , so ∆jj 0 ≥ ∆. We will show that if the µ̂j are R-close, then few points in Cj can
end up in Ĉj 0 . Indeed, if xi ends up in Ĉj 0 then we must have kxi − µ̂j 0 k22 ≤ kxi − µ̂j k22 , which after some
re-arrangement yields
1
hxi − µ̂j , µ̂j 0 − µ̂j i ≥ hµ̂j 0 − µ̂j , µ̂j 0 − µ̂j i. (282)
4
Applying the covariance bound and Chebyshev’s inequality along the vector v = µ̂j 0 − µ̂j , we see that the
4σ 2 4σ 2 4σ 2
fraction of points in Cj that end up in Ĉj 0 is at most kµ̂j −µ̂ 2 ≤ (∆ 0 −2R)2 ≤ (∆−2R)2 . In total this means
j 0k 2 jj
4σ 2 n 4kσ 2 |Cj |
that at most (∆−2R)2 points from other clusters end up in Ĉj , while at most (∆−2R)2 points from Cj end up
4kσ 2 4σ 2 8σ 2
in other clusters. Thus we have  ≤ (∆−2R)2 + α(∆−2R)2 ≤ α(∆−2R)2 , so we can take

8σ 2
f (R) = . (283)
α(∆ − 2R)2

61
2
Now suppose that γjj 0 |Cj | points in Cj are assigned to Ĉj 0 , where we must have γjj 0 ≤ (∆ 4σ 2 . By
jj 0 −2R)
√ √
resilience, the mean of these points is within σ/ γjj 0 of µj and hence within ∆jj 0 + σ/ γjj 0 of µj 0 . In total,
then, these points can shift the mean µ̂j 0 by at most

γjj 0 αj n(∆jj 0 + σ/ γjj 0 ) 2αj  4σ 2 ∆jj 0 2σ 2  4α  2σ 2 ∆
j σ2 
1 ≤ + ≤ + . (284)
2 αn
α (∆jj 0 − 2R)2 ∆jj 0 − 2R α (∆ − 2R)2 ∆ − 2R

4kσ 2
At the same time, the (∆−2R)2 fraction of points that are missing from Cj 0 can change its mean by at most

4σ 2 k
∆−2R . Thus in total we have

4σ 2 √ 1 2∆  8σ 2 (∆ − R)
kµ̂j 0 − µj 0 k2 ≤ · k+ + ≤ . (285)
∆ − 2R α α(∆ − 2R) α(∆ − 2R)2
8σ 2 (∆−R)
In particular we can take g(R) = α(∆−2R)2 .
2 2 √
As long as R ≤ ∆/4 we have√ g(R) ≤ 24σ
and f (R) ≤ 32σ
α∆ α∆2 , as claimed. Since our initial R is 9σ/ α,
this works as long as ∆ ≥ 36σ/ α, which completes the proof.

[Lecture 19]

6 Nonparametric Learning in Hilbert spaces


We will now shift focus to nonparametric learning, where we seek to estimate a function lying within some
infinite-dimensional space. We will start by considering linear regression in reproducing kernel Hilbert spaces
(RKHSs). This is the same as ordinary or ridge regression, except that all of the underlying objects are
infinite-dimensional. It turns out that to handle this infinite dimensionality, it is helpful to consider a different
basis representation for linear regression, in terms of the function itself rather than its parameters. This will
be our starting point.

6.1 Preliminaries: Parameter vs. Function View


We will start by considering
Pn the usual task of ordinary linear regression: given (x1 , y1 ), . . . , (xn , yn ), find β
such that L(β) = n1 i=1 (yi − hβ, xi i)2 is minimized.
Letting X be the n × p matrix of covariates and y the n × 1 matrix of outputs, the solution β̂ of this
problem can be expressed as
n n
1X X
β ∗ = (X > X)−1 X > y, or β ∗ = ( xi x>
i )
−1
xi yi . (286)
n i=1 i=1
| {z }
S

In addition, for any other value of β, the excess loss relative to β ∗ can be expressed as

L(β) − L(β ∗ ) = (β − β ∗ )> S(β − β ∗ ), (287)

where as above S is the second moment matrix of the covariates.


λ 2
Finally, we can consider instead ridge regression, where we minimize Lλ (β) = n kβk2 + L(β). In this case
the minimizer is equal to
n
λ 1X
β̂λ = (λI + X > X)−1 X > y = ( I + S)−1 ( xi yi ). (288)
n n i=1

This is the standard view of linear regression and what we will call the parameter view, since it focuses on
the parameters β. But it is also helpful to adopt a different view, which we’ll the function view.
Pn The function
view instead focuses on the learned function fβ (x) = β > x. Thus we instead have L(f ) = n1 i=1 (yi − f (xi ))2

62
(assume for now that f is linear in x; by convention L(f ) = ∞ for all other such f , but the derivations below
will focus on the case where L(f ) is finite).
The excess loss is particularly easy to write down in the function view:
n
1X 1
L(f ) − L(f ∗ ) = (f (xi ) − f ∗ (xi ))2 = kf − f ∗ k22 , (289)
n i=1 n

where the `2 -norm treats f as an n-dimensional vector of values on the xi .


What is f ∗ = fβ ∗ in this case? Again expressing f ∗ as an n-dimensional vector, we have
fβ ∗ = Xβ ∗ = X(X > X)−1 X > y, (290)
or in other words fβ ∗ is the projection of y onto the p-dimensional linear span of the xi , which makes sense.
For ridge regression, we would also like to express kβk22 in terms of f . This is a bit trickier, but noting
that β = (X > X)−1 X > f , a formula that works is
kβk22 = f > X(X > X)−2 X > f = f > K † f, (291)
where K = XX > is the n × n kernel matrix (Kij = hxi , xj i) and † denotes pseudoinverse. As this is also an
important norm for f , we will give it a name, kf kK :
kf k2K = f > K † f if f lies in the span of K, and ∞ otherwise. (292)
Now we can write down ridge regression in the function view:
λ 1
fˆλ = arg min kf k2K + kf − yk22 (293)
f n n
=⇒ λK † fˆλ + (fˆλ − y) = 0 (294)
=⇒ fˆλ = (λK † + I)−1 y (295)
=⇒ fˆλ = K(λI + K)−1 y. (296)
This is also called kernel ridge regression, since it depends only on the kernel matrix K.
A final issue is how to evaluate f on a new data point x. It helps to first define kx = X > x, the n × 1
matrix of inner products between x and the xi . We then have x = (X > )† kx and β = X † f and so
f (x) = β > x (297)
= f > (X † )> X † kx (298)
> > † > †
= f (XX ) kx = f K kx . (299)

Thus for instance fˆλ (x) = kx> K † K(λI + K)−1 y.

6.2 The infinite-dimensional case


We are interested in nonparametric regression, meaning regression over infinite-dimensional families of
functions. The function view on regression is particularly useful in this case, because we can dispense with
worrying about vectors xi and focus only on the kernel matrix K. In fact, we will simply posit that there
exists a kernel function k(x, x0 ), such that for any finite set T = {x1 , . . . , xn } the matrix K[T ] defined by
Kij = k(xi , xj ) is positive semidefinite, and strictly positive definite whenever the xi are distinct. The
corresponding function space, called the reproducing kernel Hilbert space of k, is defined as
H = {f | sup kf kK[T ] < ∞}, (300)
T

with corresponding norm kf kH = supT kf kK[T ] .


The strict positive definiteness actually implies that k represents an infinite-dimensional space, since for a
p-dimensional space K[T ] would have rank at most p and thus eventually be only semidefinite.
To show that H is non-empty (and more generally that the definitions all make sense), we need two
lemmas. First, evaluating kf kK on a larger subset T always increases the norm:

63
Lemma 6.1 (Restrictions are contractions). For any f and any sets T ⊆ T 0 , we have kf kK[T ] ≤ kf kK[T 0 ] .
Secondly, despite this, we can extend a function f from any subset T to all of X without increasing the
norm:
Lemma 6.2 (Isometric extension). Let T = {x1 , . . . , xn } and let y1 , . . . , yn ∈ R be arbitrary target values.
Then there exists an f ∈ H such that f (xi ) = yi for all i, and moreover kf k2H = kyk2K = y > K −1 y.
Proof of Lemma 6.2. Define f (x) = kx> K −1 y. Since kxi is the ith row of K, kx>i K −1 is the ith standard basis
vector, so f (xi ) = yi .
It remains to show that f has bounded norm. For sets T = {x1 , . . . , xn } and T 0 = {x01 , . . . , x0m }, let
K[T, T 0 ] be the (rectangular) kernel matrix where Kij = k(xi , x0j ). Then f evaluated on T 0 is equal to
K[T, T 0 ]> K[T, T ]−1 y. For notational ease define K11 = K[T, T ], K12 = K[T, T 0 ], etc. Thus
kf k2K[T 0 ] = y > K[T, T ]−1 K[T, T 0 ]K[T 0 , T 0 ]−1 K[T 0 , T ]K[T, T ]−1 y (301)
−1 −1 −1
= y > K11 K12 K22 K21 K11 y. (302)
−1 −1 −1 −1 −1
Recall we wish to show this is at most y > K11 y. It suffices to show that K11  K22 K21 K11  K11 , or
K12
−1 K11 K12
K12 K22 K21  K11 . But this follows from the Schur complement lemma, since is positive
K21 K22
definite.
 >  −1  
−1 y1 K11 K12 y1
Proof of Lemma 6.1. Under the notation above we want to show that y1> K11 y1 ≤ .
y2 K12 K22 y2
K11 + λ1 I
   
0 K11 K12
This follow by noting that for large λ we have  and then inverting and
0 λ2 I K21 K22
taking limits.
Remark 6.3. This route to defining H is different from the typical treatment which instead focuses on
the representer theorem. That theorem starts with the infinite-dimensional space and then shows that the
minimum-norm interpolator for a set T lies within the span of K[T ]. Here we take essentially the reverse
approach, starting with an interpolating function and showing that it can be isometrically embedded in H. I
personally prefer the approach above, as it makes it much cleaner to define both kf kH and H itself.
Note that Lemma 6.2 also shows that kernel ridge regression Pn can be expressed intrinsically, without
explicit reference to the set T : fˆλ = arg minf nλ kf k2H + n1 i=1 (f (xi ) − yi )2 . This is because we can always
find on f on T = {x1 , . . . , xn } such that kf kH = kf kK[T ] .
Finally, since K is invertible, kernel ridge regression has an even simpler form: fˆλ (x) = kx> (λI + K)−1 y.
If we set λ = 0 then we recover the isometric extension f of y in Lemma 6.2.
[Lecture 20]

6.3 Bias-variance decomposition


Next we analyze the statistical error of kernel ridge regression. Let us suppose that for a fixed set T =
{x1 , . . . , xn }, we observe points (xi , yi ), where yi = f ∗ (xi )+i , i ∼ N (0, σ 2 ). Then as in the finite-dimensional
case, we can analyze the bias and variance of an estimated function fˆ. For the ordinary Pnleast squares estimator
fˆ0 , this is trivial, since fˆ0 (xi ) = yi and so the bias is zero while the variance is E[ n1 i=1 (yi − f ∗ (xi ))2 ] = σ 2 ,
independent of n. This makes sense, as fˆ0 interpolates the data and is thus completely sensitive to noise.
Therefore consider instead the ridge estimator fˆλ = K(λI + K)−1 y. We have E [y] = f ∗ while Var [y] =
σ In×n . Then E[fˆλ ] = K(λI + K)−1 f ∗ and so the bias-squared, n1 kE[fˆλ ] − f ∗ k22 , is
2

1 λ2
Bias2 = kK(λI + K)−1 f ∗ − f ∗ k22 = k(λI + K)−1 f ∗ k22 . (303)
n n
A useful but loose upper bound can be obtained via k(λI + K)−1 f ∗ k22 ≤ λ1 (f ∗ )> K −1 f ∗ = λ1 kf ∗ k2H , so the
bias is at most nλ kf k2H assuming f lies in H (we will later relax this assumption, and indeed show that it
must be relaxed to get the optimal rates).

64
Pn
Next, the variance is 1
n i=1 Var [f (xi )] = E [ n1 kK(λI + K)−1 k22 ], or

σ2
Var = tr((λI + K)−1 K 2 (λI + K)−1 ). (304)
n
2
Again, a somewhat loose upper bound is σλ (tr(K)/n).
Thus, assuming both tr(K)/n
p and kf k2H are bounded, if we optimize λ the sum of the bias and variance

can be bounded by √ n
kf kH tr(K)/n. One should generally think of tr(K)/n as being roughly constant

as n grows, so this gives a 1/ n rate of convergence for the squared error. This is slower than the usual
parametric rate of 1/n (for the squared error), but that is perhaps not surprising given we are in infinite
dimensions. However, more unusual is that the exponent on n does not seem to depend at all on the geometry
of H. This is surprising, as we know for instance that it is much easier to estimate smooth functions than
functions that are merely Lipschitz, and in fact the exponent in n is different in both cases (as it turns out,
log(n) 1
n versus n2/3 for functions on [0, 1]).

Prior on f . The issue is with the “right prior” to place on f . From a Bayesian perspective, kernel ridge
2
regression is optimal when β is drawn from a Gaussian prior: β ∼ N (0, σλ I). From a function perspective
2
this corresponds to the Gaussian process prior f ∼ N (0, σλ K). In this case the expected norm-squared
2 2
is E[kβk22 ] = σλ tr(I) = σλ p. This is a problem when p = ∞! A similar problem shows up with f :
2 2
E[kf k2H ] = E[f > K −1 f ] = σλ tr(K −1 K) = σλ n.
We therefore need a better bound on the bias that doesn’t rely on kf k2H , which is unbounded. Returning
to (303), under a Gaussian process prior f ∼ N (0, ρ2 K) we have
n
2 λ2 ∗
−2 ∗ λ 2 ρ2 −2 λρ2 X λµj
E[Bias ] = E[f (λI + K) f ] = tr((λI + K) K) = , (305)
n n n j=1 (λ + µj )2

σ2
Pn µ2j
where µj are the eigenvalues of K. Similarly the variance is n j=1 (λ+µj )2 . Thus the overall error is

n n
λρ2 X λµj σ2 X µ2j
E[Error] = + . (306)
n j=1 (λ + µj )2 n j=1 (λ + µj )2

λµ µ µ2 µ2
We can bound (306) using the inequalities (λ+µjj )2 ≤ min(1, λj ) and (λ+µj j )2 ≤ min(1, λ2j ). Sort the µj in
descending order and let J = max{j | µj > λ}. Then these inequalities yields

(λρ2 + σ 2 )J λρ2 X µj σ 2 X µ2j


E[Error] ≤ + + . (307)
n n λ n λ2
j>J j>J

Interpretation for power law decay. Suppose for concreteness that µj = nj −α ; this power law decay is
common empirically and also shows up in theoretical examples such as learning Lipschitz functions. Then
J = (λ/n)−1/α . Moreover, the sums can be approximated as

1 ∞ −α (λ/n)−1/α
X µj Z
1 J
≈ x dx = J 1−α = = , (308)
λ λ J λ(α − 1) α−1 α−1
j>J
X µ2j Z ∞
1 1 (λ/n)−1/α
2
≈ 2
x−2α dx = 2 J 1−2α = . (309)
λ λ J λ (2α − 1) 2α − 1
j>J

Assuming α > 1, the overall expression (307) is thus bounded by


 (λ/n)1−1/α ρ2 σ 2 (λ/n)−1/α 
E[Error(λ)] ≤ O + . (310)
α−1 n

65
σ2
Now take the optimal λ∗ = ρ2 . Then we obtain the final bound of
 1 1  α−1 1
E[Error(λ∗ )] ≤ O n α −1 σ 2 α ρ2 α . (311)
α−1

Thus for instance when α = 2 we get a 1/ n rate, but as α → ∞ we approach a 1/n rate.
Note that for sub-optimal choices of regularizer, we still get a bound but with a potentially worse
dependence on n. For instance if we took λ∗ ∝ nβ , for β > 0, then our rate would be suboptimal by a
nβ(α−1)/α factor in n due to the over-regularization.

6.4 Generalizing from train to test


So far our error bounds have all been on the original sample points x1 , . . . , xn . But we generally care
about how well we do on new points drawn from p. We will analyze this next, although in this case it is
helpfulPto switch back to the parameter rather than the function view. Recall that there we had the matrix
n
S = n1 i=1 xi x> ∗ > ∗
i , and we will also define S = Ex∼p [xx ]. Then letting L be the population loss, we have
∗ ∗ ∗ ∗ > ∗ ∗
L (β) − L (β ) = (β − β ) S (β − β ).
The ridge regression estimator β̂λ is still equal to n1 ( nλ I + S)−1 X > (Xβ ∗ + ), so E [β̂λ ] = ( nλ I + S)−1 Sβ ∗ .
Define the notation λn = λ/n. The bias is δ > S ∗ δ, where δ = E[β̂λ ] − β ∗ , which works out to
Bias2 = λ2n (β ∗ )> (λn I + S)−1 S ∗ (λn I + S)−1 β ∗ , (312)
which under the Gaussian process prior is
E[Bias2 ] = λ2n ρ2 tr((λn I + S)−1 S ∗ (λn I + S)−1 ). (313)
Now, note that this is almost the expression we would get from running ridge regression directly on the test
distribution–the only difference is that (λn I + S ∗ )−1 is replaced with (λn I + S)−1 in two places.
The variance can be computed similarly as
1
Var = E [k(S ∗ )1/2 (λn I + S)−1 X > k22 ] (314)
n2
σ2
= tr((S ∗ )1/2 (λn I + S)−1 S(λn I + S)−1 (S ∗ )1/2 ). (315)
n
Here, the expression is trickier because we have both S and (λn I + S)−1 . But since S(λn I + S)−1  I, we
can apply the further bound
σ2
Var ≤ tr((S ∗ )1/2 (λn I + S)−1 (S ∗ )1/2 ). (316)
n
Relative to this bound, again the only difference is replacing (λn I + S ∗ )−1 with (λn I + S)−1 . Using these
results, we can obtain the following generalization bound:
Proposition 6.4. Suppose that the true and empirical covariance matrices satisfy S ∗  2S + λn I. Then,
letting µ∗j denote the jth eigenvalues of S ∗ , the generalization error satisfies
X λn µ∗j 2σ 2 X µ∗j
E [L∗ (β̂λ )] − L∗ (β ∗ ) ≤ 4λn ρ2 ∗ + (317)
j
(λn + µj ) 2 n j λn + µ∗j
 2σ 2  X µ∗j 
≤ 4λn ρ2 + min 1, . (318)
n j
λn

Note that relative to the kernel case, µ∗j is n times smaller, since it is an eigenvalue of S rather than K.
σ2
This expression is similar to what we achieved in (307), and when µ∗j = j −α setting λn = nρ2 will yield
1
the same power law rate as before of n . α −1

The key question is for what values of λn the condition S ∗  2S + λn I is satisfied. This requires the
empirical covariance matrix to concentrate to the population covariance. To analyze this, we will make use of
the following result of Koltchinskii and Lounici (2017):

66
Theorem 6.5 (Koltchinskii and Lounici). If the xi are Gaussian with covariance S ∗ , then the empirical
covariance matrix satisfies r
 tr(S ∗ ) tr(S ∗ )kS ∗ k 

kS − Sk ≤ max , (319)
n n
p
with high probability. In particular, if S ∗ is p × p and p ≤ n then we have kS ∗ − Sk ≤ kS ∗ k p/n with high
probability.

To apply Theorem 6.5, assume without loss of generality that S ∗ is diagonalized and consider the case that
the jth eigenvalue µ∗j = j −α . We will split S into two blocks: S≤m (the top left m × m matrix corresponding
to the top m eigenvalues of S ∗ ) and S>m (the complementary
p bottom right block).

For the top left block, we have kS≤m − S≤m k ≤ m/n by Theorem 6.5. The bottom right block is more

) = j>m j −α ≈ α−1 1
m1−α . Assuming α is bounded away from 1 we then
P
complicated, but note that tr(S>m
q
∗ 1−2α
have kS>m − S>m k ≤ m n .
∗ ∗ ∗
Now we will optimize m and aim for the conditions kS≤m − S≤m k ≥ 12 λmin (S≤m ), kS>m − S>m k ≤ λn
∗ ∗
to hold, which are enough to imply S≤m  2S≤m + λI and S>m  2S>m + λI (we also have to worry about
the off-diagonal blocks but will handle this in an exercise). These conditions hold when
p
m/n  m−α , (320)
p
m1−2α /n  λn . (321)
1/(2α+1)
Solving the first yields m  n1/(2α+1) , implying λn  n−2α/(1+2α) = n n .

Now, recall that the optimal λn was λn , where λ∗ is constant in n and hence λn ≈ 1/n. To achieve good
generalization according to the above bound, we need λn to be at least n1/(2α+1) times larger than that. For
instance, when α = 2 this implies λn must be at least n1/5 times larger than the optimal choice, and yields a
corresponding rate of n−2/5 instead of the n−1/2 we had before.
It turns out, however, that this analysis is loose. By applying the bound with many blocks rather than
just two, we can do better. We will address this in an exercise as well.
Finally, the conditions in Theorem 6.5 can be relaxed somewhat, as discussed in Koltchinskii and Lounici
(2017). In particular, we don’t need perfect Gaussianity and can get by with appropriate sub-Gaussian
assumptions.

[Lecture 21]

6.5 Eigenvalues and Mercer’s theorem


It is often helpful to represent the above computations in terms of the eigenbasis of k. Since k is infinite-
dimensional, we need some conditions to ensure that it actually has an eigenbasis. These are given by Mercer’s
theorem:
Theorem 6.6 (Mercer). Suppose that k : X × X is a positive definite kernel, that k is continuous, and that
X is compact and equipped with a finite measure ν that is supported on all of X . Define
Z
Tk f (x) = k(x, s)f (s)dν(s). (322)
X

Then there is an orthonormal basis of eigenfunctions of Tk , e1 , e2 , . . ., with corresponding eigenvalues µ1 , µ2 , . . .,


such that

X
k(x, x0 ) = µm em (x)em (x0 ). (323)
m=1

Note that Rthe eigenfunctions will be different depending on the choice of ν, and orthogonality is with
respect to ν: el (x)em (x)dν(x) = δlm . There are thus three distinct inner products floating around: the
one underlying the norm kf kH , the one underlying the function error kf − yk2 , and the one associated with

67
ν. This can get a bit confusing, but it helps to think of kf kH and kf k2 as the “important” norms, while ν
specifies a non-intrinsic norm that is however useful for computational purposes.
Speaking of these computational
P∞ purposes, let us see what R Mercer’s theorem buys us. Any function f can
now be represented as f (x) = m=1 cm em (x), where cm = X f (x)em (x)dν(x). Under this representation, it
P∞ c2
turns out that kf k2H = m=1 µm m
. This might seem surprising, since the right-hand side depends on ν but
the left-hand side does not. We explain this with the following result:
P∞
Lemma 6.7. Let ν be any finite measure that is supported on all of X . Then kf k2H = m=1 µ1m ( X f (x)em (x)dν(x))2 .
R

Proving Lemma 6.7


Heuristic proof of Lemma 6.7. Consider any finite partition of X into sets X1 , . . . , Xn , with Pn aRrepresenta-
tive xi ∈ Xi for each set. We will take increasingly fine-grained partitions such that i=1 Xi |f (x0 ) −
Pn
f (xi )|dν(x0 ) → 0, so in particular e.g. i=1 f (xi )ν(Xi ) → X f (x)dν(x).
R

Now we define (fν )i = ν(Xi )1/2 f (xi ) and (Kν )ij = ν(Xi )1/2 ν(Xj )1/2 k(xi , xj ). We can check that for
T = {x1 , . . . , xn }, f > K −1 f = fν> Kν−1 fν , independent of the choice of ν. This is the first hint that the choice
of ν doesn’t really matter. Now note that
>
ν(X1 )1/2 em (x1 ) ν(X1 )1/2 em (x1 )
 

X −1
fν> Kν−1 fν = fν> µm  ···  ···  fν (324)
m=1 ν(Xn )1/2 em (xn ) ν(Xn )1/2 em (xn )
 1/2
 1/2
>
∞ ν(X 1 ) e m (x 1 ) ν(X1 ) e m (x 1 )
(i)  X 1  
≈ fν> ···  ···  fν (325)
m=1 m
µ
ν(Xn )1/2 em (xn ) ν(Xn )1/2 em (xn )

X 1 X n
2
= ν(Xi )f (xi )em (xi ) (326)
µ
m=1 m i=1
∞ Z
(ii) X 1
≈ ( f (x)em (x)dν(x))2 . (327)
m=1
µm X

Here (i) is the main “heuristic” step, where we are inverting the matrix on the premise P that the different vectors
1/2 1/2
[ν(X
R i )e m (x i )] are nearly orthogonal, since h[ν(X i ) e l (x i )], [ν(X i ) em (x i )]i = i ν(Xi )el (xi )em (xi ) ≈
X
e l (x)e m (x)dν(x) = 0. The step (ii) also needs to be made rigorous but is essentially just taking a limit.
Now, we byPdefinitionRhave kf k2H ≥ f > K −1 f = fν> Kν−1 fν , which by the above can be made arbitrarily

close to A = m=1 µ1m ( X f (x)em (x)dν(x))2 by taking sufficiently fine partitions of {Xi }. This proves
one direction of the lemma, i.e. that A ≤ kf k2H . To prove the other direction, note that for any set
T = {x̂1 , . . . , x̂n }, we can build our partitions such that they contain the x̂i as a subset of the representatives
xi in the partition. In this case, f [T ]> K[T ]−1 f [T ] ≤ f > K −1 f = fν> Kν−1 fν by Lemma 6.1. Taking the limit
then yields f [T ]> K[T ]−1 f [T ] ≤ A. Since this holds for all T , we thus have kf k2H ≤ A, which proves the other
direction and completes the lemma.

[Lecture 22]

7 Domain Adaptation under Covariate Shift


We now shift focus again, to a type of perturbation called covariate shift. We work in a classification or
regression setting where we wish to predict y from x, and make the assumption that p̃(y | x) and p∗ (y | x)
are the same (the labeling function doesn’t change between train and test):
Assumption 7.1 (Covariate Shift). For a train distribution p̃ and test distribution p∗ , we assume that
p̃(y | x) = p∗ (y | x) for all x.

68
Thus the only thing that changes between train and test is the distribution of the covariates x (hence
the name covariate shift). We furthermore assume that we observe labeled samples (x1 , y1 ), . . . , (xn , yn ) ∼ p̃,
together with unlabeled samples x̄1 , . . . , x̄m ∼ p∗ . In the language of our previous setting, we could say that
D(p̃, p∗ ) = kp̃(y | x) − p∗ (y | x)k∞ ,  = 0, and G = {p | p(x) = p0 (x)} for some distribution p0 (obtained via
the unlabeled samples from p∗ ).
Beyond covariate shift, we will need to make some additional assumption, since if p̃(x) and p∗ (x) have
disjoint supports then the assumption that p̃(y | x) = p∗ (y | x) is meaningless. We will explore two different
assumptions:
1. Either we assume the p̃(x) and p∗ (x) are known and not too different from each other, or
2. We assume that the model family is realizable: p∗ (y | x) = pθ (y | x) for some θ.

This will lead to two different techniques: importance weighting and uncertainty estimation. We will also see
how to construct a “doubly robust” estimator that works as long as at least one of the assumptions holds.

7.1 Importance weighting


First assume that p̃(x) and p∗ (x) are known. (We can generally at least attempt to estimate them from
unlabeled data, although if our model family is misspecified then our estimates might be poor.)
In a traditional setting, to minimize the loss on p̃(x) we would minimize

E(x,y)∼p̃ [`(θ; x, y)], (328)

where ` is the loss function for either classification or regression. We can approximate this via the samples
from p̃ as
n
1X
`(θ; xi , yi ). (329)
n i=1
To handle covariate shift we would like to instead minimize the expectation over p∗ , but unfortunately we
can’t do this because we don’t have any outputs y drawn from p∗ . The key insight that lets us get around
this is the following identity:
h i h p∗ (x) i
E(x,y)∼p∗ `(θ; x, y) = E(x,y)∼p̃ `(θ; x, y) . (330)
p̃(x)

Taking this identity as given for the moment, we can then approximate the expectation over p∗ via samples
from p̃ as follows:
n
1 X p∗ (xi )
`(θ; xi , yi ). (331)
n i=1 p̃(xi )

This quantity is called the propensity-weighted training loss 7 , because each training sample is weighted by
how much more it looks like a sample from p∗ than from p̃.
To prove the identity, we make use of the covariate shift assumption:
h p∗ (x, y) i
E(x,y)∼p∗ [`(θ; x, y)] = E(x,y)∼p̃ `(θ; x, y) (332)
p̃(x, y)
h p∗ (x) p∗ (y | x) i
= E(x,y)∼p̃ `(θ; x, y) (333)
p̃(x) p̃(y | x)
h p∗ (x) i
= E(x,y)∼p̃ `(θ; x, y) , (334)
p̃(x)

where the final equality is by the covariate shift assumption.


7 Also sometimes called the importance-weighted loss.

69
p∗ (x)
Variance of the estimator. Even if p̃(x) can be computed, the importance weighted estimator could

have high variance. This is because the weights pp̃(x
(xi )
i)
could be large or potentially infinite.
For convenience assume that `(θ; x, y) ≤ B for all θ, x, y. We can compute (or rather, bound) the variance
as follows:
p∗ (x)
Varp̃ [ `(θ; x, y)] = Ep̃ [(p∗ (x)/p̃(x))2 `(θ; x, y)2 ] − Ep∗ [`(θ; x, y)]2 (335)
p̃(x)
≤ Ep̃ [(p∗ (x)/p̃(x))2 ]B 2 (336)
∗ 2
= (Dχ2 (p̃kp ) + 1)B , (337)

where Dχ2 is the χ2 -divergence:

(p∗ (x) − p̃(x))2 p∗ (x)2


Z Z
def
Dχ2 (p̃kp∗ ) = dx = dx − 1. (338)
p̃(x) p̃(x)

The variance of the propensity-weighted loss is thus more or less controlled by the χ2 -divergence between
source and target. To gain some intuition for how χ2 behaves, first note that is is always larger than KL
divergence (in the reverse direction, though I’m not sure the order of arguments is canonical):

p∗ (x)
Z
Dkl (p∗ kp̃) = p∗ (x) log dx (339)
p̃(x)
p∗ (x) − p̃(x)
Z
≤ p∗ (x) dx (340)
p̃(x)
Z ∗ 2
p (x)
= dx − 1 = Dχ2 (p̃kp∗ ). (341)
p̃(x)

Additionally, the χ2 -divergence between two Gaussians is exponential in the difference between their means.
To see this, let Z denote the normalization constant of an isotropic Gaussian, and write
Z
0 1 1
Dχ2 (N (µ, I), N (µ , I)) = −1 + exp( kx − µ0 k22 − kx − µk22 )dx (342)
Z 2
Z
1 1
= −1 + exp( (−kxk22 − (2µ0 − 4µ)> x + kµ0 k22 − 2kµk22 )) (343)
Z 2
Z
1 1
= −1 + exp( (−kx + (µ0 − 2µ)k22 + kµ0 − 2µk22 + kµ0 k22 − 2kµk22 )) (344)
Z 2
= −1 + exp(kµ k2 + kµk22 − 2µ> µ0 ) = −1 + exp(kµ − µ0 k22 ).
0 2
(345)

This is bad news for propensity weighting, since the weights blow up exponentially as the distributions move
apart.

Connection to causal inference. Propensity weighting can also be used in the context of causal inference.
Here we have a patient with covariates X, with treatment condition T (usually T ∈ {0, 1}), and outcome Y .
Our goal is to estimate the treatment effect, which, roughly speaking, is E[Y | T = 1] − E[Y | T = 0] (this
is wrong as stated and will be remedied below). We will see below how to do this by letting p∗0 and p∗1 be
the distributions where T = 0 and T = 1, respectively. However, first we need to set up the problem more
carefully.
To set the problem up more carefully, we use the potential outcomes framework. In this framework there
are actually two variables, Y (0) and Y (1), which are what the outcome would have been if we had set T = 0
or T = 1, respectively. This is potentially different from the distribution of the outcome conditional on T ,
since there could be factors that correlate T with Y (for instance, if T is smoking and Y is lung cancer, there
could be some gene that causes one to both be more likely to smoke and more likely to get lung cancer
that accounts for the strong empirical correlation between T and Y ; this was an actual objection raised by
Fisher!).

70
Of course, there are plenty of factors that create correlation between T and Y in an observational setting,
for instance sicker patients are more likely to be treated aggressively. We are okay with this as long as these
factors are observed as part of the covariates X. This leads us to the unconfoundedness assumption:
Assumption 7.2 (Unconfoundedness). The distribution (X, T, Y (0), Y (1)) is said to be unconfounded if
Y (0), Y (1) ⊥
⊥ T | X. In other words, treatment and outcome should be independent conditional on the
covariates X.
The main challenge in the potential outcomes framework is that we only observe (X, T, Y (T )). In other
words, we only observe the outcome for the treatment T that was actually applied, which makes it difficult to
estimate E[Y (1)] or E[Y (0)]. We will deal with this by treating estimating E[Y (1)] as a domain adaptation
problem, and using propensity weighting. First note that, by unconfoundedness, we have

Ep̃ [Y (1)] = EX∼p̃ [Ep̃ [Y (1) | X]] (346)


= EX∼p̃ [Ep̃ [Y (1) | X, T = 1]] (347)
= Ep∗1 [Y (T )], (348)

where we define p∗1 such that p∗1 (x, t, y) = p̃(x)I[t = 1]p̃(y | x, t = 1); this has the same distribution over x as
p̃, but the treatment t = 1 is always applied. Since p̃(y | x, t) = p∗ (y | x, t) almost surely, the covariate shift
assumption holds. We can thus estimate the expectation under p∗1 via propensity weighting:
h p∗ (X, T ) i
Ep∗1 [Y (T )] = Ep̃ 1 Y (T ) (349)
p̃(X, T )
h p∗ (T | X) i
= Ep̃ 1 Y (T ) (350)
p̃(T | X)
h I[T = 1] i
= Ep̃ Y (T ) . (351)
p̃(T | X)
A similar calculation holds for computing Ep̃ [Y (0)], for the distribution p∗0 (x, t, y) = p̃(x)I[t = 0]p̃(y | x, t = 0).
Together, we have that
h I[T = 1] I[T = 0]  i
Ep̃ [Y (1) − Y (0)] = Ep̃ − Y (T ) . (352)
p̃(T | X) p̃(T | X)
Since the right-hand-side is in terms of Y (T ), it only involves observable quantities, and can be estimated
from samples as long as p̃(T | X) is known. This estimator is called inverse propensity weighting because it
involves dividing by the propensity weights p̃(T | X).
In the next section, we will explore an improvement on inverse propensity weighting called a doubly-robust
estimator.

[Lecture 23]

7.2 Doubly-Robust Estimators


Recall that in the previous section we defined the inverse propensity weighted estimator
h I[T = 1] I[T = 0]  i
Ep̃ [Y (1) − Y (0)] = Ep̃ − Y (T ) . (353)
p̃(T | X) p̃(T | X)
To actually estimate the left-hand-side, we take the empirical average over n samples.
There are a couple of downsides of this estimator. One is that the variance of this estimator can be large.
Specifically, we can compute it as
1 h 1 1 i 
Ep̃ Y (1)2 + Y (0)2 − Ep̃ [Y (1) − Y (0)]2 . (354)
n p̃(T = 1 | X) p̃(T = 0 | X)
If the propensity weights are near zero then the variance explodes (similarly to the issue with χ2 -divergence
that we saw earlier).

71
Another issue is that estimating the propensity weights themselves is non-trivial, and if we use the wrong
propensity weights, then the estimate could be arbitrarily wrong.
We will explore an idea that partially mitigates both issues; it reduces the variance when the propensity
weights are correct (although doesn’t avoid the exploding variance issue), and in some cases it produces a
correct estimate even if the propensity weights are wrong.
The basic idea is as follows: suppose that we have some prediction Ȳ (1, X), Ȳ (0, X) of what will happen
under T = 1, T = 0 conditioned on X. Since these predictions only require knowing X and not T , an
alternate estimate of the treatment effect can be obtained by adding and subtracting Ȳ :
Ep̃ [Y (1) − Y (0)] = Ep̃ [Ȳ (1, X) − Ȳ (0, X)] + Ep̃ [(Y (1) − Ȳ (1, X)) − (Y (0) − Ȳ (0, X))] (355)
h I[T = 1] I[T = 0]  i
= Ep̃ [Ȳ (1, X) − Ȳ (0, X)] + Ep̃ − (Y (T ) − Ȳ (T, X)) . (356)
p̃(T | X) p̃(T | X)
In other words, we first use our prediction Ȳ to form a guess of the average treatment effect, then use inverse
propensity weighting to correct the guess so as to obtain an unbiased estimate. This can yield substantial
improvements when Y (T ) − Ȳ (T, X) is much smaller in magnitude than Y (T ). For instance, a patient’s
cholesterol after taking a cholesterol-reducing drug is still highly-correlated with their initial cholesterol, so in
that case we can take Ȳ (T, X) to be the pre-treatment cholesterol level. Even though this is independent of
T it can substantially reduce the variance of the estimate! (We will formally bound the variance below.)

Bias of the estimate. Call the prediction Ȳ unbiased if E[Y | X, T ] = Ȳ (T, X). The first key property of
(356) is that it is unbiased as long as either Ȳ is unbiased, or the propensity weights are correct. Indeed,
if the prediction is unbiased then the first term is the average treatment effect while the second term is
zero. Conversely, if the propensity weights are correct then the second term exactly estimates the difference
between the predicted and true treatment effect. Correspondingly, (356) is called a doubly-robust estimator.
We can actually say more about the bias. Suppose that instead of the true propensity weights, we have an
incorrect guess p̂(t | x). Then the bias of the estimate is the difference between Ep̃ [Y (1) − Y (0)] and (356),
which is
h I[T = 1] I[T = 0]  i
Ep̃ [Y (1) − Y (0)] − E[Ȳ (1, X) − Ȳ (0, X)] − Ep̃ − (Y (T ) − Ȳ (T, X)) (357)
p̂(T | X) p̂(T | X)
h  I[T = 1]   I[T = 0] i
= Ep̃ (Y (1) − Ȳ (1, X)) 1 − + (Y (0) − Ȳ (0, X)) 1 − . (358)
p̂(t = 1 | X) p̂(t = 0 | X)
Focusing on the first term, and using the independence of T and Y conditioned on X, we have
h  I[T = 1] i
Ep̃ (Y (1) − Ȳ (1, X)) 1 − (359)
p̂(t = 1 | X)
h  p̃(t = 1 | X)  i
= Ep̃ (E[Y (1) | X] − Ȳ (1, X)) 1 − |X (360)
p̂(t = 1 | X)
h p̃(t = 1 | X) 2 i1/2
≤ Ep̃ [(E[Y (1) | X] − Ȳ (1, X))2 ]1/2 Ep̃ 1 − , (361)
p̂(t = 1 | X)
meaning that the bias of the estimator is the product of the biases of Ȳ and p̂ (measured as the expected
squared errors in (361)).

Variance of the estimate. We can obtain a somewhat similar relation for the variance. Usually the
variance of Ȳ (1, X) − Ȳ (0, X) is small compared to the propensity-weighted term, so again focusing on the
T = 1 case we have
h I[T = 1] i h p̃(t = 1 | X) i
Var (Y (1) − Ȳ (1, X)) ≤ E EY [(Y (1) − Ȳ (1, X))2 | X] . (362)
p̂(t = 1 | X) p̂(t = 1 | X)2
The variance is substantially reduced when Y (1) is close to Ȳ (1, X). We cannot always hope for this, e.g. if
Y has a large amount of intrinsic variance even conditioned on X. But in many cases even trivial Ȳ can
predict most of the variance in Y —for instance, for any chronic disease the patient’s post-treatment status is
well-predicted by their pre-treatment status. And the value of a stock tomorrow is well-predicted by its value
today.

72
Semi-parametric estimation. It may seem difficult to estimate Ȳ (·, X) and p̃(t = 1 | X), since any
parametric model could be mis-specified and lead to biased estimates. One idea is to estimate these both
non-parametrically, and then apply the doubly-robust estimator above. This is an instance of semi-parametric
estimation, because while we estimate Ȳ and p̃(t | X) non-parametrically, the doubly-robust estimator itself
is parametric (i.e a simple sample estimate of the mean), and in some cases we obtain non-parametric rates.
This is explored in detail in Nie and Wager (2017) for estimating conditional average treatment effects; we
describe the basic idea here. Since the squared error in an estimate is Bias2 + Variance/n, the bias will
dominate the error in the doubly-robust estimator as long as the variance doesn’t explode (of course, the
variance can often explode if the propensity weights are too close to 0 or 1, and the following idea won’t help
in that case).
We saw above that the bias of the doubly-robust estimator is the product of the biases in Ȳ and p̂,
which are both given as expected squared errors between the true and estimated value. In non-parametric
estimation, we typically get convergence rates of O(n−α ) for some α < 1/2 (note that α = 1/2 is what we
typically get for parametric estimation). The parameter α typically depends on the dimension of the problem
and the smoothness of the function class we wish to estimate. Since the doubly-robust bias is the product
of the biases, we end up with a bias of O(n−2α ) as long as Ȳ and p̂ each converge at a n−α rate. As long
as α > 1/4, this√yields a parametric rate (the variance term will then asymptotically dominate as it only
converges at 1/ n).

References
Radoslaw Adamczak and Pawel Wolff. Concentration inequalities for non-lipschitz functions with bounded derivatives
of higher order. Probability Theory and Related Fields, 162(3-4):531–586, 2015.

Noga Alon and Assaf Naor. Approximating the cut-norm via grothendieck’s inequality. In Proceedings of the thirty-sixth
annual ACM symposium on Theory of computing, pages 72–80. ACM, 2004.

Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization,
and Combinatorial Optimization. Algorithms and Techniques, pages 37–49. Springer, 2012.

Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages
177–206. Springer, 1985.

Jean-Baptiste Bardet, Nathaël Gozlan, Florent Malrieu, Pierre-André Zitt, et al. Functional inequalities for gaussian
convolutions of compactly supported measures: explicit bounds and dimension dependence. Bernoulli, 24(1):
333–353, 2018.

Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer School on Machine
Learning, pages 208–240. Springer, 2003.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of
independence. Oxford university press, 2013.

David L Donoho and Richard C Liu. The” automatic” robustness of minimum distance functionals. The Annals of
Statistics, 16(2):552–586, 1988.

Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance
operators. Bernoulli, 23(1):110–133, 2017.

Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In 2010 IEEE 51st
Annual Symposium on Foundations of Computer Science, pages 299–308. IEEE, 2010.

Rafal Latala. Estimates of moments and tails of gaussian chaoses. The Annals of Probability, 34(6):2315–2331, 2006.

Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.

James G MacKinnon and Halbert White. Some heteroskedasticity-consistent covariance matrix estimators with
improved finite sample properties. Journal of econometrics, 29(3):305–325, 1985.

Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint
arXiv:1712.04912, 2017.

73
Omar Rivasplata. Subgaussian random variables: An expository note. 2012.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027,
2010.

A Properties of Statistical Discrepancies


A.1 Total variation distance
A.2 Wasserstein distance

B Concentration Inequalities
B.1 Proof of Chebyshev’s inequality (Lemma 2.1)
Let I[E] denote the indicator that E occurs. Then we have

|EX∼p [X | E] − µ| = |EX∼p [(X − µ)I[E]]|/P[E] (363)


q
≤ EX∼p [(X − µ)2 ] · EX∼p [I[E]2 ]/P[E] (364)
p p
≤ σ 2 · P[E]/P[E] = σ/ P[E]. (365)
√ √ p
In particular, if we let E0 be the event that X ≥ µ + σ/ δ, we get that σ/ δ ≤ σ/ P[E0 ], and hence
P[E0 ] ≤ δ, which proves the first part of the lemma. √ p
For the second part, if P[E] ≤ 12 then (365) already implies the desired result since σ/ δ ≤ σ 2(1 − δ)/δ
when δ ≤ 12 . If P[E] ≥ 21 , then consider the same argument applied to ¬E (the event that E does not occur).
We get

1 − P[E]
|EX∼p [X | E] − µ| = |EX∼p [X | ¬E] − µ| (366)
P[E]
1 − P[E] p
≤ · σ/ 1 − P[E]. (367)
P[E]
√ p
Again the result follows since σ 1 − δ/δ ≤ σ 2(1 − δ)/δ when δ ≥ 12 .

B.2 Proof of d-dimensional Chebyshev’s inequality (Lemma 2.8)

C Proof of Lemma 2.14


Since (ρ, )-resilience is equivalent to ( 1− 1−
 ρ, 1 − )-resilience, it suffices to show that (1 − ,  ρ)-resilience
is equivalent to (18). Suppose that E is an event with probability , and let v be such that kvk∗ = 1 and
hE[X − µ | E], vi = kE[X − µ | E]k. Then we have

kE[X − µ | E]k = hE[X − µ | E], vi (368)


= hE[hX − µ, vi | E] (369)
(i)
≤ E[hX − µ, vi | hX − µ, vi ≥ τ (v)] (370)
(18) 1−
≤ ρ. (371)

Here (i) is because hX − µ, vi is at least as large for the -quantile as for any other event E of probability .
This shows that (18) implies (1 − , 1− ρ)-resilience. For the other direction, given any v let Ev denote the

74
event that hX − µ, vi ≥ τ (v). Then Ev has probability  and hence

E[hX − µ, vi | hX − µ, vi ≥ τ (v)] = E[hX − µ, vi | Ev ] (372)


= hE[X − µ | Ev ], vi (373)
(ii)
≤ kE[X − µ | Ev ]k (374)
(iii) 1−
≤ ρ, (375)

where (ii) is Hölder’s inequality and (iii) invokes resilience. Therefore, resilience implies (18), so the two
properties are equivalent, as claimed.

D Proof of Lemma 2.15


Let E+ be the event that hxi −µ, vi is positive, and E− the event that it is non-negative. Then P[E+ ]+P[E− ] =
1, so at least one of E+ and E− has probablity at least 12 . Without loss of generality assume it is E+ . Then
we have

Ex∼p [|hx − µ, vi|] = 2Ex∼p [max(hx − µ, vi, 0)] (376)


= 2P[E+ ]Ex∼p [hx − µ, vi | E+ ] (377)
≤ 2P[E+ ]kEx∼p [x − µ | E+ ]k ≤ 2ρ, (378)

where the last step invokes resilience applies to E+ together with P[E+ ] ≤ 1. Conversely, if p has bounded
1st moments then

E[hX − µ, vi | hX − µ, vi ≥ τ1/2 (v)] ≤ E[|hX − µ, vi|]/P[hX − µ, vi ≥ τ1/2 (v)] (379)


= 2E[|hX − µ, vi|] ≤ 2ρ, (380)

so p is (2ρ, 21 )-resilient by Lemma 2.14.

75

You might also like