Robust Nonparametric Methods 2
Robust Nonparametric Methods 2
Jacob Steinhardt
Last updated: April 1, 2021
[Lecture 1]
Overarching framework. Most robustness questions can be cast in the following way: We let p∗ denote
the true test distribution we wish to estimate, and assume that training data X1 , . . . , Xn is sampled i.i.d. from
some distribution p̃ such that D(p̃, p∗ ) ≤ according to some discrepancy D. We also assume that p∗ ∈ G,
which encodes the distributional assumptions we make (e.g. that p∗ has bounded moments or tails, which
is typically necessary for robust estimation to be possible). We benchmark an estimator θ̂(X1 , . . . , Xn )
according to some cost L(p∗ , θ̂) (the test error). The diagram in Figure 1 illustrates this.
This framework captures both of the examples discused at the beginning. However, it will be profitable to
think about each case separately due to different emphases:
1
test distribution train distribution
D(p∗ , p̃) ≤
∗
p p̃
∈
G
distributional assumptions
samples
X1 , . . . , Xn
1. For corrupted training data, we think of p̃ as being corrupted and p∗ as being nice.
2. For distributional shift, we think of p̃ and p∗ as both being nice (but different).
Additionally, since both p̃ and p∗ are nice for distributional shift, we should have greater ambitions and seek
to handle larger differences between train and test than in the corruption cases.
Training robustness. Designing robust estimators for training corruptions usually involves reasoning
about what the real data “might have” looked like. This could involve operations such as removing outliers,
smoothing points away from extremes, etc. Unfortunately, many intuitive algorithms in low dimensions
achieve essentially trivial bounds in high dimensions. We will show how to achieve more meaningful bounds,
focusing on three aspects:
1. good dependence of the error on the dimension,
2. good finite-sample bounds,
3. computational tractability.
Each of these aspects turns out to require new machinery and we will devote roughly equal space to each.
Distributional shift. For distributional shift, we often seek invariant features or structure that can transfer
information from the train to test distributions. We can also counteract distribution shift by training on more
diverse data. Finally, we can use model uncertainty to infer out-of-distribution error, but these inferences can
be fraught if the model is mis-specified.
Nonparametric modeling. This brings us to nonparametric methods. The simplest way to be non-
parametric is through the model–using a rich class such as smooth functions, or neural networks. This
mitigates model mis-specification but raises new statistical challenges. Familiar phenomena such as the
central limit theorem cease to hold, and error rates are instead governed by the eigenvalue spectrum of an
infinite-dimensional kernel. Aside from the model, we can also be nonparametric at inference time; an example
is the bootstrap, which constructs confidence intervals that are more robust to model mis-specification than
classical parametric tests.
2
Figure 2: Possible corruptions to be robust to. Left: data contains outliers. Middle: outputs are perturbed
(process noise); Right: inputs are perturbed (measurement error).
Summary. We will cover each of training time robustness, distribution shift, and nonparametric modeling,
drawing connections as we go. The first third will focus on training time, also building up the statistical
machinery needed to prove generalization bounds. Then, we will shift focus to model mis-specification, and to
nonparametric methods as a remedy. These including kernel regression, the bootstrap, and neural networks,
and we will both see how they mitigate model mis-specification and how to analyze their generalization
performance. Finally, we will turn to distribution shift, and see that nonparametric models are often robust
to distribution shift when trained on the right data. Training robustness will receive a largely theoretical
treatment, while for nonparametrics and distribution shift we will see a mix of theoretical and empirical
results.
If p and q both have densities then an equivalent characterization is TV(p, q) = 12 |p(x) − q(x)|dx.
R
In the middle and right panels of Figure 2, either the inputs or outputs have been moved slightly. Both
operations can be modeled using Wasserstein distances (also called earthmover distances), which we will
discuss later. For now, however, we will focus on the case of handling outliers. Thus for the next several
sections our discrepancy will be the total variation distance D = TV.
3
Are the red points outliers? Or part of the real data? Depending on the conclusion, the estimated mean
could vary substantially. Without further assumptions on the data-generating distribution p∗ , we cannot rule
out either case. This brings us to an important principle:
With no assumptions on the distribution p∗ , robust estimation is impossible.
In particular, we must make assumptions that are strong enough to reject sufficiently extreme points as
outliers, or else even a small fraction of such points can dominate the estimate of the mean. For simplicity,
here and in the next several sections we will assume that we directly observe the training distribution p̃ rather
than samples x1:n from p̃. This allows us to avoid analyzing finite-sample concentration, which requires
introducing additional technical tools that we will turn to in Section 2.5.
Assumption: bounded variance. One possible assumption is that p∗ has bounded variance: Ex∼p∗ [(x −
µ)2 ] ≤ σ 2 for some parameter σ. We take G = Gcov (σ) to be the set of√distributions satisfying this constraint.
Under this assumption, we can estimate µ to within error O(σ ) under TV-perturbations of size .
Indeed, consider the following procedure:
Algorithm 1 TrimmedMean
1: Remove the upper and lower (2)-quantiles from p̃ (so 4 mass is removed in total).
2: Let p̃2 denote the new distribution after re-normalizing, and return the mean of p̃2 .
To analyze Algorithm 1, we will make use of a strengthened version of Chebyshev’s inequality, which we
recall here (see Section B.1 for a proof):
Lemma 2.1 (Chebyshev inequality). Suppose that p has mean µ and variance σ 2 . Then, PX∼p q [X ≥
√
µ + σ/ δ] ≤ δ. Moreover, if E is any event with probability at least δ, then |EX∼p [X | E] − µ| ≤ σ 2(1−δ)
δ .
The first part, which is the standard Chebyshev inequality, says that it is unlikely for a point to be more
than a few standard deviations away from µ. The second part says that any large population of points must
have a mean close to µ. This second property, which is called resilience, is central to robust estimation, and
will be studied in more detail in Section 2.4.
With Lemma 2.1 in hand, we can prove the following fact about Algorithm 1:
√
Proposition 2.2. Assume that TV(p̃, p∗ ) ≤ ≤ 18 . Then the output µ̂ of Algorithm 1 satisfies |µ̂−µ| ≤ 9σ .
Proof. If TV(p̃, p∗ ) ≤ , then we can get from p∗ to p̃ by adding an -fraction of points (outliers) and deleting
an -fraction of the original points.
First note that all outliers that exceed the -quantile of p∗ are removed by Algorithm 1. Therefore, all
non-removed outliers lie within √σ of the mean µ by Chebyshev’s inequality.
Second, we and the adversary together remove at most a 5-fractionq of the mass in p∗ . Applying Lemma 2.1
10
with δ = 1 − 5, the mean of the remaining good points lies within σ 1−5 of µ.
Now let be the fraction of remaining points which are bad, and note that 0 ≤ 1−4
0
. The mean of all
q q √ √
the remaining points differs from µ by at most 0 · σ 1 + (1 − 0 ) · σ 1−5
10
, which is at most (1 + 10) 1−4 σ.
√ 1
This is in turn at most 9σ assuming that ≤ 8 .
4
√ √
Figure 3: The outliers can lie at distance d without being detected, skewing the mean by d.
√
Optimality. The O(σ ) dependence is optimal, because the adversary can themselves apply the same
trimming
√ procedure we do, and in general this will shift the mean of a bounded covariance distribution by
O(σ ) while keeping the covariance bounded.
Alternate assumptions. The key fact driving√ the proof of Proposition 2.2 is that any (1 − )-fraction
of the good points has mean at most O(σ ) away from the true mean due to Chebyshev’s inequality
(Lemma 2.1), which makes use of the bound σ 2 on the variance. Any other bound on the deviation from √
the mean would yield an analogous result. For instance, if p∗ has bounded kth moment, then the O(σ )
1−1/k k
in Lemma 2.1 can be improved to O(σk ), where (σk ) is a bound on the kth moment; in this case
Algorithm 1 will estimate µ with a correspondingly improved error of O(σk 1−1/k ).
[Lecture 2]
5
2.3 Minimum Distance Functionals
In the previous section we saw
√ that simple approaches to handling outliers in high-dimensional data, such as
the trimmed mean, incur a d error. We will avoid this error using minimum distance functionals, an idea
which seems to have first appeared in Donoho and Liu (1988).
Definition 2.3 (Minimum distance functional). For a family G and discrepancy D, the minimum distance
functional is
θ̂(p̃) = θ∗ (q) = arg min L(q, θ), where q = arg min D(q, p̃). (2)
θ q∈G
In other words, θ̂ is the parameters obtained by first projecting p̃ onto G under D, and then outputting the
optimal parameters for the resulting distribution.
An attractive property of the minimum-distance functional is that it does not depend on the perturbation
level . More importantly, it satisfies the following cost bound in terms of the modulus of continuity of G:
Proposition 2.4. Suppose D is a pseudometric. Then the cost L(p∗ , θ̂(p̃)) of the minimum distance functional
is at most the maximum loss between any pair of distributions in G of distance at most 2:
The quantity m is called the modulus of continuity because, if we think of L(p, θ∗ (q)) as a discrepancy
between distributions, then m is the constant of continuity between L and D when restricted to pairs of
nearby distributions in G.
Specialize again to the case D = TV and L(p∗ , θ) = kθ − µ(p∗ )k2 (here we allow p∗ to be a distribution
over Rd rather than just R). Then the modulus is supp,q∈G:TV(p,q)≤2 kµ(p) − µ(q)k2 . As a concrete example,
let G be the family of Gaussian distributions with unknown mean µ and identity covariance. For this family,
the TV distance is essentially linear in the difference in mean:
Lemma 2.5. Let N (µ, I) denote a Gaussian distribution with mean µ and identity covariance. Then
√ √
min(u/2, 1)/ 2π ≤ TV(N (µ, I), N (µ0 , I)) ≤ min(u/ 2π, 1), (4)
where u = kµ − µ0 k2 .
Proof. By rotational and translational symmetry, it suffices to consider the case of one-dimensional Gaussians
N (−u/2, 1) and N (u/2, 1). Then we have that
Z ∞
1 2 2
TV(N (−u/2, 1), N (u/2, 1)) = √ |e−(t+u/2) /2 − e−(t−u/2) /2 |dt (5)
2 2π −∞
Z u/2
(i) 1 2
= √ e−t /2 dt. (6)
2π −u/2
(The equality (i) is a couple lines of algebra, but is easiest to see by drawing a graph of the two Gaussians
and cancelling out most of the probability mass.)
2
For the lower bound, note that e−t /2 ≥ 12 if |t| ≤ 1.
2
For the upper bound, similarly note that e−t /2 ≤ 1 for all t ∈ R, and also that the entire integral must
be at most 1 because it is the probability density of a Gaussian.
Lemma 2.5 allows us to compute the modulus for Gaussians:
Corollary 2.6. Let Ggauss be the √ family of isotropic Gaussians, D = TV, and L the difference in means as
above. Then m(Ggauss , , D, L) ≤ 2 2π whenever ≤ 2√12π .
6
More general families. Taking G to be Gaussians is restrictive, as it assumes that p∗ has a specific
parametric form—counter to our goal of being robust! However, the modulus m is bounded for much more
general families. As one example, we can take the distributions with bounded covariance (compare to
Proposition 2.2):
2
Lemma 2.7. Let G√ cov (σ) be the family of distributions whose covariance matrix Σ satisfies Σ σ I. Then
m(Gcov (σ), ) = O(σ ).
Proof. Let p, q ∈ Gcov (σ) such that TV(p, q) ≤ . This means that we can get from p to q by first deleting
mass from p and then adding new points to end up at q. Put another way, there is a distribution r that
can be reached from both p and q by deleting mass (and then renormalizing). In fact, this distribution is
exactly
min(p, q)
r= . (7)
1 − TV(p, q)
Since r can be obtained from both p and q by deletions, we can make use of the following multi-dimensional
analogue of Chebyshev’s inequality (Lemma 2.1):
Lemma 2.8 (Chebyshev in Rd ). Suppose that p has mean µ and covariance 2
q Σ, where Σ σ I. Then, if E
is any event with probability at least δ, we have kEX∼p [X | E] − µk2 ≤ σ 2(1−δ)
δ .
p p
As a consequence, we have kµ(r) − µ(p)k2 ≤ σ 2/(1 − ) and kµ(r) − µ(q)k2 ≤ σ 2/(1 − ) (since r
can be obtained from either p or q by conditioning√on an event of probability 1 − ). By triangle inequality
and assuming ≤ 21 , we have kµ(p) − µ(q)k2 ≤ 4σ , as claimed.
As a consequence, the minimum
√ distance functional robustly estimates the mean bounded covariance
distributions with error O(σ ), generalizing the 1-dimensional bound obtained by the trimmed mean.
In Lemma 2.7, the two key properties we needed were:
• The midpoint property of TV distance (i.e., that there existed an r that was a deletion of p and q).
• The bounded tails guaranteed by Chebyshev’s inequality.
If we replace bounded covariance distributions with any other family that has tails bounded in a similar way,
then the minimum distance functional will similarly yield good bounds. A general family of distributions
satisfying this property are resilience distributions, which we turn to next.
2.4 Resilience
Here we generalize Lemma 2.7 to prove that the modulus of continuity m is bounded for a general family of
distributions containins Gaussians, sub-Gaussians, bounded covariance distributions, and many others. The
main observation is that in the proof of Lemma 2.7, all we needed was that the tails of distributions in G
were bounded, in the sense that deleting an -fraction of the points could not substantially change the mean.
This motivates the following definition:
Definition 2.9. A distribution p over Rd is said to be (ρ, )-resilient (with respect to some norm k · k) if
kEX∼p [X | E] − EX∼p [X]k ≤ ρ for all events E with p(E) ≥ 1 − . (8)
We let GTV (ρ, ) denote the family of (ρ, )-resilient distributions.
p
We observe that Gcov (σ) ⊂ GTV (σ 2/(1 − ), ) for all by Lemma p 2.8; in other words, bounded
covariance distributions are resilient. We can also show that Ggauss ⊂ GTV (2 log(1/), ), so Gaussians are
resilient as well.
Resilient distributions always have bounded modulus:
Theorem 2.10. The modulus of continuity m(GTV , 2) satisfies the bound
m(GTV (ρ, ), 2) ≤ 2ρ (9)
whenever < 1/2.
7
Proof. As in Lemma 2.7, the key idea is that any two distributions p, q that are close in TV have a midpoint
min(p,q)
distribution r = 1−TV(p,q) that is a deletion of both distributions). This midpoint distribution connects the
two distributions, and it follows from the triangle inequality that the modulus of GTV . is bounded. We
illustrate this idea in Figure 4 and make it precise below.
TV(p, q) ≤
p q
p q
r≤ 1−
r≤ 1−
min(p,q)
r= 1−TV(p,q)
Recall that
min(p,q)
From TV(p, q) ≤ 2, we know that r = 1−TV(p,q) can be obtained from either p and q by conditioning on
an event of probability 1 − . It then follows from p, q ∈ GTV (ρ, ) that kµ(p) − µ(r)k ≤ and similarly
kµ(q) − µ(r)k ≤ . Thus by the triangle inequality kµ(p) − µ(q)k ≤ 2ρ, which yields the desired result.
We have seen so far that resilient distributions have bounded modulus, and that both Gaussian and
bounded covariance distributions
√ are resilient. The bound on the modulus for Gcov p that is implied by
resilience is optimal (O(σ )), while for Ggauss it is optimal up to log factors (O( log(1/)) vs. O()).
In fact, Gaussians are a special case and resilience yields an essentially optimal bound at least for most
non-parametric families of distributions. As one family of examples, consider distributions with bounded
Orlicz norm:
Definition 2.11 (Orlicz norm). A function ψ : R≥0 → R≥0 is an Orlicz function if ψ is convex, non-
decreasing, and satisfies ψ(0) = 0, ψ(x) → ∞ as x → ∞. For an Orlicz function ψ, the Orlicz norm or
ψ-norm of a random variable X is defined as
|X|
kXkψ , inf t > 0 : Ep ψ ≤1 . (11)
t
We let Gψ (σ) denote the distributions with bounded ψ-norm as in Definition 2.11.
8
Thus a distribution has bounded ψ-norm if each of its 1-dimensional projections does. As an example,
Gcov (σ) = Gψ (σ) for ψ(x) = x2 , so Orlicz norms generalize bounded covariance. It is also possible to generalize
Definition 2.12 to norms other than the `2 -norm, which we will see in an exercise.
Functions with bounded Orlicz norm are resilient:
Lemma 2.13. The family Gψ (σ) is contained in GTV (2σψ −1 (1/), ) for all 0 < < 1/2.
Proof. Without loss of generality assume E[X] = 0. For any event E with p(E) = 1 − 0 ≥ 1 − , denote its
complement as E c . We then have
(i) 0
kEX∼p [X | E]k2 = kEX∼p [X | E c ]k2 (13)
1 − 0
0
= sup EX∼p [hX, vi | E c ] (14)
1 − 0 kvk2 ≤1
(ii) 0
≤ sup σψ −1 (EX∼p [ψ(|hX, vi|/σ) | E c ]) (15)
1 − 0 kvk2 ≤1
(iii) 0
≤ sup σψ −1 (EX∼p [ψ(|hX, vi|/σ)]/0 ) (16)
1 − 0 kvk2 ≤1
(iv) 0
≤ σψ −1 (1/0 ) ≤ 2σψ −1 (1/), (17)
1 − 0
as was to be shown. Here (i) is because (1 − 0 )E[X | E] + 0 E[X | E c ] = 0. Meanwhile (ii) is by convexity of
ψ, (iii) is by non-negativity of ψ, and (iv) is the assumed ψ-norm bound.
As a consequence, the modulus m of Gψ (σ) is O(σψ −1 (1/)), and hence the minimum distance functional
estimates the mean with error O(σψ −1 (1/)). Note that for ψ(x) = x2 this reproduces our result for bounded
covariance. For ψ(x) = xk we get error O(σ1−1/k ) when p a distribution has kth moments bounded by σ k .
Similarly for sub-Gaussian distributions we get error O(σ log(1/)). We will show in an exercise that the
error bound implied by Lemma 2.13 is optimal for any Orlicz function ψ.
Further properties and dual norm perspective. Having seen several examples of resilient distributions,
we now collect some basic properties of resilience, as well as a dual perspective that is often fruitful. First,
we can make the connection between resilience and tails even more precise with the following lemma:
Lemma 2.14. For a fixed vector v, let τ (v) denote the -quantile of hx − µ, vi: Px∼p [hx − µ, vi ≥ τ (v)] = .
Then, p is (ρ, )-resilient in a norm k · k if and only if the -tail of p has bounded mean when projected onto
any dual unit vector v:
1−
Ep [hx − µ, vi | hx − µ, vi ≥ τ (v)] ≤ ρ whenever kvk∗ ≤ 1. (18)
1−
In particular, the -quantile satisfies τ (v) ≤ ρ.
In other words, if we project onto any unit vector v in the dual norm, the -tail of x − µ must have mean
at most 1−
ρ. Lemma 2.14 is proved in Section C.
The intuition for Lemma 2.14 is the following picture, which is helpful to keep in mind more generally:
Specifically, letting µ̂ = E[X | E], if we have kµ̂ − µk = ρ, then there must be some dual norm unit vector
v such that hµ̂ − µ, vi = ρ and kvk∗ = 1. Moreover, for such a v, hµ̂ − µ, vi will be largest when E consists of
the (1 − )-fraction of points for which hX − µ, vi is largest. Therefore, resilience reduces to a 1-dimensional
problem along each of the dual unit vectors v.
A related result establishes that for = 12 , resilience in a norm is equivalent to having bounded first
moments in the dual norm (see Section D for a proof):
Lemma 2.15. Suppose that p is (ρ, 12 )-resilient in a norm k · k, and let k · k∗ be the dual norm. Then p has
1st moments bounded by 2ρ: Ex∼p [|hx − µ, vi|] ≤ 2ρkvk∗ for all v ∈ Rd .
Conversely, if p has 1st moments bounded by ρ, it is (2ρ, 12 )-resilient.
9
µ
µ̂ vkµ̂ −dual vector v:
µk = hµ̂ − µ, vi
Figure 5: The optimal set T discards the smallest |S| elements projected onto a dual unit vector v.
√
Recap. We saw that the error of the trimmed mean grew as d in d dimensions, and introduced an
alternative estimator–the minimum distance functional–that achieves better error. Specifically, it achieves
error 2ρ for the family of (ρ, )-resilient distributions, which includes all distributions with bounded Orlicz
norm (including bounded covariance, bounded moments, and sub-Gaussians).
The definition of resilience is important not just as an analysis tool. Without it, we would need a different
estimator for each of the cases of bounded covariance, sub-Gaussian, etc., since the minimum distance
functional depends on the family G. Instead, we can always project onto the resilient family GTV (ρ, ) and be
confident that this will typically yield an optimal error bound. The only complication is that projection still
depends on the parameters ρ and ; however, we can do without knowledge of either one of the parameters as
long as we know the other.
[Lecture 3]
2. We show that the property composes nicely for products of independent random variables.
A prototypical example (covered below) is showing that (1) a random variable has at most a 1/t2 probability
of being t standard
√ deviations from its mean; and (2) the standard deviation of a sum of n i.i.d. random
variables is n times the standard deviation of a single variable.
The simplest concentration inequality is Markov’s inequality. Consider the following question:
A slot machine has an expected pay-out of $5 (and its payout is always non-negative). What can
we say about the probability that it pays out at least $100?
We observe that the probability must be at most 0.05, since a 0.05 chance of a $100 payout would by itself
already contribute $5 to the expected value. Moreover, this bound is achievable by taking a slot machine
that pays $0 with probability 0.95 and $100 with probability 0.05. Markov’s inequality is the generalization
of this observation:
10
Theorem 2.16 (Markov’s inequality). Let X be a non-negative random variable with mean µ. Then,
P[X ≥ t · µ] ≤ 1t .
Markov’s inequality accomplishes our first goal of establishing concentration for a single random variable,
but it has two issues: first, the 1t tail bound decays too slowly in many cases (we instead would like
exponentially decaying tails); second, Markov’s inequality doesn’t compose well and so doesn’t accomplish
our second goal.
We can address both issues by applying Markov’s inequality to some transformed random variable. For
instance, applying Markov’s inequality to the random variable Z = (X − µ)2 yields the stronger Chebyshev
inequality:
Theorem 2.17 (Chebyshev’s inequality). Let X be a real-valued random variable with mean µ and variance
σ 2 . Then, P[|X − µ| ≥ t · σ] ≤ t12 .
1
Proof. Since Z = (X − µ)2 is non-negative, we have that P[Z ≥ t2 · σ 2 ] ≤ t2 by Markov’s inequality. Taking
the square-root gives P[|X − µ| ≥ t · σ] ≤ t12 , as was to be shown.
Chebyshev’s inequality improves the 1/t dependence to 1/t2 . But more importantly, it gives a bound in
terms of a quantity (the variance σ 2 ) that composes nicely:
Lemma 2.18 (Additivity of variance). Let X1 , . . . , Xn be pairwise independent random variables, and let
Var[X] denote the variance of X. Then,
Proof. It suffices by induction to prove this for two random variables. Without loss of generality assume
that both variables have mean zero. Then we have Var[X + Y ] = E[(X + Y )2 ] = E[X 2 ] + E[Y 2 ] + 2E[XY ] =
Var[X] + Var[Y ] + 2E[X]E[Y ] = Var[X] + Var[Y ], where the second-to-last step uses pairwise independence.
Chebyshev’s inequality together with Lemma √ 2.18 together allow us to show that an average of i.i.d.
random variables converges to its mean at a 1/ n rate:
Corollary 2.19. Suppose X1 , . . . , Xn are drawn i.i.d.
√ from p, where p has mean µ and variance σ 2 . Also
let S = n1 (X1 + · · · + Xn ). Then, P[|S − µ|/σ ≥ t/ n] ≤ 1/t2 .
Proof. Lemma 2.18 implies that Var[S] = σ 2 /n, from which the result follows by Chebyshev’s inequality.
Higher moments. Chebyshev’s inequality gives bounds in terms of the second moment of X − µ. Can we
do better by considering higher moments such as the 4th moment? Supposing that E[(X − µ)4 ] ≤ τ 4 , we do
get the analogous bound P[|X − µ| ≥ t · τ ] ≤ 1/t4 . However, the 4th moment doesn’t compose as nicely as
the variance; if X and Y are two independent mean-zero random variables, then we have
where the E[X 2 ]E[Y 2 ] can’t be easily dealt with. It is possible to bound higher moments under composition,
for instance using the Rosenthal inequality which states that
X X √ X
E[| Xi |p ] ≤ O(p)p E[|Xi |p ] + O( p)p ( E[Xi2 ])p/2 (21)
i i i
for independent random variables Xi . Note that the first term on the right-hand-side typically grows as
√
n · O(p)p while the second term typically grows as O( pn)p .
We will typically not take the Rosenthal approach and instead work with an alternative, nicer object
called the moment generating function:
def
mX (λ) = E[exp(λ(X − µ))]. (22)
For
Qn independent random variables, the moment generating function composes via the identity mX1 +···+Xn (λ) =
i=1 mXi (λ). Applying Markov’s inequality to the moment generating function yields the Chernoff bound :
11
Theorem 2.20 (Chernoff bound). For a random variable X with moment generating mX (λ), we have
Proof. By Markov’s inequality, P[X − µ ≥ t] = P[exp(λ(X − µ)) ≥ exp(λt)] ≤ E[exp(λ(X − µ))]/ exp(λt),
which is equal to mX (λ)e−λt by the definition of mX . Taking inf over λ yields the claimed bound.
The factor of 2 is pesky and actually we can get the more convenient bound mX (λ) ≤ exp(3λ2 σ 2 /2) (Rivasplata,
2012). Plugging this into the Chernoff bound yields P[X − µ ≥ t] ≤ exp(3λ2 σ 2 /2 − λt); minimizing over λ
gives the optimized bound P[X − µ ≥ t] ≤ exp(−t2 /6σ 2 ).
Sub-Gaussians are particularly convenient because the bound mX (λ) ≤ exp(3λ2 σ 2 /2) composes well.
Let X1 , . . . , Xn be independent sub-Gaussians with constants σ1 , . . . , σn . Then we have mX1 +···+Xn (λ) ≤
exp(3λ2 (σ12 + · · · + σn2 )/2). We will use this to bound the behavior of sums of bounded random variables
using Hoeffding’s inequality:1
Theorem 2.21 (Hoeffding’s inequality). Let X1 , . . . , Xn be zero-mean random variables lying in [−M, M ],
and let S = n1 (X1 + · · · + Xn ). Then, P[S ≥ t] ≤ exp(− ln(2)nt2 /6M 2 ) ≤ exp(−nt2 /9M 2 ).
√
Proof. First, note that each Xi is sub-Gaussian with parameter σ = M/ ln 2, since E[exp(Xi2 /σ 2 )] ≤
exp(M 2 /σ 2 ) = exp(ln(2)) = 2. We thus have mXi (λ) ≤ exp(3λ2 M 2 /2 ln 2), and so by the multiplicativity of
moment generating functions we obtain mS (λ) ≤ exp(3λ2 M 2 /(2n ln 2)). Plugging into Chernoff’s bound and
optimizing λ as before yields P[S ≥ t] ≤ exp(− ln(2)nt2 /6M 2 ) as claimed.
√Hoeffding’s inequality shows that a sum of independent random variables converges to its mean at a
1/ n rate, with tails that decay as fast as a Gaussian as long as each of the individual variables is bounded.
Compare this to the 1/t2 decay that we obtained earlier through Chebyshev’s inequality.
Cumulants. The moment generating function is a convenient tool because it multiplies over independent
random variables. However, its existence requires that X already have thin tails, since E[exp(λX)] must be
finite. For heavy-tailed distributions a (laborious) alternative is to use cumulants.
The cumulant function is defined as
def
KX (λ) = log E[exp(λX)]. (25)
Note this is the log of the moment-generating function. Taking the log is convenient because now we have
additivity: KX+Y (λ) = KX (λ) + KY (λ) for independent X, Y . Cumulants are obtained by writing KX (λ) as
a power series:
∞
X κn (X) n
KX (λ) = 1 + λ . (26)
n=1
n!
1 Most of the constants presented here are suboptimal; we have focused on giving simpler proofs at the expense of sharp
constants.
12
When E[X] = 0, the first few values of κn are:
κ1 (X) = 0, (27)
2
κ2 (X) = E[X ], (28)
3
κ3 (X) = E[X ], (29)
4 2 2
κ4 (X) = E[X ] − 3E[X ] , (30)
5 3 2
κ5 (X) = E[X ] − 10E[X ]E[X ], (31)
6 4 2 3 2 2 3
κ6 (X) = E[X ] − 16E[X ]E[X ] − 10E[X ] + 30E[X ] . (32)
Since K is additive, each of the κn are as well. Thus while we ran into the issue that E[(X + Y )4 ] 6=
E[X 4 ] + E[Y 4 ], it is the case that κ4 (X + Y ) = κ4 (X) + κ4 (Y ) as long as X and Y are independent. By
going back and forth between moments and cumulants it is possible to obtain tail bounds even if only some
of the moments exist. However, this can be arduous and Rosenthal’s inequality is probably the better route
in such cases.
[Lecture 4]
Lemma 2.23. The maximum eigenvalue kM k is O(σ 2 · (1 + d/n + log(1/δ)/n)) with probability 1 − δ.
Proof. The maximum eigenvalue can be expressed as
n
1X
kM k = sup v > M v = sup |hXi , vi|2 . (33)
kvk2 ≤1 kvk2 ≤1 n i=1
13
The quantity inside the sup is attractive to analyze because it is an average of independent random variables.
Indeed, we have
n
n > X
E[exp( v M v)] = E[exp( |hXi , vi|2 /σ 2 )] (34)
σ2 i=1
n
Y
= E[exp(|hXi , vi|2 /σ 2 )] ≤ 2n , (35)
i=1
where the last step follows by sub-Gaussianity if hXi , vi. The Chernoff bound then gives P[v > M v ≥ t] ≤
2n exp(−nt/σ 2 ).
If we were to follow the same strategy as Lemma 2.22, the next step would be to union bound over v.
Unfortunately, there are infinitely many v so we cannot do this directly. Fortunately, we can get by with only
considering a large but finite number of v; we will construct a finite subset N1/4 of the unit ball such that
1
sup v > M v ≥ sup v > M v. (36)
v∈N1/4 2 kvk2 ≤1
Our construction follows Section 5.2.2 of Vershynin (2010). Let N1/4 be a maximal set of points in the unit
ball such that kx − yk2 ≥ 1/4 for all distinct x, y ∈ N1/4 . We observe that |N1/4 | ≤ 9d ; this is because the
balls of radius 1/8 around each point in N1/4 are disjoint and contained in a ball of radius 9/8.
To establish (36), let v maximize v > M v over kvk2 ≤ 1 and let u maximize v > M v over N1/4 . Then
|v > M v − u> M u| = |v > M (v − u) + u> M (v − u)| (37)
≤ (kvk2 + kuk2 )kM kkv − uk2 (38)
≤ 2 · kM k · (1/4) = kM k/2. (39)
Since v > M v = kM k, we obtain |kM k − u> M u| ≤ kM k/2, whence u> M u ≥ kM k/2, which establishes
(36). We are now ready to apply the union bound: Recall that from the Chernoff bound on v > M v, we had
P[v > M v ≥ t] ≤ 2n exp(−nt/σ 2 ), so
P[ sup v > M v ≥ t] ≤ 9d 2n exp(−nt/σ 2 ). (40)
v∈N1/4
VC dimension. Our final example will be important in the following section; it concerns how quickly
a family of events with certain geometric structure converges to its expectation. Let H be a collection of
functions f : X → {0, 1}, and define the VC dimension vc(H) to be the maximum d for which there are
points x1 , . . . , xd such that (f (x1 ), . . . , f (xd )) can take on all 2d possible values. For instance:
• If X = R and H = {I[x ≥ τ ] | τ ∈ R} is the family of threshold functions, then vc(H) = 1.
• If X = Rd and H = {I[hx, vi ≥ τ ] | v ∈ Rd , τ ∈ R} is the family of half-spaces, then vc(H) = d + 1.
Additionally, for a point set S = {x1 , . . . , xn }, let VH (S) denote the number of distinct values of (f (x1 ), . . . , f (xn ))
and VH (n) = max{VH (S) | |S| = n}. Thus the VC dimension is exactly the maximum n such that VH (n) = 2n .
We will show the following:
Proposition 2.24. Let H be a family of functions with vc(H) = d, and let X1 , . . . , Xn ∼ p be i.i.d. random
variables over X . For f : X → {0, 1}, let νn (f ) = n1 |{i | f (Xi ) = 1}| and let ν(f ) = p(f (X) = 1). Then
r
d + log(1/δ)
sup |νn (f ) − ν(f )| ≤ O (42)
f ∈H n
with probability 1 − δ.
14
We will prove a weaker result that has a d log(n) factor instead of d, and which bounds the expected
value rather than giving a probability 1 − δ bound. The log(1/δ) tail bound follows from McDiarmid’s
inequality, which is a standard result in a probability course but requires tools that would take us too far
afield. Removing the log(n) factor is slightly more involved and uses a tool called chaining.
Proof of Proposition 2.24. The importance of the VC dimension for our purposes lies in the Sauer-Shelah
lemma:
Pd
Lemma 2.25 (Sauer-Shelah). Let d = vc(H). Then VH (n) ≤ k=0 nk ≤ 2nd .
It is tempting to union bound over the at most VH (n) distinct values of (f (X1 ), . . . , f (Xn )); however,
this doesn’t work because revealing X1 , . . . , Xn uses up all of the randomness in the problem and we have no
randomness left from which to get a concentration inequality! We will instead have to introduce some new
randomness using a technique called symmetrization.
Regarding the expectation, let X10 , . . . , Xn0 be independent copies of X1 , . . . , Xn and let νn0 (f ) denote the
version of νn (f ) computed with the Xi0 . Then we have
EX [sup |νn (f ) − ν(f )|] ≤ EX,X 0 [sup |νn (f ) − νn0 (f )|] (43)
f ∈H f ∈H
n
1 X
= EX,X 0 [sup | f (Xi ) − f (Xi0 )|. (44)
n f ∈H i=1
We can create our new randomness by noting that since Xi and Xi0 are identically distributed, f (Xi ) − f (Xi0 )
has the same distribution as si (f (Xi ) − f (Xi0 )), where si is a random sign variable that is ±1 with equal
probability. Introducing these variables and continuing the inequality, we thus have
n n
1 X 1 X
EX,X 0 [sup | f (Xi ) − f (Xi0 )|] = EX,X 0 ,s [sup | si (f (Xi ) − f (Xi0 ))|]. (45)
n f ∈H i=1 n f ∈H i=1
We now have enough randomness to exploit the Sauer-Shelah lemma. If we fix X and X 0 , note that the
quantities f (Xi ) − f (Xi0 ) take values in [−1, 1] and collectively can take on at most VH (n)2 = O(n2d ) values.
But for fixed X, X 0 , the quantities 0
Psi (f (Xi ) − f (Xi ))0 are independent,2 zero-mean, bounded random variables
and hence for fixed f we have P[ i si (f (Xi ) − f (Xi )) ≥ t] ≤ exp(−t /9n) by Hoeffding’s inequality. Union
bounding over the O(n2d ) effectively distinct f , we obtain
X
Ps [sup | si (f (Xi ) − f (Xi0 ))| ≥ t | X, X 0 ] ≤ O(n2d ) exp(−t2 /9n). (46)
f ∈H i
√ p
This is small as long as t nd log n, so (45) is O( d log n/n), as claimed.
[Lecture 5]
15
This projects onto the set G under TV distance and outputs the optimal parameters for the projected
distribution.
The problem with the minimum distance functional defined above is that projection under TV usually
doesn’t make sense for finite samples! For instance, suppose that p is a Gaussian distribution and let pn and
p0n be the empirical distributions of two different sets of n samples. Then TV(pn , p) = TV(pn , p0n ) = 1 almost
surely. This is because samples from a continuous probability distribution will almost surely be distinct,
and TV distance doesn’t give credit for being “close”—the TV distance between two point masses at 1 and
1.000001 is still 1.2
To address this issue, we will consider two solutions. The first solution is to relax the distance. Intuitively,
the issue is that the TV distance is too strong—it reports a large distance even between a population
distribution p and the finite-sample distribution pn . We will replace the distance TV with a more forgiving
distance TVf and use the minimum distance functional corresponding to this relaxed distance. To show
that projection under TV f still works, we will need to check that the modulus m(G, ) is still small after we
replace TV with TV, and we will also need to check that the distance TV(p,
f f pn ) between p and its empirical
distribution is small with high probability. We do this below in Section 2.6.1.
An alternative to relaxing the distance from TV to TV f is to expand the destination set from G to some
M ⊃ G, such that even though p is not close to the empirical distribution pn , some element of M is close to
pn . Another advantage to expanding the destination set is that projecting onto G may not be computationally
tractable, while projecting onto some larger set M can sometimes be done efficiently. We will see how
to statistically analyze this modified projection algorithm in Section 2.6.2, and study the computational
feasibility of projecting onto a set M starting in Section 2.7.
Proof. By Proposition 2.4 we already know that the error is bounded by m(G, 2TV(p f ∗ , p̃n ), TV,
f L). Since
∗ ∗
f is a pseudometric, by the triangle inequality we have TV(p
TV f , p̃n ) ≤ TV(p
f , p̃) + TV(p̃,
f p̃n ). Finally,
∗ ∗
TV(p , p̃) ≤ TV(p , p̃) by assumption.
f
f H (p, q) def
TV = sup |PX∼p [f (X) ≥ τ ] − PX∼q [f (X) ≥ τ ]|. (48)
f ∈H,τ ∈R
(Note the similarity to the distance in Proposition 2.24; we will make use of this later.) We will make the
def
particular choice H = Hlin , where Hlin = {x 7→ hv, xi | v ∈ Rd }.
f H is indeed upper-bounded by TV, since TV(p, q) = sup |p(E) − q(E)| is the supremum
First note that TV E
over all events E, and (48) takes a supremum over a subset of events. The intuition for taking the particular
family H is that linear projections of our data contain all information needed to recover the mean, so perhaps
it is enough for distributions to be close only under these projections.
Bounding the modulus. To formalize this intuition, we prove the following mean crossing lemma:
Lemma 2.28. Suppose that p and q are two distributions such that TV f H (p, q) ≤ . Then for any f ∈ H,
p q
there are distributions rp ≤ 1− and rq ≤ 1− such that EX∼rp [f (X)] ≥ EY ∼rq [f (Y )].
2 We will later study the W1 distance, which does give credit for being close.
16
Proof. We will prove the stronger statement that f (X) under rp stochastically dominates f (Y ) under rq .
Starting from p, q, we delete probability mass corresponding to the smallest points of f (X) in p to get rp ,
f H (p, q) ≤ we
and delete probability mass corresponding to the largest points f (Y ) in q to get rq . Since TV
have
sup |PX∼p (f (X) ≥ τ ) − PY ∼q (f (Y ) ≥ τ )| ≤ , (49)
τ ∈R
which implies that Prp (f (X) ≥ τ ) ≥ Prq (f (Y ) ≥ τ ) for all t ∈ R. Hence, rp stochastically dominates rq and
Erp [f (X)] ≥ Erq [(Y )].
Mean crossing lemmas such as Lemma 2.28 help us bound the modulus of relaxed distances for the family
of resilient distributions. In this case we have the following corollary:
Corollary 2.29. For the family GTV (ρ, ) of (ρ, )-resilient distributions and L(p, θ) = kθ − µ(p)k2 , we have
f H ) ≤ 2ρ.
m(GTV (ρ, ), , TV (50)
lin
Compare to Theorem 2.10 where we showed that m(GTV , , TV) ≤ ρ. Thus as long as Theorem 2.10 is
tight, relaxing from TV to TV
f H doesn’t increase the modulus at all!
lin
Then,
≤ 2ρ, (56)
Figure 6: Illustration of mean cross lemma. For any distributions p1 , p2 that are close under TV,
f we can
truncate the -tails of each distribution to make their means cross.
17
Bounding the distance to the empirical distribution. Now that we have bounded the modulus, it
remains to bound the distance TV(p̃,
f p̃n ). Note that TV(p̃,
f p̃nq
) is exactly the quantity bounded in equation
f H (p̃, p̃n ) ≤ O vc(H)+log(1/δ)
(42) of Proposition 2.24; we thus have that TV n with probability 1 − δ. Here
vc(H) is the VC dimension of the family of threshold functions {x 7→ I[f (x) ≥ τ ] | f ∈ H, τ ∈ R}. So, for
H = Hlin all we need to do is bound the VC dimension of the family of halfspace functions on Rd .
We claimed earlier that this VC dimension is d + 1, but we prove it here for completeness. We will show
that no set of points x1 , . . . , xd+2 ∈ Rd cannot be shattered into all 2d+2 possible subsets using halfspaces.
For any such points we can find multipliers a1 , . . . , ad+2 ∈ R such that
d+2
X d+2
X
ai xi = 0, ai = 0. (57)
i=1 i=1
Let S+ = {i | ai > 0} and S− = {i | ai < 0}. We will show that the convex hulls of S+ and S− intersect.
Consequently, there is no vector v and threshold τ such that hxi , vi ≥ τ iff i ∈ S+ . (This is because both a
halfspace and its complement are convex, so if we let Hv,τ denote the half-space, it is impossible to have
c
S+ ⊂ Hv,τ , S− ⊂ Hv,τ , and conv(S+ ) ∩ conv(S− ) 6= ∅.)
To prove that the convex hulls intersect, note that we have
1 X 1 X
ai xi = (−ai )xi , (58)
A A
i∈S+ i∈S−
P P
where A = i∈S+ ai = i∈S− (−ai ). But the left-hand-side lies in conv(S+ ) while the right-hand-side lies in
conv(S− ), so the convex hulls do indeed intersect.
This shows that x1 , . . . , xd+2 cannot be shattered, so vc(Hlin ) ≤ d+1. Combining this with Proposition 2.24,
we obtain:
q
d+log(1/δ)
Proposition 2.30. With probability 1 − δ, we have TV f H (p̃, p̃n ) ≤ O
lin n .
0
Combining this with Corollary 2.29 and Lemma p 2.27, we see that projecting onto GTV (ρ, 2 ) under TVHlin
f
0
performsq well in finite samples, for = + O( d/n). For instance, if G has bounded covariance we achieve
p p
error O( + d/n); if G is sub-Gaussian we achieve error Õ( + d/n); and in general if G has bounded
p p
ψ-norm we achieve error O + d/n ψ −1 √1 ≤ O(( + d/n)ψ −1 (1/)).
+ d/n
p
This analysis is slightly sub-optimal as the best lower bound we are aware of is Ω(ψ −1 (1/) + d/n), i.e.
the ψ −1 (1/) coefficient
p in the dependence on n shouldn’t be there. However, it is accurate as long as is
large compared to d/n.
Connection to Tukey median. A classical robust estimator for the mean is the Tukey median, which
solves the problem
min max |PX∼p̃n [hX, vi ≥ hµ, vi] − 12 | (59)
µ v∈Rd
[Note: this definition is slightly wrong as it does not behave gracefully when there is a point mass at µ.]
It is instructive to compare this to projection under TV,
f which corresponds to
The differences are: (1) the Tukey median only minimizes over the mean rather than the full distribution
q; (2) it only considers the threshold hµ, vi rather than all thresholds τ ; it assumes that the median of any
one-dimensional projection hX, vi is equal to its mean (which is why we subtract 12 in (59)). Distributions
satisfying this final property are said to be unskewed.
For unskewed distributions with “sufficient probability mass” near the mean, the Tukey median yields a
robust estimator. In fact, it can be robust even if the true distribution has heavy tails (and hence is not
resilient), by virtue of leveraging the unskewed property. We will explore this in an exercise.
[Lectures 6-7]
18
2.6.2 Expanding the Set
In Section 2.6.1 we saw how to resolve the issue with TV projection by relaxing to a weaker distance TV. f
We will now study an alternate approach, based on expanding the destination set G to a larger set M. For
this approach we will need to reference the “true empirical distribution” p∗n . What we mean by this is the
following: Whenever TV(p∗ , p̃) ≤ , we know that p∗ and p̃ are identical except for some event E of probability
. Therefore we can sample from p̃ as follows:
1. Draw a sample from X ∼ p∗ .
2. Check if E holds; if it does, replace X with a sample from the conditional distribution p̃|E .
3. Otherwise leave X as-is.
Thus we can interpret a sample from p̃ as having a 1 − chance of being “from” p∗ . More generally, we
can construct the empirical distribution p̃n by first constructing the empirical distribution p∗n coming from
p∗ , then replacing Binom(n, ) of the points with samples from p̃|E . Formally, we have created a coupling
between the random variables p∗n and p̃n such that TV(p∗n , p̃n ) is distributed as n1 Binom(n, ).
Let us return to expanding the set from G to M. For this to work, we need three properties to hold:
• M is large enough: minq∈M TV(q, p∗n ) is small with high probability.
• The empirical loss L(p∗n , θ) is a good approximation to the population loss L(p∗ , θ).
• The modulus is still bounded: minp,q∈M:TV(p,q)≤2 L(p, θ∗ (q)) is small.
In fact, it suffices for M to satisfy a weaker property; we only need the “generalized modulus” to be small
relative to some G 0 ⊂ M:
Assume that the true empirical distribution p∗n lies in G 0 with probability 1 − δ. Then the minimum distance
functional projecting under TV onto M has empirical error L(p∗n , θ̂) at most m(G 0 , M, 20 ) with probability at
least 1 − δ − P[Binom(, n) ≥ 0 n].
Proof. Let 0 = TV(p∗n , p̃n ), which is Binom(, n)-distributed. If p∗n lies in G 0 , then since G 0 ⊂ M we know
that p̃n has distance at most 0 from M, and so the projected distribution q satisfies TV(q, p̃n ) ≤ 0 and hence
TV(q, p∗n ) ≤ 20 . It follows from the definition that L(p∗n , θ̂) = L(p∗n , θ∗ (q)) ≤ m(G 0 , M, 20 ).
A useful bound on the binomial tail is that P[Binom(, n) ≥ 2n] ≤ exp(−n/3). In particular the empirical
error is at most m(G 0 , M, 4) with probability at least 1 − δ − exp(−n/3).
Application: bounded kth moments. First suppose that the distribution p∗ has bounded kth moments,
i.e. Gmom,k (σ) = {p | kpkψ ≤ σ}, where ψ(x) = xk . When k > 2, the empirical distribution p∗n will not have
bouned kth moments until n ≥ Ω(dk/2 ). This is because if we take a single sample x1 ∼ p and let v be a
unit vector in the direction of x1 − µ, then Ex∼p∗n [hx − µ, vik ] ≥ n1 kx1 − µkk2 ' dk/2 /n, since the norm of
√
kx1 − µk2 is typically d.
Consequently, it is necessary to expand the set and we will choose G 0 = M = GTV (ρ, ) for ρ = O(σ1−1/k )
to be the set of resilience distributions with appropriate parameters ρ and . We already know that the
modulus of M is bounded by O(σ1−1/k ), so the hard part is showing that the empirical distribution p∗n lies
in M with high probability.
As noted above, we cannot hope to prove that p∗n has bounded moments except when n = Ω(dk/2 ), which
is too large. We will instead show that certain truncated moments of p∗n are bounded as soon as n = Ω(d),
19
and that these truncated moments suffice to show resilience. Specifically, if ψ(x) = xk is the Orlicz function
for the kth moments, we will define the truncated function
xk
: x ≤ x0
ψ̃(x) = k−1 (62)
kx0 (x − x0 ) + xk0 : x > x0
In other words, ψ̃ is equal to ψ for x ≤ x0 , and is the best linear lower bound to ψ for x > x0 . Note that ψ̃ is
L-Lipschitz for L = kxk−10 . We will eventually take x0 = (k k−1 )−1/k and hence L = (1/)(k−1)/k . Using a
symmetrization argument, we will bound the truncated supkvk2 ≤1 Ep∗n [ψ̃(|hx − µ, vi|/σ)].
Proposition 2.32. Let X1 , . . . , Xn ∼ p∗ , where p∗ ∈ Gmom,k (σ). Then,
n k r k
1X |hXi − µ, vi| dk
EX1 ,...,Xn ∼p∗ sup ψ̃ − U (v) ≤ O 2L , (63)
kvk2 ≤1 n i=1 σ n
where L = kxk−1
0 and U (v) is a function satisfying U (v) ≤ 1 for all v.
Before proving Proposition 2.32, let us interpret its significance. Take x0 = (k k−1 )−1/k and hence L =
1−1/k
. Take n large enough so that the right-hand-side of (63) is at most 1, which requires n ≥ Ω(kd/2−2/k ).
We then obtain a high-probability bound on the ψ̃-norm of p∗n , i.e. the ψ̃-norm is at most O(δ −1/k ) with
probability 1 − δ. This implies that p∗n is resilient with parameter ρ = σψ̃ −1 (O(δ −1/k )/) = 2σ1−1/k . A
useful bound on ψ̃ −1 is ψ̃(−1)(z) ≤ x0 + z/L, and since x0 ≤ (1/)−1/k and L = (1/)(k−1)/k in our case, we
have
ρ ≤ O(σ1−1/k δ −1/k ) with probability 1 − δ.
This matches the population-bound of O(σ1−1/k ), and only requires kd/2−2/k samples, in contrast to the
d/2 samples required before. Indeed, this sample complexity dependence is optimal (other than the factor of
−1/k
k); the only
p drawback is that we do not get exponential tails (we instead obtain tails of δ , which is worse
than the log(1/δ) from before).
Now we discuss some ideas that are needed in the proof. We would like to somehow exploit the fact that ψ̃
is L-Lipschitz to prove concentration. We can do so with the following keystone result in probability theory:
Let us interpret this result. We should think of the ti as a quantity such as hxi − µ, vi, where
P abstracting
to ti yields generality
P and notational simplicity. Theorem 2.33 says that if we let Y = sup t∈T i i φ(ti ) and
Z = L supt∈T i i ti , then E[g(Y )] ≤ E[g(Z)] for all convex increasing functions g. When this holds we say
that Y stochastically dominates Z in second order ; intuitively, it is equivalent to saying that Z has larger
mean than Y and greater variation around its mean. For distributions supported on just two points, we can
formalize this as follows:
Lemma 2.34 (Two-point stochastic dominance). Let Y take values y1 and y2 with probability 12 , and Z take
values z1 and z2 with probability 12 . Then Z stochastically dominates Y (in second order) if and only if
z1 + z2 y1 + y2
≥ and max(z1 , z2 ) ≥ max(y1 , y2 ). (65)
2 2
Proof. Without loss of generality assume z2 ≥ z1 and y2 ≥ y1 . We want to show that E[g(Y )] ≤ E[g(Z)] for
all convex increasing g if and only if (65) holds. We first establish necessity of (65). Take g(x) = x, then we
require E[Y ] ≤ E[Z], which is the first condition in (65). Taking g(x) = max(x − z2 , 0) yields E[g(Z)] = 0
and E[g(Y )] ≥ 12 max(y2 − z2 , 0), so E[g(Y )] ≤ E[g(Z)] implies that y2 ≤ z2 , which is the second condition in
(65).
20
We next establish sufficiency, by conjuring up appropriate weights for Jensen’s inequality. We have
y2 − z1 z2 − y2 z (y − z ) + z (z − y )
2 2 1 1 2 2
g(z2 ) + g(z1 ) ≥ g = g(y2 ), (66)
z2 − z1 z2 − z1 z2 − z1
z2 − y2 y2 − z1 z (z − y ) + z (y − z )
2 2 2 1 2 1
g(z2 ) + g(z1 ) ≥ g = g(z1 + z2 − y2 ) ≥ g(y1 ). (67)
z2 − z1 z2 − z1 z2 − z1
Here the first two inequalities are Jensen while the last is by the first condition in (65) together with the
monotonicity of g. Adding these together yields g(z2 ) + g(z1 ) ≥ g(y2 ) + g(y1 ), or E[g(Z)] ≥ E[g(Y )], as
−z1
desired. We need only check that the weights yz22 −z 1
and zz22 −y
−z1 are positive. The second weight is positive
2
by the assumption z2 ≥ y2 . The first weight could be negative if y2 < z1 , meaning that both y1 and y2 are
smaller than both z1 and z2 . But in this case, the inequality E[g(Y )] ≤ E[g(Z)] trivially holds by monotonicity
of g. This completes the proof.
We are now ready to prove Theorem 2.33.
Proof of Theorem 2.33. Without loss of generality we may take L = 1. Our strategy will be to iteratively
apply an inequality for a single i to replace all the φ(ti ) with ti one-by-one. The inequality for a single i is
the following:
Lemma 2.35. For any 1-Lipschitz function φ with φ(0) = 0, any collection T of ordered pairs (a, b), and
any convex increasing function g, we have
To prove this, let (a+ , b+ ) attain the sup of a + φ(b) for = +1, and (a− , b− ) attain the sup for = −1.
We will check the conditions of Lemma 2.34 for
y1 = a− − φ(b− ), (69)
y2 = a+ + φ(b+ ), (70)
z1 = max(a− − b− , a+ − b+ ), (71)
z2 = max(a− + b− , a+ + b+ ). (72)
(Note that z1 and z2 are lower-bounds on the right-hand-side sup for = −1, +1 respectively.)
First we need max(y1 , y2 ) ≤ max(z1 , z2 ). But max(z1 , z2 ) = max(a− + |b− |, a+ + |b+ |) ≥ max(a− −
φ(b− ), a+ + φ(b+ )) = max(y1 , y2 ). Here the inequality follows since φ(b) ≤ |b| since φ is Lipschitz and
φ(0) = 0.
Second we need y1 +y2
2
≤ z1 +z
2 . We have z1 + z2 ≥ max((a− − b− ) + (a+ + b+ ), (a− + b− ) + (a+ − b+ )) =
2
+ −b− |
a+ + a− + |b+ − b− |, so it suffices to show that a+ +a− +|b
2 ≥ a+ +a− +φ(b
2
+ )−φ(b− )
. This exactly reduces to
φ(b+ ) − φ(b− ) ≤ |b+ − b− |, which again follows since φ is Lipschitz. This completes the proof of the lemma.
Now to prove the general proposition we observe that if g(x) is convex in x, so is g(x + t) for any t. We
21
then proceed by iteratively applying Lemma 2.35:
n
X n−1
X
E1:n [g(sup i φ(ti ))] = E1:n−1 [En [g(sup i φ(ti ) +n φ(tn )) | 1:n−1 ]] (73)
t∈T i=1 t∈T i=1 | {z }
| {z } φ(b)
a
n−1
X
≤ E1:n−1 [En [g(sup i φ(ti ) + n tn ) | 1:n−1 ]] (74)
t∈T i=1
n−1
X
= E1:n [[g(sup i φ(ti ) + n tn )] (75)
t∈T i=1
..
. (76)
n
X
≤ E1:n [g(sup 1 φ(t1 ) + i ti )] (77)
t∈T i=2
n
X
≤ E1:n [g(sup i ti )], (78)
t∈T i=1
Here the first inequality adds and subtracts the mean, the second applies symmetrization, while the third
uses the fact that optimizing a single v for both X and X 0 is smaller than optimizing v separately for each
(and that the expectations of the expressions with X and X 0 are equal to each other in that case).
We now apply Ledoux-Talagrand contraction. Invoking Theorem 2.33 with g(x) = |x|k , φ(x) = ψ̃(|x|)
and ti = hXi − µ, vi|/σ, we obtain
n |hX − µ, vi| k n ki
1X i
h 1X
EX∼p, sup i ψ̃ ≤ (L/σ)k EX∼p, sup i hXi − µ, vi (82)
kvk2 ≤1 n i=1 σ kvk2 ≤1 n i=1
h 1X n ki
k
= (L/σ) EX∼p, i (Xi − µ) . (83)
n i=1 2
Pn
We are thus finally left to bound EX∼p, [k i=1 i (Xi − µ)kk ]. Here we will use Khintchine’s inequality, which
says that X
Ak kzk2 ≤ E [| i zi |k ]1/k ≤ Bk kzk2 , (84)
i
√
where Ak is Θ(1) and Bk is Θ( k) for k ≥ 1. Applying this in our case, we obtain
n
X n
X
EX, [k i (Xi − µ)kk2 ] ≤ O(1)k EX,,0 [| i hXi − µ, 0 i|k ]. (85)
i=1 i=1
22
Next apply Rosenthal’s inequality (Eq. 21), which yields that
Xn n
X √ n
X
EX, [ i hXi − µ, 0 i|k | 0 ] ≤ O(k)k EX, [|hXi − µ, 0 i|k | 0 ] + O( k)k ( E[hXi − µ, 0 i|2 ])k/2 (86)
i=1 i=1 i=1
√
≤ O(k)k · nσ k k0 kk2 + O( kn)k σ k k0 kk2 (87)
√ √
= O(σk d)k n + O(σ kd)k nk/2 , (88)
√
where the last step uses that k0 k2 = d and the second-to-last step uses the bounded moments of X. As
long as n k k/(k−2) the latter term dominates and hence plugging back into we conclude that
n
X √
EX, [k i (Xi − µ)kk2 ]1/k = O(σ kdn). (89)
i=1
p
Thus bounds the symmetrized truncated moments in (82-83) by O(L kd/n)k , and plugging back into (81)
completes the proof.
Application: isotropic Gaussians. Next take Ggauss to be the family of isotropic Gaussians N (µ, I). We
saw earlier that the modulus m(Ggauss , ) was O() for the mean estimation loss L(p, θ) = kθ − µ(p)k2 . Thus
projecting onto Ggauss yields error O() for mean estimation in the limit of infinite samples, but doesn’t work
for finite samples since the TV distance to Ggauss will always be 1.
Instead we will project onto the set Gcov (σ) = {p | kE[(X − µ)(X − µ)> ]k ≤ σ 2 }, for σ 2 = O(1 + d/n +
log(1/δ)/n). We already saw in Lemma 2.23 that when p∗ is √ (sub-)Gaussian the empirical distribution p∗n
lies within this set. But the modulus of Gcov only decays as O( ), which is worse than the O() dependence
that we had in infinite samples! How can we resolve this issue?
We will let Giso be the family of distributions whose covariance is not only bounded, but close to the
identity, and where moreover this holds for all (1 − )-subsets:
def p
Giso (σ1 , σ2 ) = {p | kEr [X − µ]k2 ≤ σ1 and kEr [(X − µ)(X − µ)> − Ik ≤ (σ2 )2 , whenever r ≤ }. (90)
1−
∗
p
The following
p improvement on Lemma 2.23 implies that pn ∈ G iso (σ 1 , σ2 ) for σ 1 = O( log(1/)) and
σ2 = O( log(1/)). [Note: the lemma below is wrong as stated. To be fixed.]
Lemma 2.36. Suppose that X1 , . . . , Xn are drawn independently from a sub-Gaussian distribution with
sub-Gaussian parameter σ, mean 0, and identity covariance. Then, with probability 1 − δ we have
n
1 X d + log(1/δ)
Xi Xi> − I ≤ O σ 2 · log(1/) + , and (91)
|S| n
i∈S
n
r
1 X p d + log(1/δ)
Xi ≤ O σ · log(1/) + (92)
|S| 2 n
i∈S
for all subsets S ⊆ {1, . . . , n} with |S| ≥ (1 − )n. In particular, if n d/(2 log(1/)) then δ ≤
exp(−cn log(1/)) for some constant c.
∗ 0
0
We will return p of Lemma 2.36 later. For now, note that this means 0that pn ∈ G for
p to the proof
G = Giso (O( log(1/), O( log(1/))), at least for large enough n. Furthermore, G ⊂ M for M =
Gcov (1 + O( log(1/))).
Now we bound the generalized modulus of continuity:
p
Lemma 2.37. Suppose that√p ∈ Giso (σ1 , σ2 ) and q ∈ Gcov ( 1 + σ22 ), and furthermore TV(p, q) ≤ . Then
kµ(p) − µ(q)k2 ≤ O(σ1 + σ2 + ).
23
Proof. Take the midpoint distribution r = min(p,q) 0
1− , and write q = (1 − )r + q . We will bound kµ(r) − µ(q)k2
(note that kµ(r) − µ(p)k2 is already bounded since p ∈ Giso ). We have that
Covq [X] = (1 − )Er [(X − µq )(X − µq )> ] + Eq0 [(X − µq )(X − µq )> ] (93)
> >
= (1 − )(Covr [X] + (µq − µr )(µq − µr ) ) + Eq0 [(X − µq )(X − µ)q) ] (94)
> >
(1 − )(Covr [X] + (µq − µr )(µq − µr ) ) + (µq − µq0 )(µq − µq0 ) . (95)
(1−)2
A computation yields µq − µq0 = (µq − µr ). Plugging this into (95) and simplifying, we obtain that
σ2 p̃
1− ≤ 2σ 2 whenever kCovp∗ [X]k ≤ σ 2 . Therefore, taking p∗ ≤ 1− is equivalent to the TV corruption model
from before for our present purposes.
We will construct an efficient algorithm that, given p̃, outputs a distribution q such that TV(q, p∗ ) ≤ O()
and kCovq [X]k2 ≤ O(σ 2 ). This is similar to the minimum distance functional, in that it finds a distribution
close to p∗ with bounded covariance; the main difference is that q need not be the projection of p̃ onto Gcov ,
and also the covariance of q is bounded by O(σ 2 ) instead of σ 2 . However, the modulus of continuity bound
from before says that any distribution q that is near p∗ and has bounded covariance will approximate the
mean of p∗ . Specifically, we have (by tracking the constants in our previous midpoint argument):
q q
kΣq k kΣp k
Lemma 2.38. If TV(p, q) ≤ , then kµ(p) − µ(q)k2 ≤ 1− + 1− .
24
Note that the bound on kµ(p∗ ) − µ(q)k2 follows directly from the modulus bound on Gcov (σ) together
with the property TV(p∗ , q) ≤ 1−
.
The algorithm, MinCovL2, underlying Proposition 2.39 is given below; it maintains a weighted distribution
q, which places weight qi on point xi . It then computes the weighted mean and covariance, picking the
weights that minimize the norm of the covariance.
Algorithm 2 MinCovL2
1: Input: x1 , . . . , xn ∈ Rd .
2: Find any stationary point q of the optimization problem:
n
X
min sup qi hv, xi − µq i2 , (97)
q kvk2 ≤1 i=1
X
s.t. µq = qi xi ,
i
X 1
q ≥ 0, qi = 1, qi ≤
i
(1 − )n
1
The intuition behind Algorithm 2 is as follows: the constraint qi ≤ (1−)n ensures that q is an -deletion
of the uniform distribution over x1 , . . . , xn . Then, subject to that constraint, Algorithm 2 seeks to minimize
the weighted covariance matrix: note the objective is exactly kΣq k2 .
Algorithm 2 is non-trivial to analyze, because although the constraint set is convex, the objective is
non-convex: both qi and µq are linear in q, and so the overall objective (even for a fixed v) is thus a non-convex
cubic in q. On the other hand, for any “reasonable” choice of q, µq should be close to µp∗ . If we apply this
approximation–substituting µp∗ for µq –then the objective becomes convex again. So the main idea behind
the proof is to show that this substitution can be (approximately) done.
Before getting into that, we need to understand what stationary points of (117) look like. In general,
a stationary point is one where the gradient is either zero, or where the point is at the boundary of the
constraint set and the gradient points outward into the infeasible region for the constraints.
However, the supremum over v can lead to a non-differentiable function (e.g. max(x1 , x2 ) is non-
differentiable when x1 = x2 ), so these conditions need to be refined slightly to handle that. We can use
something called a “Clarke subdifferential”, to show that the preceding conditions at least hold for some v
that maximizes the supremum:
Lemma 2.40. Suppose that q is a stationary point of (117). Then, for any feasible p, there exists a v ∈ Rd
with kvk2 = 1 such that
Eq [hv, X − µq i2 ] ≤ Ep [hv, X − µq i2 ]. (98)
Moreover, v is a maximizer of the left-hand-side, i.e. v > Σq v = kΣq k.
Proof. Let Fv (q) = Eq [hv, X − µq i2 ]. First, compute ∇Fv (q). We have
n
∂ ∂ X
F (q) = qj hv, xj − µq i2 (99)
∂qi ∂qi j=1
n
X ∂µ
q
= hv, xi − µq i2 + 2 qj hv, xj − µq i (100)
j=1
∂qi
= hv, xi − µq i2 , (101)
where the last equality is because j qj (xj − µq ) = 0. Consequently, ∇Fv (q)i = hv, xi − µq i2 .
P
∗
Now, let F (q) = maxkvk2 =1 Fv (q) = kΣq k. If the Pmaximizing v is unique and equal to v , then ∇F (q) =
∇Fv (q), and q is a stationary point if and only if i (qi − pi )∇Fv (q)i ≤ 0 for all feasible p, or equivalently
∗ ∗
25
Suppose (the harder case) that the maximizing v is not unique. Then F is not differentiable at q, but
P Clark subdifferential is the convex hull of ∇Fv (q) for all maximizing v’s.
the P Stationarity implies that
i (qi − pi )gi ≤ 0 for some g in this convex hull, and thus by convexity that i (qi − pi )∇Fv (q)i ≤ 0 for
∗
some maximizing v ∗ . This gives us the same desired condition as before and thus completes the proof.
Given Lemma 2.40, we are in a better position to analyze Algorithm 2. In particular, for any p (we will
eventually take the global minimizer p of (117)), Lemma 2.40 yields
The kµp − µq k22 quantifies the “error due to non-convexity”–recall that if we replace µq with a fixed µp
in (117), the problem becomes convex, and hence any stationary point would be a global minimizer. The
distance kµp − µq k22 is how much we pay for this discrepancy.
Fortunately, µp −µq is small, precisely due to the modulus of continuity! We can show that
q any feasible
q p, q
q kΣ k p kΣ k
for (117) satisfies TV(p, q) ≤ 1− (see Lemma 2.41), hence Lemma 2.38 gives kµp − µq k2 ≤ 1−2 + 1−2 .
Plugging back in to (105), we obtain
q q 2
kCovq k ≤ kCovp k + kΣp k + kΣq k . (106)
1 − 2
p
For fixed kCovp k, we can view this as a quadratic inequality in kΣp k. Solving the quadratic then yields
1 − 2
kCovq k ≤ kCovp k. (107)
1 − 3
In particular, taking p to be the global minimum p of (117), we have kCovp k ≤ kCovp∗ k ≤ σ 2 , so kCovq k ≤
2
1−
1−3 σ 2 . Plugging back into Lemma 2.38 again, we then have
√
1 − 2 √
r
1−
kµq − µ k2 ≤
p∗ σ+ σ =2 σ, (108)
1 − 2 1 − 3 1 − 3
which proves Proposition 2.39.
26
• µ is equal to µq
The distribution q clearly satisfies this: we have µq = b/2, and both 0 and b are closer to µq than −a is.
This also shows why the breakdown point is 1/3 and not smaller. If were slightly smaller than 1/3, then
some of the mass of q would have to remain on δ−a . Then as b increases, the mean µq would increase more
slowly, and eventually −a would be closer to µq than b.
Lemma 2.41. Suppose that q, q 0 are both -deletions of a distribution p. Then TV(q, q 0 ) ≤
1− .
Proof. Conceptually, the reason Lemma 2.41 is true is that q 0 can be obtained from q by first adding an
-fraction of points (to get to p), then deleting an -fraction. Since TV ≤ exactly allows an -fraction of of
additions and deletions, this should yield the result. The reason we get 1−
is because q and q 0 are only a
(1 − )-“fraction” of p, so the -deletions are more like 1− -deletions relative to q and q 0 .
[Lecture 9]
As some examples, the dual of the `2 -norm is itself, the dual of the `1 -norm is the `∞ -norm, and the dual
of the `∞ -norm is the `1 -norm. An important property (we omit the proof) is that the dual of the dual is the
original norm:
Proposition 2.43. If k · k is a norm on a finite-dimensional vector space, then k · k∗∗ = k · k.
27
For a more complex example: let kvk(k) be the sum of the k largest coordinates of v (in absolute value).
Then the dual of k · k(k) is max(kuk∞ , kuk1 /k). This can be seen by noting that the vertices of the constraint
set {u | kuk∞ ≤ 1, kuk1 ≤ k} are exactly the k-sparse {−1, 0, +1}-vectors.
Let Gcov (σ, k · k) denote the family of distributions satisfying maxkvk∗ ≤1 v > Covp [X]v ≤ σ 2 . Then Gcov is
resilient exactly analogously to the `2 -case:
q
p 2
Proposition 2.44. If p ∈ Gcov (σ, k · k) and r ≤ 1− , then kµ(r) − µ(p)k ≤ 1− σ. In other words, all
√
distributions in Gcov (σ, k · k) are (, O(σ ))-resilient.
Proof. We have that kµ(r) − µ(p)k = hµ(r) − µ(p), vi for some vector v with kvk∗ = 1. The result then
follows by resilience for the one-dimensional distribution hX, vi for X ∼ p.
When p∗ ∈ Gcov (σ, k · k), we will design efficient algorithms analogous to Algorithm 2. The main difficulty
is that in norms other than `2 , it is generally not possible to exactly solve the optimization problem
maxkvk∗ ≤1 v > Σ̂c v that is used in Algorithm 2. We instead make use of a κ-approximate relaxation:
Definition 2.45. A set C of positive semidefinie matrices is a κ-approximate relaxation if C contains vv >
for all v satisfying kvk∗ ≤ 1, and furthermore
Thus a κ-approximate relaxation over-approximates hvv > , Σi, but also it underapproximates it within a
factor of κ. Given such a relaxation, we have the following analog to Algorithm 2:
Algorithm 3 MinCovNorm
1: Input: x1 , . . . , xn ∈ Rd and κ-approximate relaxation C.
2: Find any stationary point q of the optimization problem:
n
X
min sup qi (xi − µq )> M (xi − µq ), (117)
q M ∈C i=1
X
s.t. µq = qi xi ,
i
X 1
q ≥ 0, qi = 1, qi ≤
i
(1 − )n
√
Algorithm 3 outputs an estimate of the mean with error O(σ κ). The proof is almost exactly the same
as Algorithm 2; the main difference is that we need to ensure that hΣp , M i, the inner product of M with the
true covariance, is not too large. This is where we use the κ-approximation property. We leave the detailed
proof as an exercise, and focus on how to construct a κ-approximate relaxation C.
Semidefinite
Pd programming. As a concrete example, suppose that we wish to estimate µ in the `1 -norm
kvk = j=1 |vj |. The dual norm is the `∞ -norm, and hence our goal is to approximately solve the optimization
problem
maximize v > Σv subject to kvk∞ ≤ 1. (118)
The issue with (118) is that it is not concave in v because of the quadratic function v > Σv. However, note
that v > Σv = hΣ, vv > i. Therefore, if we replace v with the variable M = vv > , then we can re-express the
optimization problem as
Here the first constraint is a translation of kvk∞ ≤ 1, while the latter two constrain M to be of the form vv > .
28
This is almost convex in M , except for the constraint rank(M ) = 1. If we omit this constraint, we obtain
the optimization
maximize hΣ, M i
subject to Mjj = 1 for all j,
M 0. (120)
Note that here we replace the constraint Mjj ≤ 1 with Mjj = 1; this can be done because the maximizer of
(120) will always have Mjj = 1 for all j. For brevity we often write this constraint as diag(M ) = 1.
The problem (120) is a special instance of a semidefinite program and can be solved in polynomial time (in
general, a semidefinite program allows arbitrary linear inequality or positive semidefinite constraints between
linear functions of the decision variables; we discuss this more below).
The optimizer M ∗ of (120) will always satisfy hΣ, M ∗ i ≥ supkvk∞ ≤1 v > Σv because and v with kvk∞ ≤ 1
yields a feasible M . The key is to show that it is not too much larger than this. This turns out to be a
fundamental fact in the theory of optimization called Grothendieck’s inequality:
Theorem 2.46. If Σ 0, then the value of (120) is at most π
2 supkvk∞ ≤1 v > Σv.
See Alon and Naor (2004) for a very well-written exposition on Grothendieck’s inequality and its relation
to optimization algorithms. In that text we also see that a version of Theorem 2.46 holds even when Σ is not
positive semidefinite or indeed even square. Here we produce a proof based on [todo: cite] for the semidefinite
case.
Proof of Theorem 2.46. The proof involves two key relations. To describe the first, given a matrix X let
arcsin[X] denote the matrix whose i, j entry is arcsin(Xij ) (i.e. we apply arcsin element-wise). Then we have
(we will show this later)
2
max v > Σv = max hΣ, arcsin[X]i. (121)
kvk∞ ≤1 X0,diag(X)=1 π
θ θ 2 π 2
Eg [vi vj ] = (1 − ) − = ( − θ) = arcsin(hui , uj i), (124)
π π π 2 π
as desired. Therefore, we can always construct a distribution over v for which E[v > Σv] = π2 hΣ, arcsin[M ]i,
hence the right-hand-side of (121) is at most the left-hand-side. For the other direction, note that the
maximizing v on the left-hand-side is always a {−1, +1} vector by convexity of v > Σv, and for any such vector
we have π2 arcsin[vv > ] = vv > . Thus the left-hand-side is at most the right-hand-side, and so the equality
(121) indeed holds.
We now turn our attention to establishing (122). For this, let X k denote the matrix whose i, j entry is
k
Xij (we take element-wise power). We require the following lemma:
29
k
Lemma 2.47. For all k ∈ {1, 2, . . .}, if X 0 then X 0.
Proof. The matrix X k is a submatrix of X ⊗k , where (X ⊗k )i1 ···ik ,j1 ···jk = Xi1 ,j1 · · · Xik ,jk . We can verify
that X ⊗k 0 (its eigenvalues are λi1 · · · λik where λi are the eigenvalues of X), hence so is X k since
submatrices of PSD matrices are PSD.
z 2n+1
P∞
With this in hand, we also make use of the Taylor series for arcsin(z): arcsin(z) = n=0 (2(2n)! n n!)2 2n+1 =
z3
z+ 6 + · · · . Then we have
∞
X (2n)! 1 (2n+1)
arcsin[X] = X + n n!)2 2n + 1
X X, (125)
n=1
(2
from which the claim follows. Here the key step is that for rank-one matrices the operation behaves nicely:
(ui u>
i ) (vj vj> ) = (ui vj )(ui vj )> .
Here hX, Y i = tr(X T Y ) = ij Xij Yij is the inner product between matrices, which is the same as the
P
30
While (129) is the canonical form for a semidefinite program, problems that are seemingly more complex
can be reduced to this form. For one, we can add linear equality constraints as two-sided inequality constraints.
In addition, we can replace X 0 with L(X) 0 for any linear function L, by using linear equality constraints
to enforce the linear relations implied by L. Finally, we can actually include any number of constraints
L1 (X) 0
L1 (X) 0, Lk (X) 0, since this is e.g. equivalent to the single constraint when
0 L2 (X)
k = 2. As an example of these observations, the following (arbitrarily-chosen) optimization problem is also a
semidefinite program:
subject to M + Y Σ
diag(M ) = 1
tr(Y ) ≤ 1
Y 0
1 x>
0
x M
(As a brief aside, the constraint [1 x> ; x M ] 0 is equivalent to xx> M which is in turn equivalent to
x> M −1 x ≤ 1 and M 0.)
minimize λ (131)
subject to λI − M 0 (equivalently, v > (λI − M )v ≥ 0 for all v)
We thus begin to see a relationship between moments and polynomial non-negativity constraints.
Note we can equivalently express hM4 , v ⊗4 i = ijkl (M4 )ijkl vi vj vk vl , and hence (M4 )ijkl = E[(xi − µ)(xj −
P
µ)(xk − µ)(xl − µ)].
def
A distribution p has bounded 4th moment if and only if hM, v ⊗4 i ≤ λkvk42 for all v. Letting pM (v) =
hM, v ⊗4 i, we thus can express the 4th moment of p as the polynomial program
minimize λ (133)
subject to λ(v12 + ··· + vd2 )2 − pM (v) ≥ 0 for all v ∈ R d
Unfortunately, in constrast to (130), (133) is NP-hard to solve in general. We will next see a way to
approximate (133) via a technique called sum-of-squares programming, which is a way of approximately
reducing polynomial programs such as (133) to a large but polynomial-size semidefinite program.
31
Warm-up: certifying non-negativity over R. Consider the one-dimensional polynomial
Is it the case that q(x) ≥ 0 for all x? If so, how would we check this?
What if I told you that we had
1 1
q(x) = (2x2 + x − 3)2 + (3x + 1)2 (135)
2 2
Then, it is immediate that q(x) ≥ 0 for all x, since it is a (weighted) sum of squares.
How can we construct such decompositions of q? First observe that we can re-write q as the matrix
function >
1 5 0 0 1
q(x) = x 0 −1 1 x . (136)
x2 0 1 2 x2
| {z }
M
On the other hand, the sum-of-squares decomposition for q implies that we can also write
> > >
1 −3 −3 1 1 1
1 1
q(x) = x 1 1 + 3 3 x , (137)
2 2
x2 2 2 0 0 x2
i.e. we can decompose the matrix M defining q(x) = [1; x; x2 ]> M [1; x;2 ] into a non-negative combination of
rank-one outer products, which is true if and only if M 0.
There is one problem with this, which is that despite our successful decomposition of q, M is self-evidently
instance, M22 = −1.)
not positive semidefinite! (For The issue is that the matrix M defining q(x) is not
5 0 −a
unique. Indeed, any M (a) = 0 2a − 1 1 would give rise to the same q(x), and a sum-of-squares
−a 1 2
decomposition merely implies that M (a) 0 for some a. Thus, we obtain the following characterization:
k
X
q(x) is a sum of squares qj (x)2 ⇐⇒ M (a) 0 for some a ∈ R. (138)
j=1
for any b, b0 , b00 , c, c0 , c00 . Call the above expression M (b, b0 , b00 , c, c0 , c00 ), which is linear in each of its variables.
Then we have q sos 0 if and only if M (b, b0 , b00 , c, c0 , c00 ) 0 for some setting of the bs and cs.
32
Sum-of-squares in arbitrary dimensions. In general, if we have a polynomial q(x1 , . . . , xd ) in d variables,
which has degree 2t, then we can embed it as some matrix M (b) (for decision variables b that capture the
symmetries in M as above), and the dimensionality of M will be the number of monomials of degree at most
t which turns out to be d+tt = O((d + t)t ).
The upshot is that any constraint of the form q sos 0, where q is linear in the decision variables, is a
semidefinite constraint in disguise. Thus, we can solve any program of the form
where the qj are linear in the decision variables y. (And we are free to throw in any additional linear inequality
or semidefinite constraints as well.) We refer to such optimization problems as sum-of-squares programs, in
analogy to semidefinite programs.
Sum-of-squares for kth moment. Return again to the kth moment problem. As a polynomial program
we sought to minimize λ such that λ(v12 + · · · + vd2 )k/2 − hM2k , v ⊗2k i was a non-negative polynomial. It is
then natural to replace the non-negativity constraint with the constraint that λkvkk2 − hM2k , v ⊗2k i sos 0.
However, we actually have a bit more flexibility and it turns out that the best program to use is
minimize λ (141)
⊗2k
subject to λ − hM2k , v i+ (kvk22 − 1)q(v) sos 0 for some q of degree at most 2k − 2
Note that the family of all such q can be linearly parameterized and so the above is indeed a sum-of-
squares program. It is always at least as good as the previous program because we can take q(v) =
λ(1 + kvk22 + · · · + kvk2k−2
2 ).
When the solution λ∗ to (141) is at most σ 2k for M2k (p), we say that p has 2kth moment certifiably
bounded by σ 2k . In this case a variation on the filtering algorithm achieves error O(σ1−1/2k ). We will not
discuss this in detail, but the main issue we need to resolve to obtain a filtering algorithm is to find some
appropriate tensor T such that hT, M2k i = λ∗ and T “looks like” the expectation of v ⊗2k for some probability
distribution over the unit sphere. Then we can filter using τi = hT, (xi − µ̂)⊗2k i.
To obtain T requires computing the dual of (141), which requires more optimization theory than we
have assumed from the reader, but it can be done in polynomial time. We refer to the corresponding T
as a pseudomoment matrix. Speaking very roughly, T has all properties of a moment matrix that can be
“proved using only sum-of-squares inequalities”, which includes all properties that we needed for the filtering
algorithm to work. We will henceforth ignore the issue of T and focus on assumptions on p that ensure that
M2k (p) is certifiably bounded. The main such assumption is the Poincaré inequality, which we cover in the
next section.
33
• If p1 sos p2 , q1 sos q2 , and p2 , q1 sos 0, then p1 q1 sos p2 q2 .
• Moreover, many “standard” inequalities such as Cauchy-Schwarz and Hölder have sum-of-squares proofs.
Using these, we can often turn a normal proof that p ≤ q into a sum-of-squares proof that p q as long as
we give sum-of-squares proofs for a small number of key steps.
For concreteness, we will prove the last two claims properties above. We first prove that p1 , p2 sos 0 =⇒
p1 p2 sos 0. Indeed we have
X X X
p1 (v)p2 (v) = ( p1i (v)2 )( p2j (v)2 ) = (p1 i(v)p2j (v))2 sos 0 (142)
i j ij
Next we prove that p1 sos p2 , q1 sos q2 , and p2 , q1 sos 0 implies p1 q2 sos p2 q2 . This is because
where the second relation uses p2 , q2 − q1 sos 0 and p2 − p1 , q1 sos 0 together with the previous result.
In view of this, we can reframe bounding (141) as the following goal:
Goal: Find a sum-of-squares proof that hM2k (p), v ⊗2k i sos λkvk2k
2 .
Certifiability for Gaussians. We now return to the assumptions needed on p that will enable us to
provide the desired sum-of-squares proof. Let us start by observing that a sum-of-squares proof exists for any
Gaussian distribution: If p = N (µ, Σ), then
hM2k (N (µ, Σ)), v ⊗2k i = hM2k (N (0, I)), (Σ1/2 v)⊗2k i (144)
k
Y
(2i − 1) hI, (Σ1/2 v)⊗2k i
= (145)
i=1
k
Y
(2i − 1) kΣ1/2 vk2k
= 2 (146)
i=1
≤ (2k) kΣkk kvk2k
k
2 , (147)
so we may take λ = (2kkΣk)k . (Here I denotes the identity tensor that is 1 along the diagonal and zero
elsewhere.) Therefore normal distributions have certifiably bounded moments, but the proof above heavily
exploited the rotational symmetry of a normal distribution. We can provide similar proofs for other highly
symmetric distributions (such as the uniform distribution on the hypercube), but these are unsatisfying as
they only apply under very specific distributional assumptions. We would like more general properties that
yield certifiably bounded moments.
Poincaré inequality. The property we will use is the Poincaré inequality. A distribution p on Rd is said
to satisfy the Poincaré inequality with parameter σ if
for all differentiable functions f : Rd → R. This is a “global to local property”–it says that for any function
that for any function f that varies under p, that variation can be picked up in terms of local variation (the
gradient). In particular, it says that p doesn’t have any “holes” (regines with low probability density that lie
between two regions of high probability density). Indeed, suppose that A and B were two disjoint convex
regions with p(A) = p(B) = 12 . Then p cannot satisfy the Poincaré inequality with any constant, since there
is a function that is 1 on A, 0 on B, and constant on both A and B.
Below are some additional examples and properties of Poincaré distributions:
• A one-dimensional Gaussian N (µ, σ 2 ) is Poincaré with constant σ.
• If p, q are σ-Poincaré then their product p × q is σ-Poincaré. In particular a multivariate Gausssian
N (µ, σ 2 I) is σ-Poincaré.
34
• If X√∼ p is σ-Poincaré and A is a linear map, then AX is (σkAk)-Poincaré. In particular, aX1 + aX2
is ( a2 + b2 σ)-Poincaré when X1 and X2 are both σ-Poincaré, and N (µ, Σ) is kΣk1/2 -Poincaré.
• More generally, if X ∼ p is σ-Poincaré and f is L-Lipschitz, then f (X) is (σL)-Poincaré.
Together these imply that Poincaré distributions contain multivariate Gaussians, arbitrary Lipschitz functions
of Gaussians, and independent sums of such distributions. The above properties (except the initial Gaussian
property) are all straightforward computations. Let us next state two substantially deeper results:
• If p is σ-strongly log-concave (meaning that the log-probability density log p(x) satisfies ∇2 log p(x)
− σ12 I), then p is σ-Poincaré (Bakry and Émery, 1985).
2
• Suppose that
√ the support of X ∼ p has `2 -radius at most R, and let Z = N (0, τ I) for τ ≥ 2R. Then
X + Z is (τ e)-Poincaré (Bardet et al., 2018).
Thus Poincaré encompasses all strongly log-concave densities, and effectively any product of bounded random
variables (after adding Gaussian noise, which we can always do ourselves).
It is instructive to compare Poincaré to the sub-Gaussian property that we have so far relied on. Poincaré
is neither strictly stronger or weaker than sub-Gaussian, but it is stronger than sub-exponential (we will see
this below). In general, we should think of Poincaré as being substantially stronger than sub-exponential: it
implies that not only is the distribution itself sub-exponential, but so is any Lipschitz function of the density.
As an example, consider the random variable (X, Y ) ∈ Rd where X ∼ N (0, I) and Y = X for a
Rademacher random P variable . Then (X, Y ) is sub-Gaussian, but not Poincaré with good constant: if we
take f (X, Y ) = i Xi Yi , then f is with high probability close to either +d or −d, so Var[f (X, Y )] ≈ d2 .
However, ∇f (X, Y ) = (Y1 , . . . , Yd , X1 , . . . , Xd ) and so k∇f (X, Y )k22 is close to 2d with
√ high probability. Thus
while the sub-Gaussian constant is O(1), the Poincaré constant in this case is Ω( d).
Consequences of Poincaré. So far we have seen conditions that imply Poincaré, but we would also like
to derive consequences of this property. Below are some of the most useful ones:
• If X ∼ p is σ-Poincaré, then Lipschitz functions concentrate: P[|f (x) − E[f (x)]| ≥ t] ≤ 6 exp(−t/(σL))
for any L-Lipschitz f .
• As a corollary, we have volume expansion: For any set A, let A be the set of points within `2 -distance
of A. Then p(A)p(Ac ) ≤ 36 exp(−/σ).
This second property implies, for instance, that if p(A) ≥ δ, then almost all points will be within distance
O(σ log(1/δ)) of A.
To prove the second property, let f (x) = min(inf y∈A kx − yk2 .). Then f is Lipschitz, is 0 on A, and is
on Ac . Let µ be the mean of f (X). Since f is sub-exponential we have p(A) = p(f (X) = 0) ≤ 6 exp(−µ/σ),
and p(Ac ) = p(f (X) = ) ≤ 6 exp(−( − µ))/σ). Multiplying these together yields the claimed result.
The most important property for our purposes, however, will be the following:
Theorem 2.49. Suppose that p is σ-Poincaré and let f be a differentiable function such that E[∇j f (X)] = 0
for j = 1, . . . , k − 1. Then there is a universal constant Ck such that Var[f (X)] ≤ Ck σ 2k E[k∇k f (X)k2F ].
Note that k = 1 is the original Poincaré property, so we can think of Theorem 2.49 as a generalization of
k
Poincaré to higher derivatives. Note also that ∇k f (X) is a tensor in Rd ; the notation k∇k f (X)k2F denotes
k
the squared Frobenius norm of ∇ f (X), i.e. the sum of the squares of its entries.
Theorem 2.49, while it may appear to be a simple generalization of the Poincaré property, is a deep
result that was established in Adamczak and Wolff (2015), building on work of Latala (2006). We will use
Theorem 2.49 in the sequel to construct our sum-of-squares proofs.
Sum-of-squares proofs for Poincaré distributions. Here we will construct sum-of-squares proofs that
def
M2k (v) = Ep [hx − µ, vi2k ] sos Ck0 σ 2k kvk2k 0
2 whenever p is σ-Poincaré, for some universal constants Ck . We
35
will exhibit the proof for k = 1, 2, 3 (the proof extends to larger k and the key ideas appear already by k = 3).
We introduce the notation
Proof for k = 1. We wish to show that Ep [hx − µ, vi2 ] sos σ 2 kvk22 . To do this take fv (x) = hx, vi. Then
the Poincaré inequality applied to fv yields
Thus M2 (v) ≤ σ 2 kvk22 (this is just saying that Poincaré distributions have bounded covariance). This
property has a sum-of-squares proof because it is equivalent to σ 2 I − M2 0, and we know that all positive
semidefiniteness relations are sum-of-squares certifiable.
Proof for k = 2. Extending to k = 2, it makes sense to try fv (x) = hx − µ, vi2 . Then we have
∇fv (x) = 2hx − µ, viv and hence E[∇fv (x)] = 0. We also have ∇2 fv (x) = 2v ⊗ v. Thus applying
Theorem 2.49 we obtain
Var[fv (x)] ≤ C2 σ 4 E[k2v ⊗ vk2F ] = 4C2 σ 4 kvk42 . (152)
We also have Var[fv (x)] = E[hx − µ, vi4 ] − E[hx − µ, vi2 ]2 = M4 (v) − M2 (v)2 . Thus
This shows that the fourth moment is bounded, but how can we construct a sum-of-squares proof? We
already have that M2 (v)2 sos σ 4 kvk42 (by 0 sos M2 (v) sos σ 2 kvk22 and the product property). Therefore
we focus on bounding M4 (v) − M2 (v)2 = Var[fv (x)].
For this we will apply Theorem 2.49 to a modified version of fv (x). For a matrix A, let fA (x) =
(x − µ)> A(x − µ) = hA, (x − µ)⊗2 i. Then fv (x) = fA (x) for A = vv > . By the same calculations as above we
have E[∇fA (x)] = 0 and ∇2 fA (x) = 2A. Thus by Theorem 2.49 we have
On the other hand, we have Var[fA (x)] = hM4 , A ⊗ Ai − hM2 , Ai2 = hM4 − M2 ⊗ M2 , A ⊗ Ai. Thus (155)
implies that
hM4 − M2 ⊗ M2 , A ⊗ Ai ≤ 4C2 σ 4 kAk2F . (156)
2 2
Another way of putting this is that M4 − M2 ⊗ M2 , when considered as a matrix in Rd ×d , is smaller than
4C2 σ 4 I in the semidefinite ordering. Hence 4C2 σ 4 I − (M4 − M2 ⊗ M2 ) 0 and so 4C2 σ 4 kvk42 − hM4 − M2 ⊗
M2 , v ⊗4 i sos 0, giving us our desired sum-of-squares proof. To recap, we have:
36
We can additionally compute
Var[fv (x)] = E[(hx − µ, vi3 − 3M2 (v)hx − µ, vi)2 ] − E[hx − µ, vi3 − 3M2 (v)hx − µ, vi]2 (164)
3 2
= M6 (v) − 6M2 (v)M4 (v) + 9M2 (v) − M3 (v) . (165)
M6 (v) = Var[fv (x)] + 6M2 (v)M4 (v) + M3 (v)2 − 9M2 (v)2 (166)
≤ 36C3 σ 6
kvk62 + 6(σ 2
kvk22 )(C20 σ 4 kvk42 ) 2
+ M3 (v) + 0 (167)
We can also use Hölder’s inequality to obtain M3 (v)2 ≤ M2 (v)M4 (v), which yields an overall bound of
M6 (v) ≤ (36C3 + 12C20 )σ 6 kvk62 .
We now turn this into a sum-of-squares proof. We need to show the following four relations:
(i) Var[fv (x)] sos 36C3 σ 6 kvk62 , (ii) M2 (v)M4 (v) sos (σ 2 kvk22 )(C20 σ 4 kvk42 ), (168)
2
(iii) M3 (v) sos M2 (v)M4 (v), (iv) − 9M2 (v) sos 0. (169)
The relation (ii) again follows by the product property of sos , while −9M2 (v)2 sos 0 is direct because
M2 (v)2 is already a square. We will show in an exercise that the Hölder inequality in (iii) has a sum-of-squares
proof, and focus on (i).
3
The relation (i) holds for reasons analogous to the k = 2 case. For a symmetric tensor A ∈ Rd , let
⊗3 2
fA (x) = hA, (x − µ) − 3M2 ⊗ (x − µ)i. Then just as before we have E[∇fA (x)] = 0, E[∇ fA (x)] = 0, and
so Var[fA (x)] ≤ 36C3 σ 6 kAk2F , which implies that3
and hence Var[fv (x)] sos 36C3 σ 6 kvk62 (again because semidefinite relations have sum-of-squares proofs).
In summary, we have M6 (v) sos (36C3 + 12C20 )σ 6 kvk62 , as desired.
Generalizing to higher k. For higher k the proof is essentially the same. What is needed is a function fv (x)
whose first k − 1 derivates all have zero mean. This always exists and is unique up to scaling by constants. For
instance, when k = 4 we can take fv (x) = hx−µ, vi4 −6M2 (v)hx−µ, vi2 −4M3 (v)hx−µ, vi−M4 (v)+6M2 (v)2 .
This appears somewhat clunky but is a special case of a combinatorial sum. For the general case, let Tk be
the set of all integer tuples (i0 , i1 , . . .) such that i0 ≥ 0, is ≥ 2 for s > 0, and i0 + i1 + · · · = k. Then the
general form is
X
r k
fv,k (x) = (−1) hx − µ, vii0 Mi1 (v)Mi2 (v) · · · Mir (v). (171)
i0 · · · ir
(i0 ,...,ir )∈Tk
The motivation for this formula is that it is the solution to ∇fv,k (x) = kfv,k−1 (x)v. Using fv,k , one can
construct sum-of-squares proofs by applying Theorem 2.49 to the analogous fA,k function as before, and then
use induction, the product rule, and Hölder’s inequality as in the k = 3 case.
[Lectures 10-11]
37
contamination as in Section 2.6.2). We wish to estimate a parameter θ and do so via en estimator θ̂ =
θ̂(X1 , . . . , Xn ). Our goal is to construct an estimator such that L(p∗ , θ̂) is small according to a given loss
function L. This was summarized in Figure 1 from Section 1.
As before, we will start by allowing our estimator θ̂ to directly access the population distribution p̃
rather than samples. Thus we wish to control the error L(p∗ , θ̂(p̃)). Since this is hopeless without further
assumptions, we assume that D(p∗ , p̃) ≤ for some distance D, and that p∗ lies in some family G.
For now we continue to take D = TV and focus on more general losses L, corresponding to tasks beyond
mean estimation. Two key examples will be:
• Second moment estimation in spectral norm, which corresponds to the loss L(p, S) = kEp [XX > ] −
Sk.
• Linear regression, which corresponds to the loss L(p, θ) = Ex,y∼p [(y − θ> x)2 − (y − θ∗ (p)> x)2 ]. Note
that here L measures the excess predictive loss so that L(p, θ∗ (p)) = 0.
As in the mean estimation case, we will define the modulus of continuity and the family of resilience
distributions, and derive sufficient conditions for resilience.
Modulus of continuity. The modulus of continuity generalizes straightforwardly from the mean estimation
case. We define
m(G, 2, L) = sup L(p, θ∗ (q)). (172)
p,q∈G:TV(p,q)≤2
As before, the modulus m upper-bounds the minimax loss. Specifically, consider the projection estimator that
outputs θ̂(p̃) = θ∗ (q) for any q ∈ G with TV(p̃, q) ≤ . Then the error of θ̂ is at most m because TV(q, p∗ ) ≤ 2
and p∗ , q ∈ G.
Resilience. Generalizing resilience requires more care. Recall that for mean estimation the set of (ρ, )-
resilient distributions was
TV def p
Gmean (ρ, ) = p | kEr [X] − Ep [X]k ≤ ρ for all r ≤ . (173)
1−
We saw in Section 2.4 that robust mean estimation is possible for the family Gmean of resilient distributions; the
two key ingredients were the existence of a midpoint distribution and the triangle inequality for L(p, θ∗ (q)) =
kµp − µq k. We now extend the definition of resilience to arbitrary cost functions L(p, θ) that may not satisfy
the triangle inequality. The general definition below imposes two conditions: (1) the parameter θ∗ (p) should
p p
do well on all distributions r ≤ 1− , and (2) any parameter that does well on some r ≤ 1− also does well on
p. We measure performance on r with a bridge function B(r, θ), which is often the same as the loss L but
need not be.
Definition 3.1 (G TV (ρ1 , ρ2 , )). Given an arbitrary loss function L(p, θ), we define G TV (ρ1 , ρ2 , ) = G↓TV (ρ1 , )∩
G↑TV (ρ1 , ρ2 , ), where:
If we take B(p, θ) = L(p, θ) = kEp [X] − Eθ [X]k, ρ2 = 2ρ1 , then this exactly reduces to the resilient set
TV TV
Gmean (ρ1 , ) for mean estimation. To see the reduction, note that Gmean is equivalent to G↓TV in Equation
(174). Thus we only need to show that G↑TV is a subset of G↓TV . By our choice of B, L and ρ2 , the implication
condition in G↑TV follows from the triangle inequality..
We will show that G TV is not too big by bounding its modulus of continuity, and that it is not too small
by exhibiting reasonable sufficient conditions for resilience.
38
TV(p1 , p2 ) ≤ η
p1 p2
p1 p2
r≤ 1−η
r≤ 1−η
min(p1 ,p2 )
r= 1−TV(p1 ,p2 )
p2 ∈ G↑TV
p1 ∈ G↓TV B(r, θ∗ (p1 )) ≤ ρ1 L(p2 , θ∗ (p1 )) ≤ ρ2
Not too big: bounding m. We show that the designed G TV (ρ1 , ρ2 , ) has small modulus of continuity
(and thus population minimax limit) in the following theorem:
Theorem 3.2. For G TV (ρ1 , ρ2 , ) in Definition 3.1, we have m(G TV (ρ1 , ρ2 , ), ) ≤ ρ2 .
Proof. As illustrated in Figure 7, we still rely on the midpoint distribution r to bridge the modulus. Consider
p1 p2
any p1 , p2 satisfying TV(p1 , p2 ) ≤ . Then there is a midpoint r such that r ≤ 1− and r ≤ 1− . From
the fact that p1 ∈ G TV (ρ1 , ρ2 , ) ⊂ G↓TV (ρ1 , ), we have B(r, θ∗ (p1 )) ≤ ρ1 . From this and the fact that
p2 ∈ G TV (ρ1 , ρ2 , ) ⊂ G↑TV (ρ1 , ρ2 , ), we then have L(p2 , θ∗ (p1 )) ≤ ρ2 . Since p1 and p2 are arbitrary, this
bounds the modulus of continuity by ρ2 .
Not too small: concrete examples. We next show that G TV yields sensible conditions for second moment
estimation and linear regression. We start with second moment estimation:
Proposition 3.3. Let B(p, S) = L(p, S) = kEp [XX > ] − Sk, and let p be a distribution on Rd such that
p ∈ Gmom,k )(σ), i.e. p∗ has bounded kth moments. Then assuming k > 2, we have p ∈ G TV (ρ, 2ρ, ) for
ρ = O(σ 2 1−2/k ).
This is essentially the same statement as for mean estimation, except with σ 2 1−2/k instead of σ1−1/k .
Proof. First we show that p ∈ G ↓ (ρ, ), for which we need to show that
p
kEr [XX > ] − Ep [XX > ]k ≤ ρ for all r ≤ . (176)
1−
Letting Y = XX > , this asks that Y is resilient in operator norm, which in turn asks that hY, Zi is resilient for
any kZk∗ ≤ 1, where k · k∗ is dual to the operator norm. Recalling that the operator norm is the maximum
P k · k∗ is the nuclear norm, or the sum of the singular values. Thus for
singular value, it turns out that
Z = U ΛV > we have kZk∗ = i Λii . (Proving this duality requires some non-trivial but very useful matrix
inequalities that we provide at the end of this section.)
Conveniently, the extreme points of the nuclear norm ball are exactly rank-one matrices of the form
±vv > where kvk2 = 1. Thus we exactly need that hv, Xi2 is resilience for all v. Fortunately we have that
E[|hv, Xi2 − E[hv, Xi2 ]|k/2 ] ≤ E[|hv, Xi|k ] ≤ σ k , so p is (ρ1 , )-resilient with ρ1 = σ 2 1−2/k , which gives that
p ∈ G↓.
Next we need to show that p ∈ G ↑ . We want
p
kEr [XX > ] − Sk ≤ ρ1 =⇒ kEp [XX > ] − Sk ≤ ρ2 whenever r ≤ , (177)
1−
but this is the same as ρ2 − ρ1 ≤ kEr [XX > ] − Ep [XX > ]k, and we already know that the right-hand-side is
bounded by ρ1 , so we can take ρ2 = 2ρ1 , which proves the claimed result.
We move on to linear regression. In the proof for second moment estimation, we saw that p ∈ G ↑ was
essentially implied by p ∈ G ↓ . This was due to the symmetry of the second moment loss together with the
39
triangle inequality for k · k, two properties that we don’t have in general. The proof for second moment
estimation will require somewhat more different proofs for G ↑ and G ↓ . For simplicity we state the result only
for fourth moments:
Proposition 3.4. For a distribution p on Rd × R, let B(p, θ) = L(p, θ) = Ep [(y − hθ, xi)2 − (y − hθ∗ (p), xi)2 ].
Let Z = Y − hθ∗ (p), Xi and suppose that the following two conditions holds:
Let us interpret the two conditions. First, as long as X and Z are independent (covariates are independent
of noise), we have Ep [XZ 2 X > ] = E[Z 2 ]E[XX > ], so in that case σ 2 is exactly a bound on the noise Z. Even
when X and Z are not independent, the first condition holds when Z has bounded 4th moment.
The second condition is a hypercontractivity condition stating that the fourth moments of X should not
be too large compared to the second moments. It is a bit unusual from the perspective of mean estimation,
because it does not require X to be well-concentrated, but only well-concentrated relative to its variance. For
regression, this condition makes sense because κ bounds how close √ the covariates are to being rank-deficient
(the worst-case is roughly an -mass at some arbitrary distance t/ , which would have second moment t2
and fourth moment t4 /, so we roughly want κ < 1/). We will show later that such a hypercontractivity
condition is needed, i.e. simply assuming sub-Gaussianity (without making it relative to the variance) allows
for distributions that are hard to robustly estimate due to the rank-deficiency issue.
Proof. First note that L(p, θ) = (θ − θ∗ (p))> Sp (θ − θ∗ (p)), where Sp = Ep [XX > ], and analogously for L(r, θ).
At a high level our strategy will be to show that θ∗ (r) ≈ θ∗ (p) and Sr ≈ Sp , and then use this to establish
membership in G ↓ and G ↑ .
We first use the hypercontractivity condition to show that Sr ≈ Sp . We have
1
q
Er [hv, Xi2 ] ≥ Ep [hv, Xi2 ] − Varp [hv, Xi2 ] (180)
1−
1
q
= Ep [hv, Xi2 ] − (Ep [hv, Xi4 ] − Ep [hv, Xi2 ]2 ) (181)
1−
1 p
≥ Ep [hv, Xi2 ] − (κ − 1)Ep [hv, Xi2 ] (182)
1−
1 p
= (1 − (κ − 1))Ep [hv, Xi2 ]. (183)
1−
1
p 1
p
Thus Sr (1− 1− (κ − 1))Sp , and similarly Sr (1+ 1− (κ − 1))Sp . Assuming ≤ 18 and (κ−1) ≤ 61 ,
1
p p
we have 1− (κ − 1) ≤ 87 1/6 < 12 , and so 21 Sp Sr 32 Sp .
We now turn to G ↑ and G ↓ . A useful relation is θ∗ (p) = Sp−1 Ep [XY ], and θ∗ (r) − θ∗ (p) = Sr−1 Er [XZ].
To prove that p ∈ G ↓ we need to show that (θ∗ (r) − θ∗ (p))> Sr (θ∗ (r) − θ∗ (p)) is small. We have
3 ∗
(θ∗ (r) − θ∗ (p))> Sr (θ∗ (r) − θ∗ (p)) ≤ (θ (r) − θ∗ (p))> Sp (θ∗ (r) − θ∗ (p)) (184)
2
3 3
= Er [XZ]> Sp−1 Er [XZ] = kEr [Sp−1/2 XZ] − Ep [Sp−1/2 XZ]k22 . (185)
2 2
−1/2
This final condition calls for Sp XZ to be resilient, and bounded variance of this distribution can be seen to
3σ 2
exactly correspond to the condition E[XZ 2 X > ] σ 2 E[XX > ]. Thus we have resilience with ρ = 2(1−) 2
2 ≤ 2σ
(since < 18 ).
Now we turn to G ↑ . We want that (θ − θ∗ (r))> Sr (θ − θ∗ (r)) ≤ ρ implies (θ − θ∗ (p))> Sp (θ − θ∗ (p)) ≤ 5ρ.
40
By the triangle inequality we have
q q q
(θ − θ∗ (p))> Sp (θ − θ∗ (p)) ≤ (θ − θ∗ (r))> Sp (θ − θ∗ (r)) + (θ∗ (r) − θ∗ (p))> Sp (θ∗ (r) − θ∗ (p)) (186)
q p
≤ 2(θ − θ∗ (r))> Sr (θ − θ∗ (r)) + (4/3)σ 2 (187)
p p √ √ p p
≤ 2ρ + (4/3)σ 2 = ρ( 2 + 2/3) ≤ 5ρ, (188)
Proving that nuclear norm is dual to operator norm. Here we establish a series of matrix inequalities
that are useful more broadly, and use these to analyze the nuclear norm. The first allows us to reduce dot
products of arbitrary matrices to symmetric PSD matrices:
Proposition 3.5. For any (rectangular) matrices A, B of equal dimensions, we have
hA, Bi2 ≤ h(A> A)1/2 , (B > B)1/2 ih(AA> )1/2 , (BB > )1/2 i. (189)
since both terms in the inner product are PSD. This gives λ2 h(AA> )1/2 , (BB > )1/2 i+ λ12 h(A> A)1/2 , (B > B)1/2 i ≥
2hA, Bi. Optimizing λ yields the claimed result.
Next we show:
Theorem 3.6. If A and B are matrices of the same dimensions with (sorted) lists of singular values
σ1 , . . . , σn and τ1 , . . . , τn , then
n
X
hA, Bi ≤ σi τi . (191)
i=1
This says that the dot product between two matrices is bounded by the dot product between their sorted
singular values.
Proof. By Proposition 3.5, it suffices to show this in the case that A and B are both PSD and σ, τ are the
eigenvalues. Actually we will only need A and B to be symmetric (which implies that, oddly, the inequality
can hold even if some of the σi and τi are negative).
By taking similarity transforms we can assume Pwithout loss P
of generality that A = diag(σ1 , . . . , σn ) with
n n
σ1 ≥ σ2 ≥ · · · ≥ σn . We thus wish to prove that i=1 σi Bii ≤ i=1 σi τi , where τi are the eigenvalues of B.
We make use of the following lemma:
Pk Pk
Lemma 3.7. For all 1 ≤ k ≤ n, we have i=1 Bii ≤ i=1 τi .
41
Pk
Proof. Let Bk be the k × k top-left submatrix of B. Then i=1 Bii = tr(Bk ) is the sum of the eigenvalues
of Bk . We will show that the jth largest eigenvalue of Bk is smaller than the jth largest eigenvalue of B
(this is a special case of the Cauchy interlacing theorem). We prove this using the min-max formulation of
eigenvalues: λi (M ) = minW :dim(W )=i−1 maxv∈W ⊥ ,kvk2 ≤1 v > M v. Let W ∗ be the W that attains the min for
λj (B), and let Pk denote projection onto the first k coordinates. We have
which yields the desired result. In the above algebra we have used Abel summation, which is the discrete
version of integration by parts.
Now that we have Theorem 3.6 in hand, we can easily analyze the operator and nuclear norms. Letting
~σ (A) denote the vector of non-decreasing singular values of A, we have
This shows that the dual of the operator norm is at most the nuclear norm, since k~σ (Z)k1 is the nuclear
norm of Z. But we can achieve equality when Y = U ΛV > by taking Z = u1 v1> (then kZk∗ = 1 while
hY, Zi = Λ11 = kY k). So operator and nuclear norm are indeed dual to each other.
[Lecture 12]
42
Then we seek to find a q such that F1 (q) ≤ κ, F2 (q) ≤ σ 2 , and q ∈ ∆n, , where ∆n, is the set of -deletions
of p.
However, there are a couple wrinkles. While with mean estimation we could minimize the objective with
gradient descent, in this case we will need to use quasigradient descent–following a direction that is not the
gradient, but that we can show makes progress towards the optimum. The rough reason for this is that, since
p appears on both the left- and right-hand sides of the inequalities above, the gradients become quite messy,
e.g. ∇F1 (q) has a mix of positive and negative terms:
and it isn’t clear that following them will not land us at bad local minima. To address this, we instead
construct a simpler quasigradient for F1 and F2 :
Eq [hx, vi4 ]
g1 (xi ; q) = hxi , vi4 , where v = arg max , (202)
kvk2 =1 Eq [hx, vi2 ]2
Eq [hx, vi2 (y − hθ∗ (q), xi)2 ]
g2 (xi ; q) = hxi , vi2 (yi − hθ∗ (q), xi i)2 , where v = arg max . (203)
kvk2 =1 Eq [hx, vi2 ]
We will then follow g1 until F1 is small, and then follow g2 until F2 is small.
The other wrinkle is that computationally, the hypercontractivity condition is difficult to certify, because
E [hx,vi4 ]
it involves maximizing Epp[hx,vi2 ]2 , which is no longer a straightforward eigenvalue problem as in the mean
estimation case. We’ve already seen this sort of difficulty before–for norms beyond the `2 -norm, we had to use
SDP relaxations and Grothendieck’s inequality in order to get a constant factor relaxation. Here, there is also
an SDP relaxation called the sum-of-squares relaxation, but it doesn’t always give a constant factor relaxation.
We’ll mostly ignore this issue and assume that we can find the maximizing v for hypercontractivity.
We are now ready to define our efficient algorithm for linear regression, Algorithm 4. It is closely analogous
to the algorithm for mean estimation (Algorithm 2), but specifies the gradient steps more explicitly.
Algorithm 4 QuasigradientDescentLinReg
1: Input: (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R.
2: Initialize q ∈ ∆n, arbitrarily.
3: while F1 (q) ≥ 2κ or F2 (q) ≥ 4σ 2 do
4: if F1 (q) ≥ 2κ then
5: Find the unit vector v that maximizes Eq [hx, vi4 ]/Eq [hx, vi2 ]2 .
6: Take a projected gradient step in the direction (g1 )i = hxi , vi4 .
7: else Pn Pn
8: Compute the empirical least squares regressor: θ∗ (q) = ( i=1 qi xi x> i )
−1
( i=1 qi xi yi ).
9: Find the unit vector v that maximizes Eq [hx, vi2 (y − hθ∗ (q)xi)2 ]/Eq [hx, vi2 ].
10: Take a projected gradient step in the direction (g2 )i = hxi , vi2 (yi − hθ∗ (q), xi i)2 .
11: end if
12: end while
13: Output θ∗ (q).
EpS [hx, vi4 ] ≤ κEpS [hx, vi2 ]2 ∀v ∈ Rd , and EpS [(y − hθ∗ (pS ), xi)2 xx> ] σ 2 EpS [xx> ]. (204)
Then assuming κ ≤ 1
80 , Algorithm 4 terminates and its output has excess loss L(pS , θ∗ (q)) ≤ 40σ 2 .
To prove Proposition 3.8, we need a few ideas. The first is a result from optimization justifying the use of
the quasigradients:
43
Lemma 3.9 (Informal). Asymptotically, the iterates of Algorithm 4 (or any other low-regret algorithm)
satisfy the conditions
EX∼q [gj (X; q)] ≤ EX∼pS [gj (X; q)] for j = 1, 2. (205)
We will establish this more formally later, and for now assume that (205) holds. Then, asuming this,
we will show that q is both hypercontractive and has bounded noise with constants κ0 , σ 0 that are only a
constant factor worse than κ and σ.
First, we will show this for hypercontractivity:
1
Lemma 3.10. Suppose that EX∼q [g1 (X; q)] ≤ EX∼pS [g1 (X; q)] and that κ ≤ 80 . Then q is hypercontractive
with parameter κ0 = 1.5κ.
Proof. Take the maximizing v such that g1 (xi ; q) = hxi , vi4 . To establish hypercontractivity, we want to
show that Eq [hx, vi4 ] is small while Eq [hx, vi2 ]2 is large. Note the quasigradient condition gives us that
Eq [hx, vi4 ] ≤ EpS [hx, vi4 ]; so we will mainly focus on showing that Eq [hx, vi2 ] is large, and in fact not much
smaller than EpS [hx, vi2 ]. Specifically, by the fact that TV(q, pS ) ≤ 1− , together with resilience applied to
2
hx, vi , we have
21
|Eq [hx, vi2 ] − EpS [hx, vi2 ]| ≤ (Eq [hx, vi 4
] + EpS [hx, vi4
]) (206)
(1 − 2)2
(i) 2 12
≤ 2
EpS [hx, vi4 ] (207)
(1 − 2)
(ii) 2κ 12
≤ EpS [hx, vi2 ], (208)
(1 − 2)2
where (i) uses the quasigradient condition and (ii) uses hypercontractivity for p pS . Now assuming that
1 1
κ ≤ 80 (and hence also ≤ 80 ), the coefficient on the right-hand-side is at most (1/40)/(1 − 1/40)2 < 61 .
Consequently |Eq [hx, vi2 ] − EpS [hx, vi]2 | ≤ 61 EpS [hx, vi2 ], and re-arranging then yields
5
Eq [hx, vi2 ] ≥ Ep [hx, vi2 ]. (209)
6 S
But we already have Eq [hx, vi4 ] ≤ EpS [hx, vi2 ], and so the ratio Eq [hx, vi2 ]/Eq [hx, vi2 ]2 is at most (6/5)2 the
same ratio under pS , and in particular at most (6/5)2 κ ≤ 1.5κ.
Next, we will show this for bounded noise assuming that hypercontractivity holds:
1
Lemma 3.11. Suppose that F1 (q) ≤ 2κ and that EX∼q [g2 (X; q)] ≤ EX∼pS [g2 (X; q)], and that κ ≤ 80 . Then
q has bounded noise with parameter (σ 0 )2 = 4σ 2 , and furthermore satisfies L(pS , θ∗ (q)) ≤ 40σ 2 .
Proof. Again take the maximizing v such that g2 (xi ; q) = hxi , vi2 (yi − hθ∗ (q), xi i)2 . We want to show that q
has bounded noise, or in other words that Eq [g2 (x; q)] is small relative to Eq [hx, vi2 ]. By the quasigradient
assumption, we have
Intuitively, we want to use the bounded noise condition for pS to upper-bound the right-hand-side. The
problem is that the term inside the expectation contains θ∗ (q), rather than θ∗ (pS ). But we can handle this
using the AM-RMS inequality. Specifically, we have
EpS [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2 EpS [hx, vi2 (y − hθ∗ (pS ), xi)2 ] + EpS [hx, vi2 hθ∗ (q) − θ∗ (pS ), xi2 ] . (212)
| {z } | {z }
(a) (b)
We will bound (a) and (b) in turn. To bound (a) note that by the bounded noise condition we simply have
(a) ≤ σ 2 EpS [hx, vi2 ].
44
To bound (b), let R = EpS [hθ∗ (q) − θ∗ (pS ), xi2 ] = L(pS , θ∗ (q)) be the excess loss of θ∗ (q) under pS . We
will upper-bound (b) in terms of R, and then apply resilience to get a bound for R in terms of itself. Solving
the resulting inequality will provide an absolute bound on R and hence also on (b).
More specifically, we have
(i) 1/4 1/2
EpS [hx, vi2 hθ∗ (q) − θ∗ (pS ), xi2 ] ≤ EpS [hx, vi4 EpS [hθ∗ (q) − θ∗ (pS ), xi4 ] (213)
(ii)
≤ κ EpS [hx, vi2 ] EpS [hθ∗ (q) − θ∗ (pS ), xi2 (214)
2
= κREpS [hx, vi ]. (215)
Here (i) is Cauchy-Schwarz and (ii) invokes hypercontractivity of pS . Combining the bounds on (a) and (b)
and plugging back in to (212), we obtain
EpS [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2(σ 2 + κR)EpS [hx, vi2 ]. (216)
Remember that we would like a bound such as the above, but with expectations taken with respect to q
instead of pS . For the left-hand-side, we can directly move to Eq [·] using the quasigradient assumption. For
the right-hand-side, since F1 (q) ≤ 2κ, the same argument as in (206)-(209) yields (with modified constants)
that Eq [hx, vi2 ] ≥ 45 EpS [hx, vi2 ]. Applying both of these, we have that
Eq [hx, vi2 (y − hθ∗ (q), xi)2 ] ≤ 2.5(σ 2 + κR)Eq [hx, vi2 ]. (217)
This establishes bounded noise with parameter (σ 0 )2 = 2.5(σ 2 + κR). By assumption, we also have
hypercontractivity with parameter κ0 = 2κ. We are not yet done, because we do not know R. But
recall that R is the excess loss, and so by the resilience conditions for linear regression (Proposition 3.4) we
have
R ≤ 5ρ(κ0 , σ 0 ) ≤ 10(σ 0 )2 = 25(σ 2 + κR), (218)
as long as ≤ 1
8 and κ0 = 2κ ≤ 16 . Re-arranging, we have
25σ 2
R≤ ≤ 40σ 2 , (219)
1 − 25κ
since κ ≤ 1
80 . Plugging back into σ 0 , we also have (σ 0 )2 ≤ 2.5σ 2 (1 + 40κ) ≤ 4σ 2 , as claimed.
Quasigradient bounds via low-regret algorithms. Combining Lemmas 3.9, 3.10, and 3.11 together
yields Proposition 3.8. However, we still need to formalize Lemma 3.9, showing that we can drive the
quasigradients to be small. We can do so with the idea of low-regret online optimization algorithms.
An online optimization algorithm is one that takes a sequence of losses `1 (·), `2 (·) rather than a fixed loss
`. In traditional optimization, we have a single ` with parameter w, and seek to produce iterates w1 , w2 , . . .
such that `(wT ) − `(w∗ ) → 0 as T → ∞. In online optimization algorithms, we instead consider the regret,
defined as
T
1X
RegretT = max `t (wt ) − `t (w). (220)
w T
t=1
This is the average excess loss compared to the best fixed w, picked in hindsight. We then seek to produce
iterates wt such that RegretT → 0 as T → ∞. Note that when `t = ` is fixed for all t, this is exactly the
same as traditional optimization.
Remarkably, for most “nice” loss functions `t , it is possible to ensure that RegretT → 0; in fact, projected
gradient descent with an appropriate step size will achieve this.
How does this relate to quasigradients? For any quasigradient g(X; q), define the loss function `t (q) =
Ex∼q [g(x; qt )]. Even though this loss depends on qt , the regret is well-defined:
1 1
RegretT = max Ex∼qt [g(x; qt )] − Ex∼q0 [g(x; qt )] ≥ Ex∼qt [g(x; qt )] − Ex∼pS [g(x; qt )]. (221)
0 q T T
45
PT PT
In particular, as long as RegretT → 0, we asymptotically have that T1 t=1 Ex∼qt [g(x; qt )] ≤ t=1 Ex∼pS [g(x; qt )],
so eventually the quasigradient bound Ex∼q [g(x; q)] ≤ Ex∼pS [g(x; q)] must (almost) hold for one of the qt .
This is enough to show that, for any fixed quasigradient, a statement such as Lemma 3.9 holds4 . However,
we want both g1 and g2 to be small simultaneously.
There are two ways to handle this. The first, crude way is to first use g1 until we have hypercontractivity,
then take the resulting q as our new p̃ (so that we restrict to -deletions of q) and running gradient descent
with g2 until we have bounded noise. This uses the fact that -deletions of hypercontractive distributions are
still hypercontractive, but yields worse constants (since we need 2-deletions instead of -deletions).
A slicker approach is to alternate between g1 and g2 as in Algorithm 4. Note that we then still
(asymptotically) have
T T
1X X
Ex∼qt [gjt (x; qt )] ≤ Ex∼pS [gjt (x; qt )], (222)
T t=1 t=1
[Lectures 13-14]
This definition is a bit abstruse so let us unpack it. The decision variable π is called a coupling between p
and q, and can be thought of as a way of matching points in p with points in q (π(x, y) is the amount of
mass in p(x) that is matched to q(y)). The Wasserstein distance is then the minimum cost coupling (i.e.,
minimum cost matching) between p and q. Some special cases include:
• c(x, y) = I[x =
6 y]. Then Wc is the total variation distance, with the optimal coupling being π(x, x) =
min(p(x), q(x)) (the off-diagonal π(x, y) can be arbitrary as long as the total mass adds up correctly).
• c(x, y) = kx − yk2 . Then Wc is the earth-mover distance—the average amount that we need to move
points around to “move” p to q.
• c(x, y) = kx − yk0 . Then Wc is the average number of coordinates we need to change to move p to q.
• c(x, y) = kx − ykα
2 , for α ∈ [0, 1]. This is still a metric and interpolates between TV and earthmover
distance.
4 We have to be a bit careful because outliers could make g(x; q) arbitrarily large, which violates the assumptions needed to
achieve low regret. This can be addressed with a pre-filtering step that removes data points that are obviously too large to be
inliers, but we will not worry about this here.
46
There are a couple key properties of Wasserstein distance we will want to use. The first is that Wc is a metric
if c is:
Proposition 4.1. Suppose that c is a metric. Then Wc is also a metric.
Proof. TBD
As a special case, take c(x, y) = I[x =6 y] (corresponding to TV distance). Then f ∈ L(c) if and only if
|f (x) − f (y)| ≤ 1 for all x =
6 y. By translating f , we can equivalently take the supremum over all f mapping
to [0, 1]. This says that
TV(p, q) = sup Ep [f (x)] − Eq [f (x)], (226)
f :X →[0,1]
which recovers the definition of TV in terms of the maximum difference in probability of any event E.
As another special case, take c(x, y) = kx − yk2 . Then the supremum is over all 1-Lipschitz functions (in
the usual sense).
In the next section, we will see how to generalize the definition of resilience to any Wasserstein distance.
µp µr
47
Here we can equivalently either delete the left tail of p or shift all of its mass to µr ; both yield a modified
distribution with the same mean µr . Thus we can more generally say that a perturbation is friendly if it only
moves probability mass towards the mean. This motivates the following definition:
Definition 4.3 (Friendly perturbation). For a distribution p over X , fix a function f : X → R. A distribution
r is an -friendly perturbation of p for f under Wc if there is a coupling π between X ∼ p and Y ∼ r such
that:
• The cost (Eπ [c(X, Y )]) is at most .
• All points move towards the mean of r: f (Y ) is between f (X) and Er [f (Y )] almost surely.
Note that friendliness is defined only in terms of one-dimensional functions f : X → R; we will see how to
handle higher-dimensional objects later. Intuitively, a friendly perturbation is a distribution r for which there
exists a coupling that ‘squeezes’ p to µr .
The key property of deletion in the TV case was the existence of a midpoint: for any two distributions
that are within in TV, one can find another distribution that is an -deletion of both distributions. We
would like to show the analogous result for Wc –i.e. that if Wc (p, q) ≤ then there exists an r that is an
-friendly perturbation of both p and q for the function f .
The intuitive reason this is true is that any coupling between two one-dimensional distributions can be
separated into two stages: in one stage all the mass only moves towards some point, in the other stage all the
mass moves away from that point. This is illustrated in Figure 8.
µp1 µr µp2
Figure 8: Illustration of midpoint lemma. For any distributions p1 , p2 that are close under Wc , the coupling
between p1 and p2 can be split into couplings πp1 ,r , πp2 ,r such that p1 , p2 only move towards µr under the
couplings. We do this by “stopping” the movement from p1 to p2 at µr .
48
If we imagine u increasing from −∞ to +∞, we can think of sxy as a “slider” that tries to be as close to
u as possible while remaining between f (x) and f (y).
By Assumption 4.4, there must exist some point z such that max(c(x, z), c(z, y)) ≤ c(x, y) and f (z) =
sxy (u). Call this point zxy (u).
Given a coupling π(x, y) from p1 to p2 , if we map y to zxy (u), we obtain a coupling π1 (x, z) to some
distribution r(u), which by construction satisfies the squeezing property, except that it squeezes towards
u rather than towards the mean µ(u) = EX∼r(u) [f (X)]. However, note that u − µ(u) is a continuous,
monotonically non-decreasing function (since u − sxy (u) is non-decreasing) that ranges from −∞ to +∞. It
follows that there is a u∗ with µ(u∗ ) = u∗ . Then the couplings to r(u∗ ) squeeze towards its mean µ(u∗ ).
Moreover, E(X,Z)∼π1 [c(X, Z)] ≤ E(X,Y )∼π [c(X, Y )] = Wc (p1 , p2 ). The coupling π1 therefore also has small
enough cost, and so is a friendly perturbation. Similarly, the coupling π2 mapping y to zxy (u∗ ) satisfies the
squeezing property and has small enough cost by the same argument.
Defining resilience: warm-up. With Lemma 4.5 in hand, we generalize resilience to Wasserstein distances
by saying that a distribution is resilient if Er [f (X)] is close to Ep [f (X)] for every η-friendly perturbation r
and every function f lying within some appropriate family F. For now, we will focus on second moment
estimation under Wk·k2 (we consider second moment estimation because mean estimation is trivial under
Wk·k2 ). This corresponds to the loss function
Sufficient conditions for W1 -resilience. Recall that for mean estimation under TV perturbation, any
distribution with bounded
√ ψ-norm was (O(ψ −1 (1/)), )-resilient. In particular, bounded covariance dis-
tributions were (O( ), )-resilient. We have an analogous result for W1 -resilience, but with a modified ψ
function:
Proposition 4.7. Let ψ be an Orlicz function, and define ψ̃(x) = xψ(2x). Suppose that X (not X − µ) has
bounded ψ̃-norm: Ep [ψ̃(|v > X|/σ)] ≤ 1 for all unit vectors v. Also assume that the second moment of p is at
most σ 2 . Then p is (ρ, ) resilient for ρ = max(σψ −1 ( 2σ 2
), 4 + 2σ).
Let us interpret Proposition 4.7 before giving the proof. Take for instance ψ(x) = x2 . Then Proposition
√ 4.7
asks for the 3rd moment to be bounded by σ 3 /4. In that case we have ρ = σψ −1 (2σ/) = 2σ 3/2 1/2 . If
the units seem weird, remember that has units of distance (before it was unitless) and hence σ 3/2 1/2 has
quadratic units, which matches the second moment estimation task.
More generally, taking ψ(x) = xk , we ask for a (k + 1)st moment bound and get error O(σ 1+1/k 1−1/k ).
We now turn to proving Proposition 4.7. A helpful auxiliary lemma (here and later) proves a way to use
Orlicz norm bounds:
49
Lemma 4.8. Let p and q be two distributions over X , g : X → R be any function, c be a non-negative cost
function, and ψ be an Orlicz function. Then for any coupling πp,q between p and q and any σ > 0 we have
Proof. Note that |Ep [g(X)] − Eq [g(Y )]| = |Eπ [g(X) − g(Y )]|. We weight the coupling π by the cost c to
obtain a new probability measure π 0 (x, y) = c(x, y)π(x, y)/E[c(x, y)]. We apply Jensen’s inequality under π 0
as follows:
E [g(X) − g(Y )] h c(X, Y ) g(X) − g(Y ) i
π
ψ = ψ Eπ · (231)
σEπ [c(X, Y )] E[c(X, Y )] σc(X, Y )
h g(X) − g(Y ) i
= ψ Eπ0 (232)
σc(X, Y )
h |g(X) − g(Y )| i
≤ Eπ0 ψ (233)
σc(X, Y )
h |g(X) − g(Y )| i
= Eπ c(X, Y )ψ /Eπ [c(X, Y )]. (234)
σc(X, Y )
Inverting ψ yields the desired result.
Proof of Proposition 4.7. We apply Lemma 4.8 with q = r an -friendly perturbation of p under hx, vi2 , and
g = hx, vi2 ; we will also use cost c0 (x, y) = |v > (x − y)|, which satisfies c0 (x, y) ≤ c(x, y). Taking π to be the
-friendly coupling (under c, not c0 ) between p and r yields
|hx,vi2 −hy,vi2 | !
E π |hx − y, vi|ψ σ|hx−y,vi|
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 (235)
!
−1 Eπ |hx − y, vi|ψ |hx, vi + hy, vi|/σ
= σψ . (236)
Now we will split into two cases. First, we observe that the worst-case friendly perturbation will either move
all of the hx, vi2 upwards, or all of the hx, vi2 downwards, since otherwise we could take just the upwards part
or just the downwards part and perturb the mean further. In other words, we either have (i) hx, vi2 ≥ hy, vi2
for all (x, y) ∈ supp(π) with x 6= y, or (ii) hx, vi2 ≤ hy, vi2 for all (x, y) ∈ supp(π) with x 6= y. We analyze
each case in turn.
Case (i): y moves downwards. In this case we can use the bounds |hx − y, vi| ≤ 2|hx, vi| and |hx + y, vi| ≤
2|hx, vi| together with (236) to conclude that
h 2|hx, vi| i
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 Eπ 2|hx, vi|ψ / (237)
σ
h |hx, ri| i
= σψ −1 Ep 2σ ψ̃ / (238)
σ
≤ σψ −1 (2σ/), (239)
where the final inequality is by bounded Orlicz norm of p.
Case (ii): y moved upwards. In this case by friendliness we have that |hy, vi|2 ≤ v > Sr v whenever
(x, y) ∈ supp(π) and y 6= x. Thus
p
|hx − y, vi|ψ(|hx, vi + hy, vi|/σ) ≤ |hx − y, vi|ψ(2|hy, vi|/σ) ≤ |hx − y, vi|ψ(2 v > Sr v/σ). (240)
for all (x, y) ∈ supp(π). Plugging back into (236) yields
p
|Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ σψ −1 (Eπ [|hx − y, vi|ψ(2 v > Sr v/σ)]/) (241)
p
≤ σψ −1 ( · ψ(2 v > Sr v/σ)/) (242)
p
= σ · 2v > Sr v/σ = 2 v > Sr v. (243)
50
Here the final inequality is because Eπ [|hx − y, vi|] ≤ Eπ [c(x, y)] ≤ under p the coupling. Comparing
the left-hand-side to the final right-hand-side yields |v > Sp v − v > Sr v|√≤ 2 v > Sr v. Thus defining ∆ =
|v > Sp v − v > Sr v| and using the fact that v > Sp v ≤ σ 2 , we obtain ∆ ≤ 2 ∆ + σ 2 , which implies (after solving
the quadratic) that ∆ ≤ 42 + 2σ.
Thus overall we have |Ep [hx, vi2 ] − Er [hx, vi2 ]| ≤ max(σψ −1 (2σ/), 42 + 2σ), as was to be shown.
which exists for some Fθ and L∗ whenever L(p, θ) is convex in p. The family Fθ is roughly the family of
subgradients of L(p, θ) as p varies for fixed θ. In this case we obtain conditions G↓ and G↑ as before, asking
that
Er [f (x)] − L∗ (f, θ∗ (p)) ≤ ρ1 for all f ∈ Fθ∗ (p) and -friendly r, (↓)
and furthermore
L(p, θ) ≤ ρ2 if for every f ∈ Fθ there is an -friendly r such that Er [f (x)] − L∗ (f, θ) ≤ ρ1 . (↑)
Note that for the second condition (G↓ ), we allow the perturbation r to depend on the current function f . If
r was fixed this would closely match the old definition, but we can only do that for deletions since in general
even the set of feasible r depends on f .
Using this, we can (after sufficient algebra) derive sufficient conditions for robust linear regression under
W1 , for conditions similar to the hypercontractivity condition from before. This will be a challenge problem
on the homework.
Finally, we can define a W̃1 similar to TV, ˜ but our understanding of it is far more rudimentary. In
particular, known analyses do not seem to yield the correct finite-sample rates (for instance, the rate of
convergence includes an n−1/3 term that seems unlikely to actually exist).
51
formulas that hold even when the particular parametric model is wrong, as long as certain “orthogonality
assumptions” are true.
Specifically, we will consider linear regression, deriving standard error estimates via typical parametric
confidence regions as with GLMs. We will see that these are brittle, but that they are primarily brittle
because they implicitly assume that certain equations hold. If we instead explicitly subtitute the right-hand
side of those equations, we get better confidence intervals that hold under fewer assumptions. As a bonus,
we’ll be able to study how linear regression performs under distribution shift.
Starting point: linear response with Gaussian errors. In the simplest setting, suppose that we
completely believe our model:
We observe samples (x1 , y1 ), . . . , (xn , yn ) ∼ p. Suppose that we estimate β using the ordinary least squares
estimator:
n n n
1X X X
β̂ = arg min (yi − hβ, xi i)2 = ( xi x>
i )−1
xi yi . (246)
β n i=1 i=1 i=1
Pn
Define S = n1 i=1 xi x> >
i . Then since yi = xi β + zi , we can further write
Xn n
X
β̂ = ( xi x>
i )−1
( xi x>
i β + xi zi ) (247)
i=1 i=1
Xn
= (nS)−1 (nSβ + xi zi ) (248)
i=1
n
1 −1 X
=β+ S xi zi . (249)
n i=1
From this we see that, conditional on the xi , β̂ − β is a zero-mean Gaussian distribution. Its covariance
matrix is given by
n
1 −1 X 2 > −1 σ 2 −1
S E[z i | x i ]x i xi S = S . (250)
n2 i=1
n
Confidence regions. The above calculation shows that the error β̂ − β is exactly Gaussian with covariance
2
matrix σn S −1 (at least assuming the errors zi are i.i.d. Gaussian). Thus the (parametric) confidence region
for β̂ − β would be an ellipsoid with shape S −1 and radius depending p on σ, n, and the significance level α of
the test. As a specific consequence, the standard error for βi is σ (S −1 )ii /n. This is the standard error
estimate returned by default in most software packages.
Of course, this all so far rests on the assumption of Gaussian error. Can we do better?
Calculation from moment assumptions. It turns out that our calculation above relied only on condi-
tional moments of the errors, rather than Gaussianity. We will show this explicitly by doing the calculations
more carefully. Re-using steps above, we have that
n
1 −1 X
β̂ − β = S xi zi . (251)
n i=1
def 1 Pn
where b = n i=1 xi E[zi | xi ].
52
In particular, as long as E[Z | X] = 0 for all X, β̂ is an unbiased estimator for X. In fact, since this only
needs to hold on average, as long as E[ZX] = 0 (covariates uncorrelated with noise) then E[β̂ − β] = 0, and
E[β̂ − β | x1:n ] converges to zero as n → ∞. This yields an insight that is important more generally:
Orindary least squares yields an unbiased estimate of β whenever the covariates X and noise Z
are uncorrelated.
This partly explains the success of OLS compared to other alternatives (e.g. penalizing the absolute error or
fourth power of the error). While OLS might initially look like the maximum likelihood estimator under
Gaussian errors, it yields consistent estimates of β under much weaker assumptions. Minimizing the fourth
power of the error requires stronger assumptions for consistency, while minimizing the absolute error would
yield a different condition in terms of medians rather than expectations.
Next we turn to the covariance of β̂. Assuming again that the (xi , yi ) are independent across i, we have
n
1 X
Cov[β̂ | x1:n ] = Cov[ S −1 xi zi | x1:n ] (253)
n i=1
n
1 −1 X
= S xi Cov[zi , zj | xi , xj ]x>
j S
−1
(254)
n2 i,j=1
n
1 −1 X
= S xi Var[zi | xi ]x>
i S
−1
, (255)
n2 i=1
Pn
where the final line is because zi , zj are independent for i 6= j. If we define Ω = n1 i=1 xi Var[zi | xi ]x>
i , then
the final term becomes n1 S −1 ΩS −1 .
This quantity can be upper-bounded under much weaker assumptions than Gaussianity. If we, for instance,
2
merely assume that Var[zi | xi ] ≤ σ 2 for all i, then we have that Ω σ 2 S and hence Cov[β̂ | x1:n ] σn S −1 .
Even better, this quantity can be estimated from data. Let u2i = (yi − β̂ > xi )2 . This is a downward-biased,
but asymptotically unbiased, estimate for Var[zi | xi ] (it would be unbiased if we used β instead of β̂).
Therefore, form the matrix
n
1X
Ω̂n = xi u2i x>
i . (256)
n i=1
1 −1
Then nS Ω̂n S −1
can be used to generate confidence regions and standard errors. In particular, the standard
q
error estimate for βi is (S −1 Ω̂n S −1 )ii /n. This is called the robust standard error or heteroskedacity-
consistent standard error.
There are a couple of simple improvements on this estimate. The first is a “degrees of freedom” correction:
we know that u2i is downward-biased, and it is more downward-biased the larger the dimension d (because
1
then β̂ can more easily overfit). We often instead use n−d S −1 Ω̂n S −1 , which corrects for this.
A fancier correction, based on the jacknnife, first corrects the errors ui , via
1 > −1
u0i = ui /(1 − κi ), with κi = x S xi .
n i
Pn
We obtain a corresponding Ω0n = 1
n i=1 xi (u0i )2 x>
i , and the matrix for the standard errors is
n
1 −1 0 1X
S (Ωn − ζζ > )S −1 , where ζ = xi u0i .
n n i=1
1
The main difference is that each ui gets a different correction factor 1−κi
(which is however roughly equal to
n
n−d ) and also that we subtract off the mean ζ. There is some evidence that this more complicated estimator
works better when the sample size is small, see for instance MacKinnon and White (1985).
53
Out-of-distribution error. Now suppose that we wish to estimate the error on test samples x̄1:m drawn
from a distribution p̄ 6= p. Start again with the Gaussian assumption that y = β > x + z. Pm
1 >
The expected error on sample x̄i (over test noise z̄i ) is σ 2 + hβ̂ − β, x̄i i2 . If we let S̄ = m i=1 x̄i x̄i , and
let EZ denote the expectation with respect to the training noise z1 , . . . , zn , then the overall average expected
error (conditional on x1:n , x̄1:m ) is
m m
1 X > 1 X
σ 2 + EZ [ (x̄i (β − β̂))2 ] = σ 2 + h x̄i x̄> >
i , EZ [(β − β̂)(β − β̂) ]i (257)
m i=1 m i=1
σ2
= σ 2 + hS̄, S −1 i (258)
n
1
+ σ 1 + hS̄, S −1 i .
2
(259)
n
This shows that the error depends on the divergence between the second moment matrices of p(x) and p̄(x):
• When p(x) = p̄(x), then hS̄, S −1 i = tr(S̄S −1 ) ≈ tr(I) = d, so the error decays as d
n.
• If S is low-rank and is missing any directions that appear in S̄, then the error is infinite. This makes
sense, as we have no way of estimating β along the missing directions, and we need to be able to
estimate β in those directions to get good error under p̄. We can get non-infinite bounds if we further
assume some norm bound on β; e.g. if kβk2 is bounded then the missing directions only contribute
some finite error.
• On the other hand, if S is full-rank but S̄ is low-rank, then we still achieve finite error. For instance,
suppose that S = I is the identity, and S̄ = kd P is a projection matrix onto a k-dimensional subspace,
scaled to have trace d. Then we get a sample complexity of nd , although if we had observed samples
with second moment matrix S̄ at training time, we would have gotten a better sample complexity of nk .
• In general it is always better for S to be bigger. This is partially an artefact of the noise σ 2 being the
same for all X, so we would always rather have X be as far out as possible since it pins down β more
effectively. If the noise was proportional to kXkF (for instance) then the answer would be different.
Robust OOD error estimate. We can also estimate the OOD error even when the Gaussian assumption
doesn’t hold, using
Pm the same idea as for robust standard errors. Letting z̄i be the noise for x̄i , the squared
1 2
error is then m j=1 (hβ − β̂, x̄ i i + z̄i ) , and computing the expectation given x1:n , x̄1:m yields
m
1 X
E[ (hβ − β̂, x̄i i + z̄i )2 | x1:n , x̄1:m ] (260)
m j=1
m
1 X >
= x̄ E[(β − β̂)(β − β̂)> | x1:n ]x̄i + 2x̄> 2
i E[β − β̂ | x1:n ]E[z̄i | xi ] + E[z̄i | x̄i ] (261)
m i=1 i
1 m
D E D E 1 X
= S̄, S −1 Ω + bb> S −1 + 2 b̄, S −1 b + E[z̄i2 | x̄i ]. (262)
n m j=1
To interpret this expression, first assume that the true Pm model is “actually linear”, meaning that b = b̄ = 0.
Then the expression reduces to n1 hS̄, S −1 ΩS −1 i + m
1
j=1 E[z̄ 2
|
i xi ]. The second term is the intrinsic variance
1 −1
in the data, while the first term is similar to the n hS̄, S i term from before, but accounts for correlation
between X and the variation in Z.
If the model is not actually linear, then we need to decide how to define β (since the optimal β is then
no longer independent of the distribution). In that case a natural choice is to let β be the minimizer under
the training distribution, in which case b → 0 as n → ∞ and thus the hb̄, S −1 bi term conveniently becomes
asymptotically negligible. The twist is that E[z̄i2 | x̄i ] now measures not just the intrinsic variance but also
the departure from linearity, and could be quite large if the linear extrapolation away from the training points
ends up being poor.
54
Partial specification. In general, we see that we can actually form good estimates of the mean-squared
error on p̄ making only certain moment assumptions (e.g. b = b̄ = 0) rather than needing to assume the
Gaussian model is correct. This idea is called partial specification, where rather than making assumptions that
are stringent enough to specify a parametric family, we make weaker assumptions that are typically insufficient
to even yield a likelihood, but show that our estimates are still valid under those weaker assumptions. The
weaker the assumptions, the more happy we are. Of course b = b̄ = 0 is still fairly strong, but much better
than Gaussianity. The goal of partial specification aligns with our earlier desire to design estimators for the
entire family of resilient distributions, rather than specific parametric classes. We will study other variants of
partial specification later in the course, in the context of clustering algorithms.
[Lecture 18]
55
Proof. This is equivalent to showing that the functions fµ,Σ (x) defining the pdf of a Gaussian are all linearly
independent (i.e., there is no non-trivial finite combination that yields the zero function). We will start by
showing this in one dimension. So, suppose for the sake of contradiction that
m
X √
cj exp(−(x − µj )2 /2σj2 )/ 2πσ 2 = 0, (263)
j=1
where the cj are all non-zero. Then integrating (263) against the function exp(λx) and using the formula for
the moment generating function of a Gaussian, we obtain
m
X 1
cj exp( σj2 λ2 + µj λ) = 0. (264)
j=1
2
1 2
Let σmax = maxm 2
j=1 σj , then dividing the above equation by exp( 2 σmax λ ) and taking λ → ∞, we see that
only those j such that σj = σmax affect the limit. If S is the set of such indices j, we obtain
X
cj exp(µj λ) = 0, (265)
j∈S
i.e. there is a linear relation between the functions gµj (λ) = exp(µj λ). But this is impossible, because as long
as the µj are distinct, the largest µj will always dominate the limit of the linear relation as λ → ∞, and so
we must have cj = 0 for that j, a contradiction.
It remains to extend to the n-dimensional case. Suppose there was a linear relation among the PDFs of
n-dimensional Gaussians with distinct parameters. Then if we project to a random 1-dimensional subspace,
the corresponding marginals (which are linear functions of the n-dimensional PDFs) are also each Gaussian,
and have distinct parameters with probability 1. This is again a contradiction since we already know that
distinct 1-dimensional Gaussians cannot satisfy any non-trivial linear relation.
Proposition 5.1 shows that we can recover the parameters exactly in the limit of infinite data, but it
doesn’t say anything about finite-sample rates. However, asymptotically, as long as the log-likelihood function
is locally quadratic around the true parameters,
√ we can use tools from asymptotic statistics to show that we
approach the true parameters at a 1/ n rate.
Recovery from moments. Proposition 5.1 also leaves open the question of efficient computation. In
practice we would probably use k-means or EM, but another algorithm is based on the method of moments.
It has the virtue of being provably efficient, but is highly brittle to mis-specification.
The idea is that the first, second,Pand third moments give a system of equations that can be solved for
the parameters (α, µ, Σ): letting p = j αj N (µj , Σj ), we have
k
X
Ep [X] = αj µj , (266)
j=1
k
X
Ep [X ⊗ X] = αj (µj µ>
j + Σj ), (267)
j=1
k
X
Ep [X ⊗ X ⊗ X] = αj (µ⊗3
j + 3 Sym(µj ⊗ Σj )), (268)
j=1
56
which is a generalization of the power method for matrices, and the
√ rate of convergence is polynomial in k, d,
and the condition number of certain matrices (and decays as 1/ n).
However, this method is very brittle—it relies on exact algebraic moment relations of Gaussians, so even
small departures from the assumptions (like moving from Gaussian to sub-Gaussian) will likely break the
algorithm. This is one nice thing about the agnostic clustering setting—it explicitly reveals the brittleness of
algorithms like the one above, and (as we shall see) shows why other algorithms such as k-means are likely to
perform better in practice.
Cluster recovery. An important point is that even in this favorable setting, exact cluster recovery is
impossible. This is because even if the Gaussians are well-separated, there is some small probability that a
sample ends up being near the center of a different Gaussian.
To measure this quantitatively, assume for simplicity that Σj = σ 2 I for all j (all Gaussians are isotropic
with the same variance), and suppose also that the µj are known exactly and that we assign each point x to
the cluster that minimizes kx − µj k2 .6 Then the error in cluster recovery is exactly the probability that a
sample from µj ends up closer to some other sample µj 0 , which is
k
X k
X X
αj Px∼N (µj ,σ2 I) [kx − µj k2 > kx − µj 0 k2 for some j 0 6= j] ≤ αj Φ(kµj − µj 0 k/σ) (269)
j=1 j=1 j 0 6=j
≤ kΦ(∆/σ), (270)
p
where ∆ = minj 0 6=j kµj − µj 0 k2 and Φ is the normal CDF. As long as ∆ log(k/), the cluster error will
be at most . Note that the cluster error depends on a separation condition stipulating that the cluster
centers are all sufficiently far apart. Moreover, we need greater separation if there are more total clusters
(albeit at a slowly-growing rate in the Gaussian case).
Proof. The basic intuition is that we can cover the points in S̃ by resilient sets S10 , . . . , S2/α
0
of size α2 n. Then
by the pigeonhole principle, the resilient set S must have large overlap with at least one of the S 0 , and hence
have similar mean. This is captured in Figure 9 below.
The main difference is that S and S 0 may have relatively small overlap (in a roughly α-fraction of elements).
We thus need to care about resilience when the subset T is small compared to S. The following lemma relates
resilience on large sets to resilience on small sets:
Lemma 5.3. For any 0 < < 1, a distribution/set is (ρ, )-resilient if and only if it is ( 1−
ρ, 1 − )-resilient.
This was already proved in Appendix C as part of Lemma 2.14. Given Lemma 5.3, we can prove
Proposition 5.2 with a similar triangle inequality argument to how we showed that resilient sets have small
modulus of continuity. However, we now need to consider multiple resilient sets Si rather than a single S 0 .
6 This is not quite optimal, in reality we would want to assign based on kx − µ k2 /σ 2 + log α , but we consider this simpler
j 2 j
assignment for simplicity.
57
Figure 9: If we cover S̃ by resilient sets, at least one of the sets S 0 has large intersection with S.
Suppose S is (ρ, α4 )-resilient around µ–and thus also ( α4 ρ, 1 − α4 )-resilient by Lemma 5.3–and let S1 , . . . , Sm
be a maximal collection of subsets of [n] such that:
α
1. |Sj | ≥ 2n for all j.
2. Sj is ( α4 ρ, 1 − α2 )-resilient (with mean µj ).
3. Sj ∩ Sj 0 = ∅ for all j 6= j 0 .
Clearly m ≤ α2 . We claim that S has large intersection with at least one of the Sj and hence µj is close
to µ. By maximality of the collection {Sj }m j=1 , it must be that S0 = S\(S1 ∪ · · · ∪ Sm ) cannot be added to
the collection. First suppose that |S0 | ≥ 2 n. Then S0 is ( α4 ρ, 1 − α2 )-resilient (because any subset of α2 |S0 |
α
points in S0 is a subset of at least α4 |S| points in S). This contradicts the maximality of {Sj }m j=1 , so we must
have |S0 | < α2 n.
Now, this implies that |S ∩ (S1 ∪ · · · ∪ Sm )| ≥ α2 n, so by pigeonhole we must have |S ∩ Sj | ≥ α2 |Sj | for
some j. Letting T = S ∩ Sj as before, we find that |T | ≥ α2 |Sj | ≥ α4 |S| and hence by resilience of Sj and S
we have kµ − µj k ≤ 2 · ( α4 ρ) = α8 ρ by the same triangle inequality argument as before.
Better bounds for well-separated clusters. Proposition 5.2 is powerful because it holds under very
minimal conditions (we do not need to assume anything about separation of clusters or even about any of the
clusters other than the one we are estimating). However, its guarantees are also minimal—we only know that
we get approximate parameter recovery in the list-decoding model, and cannot say anything about cluster
recovery. We next obtain a stronger bound assuming that the data can actually be separated into clusters
(with a small fraction of outliers) and that the means are well-separated. This stronger result both gives
cluster recovery, and gives better bounds for parameter recovery:
Proposition 5.4. Suppose that a set of points {x1 , . . . , xn } can be partitioned into k sets C1 , . . . , Ck of size
α1 n, . . . , αk n, together with a fraction n of outliers ( = 1 − (α1 + · · · + αk )), where 2 ≤ α = minkk=1 αj .
Further suppose that
• Each cluster is (ρ1 , )-resilient and (ρ2 , 2/α)-resilient.
4ρ1
• The means are well-separated: ∆ > where ∆ = minj6=j 0 kµj − µj 0 k2 .
58
• Each Sl is (ρ1 , )-resilient.
We know that such a collection exists because we can take the Cj themselves. Now call a set S “j-like” if it
contains at least αj (/α)|S| points from Cj . We claim that each Sl is j-like for exactly one j. Indeed, by
pigeonhole it must be j-like for at least one j since /α ≤ 1/2 < 1.
In the other direction, note that if S if j-like then S ∩ Cj contains at least (αj /α) of the points in S,
and at least (|S|/n)(/α) ≥ of the points in Cj . Thus by resilience of both sets, the means of both S and
Cj are within ρ1 of the mean of S ∩ Cj and hence within 2ρ 1 of each other. In summary, kµj − µS k2 ≤ 2ρ 1 .
Now if S were j-like and also j 0 -like, then we would have kµj − µj 0 k2 ≤ 4ρ 1 , which contradicts the separation
assumption.
Since Sl is j-like for a unique j, it contains at most (/α)|Sl | points from any of the other Cj 0 , together
with at most n outliers. Moreover, since the other Sl0 are not j-like, Sl is missing at most αj (/α)n points
from Cj . Thus Sl ∩ Cj is missing at most 2/α|Sl | points from Sl and at most /α|Cj | points from Cj . By
resilience their means are thus within 2ρ2 of each other, as claimed.
√
• ∆ ≥ 36σ/ α, where ∆ = minj6=j 0 kµj − µj 0 k2 .
Then there is a polynomial-time algorithm outputting candidate clusters Ĉ1 , . . . , Ĉk and means µ̂1 , . . . , µ̂k
such that:
• |Cj 4Ĉj | = O(σ 2 /α∆2 ) (cluster recovery), and
1. Project points xi to P xi .
2. Form initial clusters based on the P xi .
3. Compute the means of each of these clusters.
4. Optionally run any number of steps of k-means in the original space of xi , initialized with the computed
means from the previous step.
We will provide more formal psuedocode later [NOTE: TBD]. For now, we focus on the analysis, which has
two stages: (1) showing that the initial clustering from the first two steps is “nice enough”, and (2) showing
that this niceness is preserved by the k-means iterations in the second two steps.
59
Analyzing the projection. We start by analyzing the geometry of the points Pk xi . The following lemma
shows that the projected clusters are still well-separated and have small covariance:
Lemma
p 5.6. The projected √ points Pk xi satisfy the covariance and separation conditions with parameters σ
and ∆2 − 4σ 2 /α ≥ 35σ/ α:
1 X p
(P xi − P µj )(P xi − P µj )> σ 2 I and kP µj − P µj 0 k2 ≥ ∆2 − 4σ 2 /α. (271)
|Cj |
i∈Cj
In other words, the covariance condition is preserved, and separation is only decreased slightly.
Proof. The covariance condition is preserved because the covariance matrix of the projected points for cluster
j is Pk Σj Pk , where Σj is the un-projected covariance matrix. This evidently has smaller singular values than
Σk .
The separation condition requires more detailed analysis. We start by showing that there is not much in
the orthogonal component (I − Pk )xi . Indeed, we have that the top singular value of (I − Pk )xi is at most σ:
n
1X
S= ((I − Pk )xi )((I − Pk )xi )> σ 2 I (272)
n i=1
This is because Pk minimizes this top singular value among all k-dimensional projection matrices, and if we
take the projection Qk onto the space spanned by the means µ1 , . . . , µk , we have
n k
1X X αj X
((I − Qk )xi )((I − Qk )xj )> = ((I − Qk )xi )((I − Qk )xi )> (273)
n i=1 j=1
|C j |
i∈Cj
k
X αj X
= ((I − Qk )(xi − µj ))((I − Qk )(xi − µj ))> (274)
j=1
|C j |
i∈Cj
k k
X αj X X
(xi − µj )(xi − µj )> αj σ 2 I = σ 2 I. (275)
j=1
|C j | j=1
i∈Cj
Given this, we know that the projections (I − Pk )µj must be small, since otherwise we have
n
1X
v > Sv = h(I − Pk )xi , vi2 (276)
n i=1
αj X
≥ h(I − Pk )xi , vi2 (277)
|Cj |
i∈Cj
D 1 X E2
≥ αj (I − Pk )xi , v (278)
|Cj |
i∈Cj
Analyzing the initial clustering. We now analyze the √ initial clustering. Call a point i a proto-center if
there are at least α2 n projected points within distance 3σ k of Pk xi , and call the set of these nearby points
the associated proto-cluster.
We will show that the proto-clusters are nearly pure (have few points not from Cj ) using a similar
argument as when we analyzed resilient clustering. As before, call a proto-cluster j-like if there are at least
αj α
4 n points from Cj in the proto-cluster.
60
Lemma 5.7. Each proto-cluster is j-like for exactly one j.
Proof. We know that it is j-like for at least one j by the Pigeonhole principle (if not, then the proto-cluster
has at most α4 n points in total, contradicting its size of at least α2 n). So suppose for the sake of√contradiction
that it is both j-like and j 0 -like. By resilience, the mean of the points
√ from Cj is at most 2σ/ α away from
Pk µj , and√similarly the mean of the points from Cj 0 is at most 2σ/ α away from Pk µj 0 . Since the cluster has
√ √
radius 3σ k ≤ 3σ/ α, this implies that kPk (µj − µj 0 )k2 ≤ 10σ/ α, contradicting the separation condition
for the projected means. Thus no proto-cluster can be j-like for multiple j, which proves the lemma.
Now since each proto-cluster is j-like for exactly one j, at least half of the points must come from that
proto-cluster.
At this point we are essentially done if all we care about is constructing an efficient algorithm for cluster
recovery (but not parameter recovery), since if we just extend each proto-cluster by O(σ) we are guaranteed
to contain almost all of the points in its corresponding cluster, while still containing very few points from any
other cluster (assuming the data are well-separated). However, parameter recovery is a bit trickier because we
need to make sure that the small number of points from other clusters don’t mess up the mean of the cluster.
The difficulty is that while we have control over the projected distances, and can recover the projected centers
Pk µj well, we need to somehow get back to the original centers µj .
The key here is that for each proto-cluster, the P xi are all close to each other, and the missing component
(I − Pk )xi has bounded covariance. Together, these imply that the proto-cluster is resilient—deleting
√ an
-fraction of points can change the mean by at most O(σ) in the Pk component, and O(σ ) in the (I − Pk )
component. In fact, we have:
Lemma 5.8. Let B be a proto-cluster with mean ν. Then
1 X
(xi − ν)(xi − ν)> 11σ 2 /α. (281)
|B|
i∈B
√
In particular, if B is j-like then we have kµj − νk2 ≤ 9σ/ α.
√
Proof. The covariance bound is because the covariance of the xi are bounded in norm by at most 3σ k
in the Pk component and hence can contribute at most 9σ 2 k ≤ 9σ 2 /α to the covariance, while we get an
additional 2σ 2 /α in an orthogonal direction because the overall second moment of the (I − Pk )xi is σ 2 and
the i ∈ B contribute to at least an α2 fraction of that.
Now, this implies that B is resilient, while we already have that Cj is resilient. Since B √ ∩ Cj contains
√ at
least√half the points in both B and Cj , this gives that their means are close–within distance 2( 11+1)σ/ α <
9σ/ α.
Analyzing k-means. We next show that k-means iterations preserve certain important invariants. We
will call the assigned means µ̂j R-close if kµ̂j − µj k2 ≤ R for all j, and we will call the assigned clusters Ĉj
-close if |Cj 4Ĉj | ≤ |Cj | for all j. We will show that if the means are R-close then the clusters are -close
for some = f (R), and that the resulting new means are then g(R)-close. If R is small enough then we will
also have g(R) < R so that we obtain an invariant.
Let ∆jj 0 = kµj − µj 0 k2 , so ∆jj 0 ≥ ∆. We will show that if the µ̂j are R-close, then few points in Cj can
end up in Ĉj 0 . Indeed, if xi ends up in Ĉj 0 then we must have kxi − µ̂j 0 k22 ≤ kxi − µ̂j k22 , which after some
re-arrangement yields
1
hxi − µ̂j , µ̂j 0 − µ̂j i ≥ hµ̂j 0 − µ̂j , µ̂j 0 − µ̂j i. (282)
4
Applying the covariance bound and Chebyshev’s inequality along the vector v = µ̂j 0 − µ̂j , we see that the
4σ 2 4σ 2 4σ 2
fraction of points in Cj that end up in Ĉj 0 is at most kµ̂j −µ̂ 2 ≤ (∆ 0 −2R)2 ≤ (∆−2R)2 . In total this means
j 0k 2 jj
4σ 2 n 4kσ 2 |Cj |
that at most (∆−2R)2 points from other clusters end up in Ĉj , while at most (∆−2R)2 points from Cj end up
4kσ 2 4σ 2 8σ 2
in other clusters. Thus we have ≤ (∆−2R)2 + α(∆−2R)2 ≤ α(∆−2R)2 , so we can take
8σ 2
f (R) = . (283)
α(∆ − 2R)2
61
2
Now suppose that γjj 0 |Cj | points in Cj are assigned to Ĉj 0 , where we must have γjj 0 ≤ (∆ 4σ 2 . By
jj 0 −2R)
√ √
resilience, the mean of these points is within σ/ γjj 0 of µj and hence within ∆jj 0 + σ/ γjj 0 of µj 0 . In total,
then, these points can shift the mean µ̂j 0 by at most
√
γjj 0 αj n(∆jj 0 + σ/ γjj 0 ) 2αj 4σ 2 ∆jj 0 2σ 2 4α 2σ 2 ∆
j σ2
1 ≤ + ≤ + . (284)
2 αn
α (∆jj 0 − 2R)2 ∆jj 0 − 2R α (∆ − 2R)2 ∆ − 2R
4kσ 2
At the same time, the (∆−2R)2 fraction of points that are missing from Cj 0 can change its mean by at most
√
4σ 2 k
∆−2R . Thus in total we have
4σ 2 √ 1 2∆ 8σ 2 (∆ − R)
kµ̂j 0 − µj 0 k2 ≤ · k+ + ≤ . (285)
∆ − 2R α α(∆ − 2R) α(∆ − 2R)2
8σ 2 (∆−R)
In particular we can take g(R) = α(∆−2R)2 .
2 2 √
As long as R ≤ ∆/4 we have√ g(R) ≤ 24σ
and f (R) ≤ 32σ
α∆ α∆2 , as claimed. Since our initial R is 9σ/ α,
this works as long as ∆ ≥ 36σ/ α, which completes the proof.
[Lecture 19]
In addition, for any other value of β, the excess loss relative to β ∗ can be expressed as
This is the standard view of linear regression and what we will call the parameter view, since it focuses on
the parameters β. But it is also helpful to adopt a different view, which we’ll the function view.
Pn The function
view instead focuses on the learned function fβ (x) = β > x. Thus we instead have L(f ) = n1 i=1 (yi − f (xi ))2
62
(assume for now that f is linear in x; by convention L(f ) = ∞ for all other such f , but the derivations below
will focus on the case where L(f ) is finite).
The excess loss is particularly easy to write down in the function view:
n
1X 1
L(f ) − L(f ∗ ) = (f (xi ) − f ∗ (xi ))2 = kf − f ∗ k22 , (289)
n i=1 n
63
Lemma 6.1 (Restrictions are contractions). For any f and any sets T ⊆ T 0 , we have kf kK[T ] ≤ kf kK[T 0 ] .
Secondly, despite this, we can extend a function f from any subset T to all of X without increasing the
norm:
Lemma 6.2 (Isometric extension). Let T = {x1 , . . . , xn } and let y1 , . . . , yn ∈ R be arbitrary target values.
Then there exists an f ∈ H such that f (xi ) = yi for all i, and moreover kf k2H = kyk2K = y > K −1 y.
Proof of Lemma 6.2. Define f (x) = kx> K −1 y. Since kxi is the ith row of K, kx>i K −1 is the ith standard basis
vector, so f (xi ) = yi .
It remains to show that f has bounded norm. For sets T = {x1 , . . . , xn } and T 0 = {x01 , . . . , x0m }, let
K[T, T 0 ] be the (rectangular) kernel matrix where Kij = k(xi , x0j ). Then f evaluated on T 0 is equal to
K[T, T 0 ]> K[T, T ]−1 y. For notational ease define K11 = K[T, T ], K12 = K[T, T 0 ], etc. Thus
kf k2K[T 0 ] = y > K[T, T ]−1 K[T, T 0 ]K[T 0 , T 0 ]−1 K[T 0 , T ]K[T, T ]−1 y (301)
−1 −1 −1
= y > K11 K12 K22 K21 K11 y. (302)
−1 −1 −1 −1 −1
Recall we wish to show this is at most y > K11 y. It suffices to show that K11 K22 K21 K11 K11 , or
K12
−1 K11 K12
K12 K22 K21 K11 . But this follows from the Schur complement lemma, since is positive
K21 K22
definite.
> −1
−1 y1 K11 K12 y1
Proof of Lemma 6.1. Under the notation above we want to show that y1> K11 y1 ≤ .
y2 K12 K22 y2
K11 + λ1 I
0 K11 K12
This follow by noting that for large λ we have and then inverting and
0 λ2 I K21 K22
taking limits.
Remark 6.3. This route to defining H is different from the typical treatment which instead focuses on
the representer theorem. That theorem starts with the infinite-dimensional space and then shows that the
minimum-norm interpolator for a set T lies within the span of K[T ]. Here we take essentially the reverse
approach, starting with an interpolating function and showing that it can be isometrically embedded in H. I
personally prefer the approach above, as it makes it much cleaner to define both kf kH and H itself.
Note that Lemma 6.2 also shows that kernel ridge regression Pn can be expressed intrinsically, without
explicit reference to the set T : fˆλ = arg minf nλ kf k2H + n1 i=1 (f (xi ) − yi )2 . This is because we can always
find on f on T = {x1 , . . . , xn } such that kf kH = kf kK[T ] .
Finally, since K is invertible, kernel ridge regression has an even simpler form: fˆλ (x) = kx> (λI + K)−1 y.
If we set λ = 0 then we recover the isometric extension f of y in Lemma 6.2.
[Lecture 20]
1 λ2
Bias2 = kK(λI + K)−1 f ∗ − f ∗ k22 = k(λI + K)−1 f ∗ k22 . (303)
n n
A useful but loose upper bound can be obtained via k(λI + K)−1 f ∗ k22 ≤ λ1 (f ∗ )> K −1 f ∗ = λ1 kf ∗ k2H , so the
bias is at most nλ kf k2H assuming f lies in H (we will later relax this assumption, and indeed show that it
must be relaxed to get the optimal rates).
64
Pn
Next, the variance is 1
n i=1 Var [f (xi )] = E [ n1 kK(λI + K)−1 k22 ], or
σ2
Var = tr((λI + K)−1 K 2 (λI + K)−1 ). (304)
n
2
Again, a somewhat loose upper bound is σλ (tr(K)/n).
Thus, assuming both tr(K)/n
p and kf k2H are bounded, if we optimize λ the sum of the bias and variance
2σ
can be bounded by √ n
kf kH tr(K)/n. One should generally think of tr(K)/n as being roughly constant
√
as n grows, so this gives a 1/ n rate of convergence for the squared error. This is slower than the usual
parametric rate of 1/n (for the squared error), but that is perhaps not surprising given we are in infinite
dimensions. However, more unusual is that the exponent on n does not seem to depend at all on the geometry
of H. This is surprising, as we know for instance that it is much easier to estimate smooth functions than
functions that are merely Lipschitz, and in fact the exponent in n is different in both cases (as it turns out,
log(n) 1
n versus n2/3 for functions on [0, 1]).
Prior on f . The issue is with the “right prior” to place on f . From a Bayesian perspective, kernel ridge
2
regression is optimal when β is drawn from a Gaussian prior: β ∼ N (0, σλ I). From a function perspective
2
this corresponds to the Gaussian process prior f ∼ N (0, σλ K). In this case the expected norm-squared
2 2
is E[kβk22 ] = σλ tr(I) = σλ p. This is a problem when p = ∞! A similar problem shows up with f :
2 2
E[kf k2H ] = E[f > K −1 f ] = σλ tr(K −1 K) = σλ n.
We therefore need a better bound on the bias that doesn’t rely on kf k2H , which is unbounded. Returning
to (303), under a Gaussian process prior f ∼ N (0, ρ2 K) we have
n
2 λ2 ∗
−2 ∗ λ 2 ρ2 −2 λρ2 X λµj
E[Bias ] = E[f (λI + K) f ] = tr((λI + K) K) = , (305)
n n n j=1 (λ + µj )2
σ2
Pn µ2j
where µj are the eigenvalues of K. Similarly the variance is n j=1 (λ+µj )2 . Thus the overall error is
n n
λρ2 X λµj σ2 X µ2j
E[Error] = + . (306)
n j=1 (λ + µj )2 n j=1 (λ + µj )2
λµ µ µ2 µ2
We can bound (306) using the inequalities (λ+µjj )2 ≤ min(1, λj ) and (λ+µj j )2 ≤ min(1, λ2j ). Sort the µj in
descending order and let J = max{j | µj > λ}. Then these inequalities yields
Interpretation for power law decay. Suppose for concreteness that µj = nj −α ; this power law decay is
common empirically and also shows up in theoretical examples such as learning Lipschitz functions. Then
J = (λ/n)−1/α . Moreover, the sums can be approximated as
1 ∞ −α (λ/n)−1/α
X µj Z
1 J
≈ x dx = J 1−α = = , (308)
λ λ J λ(α − 1) α−1 α−1
j>J
X µ2j Z ∞
1 1 (λ/n)−1/α
2
≈ 2
x−2α dx = 2 J 1−2α = . (309)
λ λ J λ (2α − 1) 2α − 1
j>J
65
σ2
Now take the optimal λ∗ = ρ2 . Then we obtain the final bound of
1 1 α−1 1
E[Error(λ∗ )] ≤ O n α −1 σ 2 α ρ2 α . (311)
α−1
√
Thus for instance when α = 2 we get a 1/ n rate, but as α → ∞ we approach a 1/n rate.
Note that for sub-optimal choices of regularizer, we still get a bound but with a potentially worse
dependence on n. For instance if we took λ∗ ∝ nβ , for β > 0, then our rate would be suboptimal by a
nβ(α−1)/α factor in n due to the over-regularization.
Note that relative to the kernel case, µ∗j is n times smaller, since it is an eigenvalue of S rather than K.
σ2
This expression is similar to what we achieved in (307), and when µ∗j = j −α setting λn = nρ2 will yield
1
the same power law rate as before of n . α −1
The key question is for what values of λn the condition S ∗ 2S + λn I is satisfied. This requires the
empirical covariance matrix to concentrate to the population covariance. To analyze this, we will make use of
the following result of Koltchinskii and Lounici (2017):
66
Theorem 6.5 (Koltchinskii and Lounici). If the xi are Gaussian with covariance S ∗ , then the empirical
covariance matrix satisfies r
tr(S ∗ ) tr(S ∗ )kS ∗ k
∗
kS − Sk ≤ max , (319)
n n
p
with high probability. In particular, if S ∗ is p × p and p ≤ n then we have kS ∗ − Sk ≤ kS ∗ k p/n with high
probability.
To apply Theorem 6.5, assume without loss of generality that S ∗ is diagonalized and consider the case that
the jth eigenvalue µ∗j = j −α . We will split S into two blocks: S≤m (the top left m × m matrix corresponding
to the top m eigenvalues of S ∗ ) and S>m (the complementary
p bottom right block).
∗
For the top left block, we have kS≤m − S≤m k ≤ m/n by Theorem 6.5. The bottom right block is more
∗
) = j>m j −α ≈ α−1 1
m1−α . Assuming α is bounded away from 1 we then
P
complicated, but note that tr(S>m
q
∗ 1−2α
have kS>m − S>m k ≤ m n .
∗ ∗ ∗
Now we will optimize m and aim for the conditions kS≤m − S≤m k ≥ 12 λmin (S≤m ), kS>m − S>m k ≤ λn
∗ ∗
to hold, which are enough to imply S≤m 2S≤m + λI and S>m 2S>m + λI (we also have to worry about
the off-diagonal blocks but will handle this in an exercise). These conditions hold when
p
m/n m−α , (320)
p
m1−2α /n λn . (321)
1/(2α+1)
Solving the first yields m n1/(2α+1) , implying λn n−2α/(1+2α) = n n .
∗
Now, recall that the optimal λn was λn , where λ∗ is constant in n and hence λn ≈ 1/n. To achieve good
generalization according to the above bound, we need λn to be at least n1/(2α+1) times larger than that. For
instance, when α = 2 this implies λn must be at least n1/5 times larger than the optimal choice, and yields a
corresponding rate of n−2/5 instead of the n−1/2 we had before.
It turns out, however, that this analysis is loose. By applying the bound with many blocks rather than
just two, we can do better. We will address this in an exercise as well.
Finally, the conditions in Theorem 6.5 can be relaxed somewhat, as discussed in Koltchinskii and Lounici
(2017). In particular, we don’t need perfect Gaussianity and can get by with appropriate sub-Gaussian
assumptions.
[Lecture 21]
Note that Rthe eigenfunctions will be different depending on the choice of ν, and orthogonality is with
respect to ν: el (x)em (x)dν(x) = δlm . There are thus three distinct inner products floating around: the
one underlying the norm kf kH , the one underlying the function error kf − yk2 , and the one associated with
67
ν. This can get a bit confusing, but it helps to think of kf kH and kf k2 as the “important” norms, while ν
specifies a non-intrinsic norm that is however useful for computational purposes.
Speaking of these computational
P∞ purposes, let us see what R Mercer’s theorem buys us. Any function f can
now be represented as f (x) = m=1 cm em (x), where cm = X f (x)em (x)dν(x). Under this representation, it
P∞ c2
turns out that kf k2H = m=1 µm m
. This might seem surprising, since the right-hand side depends on ν but
the left-hand side does not. We explain this with the following result:
P∞
Lemma 6.7. Let ν be any finite measure that is supported on all of X . Then kf k2H = m=1 µ1m ( X f (x)em (x)dν(x))2 .
R
Now we define (fν )i = ν(Xi )1/2 f (xi ) and (Kν )ij = ν(Xi )1/2 ν(Xj )1/2 k(xi , xj ). We can check that for
T = {x1 , . . . , xn }, f > K −1 f = fν> Kν−1 fν , independent of the choice of ν. This is the first hint that the choice
of ν doesn’t really matter. Now note that
>
ν(X1 )1/2 em (x1 ) ν(X1 )1/2 em (x1 )
∞
X −1
fν> Kν−1 fν = fν> µm ··· ··· fν (324)
m=1 ν(Xn )1/2 em (xn ) ν(Xn )1/2 em (xn )
1/2
1/2
>
∞ ν(X 1 ) e m (x 1 ) ν(X1 ) e m (x 1 )
(i) X 1
≈ fν> ··· ··· fν (325)
m=1 m
µ
ν(Xn )1/2 em (xn ) ν(Xn )1/2 em (xn )
∞
X 1 X n
2
= ν(Xi )f (xi )em (xi ) (326)
µ
m=1 m i=1
∞ Z
(ii) X 1
≈ ( f (x)em (x)dν(x))2 . (327)
m=1
µm X
Here (i) is the main “heuristic” step, where we are inverting the matrix on the premise P that the different vectors
1/2 1/2
[ν(X
R i )e m (x i )] are nearly orthogonal, since h[ν(X i ) e l (x i )], [ν(X i ) em (x i )]i = i ν(Xi )el (xi )em (xi ) ≈
X
e l (x)e m (x)dν(x) = 0. The step (ii) also needs to be made rigorous but is essentially just taking a limit.
Now, we byPdefinitionRhave kf k2H ≥ f > K −1 f = fν> Kν−1 fν , which by the above can be made arbitrarily
∞
close to A = m=1 µ1m ( X f (x)em (x)dν(x))2 by taking sufficiently fine partitions of {Xi }. This proves
one direction of the lemma, i.e. that A ≤ kf k2H . To prove the other direction, note that for any set
T = {x̂1 , . . . , x̂n }, we can build our partitions such that they contain the x̂i as a subset of the representatives
xi in the partition. In this case, f [T ]> K[T ]−1 f [T ] ≤ f > K −1 f = fν> Kν−1 fν by Lemma 6.1. Taking the limit
then yields f [T ]> K[T ]−1 f [T ] ≤ A. Since this holds for all T , we thus have kf k2H ≤ A, which proves the other
direction and completes the lemma.
[Lecture 22]
68
Thus the only thing that changes between train and test is the distribution of the covariates x (hence
the name covariate shift). We furthermore assume that we observe labeled samples (x1 , y1 ), . . . , (xn , yn ) ∼ p̃,
together with unlabeled samples x̄1 , . . . , x̄m ∼ p∗ . In the language of our previous setting, we could say that
D(p̃, p∗ ) = kp̃(y | x) − p∗ (y | x)k∞ , = 0, and G = {p | p(x) = p0 (x)} for some distribution p0 (obtained via
the unlabeled samples from p∗ ).
Beyond covariate shift, we will need to make some additional assumption, since if p̃(x) and p∗ (x) have
disjoint supports then the assumption that p̃(y | x) = p∗ (y | x) is meaningless. We will explore two different
assumptions:
1. Either we assume the p̃(x) and p∗ (x) are known and not too different from each other, or
2. We assume that the model family is realizable: p∗ (y | x) = pθ (y | x) for some θ.
This will lead to two different techniques: importance weighting and uncertainty estimation. We will also see
how to construct a “doubly robust” estimator that works as long as at least one of the assumptions holds.
where ` is the loss function for either classification or regression. We can approximate this via the samples
from p̃ as
n
1X
`(θ; xi , yi ). (329)
n i=1
To handle covariate shift we would like to instead minimize the expectation over p∗ , but unfortunately we
can’t do this because we don’t have any outputs y drawn from p∗ . The key insight that lets us get around
this is the following identity:
h i h p∗ (x) i
E(x,y)∼p∗ `(θ; x, y) = E(x,y)∼p̃ `(θ; x, y) . (330)
p̃(x)
Taking this identity as given for the moment, we can then approximate the expectation over p∗ via samples
from p̃ as follows:
n
1 X p∗ (xi )
`(θ; xi , yi ). (331)
n i=1 p̃(xi )
This quantity is called the propensity-weighted training loss 7 , because each training sample is weighted by
how much more it looks like a sample from p∗ than from p̃.
To prove the identity, we make use of the covariate shift assumption:
h p∗ (x, y) i
E(x,y)∼p∗ [`(θ; x, y)] = E(x,y)∼p̃ `(θ; x, y) (332)
p̃(x, y)
h p∗ (x) p∗ (y | x) i
= E(x,y)∼p̃ `(θ; x, y) (333)
p̃(x) p̃(y | x)
h p∗ (x) i
= E(x,y)∼p̃ `(θ; x, y) , (334)
p̃(x)
69
p∗ (x)
Variance of the estimator. Even if p̃(x) can be computed, the importance weighted estimator could
∗
have high variance. This is because the weights pp̃(x
(xi )
i)
could be large or potentially infinite.
For convenience assume that `(θ; x, y) ≤ B for all θ, x, y. We can compute (or rather, bound) the variance
as follows:
p∗ (x)
Varp̃ [ `(θ; x, y)] = Ep̃ [(p∗ (x)/p̃(x))2 `(θ; x, y)2 ] − Ep∗ [`(θ; x, y)]2 (335)
p̃(x)
≤ Ep̃ [(p∗ (x)/p̃(x))2 ]B 2 (336)
∗ 2
= (Dχ2 (p̃kp ) + 1)B , (337)
The variance of the propensity-weighted loss is thus more or less controlled by the χ2 -divergence between
source and target. To gain some intuition for how χ2 behaves, first note that is is always larger than KL
divergence (in the reverse direction, though I’m not sure the order of arguments is canonical):
p∗ (x)
Z
Dkl (p∗ kp̃) = p∗ (x) log dx (339)
p̃(x)
p∗ (x) − p̃(x)
Z
≤ p∗ (x) dx (340)
p̃(x)
Z ∗ 2
p (x)
= dx − 1 = Dχ2 (p̃kp∗ ). (341)
p̃(x)
Additionally, the χ2 -divergence between two Gaussians is exponential in the difference between their means.
To see this, let Z denote the normalization constant of an isotropic Gaussian, and write
Z
0 1 1
Dχ2 (N (µ, I), N (µ , I)) = −1 + exp( kx − µ0 k22 − kx − µk22 )dx (342)
Z 2
Z
1 1
= −1 + exp( (−kxk22 − (2µ0 − 4µ)> x + kµ0 k22 − 2kµk22 )) (343)
Z 2
Z
1 1
= −1 + exp( (−kx + (µ0 − 2µ)k22 + kµ0 − 2µk22 + kµ0 k22 − 2kµk22 )) (344)
Z 2
= −1 + exp(kµ k2 + kµk22 − 2µ> µ0 ) = −1 + exp(kµ − µ0 k22 ).
0 2
(345)
This is bad news for propensity weighting, since the weights blow up exponentially as the distributions move
apart.
Connection to causal inference. Propensity weighting can also be used in the context of causal inference.
Here we have a patient with covariates X, with treatment condition T (usually T ∈ {0, 1}), and outcome Y .
Our goal is to estimate the treatment effect, which, roughly speaking, is E[Y | T = 1] − E[Y | T = 0] (this
is wrong as stated and will be remedied below). We will see below how to do this by letting p∗0 and p∗1 be
the distributions where T = 0 and T = 1, respectively. However, first we need to set up the problem more
carefully.
To set the problem up more carefully, we use the potential outcomes framework. In this framework there
are actually two variables, Y (0) and Y (1), which are what the outcome would have been if we had set T = 0
or T = 1, respectively. This is potentially different from the distribution of the outcome conditional on T ,
since there could be factors that correlate T with Y (for instance, if T is smoking and Y is lung cancer, there
could be some gene that causes one to both be more likely to smoke and more likely to get lung cancer
that accounts for the strong empirical correlation between T and Y ; this was an actual objection raised by
Fisher!).
70
Of course, there are plenty of factors that create correlation between T and Y in an observational setting,
for instance sicker patients are more likely to be treated aggressively. We are okay with this as long as these
factors are observed as part of the covariates X. This leads us to the unconfoundedness assumption:
Assumption 7.2 (Unconfoundedness). The distribution (X, T, Y (0), Y (1)) is said to be unconfounded if
Y (0), Y (1) ⊥
⊥ T | X. In other words, treatment and outcome should be independent conditional on the
covariates X.
The main challenge in the potential outcomes framework is that we only observe (X, T, Y (T )). In other
words, we only observe the outcome for the treatment T that was actually applied, which makes it difficult to
estimate E[Y (1)] or E[Y (0)]. We will deal with this by treating estimating E[Y (1)] as a domain adaptation
problem, and using propensity weighting. First note that, by unconfoundedness, we have
where we define p∗1 such that p∗1 (x, t, y) = p̃(x)I[t = 1]p̃(y | x, t = 1); this has the same distribution over x as
p̃, but the treatment t = 1 is always applied. Since p̃(y | x, t) = p∗ (y | x, t) almost surely, the covariate shift
assumption holds. We can thus estimate the expectation under p∗1 via propensity weighting:
h p∗ (X, T ) i
Ep∗1 [Y (T )] = Ep̃ 1 Y (T ) (349)
p̃(X, T )
h p∗ (T | X) i
= Ep̃ 1 Y (T ) (350)
p̃(T | X)
h I[T = 1] i
= Ep̃ Y (T ) . (351)
p̃(T | X)
A similar calculation holds for computing Ep̃ [Y (0)], for the distribution p∗0 (x, t, y) = p̃(x)I[t = 0]p̃(y | x, t = 0).
Together, we have that
h I[T = 1] I[T = 0] i
Ep̃ [Y (1) − Y (0)] = Ep̃ − Y (T ) . (352)
p̃(T | X) p̃(T | X)
Since the right-hand-side is in terms of Y (T ), it only involves observable quantities, and can be estimated
from samples as long as p̃(T | X) is known. This estimator is called inverse propensity weighting because it
involves dividing by the propensity weights p̃(T | X).
In the next section, we will explore an improvement on inverse propensity weighting called a doubly-robust
estimator.
[Lecture 23]
71
Another issue is that estimating the propensity weights themselves is non-trivial, and if we use the wrong
propensity weights, then the estimate could be arbitrarily wrong.
We will explore an idea that partially mitigates both issues; it reduces the variance when the propensity
weights are correct (although doesn’t avoid the exploding variance issue), and in some cases it produces a
correct estimate even if the propensity weights are wrong.
The basic idea is as follows: suppose that we have some prediction Ȳ (1, X), Ȳ (0, X) of what will happen
under T = 1, T = 0 conditioned on X. Since these predictions only require knowing X and not T , an
alternate estimate of the treatment effect can be obtained by adding and subtracting Ȳ :
Ep̃ [Y (1) − Y (0)] = Ep̃ [Ȳ (1, X) − Ȳ (0, X)] + Ep̃ [(Y (1) − Ȳ (1, X)) − (Y (0) − Ȳ (0, X))] (355)
h I[T = 1] I[T = 0] i
= Ep̃ [Ȳ (1, X) − Ȳ (0, X)] + Ep̃ − (Y (T ) − Ȳ (T, X)) . (356)
p̃(T | X) p̃(T | X)
In other words, we first use our prediction Ȳ to form a guess of the average treatment effect, then use inverse
propensity weighting to correct the guess so as to obtain an unbiased estimate. This can yield substantial
improvements when Y (T ) − Ȳ (T, X) is much smaller in magnitude than Y (T ). For instance, a patient’s
cholesterol after taking a cholesterol-reducing drug is still highly-correlated with their initial cholesterol, so in
that case we can take Ȳ (T, X) to be the pre-treatment cholesterol level. Even though this is independent of
T it can substantially reduce the variance of the estimate! (We will formally bound the variance below.)
Bias of the estimate. Call the prediction Ȳ unbiased if E[Y | X, T ] = Ȳ (T, X). The first key property of
(356) is that it is unbiased as long as either Ȳ is unbiased, or the propensity weights are correct. Indeed,
if the prediction is unbiased then the first term is the average treatment effect while the second term is
zero. Conversely, if the propensity weights are correct then the second term exactly estimates the difference
between the predicted and true treatment effect. Correspondingly, (356) is called a doubly-robust estimator.
We can actually say more about the bias. Suppose that instead of the true propensity weights, we have an
incorrect guess p̂(t | x). Then the bias of the estimate is the difference between Ep̃ [Y (1) − Y (0)] and (356),
which is
h I[T = 1] I[T = 0] i
Ep̃ [Y (1) − Y (0)] − E[Ȳ (1, X) − Ȳ (0, X)] − Ep̃ − (Y (T ) − Ȳ (T, X)) (357)
p̂(T | X) p̂(T | X)
h I[T = 1] I[T = 0] i
= Ep̃ (Y (1) − Ȳ (1, X)) 1 − + (Y (0) − Ȳ (0, X)) 1 − . (358)
p̂(t = 1 | X) p̂(t = 0 | X)
Focusing on the first term, and using the independence of T and Y conditioned on X, we have
h I[T = 1] i
Ep̃ (Y (1) − Ȳ (1, X)) 1 − (359)
p̂(t = 1 | X)
h p̃(t = 1 | X) i
= Ep̃ (E[Y (1) | X] − Ȳ (1, X)) 1 − |X (360)
p̂(t = 1 | X)
h p̃(t = 1 | X) 2 i1/2
≤ Ep̃ [(E[Y (1) | X] − Ȳ (1, X))2 ]1/2 Ep̃ 1 − , (361)
p̂(t = 1 | X)
meaning that the bias of the estimator is the product of the biases of Ȳ and p̂ (measured as the expected
squared errors in (361)).
Variance of the estimate. We can obtain a somewhat similar relation for the variance. Usually the
variance of Ȳ (1, X) − Ȳ (0, X) is small compared to the propensity-weighted term, so again focusing on the
T = 1 case we have
h I[T = 1] i h p̃(t = 1 | X) i
Var (Y (1) − Ȳ (1, X)) ≤ E EY [(Y (1) − Ȳ (1, X))2 | X] . (362)
p̂(t = 1 | X) p̂(t = 1 | X)2
The variance is substantially reduced when Y (1) is close to Ȳ (1, X). We cannot always hope for this, e.g. if
Y has a large amount of intrinsic variance even conditioned on X. But in many cases even trivial Ȳ can
predict most of the variance in Y —for instance, for any chronic disease the patient’s post-treatment status is
well-predicted by their pre-treatment status. And the value of a stock tomorrow is well-predicted by its value
today.
72
Semi-parametric estimation. It may seem difficult to estimate Ȳ (·, X) and p̃(t = 1 | X), since any
parametric model could be mis-specified and lead to biased estimates. One idea is to estimate these both
non-parametrically, and then apply the doubly-robust estimator above. This is an instance of semi-parametric
estimation, because while we estimate Ȳ and p̃(t | X) non-parametrically, the doubly-robust estimator itself
is parametric (i.e a simple sample estimate of the mean), and in some cases we obtain non-parametric rates.
This is explored in detail in Nie and Wager (2017) for estimating conditional average treatment effects; we
describe the basic idea here. Since the squared error in an estimate is Bias2 + Variance/n, the bias will
dominate the error in the doubly-robust estimator as long as the variance doesn’t explode (of course, the
variance can often explode if the propensity weights are too close to 0 or 1, and the following idea won’t help
in that case).
We saw above that the bias of the doubly-robust estimator is the product of the biases in Ȳ and p̂,
which are both given as expected squared errors between the true and estimated value. In non-parametric
estimation, we typically get convergence rates of O(n−α ) for some α < 1/2 (note that α = 1/2 is what we
typically get for parametric estimation). The parameter α typically depends on the dimension of the problem
and the smoothness of the function class we wish to estimate. Since the doubly-robust bias is the product
of the biases, we end up with a bias of O(n−2α ) as long as Ȳ and p̂ each converge at a n−α rate. As long
as α > 1/4, this√yields a parametric rate (the variance term will then asymptotically dominate as it only
converges at 1/ n).
References
Radoslaw Adamczak and Pawel Wolff. Concentration inequalities for non-lipschitz functions with bounded derivatives
of higher order. Probability Theory and Related Fields, 162(3-4):531–586, 2015.
Noga Alon and Assaf Naor. Approximating the cut-norm via grothendieck’s inequality. In Proceedings of the thirty-sixth
annual ACM symposium on Theory of computing, pages 72–80. ACM, 2004.
Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization,
and Combinatorial Optimization. Algorithms and Techniques, pages 37–49. Springer, 2012.
Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages
177–206. Springer, 1985.
Jean-Baptiste Bardet, Nathaël Gozlan, Florent Malrieu, Pierre-André Zitt, et al. Functional inequalities for gaussian
convolutions of compactly supported measures: explicit bounds and dimension dependence. Bernoulli, 24(1):
333–353, 2018.
Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer School on Machine
Learning, pages 208–240. Springer, 2003.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of
independence. Oxford university press, 2013.
David L Donoho and Richard C Liu. The” automatic” robustness of minimum distance functionals. The Annals of
Statistics, 16(2):552–586, 1988.
Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance
operators. Bernoulli, 23(1):110–133, 2017.
Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In 2010 IEEE 51st
Annual Symposium on Foundations of Computer Science, pages 299–308. IEEE, 2010.
Rafal Latala. Estimates of moments and tails of gaussian chaoses. The Annals of Probability, 34(6):2315–2331, 2006.
Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.
James G MacKinnon and Halbert White. Some heteroskedasticity-consistent covariance matrix estimators with
improved finite sample properties. Journal of econometrics, 29(3):305–325, 1985.
Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint
arXiv:1712.04912, 2017.
73
Omar Rivasplata. Subgaussian random variables: An expository note. 2012.
Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027,
2010.
B Concentration Inequalities
B.1 Proof of Chebyshev’s inequality (Lemma 2.1)
Let I[E] denote the indicator that E occurs. Then we have
1 − P[E]
|EX∼p [X | E] − µ| = |EX∼p [X | ¬E] − µ| (366)
P[E]
1 − P[E] p
≤ · σ/ 1 − P[E]. (367)
P[E]
√ p
Again the result follows since σ 1 − δ/δ ≤ σ 2(1 − δ)/δ when δ ≥ 12 .
74
event that hX − µ, vi ≥ τ (v). Then Ev has probability and hence
where the last step invokes resilience applies to E+ together with P[E+ ] ≤ 1. Conversely, if p has bounded
1st moments then
75