arXiv:1802.10575v1 [math.ST] 28 Feb 2018
Near-Optimal Sample Complexity Bounds for Maximum
Likelihood Estimation of Multivariate Log-concave Densities
Timothy Carpenter
Ohio State University
carpenter.454@osu.edu
Ilias Diakonikolas∗
University of Southern California
diakonik@usc.edu
Anastasios Sidiropoulos†
University of Illinois at Chicago
sidiropo@uic.edu
Alistair Stewart
University of Southern California
stewart.al@gmail.com
December 2, 2021
Abstract
We study the problem of learning multivariate log-concave densities with respect to a global loss
function. We obtain the first upper bound on the sample complexity of the maximum likelihood estimator
(MLE) for a log-concave density on Rd , for all d ≥ 4. Prior to this work, no finite sample upper bound
was known for this estimator in more than 3 dimensions.
In more detail, we prove that for any d ≥ 1 and ǫ > 0, given Õd ((1/ǫ)(d+3)/2 ) samples drawn from
an unknown log-concave density f0 on Rd , the MLE outputs a hypothesis h that with high probability
is ǫ-close to f0 , in squared Hellinger loss. A sample complexity lower bound of Ωd ((1/ǫ)(d+1)/2 ) was
previously known for any learning algorithm that achieves this guarantee. We thus establish that the
sample complexity of the log-concave MLE is near-optimal, up to an Õ(1/ǫ) factor.
∗
†
Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.
Supported by NSF Award CCF-1453472 (CAREER) and NSF grant CCF-1423230.
1 Introduction
1.1
Background
The general task of estimating a probability distribution under certain qualitative assumptions about the
shape of its probability density function has a long history in statistics, dating back to the pioneering work
of Grenander [Gre56] who analyzed the maximum likelihood estimator of a univariate monotone density.
Since then, shape constrained density estimation has been a very active research area with a rich literature in
mathematical statistics and, more recently, in computer science. A wide range of shape constraints have been
studied, including unimodality, convexity and concavity, k-modality, log-concavity, and k-monotonicity.
The reader is referred to [BBBB72] for a summary of the early work and to [GJ14] for a recent book on the
subject. (See Section 1.3 for a succinct summary of prior work.) The majority of the literature has studied
the univariate (one-dimensional) setting, which is by now fairly well-understood for a range of distributions.
On the other hand, the multivariate setting and specifically the regime of fixed dimension is significantly
more challenging and poorly understood for many natural distribution families.
In this work, we focus on the family of multivariate log-concave distributions. A distribution on Rd
is log-concave if the logarithm of its probability density function is concave (see Definition 1.1). Logconcave distributions constitute a rich non-parametric family encompassing a range of fundamental distributions, including uniform, normal, exponential, logistic, extreme value, Laplace, Weibull, Gamma, Chi
and Chi-Squared, and Beta distributions (see, e.g., [BB05]). Due to their fundamental nature and appealing
properties, log-concave distributions have been studied in a range of fields including economics [An95],
probability theory [SW14], computer science [LV07], and geometry [Sta89].
The problem of density estimation for log-concave distributions is of central importance in the area of
non-parametric shape constrained estimation [Wal09, SW14, Sam17] and has received significant attention
during the past decade in statistics [CSS10, DR09, DW16, CS13, KS16, BD18, HW16] and theoretical
computer science [CDSS13, CDSS14a, ADLS17, CDGR16, DKS16a, DKS17].
1.2
Our Results and Comparison to Prior Work
In this work, we analyze the global convergence rate of the maximum likelihood estimator (MLE) of a
multivariate log-concave density. Formally, we study the following fundamental question:
How many samples are information-theoretically sufficient so that the MLE of an arbitrary
log-concave density on Rd learns the underlying density, within squared Hellinger loss ǫ?
Perhaps surprisingly, despite significant effort within the statistics community on analyzing the log-concave
MLE, our understanding of its finite sample performance in constant dimension has remained poor. The
only result prior to this work that addressed the sample complexity of the MLE in more than one dimensions
is by Kim and Samworth [KS16]. Specifically, [KS16] obtained the following results:
(1) a sample complexity lower bound of Ωd (1/ǫ)(d+1)/2 for d ∈ Z+ that applies to any estimator, and
(2) a sample complexity upper bound for the log-concave MLE for d ≤ 3.
Prior to our work, no finite sample upper bound was known for the log-concave MLE even for d = 4.
In recent related work, Diakonikolas, Kane, and Stewart [DKS17] established a finite sample complexity
upper bound for learning multivariate log-concave densities
under global loss functions. Specifically, the
estimator analyzed in [DKS17] uses Õd (1/ǫ)(d+5)/2 1 samples and learns a log-concave density on Rd
within squared Hellinger loss ǫ, with high probability. We remark that the upper bound of [DKS17] was
1
The Õ(·) notation hides logarithmic factors in its argument.
1
obtained by analyzing an estimator that is substantially different than the log-concave MLE. Moreover, the
analysis in [DKS17] has no implications on the performance of the MLE. Interestingly, some of the technical
tools employed in [DKS17] will be useful in our current setting.
Due to the fundamental nature of the MLE, understanding its performance merits investigation in its
own right. In particular, the log-concave MLE has an intriguing geometric structure that is a topic of current
investigation [CSS10, RSU17]. The output of the log-concave MLE satisfies several desirable properties
that may not be automatically satisfied by surrogate estimators. These include the log-concavity of the
hypothesis, the paradigm of log-concave projections and their continuity in Wasserstein distance, affine
equivariance, one-dimensional characterization, and adaptation (see, e.g., [Sam17]). An additional motivation comes from a recent conjecture (see, e.g., [Wel15]) that for 4-dimensional log-concave densities the
MLE may have sub-optimal sample complexity. These facts provide strong motivation for characterizing
the sample complexity of the log-concave MLE in any dimension.
To formally state our results, we will need some terminology. The Rsquared
p distance between
p Hellinger
two density functions f, g : Rd → R+ is defined as h2 (f, g) = (1/2) · Rd ( f (x) − g(x))2 dx.
We now define our two main objects of study:
Definition 1.1 (Log-concave Density). A probability density function f : Rd → R+ , d ∈ Z+ , is called
log-concave if there exists an upper semi-continuous concave function φ : Rd → [−∞, ∞) such that
f (x) = eφ(x) for all x ∈ Rd . We will denote by Fd the set of upper semi-continuous, log-concave densities
with respect to the Lebesgue measure on Rd .
Definition 1.2 (Log-concave MLE). Let f0 ∈ Fd and X1 , . . . , Xn be
P iid samples from f0 . The maximum
likelihood estimator, fˆn , is the density fˆn ∈ Fd which maximizes n1 ni=1 log(f (Xi )) over all f ∈ Fd .
We can now state our main result:
Theorem 1.3 (Main Result). Fix d ∈ Z+ and 0 < ǫ < 1. Let n = Õd (1/ǫ)(d+3)/2 . For any f0 ∈ Fd ,
with probability at least 9/10 over the n samples from f0 , we have that h2 (fˆn , f0 ) ≤ ǫ.
See Theorem 3.1 for a more detailed statement. The aforementioned lower bound of [KS16] implies that
our upper bound is tight up to an Õd (ǫ−1 ) multiplicative factor.
1.3
Related Work
Shape constrained density estimation is a vibrant research field within mathematical statistics. Statistical research in this area started in the 1950s and has seen a recent surge of research activity, in part due to the ubiquity of structured distributions in various domains. The standard method used in statistics to address density
estimation problems of this form is the MLE. See [Bru58, Rao69, Weg70, HP76, Gro85, Bir87a, Bir87b,
Fou97, CT04, BW07, JW09, DR09, BRW09, GW09, BW10, KM10, Wal09, CS13, KS16, BD18, HW16]
for a partial list of works analyzing the MLE for various distribution families. During the past decade, there
has been a large body of work on shape constrained density estimation in computer science with a focus on
both sample and computational efficiency [DDS12a, DDS12b, DDO+ 13, CDSS13, CDSS14a, CDSS14b,
ADH+ 15, ADLS17, DKS16c, DKS16d, DDKT16, DKS16b, VV16, DKS17].
Density estimation of log-concave densities has been extensively investigated. The univariate case is
by now well understood [DL01, CDSS14a, ADLS17, KS16, HW16]. For example, it is known [KS16,
HW16] that Θ(ǫ−5/4 ) samples are necessary and sufficient to learn an arbitrary log-concave density over R
within squared Hellinger loss ǫ. Moreover, the MLE is sample-efficient [KS16, HW16] and attains certain
adaptivity properties [KGS16]. A recent line of work in computer science [CDSS13, CDSS14a, ADLS17,
CDGR16, DKS16a] gave efficient algorithms for log-concave density estimation under the total variation
distance.
2
Density estimation of multivariate log-concave densities is poorly understood. A line of work [CSS10,
DR09, DW16, CS13, BD18] has obtained a complete understanding of the global consistency properties of
the MLE for any dimension d. However, both the rate of convergence of the MLE and the minimax rate of
convergence remain unknown. For dimension d ≤ 3, [KS16] show that the MLE is sample near-optimal
(within logarithmic factors) under the squared Hellinger distance. [KS16] also prove bracketing entropy
lower bounds suggesting that the MLE may be sub-optimal for d > 3 (also see [Wel15]).
1.4
Technical Overview
Here we provide a brief overview of our proof in tandem with a comparison to prior work. We start by noting
that the previously known sample complexity upper bound of the log-concave MLE for d ≤ 3 [KS16] was
obtained by bounding from above the bracketing entropy of the class. As we explain below, our argument
is more direct making essential use of the VC inequality (Theorem 2.1), a classical result from empirical
process theory. In contrast to prior work on log-concave density estimation [KS16, DKS17] which relied
on approximations to (log)-concave functions, we start by considering approximations to convex sets. Let
f0 be the target log-concave density. We show (Lemma 3.4) that given sufficiently many samples from f0 ,
with high probability, for any convex set C the empirical mass of C and the probability mass of C under f0
are close to each other. We then leverage this structural lemma to analyze the error in the log-likelihood of
log-concave densities, using the fact that the superlevel sets of a log-concave density are convex.
We remark that our aforementioned structural result (Lemma 3.4) crucially requires the assumption of
the log-concavity of f0 . Naively, one may think that this lemma follows directly from the VC inequality.
Recall however that the VC-dimension of the family of convex sets is infinite, even in the plane. For example, for the uniform distribution over the unit circle, a similar result does not hold for any finite number of
samples, and so we need to use the fact that f0 is log-concave. To prove our lemma, we consider judicious
approximations of the convex set C with convex polytopes using known results from convex geometry. In
more detail, we consider approximations to the convex set C on the inside and outside with close probabilities under f0 to the convex set from a family with a bounded VC-dimension.
For any log-concave density f , the probabilities of any superlevel set are close under the empirical
distribution and f0 . If log f were bounded, then that would mean that the empirical log-likelihood of f
and the log-likelihood of f under f0 were close. Unfortunately, for any density f , log f is unbounded from
below. To deal with this issue, we instead consider log(max(f, pmin )), for some carefully chosen probability
value pmin such that we could ignore the contribution of the density below pmin if f is close to f0 . If we
can bound the range of log(max(f, pmin )), we can show that its expectation under f0 and its empirical
version are close to each other (see Lemma 3.7). To bound the range, we show that if the maximum value
of f is much larger than the maximum of f0 , then f has small probability mass outside a set A of small
volume; since A has small volume, we see many samples outside it, and so the empirical log-likelihood of
f is smaller than the empirical log-likelihood of f0 . Using this fact, we can show that for the MLE fˆn the
expectation of log(max(fˆn , pmin )) is large under f0 and then that fˆn is close in Hellinger distance to f0 .
1.5
Organization
After setting up the required preliminaries in Section 2, in Section 3 we present the proof of our main result,
modulo the proof of our main lemma (Lemma 3.4). In Section 4, we give a slightly weaker version of
Lemma 3.4 that has a significantly simpler proof. In Section 5, we present the proof of Lemma 3.4. Finally,
we conclude with a few open problems in Section 6.
3
2 Preliminaries
def
d
Notation and Definitions. For m ∈ Z+ , we denote
R [m] = {1, . . . , m}. Let f : R → R be a Lebesgue
d →R
measurable function. We will use f (A) to denote A f (x)dx. A Lebesgue
R measurable function f : R
d
d
is a probability density function (pdf) if f (x) ≥ 0 for all x ∈ R and Rd f (x)dx = 1. Let f, g : R → R+
be probability density functions. The total variation distance between f, g : Rd → R+ is defined as
dTV (f, g) = supS |f (S) − g(S)|, where the supremum is over
R all Lebesgue measurable subsets of the
domain. We have that dTV (f, g) = (1/2) · kf − gk1 = (1/2) · Rd |f (x) − g(x)|dx. The Kullback-Leibler
R∞
(x)
(KL) divergence from g to f is defined as KL(f ||g) = −∞ f (x) ln fg(x)
dx.
For f : A → B and A′ ⊆ A, the restriction of f to A′ is the function f |A′ : A′ → B. For y ∈ R+ and
def
f : Rd → R+ we denote by Lf (y) = {x ∈ Rd | f (x) ≥ y} its superlevel sets. If f is log-concave, Lf (y)
is a convex set for all y ∈ R+ . For a function f : Rd → R+ , we will denote by Mf its maximum value.
The VC inequality. We start by recalling the notion of VC dimension. We say that a set X ⊆ Rd is
shattered by a collection A of subsets of Rd , if for every Y ⊆ X there exists A ∈ A such that A ∩ X = Y .
The VC dimension of a family A of subsets of Rd is defined to be the maximum cardinality of a subset
X ⊆ Rd that is shattered by A. If there is a shattered subset of size s for all s ∈ Z+ , then we say that the
VC dimension of A is ∞.
The empirical distribution, fn , corresponding
to a density f : Rd → R+ is the discrete probability
Pn
measure defined by fn (A) = (1/n) · i=1 1A (Xi ), where the Xi are iid samples drawn from f and 1S is
the characteristic function of the set S. Let f : Rd → R be a Lebesgue measurable function. Given a family
A of measurable subsets of Rd , we define the A-norm of f by kf kA = supA∈A |f (A)|. The VC inequality
states the following:
Theorem 2.1 (VC inequality, see [DL01], p. 31). Let f : Rd → R+ be a probability density function and
fn be the empirical distribution obtained after drawing
p n samples from f . Let A be a family of subsets over
d
R with VC dimension V . Then E[kf − fn kA ] ≤ C V /n, for some universal constant C > 0.
We will also require a high probability version of the VC inequality which can be obtained using the
following standard uniform convergence bound:
Theorem 2.2 (see [DL01], p. 17). Let A be a family of subsets over Rd and fn be the empirical distribution
of n samples from the density f : Rd → R+ . Let X be the random variable kf − fn kA . Then for all δ > 0,
2
we have that Pr[X − E[X] > δ] ≤ e−2nδ .
Approximating Convex Sets by Polytopes. We make use of the following quantitative bounds of [GMR95]
that provide volume approximation for any convex body by an inscribed and a circumscribed convex polytope respectively with a bounded number of facets:
Theorem 2.3. For any convex body K ⊆ Rd , and n sufficiently large, there exists a convex polytope
κd
P ⊆ K with at most ℓ facets such that vol(K \ P ) ≤ ℓ2/(d−1)
vol(K), where κ > 0 is a universal constant.
′
′
Similarly, there exists a convex polytope P where K ⊆ P with at most ℓ facets such that vol(P ′ \ K) ≤
κd
vol(K).
ℓ2/(d−1)
3 Main Result: Proof of Theorem 1.3
The following theorem is a more detailed version of Theorem 1.3 and is the main result of this paper:
(d+3)/2
Theorem 3.1. Fix d ∈ Z+ and 0 < ǫ, τ < 1. Let n = O (d2 /ǫ) ln3 (d/(ǫτ ))
. For any f0 ∈ Fd ,
2
ˆ
with probability at least 1 − τ over the n samples from f0 , we have that h (fn , f0 ) ≤ ǫ.
4
This section is devoted to the proof of Theorem 3.1, which follows from Lemma 3.14. We will require
a sequence of intermediate lemmas and claims.
We summarize the notation that will appear throughout this proof. We use f0 ∈ Fd to denote the
target log-concave density. We denote by fn the empirical density obtained after drawing n iid samples
X1 , . . . , Xn from f0 and by fˆn the corresponding MLE. Given d ∈ Z+ and 0 < ǫ, τ < 1, for concreteness,
we will denote:
(d+3)/2
def
N1 = Θ (d2 /ǫ) ln3 (d/(ǫτ ))
,
for a sufficiently large universal constant in the big-Θ notation. We will establish that N1 is an upper bound
on the desired sample complexity of the MLE. Moreover, we will denote
def
def
z = ln(100n4 /τ 2 ) , S = Lf0 (Mf0 e−z ) ,
def
δ = ǫ/(32 ln(100n4 /τ 2 )) ,
and
def
pmin = Mf0 /(100n4 /τ 2 ) .
We start by establishing an upper bound on the volume of superlevel sets:
Lemma 3.2 (see, e.g., [DKS17], p. 8). Let f ∈ Fd with maximum value Mf . Then for all z ′ ≥ 1, we have
′
′
′
vol(Lf (Mf e−z )) ≤ O(z ′ d /Mf ), and PrX∼f [f (X) ≤ Mf e−z ] ≤ O(d)d e−z /2 .
We defer this proof to Appendix A. We use Lemma 3.2 to get a bound on the volume of the superlevel
set that contains all the samples with high probability:
Corollary 3.3. For n ≥ N1 , we have that:
(a) vol(S) = O(z d /Mf0 ), and
(b) PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(3n). In particular, with probability at least 1 − τ /3, all
samples X1 , . . . , Xn from f0 are in S.
Proof. From Lemma 3.2, we have that vol(S) = vol(Lf0 (Mf0 e−z )) ≤ O(z d /Mf0 ). Also from Lemma 3.2,
we have that PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(3n), if we assume a sufficiently large constant is
selected in the definition of N1 . Taking a union bound over all samples, we get that with probability at least
1 − τ /3, all of the n samples are in S, as required.
We can now state our main lemma establishing an upper bound on the error of approximating the probability of every convex set:
Lemma 3.4. For n ≥ N1 , we have that with probability at least 1 − τ /3 over the choice of X1 , . . . , Xn
drawn from f0 , for any convex set C ⊂ Rd it holds that |f0 (C) − fn (C)| ≤ δ.
The proof of Lemma 3.4 is deferred to Section 5. In Section 4, we establish a weaker version of this
lemma that requires more samples but has a simpler proof. Combining Lemma 3.4 with the observation that
for any log-concave density f and t > 0 we have that Lf (t) is convex, we obtain the following corollary:
Corollary 3.5. Let n ≥ N1 . Conditioning on the event of Lemma 3.4, we have that for any f ∈ Fd and for
any t ≥ 0 it holds |PrX∼f0 [f (X) ≥ t] − PrX∼fn [f (X) ≥ t]| < δ.
We will require the following technical claim, which follows from standard properties of Lebesgue
integration (see Appendix A):
5
Lemma 3.6. Let g, h : Rd → R be
pdfs, and φ : R → R. If EY ∼g [φ(Y )], EY ∼h [φ(Y )] are both finite, then
R∞
|EY ∼g [φ(Y )] − EY ∼h [φ(Y )]| ≤ −∞ |PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx.
Our next lemma establishes a useful upper bound on the empirical error of the truncated likelihood of
any log-concave density:
Lemma 3.7. Let n ≥ N1 and f ∈ Fd with maximum value Mf . For all ρ ∈ (0, Mf ], conditioning on the
event of Corollary 3.5, we have
|EX∼f0 [ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]| ≤ δ · ln(Mf /ρ) .
Proof. Letting h = f0 , g = fn , and φ(x) = ln(max(f (x), ρ)), by Lemma 3.6 we have
|EX∼f0 [ ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]|
Z ∞
|PrX∼f0 [ln(max(f (X), ln ρ)) < t] − PrX∼fn [ln(max(f (X), ρ)) < t]| dt
≤
=
=
=
=
Z
−∞
ln Mf
−∞
Z ln Mf
ln ρ
Z ln Mf
Z
ln ρ
ln Mf
ln ρ
|PrX∼f0 [max(ln f (X), ln ρ)) < t] − PrX∼fn [max(ln f (X), ln ρ)) < t]| dt
|PrX∼f0 [ln(f (X)) < t] − PrX∼fn [ln(f (X)) < t]| dt
PrX∼f0 [f (X) < et ] − PrX∼fn [f (X) < et ] dt
PrX∼f0 [f (X) ≥ et ] − PrX∼fn [f (X) ≥ et ] dt.
Since we conditioned on the event of Corollary 3.5, we have |PrX∼f0 [f (X) ≥ t] − PrX∼fn [f (X) ≥ t]| ≤ δ
for all t ≥ 0. Therefore, we have that
Z ln Mf
|EX∼f0 [ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]| ≤
δdt = δ · (ln Mf − ln ρ) ,
ln ρ
which concludes the proof.
For f0 itself, we can use Hoeffding’s inequality to get a bound on the empirical error of its likelihood:
Lemma 3.8. Let n ≥ N1 . Conditioning on the event of Corollary 3.3, with probability at least 1 − τ /3 over
X1 , . . . , Xn , we have that
n
1X
ln f0 (Xi ) − EX∼f0 [ln f0 (X)] ≤ ǫ/8 .
n
i=1
def
def
def
Proof. Recall that z = ln(100n4 /τ 2 ), S = Lf0 (Mf0 e−z ), and pmin = Mf0 /(100n4 /τ 2 ). Note that for
any x ∈ S, we have f0 (x) ≥ pmin by construction. Since we have conditioned on the event of Corollary 3.3
6
def
holding, it follows that for each i ∈ [n], f0 (Xi ) ≥ pmin . Therefore, letting ρ = pmin , we have
n
n
1X
1X
ln(f0 (Xi )) − EX∼f0 [ln f0 (X)] =
ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln f0 (X)]
n
n
i=1
i=1
n
1X
≤
ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))]
n
i=1
+ |EX∼f0 [ln(max(f0 (X), ρ))] − EX∼f0 [ln f0 (X)]|
n
1X
ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))]
n
i=1
Z ln ρ
Pr[ln f0 (X) ≤ T ]dT .
(1)
+
≤
−∞
By Hoeffding’s inequality we have
"
#
n
1X
ǫ
Pr
ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))] >
n
16
i=1
−2n2 (ǫ/16)2
≤ 2 exp
n · (ln Mf0 − ln ρ)2
−nǫ2 /162
≤ 2 exp
(ln(100n4 /τ 2 ))2
≤ τ /3 .
(since n ≥ N1 )
(2)
Next we have
Z ∞
Z ln ρ
PrX∼f0 [ln f0 (X) ≤ ln ρ − y]dy
(setting y = ln ρ − T )
PrX∼f0 [ln f0 (X) ≤ T ]dT ≤
−∞
Z0 ∞
O(d)d (ρ/Mf0 )1/2 e−y/2 dy
(by Lemma 3.2)
≤
0
Z ∞
τ −y/2
=
O(d)d
e
dy
(since ρ = pmin = Mf0 /(100n4 /τ 2 ))
2
10n
0
τ
≤ 2 · O(d)d
10n2
≤ ǫ/16.
(since n ≥ N1 )
(3)
By applying (2) and (3) to bound (1) from above, with probability at least 1 − τ /3 we have that
n
1X
ln f0 (Xi ) − EX∼f0 [ln f0 (X)] ≤ ǫ/8 ,
n
i=1
which concludes the proof.
The following simple lemma shows that the MLE is supported in the convex hull of the samples:
Lemma 3.9. Let n ≥ 1. Let X1 , . . . , Xn be samples drawn from f0 , and C be the convex hull of these
samples. Then, for all x ∈ Rd \ C, we have fˆn (x) = 0.
7
Proof. Suppose there exists x ∈ Rd \ C such that fˆn (x) > 0. Then, we have that Lfˆn (fˆn (x)) \ C 6= ∅
R
R
and thus Rd \C fˆn (x)dx > 0. From this, it follows that C fˆn (x)dx < 1, and so there exists some α > 1
R
ˆ
ˆ
Rsuch that α C fn (x)dx = 1. Let ĝn : C → R be such that ĝn = α · fn |C . Since C is a convex set and
C ĝn (x)dx = 1, we have that ĝn is a log-concave density. Observe that
n
n
n
1X
1X
1X
log(ĝn (Xi )) =
log(αfˆn (Xi )) >
log(fˆn (Xi )) ,
n
n
n
i=1
i=1
(4)
i=1
where we used that α > 1. By definition, fˆn maximizes n1 i=1 log(f (Xi )) over all log-concave densities
f , which contradicts (4). Therefore, for all x ∈ Rd \ C, we have that fˆn (x) = 0.
Pn
We need to truncate the likelihood at a density small enough to be ignored for f close to f0 . This
motivates the following definition:
def
Definition 3.10. We define f ′ : Rd → R such that f ′ (x) = max{pmin , fˆn (x)}.
We show that this truncation and renormalization does not affect the MLE fˆn by much:
R
def
Lemma 3.11. Let n ≥ N1 . Let g(x) = α · f ′ (x)|S , α ∈ R+ , be such that S g(x)dx = 1. Conditioning on
the event of Corollary 3.3, we have the following:
(a) 1 − ǫ/32 ≤ α ≤ 1, and
(b) dTV (g, fˆn ) ≤ 3ǫ/64.
R
ˆ
Proof.
R ′ We start by
R proving (a). By the definition of g and Lemma 3.9, we have that α = α S fn (x)dx ≤
α S f (x)dx = S g(x)dx = 1, i.e., α ≤ 1. Furthermore, by the definition of pmin and Corollary 3.3, we
have
pmin ·vol(S) ≤
Mf 0
O((ln(100n4 /τ 2 ))d )
≤ ǫ/32,
·
(100n4 /τ 2 )
Mf 0
(5)
and therefore
Z
Z
Z
ˆ
pmin dx +
fn (x)dx ≤ α(pmin ·vol(S) + 1) ≤ α(ǫ/32 + 1).
1=
g(x)dx ≤ α
S
S
S
From this it follows that α ≥ 1/(1 + ǫ/32) ≥ 1 − ǫ/32. We have
Z
Z
1
1
ˆ
ˆ
dTV (g, fn ) =
|g(x) − fn (x)|dx =
|g(x) − fˆn (x)|dx ,
2 Rd
2 S
since g is defined on S and fˆn is supported in S by Lemma 3.9. We can then write
Z
Z
1
1
ˆ
|g(x) − fn (x)|dx =
|αf ′ (x) − fˆn (x)|dx
2 S
2 S
Z
1
|α − 1| · fˆn (x)dx + pmin ·vol(S)
≤
2 S
Z
|α − 1|
≤
fˆn (x)dx + ǫ/32
2
S
|1 − α|
+ ǫ/32 ≤ 3ǫ/64 ,
≤
2
which completes the proof.
8
(from (5))
(6)
To deal with the dependence on the maximum value of f in Lemma 3.7, we need to bound the maximum
value of the MLE.
Lemma 3.12. Let n ≥ N1 . Let X1 , . . . , Xn be samples drawn from f0 . Then conditioning on the events
of Corollary 3.5 and Lemma
3.8, for any f ∈ P
Fd with maximum value Mf such that ln(Mf / pmin ) ≥
1 Pn
1
4
2
4 ln(100n /τ ), we have n i=1 ln f (Xi ) < n ni=1 ln f0 (Xi ).
Proof. This lemma holds because a density f with a large Mf is small outside on a set of small volume,
which most of the samples will be outside. Let
!!
n
1X
1
ln f0 (Xi ) − ln Mf − 1
γ = exp 2
n
2
i=1
and
A = Lf (γ).
If we have that vol(A) · Mf0 ≤ 1/3, then it follows that f0 (A) ≤ 1/3. Since f is log-concave, A is a convex
set, and since we condition on Corollary 3.5 holding, we have with probability 1 that |f0 (A) − fn (A)| <
δ < 1/6. Therefore, we have that fn (A) < 1/2, in which case at least 1/2 of the samples X1 , . . . , Xn are
not contained within A. Thus, we have that
n
1X
1
1
ln f (x) ≤ ln γ + ln Mf
n
2
2
i=1
n
1X
1
ln f0 (Xi ) − ln Mf − 1
n
2
1
= ·2
2
<
1
n
n
X
i=1
!
+
1
ln Mf
2
ln f0 (Xi ).
i=1
Now we check to see how large Mf must be to ensure that vol(A) · Mf0 ≤ 1/3. We have that
vol(A) · Mf0 = vol(Lf (γ)) · Mf0
n
2X
ln f0 (Xi ) − 2 − 2 ln Mf
= vol Lf Mf · exp
n
i=1
!d
n
X
Mf 0
2
·O 2−
ln f0 (Xi ) + 2 ln Mf .
≤
Mf
n
!!!
· Mf 0
(by Lemma 3.2)
i=1
Since we condition on the event of Lemma 3.8 holding, we have with probability 1 that
n
1X
ln f0 (Xi ) ≥ EX∼f0 [ln f0 (X)] − ǫ ≥ ln pmin −ǫ,
n
i=1
and so we have that
Mf 0
· O (2 + 2 ln Mf − 2 ln pmin +2ǫ)d
Mf
d
Mf 0
· O 2 ln Mf − 2 ln Mf0 + 3 ln(n4 100/τ 2 )
.
<
Mf
vol(A) · Mf0 ≤
The following claim follows by a simple calculation (see Appendix A):
9
Claim 3.13. If ln(Mf /Mf0 ) ≥ 3 ln(100n4 /τ 2 ), then vol(A) · Mf0 ≤ 1/3.
P
Therefore, for ln(Mf /Mf0 ) ≥ 3 ln(100n4 /τ 2 ) we have that n1 ni=1 ln f (Xi ) <
1
n
P
i ln f0 (xi )
and
ln Mf − ln pmin = ln(Mf /Mf0 ) + ln(100n4 /τ 2 ) ≥ 4 ln(100n4 /τ 2 ) ,
concluding the proof.
We have now reached the final result of this section, from which Theorem 3.1 directly follows. Combining previous lemmas, we show that the likelihood under f0 of the truncated MLE is close to that of f0
and so they are close in KL divergence, which leads to a bound in the Hellinger distance of the MLE itself:
Lemma 3.14. Let n ≥ N1 . Let X1 , . . . , Xn be samples drawn from f0 . With probability at least 1 − τ , we
have that h2 (f0 , fˆn ) ≤ ǫ.
Proof. In this lemma, we will apply Lemmas 3.7, 3.8, 3.11, and 3.12. By examining the conditions of these
lemmas, it is easy to see that with probability at least 1 − τ they all hold. We henceforth condition on this
event.
Let X1 , . . . , Xn be samples drawn from f0 , let fˆn be as in Definition 1.2. Let g and f ′ be as defined in
Lemma 3.11 and Definition 3.10. Let S be as defined in Corollary 3.3 Then we have that
EX∼f0 [ln g(X)] = EX∼f0 [ln(αf ′ (X))]
≥ EX∼f0 [ln f ′ (X)] − ǫ/16
≥ EX∼f [ln(max{fˆn (X), pmin })] − ǫ/16
(since a > 1 − ǫ/32)
0
≥ EX∼fn [ln(max{fˆn (X), pmin })] − 3ǫ/16
1X ˆ
ln fn (Xi ) − 3ǫ/16
≥
n
i
1X
≥
ln f0 (Xi ) − 3ǫ/16
n
(by Lemmas 3.7 and 3.12)
i
≥ EX∼f0 [ln f0 (X)] − 5ǫ/16.
(using Lemma 3.8)
Thus, we obtain that
KL(f0 ||g) = EX∼f0 [ln f0 (X)] − EX∼f0 [ln g(X)] ≤ 5ǫ/16.
(7)
For the next derivation, we use that the Hellinger distance is related to the total variation distance and the
Kullback-Leibler divergence in the following way: For probability functions k1 , k2 : Rd → R, we have that
h2 (k1 , k2 ) ≤ dTV (k1 , k2 ) and h2 (k1 , k2 ) ≤ KL(k1 ||k2 ). Therefore, we have that
h(f0 , fˆn ) ≤ h(f0 , g) + h(g, fˆn )
≤ KL(f0 ||g)1/2 + dTV (g, fˆn )1/2
= (5ǫ/16)1/2 + (3ǫ/64)1/2
≤ǫ
1/2
(by (7) and Lemma 3.11)
,
concluding the proof.
10
4 Warmup for the Proof of Lemma 3.4
For the sake of exposition of the main ideas used in the proof of Lemma 3.4, we first prove Lemma 4.2,
which achieves a weaker bound on the sample complexity, but has a significantly simpler proof. Let us
first give a brief, and somewhat imprecise, overview of the proof of Lemma 4.2. The high-level goal is to
approximate some convex set C ⊂ Rd by some set, belonging to a family of low VC dimension. We then
can obtain the desired bound using Theorem 2.1. To that end, we compute inner and outer approximations,
C in and C out , of C via polyhedral sets with a small number of facets. By Lemma 4.1, we can argue that
the VC dimension of this family is low. We therefore obtain that f0 and fn are close on the inner and outer
approximations of C. It remains to argue that the total difference between f0 and fn in C out \ C in is also
small. It thus suffices to bound the volume of C out \ C in . This can be achieved by first defining some set
S ⊂ Rd that excludes the tail of f0 . Since f0 is logconcave, we can show that S has small volume. The final
bound is obtained by restricting the above argument on C ∩ S.
(d+5)/2
def
Throughout this section, we define N2 = Θ 2O(d) (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1)
.
We will require the following simple fact:
Lemma 4.1 (see [ASE92]). Let h, d ∈ Z+ , and let A be the set of all convex polytopes in Rd with at most
h facets. Then, the VC dimension of A is at most 2(d + 1)h log((d + 1)h).
The main result of this section is the following:
Lemma 4.2. Let n ≥ N2 . With probability at least 1 −
set C ⊂ Rd it holds that |f0 (C) − fn (C)| < δ.
def
3τ
10
over the choice of X1 , . . . , Xn , for any convex
def
Proof. Recall that z = ln(100n4 /τ 2 ) and S = Lf0 (Mf0 e−z ). Let C ⊂ Rd be a convex set, and let
C ′ = C ∩ S. Since f0 is log-concave, it follows that S is convex, and thus C ′ is also convex.
Let E1 be the event that all samples X1 , . . . , Xn lie in S. By Corollary 3.3 we have
PrX1 ,...,Xn ∼f0 [E1 ] = Prfn [E1 ] ≥ 1 − τ /10.
(8)
Conditioned on E1 occurring, we have, with probability 1, fn (C) = fn (C ′ ). In other words
Prfn [fn (C \ C ′ ) = 0|E1 ] = 1.
(9)
From Corollary 3.3, we have PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(10n), and therefore
f0 (C \ C ′ ) ≤ f0 (Rd \ S) ≤ τ /(10n) ≤ δ/5.
(10)
Combining (8), (9), and (10), we get
Prfn |f0 (C \ C ′ ) − fn (C \ C ′ )| ≤ δ/5 ≥ Prfn |f0 (C \ C ′ ) − fn (C \ C ′ )| ≤ δ/5|E1 · Prfn [E1 ]
≥ Prfn fn (C \ C ′ ) = 0|E1 · Pr[E1 ]
≥ 1 − τ /10.
(11)
Let A be the set of convex polytopes in Rd with at most H = (10κdz d /δ)(d−1)/2 facets, where κ is the
universal constant in Theorem 2.3. By Theorem 2.3, there exist convex polytopes T, T ′ ∈ A, with T ⊆
C ′ ⊆ T ′ , such that
vol(C ′ \ T ) ≤
δ
δ
κd
vol(C ′ ) ≤
vol(S) ≤
,
d
d
10Mf0
10κdz /δ
10z
11
and
vol(T ′ \ C ′ ) ≤
κd
δ
δ
.
vol(C ′ ) ≤
vol(S) ≤
d
d
10Mf0
10κdz /δ
10z
Therefore, since Mf0 is the maximum value of f0 , we have
f0 (C ′ \ T ) ≤ vol(C ′ \ T ) · Mf0 ≤ δ/10,
(12)
(13)
and
f0 (T ′ \ C ′ ) ≤ vol(T ′ \ C ′ ) · Mf0 ≤ δ/10.
(14)
Noting that E[|f0 (T ) − fn (T )|] ≤ E[||f0 − fn ||A ], by Theorem 2.1 we have for some universal constant α
that
r
αV
E[|f0 (T ) − fn (T )|] ≤
.
n
The following claim is obtained via a simple calculation (see Appendix A):
q
Claim 4.3. For n ≥ N2 , we have that αV
n ≤ δ/10.
Let E2 be the event that |f0 (T ) − fn (T )| ≤ δ/5. By Claim 4.3 and Theorem 2.2 we have
Prfn [E2 ] = 1 − Prfn [|f0 (T ) − fn (T )| > δ/5]
≥ 1 − Prfn [|f0 (T ) − fn (T )| − E[|f0 (T ) − fn (T )|] > δ/10]
≥ 1 − e−2n(δ/10)
2
≥ 1 − τ /10.
(15)
Similarly, we define E3 to be the event that |f0 (T ′ ) − fn (T ′ )| ≤ δ/5. Arguing as above, we obtain
Prfn [E3 ] ≥ 1 − e−2n
1/2
≥ 1 − τ /10.
(16)
For any choice of samples X1 , . . . , Xn , we have
fn (C ′ ) ≥ fn (T )
(since T ⊆ C ′ )
≥ f0 (C ′ ) − f0 (C ′ \ T ) − |f0 (T ) − fn (T )|
δ
− |f0 (T ) − fn (T )|.
≥ f0 (C ′ ) −
10
(by (12))
(17)
Thus, we get
Prfn [fn (C ′ ) ≥ f0 (C ′ ) − 3δ/10] ≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5]
(by (17))
≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5|E2 ] · Prfn [E2 ]
≥ 1 − τ /10.
(by (15))
(18)
In a similar way, using that C ′ ⊆ T ′ , for any choice of the samples X1 , . . . , Xn , we have
fn (C ′ ) ≤ f0 (C ′ ) +
δ
+ |f0 (T ′ ) − fn (T ′ )|.
10
12
(by (14))
(19)
It therefore follows that
Prfn [fn (C ′ ) ≤ f0 (C ′ ) + 3δ/10] ≥ Prfn [|f0 (T ′ ) − fn (T ′ )| ≤ δ/5]
′
(by (19))
′
≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5|E3 ] · Prfn [E3 ]
≥ 1 − τ /10.
(by (16))
(20)
By (18) and (20) and the union bound, we obtain
Prfn [|fn (C ′ ) − f0 (C ′ )| ≤ 3δ/10] ≥ 1 − 2τ /10.
(21)
Combining (11) and (21) we get
Prfn [|fn (C) − f0 (C)| ≤ 2δ/5]
≥ Prfn [(|fn (C \ C ′ ) − f0 (C \ C ′ )| ≤ δ/5) ∧ (|fn (C ′ ) − f0 (C ′ )| ≤ 3δ/10)]
≥ 1 − τ /10 − 2τ /10
≥ 1 − 3τ /10,
which concludes the proof.
5 Proof of Lemma 3.4
We are now ready to prove the main technical part of our work, which is Lemma 3.4. The proof builds
upon the argument used in the proof of Lemma 4.2, which achieves a weaker sample complexity bound.
Recall that in the proof of Lemma 4.2 we use inner and outer polyhedral approximations of C, restricted
on some appropriate bounded S ⊂ Rd . The main difference in the proof of Lemma 3.4 is that we now
use roughly O(log n) inner and outer polyhedral approximations of intersections of C with different superlevelsets of f0 . We need slightly more samples due to the higher number of facets, and consequently higher
VC dimension of the resulting approximations. However, since we use a finer discretization of the values of
f0 , we incur lower error in total.
The following Lemma is implicit in [DKS17]. We reproduce its proof for completeness in Appendix A.
Lemma 5.1. Let L, H ∈ Z+ . We define the set AH,L , elements of which are defined by the following
process: Starting with L convex polytopes each with at most H facets, all combinations of intersection
and union of these polytopes are elements of AH,L . If V is the VC dimension of AH,L , then V / log(V ) =
O(dLH).
We are now prepared to present the proof of Lemma 3.4. Let
Si = Lf0 (Mf0 e−i )
and let S0 = ∅. Let L be be the minimum L′ ∈ Z+ such that PrX∼f0 [X ∈
/ SL ′ ] ≤
that all samples X1 , . . . , Xn lie in SL . We have that
PrX1 ,...,Xn ∼f0 [E1 ] = Prfn [E1 ] ≥ 1 − τ /10.
τ
10n .
Let E1 be the event
(22)
Let L = ln(100n4 /τ ). By Lemma 3.2, we have that PrX∼f0 [f0 (X) ≤ Mf0 e−z ] = O(d)d e−z/2 and thus
PrX∼f0 [X ∈
/ SL ] = PrX∼f0 [f0 (X) < Mf0 e−L ] ≤
13
τ
.
10n
For a convex set C ⊆ Rd , for all i ∈ [L], let
C i = C ∩ Si .
Note that, conditioned on E1 occurring, we have with probability 1, that fn (C) = fn (CL ). In other words,
Prfn [fn (C \ CL ) = 0|E1 ] = 1.
τ
10n ,
Furthermore, by our choice of L we have f0 (Rd \ SL ) ≤
f0 (C \ CL ) ≤
(23)
and therefore
τ
≤ δ/5.
10n
(24)
Combining 22, 23, and 24, we have
Prfn [|f0 (C \ CL ) − fn (C \ CL )| ≤ δ/5] ≥ Prfn [|f0 (C \ CL ) − fn (C \ CL )| ≤ δ/5|E1 ] · Prfn [E1 ]
≥ Prfn [fn (C \ CL ) = 0|E1 ] · Prfn [E1 ]
≥ 1 − τ /10.
(25)
Using Theorem 2.3, for i ∈ [L] let Piin , Piout be convex polytopes with H = (10κd/δ)(d−1)/2 facets,
where κ is the universal constant from Theorem 2.3, such that Piin ⊆ Ci ⊆ Piout ,
vol(Ci \ Piin ) ≤ δ · vol(Ci )/10 ≤ δ · vol(Si )/10,
(26)
vol(Piout \ Ci ) ≤ δ · vol(Ci )/10 ≤ δ · vol(Si )/10.
(27)
and
Let
[
C in =
Piin .
i∈[L]
For any i ∈ [L], let PiS be a convex polytope with at most H facets such that PiS ⊆ Si and vol(Si \
PiS ) ≤ δ · vol(Si )/10.
Let
[
PjS
Si′ =
1≤j≤i
and S0′ = ∅. Let
C out =
[
′
(Piout \ Si−1
).
i∈[L]
We will now show that C in and C out satisfy the following conditions:
1. C in ⊆ CL ⊆ C out .
2. f0 (C out \ CL ) < δ/2.
3. f0 (CL \ C in ) < δ/2.
14
Figure 1: Constructing C in . For each set Si , a convex polytope approximating C ∩ Si from the inside is
found, and C in is formed by taking the union of these convex polytopes.
First, we consider C in . Since Piin ⊆ Ci ⊆ CL for all i ∈ [L], it follows that
Observe that by the above definitions, we have that
S
in
i∈[L] Pi
= C in ⊆ CL .
(CL \ C in ) ∩ (Si \ Si−1 ) ⊆ (CL \ C in ) \ Si−1 ⊆ (CL \ Piin ) \ Si−1 .
(28)
From (28), we therefore have
(CL \ C in ) =
[
i∈[L]
[
(CL \ C in ) ∩ (Si \ Si−1 ) ⊆
(Ci \ Piin ) \ Si−1 ,
(29)
i∈[L]
and so
f0 (CL \ C in ) ≤
X
f0 ((Ci \ Piin ) \ Si−1 )
(by (29))
i∈[L]
≤
X
vol((Ci \ Piin ) \ Si−1 )Mf0 e−(i−1)
i∈[L]
≤
X
vol(Ci \ Piin )Mf0 e−(i−1)
i∈[L]
≤
X
(δ/10)vol(Si )Mf0 e−(i−1)
(by (26))
i∈[L]
≤ (δ/10)
X
vol(Lf0 (Mf0 e−i ))Mf0 e−(i−1)
i∈[L]
≤ (δ/10)
Z
Mf0
0
vol(Lf0 (y))dy < δ/2.
(30)
Now we consider C out . Let x ∈ CL . Then there exists i ∈
/ Si−1 . Thus
S [L] suchoutthat x′ ∈ Si and x ∈
′ , from which we have that x ∈ C out =
(P
\
S
).
Therefore
CL ⊆ C out .
x ∈ Piout and x ∈
/ Si−1
i−1
i∈[L] i
′ . If
Let y ∈ C out \ CL . From the definition of C out , there must exist some i ∈ [L] such that y ∈ Piout \ Si−1
out
out
out
y ∈ Pi \ Ci , we are done. Suppose that y ∈
/ Pi \ Ci . Since we have that y ∈ Pi , we must also have
that y ∈ Ci . But Ci ⊆ CL , and we began with y ∈ C out \ CL , which makes a contradiction. Therefore,
(31)
C out \ CL ⊆ ∪i∈[L] Piout \ Ci .
15
Figure 2: Constructing C out . For each set Si , a convex polytope approximating Si from the inside is found
(PiS , see row (a)), and a convex polytope approximating C ∩ Si from the outside is found (Piout , see row
S
(b)). For each i, the set Piout \ (∪i−1
j=1 Pj ) is constructed (see row (c)), and the union of these sets finish the
out
construction of C (see row (d)).
16
Thus, we have that
f0 (C out \ CL ) ≤
X
f0 (Piout \ Ci )
(by (31))
i∈[L]
≤
X
vol(Piout \ Ci )Mf0 e−(i−1)
i∈[L]
≤
X
(δ/10)vol(Si )Mf0 e−(i−1)
(by (27))
i∈[L]
≤ (δ/10)
X
vol(Lf0 (Mf0 e−i ))Mf0 e−(i−1)
i∈[L]
≤ (δ/10)
Z
Mf 0
0
vol(Lf0 (y))dy < δ/2.
(32)
We define the set A, elements of which are defined by the following process: Starting with 2L convex
polytopes each with at most H facets, all combinations of intersection and union of these convex polytopes
are elements of A. Then for any convex set C with C in , C out as defined above, we have that C out , C in ∈ A.
From Lemma 5.1, we have that if V is the VC dimension of A, then
V / ln(V ) = O(dLH).
Using Theorem 2.1, we have for some universal constant α that
r
αV
.
n
The following claim is obtained via a simple calculation (see Appendix A):
q
Claim 5.2. For n ≥ N1 we have that αV
n ≤ δ/10.
E[|f0 (C in ) − fn (C in )|] ≤ E[||f0 − fn ||A ] =
(33)
Let E2 be the event that |f0 (C in ) − fn (C in )| ≤ δ/5. Then by (33), Claim 5.2, and Theorem 2.2, we
have that
Prfn [E2 ] = 1 − Prfn [|f0 (C in ) − fn (C in )| > δ/5]
≥ 1 − Prfn [|f0 (C in ) − fn (C in )| − E[|f0 (C in ) − fn (C in )|] > δ/10]
≥ 1 − e−2n(δ/10)
2
≥ 1 − τ /10.
(34)
Let E3 be the event that |f0 (C out ) − fn (C out )| ≤ δ/5, and a nearly identical argument as above shows
that
Prfn [E3 ] = 1 − Prfn [|f0 (C out ) − fn (C out )| > δ/5]
≥ 1 − τ /10.
(35)
Claim 5.3. We have that Prfn [|fn (CL ) − f0 (CL )| ≤ 7δ/10] ≥ 1 − τ /5.
This claim follows from (30), (32), (34), and (35). The full proof can be found in Appendix A.
Combining (25) and Claim 5.3 we get
Prfn [|fn (C) − f0 (C)| ≤ δ] ≥ Prfn [(|fn (C \ CL ) − f0 (C \ CL )| ≤ δ/5) ∧ (|fn (CL ) − f0 (CL )| ≤ 7δ/10)]
τ
τ
−
≥1−
10 5
≥ 1 − 3τ /10,
which concludes the proof.
17
6 Conclusions
In this paper, we gave the first sample complexity upper bound for the MLE of multivariate log-concave
densities on Rd , for any d ≥ 4. Our upper bound agrees with the previously known lower bound up to a
multiplicative factor of Õd (ǫ−1 ).
A number of open problems remain: What is the optimal sample complexity of the multivariate logconcave MLE? In particular, is the log-concave MLE sample-optimal for d ≥ 4? Does the multivariate
log-concave MLE have similar adaptivity properties as in one dimension? And is there a polynomial time
algorithm to compute it?
References
[ADH+ 15] J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and near-optimal algorithms
for approximating distributions by histograms. In Proceedings of the 34th ACM Symposium on
Principles of Database Systems, PODS 2015, pages 249–263, 2015.
[ADLS17] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 1278–1289, 2017.
Available at
https://arxiv.org/abs/1506.00671.
[An95]
M. Y. An. Log-concave probability distributions: Theory and statistical testing. Technical
Report Economics Working Paper Archive at WUSTL, Washington University at St. Louis,
1995.
[ASE92]
N. Alon, J. Spencer, and P. Erdos. The Probabilistic Method. Wiley-Interscience, New York,
1992.
[BB05]
M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. Economic Theory,
26(2):pp. 445–469, 2005.
[BBBB72] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistical Inference under
Order Restrictions. Wiley, New York, 1972.
[BD18]
F. Balabdaoui and C. R. Doss. Inference for a two-component mixture of symmetric distributions under log-concavity. Bernoulli, 24(2):1053–1071, 05 2018.
[Bir87a]
L. Birgé. Estimating a density under order restrictions: Nonasymptotic minimax risk. Annals
of Statistics, 15(3):995–1012, 1987.
[Bir87b]
L. Birgé. On the risk of histograms for estimating decreasing densities. Annals of Statistics,
15(3):1013–1022, 1987.
[Bru58]
H. D. Brunk. On the estimation of parameters restricted by inequalities. The Annals of Mathematical Statistics, 29(2):pp. 437–454, 1958.
[BRW09]
F. Balabdaoui, K. Rufibach, and J. A. Wellner. Limit distribution theory for maximum likelihood estimation of a log-concave density. The Annals of Statistics, 37(3):pp. 1299–1331,
2009.
[BW07]
F. Balabdaoui and J. A. Wellner. Estimation of a k-monotone density: Limit distribution theory
and the spline connection. The Annals of Statistics, 35(6):pp. 2536–2564, 2007.
18
[BW10]
F. Balabdaoui and J. A. Wellner. Estimation of a k-monotone density: characterizations, consistency and minimax lower bounds. Statistica Neerlandica, 64(1):45–70, 2010.
[CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of
discrete distributions. In STACS, pages 25:1–25:14, 2016.
[CDSS13] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions
over discrete domains. In SODA, pages 1380–1394, 2013.
[CDSS14a] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Efficient density estimation via piecewise
polynomial approximation. In STOC, pages 604–613, 2014.
[CDSS14b] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in nearlinear time using variable-width histograms. In NIPS, pages 1844–1852, 2014.
[CS13]
Y. Chen and R. J. Samworth. Smoothed log-concave maximum likelihood estimation with
applications. Statist. Sinica, 23:1373–1398, 2013.
[CSS10]
M. Cule, R. Samworth, and M. Stewart. Maximum likelihood estimation of a multidimensional log-concave density. Journal of the Royal Statistical Society: Series B, 72:545–
607, 2010.
[CT04]
K.S. Chan and H. Tong. Testing for multimodality with dependent data. Biometrika, 91(1):113–
123, 2004.
[DDKT16] C. Daskalakis, A. De, G. Kamath, and C. Tzamos. A size-free CLT for poisson multinomials
and its applications. In Proceedings of the 48th Annual ACM Symposium on the Theory of
Computing, STOC ’16, 2016.
[DDO+ 13] C. Daskalakis, I. Diakonikolas, R. O’Donnell, R.A. Servedio, and L. Tan. Learning Sums of
Independent Integer Random Variables. In FOCS, pages 217–226, 2013.
[DDS12a]
C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning k-modal distributions via testing.
In SODA, pages 1371–1385, 2012.
[DDS12b]
C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning Poisson Binomial Distributions.
In STOC, pages 709–728, 2012.
[DKS16a]
I. Diakonikolas, D. M. Kane, and A. Stewart. Efficient Robust Proper Learning of Log-concave
Distributions. Arxiv report, 2016.
[DKS16b]
I. Diakonikolas, D. M. Kane, and A. Stewart. The fourier transform of poisson multinomial
distributions and its algorithmic applications. In Proceedings of STOC’16, 2016.
[DKS16c]
I. Diakonikolas, D. M. Kane, and A. Stewart. Optimal learning via the fourier transform
for sums of independent integer random variables. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 831–849, 2016. Full version available at
https://arxiv.org/abs/1505.00662.
[DKS16d]
I. Diakonikolas, D. M. Kane, and A. Stewart. Properly learning poisson binomial distributions
in almost polynomial time. In Proceedings of the 29th Conference on Learning Theory, COLT
2016, pages 850–878, 2016. Full version available at https://arxiv.org/abs/1511.04066.
19
[DKS17]
I. Diakonikolas, D. M. Kane, and A. Stewart. Learning multivariate log-concave distributions.
In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 711–727, 2017.
[DL01]
L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001.
[DR09]
L. Dumbgen and K. Rufibach. Maximum likelihood estimation of a log-concave density and its
distribution function: Basic properties and uniform consistency. Bernoulli, 15(1):40–68, 2009.
[DW16]
C. R. Doss and J. A. Wellner. Global rates of convergence of the mles of log-concave and
s-concave densities. Ann. Statist., 44(3):954–981, 06 2016.
[Fou97]
A.-L. Fougères. Estimation de densités unimodales. Canadian Journal of Statistics, 25:375–
387, 1997.
[GJ14]
P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics. Cambridge University Press, 2014.
[GMR95]
Y. Gordon, M. Meyer, and S. Reisner. Constructing a polytope to approximate a convex body.
Geometriae Dedicata, 57(2):217–222, 1995.
[Gre56]
U. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125–153,
1956.
[Gro85]
P. Groeneboom. Estimating a monotone density. In Proc. of the Berkeley Conference in Honor
of Jerzy Neyman and Jack Kiefer, pages 539–555, 1985.
[GW09]
F. Gao and J. A. Wellner. On the rate of convergence of the maximum likelihood estimator of
a k-monotone density. Science in China Series A: Mathematics, 52:1525–1538, 2009.
[HP76]
D. L. Hanson and G. Pledger. Consistency in concave regression. The Annals of Statistics,
4(6):pp. 1038–1050, 1976.
[HW16]
Q. Han and J. A. Wellner. Approximation and estimation of s-concave densities via renyi
divergences. Ann. Statist., 44(3):1332–1359, 06 2016.
[JW09]
H. K. Jankowski and J. A. Wellner. Estimation of a discrete monotone density. Electronic
Journal of Statistics, 3:1567–1605, 2009.
[KGS16]
A. Kim, A. Guntuboyina, and R. J. Samworth. Adaptation in log-concave density estimation.
ArXiv e-prints, 2016. Available at http://arxiv.org/abs/1609.00861.
[KM10]
R. Koenker and I. Mizera. Quasi-concave density estimation. Ann. Statist., 38(5):2998–3027,
2010.
[KS16]
A. K. H. Kim and R. J. Samworth. Global rates of convergence in log-concave density estimation. Ann. Statist., 44(6):2756–2779, 12 2016. Available at http://arxiv.org/abs/1404.2298.
[LV07]
L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms.
Random Structures and Algorithms, 30(3):307–358, 2007.
[Rao69]
B.L.S. Prakasa Rao. Estimation of a unimodal density. Sankhya Ser. A, 31:23–36, 1969.
[RSU17]
E. Robeva, B. Sturmfels, and C. Uhler. Geometry of Log-Concave Density Estimation. ArXiv
e-prints, 2017. Available at https://arxiv.org/abs/1704.01910.
20
[Sam17]
R. J. Samworth. Recent progress in log-concave density estimation. ArXiv e-prints, 2017.
[Sta89]
R. P. Stanley. Log-concave and unimodal sequences in algebra, combinatorics, and geometry.
Annals of the New York Academy of Sciences, 576(1):500–535, 1989.
[SW14]
A. Saumard and J. A. Wellner. Log-concavity and strong log-concavity: A review. Statist.
Surv., 8:45–114, 2014.
[VV16]
G. Valiant and P. Valiant. Instance optimal learning of discrete distributions. In Proceedings of
the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 142–155,
2016.
[Wal09]
G. Walther. Inference and modeling with log-concave distributions. Stat. Science, 24:319–327,
2009.
[Weg70]
E.J. Wegman. Maximum likelihood estimation of a unimodal density. I. and II. Ann. Math.
Statist., 41:457–471, 2169–2174, 1970.
[Wel15]
J. A. Wellner. Nonparametric estimation of s-concave and log-concave densities: an alternative to maximum likelihood. Talk given at European Meeting of Statisticians, Amsterdam,
2015. Available at https://www.stat.washington.edu/jaw/RESEARCH/TALKS/EMS-2015.1rev1.pdf.
A
Deferred Proofs
A.1 Proof of Lemma 3.2
Let R = Lf (Mf /e). Then using the fact that if y ≤ Mf /e then R ⊆ Lf (y), we have that
Z
Z
Z
Mf
1=
vol(Lf (y))dy ≥
vol(Lf (y))dy ≥
vol(R)dy =
· vol(R)
e
R+
0≤y≤Mf /e
0≤y≤Mf /e
(36)
Suppose that f (x) ≥ Mf e−z , for some x ∈ Rd . By the definition of log-concavity we have f (x/z) ≥
f (0)(z−1)/z f (x)1/z . We may assume w.l.o.g. that f has mean 0, and thus f (0) = Mf . By the assumption
(z−1)/z
(z−1)/z
1/z
we get f (x/z) ≥ Mf
(Mf /ez )1/z = Mf
Mf /e = Mf /e. Thus x/z ∈ R, and so x ∈ zR.
Therefore Lf (Mf e−z ) ⊆ zR. Thus by (36) we get
vol(Lf (Mf e−z )) ≤ vol(zR) ≤ O(z d ) · vol(R) = O(z d /Mf ),
which proved the first part of the assertion.
21
(37)
It remains to prove the second part. We have
PrX∼f [f (X) ≤ Mf e
−z
]≤
=
Mf e−z
Z
vol(Lf (y))dy
Z0 ∞
z
≤
=
≤
Z
z
Z
∞
∞
Zz ∞
vol(Lf (Mf e−x ))Mf e−x dx
(setting y = Mf e−x )
O(xd /Mf )Mf e−x dx
(by (37))
O(xd e−x )dx
O(d)d e−x/2 dx
(since ex/2 ≥ (x/2)d /d!)
z
= O(d)d e−z/2 ,
which concludes the proof.
A.2 Proof of Lemma 3.6
We begin with a few common definitions and observations. If X is a random variable defined on a probability
space (Ω, Σ, P ), then the expected value E[X] of X is defined as the Lebesgue integral
Z
E[X] =
X(ω)dP (ω).
Ω
Next, we define two functions
X+ (ω) = max(X(ω), 0)
and
X− (ω) = − min(X(ω), 0).
We observe that these functions are both measurable (and therefore also random variables), and that E[X] =
E[X+ ] − E[X− ]. Finally, we observe that if X : Ω → R≥0 ∪ {∞} is a non-negative random variable then
Z ∞
E[X] =
Pr[X > x]dx.
0
Similarly, if X : Ω → R≥0 ∪ {−∞} is a non-positive random variable then
E[X] = −
Z
0
Pr[X < x]dx.
−∞
22
Applying the definitions and observations of the previous paragraph, we have the following derivation:
EY ∼g [φ(Y )] − EY ∼h [φ(Y )] = (EY ∼g [φ(Y )+ ] − EY ∼g [φ(Y )− ]) − (EY ∼h [φ(Y )+ ] − EY ∼h [φ(Y )− ])
= (EY ∼g [φ(Y )+ ] + EY ∼g [−φ(Y )− ]) − (EY ∼h [φ(Y )+ ] + EY ∼h [−φ(Y )− ])
Z ∞
Z 0
PrY ∼g [−φ(Y )− < x]dx
PrY ∼g [φ(Y )+ > x]dx +
=
−∞
0
Z ∞
Z 0
−
PrY ∼h [φ(Y )+ > x]dx +
PrY ∼h [−φ(Y )− < x]dx
0
−∞
Z ∞
Z 0
PrY ∼g [φ(Y ) < x]dx
PrY ∼g [φ(Y ) > x]dx +
=
0
−∞
Z ∞
Z 0
PrY ∼h [φ(Y ) < x]dx
PrY ∼h [φ(Y ) > x]dx +
−
−∞
Z ∞ 0
PrY ∼g [φ(Y ) > x] − PrY ∼h [φ(Y ) > x]dx
=
0
Z 0
+
PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx
Z ∞−∞
(1 − PrY ∼g [φ(Y ) < x]) − (1 − PrY ∼h [φ(Y ) < x])dx
=
0
Z 0
PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx
+
Z ∞−∞
=
PrY ∼h [φ(Y ) < x] − PrY ∼g [φ(Y ) < x])dx
0
Z 0
+
PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx
−∞
Z ∞
|PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx
≤
0
Z 0
+
|PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx
−∞
Z ∞
=
|PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx.
−∞
A symmetric argument shows that
EY ∼h [φ(Y )] − EY ∼g [φ(Y )] ≤
Z
∞
|PrY ∼h [φ(Y ) < x] − PrY ∼g [φ(Y ) < x]| dx,
−∞
concluding the proof.
A.3 Proof of Claim 3.13
Recall that
Mf 0
· O (2 + 2 ln Mf − 2 ln pmin +2ǫ)d
Mf
d
Mf 0
<
.
· O 2 ln Mf − 2 ln Mf0 + 3 ln(n4 100/τ 2 )
Mf
vol(A) · Mf0 ≤
23
We search for Mf such that vol(A) · Mf0 ≤ 1/3. It is sufficient for Mf to satisfy, for some constant c > 1,
d
Mf0 /Mf · c 2 ln Mf − 2 ln Mf0 + 3 ln n6 ≤ 1/3
d
ln (Mf0 /Mf ) · c 2 ln(Mf /Mf0 ) + 3 ln n6
≤ ln(1/3)
d
ln (Mf0 /Mf ) + ln c + ln 2 ln(Mf /Mf0 ) + 3 ln n6
≤ ln(1/3)
d
+ ln(3c) ≤ ln (Mf /Mf0 )
ln 2 ln(Mf /Mf0 ) + 3 ln n6
d ln 2 ln(Mf /Mf0 ) + 3 ln(n4 100/τ 2 ) + ln(c) ≤ ln (Mf /Mf0 ) .
(38)
If we have Mf such that ln(Mf /Mf0 ) ≥ 3 ln(n4 100/τ 2 ), and a sufficiently large constant is chosen for N1
so that ln(c) ≤ ln(n4 100/τ 2 ), then (38) becomes
d ln (6 ln(Mf /Mf0 )) ≤ 2 ln (Mf /Mf0 ) .
(39)
The next inequality is equivalent to (39):
(6 ln(Mf /Mf0 ))d/2 ≤ Mf /Mf0
We note that the derivative of (6 ln x)d/2 is
6d/2 d(ln x)d/2−1
.
2x
We also note that for x = (36d)d/2+1 (ln(36d))d/2+1 we have that
6d/2 d(ln x)d/2−1
≤ 1.
2x
and
(6 ln x)d/2 = [6(d/2 + 1) ln(36d ln(36d)]d/2
= 6d/2 · dd/2 · [2 ln(36d)]d/2
≤ x.
Therefore, assuming sufficiently large constants are chosen in the definition of N1 , if
ln(Mf /Mf0 ) ≥ 3 ln(n4 100/τ 2 )
then vol(A) · Mf0 ≤ 1/3.
A.4 Proof of Claim 4.3
By Lemma 4.1 we have that the VC dimension of A is V ≤ 2(d + 1)H ln((d + 1)H), and so V ≤
(10κ)(d+1)/2 d(d+5)/2 (ln(100n4 /τ 2 ))d /δ)(d+1)/2 . Noting that E[|f0 (T ) − fn (T )|] ≤ E[||f0 − fn ||A ], by
Theorem 2.1 we get that
r
O(V )
E[|f0 (T ) − fn (T )|] ≤
s n
O (10κ)(d+1)/2 d(d+5)/2 (ln(100n4 /τ 2 ))d /δ)(d+1)/2
.
≤
n
24
For the next part we want that E[|f0 (T ) − fn (T )|] ≤ δ/10. This holds when
(d+5)/2
n = Ω (d/ǫ)(ln(100n4 /τ 2 ))(d+1)
If n ≥ b cd+1 (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1)
have
(d+5)/2
for some constants b > 1, c ≥ 100 ln c, then we
(d+1)
(d/ǫ)(ln(100n4 /τ 2 ))(d+1) ≤ (d(2d+3) /ǫ)(100 ln c)(d+1) ln(d(d+1) /(ǫτ ))
≤ cd+1 (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1)
and therefore n = Ω (d/ǫ)(ln(100n4 /τ 2 )(d+1)
(d+5)/2
as desired. Therefore, for n ≥ N2 we have
E[|f0 (T ) − fn (T )|] ≤ δ/10.
(40)
A.5 Proof of Lemma 5.1
Consider an arbitrary set T of t points in Rd . We wish to bound the number of possible distinct sets that can
be obtained by the intersection of T with a set in AH,L . We note that AH,L can also be constructed in the
following manner: Take an arrangement consisting of at most H ·L hyperplanes. This arrangement partitions
Rd into a set of components. Then, the union of subsets of these components are elements of AH,L . Any
halfspace can be perturbed, without changing its intersection with T , so that its boundary intersects d′ + 1
points in T , where d′ ≤ d is the dimension of the affine subspace spanned by T . Any such subset uniquely
determines the intersection of the halfspace with T . Therefore, the number of possible intersections with a
set of size t is at most O(t)d . It follows then that the number of possible intersections of any A ∈ AH,L
and any set of size t is at most (O(t)d )LH ≤ O(t)dLH . If A has VC dimension t, then is must be that
O(t)dLH ≥ 2t , and therefore t/ log(t) = O(dLH).
A.6 Proof of Claim 5.2
Recalling that L = ln(100n4 /τ 2 ) and H = (10κd/δ)(d−1)/2 , we have that
V / ln(V ) = O d · ln(100n4 /τ 2 ) · (10κd/δ)(d−1)/2
= O (10κ)(d−1)/2 d(d+1)/2 ln(100n4 /τ 2 )/δ (d−1)/2 .
We note that
d−1
ln (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2 ≤
ln (10κ)d3 (ln(100n4 /τ 2 ))6 /δ
2
≤ d ln (10κ)d3 (ln(100n4 /τ 2 ))6 /δ
≤ cd ln(ln(100n4 /τ 2 ))
for some sufficiently large constant c. Therefore, letting
V = O (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2
satisfies (41). Therefore, we have that
s
r
α · O (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2
αV
=
,
n
n
25
(41)
and thus when
we have that
q
αV
n
n = Ω (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))(d+7)/2 /ǫ(d+3)/2
(42)
≤ δ/10. To simplify (42), we note that the
d(d+3)/2 (ln(100n4 /τ 2 ))(d+7)/2 /ǫ(d+3)/2 ≤ (d/ǫ)(ln(100n4 /τ 2 ))2
(d+3)/2
.
Thus, if we let n = (c(d2 /ǫ)(ln(d/(ǫτ )))3 )(d+3)/2 for some large constant c, then we have that
d+3
ln(100c(d2 /ǫ)(ln(d/(ǫτ )))3 ) + ln(1/τ 2 )
2
≤ c′ d ln(d/(ǫτ ))
ln(100n4 /τ 2 ) =
for some large constant c′ . Thus, assuming a sufficiently large constant is chosen, for n ≥ N1 we have that
(42) holds, and therefore
r
αV
≤ δ/10.
(43)
n
A.7 Proof of Claim 5.3
For any choice of the samples X1 , . . . , Xn , we have
fn (CL ) ≥ fn (C in )
(since C in ⊆ CL )
≥ f0 (C in ) − |f0 (C in ) − fn (C in )|
= f0 (CL ) − f0 (CL \ C in ) − |f0 (C in ) − fn (C in )|
δ
≥ f0 (CL ) − − |f0 (C in ) − fn (C in )|.
2
(by (30))
(44)
Thus
Prfn [fn (CL ) ≥ f0 (CL ) − 7δ/10] ≥ Prfn [|f0 (C in ) − fn (C in )| ≤ δ/5]
in
(by (44))
in
≥ Prfn [|f0 (C ) − fn (C )| ≤ δ/5|E2 ] · Prfn [E2 ]
≥ 1 − τ /10.
(by (34))
(45)
Similarly, for any choice of the samples X1 , . . . , Xn , we have
fn (CL ) ≤ fn (C out )
(since CL ⊆ C out )
≤ f0 (C out ) + |f0 (C out ) − fn (C out )|
= f0 (CL ) + f0 (C out \ CL ) − |f0 (C out ) − fn (C out )|
δ
≤ f0 (CL ) + + |f0 (C out ) − fn (C out )|.
2
(by (32))
(46)
Thus
Prfn [fn (CL ) ≤ f0 (CL ) + 7δ/10] ≥ Prfn [|f0 (C out ) − fn (C out )| ≤ δ/5]
≥ Prfn [|f0 (C
out
) − fn (C
out
(by (46))
)| ≤ δ/5|E3 ] · Prfn [E3 ]
≥ 1 − τ /10.
(by (35)) (47)
By (45) and (47) and the union bound, we obtain
Prfn [|fn (CL ) − f0 (CL )| ≤ 7δ/10] ≥ 1 − τ /5,
concluding the proof.
26