[go: up one dir, main page]

Academia.eduAcademia.edu
arXiv:1802.10575v1 [math.ST] 28 Feb 2018 Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities Timothy Carpenter Ohio State University carpenter.454@osu.edu Ilias Diakonikolas∗ University of Southern California diakonik@usc.edu Anastasios Sidiropoulos† University of Illinois at Chicago sidiropo@uic.edu Alistair Stewart University of Southern California stewart.al@gmail.com December 2, 2021 Abstract We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the first upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on Rd , for all d ≥ 4. Prior to this work, no finite sample upper bound was known for this estimator in more than 3 dimensions. In more detail, we prove that for any d ≥ 1 and ǫ > 0, given Õd ((1/ǫ)(d+3)/2 ) samples drawn from an unknown log-concave density f0 on Rd , the MLE outputs a hypothesis h that with high probability is ǫ-close to f0 , in squared Hellinger loss. A sample complexity lower bound of Ωd ((1/ǫ)(d+1)/2 ) was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an Õ(1/ǫ) factor. ∗ † Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. Supported by NSF Award CCF-1453472 (CAREER) and NSF grant CCF-1423230. 1 Introduction 1.1 Background The general task of estimating a probability distribution under certain qualitative assumptions about the shape of its probability density function has a long history in statistics, dating back to the pioneering work of Grenander [Gre56] who analyzed the maximum likelihood estimator of a univariate monotone density. Since then, shape constrained density estimation has been a very active research area with a rich literature in mathematical statistics and, more recently, in computer science. A wide range of shape constraints have been studied, including unimodality, convexity and concavity, k-modality, log-concavity, and k-monotonicity. The reader is referred to [BBBB72] for a summary of the early work and to [GJ14] for a recent book on the subject. (See Section 1.3 for a succinct summary of prior work.) The majority of the literature has studied the univariate (one-dimensional) setting, which is by now fairly well-understood for a range of distributions. On the other hand, the multivariate setting and specifically the regime of fixed dimension is significantly more challenging and poorly understood for many natural distribution families. In this work, we focus on the family of multivariate log-concave distributions. A distribution on Rd is log-concave if the logarithm of its probability density function is concave (see Definition 1.1). Logconcave distributions constitute a rich non-parametric family encompassing a range of fundamental distributions, including uniform, normal, exponential, logistic, extreme value, Laplace, Weibull, Gamma, Chi and Chi-Squared, and Beta distributions (see, e.g., [BB05]). Due to their fundamental nature and appealing properties, log-concave distributions have been studied in a range of fields including economics [An95], probability theory [SW14], computer science [LV07], and geometry [Sta89]. The problem of density estimation for log-concave distributions is of central importance in the area of non-parametric shape constrained estimation [Wal09, SW14, Sam17] and has received significant attention during the past decade in statistics [CSS10, DR09, DW16, CS13, KS16, BD18, HW16] and theoretical computer science [CDSS13, CDSS14a, ADLS17, CDGR16, DKS16a, DKS17]. 1.2 Our Results and Comparison to Prior Work In this work, we analyze the global convergence rate of the maximum likelihood estimator (MLE) of a multivariate log-concave density. Formally, we study the following fundamental question: How many samples are information-theoretically sufficient so that the MLE of an arbitrary log-concave density on Rd learns the underlying density, within squared Hellinger loss ǫ? Perhaps surprisingly, despite significant effort within the statistics community on analyzing the log-concave MLE, our understanding of its finite sample performance in constant dimension has remained poor. The only result prior to this work that addressed the sample complexity of the MLE in more than one dimensions is by Kim and Samworth [KS16]. Specifically, [KS16] obtained the following results:  (1) a sample complexity lower bound of Ωd (1/ǫ)(d+1)/2 for d ∈ Z+ that applies to any estimator, and (2) a sample complexity upper bound for the log-concave MLE for d ≤ 3. Prior to our work, no finite sample upper bound was known for the log-concave MLE even for d = 4. In recent related work, Diakonikolas, Kane, and Stewart [DKS17] established a finite sample complexity upper bound for learning multivariate log-concave densities under global loss functions. Specifically, the  estimator analyzed in [DKS17] uses Õd (1/ǫ)(d+5)/2 1 samples and learns a log-concave density on Rd within squared Hellinger loss ǫ, with high probability. We remark that the upper bound of [DKS17] was 1 The Õ(·) notation hides logarithmic factors in its argument. 1 obtained by analyzing an estimator that is substantially different than the log-concave MLE. Moreover, the analysis in [DKS17] has no implications on the performance of the MLE. Interestingly, some of the technical tools employed in [DKS17] will be useful in our current setting. Due to the fundamental nature of the MLE, understanding its performance merits investigation in its own right. In particular, the log-concave MLE has an intriguing geometric structure that is a topic of current investigation [CSS10, RSU17]. The output of the log-concave MLE satisfies several desirable properties that may not be automatically satisfied by surrogate estimators. These include the log-concavity of the hypothesis, the paradigm of log-concave projections and their continuity in Wasserstein distance, affine equivariance, one-dimensional characterization, and adaptation (see, e.g., [Sam17]). An additional motivation comes from a recent conjecture (see, e.g., [Wel15]) that for 4-dimensional log-concave densities the MLE may have sub-optimal sample complexity. These facts provide strong motivation for characterizing the sample complexity of the log-concave MLE in any dimension. To formally state our results, we will need some terminology. The Rsquared p distance between p Hellinger two density functions f, g : Rd → R+ is defined as h2 (f, g) = (1/2) · Rd ( f (x) − g(x))2 dx. We now define our two main objects of study: Definition 1.1 (Log-concave Density). A probability density function f : Rd → R+ , d ∈ Z+ , is called log-concave if there exists an upper semi-continuous concave function φ : Rd → [−∞, ∞) such that f (x) = eφ(x) for all x ∈ Rd . We will denote by Fd the set of upper semi-continuous, log-concave densities with respect to the Lebesgue measure on Rd . Definition 1.2 (Log-concave MLE). Let f0 ∈ Fd and X1 , . . . , Xn be P iid samples from f0 . The maximum likelihood estimator, fˆn , is the density fˆn ∈ Fd which maximizes n1 ni=1 log(f (Xi )) over all f ∈ Fd . We can now state our main result:  Theorem 1.3 (Main Result). Fix d ∈ Z+ and 0 < ǫ < 1. Let n = Õd (1/ǫ)(d+3)/2 . For any f0 ∈ Fd , with probability at least 9/10 over the n samples from f0 , we have that h2 (fˆn , f0 ) ≤ ǫ. See Theorem 3.1 for a more detailed statement. The aforementioned lower bound of [KS16] implies that our upper bound is tight up to an Õd (ǫ−1 ) multiplicative factor. 1.3 Related Work Shape constrained density estimation is a vibrant research field within mathematical statistics. Statistical research in this area started in the 1950s and has seen a recent surge of research activity, in part due to the ubiquity of structured distributions in various domains. The standard method used in statistics to address density estimation problems of this form is the MLE. See [Bru58, Rao69, Weg70, HP76, Gro85, Bir87a, Bir87b, Fou97, CT04, BW07, JW09, DR09, BRW09, GW09, BW10, KM10, Wal09, CS13, KS16, BD18, HW16] for a partial list of works analyzing the MLE for various distribution families. During the past decade, there has been a large body of work on shape constrained density estimation in computer science with a focus on both sample and computational efficiency [DDS12a, DDS12b, DDO+ 13, CDSS13, CDSS14a, CDSS14b, ADH+ 15, ADLS17, DKS16c, DKS16d, DDKT16, DKS16b, VV16, DKS17]. Density estimation of log-concave densities has been extensively investigated. The univariate case is by now well understood [DL01, CDSS14a, ADLS17, KS16, HW16]. For example, it is known [KS16, HW16] that Θ(ǫ−5/4 ) samples are necessary and sufficient to learn an arbitrary log-concave density over R within squared Hellinger loss ǫ. Moreover, the MLE is sample-efficient [KS16, HW16] and attains certain adaptivity properties [KGS16]. A recent line of work in computer science [CDSS13, CDSS14a, ADLS17, CDGR16, DKS16a] gave efficient algorithms for log-concave density estimation under the total variation distance. 2 Density estimation of multivariate log-concave densities is poorly understood. A line of work [CSS10, DR09, DW16, CS13, BD18] has obtained a complete understanding of the global consistency properties of the MLE for any dimension d. However, both the rate of convergence of the MLE and the minimax rate of convergence remain unknown. For dimension d ≤ 3, [KS16] show that the MLE is sample near-optimal (within logarithmic factors) under the squared Hellinger distance. [KS16] also prove bracketing entropy lower bounds suggesting that the MLE may be sub-optimal for d > 3 (also see [Wel15]). 1.4 Technical Overview Here we provide a brief overview of our proof in tandem with a comparison to prior work. We start by noting that the previously known sample complexity upper bound of the log-concave MLE for d ≤ 3 [KS16] was obtained by bounding from above the bracketing entropy of the class. As we explain below, our argument is more direct making essential use of the VC inequality (Theorem 2.1), a classical result from empirical process theory. In contrast to prior work on log-concave density estimation [KS16, DKS17] which relied on approximations to (log)-concave functions, we start by considering approximations to convex sets. Let f0 be the target log-concave density. We show (Lemma 3.4) that given sufficiently many samples from f0 , with high probability, for any convex set C the empirical mass of C and the probability mass of C under f0 are close to each other. We then leverage this structural lemma to analyze the error in the log-likelihood of log-concave densities, using the fact that the superlevel sets of a log-concave density are convex. We remark that our aforementioned structural result (Lemma 3.4) crucially requires the assumption of the log-concavity of f0 . Naively, one may think that this lemma follows directly from the VC inequality. Recall however that the VC-dimension of the family of convex sets is infinite, even in the plane. For example, for the uniform distribution over the unit circle, a similar result does not hold for any finite number of samples, and so we need to use the fact that f0 is log-concave. To prove our lemma, we consider judicious approximations of the convex set C with convex polytopes using known results from convex geometry. In more detail, we consider approximations to the convex set C on the inside and outside with close probabilities under f0 to the convex set from a family with a bounded VC-dimension. For any log-concave density f , the probabilities of any superlevel set are close under the empirical distribution and f0 . If log f were bounded, then that would mean that the empirical log-likelihood of f and the log-likelihood of f under f0 were close. Unfortunately, for any density f , log f is unbounded from below. To deal with this issue, we instead consider log(max(f, pmin )), for some carefully chosen probability value pmin such that we could ignore the contribution of the density below pmin if f is close to f0 . If we can bound the range of log(max(f, pmin )), we can show that its expectation under f0 and its empirical version are close to each other (see Lemma 3.7). To bound the range, we show that if the maximum value of f is much larger than the maximum of f0 , then f has small probability mass outside a set A of small volume; since A has small volume, we see many samples outside it, and so the empirical log-likelihood of f is smaller than the empirical log-likelihood of f0 . Using this fact, we can show that for the MLE fˆn the expectation of log(max(fˆn , pmin )) is large under f0 and then that fˆn is close in Hellinger distance to f0 . 1.5 Organization After setting up the required preliminaries in Section 2, in Section 3 we present the proof of our main result, modulo the proof of our main lemma (Lemma 3.4). In Section 4, we give a slightly weaker version of Lemma 3.4 that has a significantly simpler proof. In Section 5, we present the proof of Lemma 3.4. Finally, we conclude with a few open problems in Section 6. 3 2 Preliminaries def d Notation and Definitions. For m ∈ Z+ , we denote R [m] = {1, . . . , m}. Let f : R → R be a Lebesgue d →R measurable function. We will use f (A) to denote A f (x)dx. A Lebesgue R measurable function f : R d d is a probability density function (pdf) if f (x) ≥ 0 for all x ∈ R and Rd f (x)dx = 1. Let f, g : R → R+ be probability density functions. The total variation distance between f, g : Rd → R+ is defined as dTV (f, g) = supS |f (S) − g(S)|, where the supremum is over R all Lebesgue measurable subsets of the domain. We have that dTV (f, g) = (1/2) · kf − gk1 = (1/2) · Rd |f (x) − g(x)|dx. The Kullback-Leibler R∞ (x) (KL) divergence from g to f is defined as KL(f ||g) = −∞ f (x) ln fg(x) dx. For f : A → B and A′ ⊆ A, the restriction of f to A′ is the function f |A′ : A′ → B. For y ∈ R+ and def f : Rd → R+ we denote by Lf (y) = {x ∈ Rd | f (x) ≥ y} its superlevel sets. If f is log-concave, Lf (y) is a convex set for all y ∈ R+ . For a function f : Rd → R+ , we will denote by Mf its maximum value. The VC inequality. We start by recalling the notion of VC dimension. We say that a set X ⊆ Rd is shattered by a collection A of subsets of Rd , if for every Y ⊆ X there exists A ∈ A such that A ∩ X = Y . The VC dimension of a family A of subsets of Rd is defined to be the maximum cardinality of a subset X ⊆ Rd that is shattered by A. If there is a shattered subset of size s for all s ∈ Z+ , then we say that the VC dimension of A is ∞. The empirical distribution, fn , corresponding to a density f : Rd → R+ is the discrete probability Pn measure defined by fn (A) = (1/n) · i=1 1A (Xi ), where the Xi are iid samples drawn from f and 1S is the characteristic function of the set S. Let f : Rd → R be a Lebesgue measurable function. Given a family A of measurable subsets of Rd , we define the A-norm of f by kf kA = supA∈A |f (A)|. The VC inequality states the following: Theorem 2.1 (VC inequality, see [DL01], p. 31). Let f : Rd → R+ be a probability density function and fn be the empirical distribution obtained after drawing p n samples from f . Let A be a family of subsets over d R with VC dimension V . Then E[kf − fn kA ] ≤ C V /n, for some universal constant C > 0. We will also require a high probability version of the VC inequality which can be obtained using the following standard uniform convergence bound: Theorem 2.2 (see [DL01], p. 17). Let A be a family of subsets over Rd and fn be the empirical distribution of n samples from the density f : Rd → R+ . Let X be the random variable kf − fn kA . Then for all δ > 0, 2 we have that Pr[X − E[X] > δ] ≤ e−2nδ . Approximating Convex Sets by Polytopes. We make use of the following quantitative bounds of [GMR95] that provide volume approximation for any convex body by an inscribed and a circumscribed convex polytope respectively with a bounded number of facets: Theorem 2.3. For any convex body K ⊆ Rd , and n sufficiently large, there exists a convex polytope κd P ⊆ K with at most ℓ facets such that vol(K \ P ) ≤ ℓ2/(d−1) vol(K), where κ > 0 is a universal constant. ′ ′ Similarly, there exists a convex polytope P where K ⊆ P with at most ℓ facets such that vol(P ′ \ K) ≤ κd vol(K). ℓ2/(d−1) 3 Main Result: Proof of Theorem 1.3 The following theorem is a more detailed version of Theorem 1.3 and is the main result of this paper: (d+3)/2 Theorem 3.1. Fix d ∈ Z+ and 0 < ǫ, τ < 1. Let n = O (d2 /ǫ) ln3 (d/(ǫτ )) . For any f0 ∈ Fd , 2 ˆ with probability at least 1 − τ over the n samples from f0 , we have that h (fn , f0 ) ≤ ǫ. 4 This section is devoted to the proof of Theorem 3.1, which follows from Lemma 3.14. We will require a sequence of intermediate lemmas and claims. We summarize the notation that will appear throughout this proof. We use f0 ∈ Fd to denote the target log-concave density. We denote by fn the empirical density obtained after drawing n iid samples X1 , . . . , Xn from f0 and by fˆn the corresponding MLE. Given d ∈ Z+ and 0 < ǫ, τ < 1, for concreteness, we will denote: (d+3)/2 def N1 = Θ (d2 /ǫ) ln3 (d/(ǫτ )) , for a sufficiently large universal constant in the big-Θ notation. We will establish that N1 is an upper bound on the desired sample complexity of the MLE. Moreover, we will denote def def z = ln(100n4 /τ 2 ) , S = Lf0 (Mf0 e−z ) , def δ = ǫ/(32 ln(100n4 /τ 2 )) , and def pmin = Mf0 /(100n4 /τ 2 ) . We start by establishing an upper bound on the volume of superlevel sets: Lemma 3.2 (see, e.g., [DKS17], p. 8). Let f ∈ Fd with maximum value Mf . Then for all z ′ ≥ 1, we have ′ ′ ′ vol(Lf (Mf e−z )) ≤ O(z ′ d /Mf ), and PrX∼f [f (X) ≤ Mf e−z ] ≤ O(d)d e−z /2 . We defer this proof to Appendix A. We use Lemma 3.2 to get a bound on the volume of the superlevel set that contains all the samples with high probability: Corollary 3.3. For n ≥ N1 , we have that: (a) vol(S) = O(z d /Mf0 ), and (b) PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(3n). In particular, with probability at least 1 − τ /3, all samples X1 , . . . , Xn from f0 are in S. Proof. From Lemma 3.2, we have that vol(S) = vol(Lf0 (Mf0 e−z )) ≤ O(z d /Mf0 ). Also from Lemma 3.2, we have that PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(3n), if we assume a sufficiently large constant is selected in the definition of N1 . Taking a union bound over all samples, we get that with probability at least 1 − τ /3, all of the n samples are in S, as required. We can now state our main lemma establishing an upper bound on the error of approximating the probability of every convex set: Lemma 3.4. For n ≥ N1 , we have that with probability at least 1 − τ /3 over the choice of X1 , . . . , Xn drawn from f0 , for any convex set C ⊂ Rd it holds that |f0 (C) − fn (C)| ≤ δ. The proof of Lemma 3.4 is deferred to Section 5. In Section 4, we establish a weaker version of this lemma that requires more samples but has a simpler proof. Combining Lemma 3.4 with the observation that for any log-concave density f and t > 0 we have that Lf (t) is convex, we obtain the following corollary: Corollary 3.5. Let n ≥ N1 . Conditioning on the event of Lemma 3.4, we have that for any f ∈ Fd and for any t ≥ 0 it holds |PrX∼f0 [f (X) ≥ t] − PrX∼fn [f (X) ≥ t]| < δ. We will require the following technical claim, which follows from standard properties of Lebesgue integration (see Appendix A): 5 Lemma 3.6. Let g, h : Rd → R be pdfs, and φ : R → R. If EY ∼g [φ(Y )], EY ∼h [φ(Y )] are both finite, then R∞ |EY ∼g [φ(Y )] − EY ∼h [φ(Y )]| ≤ −∞ |PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx. Our next lemma establishes a useful upper bound on the empirical error of the truncated likelihood of any log-concave density: Lemma 3.7. Let n ≥ N1 and f ∈ Fd with maximum value Mf . For all ρ ∈ (0, Mf ], conditioning on the event of Corollary 3.5, we have |EX∼f0 [ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]| ≤ δ · ln(Mf /ρ) . Proof. Letting h = f0 , g = fn , and φ(x) = ln(max(f (x), ρ)), by Lemma 3.6 we have |EX∼f0 [ ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]| Z ∞ |PrX∼f0 [ln(max(f (X), ln ρ)) < t] − PrX∼fn [ln(max(f (X), ρ)) < t]| dt ≤ = = = = Z −∞ ln Mf −∞ Z ln Mf ln ρ Z ln Mf Z ln ρ ln Mf ln ρ |PrX∼f0 [max(ln f (X), ln ρ)) < t] − PrX∼fn [max(ln f (X), ln ρ)) < t]| dt |PrX∼f0 [ln(f (X)) < t] − PrX∼fn [ln(f (X)) < t]| dt PrX∼f0 [f (X) < et ] − PrX∼fn [f (X) < et ] dt PrX∼f0 [f (X) ≥ et ] − PrX∼fn [f (X) ≥ et ] dt. Since we conditioned on the event of Corollary 3.5, we have |PrX∼f0 [f (X) ≥ t] − PrX∼fn [f (X) ≥ t]| ≤ δ for all t ≥ 0. Therefore, we have that Z ln Mf |EX∼f0 [ln(max(f (X), ρ))] − EX∼fn [ln(max(f (X), ρ))]| ≤ δdt = δ · (ln Mf − ln ρ) , ln ρ which concludes the proof. For f0 itself, we can use Hoeffding’s inequality to get a bound on the empirical error of its likelihood: Lemma 3.8. Let n ≥ N1 . Conditioning on the event of Corollary 3.3, with probability at least 1 − τ /3 over X1 , . . . , Xn , we have that n 1X ln f0 (Xi ) − EX∼f0 [ln f0 (X)] ≤ ǫ/8 . n i=1 def def def Proof. Recall that z = ln(100n4 /τ 2 ), S = Lf0 (Mf0 e−z ), and pmin = Mf0 /(100n4 /τ 2 ). Note that for any x ∈ S, we have f0 (x) ≥ pmin by construction. Since we have conditioned on the event of Corollary 3.3 6 def holding, it follows that for each i ∈ [n], f0 (Xi ) ≥ pmin . Therefore, letting ρ = pmin , we have n n 1X 1X ln(f0 (Xi )) − EX∼f0 [ln f0 (X)] = ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln f0 (X)] n n i=1 i=1 n 1X ≤ ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))] n i=1 + |EX∼f0 [ln(max(f0 (X), ρ))] − EX∼f0 [ln f0 (X)]| n 1X ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))] n i=1 Z ln ρ Pr[ln f0 (X) ≤ T ]dT . (1) + ≤ −∞ By Hoeffding’s inequality we have " # n 1X ǫ Pr ln(max(f0 (Xi ), ρ)) − EX∼f0 [ln(max(f0 (X), ρ))] > n 16 i=1   −2n2 (ǫ/16)2 ≤ 2 exp n · (ln Mf0 − ln ρ)2   −nǫ2 /162 ≤ 2 exp (ln(100n4 /τ 2 ))2 ≤ τ /3 . (since n ≥ N1 ) (2) Next we have Z ∞ Z ln ρ PrX∼f0 [ln f0 (X) ≤ ln ρ − y]dy (setting y = ln ρ − T ) PrX∼f0 [ln f0 (X) ≤ T ]dT ≤ −∞ Z0 ∞ O(d)d (ρ/Mf0 )1/2 e−y/2 dy (by Lemma 3.2) ≤ 0 Z ∞ τ −y/2 = O(d)d e dy (since ρ = pmin = Mf0 /(100n4 /τ 2 )) 2 10n 0 τ ≤ 2 · O(d)d 10n2 ≤ ǫ/16. (since n ≥ N1 ) (3) By applying (2) and (3) to bound (1) from above, with probability at least 1 − τ /3 we have that n 1X ln f0 (Xi ) − EX∼f0 [ln f0 (X)] ≤ ǫ/8 , n i=1 which concludes the proof. The following simple lemma shows that the MLE is supported in the convex hull of the samples: Lemma 3.9. Let n ≥ 1. Let X1 , . . . , Xn be samples drawn from f0 , and C be the convex hull of these samples. Then, for all x ∈ Rd \ C, we have fˆn (x) = 0. 7 Proof. Suppose there exists x ∈ Rd \ C such that fˆn (x) > 0. Then, we have that Lfˆn (fˆn (x)) \ C 6= ∅ R R and thus Rd \C fˆn (x)dx > 0. From this, it follows that C fˆn (x)dx < 1, and so there exists some α > 1 R ˆ ˆ Rsuch that α C fn (x)dx = 1. Let ĝn : C → R be such that ĝn = α · fn |C . Since C is a convex set and C ĝn (x)dx = 1, we have that ĝn is a log-concave density. Observe that n n n 1X 1X 1X log(ĝn (Xi )) = log(αfˆn (Xi )) > log(fˆn (Xi )) , n n n i=1 i=1 (4) i=1 where we used that α > 1. By definition, fˆn maximizes n1 i=1 log(f (Xi )) over all log-concave densities f , which contradicts (4). Therefore, for all x ∈ Rd \ C, we have that fˆn (x) = 0. Pn We need to truncate the likelihood at a density small enough to be ignored for f close to f0 . This motivates the following definition: def Definition 3.10. We define f ′ : Rd → R such that f ′ (x) = max{pmin , fˆn (x)}. We show that this truncation and renormalization does not affect the MLE fˆn by much: R def Lemma 3.11. Let n ≥ N1 . Let g(x) = α · f ′ (x)|S , α ∈ R+ , be such that S g(x)dx = 1. Conditioning on the event of Corollary 3.3, we have the following: (a) 1 − ǫ/32 ≤ α ≤ 1, and (b) dTV (g, fˆn ) ≤ 3ǫ/64. R ˆ Proof. R ′ We start by R proving (a). By the definition of g and Lemma 3.9, we have that α = α S fn (x)dx ≤ α S f (x)dx = S g(x)dx = 1, i.e., α ≤ 1. Furthermore, by the definition of pmin and Corollary 3.3, we have pmin ·vol(S) ≤ Mf 0 O((ln(100n4 /τ 2 ))d ) ≤ ǫ/32, · (100n4 /τ 2 ) Mf 0 (5) and therefore  Z Z Z ˆ pmin dx + fn (x)dx ≤ α(pmin ·vol(S) + 1) ≤ α(ǫ/32 + 1). 1= g(x)dx ≤ α S S S From this it follows that α ≥ 1/(1 + ǫ/32) ≥ 1 − ǫ/32. We have Z Z 1 1 ˆ ˆ dTV (g, fn ) = |g(x) − fn (x)|dx = |g(x) − fˆn (x)|dx , 2 Rd 2 S since g is defined on S and fˆn is supported in S by Lemma 3.9. We can then write Z Z 1 1 ˆ |g(x) − fn (x)|dx = |αf ′ (x) − fˆn (x)|dx 2 S 2 S Z 1 |α − 1| · fˆn (x)dx + pmin ·vol(S) ≤ 2 S Z |α − 1| ≤ fˆn (x)dx + ǫ/32 2 S |1 − α| + ǫ/32 ≤ 3ǫ/64 , ≤ 2 which completes the proof. 8 (from (5)) (6) To deal with the dependence on the maximum value of f in Lemma 3.7, we need to bound the maximum value of the MLE. Lemma 3.12. Let n ≥ N1 . Let X1 , . . . , Xn be samples drawn from f0 . Then conditioning on the events of Corollary 3.5 and Lemma 3.8, for any f ∈ P Fd with maximum value Mf such that ln(Mf / pmin ) ≥ 1 Pn 1 4 2 4 ln(100n /τ ), we have n i=1 ln f (Xi ) < n ni=1 ln f0 (Xi ). Proof. This lemma holds because a density f with a large Mf is small outside on a set of small volume, which most of the samples will be outside. Let !! n 1X 1 ln f0 (Xi ) − ln Mf − 1 γ = exp 2 n 2 i=1 and A = Lf (γ). If we have that vol(A) · Mf0 ≤ 1/3, then it follows that f0 (A) ≤ 1/3. Since f is log-concave, A is a convex set, and since we condition on Corollary 3.5 holding, we have with probability 1 that |f0 (A) − fn (A)| < δ < 1/6. Therefore, we have that fn (A) < 1/2, in which case at least 1/2 of the samples X1 , . . . , Xn are not contained within A. Thus, we have that n 1X 1 1 ln f (x) ≤ ln γ + ln Mf n 2 2 i=1 n 1X 1 ln f0 (Xi ) − ln Mf − 1 n 2 1 = ·2 2 < 1 n n X i=1 ! + 1 ln Mf 2 ln f0 (Xi ). i=1 Now we check to see how large Mf must be to ensure that vol(A) · Mf0 ≤ 1/3. We have that vol(A) · Mf0 = vol(Lf (γ)) · Mf0 n 2X ln f0 (Xi ) − 2 − 2 ln Mf = vol Lf Mf · exp n i=1  !d  n X Mf 0 2 ·O 2− ln f0 (Xi ) + 2 ln Mf  . ≤ Mf n !!! · Mf 0 (by Lemma 3.2) i=1 Since we condition on the event of Lemma 3.8 holding, we have with probability 1 that n 1X ln f0 (Xi ) ≥ EX∼f0 [ln f0 (X)] − ǫ ≥ ln pmin −ǫ, n i=1 and so we have that   Mf 0 · O (2 + 2 ln Mf − 2 ln pmin +2ǫ)d Mf  d  Mf 0 · O 2 ln Mf − 2 ln Mf0 + 3 ln(n4 100/τ 2 ) . < Mf vol(A) · Mf0 ≤ The following claim follows by a simple calculation (see Appendix A): 9 Claim 3.13. If ln(Mf /Mf0 ) ≥ 3 ln(100n4 /τ 2 ), then vol(A) · Mf0 ≤ 1/3. P Therefore, for ln(Mf /Mf0 ) ≥ 3 ln(100n4 /τ 2 ) we have that n1 ni=1 ln f (Xi ) < 1 n P i ln f0 (xi ) and ln Mf − ln pmin = ln(Mf /Mf0 ) + ln(100n4 /τ 2 ) ≥ 4 ln(100n4 /τ 2 ) , concluding the proof. We have now reached the final result of this section, from which Theorem 3.1 directly follows. Combining previous lemmas, we show that the likelihood under f0 of the truncated MLE is close to that of f0 and so they are close in KL divergence, which leads to a bound in the Hellinger distance of the MLE itself: Lemma 3.14. Let n ≥ N1 . Let X1 , . . . , Xn be samples drawn from f0 . With probability at least 1 − τ , we have that h2 (f0 , fˆn ) ≤ ǫ. Proof. In this lemma, we will apply Lemmas 3.7, 3.8, 3.11, and 3.12. By examining the conditions of these lemmas, it is easy to see that with probability at least 1 − τ they all hold. We henceforth condition on this event. Let X1 , . . . , Xn be samples drawn from f0 , let fˆn be as in Definition 1.2. Let g and f ′ be as defined in Lemma 3.11 and Definition 3.10. Let S be as defined in Corollary 3.3 Then we have that EX∼f0 [ln g(X)] = EX∼f0 [ln(αf ′ (X))] ≥ EX∼f0 [ln f ′ (X)] − ǫ/16 ≥ EX∼f [ln(max{fˆn (X), pmin })] − ǫ/16 (since a > 1 − ǫ/32) 0 ≥ EX∼fn [ln(max{fˆn (X), pmin })] − 3ǫ/16 1X ˆ ln fn (Xi ) − 3ǫ/16 ≥ n i 1X ≥ ln f0 (Xi ) − 3ǫ/16 n (by Lemmas 3.7 and 3.12) i ≥ EX∼f0 [ln f0 (X)] − 5ǫ/16. (using Lemma 3.8) Thus, we obtain that KL(f0 ||g) = EX∼f0 [ln f0 (X)] − EX∼f0 [ln g(X)] ≤ 5ǫ/16. (7) For the next derivation, we use that the Hellinger distance is related to the total variation distance and the Kullback-Leibler divergence in the following way: For probability functions k1 , k2 : Rd → R, we have that h2 (k1 , k2 ) ≤ dTV (k1 , k2 ) and h2 (k1 , k2 ) ≤ KL(k1 ||k2 ). Therefore, we have that h(f0 , fˆn ) ≤ h(f0 , g) + h(g, fˆn ) ≤ KL(f0 ||g)1/2 + dTV (g, fˆn )1/2 = (5ǫ/16)1/2 + (3ǫ/64)1/2 ≤ǫ 1/2 (by (7) and Lemma 3.11) , concluding the proof. 10 4 Warmup for the Proof of Lemma 3.4 For the sake of exposition of the main ideas used in the proof of Lemma 3.4, we first prove Lemma 4.2, which achieves a weaker bound on the sample complexity, but has a significantly simpler proof. Let us first give a brief, and somewhat imprecise, overview of the proof of Lemma 4.2. The high-level goal is to approximate some convex set C ⊂ Rd by some set, belonging to a family of low VC dimension. We then can obtain the desired bound using Theorem 2.1. To that end, we compute inner and outer approximations, C in and C out , of C via polyhedral sets with a small number of facets. By Lemma 4.1, we can argue that the VC dimension of this family is low. We therefore obtain that f0 and fn are close on the inner and outer approximations of C. It remains to argue that the total difference between f0 and fn in C out \ C in is also small. It thus suffices to bound the volume of C out \ C in . This can be achieved by first defining some set S ⊂ Rd that excludes the tail of f0 . Since f0 is logconcave, we can show that S has small volume. The final bound is obtained by restricting the above argument on C ∩ S. (d+5)/2 def Throughout this section, we define N2 = Θ 2O(d) (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1) . We will require the following simple fact: Lemma 4.1 (see [ASE92]). Let h, d ∈ Z+ , and let A be the set of all convex polytopes in Rd with at most h facets. Then, the VC dimension of A is at most 2(d + 1)h log((d + 1)h). The main result of this section is the following: Lemma 4.2. Let n ≥ N2 . With probability at least 1 − set C ⊂ Rd it holds that |f0 (C) − fn (C)| < δ. def 3τ 10 over the choice of X1 , . . . , Xn , for any convex def Proof. Recall that z = ln(100n4 /τ 2 ) and S = Lf0 (Mf0 e−z ). Let C ⊂ Rd be a convex set, and let C ′ = C ∩ S. Since f0 is log-concave, it follows that S is convex, and thus C ′ is also convex. Let E1 be the event that all samples X1 , . . . , Xn lie in S. By Corollary 3.3 we have PrX1 ,...,Xn ∼f0 [E1 ] = Prfn [E1 ] ≥ 1 − τ /10. (8) Conditioned on E1 occurring, we have, with probability 1, fn (C) = fn (C ′ ). In other words Prfn [fn (C \ C ′ ) = 0|E1 ] = 1. (9) From Corollary 3.3, we have PrX∼f0 [f0 (X) ≤ Mf0 /(100n4 /τ 2 )] ≤ τ /(10n), and therefore f0 (C \ C ′ ) ≤ f0 (Rd \ S) ≤ τ /(10n) ≤ δ/5. (10) Combining (8), (9), and (10), we get     Prfn |f0 (C \ C ′ ) − fn (C \ C ′ )| ≤ δ/5 ≥ Prfn |f0 (C \ C ′ ) − fn (C \ C ′ )| ≤ δ/5|E1 · Prfn [E1 ]   ≥ Prfn fn (C \ C ′ ) = 0|E1 · Pr[E1 ] ≥ 1 − τ /10. (11) Let A be the set of convex polytopes in Rd with at most H = (10κdz d /δ)(d−1)/2 facets, where κ is the universal constant in Theorem 2.3. By Theorem 2.3, there exist convex polytopes T, T ′ ∈ A, with T ⊆ C ′ ⊆ T ′ , such that vol(C ′ \ T ) ≤ δ δ κd vol(C ′ ) ≤ vol(S) ≤ , d d 10Mf0 10κdz /δ 10z 11 and vol(T ′ \ C ′ ) ≤ κd δ δ . vol(C ′ ) ≤ vol(S) ≤ d d 10Mf0 10κdz /δ 10z Therefore, since Mf0 is the maximum value of f0 , we have f0 (C ′ \ T ) ≤ vol(C ′ \ T ) · Mf0 ≤ δ/10, (12) (13) and f0 (T ′ \ C ′ ) ≤ vol(T ′ \ C ′ ) · Mf0 ≤ δ/10. (14) Noting that E[|f0 (T ) − fn (T )|] ≤ E[||f0 − fn ||A ], by Theorem 2.1 we have for some universal constant α that r αV E[|f0 (T ) − fn (T )|] ≤ . n The following claim is obtained via a simple calculation (see Appendix A): q Claim 4.3. For n ≥ N2 , we have that αV n ≤ δ/10. Let E2 be the event that |f0 (T ) − fn (T )| ≤ δ/5. By Claim 4.3 and Theorem 2.2 we have Prfn [E2 ] = 1 − Prfn [|f0 (T ) − fn (T )| > δ/5] ≥ 1 − Prfn [|f0 (T ) − fn (T )| − E[|f0 (T ) − fn (T )|] > δ/10] ≥ 1 − e−2n(δ/10) 2 ≥ 1 − τ /10. (15) Similarly, we define E3 to be the event that |f0 (T ′ ) − fn (T ′ )| ≤ δ/5. Arguing as above, we obtain Prfn [E3 ] ≥ 1 − e−2n 1/2 ≥ 1 − τ /10. (16) For any choice of samples X1 , . . . , Xn , we have fn (C ′ ) ≥ fn (T ) (since T ⊆ C ′ ) ≥ f0 (C ′ ) − f0 (C ′ \ T ) − |f0 (T ) − fn (T )| δ − |f0 (T ) − fn (T )|. ≥ f0 (C ′ ) − 10 (by (12)) (17) Thus, we get Prfn [fn (C ′ ) ≥ f0 (C ′ ) − 3δ/10] ≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5] (by (17)) ≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5|E2 ] · Prfn [E2 ] ≥ 1 − τ /10. (by (15)) (18) In a similar way, using that C ′ ⊆ T ′ , for any choice of the samples X1 , . . . , Xn , we have fn (C ′ ) ≤ f0 (C ′ ) + δ + |f0 (T ′ ) − fn (T ′ )|. 10 12 (by (14)) (19) It therefore follows that Prfn [fn (C ′ ) ≤ f0 (C ′ ) + 3δ/10] ≥ Prfn [|f0 (T ′ ) − fn (T ′ )| ≤ δ/5] ′ (by (19)) ′ ≥ Prfn [|f0 (T ) − fn (T )| ≤ δ/5|E3 ] · Prfn [E3 ] ≥ 1 − τ /10. (by (16)) (20) By (18) and (20) and the union bound, we obtain Prfn [|fn (C ′ ) − f0 (C ′ )| ≤ 3δ/10] ≥ 1 − 2τ /10. (21) Combining (11) and (21) we get Prfn [|fn (C) − f0 (C)| ≤ 2δ/5] ≥ Prfn [(|fn (C \ C ′ ) − f0 (C \ C ′ )| ≤ δ/5) ∧ (|fn (C ′ ) − f0 (C ′ )| ≤ 3δ/10)] ≥ 1 − τ /10 − 2τ /10 ≥ 1 − 3τ /10, which concludes the proof. 5 Proof of Lemma 3.4 We are now ready to prove the main technical part of our work, which is Lemma 3.4. The proof builds upon the argument used in the proof of Lemma 4.2, which achieves a weaker sample complexity bound. Recall that in the proof of Lemma 4.2 we use inner and outer polyhedral approximations of C, restricted on some appropriate bounded S ⊂ Rd . The main difference in the proof of Lemma 3.4 is that we now use roughly O(log n) inner and outer polyhedral approximations of intersections of C with different superlevelsets of f0 . We need slightly more samples due to the higher number of facets, and consequently higher VC dimension of the resulting approximations. However, since we use a finer discretization of the values of f0 , we incur lower error in total. The following Lemma is implicit in [DKS17]. We reproduce its proof for completeness in Appendix A. Lemma 5.1. Let L, H ∈ Z+ . We define the set AH,L , elements of which are defined by the following process: Starting with L convex polytopes each with at most H facets, all combinations of intersection and union of these polytopes are elements of AH,L . If V is the VC dimension of AH,L , then V / log(V ) = O(dLH). We are now prepared to present the proof of Lemma 3.4. Let Si = Lf0 (Mf0 e−i ) and let S0 = ∅. Let L be be the minimum L′ ∈ Z+ such that PrX∼f0 [X ∈ / SL ′ ] ≤ that all samples X1 , . . . , Xn lie in SL . We have that PrX1 ,...,Xn ∼f0 [E1 ] = Prfn [E1 ] ≥ 1 − τ /10. τ 10n . Let E1 be the event (22) Let L = ln(100n4 /τ ). By Lemma 3.2, we have that PrX∼f0 [f0 (X) ≤ Mf0 e−z ] = O(d)d e−z/2 and thus PrX∼f0 [X ∈ / SL ] = PrX∼f0 [f0 (X) < Mf0 e−L ] ≤ 13 τ . 10n For a convex set C ⊆ Rd , for all i ∈ [L], let C i = C ∩ Si . Note that, conditioned on E1 occurring, we have with probability 1, that fn (C) = fn (CL ). In other words, Prfn [fn (C \ CL ) = 0|E1 ] = 1. τ 10n , Furthermore, by our choice of L we have f0 (Rd \ SL ) ≤ f0 (C \ CL ) ≤ (23) and therefore τ ≤ δ/5. 10n (24) Combining 22, 23, and 24, we have Prfn [|f0 (C \ CL ) − fn (C \ CL )| ≤ δ/5] ≥ Prfn [|f0 (C \ CL ) − fn (C \ CL )| ≤ δ/5|E1 ] · Prfn [E1 ] ≥ Prfn [fn (C \ CL ) = 0|E1 ] · Prfn [E1 ] ≥ 1 − τ /10. (25) Using Theorem 2.3, for i ∈ [L] let Piin , Piout be convex polytopes with H = (10κd/δ)(d−1)/2 facets, where κ is the universal constant from Theorem 2.3, such that Piin ⊆ Ci ⊆ Piout , vol(Ci \ Piin ) ≤ δ · vol(Ci )/10 ≤ δ · vol(Si )/10, (26) vol(Piout \ Ci ) ≤ δ · vol(Ci )/10 ≤ δ · vol(Si )/10. (27) and Let [ C in = Piin . i∈[L] For any i ∈ [L], let PiS be a convex polytope with at most H facets such that PiS ⊆ Si and vol(Si \ PiS ) ≤ δ · vol(Si )/10. Let [ PjS Si′ = 1≤j≤i and S0′ = ∅. Let C out = [ ′ (Piout \ Si−1 ). i∈[L] We will now show that C in and C out satisfy the following conditions: 1. C in ⊆ CL ⊆ C out . 2. f0 (C out \ CL ) < δ/2. 3. f0 (CL \ C in ) < δ/2. 14 Figure 1: Constructing C in . For each set Si , a convex polytope approximating C ∩ Si from the inside is found, and C in is formed by taking the union of these convex polytopes. First, we consider C in . Since Piin ⊆ Ci ⊆ CL for all i ∈ [L], it follows that Observe that by the above definitions, we have that S in i∈[L] Pi = C in ⊆ CL . (CL \ C in ) ∩ (Si \ Si−1 ) ⊆ (CL \ C in ) \ Si−1 ⊆ (CL \ Piin ) \ Si−1 . (28) From (28), we therefore have (CL \ C in ) = [  i∈[L] [  (CL \ C in ) ∩ (Si \ Si−1 ) ⊆ (Ci \ Piin ) \ Si−1 , (29) i∈[L] and so f0 (CL \ C in ) ≤ X f0 ((Ci \ Piin ) \ Si−1 ) (by (29)) i∈[L] ≤ X vol((Ci \ Piin ) \ Si−1 )Mf0 e−(i−1) i∈[L] ≤ X vol(Ci \ Piin )Mf0 e−(i−1) i∈[L] ≤ X (δ/10)vol(Si )Mf0 e−(i−1) (by (26)) i∈[L] ≤ (δ/10) X vol(Lf0 (Mf0 e−i ))Mf0 e−(i−1) i∈[L] ≤ (δ/10) Z Mf0 0 vol(Lf0 (y))dy < δ/2. (30) Now we consider C out . Let x ∈ CL . Then there exists i ∈ / Si−1 . Thus S [L] suchoutthat x′ ∈ Si and x ∈ ′ , from which we have that x ∈ C out = (P \ S ). Therefore CL ⊆ C out . x ∈ Piout and x ∈ / Si−1 i−1 i∈[L] i ′ . If Let y ∈ C out \ CL . From the definition of C out , there must exist some i ∈ [L] such that y ∈ Piout \ Si−1 out out out y ∈ Pi \ Ci , we are done. Suppose that y ∈ / Pi \ Ci . Since we have that y ∈ Pi , we must also have that y ∈ Ci . But Ci ⊆ CL , and we began with y ∈ C out \ CL , which makes a contradiction. Therefore,  (31) C out \ CL ⊆ ∪i∈[L] Piout \ Ci . 15 Figure 2: Constructing C out . For each set Si , a convex polytope approximating Si from the inside is found (PiS , see row (a)), and a convex polytope approximating C ∩ Si from the outside is found (Piout , see row S (b)). For each i, the set Piout \ (∪i−1 j=1 Pj ) is constructed (see row (c)), and the union of these sets finish the out construction of C (see row (d)). 16 Thus, we have that f0 (C out \ CL ) ≤ X f0 (Piout \ Ci ) (by (31)) i∈[L] ≤ X vol(Piout \ Ci )Mf0 e−(i−1) i∈[L] ≤ X (δ/10)vol(Si )Mf0 e−(i−1) (by (27)) i∈[L] ≤ (δ/10) X vol(Lf0 (Mf0 e−i ))Mf0 e−(i−1) i∈[L] ≤ (δ/10) Z Mf 0 0 vol(Lf0 (y))dy < δ/2. (32) We define the set A, elements of which are defined by the following process: Starting with 2L convex polytopes each with at most H facets, all combinations of intersection and union of these convex polytopes are elements of A. Then for any convex set C with C in , C out as defined above, we have that C out , C in ∈ A. From Lemma 5.1, we have that if V is the VC dimension of A, then V / ln(V ) = O(dLH). Using Theorem 2.1, we have for some universal constant α that r αV . n The following claim is obtained via a simple calculation (see Appendix A): q Claim 5.2. For n ≥ N1 we have that αV n ≤ δ/10. E[|f0 (C in ) − fn (C in )|] ≤ E[||f0 − fn ||A ] = (33) Let E2 be the event that |f0 (C in ) − fn (C in )| ≤ δ/5. Then by (33), Claim 5.2, and Theorem 2.2, we have that Prfn [E2 ] = 1 − Prfn [|f0 (C in ) − fn (C in )| > δ/5] ≥ 1 − Prfn [|f0 (C in ) − fn (C in )| − E[|f0 (C in ) − fn (C in )|] > δ/10] ≥ 1 − e−2n(δ/10) 2 ≥ 1 − τ /10. (34) Let E3 be the event that |f0 (C out ) − fn (C out )| ≤ δ/5, and a nearly identical argument as above shows that Prfn [E3 ] = 1 − Prfn [|f0 (C out ) − fn (C out )| > δ/5] ≥ 1 − τ /10. (35) Claim 5.3. We have that Prfn [|fn (CL ) − f0 (CL )| ≤ 7δ/10] ≥ 1 − τ /5. This claim follows from (30), (32), (34), and (35). The full proof can be found in Appendix A. Combining (25) and Claim 5.3 we get Prfn [|fn (C) − f0 (C)| ≤ δ] ≥ Prfn [(|fn (C \ CL ) − f0 (C \ CL )| ≤ δ/5) ∧ (|fn (CL ) − f0 (CL )| ≤ 7δ/10)] τ τ − ≥1− 10 5 ≥ 1 − 3τ /10, which concludes the proof. 17 6 Conclusions In this paper, we gave the first sample complexity upper bound for the MLE of multivariate log-concave densities on Rd , for any d ≥ 4. Our upper bound agrees with the previously known lower bound up to a multiplicative factor of Õd (ǫ−1 ). A number of open problems remain: What is the optimal sample complexity of the multivariate logconcave MLE? In particular, is the log-concave MLE sample-optimal for d ≥ 4? Does the multivariate log-concave MLE have similar adaptivity properties as in one dimension? And is there a polynomial time algorithm to compute it? References [ADH+ 15] J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and near-optimal algorithms for approximating distributions by histograms. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, pages 249–263, 2015. [ADLS17] J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 1278–1289, 2017. Available at https://arxiv.org/abs/1506.00671. [An95] M. Y. An. Log-concave probability distributions: Theory and statistical testing. Technical Report Economics Working Paper Archive at WUSTL, Washington University at St. Louis, 1995. [ASE92] N. Alon, J. Spencer, and P. Erdos. The Probabilistic Method. Wiley-Interscience, New York, 1992. [BB05] M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. Economic Theory, 26(2):pp. 445–469, 2005. [BBBB72] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistical Inference under Order Restrictions. Wiley, New York, 1972. [BD18] F. Balabdaoui and C. R. Doss. Inference for a two-component mixture of symmetric distributions under log-concavity. Bernoulli, 24(2):1053–1071, 05 2018. [Bir87a] L. Birgé. Estimating a density under order restrictions: Nonasymptotic minimax risk. Annals of Statistics, 15(3):995–1012, 1987. [Bir87b] L. Birgé. On the risk of histograms for estimating decreasing densities. Annals of Statistics, 15(3):1013–1022, 1987. [Bru58] H. D. Brunk. On the estimation of parameters restricted by inequalities. The Annals of Mathematical Statistics, 29(2):pp. 437–454, 1958. [BRW09] F. Balabdaoui, K. Rufibach, and J. A. Wellner. Limit distribution theory for maximum likelihood estimation of a log-concave density. The Annals of Statistics, 37(3):pp. 1299–1331, 2009. [BW07] F. Balabdaoui and J. A. Wellner. Estimation of a k-monotone density: Limit distribution theory and the spline connection. The Annals of Statistics, 35(6):pp. 2536–2564, 2007. 18 [BW10] F. Balabdaoui and J. A. Wellner. Estimation of a k-monotone density: characterizations, consistency and minimax lower bounds. Statistica Neerlandica, 64(1):45–70, 2010. [CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In STACS, pages 25:1–25:14, 2016. [CDSS13] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions over discrete domains. In SODA, pages 1380–1394, 2013. [CDSS14a] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Efficient density estimation via piecewise polynomial approximation. In STOC, pages 604–613, 2014. [CDSS14b] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in nearlinear time using variable-width histograms. In NIPS, pages 1844–1852, 2014. [CS13] Y. Chen and R. J. Samworth. Smoothed log-concave maximum likelihood estimation with applications. Statist. Sinica, 23:1373–1398, 2013. [CSS10] M. Cule, R. Samworth, and M. Stewart. Maximum likelihood estimation of a multidimensional log-concave density. Journal of the Royal Statistical Society: Series B, 72:545– 607, 2010. [CT04] K.S. Chan and H. Tong. Testing for multimodality with dependent data. Biometrika, 91(1):113– 123, 2004. [DDKT16] C. Daskalakis, A. De, G. Kamath, and C. Tzamos. A size-free CLT for poisson multinomials and its applications. In Proceedings of the 48th Annual ACM Symposium on the Theory of Computing, STOC ’16, 2016. [DDO+ 13] C. Daskalakis, I. Diakonikolas, R. O’Donnell, R.A. Servedio, and L. Tan. Learning Sums of Independent Integer Random Variables. In FOCS, pages 217–226, 2013. [DDS12a] C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning k-modal distributions via testing. In SODA, pages 1371–1385, 2012. [DDS12b] C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning Poisson Binomial Distributions. In STOC, pages 709–728, 2012. [DKS16a] I. Diakonikolas, D. M. Kane, and A. Stewart. Efficient Robust Proper Learning of Log-concave Distributions. Arxiv report, 2016. [DKS16b] I. Diakonikolas, D. M. Kane, and A. Stewart. The fourier transform of poisson multinomial distributions and its algorithmic applications. In Proceedings of STOC’16, 2016. [DKS16c] I. Diakonikolas, D. M. Kane, and A. Stewart. Optimal learning via the fourier transform for sums of independent integer random variables. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 831–849, 2016. Full version available at https://arxiv.org/abs/1505.00662. [DKS16d] I. Diakonikolas, D. M. Kane, and A. Stewart. Properly learning poisson binomial distributions in almost polynomial time. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 850–878, 2016. Full version available at https://arxiv.org/abs/1511.04066. 19 [DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning multivariate log-concave distributions. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 711–727, 2017. [DL01] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001. [DR09] L. Dumbgen and K. Rufibach. Maximum likelihood estimation of a log-concave density and its distribution function: Basic properties and uniform consistency. Bernoulli, 15(1):40–68, 2009. [DW16] C. R. Doss and J. A. Wellner. Global rates of convergence of the mles of log-concave and s-concave densities. Ann. Statist., 44(3):954–981, 06 2016. [Fou97] A.-L. Fougères. Estimation de densités unimodales. Canadian Journal of Statistics, 25:375– 387, 1997. [GJ14] P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics. Cambridge University Press, 2014. [GMR95] Y. Gordon, M. Meyer, and S. Reisner. Constructing a polytope to approximate a convex body. Geometriae Dedicata, 57(2):217–222, 1995. [Gre56] U. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125–153, 1956. [Gro85] P. Groeneboom. Estimating a monotone density. In Proc. of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, pages 539–555, 1985. [GW09] F. Gao and J. A. Wellner. On the rate of convergence of the maximum likelihood estimator of a k-monotone density. Science in China Series A: Mathematics, 52:1525–1538, 2009. [HP76] D. L. Hanson and G. Pledger. Consistency in concave regression. The Annals of Statistics, 4(6):pp. 1038–1050, 1976. [HW16] Q. Han and J. A. Wellner. Approximation and estimation of s-concave densities via renyi divergences. Ann. Statist., 44(3):1332–1359, 06 2016. [JW09] H. K. Jankowski and J. A. Wellner. Estimation of a discrete monotone density. Electronic Journal of Statistics, 3:1567–1605, 2009. [KGS16] A. Kim, A. Guntuboyina, and R. J. Samworth. Adaptation in log-concave density estimation. ArXiv e-prints, 2016. Available at http://arxiv.org/abs/1609.00861. [KM10] R. Koenker and I. Mizera. Quasi-concave density estimation. Ann. Statist., 38(5):2998–3027, 2010. [KS16] A. K. H. Kim and R. J. Samworth. Global rates of convergence in log-concave density estimation. Ann. Statist., 44(6):2756–2779, 12 2016. Available at http://arxiv.org/abs/1404.2298. [LV07] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures and Algorithms, 30(3):307–358, 2007. [Rao69] B.L.S. Prakasa Rao. Estimation of a unimodal density. Sankhya Ser. A, 31:23–36, 1969. [RSU17] E. Robeva, B. Sturmfels, and C. Uhler. Geometry of Log-Concave Density Estimation. ArXiv e-prints, 2017. Available at https://arxiv.org/abs/1704.01910. 20 [Sam17] R. J. Samworth. Recent progress in log-concave density estimation. ArXiv e-prints, 2017. [Sta89] R. P. Stanley. Log-concave and unimodal sequences in algebra, combinatorics, and geometry. Annals of the New York Academy of Sciences, 576(1):500–535, 1989. [SW14] A. Saumard and J. A. Wellner. Log-concavity and strong log-concavity: A review. Statist. Surv., 8:45–114, 2014. [VV16] G. Valiant and P. Valiant. Instance optimal learning of discrete distributions. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 142–155, 2016. [Wal09] G. Walther. Inference and modeling with log-concave distributions. Stat. Science, 24:319–327, 2009. [Weg70] E.J. Wegman. Maximum likelihood estimation of a unimodal density. I. and II. Ann. Math. Statist., 41:457–471, 2169–2174, 1970. [Wel15] J. A. Wellner. Nonparametric estimation of s-concave and log-concave densities: an alternative to maximum likelihood. Talk given at European Meeting of Statisticians, Amsterdam, 2015. Available at https://www.stat.washington.edu/jaw/RESEARCH/TALKS/EMS-2015.1rev1.pdf. A Deferred Proofs A.1 Proof of Lemma 3.2 Let R = Lf (Mf /e). Then using the fact that if y ≤ Mf /e then R ⊆ Lf (y), we have that Z Z Z Mf 1= vol(Lf (y))dy ≥ vol(Lf (y))dy ≥ vol(R)dy = · vol(R) e R+ 0≤y≤Mf /e 0≤y≤Mf /e (36) Suppose that f (x) ≥ Mf e−z , for some x ∈ Rd . By the definition of log-concavity we have f (x/z) ≥ f (0)(z−1)/z f (x)1/z . We may assume w.l.o.g. that f has mean 0, and thus f (0) = Mf . By the assumption (z−1)/z (z−1)/z 1/z we get f (x/z) ≥ Mf (Mf /ez )1/z = Mf Mf /e = Mf /e. Thus x/z ∈ R, and so x ∈ zR. Therefore Lf (Mf e−z ) ⊆ zR. Thus by (36) we get vol(Lf (Mf e−z )) ≤ vol(zR) ≤ O(z d ) · vol(R) = O(z d /Mf ), which proved the first part of the assertion. 21 (37) It remains to prove the second part. We have PrX∼f [f (X) ≤ Mf e −z ]≤ = Mf e−z Z vol(Lf (y))dy Z0 ∞ z ≤ = ≤ Z z Z ∞ ∞ Zz ∞ vol(Lf (Mf e−x ))Mf e−x dx (setting y = Mf e−x ) O(xd /Mf )Mf e−x dx (by (37)) O(xd e−x )dx O(d)d e−x/2 dx (since ex/2 ≥ (x/2)d /d!) z = O(d)d e−z/2 , which concludes the proof. A.2 Proof of Lemma 3.6 We begin with a few common definitions and observations. If X is a random variable defined on a probability space (Ω, Σ, P ), then the expected value E[X] of X is defined as the Lebesgue integral Z E[X] = X(ω)dP (ω). Ω Next, we define two functions X+ (ω) = max(X(ω), 0) and X− (ω) = − min(X(ω), 0). We observe that these functions are both measurable (and therefore also random variables), and that E[X] = E[X+ ] − E[X− ]. Finally, we observe that if X : Ω → R≥0 ∪ {∞} is a non-negative random variable then Z ∞ E[X] = Pr[X > x]dx. 0 Similarly, if X : Ω → R≥0 ∪ {−∞} is a non-positive random variable then E[X] = − Z 0 Pr[X < x]dx. −∞ 22 Applying the definitions and observations of the previous paragraph, we have the following derivation: EY ∼g [φ(Y )] − EY ∼h [φ(Y )] = (EY ∼g [φ(Y )+ ] − EY ∼g [φ(Y )− ]) − (EY ∼h [φ(Y )+ ] − EY ∼h [φ(Y )− ]) = (EY ∼g [φ(Y )+ ] + EY ∼g [−φ(Y )− ]) − (EY ∼h [φ(Y )+ ] + EY ∼h [−φ(Y )− ])  Z ∞ Z 0 PrY ∼g [−φ(Y )− < x]dx PrY ∼g [φ(Y )+ > x]dx + = −∞ 0 Z ∞  Z 0 − PrY ∼h [φ(Y )+ > x]dx + PrY ∼h [−φ(Y )− < x]dx 0 −∞  Z ∞ Z 0 PrY ∼g [φ(Y ) < x]dx PrY ∼g [φ(Y ) > x]dx + = 0 −∞  Z ∞ Z 0 PrY ∼h [φ(Y ) < x]dx PrY ∼h [φ(Y ) > x]dx + − −∞ Z ∞ 0 PrY ∼g [φ(Y ) > x] − PrY ∼h [φ(Y ) > x]dx = 0 Z 0 + PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx Z ∞−∞ (1 − PrY ∼g [φ(Y ) < x]) − (1 − PrY ∼h [φ(Y ) < x])dx = 0 Z 0 PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx + Z ∞−∞ = PrY ∼h [φ(Y ) < x] − PrY ∼g [φ(Y ) < x])dx 0 Z 0 + PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]dx −∞ Z ∞ |PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx ≤ 0 Z 0 + |PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx −∞ Z ∞ = |PrY ∼g [φ(Y ) < x] − PrY ∼h [φ(Y ) < x]| dx. −∞ A symmetric argument shows that EY ∼h [φ(Y )] − EY ∼g [φ(Y )] ≤ Z ∞ |PrY ∼h [φ(Y ) < x] − PrY ∼g [φ(Y ) < x]| dx, −∞ concluding the proof. A.3 Proof of Claim 3.13 Recall that   Mf 0 · O (2 + 2 ln Mf − 2 ln pmin +2ǫ)d Mf  d  Mf 0 < . · O 2 ln Mf − 2 ln Mf0 + 3 ln(n4 100/τ 2 ) Mf vol(A) · Mf0 ≤ 23 We search for Mf such that vol(A) · Mf0 ≤ 1/3. It is sufficient for Mf to satisfy, for some constant c > 1, d Mf0 /Mf · c 2 ln Mf − 2 ln Mf0 + 3 ln n6 ≤ 1/3  d  ln (Mf0 /Mf ) · c 2 ln(Mf /Mf0 ) + 3 ln n6 ≤ ln(1/3)   d ln (Mf0 /Mf ) + ln c + ln 2 ln(Mf /Mf0 ) + 3 ln n6 ≤ ln(1/3)   d + ln(3c) ≤ ln (Mf /Mf0 ) ln 2 ln(Mf /Mf0 ) + 3 ln n6  d ln 2 ln(Mf /Mf0 ) + 3 ln(n4 100/τ 2 ) + ln(c) ≤ ln (Mf /Mf0 ) . (38) If we have Mf such that ln(Mf /Mf0 ) ≥ 3 ln(n4 100/τ 2 ), and a sufficiently large constant is chosen for N1 so that ln(c) ≤ ln(n4 100/τ 2 ), then (38) becomes d ln (6 ln(Mf /Mf0 )) ≤ 2 ln (Mf /Mf0 ) . (39) The next inequality is equivalent to (39): (6 ln(Mf /Mf0 ))d/2 ≤ Mf /Mf0 We note that the derivative of (6 ln x)d/2 is 6d/2 d(ln x)d/2−1 . 2x We also note that for x = (36d)d/2+1 (ln(36d))d/2+1 we have that 6d/2 d(ln x)d/2−1 ≤ 1. 2x and (6 ln x)d/2 = [6(d/2 + 1) ln(36d ln(36d)]d/2 = 6d/2 · dd/2 · [2 ln(36d)]d/2 ≤ x. Therefore, assuming sufficiently large constants are chosen in the definition of N1 , if ln(Mf /Mf0 ) ≥ 3 ln(n4 100/τ 2 ) then vol(A) · Mf0 ≤ 1/3. A.4 Proof of Claim 4.3 By Lemma 4.1 we have that the VC dimension of A is V ≤ 2(d + 1)H ln((d + 1)H), and so V ≤ (10κ)(d+1)/2 d(d+5)/2 (ln(100n4 /τ 2 ))d /δ)(d+1)/2 . Noting that E[|f0 (T ) − fn (T )|] ≤ E[||f0 − fn ||A ], by Theorem 2.1 we get that r O(V ) E[|f0 (T ) − fn (T )|] ≤ s n  O (10κ)(d+1)/2 d(d+5)/2 (ln(100n4 /τ 2 ))d /δ)(d+1)/2 . ≤ n 24 For the next part we want that E[|f0 (T ) − fn (T )|] ≤ δ/10. This holds when  (d+5)/2 n = Ω (d/ǫ)(ln(100n4 /τ 2 ))(d+1) If n ≥ b cd+1 (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1) have (d+5)/2 for some constants b > 1, c ≥ 100 ln c, then we  (d+1) (d/ǫ)(ln(100n4 /τ 2 ))(d+1) ≤ (d(2d+3) /ǫ)(100 ln c)(d+1) ln(d(d+1) /(ǫτ )) ≤ cd+1 (d(2d+3) /ǫ)(ln(d(d+1) /(ǫτ )))(d+1) and therefore n = Ω (d/ǫ)(ln(100n4 /τ 2 )(d+1) (d+5)/2 as desired. Therefore, for n ≥ N2 we have E[|f0 (T ) − fn (T )|] ≤ δ/10. (40) A.5 Proof of Lemma 5.1 Consider an arbitrary set T of t points in Rd . We wish to bound the number of possible distinct sets that can be obtained by the intersection of T with a set in AH,L . We note that AH,L can also be constructed in the following manner: Take an arrangement consisting of at most H ·L hyperplanes. This arrangement partitions Rd into a set of components. Then, the union of subsets of these components are elements of AH,L . Any halfspace can be perturbed, without changing its intersection with T , so that its boundary intersects d′ + 1 points in T , where d′ ≤ d is the dimension of the affine subspace spanned by T . Any such subset uniquely determines the intersection of the halfspace with T . Therefore, the number of possible intersections with a set of size t is at most O(t)d . It follows then that the number of possible intersections of any A ∈ AH,L and any set of size t is at most (O(t)d )LH ≤ O(t)dLH . If A has VC dimension t, then is must be that O(t)dLH ≥ 2t , and therefore t/ log(t) = O(dLH). A.6 Proof of Claim 5.2 Recalling that L = ln(100n4 /τ 2 ) and H = (10κd/δ)(d−1)/2 , we have that   V / ln(V ) = O d · ln(100n4 /τ 2 ) · (10κd/δ)(d−1)/2   = O (10κ)(d−1)/2 d(d+1)/2 ln(100n4 /τ 2 )/δ (d−1)/2 . We note that   d−1  ln (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2 ≤ ln (10κ)d3 (ln(100n4 /τ 2 ))6 /δ 2  ≤ d ln (10κ)d3 (ln(100n4 /τ 2 ))6 /δ ≤ cd ln(ln(100n4 /τ 2 )) for some sufficiently large constant c. Therefore, letting   V = O (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2 satisfies (41). Therefore, we have that s r  α · O (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))2 /δ (d−1)/2 αV = , n n 25 (41) and thus when we have that q αV n   n = Ω (10κ)(d−1)/2 d(d+3)/2 (ln(100n4 /τ 2 ))(d+7)/2 /ǫ(d+3)/2 (42) ≤ δ/10. To simplify (42), we note that the d(d+3)/2 (ln(100n4 /τ 2 ))(d+7)/2 /ǫ(d+3)/2 ≤ (d/ǫ)(ln(100n4 /τ 2 ))2 (d+3)/2 . Thus, if we let n = (c(d2 /ǫ)(ln(d/(ǫτ )))3 )(d+3)/2 for some large constant c, then we have that d+3 ln(100c(d2 /ǫ)(ln(d/(ǫτ )))3 ) + ln(1/τ 2 ) 2 ≤ c′ d ln(d/(ǫτ )) ln(100n4 /τ 2 ) = for some large constant c′ . Thus, assuming a sufficiently large constant is chosen, for n ≥ N1 we have that (42) holds, and therefore r αV ≤ δ/10. (43) n A.7 Proof of Claim 5.3 For any choice of the samples X1 , . . . , Xn , we have fn (CL ) ≥ fn (C in ) (since C in ⊆ CL ) ≥ f0 (C in ) − |f0 (C in ) − fn (C in )| = f0 (CL ) − f0 (CL \ C in ) − |f0 (C in ) − fn (C in )| δ ≥ f0 (CL ) − − |f0 (C in ) − fn (C in )|. 2 (by (30)) (44) Thus Prfn [fn (CL ) ≥ f0 (CL ) − 7δ/10] ≥ Prfn [|f0 (C in ) − fn (C in )| ≤ δ/5] in (by (44)) in ≥ Prfn [|f0 (C ) − fn (C )| ≤ δ/5|E2 ] · Prfn [E2 ] ≥ 1 − τ /10. (by (34)) (45) Similarly, for any choice of the samples X1 , . . . , Xn , we have fn (CL ) ≤ fn (C out ) (since CL ⊆ C out ) ≤ f0 (C out ) + |f0 (C out ) − fn (C out )| = f0 (CL ) + f0 (C out \ CL ) − |f0 (C out ) − fn (C out )| δ ≤ f0 (CL ) + + |f0 (C out ) − fn (C out )|. 2 (by (32)) (46) Thus Prfn [fn (CL ) ≤ f0 (CL ) + 7δ/10] ≥ Prfn [|f0 (C out ) − fn (C out )| ≤ δ/5] ≥ Prfn [|f0 (C out ) − fn (C out (by (46)) )| ≤ δ/5|E3 ] · Prfn [E3 ] ≥ 1 − τ /10. (by (35)) (47) By (45) and (47) and the union bound, we obtain Prfn [|fn (CL ) − f0 (CL )| ≤ 7δ/10] ≥ 1 − τ /5, concluding the proof. 26