Consistent Spectral Clustering in Sparse
Consistent Spectral Clustering in Sparse
24 January 2025
Abstract
High-order clustering aims to classify objects in multiway datasets that are preva-
lent in various fields such as bioinformatics, social network analysis, and recommen-
dation systems. These tasks often involve data that is sparse and high-dimensional,
presenting significant statistical and computational challenges. This paper introduces
a tensor block model specifically designed for sparse integer-valued data tensors. We
propose a simple spectral clustering algorithm augmented with a trimming step to mit-
igate noise fluctuations, and identify a density threshold that ensures the algorithm’s
consistency. Our approach models sparsity using a sub-Poisson noise concentration
framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural
class of tensor block models is closed under aggregation across arbitrary modes. Con-
sequently, we obtain a comprehensive framework for evaluating the tradeoff between
signal loss and noise reduction during data aggregation. The analysis is based on a
novel concentration bound for sparse random Gram matrices. The theoretical findings
are illustrated through simulation experiments.
Keywords: latent block model, stochastic block model, almost exact recovery, weak con-
sistency, multiway clustering, higher-order network, hypergraph
1 Introduction
Multiway clustering is a statistical problem where an observed data array Y ∈ Rn1 ×···×nd
of order d is analyzed by grouping (i.e., clustering) together similar entities ik = 1, . . . , nk
corresponding to slices Y:···:ik :···: along some of the modes k = 1, . . . , d. An example of a
higher-order data is multitissue gene expression data [23, 32, 40], where the three modes
correspond to individuals, genes and tissues. An important special case of a multiway array
is a binary array with equal number of entities in each mode Y ∈ {0, 1}n×···×n . Such an array
can be seen as a hypergraph on n nodes such that an entry Y(i1 , . . . , id ) = 1 indicates the
1
presence of a hyperedge between the nodes i1 , . . . , id . In this context, Y is sometimes called
an adjacency tensor. Hypergraphs have been used to model ternary relations between genes,
which can be beneficial in revealing biological regulatory mechanisms [26].
The main objective is to determine when an underlying cluster structure can be recovered
from data. In short, this paper introduces a spectral clustering algorithm, proves that the
algorithm clusters correctly almost every entity assuming a statistical model, and develops
the necessary random matrix theory to prove it.
The assumed statistical model is an integer-valued tensor block model Y = X + E ∈
n1 ×···×nd
Z , where Y is an observed d-way data array, X = EY is a nonrandom signal array and
E = Y − EY is a random noise array with independent entries. Each mode k = 1, . . . , d has a
cluster structure represented by a cluster assignment functions zk : {1, . . . , nk } → {1, . . . , rk }
(cluster of ik along mode k is zk (ik )) so that the signal entries depend only on the clusters
of the indices, i.e., X (i1 , . . . , id ) = ρS(z1 (i1 ), . . . , zd (id )) for some scaling parameter ρ ∈ R
and a core array S ∈ [−1, 1]r1 ×···×rd . The aim of multiway clustering is to recover zk up to
permutation of cluster labels from the observation Y. In this setting, we may consider the
core array as fixed and ρ may vary as the size of the data array increases. In the case of
binary data, ρ controls the expected number of ones in Y so that large values of ρ correspond
to dense tensor. The theoretical question of interest is to find how small ρ can be for the
recovery to be possible and computationally feasible.
As the main motivation comes from hypergraphs, the focus is mostly on binary arrays.
However, the developed theory includes also other more general integer-valued distributions
such as Poisson distributions. When analyzing tensor block models, this paper assumes in-
dependent entries in data array Y. This simplifies many arguments, but excludes symmetric
data tensors that satisfy Y(i1 . . . id ) = Y(iπ(1) . . . iπ(d) ) for all permutation π on {1, . . . , d}.
Such tensors arise from undirected hypergraphs, for example.
This paper introduces a spectral clustering algorithm which flattens the data array into
a wide nk × l̸=k nl -matrix, after which it reduces the dimension of the row vectors from
Q
Q
l̸=k nl to the number of clusters rk via singular value decomposition, and finally clusters the
low-dimensional row vectors with a k-means clustering algorithm. The main contribution of
this paper is showing weak consistency of this algorithm when the density parameter satisfies
ρ ≫ ( l nl )−1/2 . Based on both simulations and theoretical results, this paper also argues
Q
2
n
a d-uniform hypergraph is impossible for density parameter ρ ≤ c nlog d−1 with a sufficiently
small constant c. Furthermore, they showed that fast exact clustering is possible under
n
certain assumptions on the core array S for density parameter ρ ≥ C nlog d−1 with a sufficiently
large constant C. Essentially, their algorithm reduces the higher-order array to a matrix
by summing over higher-order modes to obtain an aggregate matrix with entries Y (i, j) =
i3 ,...,id Y(i, j, i3 , . . . , id ) counting the number of common hyperedges between pairs of nodes.
P
After this reduction, the algorithm performs spectral clustering on Y . A downside of this
algorithm is that the matrix Y needs to be informative. If this assumption fails, i.e., if the
matrix Y is not informative, then the cluster structure cannot be inferred from Y .
This motivates to study when and how the cluster structure can be recovered from the
data array when the cluster structure is unidentifiable from the aggregate matrix. In the
context of multilayer networks, Lei and Lin [29] showed that an algorithm similar to the
1/2
algorithm proposed in this paper is successful for ρ ≫ lognd/2 n . Ke, Shi and Xia [25] proved
an analogous result in hypergraph degree-corrected block models. We improve this bound by
√ n
a factor of log n. This threshold is somewhat unexpected as it differs from nlog d−1 substantially
n
for d ≥ 3 and raises the question what happens in regime nlog d−1 ≪ ρ ≪ n
−d/2
. By relying
on a conjecture, Lei, Zhang and Zhu [31] showed that every polynomial-time algorithm fails
1
asymptotically for ρ ≤ 2nd/2 log 1.4 . That is, there is a statistical-computational gap meaning
n
that there is a regime for which it is possible to recover the clusters statistically, but not in
a computationally feasible way.
A similar but sharper statistical-computational gap has been established for a subgaussian
tensor block model in [21]. This suggests a similar phenomenon, that exact clustering is
n
statistically possible for ρ ≥ C nlog d−1 with a sufficiently large constant C and computationally
1.2 Organization
This paper is structured as follows. Section 2 discusses the used tensor formalism and statis-
tical model. Section 3 presents a spectral clustering algorithm and states the main theorem
describing sufficient conditions for the consistency of the algorithm. Section 4 demonstrates
the main results with numerical experiments. Section 5 discusses related literature. Finally,
Section 6 gives an overview of the intermediate results needed to prove the main theorem.
Technical arguments are postponed to Appendices A, B, C and D.
3
2 Tensor block model
2.1 Notation
Here R denotes the set of real numbers, R≥0 the set of nonnegative real numbers, Z the set of
integers and Z≥0 the set of nonnegative integers. We denote [n] = {1, . . . , n}. The indicator
of a statement A is denoted by I{A}, i.e., I{A} = 1 if A is true and I{A} = 0 otherwise. The
natural logarithm is denoted by log. The minimum and maximum of x, y ∈ R are denoted
by x ∧ y and x ∨ y respectively. The positive part of x ∈ R is denoted by (x)+ = x ∨ 0.
Given nonnegative sequences xn and yn , we write xn ≪ yn (or xn = o(yn )) if xn /yn → 0,
and xn ≲ yn (or xn = O(yn )) if the sequence xn /yn is bounded. We also write xn ≍ yn (or
xn = Θ(yn )) if xn ≲ yn and yn ≲ xn , and xn ∼ yn to indicate xn /yn → 1.
A tensor of order d is an array A ∈ Rn1 ×···×nd , where the dimensions k = 1, . . . , d
are called the modes of the tensor A. A slice of a tensor A ∈ Rn1 ×···×nd is denoted by
A:···:ik :···: ∈ Rn1 ×···×nk−1 ×nk+1 ×...nd . In the case of matrices (d = 2), Ai: denotes the i:th row
vector and A the j:th column vector of A. The Frobenius norm of a tensor is denoted by
qP :j
2 n1 ×···×nd
∥A∥F = i1 ,...,id Ai1 ...id . The elementwise product of tensors A, B ∈ R is denoted
qP
n 2
by A ⊙ B. The 2-norm of a vector u ∈ R is denoted by ∥u∥ = i ui and its 1-norm
is denoted by ∥u∥1 = i |ui |. The operator norm of a matrix A ∈ Rn×m is denoted by
P
(1) (d)
A = C ×1 F (1) ×2 · · · ×d F (d) ,
X
Ai1 ,...,id = Cj1 ,...,jd Fi1 j1 . . . Fid jd ,
j1 ,...,jd
where C ∈ Rm1 ×···×md is called the core tensor and F (k) ∈ Rnk ×mk are called factor matrices.
In the literature as in [13], the matrices F (k) are sometimes assumed to have orthonormal
columns ((F (k) )T F (k) = I), which is not assumed in this work.
n1 ···nd
n ×
The k-matricization of a tensor A ∈ Rn1 ×···×nd is defined as a matrix matk (A) ∈ R k nk
so that the row index gives the index of the k:th mode and the column index gives the rest.
More formally, the entries of a matricized tensor are given by matk (A)ik j(i1 ,...,ik−1 ,ik+1 ,...,id ) =
Ai1 ,...,id , where j : [n1 ] × · · · × [nk−1 ] × [nk+1 ] × · · · × [nd ] → [n1 · · · nd /nk ] is some bijec-
tive enumeration map depending on the chosen convention. A possible convention is the
4
lexicographic order, i.e.,
Ai1 ,...,id = mat1 (A)i1 ,i2 +n2 (i3 −1)+···+n2 ...nd−1 (id −1) .
As stated in [13], the k-mode product can be seen as the matrix product via matk (A ×k
B) = B matk A, successive products can be calculated as (A ×k B) ×k C = A ×k CB and
(A ×k B) ×l C = (A ×l C) ×k B for k ̸= l.
where ρ ≥ 0 is the density, S ∈ [−1, 1]r1 ×···×rd is the normalized core tensor, and z1 ∈ [n1 ]r1 ,
..., zd ∈ [nd ]rd are the cluster membership vectors. Here we say that Y is a sample from
TBM(ρ, S, z1 , . . . , zd ). In this model X = EY is called the signal tensor, and E = Y − EY
the noise tensor. We say that the cluster (also known as the community or the block) of
ik ∈ [nk ] along mode k is lk if zk (ik ) = lk .
In this work, we are interested in large and sparse TBMs. A large TBM is a sequence of
TBMs indexed by ν = 1, 2, . . . so that the order d is fixed and the data tensor grows according
to n1,ν · · · nd,ν → ∞ as ν → ∞. A sequence of integer-valued TBMs (i.e., Yν ∈ Zn1,ν ×···×nd,ν )
is considered sparse if maxi1 ,...,id E|Yν,i1 ...id | → 0 as ν → ∞.
In a TBM, the main objective is to recover the underlying cluster structure (z1 , . . . , zd )
from the data Y. Figure 1 illustrates this statistical problem. The loss of an estimated
cluster structure is measured with a misclassification rate defined as follows.
Definition 2.2. Given a labeling function z : [n] → [r] and an estimate ẑ : [n] → [r], the
misclassification rate is defined as
n
1X
ℓ(z, ẑ) = min I{π(ẑ(i)) ̸= z(i)}.
permutation π:[r]→[r] n
i=1
From now on, we usually omit the index ν used in Definition 2.2. That is, instead of
writing xν ≫ yν we may write x ≫ y. Weak consistency is indicated by P(ℓ(z, ẑ) ≥ ε) → 0
for all ε > 0, and strong consistency is indicated by P(ℓ(z, ẑ) = 0) → 1.
5
(a) X (b) Clustered X
Figure 1: In this example, the data tensor Y is a 40×40×40-binary tensor and the underlying
signal X has a symmetric core tensor S and the cluster structures are the same in every mode
(z1 = z2 = z3 ). Black dots represent value 1, and zeros are not drawn.
To control the amount of noise, the separation of clusters, and the amount of data per
cluster, we will often assume the following.
Assumption 2.1 (MGF bound). An integrable random variable X is said to satisfy the
MGF bound with variance proxy σ 2 ≥ 0 if
2 (e|λ| −1−|λ|)
Eeλ(X−EX) ≤ eσ for all λ ∈ R. (2.1)
Assumption 2.3 (Balanced clusters). A TBM with membership vectors zk is said to have
balanced clusters if |zk−1 {lk }| ≍ nk /rk for all lk and k. Then a constant α ∈ (0, 1) (not
depending on ν) satisfying |zk−1 {lk }| ≥ αnk /rk for all lk and k is called a cluster balance
coefficient.
Assumption 2.1 is stronger than the subexponential assumption but weaker than the
subgaussian assumption. For example, Poisson distributions satisfy Assumption 2.1 but have
heavier than Gaussian tails. For more details about what kind of random variables satisfy
6
the MGF bound, see Lemma D.1. We refer to [39] for further details on subexponential and
subgaussian distributions.
The term variance proxy is motivated by the fact that Var(X) ≤ σ 2 for every random
variable X satisfying Assumption 2.1. This follows from writing inequality (2.1) with Taylor
2 2 2 3 2 2
series 1 + λ Var(X)
2
+ O(λ3 ) ≤ eσ λ /2+O(λ ) = 1 + σ 2λ + O(λ3 ).
Example 2.1 (Bernoulli TBM). Consider a TBM with Bernoulli distributed data en-
d
tries Yi1 ...id = Ber(Xi1 ...id ) and a core tensor ρS, where ρ ∈ [0, 1] is a scalar control-
ling sparsity and S ∈ [0, 1]r1 ×···×rd . Then the data entries have variances Var(Yi1 ...id ) =
ρSz1 (i1 )...zd (id ) (1 − ρSz1 (i1 )...zd (id ) ) ≤ ρ and the noise satisfies Assumption 2.1 with variance
proxy ρ (Lemma D.1:(ii)).
Example 2.2 (Poisson TBM). Consider a TBM with Poisson distributed data entries
d
Yi1 ...id = Poi(Xi1 ...id ) and a core tensor ρS, where ρ ∈ [0, 1] is a scalar controlling sparsity
and S ∈ [0, 1]r1 ×···×rd . Then the data entries have variances Var(Yi1 ...id ) = ρSz1 (i1 )...zd (id ) ≤ ρ
and the noise satisfies Assumption 2.1 with variance proxy ρ (Lemma D.1:(vi)).
3 Main results
3.1 Algorithm
Algorithm 1 represents a spectral clustering algorithm for recovering a latent cluster structure
along mode k. To estimate the cluster membership vectors of all modes, it may be run d
times, once for each mode.
Compute the best rank-rk approximation A = Û Λ̂Û T + Û⊥ Λ̂⊥ Û⊥T ≈ Û Λ̂Û T , where Λ̂
contains the rk largest eigenvalues in absolute value
Cluster the rows of Û Λ̂ into rk clusters by finding a labeling function ẑk and
centroids θ̂1 , . . . , θ̂rk ∈ Rrk satisfying
j z̆,θ̆ j
7
To motivate Algorithm 1, consider a sample Y from TBM(ρ, S, z1 , . . . , zd ) (Definition 2.1).
(k)
Define membership matrices Z (k) ∈ {0, 1}nk ×rk by Zik jk = I(zk (ik ) = jk ) for each mode
(k)
k = 1, . . . , d. That is, Zik jk indicates whether or not the cluster of the index ik along the
mode k is jk . Then the signal tensor admits a Tucker decomposition
X = ρS ×1 Z (1) ×2 · · · ×d Z (d) .
From linear algebraic perspective, clustering amounts to inferring a low-rank Tucker decom-
position from a perturbed data tensor. This motivates to estimate X by approximating Y
with a low-rank Tucker decomposition, i.e., to minimize the distance
where the matrices Ŭ (k) ∈ Rnk ×rk are imposed to have orthonormal columns and S̆ ∈
Rr1 ×···×rd is a tensor. Unfortunately, minimizing (3.1) is known to be NP-hard in gen-
eral, already when the core dimensions are ones [22]. Nonetheless, fast algorithms have
been developed to approximately minimize (3.1). These include higher-order singular value
decomposition (HOSVD) [13], higher-order orthogonal iteration (HOOI) [14] and simple vari-
ations of these such as sequentially truncated HOSVD [38]. These are based on observing
that the matricization matk Y is a noisy version of its expectation
matk X = Z (k) matk ρS ×1 Z (1) ×2 · · · ×k−1 Z (k−1) ×k+1 Z (k+1) ×k+2 · · · ×d Z (d)
that has rank at most rk . This motivates to estimate matk X with a rank-rk approximation
of matk Y, or alternatively, estimate (matk X )(matk X )T with a rank-rk approximation of
(matk Y)(matk Y)T . HOSVD repeats this rank approximation over each mode and HOOI
iterates this type of computation multiple times. Algorithm 1, in turn, essentially does this
rank approximation only once before the final clustering step. Although simple, this step is
crucial for HOSVD as it provides the very first initialization.
After dimension reduction, Algorithm 1 clusters the low-dimensional vectors by solving
k-means minimization task quasi-optimally. Exact minimization is known to be NP-hard
in general [15], but quasi-optimal solutions can be found with fast implementations such as
k-means++ with quasi-optimality constant O(log r), where r is the number of clusters [2].
Since k-means++ is a randomized algorithm, it is guaranteed to be quasi-optimal only in
expectation rather than always.
Two key differences distinguish Algorithm 1 from standard spectral clustering. First, the
diagonal entries of the Gram matrix (matk Y)(matk Y)T are zeroed giving a hollow Gram
matrix. This modification has been considered already in [29]. Second, the obtained hollow
Gram matrix is trimmed by removing carefully selected rows and columns. In the case of
nonnegative data, those with too large L1 -norms are selected.
8
3.2 Consistency
The following theorem presents the main result of the paper. It confirms the weak consistency
of Algorithm 1 for data sampled from a sparse integer-valued TBM (Definition 2.1).
Theorem 3.1 (Weak consistency). Let Y ∈ Zn1 ×···×nd be a sample from TBM(ρ, S, z1 , . . . , zd )
with fixed d. Assume that the entries of the noise tensor E satisfy the MGF bound with vari-
ance proxy ρ (Assumption 2.1) and E|Ei1 ...id | ≤ ρ. Assume also separated clusters along
mode k with separation δ > 0 (Assumption 2.2) and balanced clusters with cluster balance
coefficient α (Assumption 2.3). Consider parameter regime
3/2
rk nk log nk
√ ∨ ≪ ρ ≪ n−1
k (n1 · · · nd /nk )
−ε
(3.2)
n1 · · · nd n1 · · · nd
2/3
for some constant ε > 0 and rk ≪ nk . Then there exists a constant C such that if
Ctrim ≥ C ∨ ε−1 , then the estimated cluster membership vector ẑk given by Algorithm 1 on
mode k is weakly consistent. Furthermore, the probability bound
!!2
−3
rk 1 ∨ ε + Ctrim 1
P ℓ(zk , ẑk ) ≥ s + CQ d− 21 2
√ +
α δ n1 · · · nd ρ nk
1 q
2 −1∨ε−1
≤ Ctrim e−n1 ···nd ρ + 3n−1
k + Cn−1
k e
s
2
holds eventually, when s is chosen to satisfy e−n1 ···nd ρ + n−1 −1
k ≪ s ≪ rk .
3.3 Aggregation
High-order data is commonly analyzed by aggregating the data tensor into a lower-order
data tensor. Theorem 3.2 shows that an aggregated TBM is also a TBM and the variance
proxy of the noise entries scales accordingly. As a consequence, Corollary 3.3 shows that
aggregating is beneficial for sparse data as long as the signal tensor remains well separated.
nd′ +1 · · · nd ρ and normalized core tensor Sj′1 ...jd′ = n ′ 1···nd id′ +1 ,...,id Sj1 ...jd′ zd′ +1 (id′ +1 )...zd (id ) .
P
d +1
Furthermore, the entries of the noise tensor satisfy the MGF bound with variance proxy ρ′
and E|Ei′1 ...id′ | ≤ ρ′ .
9
Proof. The MGF bound is a direct consequence of Lemma D.1:(iii). By the triangle inequal-
ity, E|Ei′1 ...id′ | = E| id′ +1 ,...,id Ei1 ...id | ≤ nd′ +1 · · · nd ρ.
P
■
Corollary 3.3 (Weak consistency in Aggregated TBM). Let Y ∈ Zn1 ×···×nd be a sample
from TBM(ρ, S, z1 , . . . , zd ) fixed d. Assume that the entries of the noise tensor E satisfy the
MGF bound with variance proxy ρ (Assumption 2.1) and E|Ei1 ...id | ≤ ρ. Let Y ′ ∈ Zn1 ×···×nd′
be an aggregated data tensor as in Theorem 3.2 and assume that it has separated clusters
along mode k (Assumption 2.2) and balanced clusters (Assumption 2.3). Consider parameter
regime
3/2
rk nk log nk 1
√ ∨ ≪ρ≪ (3.3)
n1 · · · nd′ nd′ +1 · · · nd n1 · · · nd nk nd′ +1 · · · nd (n1 . . . nd′ /nk )ε
2/3
for some constant ε > 0 and rk ≪ nk . Then there exists a constant C such that if
Ctrim ≥ C ∨ ε−1 , then Algorithm 1 applied to mode k of Y ′ is weakly consistent.
Proof. The claim follows from Theorem 3.1 which is applicable by Theorem 3.2. ■
4 Numerical experiments
This section presents numerical experiments to demonstrate the performance of Algorithm 1
in various sparsity regimes.
4.1 Setup
We focus on third-order tensors. The signal tensor is chosen to be symmetric so that the mode
chosen for matricization does not matter. That is, the normalized core tensor S ∈ R2×2×2
is symmetric, the observation tensor is a cube Y ∈ Zn×n×n , and every mode has the same
membership vector z ∈ {1, 2}n . The number of clusters is fixed to r = 2 for visualization
purposes, as the low-dimensional projections are then 2-dimensional. The two clusters are
equal sized and have constant degrees, i.e.,
X X n2 ρ X
E Y i1 i2 i3 = ρSz(i1 )z(i2 )z(i3 ) = Sz(i1 )st
i2 i3 i2 i 3 4 s,t
10
Figure 2: Symmetric core tensors S used in simulation experiments. Threeway 2 × 2 × 2
arrays are represented as cubes, where black color corresponds to value 1 and white color
corresponds to value 0. The cube on the left depicts the core tensor with a noninformative
aggregate matrix and the cube on the right with an informative aggregate matrix.
is independent of z(i1 ). This implies that aggregating the data into an order-1 tensor loses
information. Similar to hypergraphs, the data tensor is assumed to be binary Y ∈ {0, 1}n×n×n
so that all entries are independent Bernoulli random variables (see Example 2.1). In a single
experiment, the normalized core tensor S is fixed, and the unnormalized core tensor is ρS,
where ρ is allowed to vary. Now, the model can be written as
d
Yi1 i2 i3 = Ber(ρSz(i1 )z(i2 )z(i3 ) ).
We investigate two instances of the normalized core tensor S, one with informative aggrega-
tion and one with noninformative. Recall that aggregating one mode yields TBM(ρ′ , S ′ , z, z)
with a density parameter ρ′ = nρ and a normalized corei tensor Sl′1 l2 = n1 i3 Sl1 l2 z(i3 ) =
P
h
1
2
(Sl1 l2 1 + Sl1 l2 2 ) (Theorem 3.2). By writing S = S::1 S::2 , the two considered normalized
core tensors and their aggregated versions are
" # " #
1 0 0 0 1 0 0 1
Sinformative = , Snoninformative = ,
0 0 0 1 0 1 1 0
" # " #
1/2 0 ′ 1/2 1/2
Sinformative = , Snoninformative = .
0 1/2 1/2 1/2
11
line is fitted in logarithmic coordinates to estimate the phase transition and this is compared
to the theoretical line. In detail, the phase transition line is estimated via logistic regression
by first predicting whether the proportion of correctly clustered nodes is sufficiently high
(e.g. 0.9) given (log n, log ρ) and then the learned decision boundary is used as the estimate
for the phase transition line. Although this might not be an optimal estimator of the phase
transition line, visually this produces meaningful results. A line with theoretical slope −γ is
drawn to visualize how well the simulated phase transition line fits to the theory.
The following four algorithms are compared.
(i) Hollow SVD (Algorithm 1 with a known parameter ρ and a trimming constant Ctrim =
3). By Theorem 3.1, we expect the phase transition line to have γ = 1.5.
(ii) SVD. A simple spectral clustering which is otherwise the same as Algorithm 1 but
without trimming and diagonal masking. We expect the phase transition line to have
γ ≈ 1.33. Namely, the diagonal entries of the data Gram matrix (mat1 Y)(mat1 Y)T
have standard deviations
sX Yi2 =Yi1 i2 i3 √
1 i2 i3
SD(((mat1 Y)(mat1 Y)T )i1 i1 ) = SD Yi21 i2 i3 =
X
Var(Yi21 i2 i3 ) ≍ n ρ,
i2 ,i3 i2 ,i3
when ρ ≪ n−4/3 . Here (∗) follows from the fact that the operator norm and Frobenius
norm of low-rank matrices are comparable, see for example Corollary 2.4.3 and Problem
P2.4.7 in [20].
(iii) HSC. A simplification of the high-order spectral clustering (HSC) proposed in [21].
First, HSC calculates a low-rank approximation of the data tensor with HOSVD [13]
and essentially one iteration of HOOI [14]. In our simplification, HOSVD matricizes
the data tensor and calculates its SVD only once since as the underlying signal tensor is
symmetric. This will slightly speed up the computations. Then the algorithm clusters
the first mode of the low-rank tensor with k-means++. Since HSC is initialized with
a simple rank-approximation (via eigenvalue decomposition of (mat1 Y)(mat1 Y)T ), it
is expected not to improve γ. That is, we expect γ ≈ 1.33.
(iv) Aggregate SVD. Spectral clustering from an aggregate matrix, which is a simplified
version of an algorithm proposed in [43]. In this experiment, the algorithm computes
the aggregate matrix Aij = k Yijk , removes rows and columns with too large L1 norms
P
12
(trimming threshold Ctrim n2 ρ with Ctrim = 3), calculates the best rank-2 approximation,
and then clusters the rows with k-means++ [2]. In the original paper, the data tensor
is symmetric and hence there are small differences in calculating A. Furthermore,
the actual clustering step is not implemented with k-means++ and there is a final
refinement step, but morally these algorithms are the same. We expect the phase
transition line to have γ = 2 if the aggregate matrix is informative.
4.2 Results
Figure 3 shows a comparison of the four algorithms when the aggregate matrix is not infor-
mative. As expected, spectral clustering on the aggregate matrix gives poor results, whereas
the other three algorithms show a clear phase transition. The observed phase transition is
in line with the expected phase transition. The refining steps in HSC do not improve the
exponent γ compared to the plain SVD, but they do improve the clustering performance sig-
nificantly. Although the estimated γ-exponents differ from the expected exponents slightly,
they would be hard to distinguish from the expected exponents visually.
Figure 4 shows a comparison of the four algorithms when the aggregate matrix is in-
formative. As expected, spectral clustering on the aggregate matrix gives the best results.
The observed phase transition is in line with the expected phase transition. Although the
estimated γ-exponents differ from the expected exponents slightly, they would be hard to
distinguish from the expected exponents visually.
The only difference is that the entries with value 0 have been changed to the value 0.5. If
the entries were zero, then any two row vectors of matk Y from different clusters would be
orthogonal. This, in turn, usually makes the projected row vectors to lie on two orthogonal
lines representing different clusters, and the projected row vectors would overlap with each
other in the visualization. Furthermore, when testing HSC, the initialization algorithm is
changed from simple SVD to the proposed hollow SVD, because without any trimming step
13
Hollow SVD 1.0 SVD 1.0
0.9 0.9
Clustering accuracy
Clustering accuracy
Density ( )
Density ( )
10 2
0.8 10 2
0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)
Clustering accuracy
Density ( )
Density ( )
10 2
0.8 10 2
0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)
Figure 3: Comparison of the four algorithms when the aggregate matrix is not informative.
The number of nodes varies logarithmically between 30 and 180, and the sparsity parameter
varies logarithmically between 0.00185 and 0.02722. The dashed black line is the fitted
phase transition line with theoretical slope (γHollow SVD = 1.5, γSVD = 1.33, γHSC = 1.33).
The estimated slopes given by logistic regression are γ̂Hollow SVD = 1.43, γ̂SVD = 1.29 and
γ̂HSC = 1.26.
near the phase transition, there would be “outliers” with significantly larger norms, which
would force the visualizations to focus on the outliers.
Figure 5 shows the obtained projections from the initialization algorithm (hollow SVD)
and HSC for four different sparsity parameters. As the sparsity parameter decreases, the
clusters merge in the projections. It may occur that the initialization provides somewhat
meaningful projections, but the k-means algorithm cannot detect the clusters. However, such
initialization might be informative enough for HSC to improve the projection and cluster the
vectors. However, as expected, once the initialization is dominated by noise, HSC cannot
cluster the vectors.
14
Hollow SVD 1.0 SVD 1.0
0.9 0.9
Clustering accuracy
Clustering accuracy
10 2 10 2
Density ( )
Density ( )
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)
Clustering accuracy
10 2 10 2
Density ( )
Density ( )
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)
Figure 4: Comparison of the four algorithms when the aggregate matrix is informative. The
number of nodes varies logarithmically between 30 and 180, and the sparsity parameter
varies logarithmically between 0.00131 and 0.01925. The dashed black line is the fitted
phase transition line with theoretical slope (γHollow SVD = 1.5, γSVD ≈ 1.33, γHSC ≈ 1.33,
γAggregate SVD = 2). The estimated slopes given by logistic regression are γ̂Hollow SVD = 1.43,
γ̂SVD = 1.31, γ̂HSC = 1.24 and γ̂Aggregate SVD = 2.00.
5 Discussion
5.1 Clustering
This section discusses related theoretical research on multiway clustering with emphasis on
recent developments on hypergraphs, and temporal and multilayer networks.
Zhang and Tan [43] considered clustering of a d-uniform hypergraph on n nodes, which
is comparable to a TBM of order d with Bernoulli distributed entries. The nodes have a
fixed prior distribution over clusters and the signal is assumed to have a core tensor ρS with
n
fixed S. They show that strong consistency is impossible, when ρ ≤ c nlog d−1 for sufficiently
small constant c. Furthermore, they showed that a spectral clustering algorithm applied
n
to an aggregate matrix Ai1 i2 = i3 ...id Yi1 ...id achieves strong consistency for ρ ≥ C nlog
P
d−1 for
sufficiently large constant C, assuming that EA is informative (see [43] for further details
on the constants c, C). The matrix A can be considered as an adjacency matrix since for a
symmetric tensor Y with Yi1 ...id = 0 whenever ik = il for some k ̸= l (undirected d-uniform
15
= 0.0074 = 0.0057 = 0.0045 = 0.004
accuracy = 0.91 accuracy = 0.79 accuracy = 0.71 accuracy = 0.555
Figure 5: Projected observations with (a) hollow SVD and (b) HSC initialized with hollow
SVD. The number of nodes is fixed to n = 200 and the sparsity parameter ρ is varied around
the phase transition. Sparsity decreases from left to right. Accuracy (1 − ℓ) of k-means
clustering is also recorded.
That is, up to a multiplicative constant (which for d ≤ 3 is simply 1), Ai1 i2 counts the number
of common hyperedges between nodes i1 and i2 . Visually, this is equivalent to transforming
a hypergraph to a (weighted) graph such that every hyperedge {i1 , . . . , id } contributes to an
edge {j1 , j2 } ⊂ {i1 , . . . , id }, j1 ̸= j2 . Stephan and Zhu [35] showed that a spectral clustering
algorithm based on nonbacktracking walks on a hypergraph achieves weak reconstruction
(i.e., is better than clustering every node to the largest cluster), if ρ ≳ n1−d and EA is
informative. However, these results do not address the case when EA is not informative.
Figure 1 visualizes an example of a tensor, for which the adjacency matrix would not be
informative.
There has been research on multiway clustering with a noninformative adjacency ma-
trix A. Lei, Chen and Lynch [28] analyzed an undirected multilayer network model, which
can be formulated as a TBM with Y ∈ {0, 1}m×n×n with no cluster structure on the first
16
mode (or z1 (i) = i, i.e., every index forms one cluster) and identical cluster structures on the
second and third modes z2 = z3 = z, r2 = r3 = r. Each slice Yi:: corresponds to a symmetric
adjacency matrix of an undirected network layer in a multilayer network. They consider a
clustering algorithm solving a least-squares problem
with a membership matrix Z̆ ∈ {0, 1}n×r and a core tensor S̆ ∈ [0, 1]m×r×r . They did not
give an algorithm finding an exact or provably approximate solution, but they did provide
an algorithm trying to solve the optimization problem. They showed that Ẑ achieves weak
3/2
n
consistency, when ρ ≫ log nm1/2
. Lei and Lin [29] analyzed a similar problem and proposed
an algorithm similar to Algorithm 1. Specifically, their algorithm masks out the diagonal
(which they call debiasing or bias adjusting) but it does not remove the nodes with too large
1/2
L1 -norms. Their algorithm is guaranteed to be weakly consistent for ρ ≫ log nm(n+m)1/2 . Ke, Shi
and Xia [25] proved an analogous result for hypergraph degree-corrected block models. Su,
Guo, Chang and Yang [36] extended these results and ideas to directed multilayer networks.
The proof requires a matrix concentration inequality based on classical matrix Bernstein’s
inequalities (see for example [37]). However, matrix Bernstein’s inequalities, which may be
based on a deep theorem by Lieb depending on the proof, suffer from a logarithmic factor
1/2
leading to the threshold log nm(n+m)
1/2 instead of nm11/2 .
Lei, Zhang and Zhu [31] studied computational and information theoretical limits of the
multilayer network model corresponding to a TBM with Y ∈ {0, 1}m×n×n with no cluster
1
structure on the first mode. They showed that weak consistency is impossible for ρ ≪ nm and
1
the maximum likelihood estimator is weakly consistent for ρ ≫ nm . Furthermore, based on
a conjecture, they showed that a polynomial-time algorithm cannot be weakly consistent for
ρ ≤ 2nm1/21log1.4 n . Kunisky [27] studied the impossibility of detecting a presence of a cluster
structure in symmetric tensors with low coordinate degree functions (LCDF). LCDFs are
arbitrary linear combinations of functions depending on at most D entries of a vector, in
our case the data tensor. Here the coordinate degree D is argued to roughly correspond
to algorithms requiring computation time eΘ(D) . Although the developed theory is more
general, in the our special case Theorem 1.13 states that the detection task is impossible
d d−2
if ρ ≤ cn− 2 D− 2 for a sufficiently small constant c and D ≤ cn. Since polynomial-time
d d−2
algorithms correspond to D ≍ ln n, the threshold corresponds roughly to ρ ≤ cn− 2 log− 2 n.
Moreover, Kunisky showed that the detection task becomes easier when an aggregated tensor
1
remains informative, each aggregated mode decreasing the density threshold by n− 2 . This
agrees with our Corollary 3.3.
Similar but slightly sharper computational gaps have been established in subgaussian
noise. Han, Luo, Wang, and Zhang [21] analyzed multiway clustering on a TBM with sub-
gaussian noise. Their algorithm is initialized with a slightly refined version of a HOSVD
17
algorithm, which involves calculating eigenvalue decompositions of (matk Y)(matk Y)T with-
out removing the diagonals or rows and columns with too large L1 -norms. They showed
that HOSVD-based of initialization is weakly consistent for ρ2 /σ 2 ≫ n−d/2 , where σ is the
largest subgaussian norm of Ei1 ...id (i.e., σ 2 is the largest variance proxy). When HOSVD-
based initialization is given to an iterative higher-order Lloyd algorithm, they showed that
the clustering is strongly consistent if ρ2 /σ 2 ≥ Cn−d/2 for sufficiently large constant C.
Furthermore, they show that strong consistency is not achievable for ρ2 /σ 2 ≤ cn1−d with a
small enough constant c while Wang and Zeng [41] show that maximum likelihood estimator
is strongly consistent for ρ2 /σ 2 ≥ Cn1−d with a sufficiently large constant C. Notice that
this threshold is similar to Theorem 3.1, namely both σ 2 and ρ are an upper bounds of the
variance of the data entries and setting σ 2 = ρ yields the same weak consistency condition
ρ ≫ n−d/2 . However, subgaussian analysis with Bernoulli distributed entries does not give
σ 2 ≍ ρ but
1 − 2ρ 1
σ2 = 1−ρ ∼
2 log ρ 2 log ρ1
as ρ → 0 by Theorem 2.1 in [6]. For example, by setting ρ = n−γ with fixed γ gives
σ 2 ≍ log−1 n which is substantially larger than ρ = n−γ . Nonetheless, it seems that the
n
analogous statistical-computational gap nlog 1
d−1 ≲ ρ ≪ nd/2 is present in clustering a binary
TBM.
18
deep theorem by Lieb depending on the proof, suffer from a logarithmic factor making the
matrix bounds slightly suboptimal.
In the case of subgaussian TBM, Han, Luo, Wang, and Zhang [21] approach by analyzing
concentration of a singular subspace of a possibly wide random matrix (i.e., a vector subspace
spanned by the first r right or left singular vectors corresponding to the r largest singular
values of a random matrix). For this purpose, the classical Davis–Kahan–Wedin theorem is
insufficient as it provides a common error bound for both left and right singular subspaces.
Hence, they rely on more sophisticated perturbation bounds developed by Cai and Zhang
[7] providing different bounds for left and right singular subspaces which is relevant for
particularly tall and wide matrices. The probabilistic analysis is based on ε-nets and Hanson–
Wright inequality (see for example Chapter 6 in [39]).
6 Proofs
This section presents the proof of the main theorem (Theorem 3.1) in three subsections.
Section 6.1 analyzes spectral clustering with deterministic error bounds, Section 6.2 studies
concentration of sparse matrices and Section 6.3 proves the main theorem. As the analysis
of matrix concentration is quite involved, the detailed proofs of Section 6.2 are postponed
to the appendix.
Theorem 6.1. Let X ∈ Rn×m be a matrix with r distinct rows represented by Xi: = Sz(i): ,
where S ∈ Rr×m and z ∈ [r]n . Let Y ∈ Rn×m be another matrix and X̂ be its best rank-r
approximation. Let ẑ be the output of a quasi-optimal k-means clustering algorithm applied
to the rows of X̂, i.e., find a labeling function ẑ : [n] → [r] and centroids θ̂1 , . . . , θ̂r ∈ Rm
satisfying
∥X̂j: − θ̂ẑj ∥2 ≤ Q min ∥X̂j: − θ̆z̆j ∥2
X X
j z̆,θ̆ j
r∥Y −X∥2
with some relaxation parameter Q > 1. Assume that |z −1 {k}| > 128Q ∆2 op for all k,
where ∆ = min{∥Xi: − Xj: ∥ | Xi: ̸= Xj: }. Then the misclassification rate (recall Definition
19
2.2) has an upper bound
r∥Y − X∥2op
ℓ(z, ẑ) ≤ 128Q .
n∆2
Proof. Define E = Y − X. Calculate a singular value decomposition Y = Û Σ̂V̂ T + Û⊥ Σ̂⊥ V̂⊥T
and define X̂ = Û Σ̂V̂ T , which is the best r-rank approximation of Y (in statistical appli-
cations X̂ is used to estimate X). Since the rank of X̂ is at most r and the rank of X is
at most r, the rank of their difference X̂ − X is at most 2r (every vector u of the image of
X̂ − X can be written as u = u1 + u2 , where u1 ∈ im X̂ is in an r-dimensional subspace
and u2 ∈ im X is in another r-dimensional subspace). By Corollary 2.4.3 in [20], the Frobe-
nius norm and the operator norm of the rank-2r matrix X̂ − X are comparable according
∥X̂ − X∥2F ≤ 2r∥X̂ − X∥2op . Now by the Eckart–Young–Mirsky theorem [17] and Weyl’s
inequality [42], we have
EYM
∥X̂ − X∥2F ≤ 2r∥X̂ − X∥2op ≤ 2r(∥X̂ − Y ∥op + ∥E∥op )2 ≤ 2r(σr+1 (Y ) + ∥E∥op )2
Weyl
≤ 2r(σr+1 (X) + ∥E∥op + ∥E∥op )2 ≤ 8r∥E∥2op .
This shows that the estimate X̂ is close to X in Frobenius norm. Denote the estimated
clusters by ẑj ∈ [r], j = 1, . . . , n and their centers by θ̂k ∈ Rm , k = 1, . . . , r. By quasi-
optimality, we have
∥X̂j: − θ̂ẑj ∥2 ≤ Q min ∥X̂j: − θ̆z̆j ∥2 ≤ Q ∥X̂j: − Xj: ∥2 = Q∥X̂ − X∥2F ≤ 8Qr∥E∥2 .
X X X
j z̆,θ̆ j j
This shows that the row vectors of the estimate X̂ are close to the estimated cluster means
on average. Now we can estimate the distances between the true cluster means (rows of S)
and the estimated cluster means (θ̂ẑj ):
j j
For any t > 0, Markov’s inequality (with respect to the counting measure) gives
j ∥θ̂ẑj − Xj ∥ 2 32Qr∥E∥2
P
2 2
|{j | ∥θ̂ẑj − Xj ∥ ≥ t }| ≤ ≤ .
t2 t2
20
only if ∥θ̂ẑi − Szj ∥ < ∆/2. This gives a well-defined mapping π : ẑ(A) → [r], π(ẑi ) = zi (if
ẑi = ẑj , then ∥θ̂ẑi − Szj ∥ < ∆/2 and hence zi = zj ). However, π is not bijective in general,
when balanced cluster sizes are not assumed.
If z|A : A → [r] is not a surjection, i.e., there exists k ∈ [r] such that zi ̸= k for
every i ∈ A, then z −1 {k} ⊂ Ac , |z −1 {k}| ≤ |Ac |. This is a contradiction if we assume
32Qr∥E∥2op
that t2
< |z −1 {k}| for every k. Under this assumption, z|A is surjective and hence
π : ẑ(A) → [r] is surjective (if k ∈ [r], there exists i ∈ A such that k = zi = π(ẑi )). The
surjectivity of π gives |ẑ(A)| ≥ |[r]| = r implying ẑ(A) = [r] and making π a bijection. The
bound for the misclassification rate follows now from the estimate ℓ(z, ẑ) ≤ |Ac |/n ■
As Theorem 6.1 shows, clustering from a low-rank approximation works well when the
operator norm of the noise matrix is sufficiently small and the clusters are sufficiently sepa-
rated in the signal. The following result (Lemma 6.2) addresses the latter issue by estimating
the separation of clusters ∆ in Theorem 6.1 under Assumptions 2.3 and 2.2. The former
issue is addressed in Section 6.2.
Lemma 6.2. Consider a TBM with a signal tensor X . Assume separated clusters along
mode k with separation δ > 0 (Assumption 2.2) and balanced clusters with cluster balance
coefficient α (Assumption 2.3). Then
1
T αd− 2 δ 2 T
min ∥((matk X )(matk X ) )i: − ((matk X )(matk X ) )j: ∥ ≥ √ √ n1 · · · nd ρ2 .
i,j:zi ̸=zj 2 nk rk
Proof. Fix k ∈ [d] and let ik , i′k ∈ [nk ] be entities from different clusters zk (ik ) ̸= zk (i′k ).
Denote l−k = (l1 , . . . , lk−1 , lk+1 , . . . , ld ) ∈ [r1 . . . rd /rk ] and j−k = (j1 , . . . , jk−1 , jk+1 , . . . , jd ) ∈
r1 ...rd r ...r
× 1r d
[n1 · · · nd /nk ]. Define a diagonal matrix D ∈ R rk k with nonnegative diagonal entries
|z1−1 {l1 }|···|zd−1 {ld }|
Dl−k l−k = |zk−1 {lk }|
. Now
× ((matk S)zk (ik ),l−k − (matk S)zk (i′k ),l−k )(matk S)lk ,l−k .
By balanced clusters with cluster balance coefficient α > 0 (Assumption 2.3) , the last term
21
is bounded from below by
2
ρ4 αnk X X
≥ Dl l ((matk S)zk (ik ),l−k − (matk S)zk (i′k ),l−k )(matk S)lk ,l−k
rk lk l−k −k −k
ρ4 αnk
= ∥((matk S)D(matk S)T )zk (ik ): − ((matk S)D(matk S)T )zk (i′k ): ∥2 .
rk
Next, we observe that for any matrix A and indices i ̸= i′ , Cauchy–Schwarz inequality gives
∥Ai: − Ai′ : ∥2 = ∥(ei − ei′ )T A∥2 = (ei − ei′ )T AAT (ei − ei′ )
√ √
≤ 2∥(ei − ei′ )T AAT ∥ = 2∥(AAT )i: − (AAT )i′ : ∥.
√
Applying this inequality to matrix (matk S) D gives now a lower bound
Theorem 6.3. Let X ∈ Rn×m be a random matrix with independent centered entries (EXij =
22
0) satisfying the MGF bound with variance proxy σ 2 (Assumption 2.1). Define a trimmed
version of the matrix as
Xij ,
∥Xi: ∥1 ∨ ∥X:j ∥1 ≤ Ctrim (n ∨ m)σ 2 ,
Xij′ =
0, otherwise.
for all
log3 m
!
t≥C 1∨ .
log3 (1/6enσ 2 )
Proof. See Appendix C. ■
23
for all s > 0 and
24 log2 m
t ≥ 24 ∨ .
log2 (1/6enσ 2 )
Proof. See Appendix C. ■
By Lemma 6.5, there exists an absolute constant C1 such that setting Ctrim ≥ C1 (1 ∨ ε−2 )
gives
nk X
1 X
P I |(matk Y)il (matk Y)j ′ l | ≥ Ctrim n1 · · · nd ρ2 ≥ s
nk i=1 j ′ ,l:j ′ ̸=i
1 q
−n1 ···nd ρ2 −1
≤ Ctrim e + 3nk .
s
By Theorem 6.4, there exists an absolute constant C2 such that
P ∥((matk Y)(matk Y)T ⊙ M − (matk X )(matk X )T ) ⊙ N ∥op ≥ u
1/3
≤ C2 n−1
k e
−t
for all t ≥ 1 ∨ ε−3 ,
where
√ n1 · · · nd ρ2
u = C2 (t + Ctrim ) n1 · · · nd ρ + .
nk
Choose s so that nk s ≪ αnk /rk ≤ minl |zk−1 {l}| to ensure that eventually every cluster is
2
present also after masking with N . More specifically, we may choose e−n1 ···nd ρ + n−1
k ≪ s ≪
2
2 rk3 −1
rk−1 since rk e−n1 ···nd ρ ≤ n1 ···nd ρ2
≪ 1 and rk
nk
≪ nk3 ≪ 1. Define
24
By Theorem 6.1, there exists an absolute constant C3 such that
!
rk u 2
P ℓ(zk , ẑk ) ≥ s + C3 Q
nk ∆2
nk X
1 X
≤ P I |(matk Y)il (matk Y)j ′ l | ≥ Ctrim n1 · · · nd ρ2 ≥ s
nk i=1 j ′ ,l:j ′ ̸=i
+ P ∥((matk Y)(matk Y)T ⊙ M − (matk X )(matk X )T ) ⊙ N ∥op ≥ u
1 q
−n1 ···nd ρ2 −t1/3
≤ Ctrim e + 3nk + C2 n−1
−1
k e
s
2
when minl |zk−1 {l}| ≥ C3 Q r∆
ku
2 . By Lemma 6.2, there exists a constant C4 such that ∆ ≥
1
αd− 2 δ 2
C4√
nk rk
n1 · · · nd ρ2 . Now
√ √ n1 ···nd ρ2
rk u C2 (t + Ctrim ) n1 · · · nd ρ + nk
√ ≤ 1
nk ∆ C4 αd− 2 δ 2
n · · · n ρ2
rk 1 d
!
rk C2 (t + Ctrim ) 1
= 1 √ + .
C4 αd− 2 δ 2 n1 · · · nd ρ nk
2 3/2
Now minl |zk−1 {l}| ≥ C3 Q r∆
ku
2 holds eventually by assumptions ρ ≫ rk (n1 · · · nd )−1/2 and
3/2
nk ≫ rk , namely
αnk
minl |zk−1 {l}| rk
2 ≥ 2
C3 Q r∆ ku
rk C2 (t+Ctrim ) 1
2
C3 Qnk 1 √
n1 ···nd ρ
+ nk
C4 αd− 2 δ 2
!2 −1
C3 Qr3 C2 (t + Ctrim ) 1
= 2 2d−1k 4 √ + ≫ 1.
C4 α δ α n1 · · · nd ρ nk
2
Set t = 1 ∨ ε−3 to be a constant. Recall e−n1 ···nd ρ + n−1 −1
k ≪ s ≪ rk . Now we have
!!2
rk C2 (t + Ctrim ) 1
ℓ(zk , ẑk )
P ≥ s + C3 Q 1 √ +
C4 αd− 2 δ 2 n1 · · · nd ρ nk
| {z }
≪1
1 q
−n1 ···nd ρ2 −t1/3
≤ Ctrim e + 3nk + C2 n−1
−1
k e ,
s
| {z }
≪1
that is, ℓ(zk , ẑk ) → 0 in probability. This concludes the proof of Theorem 3.1. ■
25
Acknowledgment We thank Bogumił Kamiński and Paul Van Dooren for stimulating
discussions and helpful comments.
References
[1] K. Alaluusua, K. Avrachenkov, V. B. R. Kumar, and L. Leskelä. “Multilayer hyper-
graph clustering using the aggregate similarity matrix”. In: Algorithms and Models for
the Web Graph (WAW). 2023, pp. 83–98.
[2] D. Arthur and S. Vassilvitskii. “k-means++: the advantages of careful seeding”. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA ’07). New Orleans, Louisiana, 2007, pp. 1027–1035.
[3] K. Avrachenkov, M. Dreveton, and L. Leskelä. “Community recovery in non-binary
and temporal stochastic block models”. In: arXiv:2008.04790 (2022).
[4] G. Bennett. “Probability Inequalities for the Sum of Independent Random Variables”.
In: Journal of the American Statistical Association 57.297 (1962), pp. 33–45.
[5] L. Brusa and C. Matias. “Model-based clustering in simple hypergraphs through a
stochastic blockmodel”. In: Scandinavian Journal of Statistics 51.4 (2024), pp. 1661–
1684.
[6] V. Buldygin and K. Moskvichova. “The sub-Gaussian norm of a binary random vari-
able”. In: Theory of Probability and Mathematical Statistics 86 (2013), pp. 33–49.
[7] T. T. Cai and A. Zhang. “Rate-Optimal Perturbation Bounds for Singular Subspaces
with Applications to High-Dimensional Statistics”. In: Annals of Statistics 46.1 (2018),
pp. 60–89.
[8] H. Chernoff. “A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on
the sum of Observations”. In: Annals of Mathematical Statistics 23.4 (1952), pp. 493–
507.
[9] I. E. Chien, C.-Y. Lin, and I.-H. Wang. “On the Minimax Misclassification Ratio of
Hypergraph Community Detection”. In: IEEE Transactions on Information Theory
65.12 (2019), pp. 8095–8118.
[10] P. Chodrow, N. Eikmeier, and J. Haddock. “Nonbacktracking Spectral Clustering of
Nonuniform Hypergraphs”. In: SIAM Journal on Mathematics of Data Science 5.2
(2023), pp. 251–279.
[11] H. Cramér. “Sur un nouveau théorème-limite de la théorie des probabilités”. In: Col-
loque consacré à la théorie des probabilités. Actualités Scientifiques et Industrielles 736
(1938), pp. 2–23.
26
[12] G. Dai, Z. Su, and H. Wang. “Tail Bounds on the Spectral Norm of Sub-Exponential
Random Matrices”. In: arXiv:2212.07600 (2022).
[13] L. De Lathauwer, B. De Moor, and J. Vandewalle. “A Multilinear Singular Value
Decomposition”. In: SIAM Journal on Matrix Analysis and Applications 21.4 (2000),
pp. 1253–1278.
[14] L. De Lathauwer, B. De Moor, and J. Vandewalle. “On the Best Rank-1 and Rank-(R1
,R2 ,. . .,RN ) Approximation of Higher-Order Tensors”. In: SIAM Journal on Matrix
Analysis and Applications 21.4 (2000), pp. 1324–1342.
[15] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. “Clustering Large Graphs
via the Singular Value Decomposition”. In: Machine learning 56 (2004), pp. 9–33.
[16] I. Dumitriu, H. Wang, and Y. Zhu. “Partial recovery and weak consistency in the
non-uniform hypergraph stochastic block model”. In: Combinatorics, Probability and
Computing 34.1 (2025), pp. 1–51.
[17] C. Eckart and G. Young. “The approximation of one matrix by another of lower rank”.
In: Psychometrika 1.3 (1936), pp. 211–218.
[18] U. Feige and E. Ofek. “Spectral techniques applied to sparse random graphs”. In:
Random Structures & Algorithms 27.2 (2005), pp. 251–275.
[19] J. Friedman, J. Kahn, and E. Szemerédi. “On the second eigenvalue of random regular
graphs”. In: Proceedings of the Twenty-First Annual ACM Symposium on Theory of
Computing (STOC ’89). Seattle, Washington, USA: Association for Computing Ma-
chinery, 1989, pp. 587–598.
[20] G. H. Golub and C. F. Van Loan. Matrix Computations. 4th ed. JHU press, 2013.
[21] R. Han, Y. Luo, M. Wang, and A. R. Zhang. “Exact Clustering in Tensor Block Model:
Statistical Optimality and Computational Limit”. In: Journal of the Royal Statistical
Society Series B: Statistical Methodology 84.5 (2022), pp. 1666–1698.
[22] C. J. Hillar and L.-H. Lim. “Most tensor problems are NP-hard”. In: Journal of the
ACM 60.6 (2013), pp. 1–39.
[23] V. Hore, A. Viñuela, A. Buil, J. Knight, M. I. McCarthy, K. Small, and J. Marchini.
“Tensor decomposition for multiple-tissue gene expression experiments”. In: Nature
Genetics 48.9 (2016), pp. 1094–1100.
[24] J. Jin. “Fast community detection by score”. In: Annals of Statistics 43.1 (2015),
pp. 57–89.
[25] Z. T. Ke, F. Shi, and D. Xia. “Community Detection for Hypergraph Networks via
Regularized Tensor Power Iteration”. In: arXiv:1909.06503 (2019).
27
[26] Y. Kong and T. Yu. “A hypergraph-based method for large-scale dynamic correlation
study at the transcriptomic scale”. In: BMC Genomics 20.397 (2019).
[27] D. Kunisky. “Low coordinate degree algorithms II: Categorical signals and generalized
stochastic block models”. In: arXiv:2412.21155 (2024).
[28] J. Lei, K. Chen, and B. Lynch. “Consistent community detection in multi-layer network
data”. In: Biometrika 107.1 (2020), pp. 61–73.
[29] J. Lei and K. Z. Lin. “Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic
Block Models”. In: Journal of the American Statistical Association 118.544 (2023),
pp. 2433–2445.
[30] J. Lei and A. Rinaldo. “Consistency of spectral clustering in stochastic block models”.
In: Annals of Statistics 43.1 (2015), pp. 215–237.
[31] J. Lei, A. R. Zhang, and Z. Zhu. “Computational and statistical thresholds in multi-
layer stochastic block models”. In: Annals of Statistics 52.5 (2024), pp. 2431–2455.
[32] M. Melé, P. G. Ferreira, F. Reverter, D. S. DeLuca, J. Monlong, M. Sammeth, T. R.
Young, J. M. Goldmann, D. D. Pervouchine, T. J. Sullivan, R. Johnson, A. V. Segrè, S.
Djebali, A. Niarchou, The GTEx Consortium, F. A. Wright, T. Lappalainen, M. Calvo,
G. Getz, E. T. Dermitzakis, K. G. Ardlie, and R. Guigó. “The human transcriptome
across tissues and individuals”. In: Science 348.6235 (2015), pp. 660–665.
[33] S. Paul and Y. Chen. “Spectral and matrix factorization methods for consistent com-
munity detection in multi-layer networks”. In: Annals of Statistics 48.1 (2020), pp. 230–
250.
[34] N. Ruggeri, M. Contisciani, F. Battiston, and C. De Bacco. “Community detection in
large hypergraphs”. In: Science Advances 9.28 (2023), eadg9159.
[35] L. Stephan and Y. Zhu. “Sparse random hypergraphs: non-backtracking spectra and
community detection”. In: Information and Inference: A Journal of the IMA 13.1
(2024), iaae004.
[36] W. Su, X. Guo, X. Chang, and Y. Yang. “Spectral co-clustering in multi-layer directed
networks”. In: Computational Statistics & Data Analysis 198 (2024), p. 107987.
[37] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12 (2012), pp. 389–434.
[38] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. “A New Truncation Strategy
for the Higher-Order Singular Value Decomposition”. In: SIAM Journal on Scientific
Computing 34.2 (2012), pp. 1027–1052.
[39] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data
Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, 2018.
28
[40] M. Wang, J. Fischer, and Y. S. Song. “Three-way clustering of multi-tissue multi-
individual gene expression data using semi-nonnegative tensor decomposition”. In:
Annals of Applied Statistics 13.2 (2019), pp. 1103–1127.
[41] M. Wang and Y. Zeng. “Multiway clustering via tensor block models”. In: Advances
in Neural Information Processing Systems (NeurIPS) 32 (2019).
[42] H. Weyl. “Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Dif-
ferentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung)”.
In: Mathematische Annalen 71.4 (1912), pp. 441–479.
[43] Q. Zhang and V. Y. F. Tan. “Exact Recovery in the General Hypergraph Stochastic
Block Model”. In: IEEE Transactions on Information Theory 69.1 (2023), pp. 453–471.
29
A Proof of Theorem 6.3
Lemma A.1 asserts that in many cases, the largest L1 -norms of rows of a random matrix can
be bounded with high probability. Furthermore, in a boundary case, almost every row can
be bounded with high probability. Although it is not needed to prove Theorem 6.3, Lemma
A.1 shows that only a small fraction of rows and columns are removed in Theorem 6.3 to
obtain the desired norm bound. Lemma A.1 is also needed to prove Theorem 6.4. Lemma
A.2 is used to prove Theorem 6.3 bounding the operator norm of a random matrix, from
which rows and columns with too large L1 -norms are removed. The proofs mainly follow
[18], which focuses on Bernoulli distributed entries (random graphs). The proof technique
originates from [19], where an analogous result is given for random regular graphs.
Lemma A.1. Let X ∈ Rn×m be a random matrix with independent centered entries satisfying
the MGF bound with variance proxy σ 2 (Assumption 2.1). Define centered L1 -norms Si =
Pm
j=1 (|Xij | − E|Xij |). Then for any t ≥ 0, we have
n n
! !
X
2
o en enmσ 2
P I Si ≥ (1 + t)mσ ≥ ≤ exp − mσ2 /8 t
i=1 emσ2 /8 8e
≤ (en)−t , mσ 2 ≤ 8 log en,
q
P max Si ≥ (1 + t) 8mσ 2 log en ≤ (en)−t , mσ 2 ≥ 8 log en.
i=1,...,n
That is, if mσ 2 ≥ 8 log en, then the largest L1 -norm is bounded by maxi ∥Xi: ∥1 ≤
√
maxi E∥Xi: ∥1 + C mσ 2 log en with high probability as n → ∞, and if 1 ≪ mσ 2 ≤ 8 log en,
then the fraction of “large” L1 -norms n1 ni=1 I{∥Xi: ∥1 ≥ E∥Xi: ∥1 + 2mσ 2 } ≪ 1 tends to zero
P
with high probability as n → ∞, and if C ≤ mσ 2 ≤ 8 log en for some constant C, then the
fraction of “large” L1 -norms n1 ni=1 I{∥Xi: ∥1 ≥ E∥Xi: ∥1 + 2mσ 2 } ≤ e1−C/8 is at most of con-
P
stant order with high probability as n → ∞, but by setting C large enough, this fraction can
be made arbitrarily small. Under an extra assumption E|Xij | ≲ σ 2 , we have E∥Xi: ∥1 ≲ mσ 2
and the L1 -norm bounds are in the order of mσ 2 .
2 λ
Proof. By Lemma D.1:(v),
P we have Eeλ(|Xij |−E|Xij |) ≤ e2σ (e −1−λ) for all λ ≥ 0. By indepen-
dence, EeλSi = Eeλ j (|Xij |−E|Xij |) = j Eeλ(|Xij |−E|Xij |) ≤ e2mσ (e −1−λ) for all λ ≥ 0. Assume
Q 2 λ
that k ≥ i P(Si ≥ t). By Bennett’s inequality (D.1) and Bernstein’s inequality (D.3) in
P
30
Lemma D.2, we have
n n n
! !
X X X
P I{Si ≥ t} ≥ k = P (I{Si ≥ t} − P{Si ≥ t}) ≥ k − P{Si ≥ t}
i=1 i=1 i=1
Pn !k
(D.1)
−
Pn
P{Si ≥t} e i=1 P{Si ≥ t}
≤ e i=1
k
Pn !k
(∗) e i=1 P{Si ≥ t}
≤
k
!!!k
(D.3)en t2 3t
≤ exp − 2
∧
k 8mσ 4
!!!
2
en t 3t
= exp k log − ∧ .
k 8mσ 2 4
Now
n
en 4 log en
( !) !
en
r
X
k
P I Si ≥ (1 + t) 8mσ 2 log ∨ ≥ k ≤ exp −k log t .
i=1 k 3 k
Notice that the assumption k ≥ i P(Si ≥ t) is needed for valid argumentation, but it
P
can be seen in (∗) that the probability upper bound becomes greater than one and hence
trivial, when k < i P(Si ≥ t). Therefore, the assumption k ≥ i P(Si ≥ t) does not
P P
en en
for all k ≥ 1. Choose k = 1 ∨ 2
emσ /8
so that if mσ 2 ≤ 8 log en, then k = 2
emσ /8
and
en 4 log en mσ 2
r
∨
8mσ 2 log k
= mσ 2 ∨ = mσ 2 ,
k 3 ! 6
n n
!
2
o en enmσ
I Si ≥ (1 + t)mσ ≥ mσ2 /8 ≤ exp − mσ2 /8 t ≤ (en)−t ,
2
X
P
i=1 e 8e
31
and if mσ 2 ≥ 8 log en, then k = 1 and
en 4 log en 4 log en
r q
k
8mσ 2 log ∨ = 8mσ 2 log en ∨
k 3 s3
q 4 mσ 2 log en
≤ 8mσ 2 log en ∨
3 8
q
= 8mσ 2 log en,
n ! n
q q !
[ X
P Si ≥ (1 + t) 8mσ 2 log en =P I Si ≥ (1 + t) 8mσ 2 log en ≥ 1
i=1 i=1
−t
≤ (en) .
The following lemma in its original form [18] bounds sums of the form i∈S j∈T |Xij |,
P P
where X ∈ Rn×m is a random matrix and S ⊂ [n] and T ⊂ [m]. When X corresponds to an
adjacency matrix of a graph, then this sum counts the number of (directed) edges from the
vertex set S to the vertex set T . However, here the statement is modified to draw a clearer
connection to the operator norm ∥X∥op = sup∥u∥,∥v∥≤1 i,j ui vj Xij .
P
Lemma A.2. Let X ∈ Rn×m be a random matrix with independent centered entries (EXij =
2 |λ|
0) satisfying EeλXij ≤ eσ (e −1−|λ|) for all λ ∈ R. For subsets S ⊂ [n] and T ⊂ [m],
define X(S, T ) = max{ i∈S j∈T ui vj Xij | ui , vj ∈ {−1, +1}} and a function Cσ2 (t) =
P P
σ2 +t
2 eσ 2
e−σ σ 2 +t
. Then
( ! !!)
[ 2en 2em
P − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log
S,T |S| |T |
≤ (e2 nm)−t , t ≥ 1,
̸ S ⊂ [n], u ∈ {±1}S , ∅ =
Proof. If ∅ = ̸ T ⊂ [m] and v ∈ {±1}T , then Lemma D.1:(iii)
2 |λ|
gives moment condition E exp(λ i∈S,j∈T ui vj Xij ) ≤ e|S||T |σ (e −1−|λ|) for all λ ∈ R. For any
P
32
k
n en
inequality (D.1) and the inequality k
≤ k
gives
[n o [ [ X X
P X(S, T ) ≥ t|S|,|T | = P uvX ≥
i j ij t|S|,|T |
S,T S,T u∈{±1}S ,v∈{±1}T i∈S j∈T
X X XX
≤ P ui vj Xij ≥ t|S|,|T |
S,T u∈{±1}S ,v∈{±1}T i∈S j∈T
n X
X m X X X X XX
= P ui vj Xij ≥ tkl
k=1 l=1 S∈([n]) T ∈([m]) u∈{±1}S v∈{±1}T i∈S j∈T
k l
n X
m
! ! !klσ2 +tkl
X n m k l −klσ2 eklσ 2
≤ 2 2 e
k=1 l=1 k l klσ 2 + tkl
| {z }
Cklσ2 (tkl )
n X
m
2en k 2em l
X
≤ Cklσ2 (tkl ).
k=1 l=1 k l
Let us fix t ≥ 1. Since the bound Cklσ2 : [0, ∞) → [0, 1] is a continuous function start-
ing from 1 and decreasing to 0 in the limit, we can find tkl > 0 such that Cklσ2 (tkl ) =
k l −(1+t)
2en 2em
k l
. Since the bound Cklσ2 is a strictly decreasing function, we have an
equivalence X(S, T ) ≥ tkl ⇐⇒ Cklσ2 (X(S, T )) ≤ Cklσ2 (tkl ) (notice that X(S, T ) ≥ 0 is
always nonnegative, because whenever uT Xv < 0, we have (−u)T Xv > 0). Therefore
( ! !!)
[ 2en 2em
P − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log
S,T |S| |T |
[n o
= P − log C|S||T |σ2 (X(S, T )) ≥ − log C|S||T |σ2 (t|S|,|T | )
S,T
n m k l
XX 2en 2em
≤ Cklσ2 (tkl )
k=1 l=1 k l
n
!kt m !lt
X k X l
= .
k=1 2en l=1 2em
33
2t
2
2en
for 2 ≤ k ≤ n yielding
n
!kt t 2t
k 1 2
X
≤ + (n − 1)+
k=1 2en 2en 2en
t≥1 2−t 1
≤ t
+n
(en) (en)1+t
2−t + e−1
=
(en)t
1
≤ .
(en)t
Hence,
n
!kt m !lt
X k X l 1
≤
k=1 2en l=1 2em (e2 nm)t
Proof of Theorem 6.3. Since the operator norm of a submatrix is less than the operator
norm of the whole matrix, without loss of generality, we may assume that n = m (the
smaller dimension can be extended to the larger dimension by filling with zero-entries, which
does not affect the claimed bound of ∥X∥op nor the probability bound). The proof is divided
into several parts. The first part discusses the general idea and the proof task, and the
remaining parts consist of completing the tasks.
Discretization. Let 0 < ε < 1/2 and S′ε be an ε-net of the n−1-dimensional unit sphere
Sn−1 of size |S′ε | ≤ (2ε−1 + 1)n (Corollary 4.2.13 in [39]). Define a new ε-net
so that |Sε | ≤ 2n ·(2ε−1 +1)n = (4ε−1 +2)n and whenever (u1 , . . . , un ) ∈ Sε , also (u1 , . . . , −ui , . . . , un ) ∈
Sε . Define a constant Cε = 2 log(4ε−1 + 2) (so that |Sε × Sε | ≤ eCε n ). By Exercise 4.4.3 in
[39], we have
1 1
∥X ′ ∥ ≤ max uT X ′ v = ui vj Xij′ .
X
max
1 − 2ε u,v∈Sε 1 − 2ε u,v∈S ε
i,j
Applying a Chernoff bound directly on sums i,j ui vj Xij′ will work sufficiently well only when
P
|ui vj | is sufficiently small (proven later). Hence, the proof also needs another argument for
heavier terms.
Discretize (0, 1] into bins by fixing any sequences 1 = a1 ≥ a2 ≥ . . . and 1 = b1 ≥ b2 ≥ . . .
to be used as bin boundaries. Let u, v ∈ B n be vectors with at most unit length. Let
us bin the values into sets Sk = {i | |ui | ∈ (ak+1 , ak ]}, a1 = 1 of size sk = |Sk | and
Tl = {j | |vj | ∈ (bl+1 , bl ]}, b1 = 1 of size tl = |Tl |. This is analogous to histograms, where the
34
entries of the vectors correspond to the data, Sk and Tl are the bins with endpoints ak and
bl and frequencies/counts sk and tl . Let us emphasize that sk and Sk depend on the vector u
and similarly, tl and Tl depend on the vector v, although this dependence is not written out
explicitly. The endpoints ak , bl , in turn, are fixed and do not depend on the vectors. Now
The intuition is, that X ′ (Sk , Tl ) concentrates better for larger bins Sk and Tl . From this
point of view, bins should be chosen as large as possible. However, the larger the bins, the
more information about the distribution of the values of u and v is lost. From this point of
view, bins should be chosen as small as possible. It turns out that preserving information
about the norms is important:
ak
That is, larger ratios ak+1 preserve the norm worse. This indicates that ak can decrease at
most geometrically/exponentially with a common ratio a = supk ak /ak+1 , and thus geomet-
rically/exponentially decreasing intervals might be the optimal choice (in [18], the ratio was
chosen to be a = 2). Thus, choose ak = a1−k to get k sk a2k ≤ a2 . Similarly, define bk = b1−k
P
1
∥X ′ ∥op ≤ ui vj Xij′
X
max
1 − 2ε u,v∈Sε i,j
1
ui vj Xij′ + ui vj Xij′
X X X X
= max
1 − 2ε u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl / i∈Sk ,j∈Tl
(k,l)∈L
1
ak bl X ′ (Sk , Tl ) ,
X X X
≤ max ui vj Xij + max
1 − 2ε u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl u,v∈Sε
(k,l)∈L
/
where L is a set of indices called light couples (defined later). Here, the last inequality
follows from the symmetry condition of the ε-net Sε . Namely, although it may be that
35
0 = ui vj Xij′ > ui vj Xij , we do have
ui vj Xij′
X X
max
u,v∈Sε
(k,l)∈L i∈Sk ,j∈Tl
vj Xij I(X:j′ ̸= 0)
X X
≤ max ui
u,v∈Sε
i j:∃(k,l)∈L:i∈Sk ,j∈Tl
vj Xij I(X:j′ ̸= 0)
X X
= max ui
u,v∈Sε
i j:∃(k,l)∈L:i∈Sk ,j∈Tl
vj I(X:j′ ̸= 0)
X X
= max ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl
X X
≤ max vj ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl
X X
= max vj ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl
X X
= max ui vj Xij .
u,v∈Sε
(k,l)∈L i∈Sk ,j∈Tl
When needed, we may also bound X ′ (S, T ) ≤ X(S, T ). Motivated by Bennett’s inequality
σ2 +t t
2 eσ 2 eσ 2
(D.1), define a function Cσ2 (t) = e−σ σ 2 +t
(recall the upper bound Cσ2 (t) ≤ t
)
and events
( ! !!)
[ 2en 2en
E1 = − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log
S,T |S| |T |
( ! !!)
[ X(S, T ) 2en 2en
⊃ X(S, T ) log ≥ (1 + t) |S| log + |T | log ,
S,T e|S||T |σ 2 |S| |T |
[ X X √
E2 = ui vj Xij ≥ (1 + t)CL nσ 2 ,
u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl
E = E1 ∪ E2 .
It remains to prove the desired upper bound for ∥X ′ ∥op by bounding the sums (k,l)∈L i∈Sk ,j∈Tl ui vj Xij
P P
P c
and (k,l)∈L
/ ak bl X(Sk , Tl ) under E .
1. Light couples. Define the set of light couples as L = {(k, l) | ak bl ≤ τ } for some
36
threshold value τ > 0. By Lemma D.1:(iii), we have
X X ui vj X X u2i vj2 σ 2 |λ|
E exp λ Xij ≤ exp (e − 1 − |λ|)
τ τ 2
(k,l)∈L i∈Sk ,j∈Tl | {z } (k,l)∈L i∈S k ,j∈Tl
∈[−1,1]
!
σ2
≤ exp 2 (e|λ| − 1 − |λ|) .
τ
Choose τ such that when the condition t ≥ 3σ 2 /τ holds as equality, thenqthe probability
9σ 2
bound becomes trivial, that is, 0 = Cε n − 9σ 2 /4τ 2 , or equivalently, τ = 4C εn
= 2√3σ
Cε n
.
3σ 2
√ 2
Now τ = 2 Cε nσ and
[ X X q
P(E2 ) = P ui vj Xij ≥ (1 + t)2 Cε nσ 2
u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl
≤ exp (−tCε n) .
√
In summary, by defining a constant CL = 2 Cε , a threshold parameter τ = CL3σ√n and the set
of light couples L = {(k, l) | ak bl ≤ τ }, we have an upper bound (k,l)∈L i∈Sk ,j∈Tl ui vj Xij ≤
P P
√
(1 + t)CL nσ 2 in E2c and P(E2 ) ≤ e−Cε nt .
2. Bounded L1 -norm. Define the first set of heavy couples as H1 = {(k, l) ∈ / L | abkl ≥
37
√ bl
√
nσ 2 or ak
≥ nσ 2 }. Then
1
ak bl X ′ (Sk , Tl ) ≤ √ a2k |Xij′ |
X X X
a √
k,l: k ≥ nσ 2
nσ 2 k,l:
ak √
≥ nσ 2 i∈Sk ,j∈Tl
bl bl
1 X 2 X
≤√ ak |Xij′ |
nσ k2
i∈Sk ,j
1 X 2
≤√ ak sk max∥Xi:′ ∥1
nσ k2 i
a2
≤√ max∥Xi:′ ∥1
nσ 2 i
√
≤ Ctrim a2 nσ 2 .
Hence,
√
ak bl X ′ (Sk , Tl ) ≤ Ctrim (a2 + b2 ) nσ 2 .
X
(k,l)∈H1
σ2
ak bl X(Sk , Tl ) ≤ e2 a2k sk b2l tl
X X
(k,l)∈H2
τ (k,l)∈H2
σ2
≤ e2 a2 b2
τ
2 2 2√
CL e a b
= nσ 2 .
3
38
Under E1c , we have
X X sk log 2en
sk
X tl log 2en
tl
ak bl X(Sk , Tl ) ≤ (1 + t) ak bl X(Sk ,Tl )
+ ak b l X(Sk ,Tl )
k,l k,l log esk tl σ 2 k,l log esk tl µ
X 2en X 2en
≤ (1 + t) ak bl sk log + ak bl tl log .
k,l sk k,l tl
By symmetry, it suffices to bound only the first sum; the second sum can be bounded with
similar arguments. By geometric summation formula, we have
log 2en
sk
X bl log 2en
sk
sk a2k
X X
s k ak b l =
k,l log X(S k ,Tl )
esk tl σ 2 k l ak log X(Sk ,Tl )
es t σ 2
k l
log 2ensk
sk a2k
X X
≤ X(Sk ,Tl )
bl
k ak minl log esk tl σ2 l
a2 bl log 2en
sk
≤ max .
1 − b−1 k,l ak minl log X(S k ,Tl )
esk tl σ 2
It remains to bound the terms of the geometric sums. By assuming log 2en ≤ CH3 log X(S k ,Tl )
esk tl σ 2
√ sk
and H1c , we have abkl ≤ nσ 2 (assumption H1c ) and consequently the desired upper bound
bl log 2en
sk
√
X(Sk ,Tl ) ≤ CH3 nσ 2 . Here the value of the constant CH3 will play an important role
ak minl log
esk tl σ 2
later, so it will be fixed then.
Let us study other ways of bounding the terms of the geometric sum by assuming
now log 2ensk
≥ CH3 log X(S k ,Tl )
esk tl σ 2
/ H2c , that is,
. By combining this with assumption (k, l) ∈
2
X(Sk , Tl ) ≥ e2 sk tl ak bl στ , we get
1
2en e2 sk tl ak bl σ 2 eak bl
CH3
≥ =
sk esk tl σ 2 τ τ
1
τ 2en CH3
=⇒ bl ≤
eak sk
log 2en
sk τ
1
2en CH3 2en
=⇒ bl ≤ 2 log .
ak eak sk sk
39
In summary, by defining the third set of heavy couples as
(k, l) ∈
/ L ∪ H1 ∪ H2 ,
1
≤ CH3 log X(S k ,Tl )
CH3 CL e 2
H3 = (k, l) log 2en
sk esk tl σ 2
or 2en
sk
CH3
log 2en
sk
≤ 3
ak n, ,
1
log 2en ≤ CH3 log X(S k ,Tl ) 2en CH3
log 2en ≤ CH3 CL e 2
tl esk tl σ 2
or tl tl 3
bl n
2
a b2
√
we have (k,l)∈H3 ak bl X(Sk , Tl ) ≤ (1 + t) 1−b CH3 nσ 2 .
P
−1 + 1−a−1
5. Remaining heavy couples. Let us consider the remaining couples (k, l) ∈ / L∪
H1 ∪ H2 ∪ H3 . Without loss of generality, since (k, l) ∈ / H3 , we may assume that log 2en
sk
≥
CH3 log X(S k ,Tl )
esk tl σ 2
and
1
CH3 CL e 2 2en CH3 2en
ak n ≤ log
3 sk sk
1 1 !
2en CH3 2en CH4
= CH4 log
sk sk
1 1
2en CH3 + CH4
≤ CH4 ,
sk
1
1− − C1
2en 3CH4 2en
CH3 H4
≤ 2
sk CH3 CL eak n sk
6CH4 1
≤ .
CH3 CL sk a2k
−1 −1 −1 1 CH3
By setting 1 − CH3 − CH4 = CH3 , that is, CH4 = −1
1−2CH3
= CH3 −2
(as for example CH3 =
CH4 = 3) we have
1
H3c
2en
CH3
2
X X
ak bl X(Sk , Tl ) ≤ ak bl esk tl σ
k,l k,l sk
H3c 6eCH4 2 X 1
≤ σ ak bl sk tl
CH3 CL k,l sk a2k
6eCH4 2 X 2 X 1
= σ bl tl
CH3 CL l k ak b l
Lc6eCH4 2 2 τ −1
≤ σ b
CH3 CL 1 − a−1
2eCH4 b2 √ 2
= nσ .
CH3 (1 − a−1 )
40
CH3
constant CH4 = CH3 −2
, we have
2eCH4 √ 2
!
X a2 b2
ak bl X(Sk , Tl ) ≤ + nσ .
(k,l)∈H4
1 − b−1 1 − a−1 CH3
1 √ 2 2
√ CL e2 a2 b2 √ 2
≤ 2 2
(1 + t)CL nσ + Ctrim (a + b ) nσ + nσ
1 − 2ε | {z } | {z } 3 {z
L H1
| }
H2
√
!
2 2
a b
+ (1 + t) −1
+ −1
CH3 nσ 2
1−b 1−a
| {z }
H3
2eCH4 √ 2
! !
a2 b2
+ + nσ
1 − b−1 1 − a−1 CH3
| {z }
H4
1 CL e2 a2 b2
= (1 + t)CL + Ctrim (a2 + b2 ) +
1 − 2ε 3
2 2
! !√
a b 2e
+ + (1 + t)CH3 + nσ 2
1 − b−1 1 − a−1 CH3 − 2
a=b=2, √
CH3 =CH4 =3 1 q 32e2 Cε
= (1 + t)2 Cε + 8Ctrim +
1 − 2ε 3
√
!
+ 16 (3(1 + t) + 2e) nσ 2
√
! !
1 q 32e2 q
= (2 Cε + 48)t + 8Ctrim + 2 + Cε + 48 + 32e nσ 2
1 − 2ε 3
1
q
= −1
2 2 log(4ε + 2) + 48 t + 8Ctrim
1 − 2ε
√
! !
32e2 q −1
+ 2+ 2 log(4ε + 2) + 48 + 32e nσ 2
3
ε=0.05 √
≤ (56t + 8.09Ctrim + 417) nσ 2 .
41
With Lemma A.2, we obtain a tail bound
√
P(∥X ′ ∥op ≥ (56t + 8.09Ctrim + 417) nσ 2 )
≤ P(E1 ∪ E2 )
Lemma A.2
≤ exp(−t log(e2 n2 )) + exp(−tCε n)
≤ 2 exp(−t(Cε n ∧ 2 log en)).
To further simplify the probability bound, we notice that with an inequality 2 log en =
2(1 + log n) ≤ 2elog n = 2n ≤ Cε n, we have Cε n ∧ 2 log en = 2 log en. ■
Lemma B.1. Let X ∈ Rn×m be a random matrix with independent centered entries satisfying
the MGF bound with variance proxy σ 2 (Assumption 2.1) and let A ∈ Rn×m be a deterministic
matrix with uniformly bounded entries Aij ≤ σ 2 . Then
√
2 n
P ∥AX T ∥F ≥ 34(1 + t)nσ 2 mσ 2 ∨ ≤ , t ≥ 1.
3 (en)t
√
Especially, if mσ 2 ≳ 1, then ∥AX T ∥F ≲ nσ 2 mσ 2 with high probability.
Proof. Discretization. For a vector ∥u∥ ≤ 1, define sets Sk = {i | 2−k−1 < |ui | ≤ 2−k },
k = 0, . . . , K − 1, SK = {i | |ui | ≤ 2−K } of size sk = |Sk | and values ak = 2−k ≥ max{|ui | |
i ∈ Sk }. Then
a2k 2
sk a2k = u2i ≤ 4,
X X
u ≤4
2 i
k < K,
i∈Sk ui i∈Sk
sK a2K ≤ n4−K .
n
Choose K = ⌊ log
log 4
⌋ so that also sK a2K ≤ n4−K ≤ 4 and n/4 ≤ 4K ≤ n. This implies
sk ≤ 4a−2
k = 4
k+1
for all k and
K K X K−1
sk a2k = a2k ≤ sK a2K + 4 u2i ≤ 4 + 4 ≤ 8.
X X X X
42
Now for any vector x ∈ Rn , we have
X XX X X X
xi ui = xi ui ≤ ak max S xi v i = ak x(Sk ).
i k i∈Sk k v∈{−1,1} k
i∈Sk k
| {z }
=x(Sk )
Bounding (AX T )i: (S). For any nonempty subset S ⊂ [n], we have
P P
where j i′ ∈S Aij Xi′ j ui′ is a sum of independent random variables. By Bernstein’s inequal-
ity (D.3), we have a tail bound
n n
n X m
n o X X Aij ts
(AX T )i: (S) ≥
[ [ X X X
P t|S| ≤ P
vi′2
X ′
ij ≥
2
i=1 ∅̸=S⊂[n] i=1 s=1 S∈([n]) v∈{−1,+1}S
′ σ
i ∈S j=1 | {z }
σ
s
∈[−1,1]
n
! !!
2 2
X n s (ts /σ ) 3(ts /σ 2 )
≤n 2 exp − ∧
s=1 s 4msσ 2 4
n
!!
X 2en t2s 3ts
≤n exp s log − 6
∧ 2 .
s=1 s 4msσ 4σ
43
2en
4σ 2 s log t2s
q
Now ts = 4ms2 σ 6 log 2en
s
∨ 3
s
satisfies s log 2en
s
= 4msσ 6
∧ 3ts
4σ 2
so that
n n
!!
[ [ n
T
o X 2en t2s 3ts
P (AX )i: (S) ≥ (1 + t)ts ≤n exp s log − (1 + t) 6
∧ 2
i=1 ∅̸=S⊂[n] s=1 s 4msσ 4σ
n
2en
X
=n exp −ts log
s=1 s
n ts
s
X
=n
s=1 2en
1 t 2 2t
(∗) !
≤n + (n − 1)
2en 2en
t
1 1+t
!
t≥1 1 1
≤ n + (n − 1)
2 en en
t t !
1 1 1 1
≤n +
2 en e en
n
≤ ,
(en)t
s
s
where inequality (∗) holds, because the function s 7→ 2en is decreasing (the derivative of
s s s
its logarithm s log 2en is negative log 2en + 1 = log 2n , s ≤ n).
Norm bound. Define a rare event
n n o
(AX T )i: (S) ≥ (1 + t)t|S| .
[ [
E=
i=1 ∅̸=S⊂[n]
The aim is to bound the norms of the row vectors ∥(AX T )i: ∥ under a high-probability event
E c . When E c holds, inequality (B.1) gives
v
u n
∥AX T ∥F =
uX
t ∥(AX T )
i: ∥
2
i=1
√
ak (AX T )i: (Sk )
X
≤ n max max
i S0 ,...,SK
k
√ X
≤ n max (1 + t) ak tsk .
S0 ,...,SK
k
44
2en
q 4σ 2 sk log
Evaluating tsk = 4ms2k σ 6 log 2en
sk
∨ 3
sk
gives an upper bound
s
2 2en
T
√ X
2 6 2en 4σ sk log sk
∥AX ∥F = (1 + t) n max 4msk σ log ∨ ak
S0 ,...,SK
k sk 3
√ 2 X √
!
2 2en
≤ 2(1 + t) nσ max mσ ∨ 2 ak sk log
S0 ,...,SK
k 3 sk
√
√ 2
2en
= 2(1 + t) nσ 2
X
mσ 2 ∨ max ak sk log .
3 S0 ,...,SK k sk
2
where the inequality (∗) follows from increasingness of a function s 7→ s log2 4es n as it has a
2 2 2
nonnegative derivative log2 4es n − 2 log 4es n = (log 4es n ) log 4n
s
≥ 0 for s ≤ 4n. With a help
of Wolfram, the last term is equal to
√
32 log(e2 n)(2K+1 − 1) − log(4)2(2K K − 2K + 1) .
45
Hence, we conclude that
K
2en √
≤ 2K+1 32 log(e2 n) − log(4)(K − 1)
X
sk ak log
k=0 sk
√ e2 n
= 2K+1 32 log K−1
4
√ 4e2 n
= 2K+1 32 log K
4
(∗∗) √ 4e2 n
≤ 2 32n log
√ n
2
= 2 32n log(4e ),
2
where the inequality (∗∗) follows from increasingness of a function s 7→ s log 4es2n as it has a
2 √
nonnegative derivative log 16es2 n − 2 = log 16n
s2
≥ 0 for s ≤ n.
Summary. Combining these results gives that the high-probability event E c implies
√
√ 2 √
∥AX T ∥F ≤ 2(1 + t) nσ 2 mσ 2 ∨ 2 32n log(4e2 )
3
√ 2 2
√ 2
= 16 2 log(4e )(1 + t)nσ mσ ∨
2
√
3
2
2
≤ 34(1 + t)nσ mσ 2 ∨ .
3
■
Notice that in previous lemma, if |Aij | = Var(Xij ) = σ 2 , then the variance of a single entry
of AX T is Var((AX T )ij ) = mσ 6 and the expected squared Frobenius norm is E∥AX T ∥2F =
n2 σ 4 mσ 2 , which makes the high-probability bound rate-optimal.
46
Proof. The first inequality follows from E∥(XX T )⊙I∥2op = E maxi |(XX T )ii |2 ≥ E(XX T )2ii ≥
Var((XX T )ii ) = mVar(Xij2 ), where the variance follows from an elementary calculation
For the second statement, first notice that the matrix (XX T ) ⊙ M is centered, namely the
diagonal entries are zero and for i ̸= i′ , we have
Therefore,
Var((XX T )ii′ )
X
=
i′ :i′ ̸=i
X X
= Var Xij Xi′ j
i′ :i′ ̸=i j
X X
= Var (Xij Xi′ j )
i′ :i′ ̸=i j
E(Xij2 Xi2′ j )
X X
=
i′ :i′ ̸=i j
EXij2 EXi2′ j
X X
=
i′ :i′ ̸=i j
= (n − 1)mp2 (1 − p)2 .
This suggests that XX T − EXX T consists of two parts, a diagonal and an off-diagonal
√
part. For small p ≪ 1, we expect the norm of the diagonal to be of order mp and the
√
norm of the off-diagonal to be of order nmp. If this holds, then the diagonal dominates
√ √
the off-diagonal, when mp ≫ nmp, or equivalently, p ≪ n−1 . When we consider square
matrices m = n, we rarely consider p ≪ n−1 , but when m ≫ n, we may be interested in the
regime p ≪ n−1 and this phenomenon will become significant. The rest of this subsection
√
aims at proving that the off-diagonal part is bounded from above by O( nmp), or in a
47
√
slightly more general setting, O( nmσ 2 ).
Lemma C.2 bounds a product X1 X2T of two independent matrices X1 , X2 with indepen-
dent entries. The key observation is that the entries of X1 X2T are conditionally independent
entries given X2 , when X2 is sufficiently sparse. This allows us to apply Theorem 6.3. Lemma
C.3 applies a well-known decoupling technique (see for example Chapter 6 in [39]) to analyze
a product (X − EX)X T as if X and X T were independent. The decoupling technique applies
to expectations of convex functions of (X − EX)X T which is then converted to a tail bound
in the proof. Theorem 6.4 finally bounds off-diagonal part of the Gram matrix XX T . The
desired norm bound is obtained by removing certain rows and columns, and Lemma 6.5
asserts that only a few rows and columns are removed. The bound presented here is derived
with a simple application of Markov’s inequality, and consequently the bound is worse than
the bound in Lemma A.1.
Lemma C.2. Let X1 ∈ Rn1 ×m , X2 ∈ Zn2 ×m be independent random matrices with indepen-
dent entries, n = n1 + n2 ≤ m. Assume that X1 is centered and its entries satisfy the MGF
bound with variance proxy σ 2 (Assumption 2.1) and E|(X1 )ij | ≤ σ 2 . For the other matrix, as-
sume that the entries of X2 satisfy the MGF bound with variance proxy σ 2 and E|(X2 )ij | ≤ σ 2 .
Furthermore, assume that 8 log en ≤ mσ 2 , nσ 2 ≤ 3−1 e−2 imposing m ≥ 24e2 n log en. Define
a product matrix Y = X1 X2T ∈ Rn1 ×n2 and its trimmed version Y ′ as
j ′ ,k |(X1 )ik (X2 )j ′ k | ∨ i′ ,k |(X1 )i′ k (X2 )jk | ≤ Ctrim nmσ 4 ,
P P
Yij ,
Yij′ =
0, otherwise.
(i) Then
√ √ !
t tnmσ 4 √ √
4
+ 2n− t/2 6
X
P |(X1 )il (X2 )jl | ≥ tnmσ ≤ √ exp − √
j,l 6 40 6
for all
24 log2 m
t ≥ 24 ∨ .
log2 (1/3enσ 2 )
for all
512 log3 m
t ≥ 53/2 ∨ .
log3 (1/3enσ 2 )
In particular, if nσ 2 ≲ m−c for some arbitrarily small constant c, then ∥Y ′ ∥op ≲
√ √
nmσ 2 with high probability as n → ∞, and if nσ 2 ≲ 1, then ∥Y ′ ∥op ≲ log3 (m) nmσ 2
48
with high probability as n → ∞.
P∞ (k)
Proof. Denote n = n1 + n2 . Decompose the integer matrix as X2 = k=1 X2 , where
(k)
matrices X2 are defined recursively by
k−1
!!
(k) X X (k′ )
X2 = arg min X2 − X + X2 , k≥1
X∈{0,±1}n2 ×m :maxj ∥X:j ∥1 ≤1 ij k′ =1 ij
with arbitrary tie breaks as there might not be unique minimizers. As an example of such a
decomposition, consider the following small integer matrix
3 −2 1
1 −1 1
1 −1 0
1 0 0
0 0 −1 = 0 0 0 + 0 0 −1 + 0 0 0 .
0 1 0 0 0 0 0 0 0 0 1 0
That is, the components are matrices with entries being in {−1, 0, 1}, each column having
at most one nonzero entry and the k:th component picks “the k:th one of each column” if it
(k)
exists. Since each column has at most one nonzero entry, the supports of the rows of X2
are disjoint.
Notice that the summands are eventually zero matrices as each entry of X2 is finite
almost surely. Let Kmax denote the number of nonzero components (which is random as X2
is random). By Lemma D.1:(v), we have
2 (eλ −1−λ) 2 (eλ −1−λ)
Eeλ(|X2 (i,j)−EX2 (i,j)|−E|X2 (i,j)−EX2 (i,j)|) ≤ e2σ ≤ e3σ for all λ ≥ 0
and consequently
P 2 (eλ −1−λ)
Eeλ i
(|X2 (i,j)−EX2 (i,j)|−E|X2 (i,j)−EX2 (i,j)|)
≤ Ee3n2 σ for all λ ≥ 0.
49
Now for any kmax ≥ 3n2 σ 2 , Bennett’s inequality (D.1) gives a tail bound
P (Kmax ≥ kmax )
(n )
m X
[ 2
The resulting upper bound holds also for 0 < kmax ≤ 3n2 σ 2 as the upper bound becomes
trivial.
(k)
Conditioned on X2 , the product Y (k) = X1 (X2 )T consists of independent entries (since
(k)
the rows of X2 have disjoint supports) and by Lemma D.1:(iii), they satisfy the moment
condition
(k) (k)
λYij λ(X1 (X2 )T )ij
E e |X2 = E e |X2
(k)
≤ exp ∥(X2 )j: ∥1 σ 2 (e|λ| − 1 − |λ|)
2 |λ|
≤ exp max
′
∥(X2 )j ′ : ∥1 σ (e − 1 − |λ|)
j
2σ 2 (eλ −1−λ)
e for all λ ≥ 0 (Lemma D.1:(v)). Hence,
X
E exp λ |(X1 )il | − E|(X1 )il | | X2
(k)
l:∃j:(X2 )jl ̸=0
2 λ
≤ exp 2n max
′
∥(X2 ) ∥1 σ (e − 1 − λ) j′:
j
50
for all λ ≥ 0. The next objective is to bound maxj ′ ∥(X2 )j ′ : ∥1 .
For any t ≥ 1, assumptions E|(X2 )ij | ≤ σ 2 and 8 log en ≤ mσ 2 and Lemma A.1 give a
tail bound
P max∥(X2 )j: ∥1 ≥ 5tmσ 2
j
≤ P max∥(X2 − EX2 )j: ∥1 + ∥(EX2 )j: ∥1 ≥ 5tmσ 2
j
≤ P max∥(X2 − EX2 )j: ∥1 ≥ 4tmσ 2
j
[ q
≤ P ∥(X2 − EX2 )j: ∥1 ≥ E∥(X2 − EX2 )j: ∥1 + (1 + t) 8mσ 2 log en
j
(A.1)
≤ (en)−t .
That is, under a high-probability event i {∥(X2 )j: ∥1 < 5tmσ 2 }, we have the moment condi-
T
tions
(k)
E e λYij
| X2 ≤ exp 5tmσ 4 (e|λ| − 1 − |λ|)
for all λ ≥ 0.
51
(i) For any t ≥ 1, Bernstein’s inequality (D.3) gives
≥ 6kmax tnmσ 4
X
P |(X1 )il (X2 )jl |
j,l
!
X X (k) 4 2
≤ P |(X1 )il (X2 )jl | ≥ 6tnmσ , max∥(X2 )i: ∥1 < 5tmσ , Kmax < kmax
i
k≤kmax j,l
+ P max∥(X2 )i: ∥1 ≥ 5tmσ 2 + P (Kmax ≥ kmax )
i
!
4 2
X X
≤ P |(X1 )il | ≥ 6tnmσ , max∥(X2 )i: ∥1 < 5tmσ
i
k≤kmax (k)
j,l:(X2 )jl ̸=0
!kmax
−t 3eσ 2
+ (en) +m
kmax
" " # !
|(X1 )il | | X2 + tnmσ 4 | X2
X X X
≤ E P |(X1 )il | ≥ E
k≤kmax (k)
l:∃j:(X2 )jl ̸=0
(k)
l:∃j:(X2 )jl ̸=0
# !kmax
3enσ 2
−t
× I max∥(X2 )i: ∥1 < 5tmσ 2
+ (en) +m n−kmax
i kmax
!! !k
(D.3) t2 n2 m2 σ 8 3tnmσ 4 −t 3enσ 2 max −kmax
≤ kmax exp − ∧ + (en) + m n
4 · 10tnmσ 4 4 kmax
! !
tnmσ 4 −t kmax
= kmax exp − + (en) + exp log m − kmax log n−kmax .
40 3enσ 2
2 log m t 2
By choosing kmax = ⌊t⌋ ≥ 2 ∨ log(1/3enσ 2 ) so that 2 ≤ t − 1 ≤ kmax ≤ t and t ≥ 6enσ
52
′
Ctrim gives Ctrim = 5tCtrim since
n2 n2 X
m n2 X
m
(k) (k) (k)
|(X1 )il (X2 )jl | ≤ Ctrim nmσ 4 ,
X X X
∥Yi: ∥1 = |Yij | = (X1 )il (X2 )jl ≤
j=1 j=1 l=1 j=1 l=1
n1 n1 Xm n1 X m
(k) (k) (k)
|(X1 )il (X2 )jl | ≤ Ctrim nmσ 4 .
X X X
∥Y:j ∥1 = |Yij | = (X1 )il (X2 )jl ≤
i=1 i=1 l=1 i=1 l=1
By Theorem 6.3, there exists a constant C such that for all t ≥ 5, we have a tail bound
Ctrim √
! !
′
P ∥Y ∥op ≥ Ckmax t + √ nmσ 2
5t
√
! !
C trim
≤ P ∥Y ′ ∥op ≥ Ckmax t + √ nmσ 2 , Kmax ≤ kmax , max∥(X2 )i: ∥1 ≤ 5tmσ 2
5t i
+ P (Kmax ≥ kmax ) + P max∥(X2 )i: ∥1 ≥ 5tmσ 2
i
s
k[
max
t Ctrim q
≤ EP ∥(Y (k) )′ ∥op ≥C + n(5tmσ 4 ) | X2
k=1 5 5t
( ) !kmax
3en2 σ 2
2
+ (en)−t
\
×I {∥(X2 )i: ∥1 < 5tmσ } +m
i kmax
√ 3en2 σ 2
!kmax
−2
≤ 2kmax (en) t/5
+m + (en)−t
kmax
√ 3en2 σ 2
!kmax
≤ 3kmax (en)−2 t/5
+m .
kmax
√ √ 8 log m 8 log m
√
t
√ √
By choosing kmax = ⌊ t⌋ ≥ 5∨ log(1/3enσ 2 ) ≥ 2∨ log(1/3enσ 2 ) so that 2 ≤ t−1 ≤ kmax ≤ t
√
and t ≥ 6e2 nσ 2 (recall the assumption nσ 2 ≤ 3−1 e−2 ), we obtain
Ctrim √
! !
P ∥Y ′ ∥op ≥ C t3/2 + √ nmσ 2
5
√ !√t/2
√ 6enσ 2
≤ 3 t(en)−2 t/5 + m √
t
√ √
s
√ t −2 t/5 √
t 1
√
t
≤ 3 5n−2 t/5 e + elog m− 4 log 3enσ2 − 4
5
√ −2 −√t/5 √
≤ 3 5n e + m−1 e− t/4
√
3 5 + 1 −√t/4
≤ e
n
8 3/2 1/3
≤ e−(t ) /4 .
n
53
83 log3 m
The claim follows by writing the assumption on t as t3/2 ≥ 53/2 ∨ log3 (1/3enσ 2 )
. ■
Lemma C.3 (Decoupling). Let X ∈ Zn×m be a random matrix with independent integer-
valued entries that satisfy Assumption 2.1 with variance proxy σ 2 and E|Xij | ≤ σ 2 . Fur-
thermore, assume that 4 log en ≤ mσ 2 , nσ 2 ≤ 6−1 e−2 , imposing m ≥ 24e2 n log en. Define
a mask M = 11T − I ∈ Rn×n , product matrix Y = ((X − EX)X T ) ⊙ M ∈ Rn×n , trimming
mask N ∈ Rn×n as
X
|(Xi′ k − EXi′ k )Xjk |Mi′ j ≤ Ctrim nmσ 4
X
Nij = I |(Xik − EXik )Xj ′ k |Mij ′ ∨
j ′ ,k i′ ,k
and a trimmed product Y ′ = Y ⊙ N . Then there exists an absolute constant C such that
√ 1/3
P ∥Y ′ ∥op ≥ C(t + Ctrim ) nmσ 2 ≤ Cn−1 e−t
for all
log3 m
!
t≥C 1∨ .
log3 (1/6enσ 2 )
√
In particular, if nσ 2 ≲ m−c for some arbitrarily small constant c, then ∥Y ′ ∥op ≲ nmσ 2
√
with high probability as n → ∞, and if nσ 2 ≲ 1, then ∥Y ′ ∥op ≲ log3 (m) nmσ 2 with high
probability as n → ∞.
On the diagonal, we have ((XI: − EXI: )(XI c : )T )ii = 0, that is, ((X − EX)X T ) ⊙ M =
4EI (XI: − EXI: )(XI c : )T . Since operator norm is a convex function in the space of matrices,
Jensen’s inequality gives
Similarly, for any convex and nondecreasing function f : R≥0 → R≥0 , Jensen’s inequality
54
gives
Ef ∥((X − EX)X T ) ⊙ M ⊙ N ∥op = Ef ∥4EI ((XI: − EXI: )XITc : ) ⊙ N ∥op
≤ EI Ef 4∥((XI: − EXI: )XITc : ) ⊙ N ∥op .
The interchange of the order of the expectations E and EI is justified most evidently when
EI = 2−n I⊂[n] is written as a finite sum. The gain of this argument is the independence of
P
the submatrices XI: and XI c : . Namely, now Lemma C.2 can be applied to the product (XI: −
EXI: )XITc : with variance proxy 2σ 2 (the number 2 is needed to have E|(XI: − EXI: )ij | ≤ 2σ 2 ).
′
Consequently, the trimming constant in Lemma C.2 becomes Ctrim = Ctrim /2. Define Z =
′ 3
∥Y ∥ ′ 3/2 512 log m
√
8C nmσ 2
− Ctrim − 5 ∨ log3 (1/6enσ2 ) , where C is the universal constant from Lemma
+
p
C.2:(ii). Since f (x) = (x − 1)+ is a nondecreasing convex function f : R≥0 → R≥0 for p ≥ 1,
by Lemma C.2, the absolute moments of Z can be estimated from above as
EZ p
!!p
4∥((XI: − EXI: )XITc : ) ⊙ N ∥ ′ 3/2 512 log3 m
≤ EI E √ − C trim − 5 ∨
8C nmσ 2 log3 (1/6enσ 2 ) +
3
∥((XI: − EXI: )XITc : )
! !
Z ∞
⊙ N∥ 512 log m ′
= EI ptp−1 P √ ≥ 53/2 ∨ + t + Ctrim dt
0 C nm2σ 2 log3 (1/3en2σ 2 )
Lemma C.2
Z ∞
−1 1/3 /4
≤ 8n ptp−1 e−t dt
Z0 ∞
= 8n−1 p43(p−1) u3(p−1) e−u 3 · 43 u2 du
0
−1 3p
= 8n 4 3pΓ(3p)
= 8n−1 43p Γ(3p + 1)
for any p ≥ 1. For 1/3 ≤ p ≤ 1, this argument does not work as x 7→ xp is not convex. Now
for any |λ| < 4−1 , geometric summation formula gives
2 ∞
λk Z k/3 λk
!
λZ 1/3
EZ k/3
X X
E e − =
k=0 k! k=3 k!
∞
λk k
≤ 8n−1
X
4 Γ(k + 1)
k=3 k!
∞
= 8n−1 (4λ)k
X
k=3
(4λ)3
= 8n−1 .
1 − 4λ
55
By Markov’s inequality, this gives a tail bound
∞ ∞
λk k/3 X λk k/3
!
X
P(Z ≥ t) ≤ P Z ≥ t
k=3 k! k=3 k!
∞
!−1
−1 (4λ)3 X λk k/3
≤ 8n t .
1 − 4λ k=3 k!
k
To simplify the upper bound, let us study the relation between the series ∞ x P
k=3 k! and the
exponential function ex . Denote the Taylor polynomials of the exponential function by
i
pk (x) = ki=0 xi! . Since the derivatives of the polynomials satisfy p′k+1 = pk , for any x ≥ 0
P
and k ≥ 1, we have
Now
λ=8−1 , !−1
t0 =λ−3 p2 (1) 1/3 /8
−3 −1
P(Z ≥ t) ≤ 16 · 2 n 1− e−t
e
1/3 /8
≤ 25n−1 e−t , t ≥ 83 = 512.
512 log3 m √
( ! ! )
{Z ≥ t} = ∥Y ′ ∥ ≥ 8C 53/2 ∨ 3 2
′
+ t + Ctrim nmσ 2
log (1/6enσ )
√
⊃ {∥Y ′ ∥ ≥ 8C(2t + Ctrim
′
) nmσ 2 }
√
⊃ {∥Y ′ ∥ ≥ 16C(t + Ctrim ) nmσ 2 },
512 log3 m
when t ≥ 53/2 ∨ log3 (1/6enσ 2 )
. ■
Proof of Theorem 6.4. Let t ≥ 0. Since E|Xij − EXij | ≤ 2E|Xij | ≤ 2σ 2 and mσ 2 ≥ 8 log en,
56
applying Lemma A.1 gives
n q !
2
≤ (en)−t .
[
P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t) 8mσ 2 log en
i=1
Tn n √ o
Under a high-probability event i=1 ∥(X − EX)i: ∥1 < 2mσ 2 + (1 + t) 8mσ 2 log en , we have
X X X
|(X − EX)ik Xjk |Mij Nij ≤ |Xik Xjk |Mij Nij + |(EXik )(X − EX)jk |Mij Nij
j,k j,k j,k
X
+ |(EX)ik (EX)jk |Mij Nij
j,k
j,k
and similarly
X X X
|(X − EX)ik Xjk |Mij Nij ≤ |Xik Xjk |Mij Nij + |(EXik )(X − EX)jk |Mij Nij
i,k i,k i,k
X
+ |(EX)ik (EX)jk |Mij Nij
i,k
i,k
In the following calculation, constant C is allowed to vary from line to line. By Lemma
1/3
C.3 and Lemma B.1, with probability of at least 1 − Cn−1 e−t − (en)−t − n(en)−t ≥
57
1/3
1 − Cn−1 e−t , triangle inequality gives
Proof of Lemma 6.5. Let t ≥ 0. Since E|Xij − EXij | ≤ 2E|Xij | ≤ 2σ 2 and mσ 2 ≥ 8 log en,
applying Lemma A.1 gives
n n
!
o
2 2
[
P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t)mσ
i=1
n q !
2
[
≤P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t) 8mσ 2 log en
i=1
−t
≤ (en) .
Tn
Under a high-probability event i=1 {∥(X − EX)i: ∥1 < (3 + t)mσ 2 }, we have
X n X
X m
|(EXik )(X − EX) |M j′k ij ′ ≤ |EXik ||Xjk − EXjk |
j ′ ,k j=1 k=1
≤ nσ 2 max∥(X − EX)j: ∥1
j
≤ (3 + t)nmσ 4 .
24 log2 m
Suppose t ≥ 24 ∨ log2 (1/6enσ 2 )
. By triangle inequality and Lemma C.2:(i) (with 2σ 2 ), we
58
obtain
X X X
|Xik Xj ′ k |Mij ′ ≤ |(X − EX)ik Xj ′ k |Mij ′ + |(EXik )(X − EX)j ′ k |Mij ′
j ′ ,k j ′ ,k j ′ ,k
X
+ |(EXik )(EXj ′ k )|
j ′ ,k
D Bennett’s inequality
This subsection studies the concentration of a sum of independent random variables satis-
fying Assumption 2.1. Lemma D.1 shows ways to construct random variables satisfying the
assumption and Lemma D.2 collects well-known tail bounds for such random variables.
Lemma D.1.
(i) If X ∈ [0, 1] almost surely, then X satisfies Condition (2.1) with variance proxy EX.
(ii) If X ∈ [−1, 1] almost surely and EX = 0, then X satisfies Condition (2.1) with variance
proxy EX 2 .
(iii) If X1 , X2 are independent random variables satisfying Condition (2.1) with variance
proxies σ12 , σ22 respectively and a1 , a2 ∈ [−1, 1], then X = a1 X1 + a2 X2 satisfies Condi-
tion (2.1) with variance proxy a21 σ12 + a22 σ22 .
(iv) If X is a centered random variable satisfying Condition (2.1) with variance proxy σ 2
and ξ is a random variable independent of X satisfying |ξ| ≤ 1 almost surely, then their
product ξX satisfies Condition (2.1) with variance proxy σ 2 .
(v) If X satisfies Condition (2.1) with variance proxy σ 2 , then |X − EX| satisfies the
condition with variance proxy 2σ 2 at least for λ ≥ 0, that is,
2 (eλ −1−λ)
Eeλ(|X−EX|−E|X−EX|) ≤ e2σ for all λ ≥ 0.
59
Proof of Lemma D.1. (i) Let X ∈ [0, 1] be a bounded nonnegative random variable. By
convexity of the exponential function (linear interpolation of convex function t 7→ et
between points (0, e0 ) and (λ, eλ )), we have
Here the second inequality (*) follows from the well-known inequality 1 + x ≤ ex . Since
λk P∞ |λ|k
eλ − 1 − λ = ∞ k=2 k! ≤
|λ|
k=2 k! = e − 1 − |λ|, random variable X satisfies Condition
P
(iii) If X1 , X2 are independent and satisfy Condition (2.1) with variance proxies σ12 and σ22
respectively and |a1 |, |a2 | ≤ 1, then
Eeλ((a1 X1 +a2 X2 )−E(a1 X1 +a2 X2 )) = Eea1 λ(X1 −EX1 ) Eea2 λ(X2 −EX2 )
2 |a1 λ| −1−|a λ|) 2 |a2 λ| −1−|a λ|)
≤ eσ1 (e 1
eσ2 (e 2
This shows that a1 X1 + a2 X2 satisfies Condition (2.1) with variance proxy a21 σ12 + a22 σ22 .
(iv) If X is a centered random variable satisfying Condition (2.1) with variance proxy σ 2
and ξ is a random variable independent of X satisfying |ξ| ≤ 1 almost surely, then their
product ξX satisfies EξX = EξEX = 0 and
(v) If X is a centered random variable satisfying Condition (2.1) with variance proxy σ 2 ,
60
then for any λ ≥ 0, we have
≤ 1 + |e−λE|X|
{z } E (e
λX
− 1 − λX) ∨ (e−λX − 1 + λX)
| {z } | {z }
≤1 ≥0 ≥0
(vi) By part (i), the statement holds for Bernoulli-variables. By part (iii), the statement
holds for binomially distributed variables as they can be represented as sums of in-
dependent and identically distributed Bernoulli-variables. If X is Poisson-distributed,
then
∞
(EX)k −EX λk
Eeλ(X−EX) = e−λEX EeλX = e−λEX
X
e e
k=0 k!
∞
−(1+λ)EX (EXeλ )k λ
= e−(1+λ)EX eEXe
X
=e
k=0 k!
λ −1−λ) |λ| −1−|λ|)
= eEX(e ≤ eEX(e .
Inequality (D.1) is known as Bennett’s inequality [4] and inequalities (D.2) and (D.3) are
known as Bernstein’s inequalities. The general method of using moment generating functions
to derive the probability tail bounds is often attributed to Chernoff [8] or Cramér [11].
61
Proof of Lemma D.2. Markov’s inequality gives
P(X ≥ t) = inf P(eλX ≥ eλt ) ≤ inf e−λt EeλX ≤ exp inf −λt + σ 2 (eλ − 1 − λ) .
λ≥0 λ≥0 λ≥0
Bennett’s inequality (D.1) follows from optimizing the infimum by setting λ = log(1 + t/σ 2 ),
namely now
2 λ
P(X ≥ t) ≤ exp inf −λt + σ (e − 1 − λ)
λ≥0
−t
t t t
λ=log(1+t/σ 2 ) 2
= 1+ 2 exp σ 1 + 2 − 1 − log 1 + 2
σ σ σ
−σ2 −t !σ2 +t !σ2 +t
t σ2 eσ 2
t t −σ 2
= 1+ 2 e = e =e .
σ σ2 + t σ2 + t
namely now
!σ2 +t
eσ 2
−σ 2 2 λ
e = exp inf −λt + σ (e − 1 − λ)
σ2 + t λ≥0
!!
σ 2 λ2 /2
≤ exp inf −λt +
λ≥0 1 − λ/3
and the tail bound can be derived in similar spirit (Exercise 2.8.5 and Exercise 2.8.6. in
[39]). Inequality (D.3) simply follows from
! ! !!
t2 /2 t2 /2 t2 3t
exp − 2 ≤ exp − = exp − ∧ .
σ + t/3 (2σ 2 ) ∨ (2t/3) 4σ 2 4
To apply the most practical but weakest form of Bernstein’s inequality, the following
simple lemma is sometimes useful.
Lemma D.3. If f, g : R≥0 → R≥0 are strictly increasing bijections, then f ∧ g is a strictly
increasing bijection with an inverse given by (f ∧ g)−1 = f −1 ∨ g −1 .
Proof. If 0 ≤ x1 < x2 , then f (x1 ) < f (x2 ) and g(x1 ) < g(x2 ) giving f (x1 ) ∧ g(x1 ) ≤ f (x1 ) <
f (x2 ) and f (x1 ) ∧ g(x1 ) ≤ g(x1 ) < g(x2 ), that is, f (x1 ) ∧ g(x1 ) < f (x2 ) ∧ g(x2 ) and f ∧ g is
strictly increasing.
62
Suppose f (x) ∧ g(x) = y. Then either f (x) = y or g(x) = y and hence f −1 (y) = x or
g −1 (y) = x. If f (x) ≥ g(x) = y, then f −1 (y) ≤ x = g −1 (y) (since f is increasing) and hence
x = f −1 (y) ∨ g −1 (y). Similarly, if y = f (x) ≤ g(x), then g −1 (y) ≤ x = f −1 (y) (since g is
increasing) and hence x = f −1 (y) ∨ g −1 (y). This shows that x = f −1 (y) ∨ g −1 (y). ■
In Bernstein’s inequality (D.3), choose f (t) = t2 /4σ 2 and g(t) = 3t/4. To find a specific
t for which the probability bound attains some given value e−s , solve s = f (t) ∧ g(t) by
√ 4s
t = f −1 (s) ∨ g −1 (s) = 4σ 2 s ∨ .
3
That is, solve both s = f (t) and s = g(t) and then choose the maximum of the solutions.
In Lemma D.2, Bernstein’s inequalities show that the probability tends to zero, if t ≫
σ ∨1, which might not be easily visible from Bennett’s inequality, although it is the sharpest.
Bennett’s inequality does show, however, that the probability tends to zero, if t ≫ σ 2 . This
conclusion is beneficial, if σ 2 is particularly small. In that case, the first inequality may be
estimated further to a more practical form
!t
eσ 2 t
P(X ≥ t) ≤ = exp −t log 2 .
t eσ
63