0% found this document useful (0 votes)

23 views63 pages

Consistent Spectral Clustering in Sparse

This paper presents a spectral clustering algorithm for high-order clustering in sparse tensor block models, addressing challenges in multiway datasets common in fields like bioinformatics and social network analysis. The authors introduce a tensor block model for sparse integer-valued data and demonstrate the algorithm's consistency through a novel concentration bound for sparse random Gram matrices. The findings are supported by simulations, and the paper discusses the tradeoff between signal loss and noise reduction during data aggregation.

Uploaded by

Abhaas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views63 pages

Consistent Spectral Clustering in Sparse

Uploaded by

Abhaas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Consistent spectral clustering in sparse

tensor block models

Ian Välimaa and Lasse Leskelä
arXiv:2501.13820v1 [math.ST] 23 Jan 2025

24 January 2025

Abstract
High-order clustering aims to classify objects in multiway datasets that are preva-
lent in various fields such as bioinformatics, social network analysis, and recommen-
dation systems. These tasks often involve data that is sparse and high-dimensional,
presenting significant statistical and computational challenges. This paper introduces
a tensor block model specifically designed for sparse integer-valued data tensors. We
propose a simple spectral clustering algorithm augmented with a trimming step to mit-
igate noise fluctuations, and identify a density threshold that ensures the algorithm’s
consistency. Our approach models sparsity using a sub-Poisson noise concentration
framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural
class of tensor block models is closed under aggregation across arbitrary modes. Con-
sequently, we obtain a comprehensive framework for evaluating the tradeoff between
signal loss and noise reduction during data aggregation. The analysis is based on a
novel concentration bound for sparse random Gram matrices. The theoretical findings
are illustrated through simulation experiments.

Keywords: latent block model, stochastic block model, almost exact recovery, weak con-
sistency, multiway clustering, higher-order network, hypergraph

1 Introduction
Multiway clustering is a statistical problem where an observed data array Y ∈ Rn1 ×···×nd
of order d is analyzed by grouping (i.e., clustering) together similar entities ik = 1, . . . , nk
corresponding to slices Y:···:ik :···: along some of the modes k = 1, . . . , d. An example of a
higher-order data is multitissue gene expression data [23, 32, 40], where the three modes
correspond to individuals, genes and tissues. An important special case of a multiway array
is a binary array with equal number of entities in each mode Y ∈ {0, 1}n×···×n . Such an array
can be seen as a hypergraph on n nodes such that an entry Y(i1 , . . . , id ) = 1 indicates the

1
presence of a hyperedge between the nodes i1 , . . . , id . In this context, Y is sometimes called
an adjacency tensor. Hypergraphs have been used to model ternary relations between genes,
which can be beneficial in revealing biological regulatory mechanisms [26].
The main objective is to determine when an underlying cluster structure can be recovered
from data. In short, this paper introduces a spectral clustering algorithm, proves that the
algorithm clusters correctly almost every entity assuming a statistical model, and develops
the necessary random matrix theory to prove it.
The assumed statistical model is an integer-valued tensor block model Y = X + E ∈
n1 ×···×nd
Z , where Y is an observed d-way data array, X = EY is a nonrandom signal array and
E = Y − EY is a random noise array with independent entries. Each mode k = 1, . . . , d has a
cluster structure represented by a cluster assignment functions zk : {1, . . . , nk } → {1, . . . , rk }
(cluster of ik along mode k is zk (ik )) so that the signal entries depend only on the clusters
of the indices, i.e., X (i1 , . . . , id ) = ρS(z1 (i1 ), . . . , zd (id )) for some scaling parameter ρ ∈ R
and a core array S ∈ [−1, 1]r1 ×···×rd . The aim of multiway clustering is to recover zk up to
permutation of cluster labels from the observation Y. In this setting, we may consider the
core array as fixed and ρ may vary as the size of the data array increases. In the case of
binary data, ρ controls the expected number of ones in Y so that large values of ρ correspond
to dense tensor. The theoretical question of interest is to find how small ρ can be for the
recovery to be possible and computationally feasible.
As the main motivation comes from hypergraphs, the focus is mostly on binary arrays.
However, the developed theory includes also other more general integer-valued distributions
such as Poisson distributions. When analyzing tensor block models, this paper assumes in-
dependent entries in data array Y. This simplifies many arguments, but excludes symmetric
data tensors that satisfy Y(i1 . . . id ) = Y(iπ(1) . . . iπ(d) ) for all permutation π on {1, . . . , d}.
Such tensors arise from undirected hypergraphs, for example.
This paper introduces a spectral clustering algorithm which flattens the data array into
a wide nk × l̸=k nl -matrix, after which it reduces the dimension of the row vectors from
Q
Q
l̸=k nl to the number of clusters rk via singular value decomposition, and finally clusters the
low-dimensional row vectors with a k-means clustering algorithm. The main contribution of
this paper is showing weak consistency of this algorithm when the density parameter satisfies
ρ ≫ ( l nl )−1/2 . Based on both simulations and theoretical results, this paper also argues
Q

without a rigorous proof that this algorithm fails for ρ ≪ ( l nl )−1/2 .

1.1 Related literature

Recent research on clustering high-order data has seen considerable attention, particularly
in the study of hypergraphs [1, 5, 10, 16, 25, 27, 34, 35, 43], as well as temporal and
multilayer networks [3, 28, 29, 31, 33, 36]. To ease the comparison, let us consider cubic
arrays n = n1 = · · · = nd . Zhang and Tan [43] recently showed that exact clustering of

2
n
a d-uniform hypergraph is impossible for density parameter ρ ≤ c nlog d−1 with a sufficiently

small constant c. Furthermore, they showed that fast exact clustering is possible under
n
certain assumptions on the core array S for density parameter ρ ≥ C nlog d−1 with a sufficiently

large constant C. Essentially, their algorithm reduces the higher-order array to a matrix
by summing over higher-order modes to obtain an aggregate matrix with entries Y (i, j) =
i3 ,...,id Y(i, j, i3 , . . . , id ) counting the number of common hyperedges between pairs of nodes.
P

After this reduction, the algorithm performs spectral clustering on Y . A downside of this
algorithm is that the matrix Y needs to be informative. If this assumption fails, i.e., if the
matrix Y is not informative, then the cluster structure cannot be inferred from Y .
This motivates to study when and how the cluster structure can be recovered from the
data array when the cluster structure is unidentifiable from the aggregate matrix. In the
context of multilayer networks, Lei and Lin [29] showed that an algorithm similar to the
1/2
algorithm proposed in this paper is successful for ρ ≫ lognd/2 n . Ke, Shi and Xia [25] proved
an analogous result in hypergraph degree-corrected block models. We improve this bound by
√ n
a factor of log n. This threshold is somewhat unexpected as it differs from nlog d−1 substantially
n
for d ≥ 3 and raises the question what happens in regime nlog d−1 ≪ ρ ≪ n
−d/2
. By relying
on a conjecture, Lei, Zhang and Zhu [31] showed that every polynomial-time algorithm fails
1
asymptotically for ρ ≤ 2nd/2 log 1.4 . That is, there is a statistical-computational gap meaning
n
that there is a regime for which it is possible to recover the clusters statistically, but not in
a computationally feasible way.
A similar but sharper statistical-computational gap has been established for a subgaussian
tensor block model in [21]. This suggests a similar phenomenon, that exact clustering is
n
statistically possible for ρ ≥ C nlog d−1 with a sufficiently large constant C and computationally

feasible for ρ ≥ Cn−d/2 with a sufficiently large constant C.

1.2 Organization
This paper is structured as follows. Section 2 discusses the used tensor formalism and statis-
tical model. Section 3 presents a spectral clustering algorithm and states the main theorem
describing sufficient conditions for the consistency of the algorithm. Section 4 demonstrates
the main results with numerical experiments. Section 5 discusses related literature. Finally,
Section 6 gives an overview of the intermediate results needed to prove the main theorem.
Technical arguments are postponed to Appendices A, B, C and D.

3
2 Tensor block model

2.1 Notation
Here R denotes the set of real numbers, R≥0 the set of nonnegative real numbers, Z the set of
integers and Z≥0 the set of nonnegative integers. We denote [n] = {1, . . . , n}. The indicator
of a statement A is denoted by I{A}, i.e., I{A} = 1 if A is true and I{A} = 0 otherwise. The
natural logarithm is denoted by log. The minimum and maximum of x, y ∈ R are denoted
by x ∧ y and x ∨ y respectively. The positive part of x ∈ R is denoted by (x)+ = x ∨ 0.
Given nonnegative sequences xn and yn , we write xn ≪ yn (or xn = o(yn )) if xn /yn → 0,
and xn ≲ yn (or xn = O(yn )) if the sequence xn /yn is bounded. We also write xn ≍ yn (or
xn = Θ(yn )) if xn ≲ yn and yn ≲ xn , and xn ∼ yn to indicate xn /yn → 1.
A tensor of order d is an array A ∈ Rn1 ×···×nd , where the dimensions k = 1, . . . , d
are called the modes of the tensor A. A slice of a tensor A ∈ Rn1 ×···×nd is denoted by
A:···:ik :···: ∈ Rn1 ×···×nk−1 ×nk+1 ×...nd . In the case of matrices (d = 2), Ai: denotes the i:th row
vector and A the j:th column vector of A. The Frobenius norm of a tensor is denoted by
qP :j
2 n1 ×···×nd
∥A∥F = i1 ,...,id Ai1 ...id . The elementwise product of tensors A, B ∈ R is denoted
qP
n 2
by A ⊙ B. The 2-norm of a vector u ∈ R is denoted by ∥u∥ = i ui and its 1-norm
is denoted by ∥u∥1 = i |ui |. The operator norm of a matrix A ∈ Rn×m is denoted by
P

∥A∥op = sup∥u∥≤1 ∥Au∥.

Following notation conventions of [13], the k-mode product of a tensor A ∈ Rn1 ×···×nd by
a matrix B ∈ Rm×nk is defined as a tensor A ×k B ∈ Rn1 ×···×nk−1 ×m×nk+1 ×···×nd with entries
X
(A ×k B)i1 ...ik−1 jik+1 ...id = Ai1 ...id Bjik .
ik

A Tucker decomposition of a tensor A ∈ Rn1 ×···×nd is a product representation

(1) (d)
A = C ×1 F (1) ×2 · · · ×d F (d) ,
X
Ai1 ,...,id = Cj1 ,...,jd Fi1 j1 . . . Fid jd ,
j1 ,...,jd

where C ∈ Rm1 ×···×md is called the core tensor and F (k) ∈ Rnk ×mk are called factor matrices.
In the literature as in [13], the matrices F (k) are sometimes assumed to have orthonormal
columns ((F (k) )T F (k) = I), which is not assumed in this work.
n1 ···nd
n ×
The k-matricization of a tensor A ∈ Rn1 ×···×nd is defined as a matrix matk (A) ∈ R k nk
so that the row index gives the index of the k:th mode and the column index gives the rest.
More formally, the entries of a matricized tensor are given by matk (A)ik j(i1 ,...,ik−1 ,ik+1 ,...,id ) =
Ai1 ,...,id , where j : [n1 ] × · · · × [nk−1 ] × [nk+1 ] × · · · × [nd ] → [n1 · · · nd /nk ] is some bijec-
tive enumeration map depending on the chosen convention. A possible convention is the

4
lexicographic order, i.e.,

Ai1 ,...,id = mat1 (A)i1 ,i2 +n2 (i3 −1)+···+n2 ...nd−1 (id −1) .

As stated in [13], the k-mode product can be seen as the matrix product via matk (A ×k
B) = B matk A, successive products can be calculated as (A ×k B) ×k C = A ×k CB and
(A ×k B) ×l C = (A ×l C) ×k B for k ̸= l.

2.2 Statistical model

Definition 2.1. The tensor block model (TBM) is a statistical model where the entries of
an observed data tensor Y ∈ Rn1 ×···×nd are independent and

EYi1 ,...,id = ρSz1 (i1 ),...,zd (id ) ,

where ρ ≥ 0 is the density, S ∈ [−1, 1]r1 ×···×rd is the normalized core tensor, and z1 ∈ [n1 ]r1 ,
..., zd ∈ [nd ]rd are the cluster membership vectors. Here we say that Y is a sample from
TBM(ρ, S, z1 , . . . , zd ). In this model X = EY is called the signal tensor, and E = Y − EY
the noise tensor. We say that the cluster (also known as the community or the block) of
ik ∈ [nk ] along mode k is lk if zk (ik ) = lk .

In this work, we are interested in large and sparse TBMs. A large TBM is a sequence of
TBMs indexed by ν = 1, 2, . . . so that the order d is fixed and the data tensor grows according
to n1,ν · · · nd,ν → ∞ as ν → ∞. A sequence of integer-valued TBMs (i.e., Yν ∈ Zn1,ν ×···×nd,ν )
is considered sparse if maxi1 ,...,id E|Yν,i1 ...id | → 0 as ν → ∞.
In a TBM, the main objective is to recover the underlying cluster structure (z1 , . . . , zd )
from the data Y. Figure 1 illustrates this statistical problem. The loss of an estimated
cluster structure is measured with a misclassification rate defined as follows.

Definition 2.2. Given a labeling function z : [n] → [r] and an estimate ẑ : [n] → [r], the
misclassification rate is defined as
n
1X
ℓ(z, ẑ) = min I{π(ẑ(i)) ̸= z(i)}.
permutation π:[r]→[r] n
i=1

An estimator Y 7→ ẑ(Y ) is said to be weakly consistent if ℓ(zν , ẑ(Yν )) → 0 in probability as

ν → ∞. An estimator Y 7→ ẑ(Y ) is said to be strongly consistent if ℓ(zν , ẑ(Yν )) = 0 with
high probability as ν → ∞.

From now on, we usually omit the index ν used in Definition 2.2. That is, instead of
writing xν ≫ yν we may write x ≫ y. Weak consistency is indicated by P(ℓ(z, ẑ) ≥ ε) → 0
for all ε > 0, and strong consistency is indicated by P(ℓ(z, ẑ) = 0) → 1.

5
(a) X (b) Clustered X

(c) Y (d) Clustered Y

Figure 1: In this example, the data tensor Y is a 40×40×40-binary tensor and the underlying
signal X has a symmetric core tensor S and the cluster structures are the same in every mode
(z1 = z2 = z3 ). Black dots represent value 1, and zeros are not drawn.

To control the amount of noise, the separation of clusters, and the amount of data per
cluster, we will often assume the following.

Assumption 2.1 (MGF bound). An integrable random variable X is said to satisfy the
MGF bound with variance proxy σ 2 ≥ 0 if
2 (e|λ| −1−|λ|)
Eeλ(X−EX) ≤ eσ for all λ ∈ R. (2.1)

Assumption 2.2 (Separation). The clusters of TBM(ρ, S, z1 , . . . , zd ) are said to be sep-

arated along mode k, if there exists a constant δk > 0 (not depending on ν) such that
2
minl̸=l′ ∥(matk S) l: −(matk S)l′ : ∥
r1 ···rd /rk
≥ δk2 . The number δk is called a separation of mode k.

Assumption 2.3 (Balanced clusters). A TBM with membership vectors zk is said to have
balanced clusters if |zk−1 {lk }| ≍ nk /rk for all lk and k. Then a constant α ∈ (0, 1) (not
depending on ν) satisfying |zk−1 {lk }| ≥ αnk /rk for all lk and k is called a cluster balance
coefficient.

Assumption 2.1 is stronger than the subexponential assumption but weaker than the
subgaussian assumption. For example, Poisson distributions satisfy Assumption 2.1 but have
heavier than Gaussian tails. For more details about what kind of random variables satisfy

6
the MGF bound, see Lemma D.1. We refer to [39] for further details on subexponential and
subgaussian distributions.
The term variance proxy is motivated by the fact that Var(X) ≤ σ 2 for every random
variable X satisfying Assumption 2.1. This follows from writing inequality (2.1) with Taylor
2 2 2 3 2 2
series 1 + λ Var(X)
2
+ O(λ3 ) ≤ eσ λ /2+O(λ ) = 1 + σ 2λ + O(λ3 ).

Example 2.1 (Bernoulli TBM). Consider a TBM with Bernoulli distributed data en-
d
tries Yi1 ...id = Ber(Xi1 ...id ) and a core tensor ρS, where ρ ∈ [0, 1] is a scalar control-
ling sparsity and S ∈ [0, 1]r1 ×···×rd . Then the data entries have variances Var(Yi1 ...id ) =
ρSz1 (i1 )...zd (id ) (1 − ρSz1 (i1 )...zd (id ) ) ≤ ρ and the noise satisfies Assumption 2.1 with variance
proxy ρ (Lemma D.1:(ii)).

Example 2.2 (Poisson TBM). Consider a TBM with Poisson distributed data entries
d
Yi1 ...id = Poi(Xi1 ...id ) and a core tensor ρS, where ρ ∈ [0, 1] is a scalar controlling sparsity
and S ∈ [0, 1]r1 ×···×rd . Then the data entries have variances Var(Yi1 ...id ) = ρSz1 (i1 )...zd (id ) ≤ ρ
and the noise satisfies Assumption 2.1 with variance proxy ρ (Lemma D.1:(vi)).

3 Main results

3.1 Algorithm
Algorithm 1 represents a spectral clustering algorithm for recovering a latent cluster structure
along mode k. To estimate the cluster membership vectors of all modes, it may be run d
times, once for each mode.

Algorithm 1. Hollow spectral clustering

input: Data tensor Y ∈ Rn1 ×···×nd , mode k, number of clusters rk , trimming
threshold Ctrim ρ2 , relaxation constant Q > 1
output: Cluster membership vector ẑk : [nk ] → [rk ]
Set A to be the Gram matrix (matk Y)(matk Y)T with zeroed diagonal
Set Aij ← 0 if j ′ ,k:j ′ ̸=i |(matk Y)ik (matk Y)j ′ k | ≥ Ctrim ρ2 n1 · · · nd or
P
2
i′ ,k:i′ ̸=j |(matk Y)i′ k (matk Y)jk | ≥ Ctrim ρ n1 · · · nd
P

Compute the best rank-rk approximation A = Û Λ̂Û T + Û⊥ Λ̂⊥ Û⊥T ≈ Û Λ̂Û T , where Λ̂
contains the rk largest eigenvalues in absolute value
Cluster the rows of Û Λ̂ into rk clusters by finding a labeling function ẑk and
centroids θ̂1 , . . . , θ̂rk ∈ Rrk satisfying

∥(Û Λ̂)j: − θ̂ẑj ∥2 ≤ Q min ∥(Û Λ̂)j: − θ̆z̆j ∥2

X X

j z̆,θ̆ j

for example with k-means++ [2]

7
To motivate Algorithm 1, consider a sample Y from TBM(ρ, S, z1 , . . . , zd ) (Definition 2.1).
(k)
Define membership matrices Z (k) ∈ {0, 1}nk ×rk by Zik jk = I(zk (ik ) = jk ) for each mode
(k)
k = 1, . . . , d. That is, Zik jk indicates whether or not the cluster of the index ik along the
mode k is jk . Then the signal tensor admits a Tucker decomposition

X = ρS ×1 Z (1) ×2 · · · ×d Z (d) .

From linear algebraic perspective, clustering amounts to inferring a low-rank Tucker decom-
position from a perturbed data tensor. This motivates to estimate X by approximating Y
with a low-rank Tucker decomposition, i.e., to minimize the distance

∥Y − S̆ ×1 Ŭ (1) ×2 · · · ×d Ŭ (d) ∥F , (3.1)

where the matrices Ŭ (k) ∈ Rnk ×rk are imposed to have orthonormal columns and S̆ ∈
Rr1 ×···×rd is a tensor. Unfortunately, minimizing (3.1) is known to be NP-hard in gen-
eral, already when the core dimensions are ones [22]. Nonetheless, fast algorithms have
been developed to approximately minimize (3.1). These include higher-order singular value
decomposition (HOSVD) [13], higher-order orthogonal iteration (HOOI) [14] and simple vari-
ations of these such as sequentially truncated HOSVD [38]. These are based on observing
that the matricization matk Y is a noisy version of its expectation

matk X = Z (k) matk ρS ×1 Z (1) ×2 · · · ×k−1 Z (k−1) ×k+1 Z (k+1) ×k+2 · · · ×d Z (d)

that has rank at most rk . This motivates to estimate matk X with a rank-rk approximation
of matk Y, or alternatively, estimate (matk X )(matk X )T with a rank-rk approximation of
(matk Y)(matk Y)T . HOSVD repeats this rank approximation over each mode and HOOI
iterates this type of computation multiple times. Algorithm 1, in turn, essentially does this
rank approximation only once before the final clustering step. Although simple, this step is
crucial for HOSVD as it provides the very first initialization.
After dimension reduction, Algorithm 1 clusters the low-dimensional vectors by solving
k-means minimization task quasi-optimally. Exact minimization is known to be NP-hard
in general [15], but quasi-optimal solutions can be found with fast implementations such as
k-means++ with quasi-optimality constant O(log r), where r is the number of clusters [2].
Since k-means++ is a randomized algorithm, it is guaranteed to be quasi-optimal only in
expectation rather than always.
Two key differences distinguish Algorithm 1 from standard spectral clustering. First, the
diagonal entries of the Gram matrix (matk Y)(matk Y)T are zeroed giving a hollow Gram
matrix. This modification has been considered already in [29]. Second, the obtained hollow
Gram matrix is trimmed by removing carefully selected rows and columns. In the case of
nonnegative data, those with too large L1 -norms are selected.

8
3.2 Consistency
The following theorem presents the main result of the paper. It confirms the weak consistency
of Algorithm 1 for data sampled from a sparse integer-valued TBM (Definition 2.1).

Theorem 3.1 (Weak consistency). Let Y ∈ Zn1 ×···×nd be a sample from TBM(ρ, S, z1 , . . . , zd )
with fixed d. Assume that the entries of the noise tensor E satisfy the MGF bound with vari-
ance proxy ρ (Assumption 2.1) and E|Ei1 ...id | ≤ ρ. Assume also separated clusters along
mode k with separation δ > 0 (Assumption 2.2) and balanced clusters with cluster balance
coefficient α (Assumption 2.3). Consider parameter regime
3/2
rk nk log nk
√ ∨ ≪ ρ ≪ n−1
k (n1 · · · nd /nk )
−ε
(3.2)
n1 · · · nd n1 · · · nd

2/3
for some constant ε > 0 and rk ≪ nk . Then there exists a constant C such that if
Ctrim ≥ C ∨ ε−1 , then the estimated cluster membership vector ẑk given by Algorithm 1 on
mode k is weakly consistent. Furthermore, the probability bound
 !!2 
−3
rk 1 ∨ ε + Ctrim 1
P ℓ(zk , ẑk ) ≥ s + CQ d− 21 2
√ + 
α δ n1 · · · nd ρ nk
1 q

2 −1∨ε−1
≤ Ctrim e−n1 ···nd ρ + 3n−1
k + Cn−1
k e
s
2
holds eventually, when s is chosen to satisfy e−n1 ···nd ρ + n−1 −1
k ≪ s ≪ rk .

Under extra assumptions n1 = · · · = nd = n and rk ≍ 1, Condition (3.2) is equivalent to

d ≥ 3 and n−d/2 ≪ ρ ≪ n−1−ε(d−1) .

3.3 Aggregation
High-order data is commonly analyzed by aggregating the data tensor into a lower-order
data tensor. Theorem 3.2 shows that an aggregated TBM is also a TBM and the variance
proxy of the noise entries scales accordingly. As a consequence, Corollary 3.3 shows that
aggregating is beneficial for sparse data as long as the signal tensor remains well separated.

Theorem 3.2 (Aggregation). Let Y ∈ Zn1 ×···×nd be a sample from TBM(ρ, S, z1 , . . . , zd ).

Assume that the entries of the noise tensor E satisfy the MGF bound with variance proxy
ρ (Assumption 2.1) and E|Ei1 ...id | ≤ ρ. Then an aggregate data tensor Y ′ ∈ Zn1 ×···×nd′ with
entries Yi′1 ...id′ = id′ +1 ,...,id Yi1 ...id is a sample from TBM(ρ′ , S ′ , z1 , . . . , zd′ ) with density ρ′ =
P

nd′ +1 · · · nd ρ and normalized core tensor Sj′1 ...jd′ = n ′ 1···nd id′ +1 ,...,id Sj1 ...jd′ zd′ +1 (id′ +1 )...zd (id ) .
P
d +1
Furthermore, the entries of the noise tensor satisfy the MGF bound with variance proxy ρ′
and E|Ei′1 ...id′ | ≤ ρ′ .

9
Proof. The MGF bound is a direct consequence of Lemma D.1:(iii). By the triangle inequal-
ity, E|Ei′1 ...id′ | = E| id′ +1 ,...,id Ei1 ...id | ≤ nd′ +1 · · · nd ρ.
P
■

Corollary 3.3 (Weak consistency in Aggregated TBM). Let Y ∈ Zn1 ×···×nd be a sample
from TBM(ρ, S, z1 , . . . , zd ) fixed d. Assume that the entries of the noise tensor E satisfy the
MGF bound with variance proxy ρ (Assumption 2.1) and E|Ei1 ...id | ≤ ρ. Let Y ′ ∈ Zn1 ×···×nd′
be an aggregated data tensor as in Theorem 3.2 and assume that it has separated clusters
along mode k (Assumption 2.2) and balanced clusters (Assumption 2.3). Consider parameter
regime
3/2
rk nk log nk 1
√ ∨ ≪ρ≪ (3.3)
n1 · · · nd′ nd′ +1 · · · nd n1 · · · nd nk nd′ +1 · · · nd (n1 . . . nd′ /nk )ε

2/3
for some constant ε > 0 and rk ≪ nk . Then there exists a constant C such that if
Ctrim ≥ C ∨ ε−1 , then Algorithm 1 applied to mode k of Y ′ is weakly consistent.

Proof. The claim follows from Theorem 3.1 which is applicable by Theorem 3.2. ■

Example 3.1. Under extra assumptions n1 = · · · = nd = n and rk ≍ 1, condition (3.3) is

′ ′ ′
equivalent to n−(d−d /2) ≪ ρ ≪ n−(1+d−d )−ε(d −1) , where the lower bound is orders of magni-
tude smaller than n−d/2 for the nonaggregated model in Theorem 3.1. That is, aggregating
allows to handle much sparser data as it improves the density threshold by a factor n−1/2
per aggregated mode. However, Assumption 2.2 for an aggregated TBM is stronger as it
considers separation in S ′ .

4 Numerical experiments
This section presents numerical experiments to demonstrate the performance of Algorithm 1
in various sparsity regimes.

4.1 Setup
We focus on third-order tensors. The signal tensor is chosen to be symmetric so that the mode
chosen for matricization does not matter. That is, the normalized core tensor S ∈ R2×2×2
is symmetric, the observation tensor is a cube Y ∈ Zn×n×n , and every mode has the same
membership vector z ∈ {1, 2}n . The number of clusters is fixed to r = 2 for visualization
purposes, as the low-dimensional projections are then 2-dimensional. The two clusters are
equal sized and have constant degrees, i.e.,

X X n2 ρ X
E Y i1 i2 i3 = ρSz(i1 )z(i2 )z(i3 ) = Sz(i1 )st
i2 i3 i2 i 3 4 s,t

10
Figure 2: Symmetric core tensors S used in simulation experiments. Threeway 2 × 2 × 2
arrays are represented as cubes, where black color corresponds to value 1 and white color
corresponds to value 0. The cube on the left depicts the core tensor with a noninformative
aggregate matrix and the cube on the right with an informative aggregate matrix.

is independent of z(i1 ). This implies that aggregating the data into an order-1 tensor loses
information. Similar to hypergraphs, the data tensor is assumed to be binary Y ∈ {0, 1}n×n×n
so that all entries are independent Bernoulli random variables (see Example 2.1). In a single
experiment, the normalized core tensor S is fixed, and the unnormalized core tensor is ρS,
where ρ is allowed to vary. Now, the model can be written as
d
Yi1 i2 i3 = Ber(ρSz(i1 )z(i2 )z(i3 ) ).

We investigate two instances of the normalized core tensor S, one with informative aggrega-
tion and one with noninformative. Recall that aggregating one mode yields TBM(ρ′ , S ′ , z, z)
with a density parameter ρ′ = nρ and a normalized corei tensor Sl′1 l2 = n1 i3 Sl1 l2 z(i3 ) =
P
h
1
2
(Sl1 l2 1 + Sl1 l2 2 ) (Theorem 3.2). By writing S = S::1 S::2 , the two considered normalized
core tensors and their aggregated versions are
" # " #
1 0 0 0 1 0 0 1
Sinformative = , Snoninformative = ,
0 0 0 1 0 1 1 0
" # " #
1/2 0 ′ 1/2 1/2
Sinformative = , Snoninformative = .
0 1/2 1/2 1/2

These are visualized in Figure 2.

Different algorithms are tested for various pairs (n, ρ) and the proportion of correctly
clustered nodes is recorded. Since the phase transition is expected to be at ρ = Cn−γ for
some constant C and exponent γ depending on the algorithm, in logarithmic coordinates, this
corresponds to a linear equation log ρ = C −γ log n with slope −γ. Hence, the algorithms are
tested on a grid of pairs (n, ρ), where values of n and ρ are spaced logarithmically. Then, a

11
line is fitted in logarithmic coordinates to estimate the phase transition and this is compared
to the theoretical line. In detail, the phase transition line is estimated via logistic regression
by first predicting whether the proportion of correctly clustered nodes is sufficiently high
(e.g. 0.9) given (log n, log ρ) and then the learned decision boundary is used as the estimate
for the phase transition line. Although this might not be an optimal estimator of the phase
transition line, visually this produces meaningful results. A line with theoretical slope −γ is
drawn to visualize how well the simulated phase transition line fits to the theory.
The following four algorithms are compared.

(i) Hollow SVD (Algorithm 1 with a known parameter ρ and a trimming constant Ctrim =
3). By Theorem 3.1, we expect the phase transition line to have γ = 1.5.

(ii) SVD. A simple spectral clustering which is otherwise the same as Algorithm 1 but
without trimming and diagonal masking. We expect the phase transition line to have
γ ≈ 1.33. Namely, the diagonal entries of the data Gram matrix (mat1 Y)(mat1 Y)T
have standard deviations
 
sX Yi2 =Yi1 i2 i3 √
1 i2 i3
SD(((mat1 Y)(mat1 Y)T )i1 i1 ) = SD  Yi21 i2 i3  =
X
Var(Yi21 i2 i3 ) ≍ n ρ,
i2 ,i3 i2 ,i3

which dominate the operator norm of the signal Gram matrix

v  2
u
(∗)
uX X
T T
∥(mat1 X )(mat1 X ) ∥op ≍ ∥(mat1 X )(mat1 X ) ∥F = Xi1 i2 i3 Xi′1 i2 i3  ≍ n3 ρ2 ,
u 
t
i1 ,i′1 i2 ,i3

when ρ ≪ n−4/3 . Here (∗) follows from the fact that the operator norm and Frobenius
norm of low-rank matrices are comparable, see for example Corollary 2.4.3 and Problem
P2.4.7 in [20].

(iii) HSC. A simplification of the high-order spectral clustering (HSC) proposed in [21].
First, HSC calculates a low-rank approximation of the data tensor with HOSVD [13]
and essentially one iteration of HOOI [14]. In our simplification, HOSVD matricizes
the data tensor and calculates its SVD only once since as the underlying signal tensor is
symmetric. This will slightly speed up the computations. Then the algorithm clusters
the first mode of the low-rank tensor with k-means++. Since HSC is initialized with
a simple rank-approximation (via eigenvalue decomposition of (mat1 Y)(mat1 Y)T ), it
is expected not to improve γ. That is, we expect γ ≈ 1.33.

(iv) Aggregate SVD. Spectral clustering from an aggregate matrix, which is a simplified
version of an algorithm proposed in [43]. In this experiment, the algorithm computes
the aggregate matrix Aij = k Yijk , removes rows and columns with too large L1 norms
P

12
(trimming threshold Ctrim n2 ρ with Ctrim = 3), calculates the best rank-2 approximation,
and then clusters the rows with k-means++ [2]. In the original paper, the data tensor
is symmetric and hence there are small differences in calculating A. Furthermore,
the actual clustering step is not implemented with k-means++ and there is a final
refinement step, but morally these algorithms are the same. We expect the phase
transition line to have γ = 2 if the aggregate matrix is informative.

4.2 Results
Figure 3 shows a comparison of the four algorithms when the aggregate matrix is not infor-
mative. As expected, spectral clustering on the aggregate matrix gives poor results, whereas
the other three algorithms show a clear phase transition. The observed phase transition is
in line with the expected phase transition. The refining steps in HSC do not improve the
exponent γ compared to the plain SVD, but they do improve the clustering performance sig-
nificantly. Although the estimated γ-exponents differ from the expected exponents slightly,
they would be hard to distinguish from the expected exponents visually.
Figure 4 shows a comparison of the four algorithms when the aggregate matrix is in-
formative. As expected, spectral clustering on the aggregate matrix gives the best results.
The observed phase transition is in line with the expected phase transition. Although the
estimated γ-exponents differ from the expected exponents slightly, they would be hard to
distinguish from the expected exponents visually.

4.3 HSC initialization

Previous simulations demonstrate that HSC does not improve the exponent γ of its initial-
ization algorithm. This subsection studies visually, how HSC depends on its initialization
algorithm, and how it is possible that for some pairs of n and ρ, the initialization algorithm
does not cluster the nodes successfully while HSC does.
Similar to the experiment with a noninformative aggregate matrix, we consider a sym-
metric normalized core tensor
" #
1 0.5 0.5 1
S= .
0.5 1 1 0.5

The only difference is that the entries with value 0 have been changed to the value 0.5. If
the entries were zero, then any two row vectors of matk Y from different clusters would be
orthogonal. This, in turn, usually makes the projected row vectors to lie on two orthogonal
lines representing different clusters, and the projected row vectors would overlap with each
other in the visualization. Furthermore, when testing HSC, the initialization algorithm is
changed from simple SVD to the proposed hollow SVD, because without any trimming step

13
Hollow SVD 1.0 SVD 1.0
0.9 0.9

Clustering accuracy

Clustering accuracy
Density ( )

Density ( )
10 2
0.8 10 2
0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)

HSC 1.0 Aggregate SVD 1.0

0.9
Clustering accuracy 0.9

Clustering accuracy
Density ( )

Density ( )
10 2
0.8 10 2
0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)

Figure 3: Comparison of the four algorithms when the aggregate matrix is not informative.
The number of nodes varies logarithmically between 30 and 180, and the sparsity parameter
varies logarithmically between 0.00185 and 0.02722. The dashed black line is the fitted
phase transition line with theoretical slope (γHollow SVD = 1.5, γSVD = 1.33, γHSC = 1.33).
The estimated slopes given by logistic regression are γ̂Hollow SVD = 1.43, γ̂SVD = 1.29 and
γ̂HSC = 1.26.

near the phase transition, there would be “outliers” with significantly larger norms, which
would force the visualizations to focus on the outliers.
Figure 5 shows the obtained projections from the initialization algorithm (hollow SVD)
and HSC for four different sparsity parameters. As the sparsity parameter decreases, the
clusters merge in the projections. It may occur that the initialization provides somewhat
meaningful projections, but the k-means algorithm cannot detect the clusters. However, such
initialization might be informative enough for HSC to improve the projection and cluster the
vectors. However, as expected, once the initialization is dominated by noise, HSC cannot
cluster the vectors.

14
Hollow SVD 1.0 SVD 1.0
0.9 0.9

Clustering accuracy

Clustering accuracy
10 2 10 2
Density ( )

Density ( )
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)

HSC 1.0 Aggregate SVD 1.0

0.9
Clustering accuracy 0.9

Clustering accuracy
10 2 10 2
Density ( )

Density ( )
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
30 60 100 180 30 60 100 180
Number of nodes (n) Number of nodes (n)

Figure 4: Comparison of the four algorithms when the aggregate matrix is informative. The
number of nodes varies logarithmically between 30 and 180, and the sparsity parameter
varies logarithmically between 0.00131 and 0.01925. The dashed black line is the fitted
phase transition line with theoretical slope (γHollow SVD = 1.5, γSVD ≈ 1.33, γHSC ≈ 1.33,
γAggregate SVD = 2). The estimated slopes given by logistic regression are γ̂Hollow SVD = 1.43,
γ̂SVD = 1.31, γ̂HSC = 1.24 and γ̂Aggregate SVD = 2.00.

5 Discussion

5.1 Clustering
This section discusses related theoretical research on multiway clustering with emphasis on
recent developments on hypergraphs, and temporal and multilayer networks.
Zhang and Tan [43] considered clustering of a d-uniform hypergraph on n nodes, which
is comparable to a TBM of order d with Bernoulli distributed entries. The nodes have a
fixed prior distribution over clusters and the signal is assumed to have a core tensor ρS with
n
fixed S. They show that strong consistency is impossible, when ρ ≤ c nlog d−1 for sufficiently

small constant c. Furthermore, they showed that a spectral clustering algorithm applied
n
to an aggregate matrix Ai1 i2 = i3 ...id Yi1 ...id achieves strong consistency for ρ ≥ C nlog
P
d−1 for

sufficiently large constant C, assuming that EA is informative (see [43] for further details
on the constants c, C). The matrix A can be considered as an adjacency matrix since for a
symmetric tensor Y with Yi1 ...id = 0 whenever ik = il for some k ̸= l (undirected d-uniform

15
= 0.0074 = 0.0057 = 0.0045 = 0.004
accuracy = 0.91 accuracy = 0.79 accuracy = 0.71 accuracy = 0.555

(a) Hollow SVD

= 0.0074 = 0.0057 = 0.0045 = 0.004

accuracy = 1.0 accuracy = 0.935 accuracy = 0.79 accuracy = 0.52

(b) Hollow HSC

Figure 5: Projected observations with (a) hollow SVD and (b) HSC initialized with hollow
SVD. The number of nodes is fixed to n = 200 and the sparsity parameter ρ is varied around
the phase transition. Sparsity decreases from left to right. Accuracy (1 − ℓ) of k-means
clustering is also recorded.

hypergraph), A is a symmetric matrix with upper-diagonal entries

X X X
Ai1 i2 = Yi1 ...id = (d − 2)! Yi1 ...id = (d − 2)! I{{i1 , i2 } ⊂ e}, i1 < i2 .
i3 ,...,id i3 <···<id hyperedge e

That is, up to a multiplicative constant (which for d ≤ 3 is simply 1), Ai1 i2 counts the number
of common hyperedges between nodes i1 and i2 . Visually, this is equivalent to transforming
a hypergraph to a (weighted) graph such that every hyperedge {i1 , . . . , id } contributes to an
edge {j1 , j2 } ⊂ {i1 , . . . , id }, j1 ̸= j2 . Stephan and Zhu [35] showed that a spectral clustering
algorithm based on nonbacktracking walks on a hypergraph achieves weak reconstruction
(i.e., is better than clustering every node to the largest cluster), if ρ ≳ n1−d and EA is
informative. However, these results do not address the case when EA is not informative.
Figure 1 visualizes an example of a tensor, for which the adjacency matrix would not be
informative.
There has been research on multiway clustering with a noninformative adjacency ma-
trix A. Lei, Chen and Lynch [28] analyzed an undirected multilayer network model, which
can be formulated as a TBM with Y ∈ {0, 1}m×n×n with no cluster structure on the first

16
mode (or z1 (i) = i, i.e., every index forms one cluster) and identical cluster structures on the
second and third modes z2 = z3 = z, r2 = r3 = r. Each slice Yi:: corresponds to a symmetric
adjacency matrix of an undirected network layer in a multilayer network. They consider a
clustering algorithm solving a least-squares problem

(Ẑ, Ŝ) = arg min∥Y − S ×2 Z̆ ×3 Z̆∥2F

Z̆, S̆

with a membership matrix Z̆ ∈ {0, 1}n×r and a core tensor S̆ ∈ [0, 1]m×r×r . They did not
give an algorithm finding an exact or provably approximate solution, but they did provide
an algorithm trying to solve the optimization problem. They showed that Ẑ achieves weak
3/2
n
consistency, when ρ ≫ log nm1/2
. Lei and Lin [29] analyzed a similar problem and proposed
an algorithm similar to Algorithm 1. Specifically, their algorithm masks out the diagonal
(which they call debiasing or bias adjusting) but it does not remove the nodes with too large
1/2
L1 -norms. Their algorithm is guaranteed to be weakly consistent for ρ ≫ log nm(n+m)1/2 . Ke, Shi
and Xia [25] proved an analogous result for hypergraph degree-corrected block models. Su,
Guo, Chang and Yang [36] extended these results and ideas to directed multilayer networks.
The proof requires a matrix concentration inequality based on classical matrix Bernstein’s
inequalities (see for example [37]). However, matrix Bernstein’s inequalities, which may be
based on a deep theorem by Lieb depending on the proof, suffer from a logarithmic factor
1/2
leading to the threshold log nm(n+m)
1/2 instead of nm11/2 .
Lei, Zhang and Zhu [31] studied computational and information theoretical limits of the
multilayer network model corresponding to a TBM with Y ∈ {0, 1}m×n×n with no cluster
1
structure on the first mode. They showed that weak consistency is impossible for ρ ≪ nm and
1
the maximum likelihood estimator is weakly consistent for ρ ≫ nm . Furthermore, based on
a conjecture, they showed that a polynomial-time algorithm cannot be weakly consistent for
ρ ≤ 2nm1/21log1.4 n . Kunisky [27] studied the impossibility of detecting a presence of a cluster
structure in symmetric tensors with low coordinate degree functions (LCDF). LCDFs are
arbitrary linear combinations of functions depending on at most D entries of a vector, in
our case the data tensor. Here the coordinate degree D is argued to roughly correspond
to algorithms requiring computation time eΘ(D) . Although the developed theory is more
general, in the our special case Theorem 1.13 states that the detection task is impossible
d d−2
if ρ ≤ cn− 2 D− 2 for a sufficiently small constant c and D ≤ cn. Since polynomial-time
d d−2
algorithms correspond to D ≍ ln n, the threshold corresponds roughly to ρ ≤ cn− 2 log− 2 n.
Moreover, Kunisky showed that the detection task becomes easier when an aggregated tensor
1
remains informative, each aggregated mode decreasing the density threshold by n− 2 . This
agrees with our Corollary 3.3.
Similar but slightly sharper computational gaps have been established in subgaussian
noise. Han, Luo, Wang, and Zhang [21] analyzed multiway clustering on a TBM with sub-
gaussian noise. Their algorithm is initialized with a slightly refined version of a HOSVD

17
algorithm, which involves calculating eigenvalue decompositions of (matk Y)(matk Y)T with-
out removing the diagonals or rows and columns with too large L1 -norms. They showed
that HOSVD-based of initialization is weakly consistent for ρ2 /σ 2 ≫ n−d/2 , where σ is the
largest subgaussian norm of Ei1 ...id (i.e., σ 2 is the largest variance proxy). When HOSVD-
based initialization is given to an iterative higher-order Lloyd algorithm, they showed that
the clustering is strongly consistent if ρ2 /σ 2 ≥ Cn−d/2 for sufficiently large constant C.
Furthermore, they show that strong consistency is not achievable for ρ2 /σ 2 ≤ cn1−d with a
small enough constant c while Wang and Zeng [41] show that maximum likelihood estimator
is strongly consistent for ρ2 /σ 2 ≥ Cn1−d with a sufficiently large constant C. Notice that
this threshold is similar to Theorem 3.1, namely both σ 2 and ρ are an upper bounds of the
variance of the data entries and setting σ 2 = ρ yields the same weak consistency condition
ρ ≫ n−d/2 . However, subgaussian analysis with Bernoulli distributed entries does not give
σ 2 ≍ ρ but
1 − 2ρ 1
σ2 = 1−ρ ∼
2 log ρ 2 log ρ1
as ρ → 0 by Theorem 2.1 in [6]. For example, by setting ρ = n−γ with fixed γ gives
σ 2 ≍ log−1 n which is substantially larger than ρ = n−γ . Nonetheless, it seems that the
n
analogous statistical-computational gap nlog 1
d−1 ≲ ρ ≪ nd/2 is present in clustering a binary

TBM.

5.2 Random matrix norm bounds

Proving matrix concentration inequalities for a sparse Bernoulli matrix is more involved than
for a subgaussian matrix. Kahn and Szemerédi [19] bound the second largest eigenvalue of
an adjacency matrix of a random regular graph (random graphs with constant degree) and
Feige and Ofek [18] repeat the same arguments for Erdős–Rényi graphs (symmetric adjacency
matrix with lower diagonal having independent and identically distributed entries). Lei and
Rinaldo [30] extend Feige’s and Ofek’s argument to bound the operator norm of a centered
symmetric binary matrices with independent but possibly not identically distributed entries.
With this they are able to show consistency of a spectral clustering algorithm under a
stochastic block model, which is similar to a TBM with d = 2. Chien, Lin, and Wang [9]
extend this further to a centered adjacency matrix of a hypergraph stochastic block model
(similar to binary TBM) to show strong consistency of a spectral clustering algorithm. Zhang
and Tan [43] relax some parametric assumptions made in [9]. We also apply the proof
techniques developed in [19] and [18].
In the context of clustering multilayer networks, Lei, Chen and Lynch [28] adapt Feige’s
and Ofek’s argument to tensor setting. Lei and Lin [29] consider alternative approach by
applying matrix concentration inequality based on classical matrix Bernstein’s inequalities
(see for example [37]). However, matrix Bernstein’s inequalities, which may be based on a

18
deep theorem by Lieb depending on the proof, suffer from a logarithmic factor making the
matrix bounds slightly suboptimal.
In the case of subgaussian TBM, Han, Luo, Wang, and Zhang [21] approach by analyzing
concentration of a singular subspace of a possibly wide random matrix (i.e., a vector subspace
spanned by the first r right or left singular vectors corresponding to the r largest singular
values of a random matrix). For this purpose, the classical Davis–Kahan–Wedin theorem is
insufficient as it provides a common error bound for both left and right singular subspaces.
Hence, they rely on more sophisticated perturbation bounds developed by Cai and Zhang
[7] providing different bounds for left and right singular subspaces which is relevant for
particularly tall and wide matrices. The probabilistic analysis is based on ε-nets and Hanson–
Wright inequality (see for example Chapter 6 in [39]).

6 Proofs
This section presents the proof of the main theorem (Theorem 3.1) in three subsections.
Section 6.1 analyzes spectral clustering with deterministic error bounds, Section 6.2 studies
concentration of sparse matrices and Section 6.3 proves the main theorem. As the analysis
of matrix concentration is quite involved, the detailed proofs of Section 6.2 are postponed
to the appendix.

6.1 Deterministic analysis

The following theorem states that a low-rank approximation of a data matrix allows efficient
clustering, when the noise level measured with operator norm is sufficiently small. The
proof is similar to the proofs of Lemma 4 in [43], Theorem 3 in [21], Lemma 5.3 in [30],
and Theorem 2.2 in [24]. However, [43], [21] and [24] included this step into consistency
theorems involving statistical models not appropriate for our purposes and [30] is restricted
to clustering from square matrices.

Theorem 6.1. Let X ∈ Rn×m be a matrix with r distinct rows represented by Xi: = Sz(i): ,
where S ∈ Rr×m and z ∈ [r]n . Let Y ∈ Rn×m be another matrix and X̂ be its best rank-r
approximation. Let ẑ be the output of a quasi-optimal k-means clustering algorithm applied
to the rows of X̂, i.e., find a labeling function ẑ : [n] → [r] and centroids θ̂1 , . . . , θ̂r ∈ Rm
satisfying
∥X̂j: − θ̂ẑj ∥2 ≤ Q min ∥X̂j: − θ̆z̆j ∥2
X X

j z̆,θ̆ j

r∥Y −X∥2
with some relaxation parameter Q > 1. Assume that |z −1 {k}| > 128Q ∆2 op for all k,
where ∆ = min{∥Xi: − Xj: ∥ | Xi: ̸= Xj: }. Then the misclassification rate (recall Definition

19
2.2) has an upper bound
r∥Y − X∥2op
ℓ(z, ẑ) ≤ 128Q .
n∆2
Proof. Define E = Y − X. Calculate a singular value decomposition Y = Û Σ̂V̂ T + Û⊥ Σ̂⊥ V̂⊥T
and define X̂ = Û Σ̂V̂ T , which is the best r-rank approximation of Y (in statistical appli-
cations X̂ is used to estimate X). Since the rank of X̂ is at most r and the rank of X is
at most r, the rank of their difference X̂ − X is at most 2r (every vector u of the image of
X̂ − X can be written as u = u1 + u2 , where u1 ∈ im X̂ is in an r-dimensional subspace
and u2 ∈ im X is in another r-dimensional subspace). By Corollary 2.4.3 in [20], the Frobe-
nius norm and the operator norm of the rank-2r matrix X̂ − X are comparable according
∥X̂ − X∥2F ≤ 2r∥X̂ − X∥2op . Now by the Eckart–Young–Mirsky theorem [17] and Weyl’s
inequality [42], we have

EYM
∥X̂ − X∥2F ≤ 2r∥X̂ − X∥2op ≤ 2r(∥X̂ − Y ∥op + ∥E∥op )2 ≤ 2r(σr+1 (Y ) + ∥E∥op )2
Weyl
≤ 2r(σr+1 (X) + ∥E∥op + ∥E∥op )2 ≤ 8r∥E∥2op .

This shows that the estimate X̂ is close to X in Frobenius norm. Denote the estimated
clusters by ẑj ∈ [r], j = 1, . . . , n and their centers by θ̂k ∈ Rm , k = 1, . . . , r. By quasi-
optimality, we have

∥X̂j: − θ̂ẑj ∥2 ≤ Q min ∥X̂j: − θ̆z̆j ∥2 ≤ Q ∥X̂j: − Xj: ∥2 = Q∥X̂ − X∥2F ≤ 8Qr∥E∥2 .
X X X

j z̆,θ̆ j j

This shows that the row vectors of the estimate X̂ are close to the estimated cluster means
on average. Now we can estimate the distances between the true cluster means (rows of S)
and the estimated cluster means (θ̂ẑj ):

∥Xj: − θ̂ẑj ∥2 ≤ 2 ∥Xj: − X̂j: ∥2 + ∥X̂j: − θ̂ẑj ∥2 ≤ 32Qr∥E∥2 .

X X

j j

For any t > 0, Markov’s inequality (with respect to the counting measure) gives

j ∥θ̂ẑj − Xj ∥ 2 32Qr∥E∥2
P
2 2
|{j | ∥θ̂ẑj − Xj ∥ ≥ t }| ≤ ≤ .
t2 t2

Consider the set A = {j ∈ [n] | ∥θ̂ẑj − Xj ∥ < t}. If zi ̸= zj for i, j ∈ A, then

∥θ̂ẑi − Szj ∥ ≥ ∥Szi − Szj ∥ − ∥Szi − θ̂ẑi ∥ > ∆ − t.

If zi = zj for some i, j ∈ A, then ∥θ̂ẑi − Szj ∥ < t. Consider t satisfying ∆ − t = t, or

equivalently, t = ∆/2. Previous calculation shows, that for every i, j ∈ A, zi = zj if and

20
only if ∥θ̂ẑi − Szj ∥ < ∆/2. This gives a well-defined mapping π : ẑ(A) → [r], π(ẑi ) = zi (if
ẑi = ẑj , then ∥θ̂ẑi − Szj ∥ < ∆/2 and hence zi = zj ). However, π is not bijective in general,
when balanced cluster sizes are not assumed.
If z|A : A → [r] is not a surjection, i.e., there exists k ∈ [r] such that zi ̸= k for
every i ∈ A, then z −1 {k} ⊂ Ac , |z −1 {k}| ≤ |Ac |. This is a contradiction if we assume
32Qr∥E∥2op
that t2
< |z −1 {k}| for every k. Under this assumption, z|A is surjective and hence
π : ẑ(A) → [r] is surjective (if k ∈ [r], there exists i ∈ A such that k = zi = π(ẑi )). The
surjectivity of π gives |ẑ(A)| ≥ |[r]| = r implying ẑ(A) = [r] and making π a bijection. The
bound for the misclassification rate follows now from the estimate ℓ(z, ẑ) ≤ |Ac |/n ■

As Theorem 6.1 shows, clustering from a low-rank approximation works well when the
operator norm of the noise matrix is sufficiently small and the clusters are sufficiently sepa-
rated in the signal. The following result (Lemma 6.2) addresses the latter issue by estimating
the separation of clusters ∆ in Theorem 6.1 under Assumptions 2.3 and 2.2. The former
issue is addressed in Section 6.2.

Lemma 6.2. Consider a TBM with a signal tensor X . Assume separated clusters along
mode k with separation δ > 0 (Assumption 2.2) and balanced clusters with cluster balance
coefficient α (Assumption 2.3). Then
1
T αd− 2 δ 2 T
min ∥((matk X )(matk X ) )i: − ((matk X )(matk X ) )j: ∥ ≥ √ √ n1 · · · nd ρ2 .
i,j:zi ̸=zj 2 nk rk

Proof. Fix k ∈ [d] and let ik , i′k ∈ [nk ] be entities from different clusters zk (ik ) ̸= zk (i′k ).
Denote l−k = (l1 , . . . , lk−1 , lk+1 , . . . , ld ) ∈ [r1 . . . rd /rk ] and j−k = (j1 , . . . , jk−1 , jk+1 , . . . , jd ) ∈
r1 ...rd r ...r
× 1r d
[n1 · · · nd /nk ]. Define a diagonal matrix D ∈ R rk k with nonnegative diagonal entries
|z1−1 {l1 }|···|zd−1 {ld }|
Dl−k l−k = |zk−1 {lk }|
. Now

∥((matk X )(matk X )T )ik : − ((matk X )(matk X )T )i′k : ∥2

 2
X X
=  ((matk X )i j − (matk X )i′k j−k ))(matk X )jk j−k 
k −k
jk j−k


= ρ4 |zk−1 {lk }| −1

|z1−1 {l1 }| . . . |zk−1 −1
{lk+1 }| . . . |zd−1 {ld }|
X X
{lk−1 }||zk+1
lk l−k
2

× ((matk S)zk (ik ),l−k − (matk S)zk (i′k ),l−k )(matk S)lk ,l−k  .

By balanced clusters with cluster balance coefficient α > 0 (Assumption 2.3) , the last term

21
is bounded from below by
 2
ρ4 αnk X  X
≥ Dl l ((matk S)zk (ik ),l−k − (matk S)zk (i′k ),l−k )(matk S)lk ,l−k 
rk lk l−k −k −k
ρ4 αnk
= ∥((matk S)D(matk S)T )zk (ik ): − ((matk S)D(matk S)T )zk (i′k ): ∥2 .
rk

Next, we observe that for any matrix A and indices i ̸= i′ , Cauchy–Schwarz inequality gives

∥Ai: − Ai′ : ∥2 = ∥(ei − ei′ )T A∥2 = (ei − ei′ )T AAT (ei − ei′ )
√ √
≤ 2∥(ei − ei′ )T AAT ∥ = 2∥(AAT )i: − (AAT )i′ : ∥.
√
Applying this inequality to matrix (matk S) D gives now a lower bound

∥((matk X )(matk X )T )ik : − ((matk X )(matk X )T )i′k : ∥2

ρ4 αnk √ √
≥ ∥((matk S) D)zk (ik ): − ((matk S) D)zk (i′k ): ∥4
2rk
 2
ρ4 αnk  Y αnk′ 
≥ ∥(matk S)zk (ik ): − (matk S)zk (i′k ): ∥4 .
2rk k′ ̸=k
rk ′

By Assumption 2.2, we obtain

1
T T αd− 2 δ 2
∥((matk X )(matk X ) )ik : − ((matk X )(matk X ) )i′k : ∥ ≥ √ √ n1 · · · nd ρ 2 .
2 nk rk

6.2 Sparse matrices

Theorem 6.3 bounds the operator norm of a random matrix, from which rows and columns
with too large L1 -norms are removed. This result along with its proof is a mild generalization
of Theorem 1.2 in [18]. Namely, we relax the assumption of binary entries to Assumption 2.1.
In a similar spirit, Theorem 6.4 analyzes concentration of a trimmed product XX T , where
the entries of the random matrix X ∈ Zn×m are independent and centered. The focus is
only on the off-diagonal entries, because the diagonal and off-diagonal entries concentrate
at different rates. The proof is based on decoupling and an observation that the entries of
XX T are nearly independent, eventually allowing us to apply Theorem 6.3. This argument,
however, relies on having a sufficiently sparse matrix X. Finally, Lemma 6.5 asserts that
only a small fraction of rows and columns are removed.

Theorem 6.3. Let X ∈ Rn×m be a random matrix with independent centered entries (EXij =

22
0) satisfying the MGF bound with variance proxy σ 2 (Assumption 2.1). Define a trimmed
version of the matrix as

Xij ,

∥Xi: ∥1 ∨ ∥X:j ∥1 ≤ Ctrim (n ∨ m)σ 2 ,
Xij′ = 
0, otherwise.

Then for all t ≥ 1,

q
′
P ∥X ∥op ≥ (28t + 8.09Ctrim + 417) (n ∨ m)σ 2 ≤ 2(e(n ∨ m))−2t .

Proof. See Appendix A. ■

Theorem 6.4. Let X ∈ Zn×m , n ≤ m be a random matrix with independent entries satis-
fying the MGF bound with variance proxy σ 2 (Assumption 2.1) and E|Xij | ≤ σ 2 . Further-
more, assume that 8 log en ≤ mσ 2 , nσ 2 ≤ 6−1 e−2 imposing m ≥ 48e2 n log en. Define a mask
M = 11T − I ∈ Rn×n and a trimming mask N as
 
X 
|Xi′ k Xjk |Mi′ j ≤ Ctrim nmσ 4 .
X
Nij = I |Xik Xj ′ k |Mij ′ ∨
j ′ ,k i′ ,k

Then there exists an absolute constant C such that

√ 1/3
P ∥((XX T ) ⊙ M − EXEX T ) ⊙ N ∥ ≥ C(t + Ctrim ) nmσ 2 + mσ 4 ≤ Cn−1 e−t

for all
log3 m
!
t≥C 1∨ .
log3 (1/6enσ 2 )
Proof. See Appendix C. ■

In particular, if nσ 2 ≲ m−c for some arbitrarily small constant c, then ∥((XX T ) ⊙ M −

√
EXEX T ) ⊙ N ∥ ≲ nmσ 2 + mσ 4 with high probability as n → ∞, and if nσ 2 ≲ 1, then
√
∥((XX T ) ⊙ M − EXEX T ) ⊙ N ∥ ≲ log3 (m) nmσ 2 + mσ 4 with high probability as n → ∞.
Lemma 6.5. Let X ∈ Zn×m , n ≤ m be a random matrix with independent entries satisfying
the MGF bound with variance proxy σ 2 (Assumption 2.1) and E|Xij | ≤ σ 2 . Furthermore,
assume that 8 log en ≤ mσ 2 , nσ 2 ≤ 6−1 e−2 imposing m ≥ 48e2 n log en. Define a mask
M = 11T − I ∈ Rn×n . Then
   
n X
1 
I |Xik Xj ′ k |Mij ′ ≥ (4 + 2t)nmσ 4  ≥ s
X
P
n i=1 j ′ ,k
√
1 √
! !
tnmσ 4 √ √
− t/2 6
≤ t exp − √ + 3n
s 10 6

23
for all s > 0 and
24 log2 m
t ≥ 24 ∨ .
log2 (1/6enσ 2 )
Proof. See Appendix C. ■

In particular, if nmσ 4 → ∞ and n → ∞, then one can choose s → 0 to approach zero

sufficiently slowly to obtain a high-probability event.

6.3 Proof of Theorem 3.1

First, notice that the inequalities 8 log enk ≤ n1 ···n
nk
dρ
and nk ρ ≤ 6−1 e−2 needed in Theorem
6.4 and Lemma 6.5 hold eventually. Define a diagonal-resetting mask M ∈ Rnk ×nk and a
trimming mask N ∈ Rnk ×nk by Mij = I{i ̸= j} and
 
 X 
|(matk Y)i′ l (matk Y)jl | ≤ Ctrim n1 · · · nd ρ2 .
X
Nij = I |(matk Y)il (matk Y)j ′ l | ∨
j ′ ,l:j ′ ̸=i i′ ,l:i′ ̸=j
 

By Lemma 6.5, there exists an absolute constant C1 such that setting Ctrim ≥ C1 (1 ∨ ε−2 )
gives
   
nk  X
1 X 
P I |(matk Y)il (matk Y)j ′ l | ≥ Ctrim n1 · · · nd ρ2 ≥ s
nk i=1 j ′ ,l:j ′ ̸=i 

1 q

−n1 ···nd ρ2 −1
≤ Ctrim e + 3nk .
s
By Theorem 6.4, there exists an absolute constant C2 such that

P ∥((matk Y)(matk Y)T ⊙ M − (matk X )(matk X )T ) ⊙ N ∥op ≥ u
1/3
≤ C2 n−1
k e
−t
for all t ≥ 1 ∨ ε−3 ,

where
√ n1 · · · nd ρ2
u = C2 (t + Ctrim ) n1 · · · nd ρ + .
nk
Choose s so that nk s ≪ αnk /rk ≤ minl |zk−1 {l}| to ensure that eventually every cluster is
2
present also after masking with N . More specifically, we may choose e−n1 ···nd ρ + n−1
k ≪ s ≪
2
2 rk3 −1
rk−1 since rk e−n1 ···nd ρ ≤ n1 ···nd ρ2
≪ 1 and rk
nk
≪ nk3 ≪ 1. Define

∆= min ∥((matk X )(matk X )T )ik : − ((matk X )(matk X )T )i′k : ∥.

ik ,i′k :zk (i)̸=zk (i′ )

24
By Theorem 6.1, there exists an absolute constant C3 such that
!
rk u 2
P ℓ(zk , ẑk ) ≥ s + C3 Q
nk ∆2
   
nk  X
1 X 
≤ P I |(matk Y)il (matk Y)j ′ l | ≥ Ctrim n1 · · · nd ρ2 ≥ s
nk i=1 j ′ ,l:j ′ ̸=i 

+ P ∥((matk Y)(matk Y)T ⊙ M − (matk X )(matk X )T ) ⊙ N ∥op ≥ u
1 q

−n1 ···nd ρ2 −t1/3
≤ Ctrim e + 3nk + C2 n−1
−1
k e
s
2
when minl |zk−1 {l}| ≥ C3 Q r∆
ku
2 . By Lemma 6.2, there exists a constant C4 such that ∆ ≥
1
αd− 2 δ 2
C4√
nk rk
n1 · · · nd ρ2 . Now

√ √ n1 ···nd ρ2
rk u C2 (t + Ctrim ) n1 · · · nd ρ + nk
√ ≤ 1
nk ∆ C4 αd− 2 δ 2
n · · · n ρ2
rk 1 d
!
rk C2 (t + Ctrim ) 1
= 1 √ + .
C4 αd− 2 δ 2 n1 · · · nd ρ nk
2 3/2
Now minl |zk−1 {l}| ≥ C3 Q r∆
ku
2 holds eventually by assumptions ρ ≫ rk (n1 · · · nd )−1/2 and
3/2
nk ≫ rk , namely
αnk
minl |zk−1 {l}| rk
2 ≥ 2
C3 Q r∆ ku

rk C2 (t+Ctrim ) 1
2
C3 Qnk 1 √
n1 ···nd ρ
+ nk
C4 αd− 2 δ 2
 !2 −1
C3 Qr3 C2 (t + Ctrim ) 1
=  2 2d−1k 4 √ +  ≫ 1.
C4 α δ α n1 · · · nd ρ nk

2
Set t = 1 ∨ ε−3 to be a constant. Recall e−n1 ···nd ρ + n−1 −1
k ≪ s ≪ rk . Now we have
 
!!2 
rk C2 (t + Ctrim ) 1

 
ℓ(zk , ẑk )
P ≥ s + C3 Q 1 √ + 
 C4 αd− 2 δ 2 n1 · · · nd ρ nk 

| {z }
≪1
1 q

−n1 ···nd ρ2 −t1/3
≤ Ctrim e + 3nk + C2 n−1
−1
k e ,
s
| {z }
≪1

that is, ℓ(zk , ẑk ) → 0 in probability. This concludes the proof of Theorem 3.1. ■

25
Acknowledgment We thank Bogumił Kamiński and Paul Van Dooren for stimulating
discussions and helpful comments.

References
[1] K. Alaluusua, K. Avrachenkov, V. B. R. Kumar, and L. Leskelä. “Multilayer hyper-
graph clustering using the aggregate similarity matrix”. In: Algorithms and Models for
the Web Graph (WAW). 2023, pp. 83–98.
[2] D. Arthur and S. Vassilvitskii. “k-means++: the advantages of careful seeding”. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA ’07). New Orleans, Louisiana, 2007, pp. 1027–1035.
[3] K. Avrachenkov, M. Dreveton, and L. Leskelä. “Community recovery in non-binary
and temporal stochastic block models”. In: arXiv:2008.04790 (2022).
[4] G. Bennett. “Probability Inequalities for the Sum of Independent Random Variables”.
In: Journal of the American Statistical Association 57.297 (1962), pp. 33–45.
[5] L. Brusa and C. Matias. “Model-based clustering in simple hypergraphs through a
stochastic blockmodel”. In: Scandinavian Journal of Statistics 51.4 (2024), pp. 1661–
1684.
[6] V. Buldygin and K. Moskvichova. “The sub-Gaussian norm of a binary random vari-
able”. In: Theory of Probability and Mathematical Statistics 86 (2013), pp. 33–49.
[7] T. T. Cai and A. Zhang. “Rate-Optimal Perturbation Bounds for Singular Subspaces
with Applications to High-Dimensional Statistics”. In: Annals of Statistics 46.1 (2018),
pp. 60–89.
[8] H. Chernoff. “A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on
the sum of Observations”. In: Annals of Mathematical Statistics 23.4 (1952), pp. 493–
507.
[9] I. E. Chien, C.-Y. Lin, and I.-H. Wang. “On the Minimax Misclassification Ratio of
Hypergraph Community Detection”. In: IEEE Transactions on Information Theory
65.12 (2019), pp. 8095–8118.
[10] P. Chodrow, N. Eikmeier, and J. Haddock. “Nonbacktracking Spectral Clustering of
Nonuniform Hypergraphs”. In: SIAM Journal on Mathematics of Data Science 5.2
(2023), pp. 251–279.
[11] H. Cramér. “Sur un nouveau théorème-limite de la théorie des probabilités”. In: Col-
loque consacré à la théorie des probabilités. Actualités Scientifiques et Industrielles 736
(1938), pp. 2–23.

26
[12] G. Dai, Z. Su, and H. Wang. “Tail Bounds on the Spectral Norm of Sub-Exponential
Random Matrices”. In: arXiv:2212.07600 (2022).
[13] L. De Lathauwer, B. De Moor, and J. Vandewalle. “A Multilinear Singular Value
Decomposition”. In: SIAM Journal on Matrix Analysis and Applications 21.4 (2000),
pp. 1253–1278.
[14] L. De Lathauwer, B. De Moor, and J. Vandewalle. “On the Best Rank-1 and Rank-(R1
,R2 ,. . .,RN ) Approximation of Higher-Order Tensors”. In: SIAM Journal on Matrix
Analysis and Applications 21.4 (2000), pp. 1324–1342.
[15] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. “Clustering Large Graphs
via the Singular Value Decomposition”. In: Machine learning 56 (2004), pp. 9–33.
[16] I. Dumitriu, H. Wang, and Y. Zhu. “Partial recovery and weak consistency in the
non-uniform hypergraph stochastic block model”. In: Combinatorics, Probability and
Computing 34.1 (2025), pp. 1–51.
[17] C. Eckart and G. Young. “The approximation of one matrix by another of lower rank”.
In: Psychometrika 1.3 (1936), pp. 211–218.
[18] U. Feige and E. Ofek. “Spectral techniques applied to sparse random graphs”. In:
Random Structures & Algorithms 27.2 (2005), pp. 251–275.
[19] J. Friedman, J. Kahn, and E. Szemerédi. “On the second eigenvalue of random regular
graphs”. In: Proceedings of the Twenty-First Annual ACM Symposium on Theory of
Computing (STOC ’89). Seattle, Washington, USA: Association for Computing Ma-
chinery, 1989, pp. 587–598.
[20] G. H. Golub and C. F. Van Loan. Matrix Computations. 4th ed. JHU press, 2013.
[21] R. Han, Y. Luo, M. Wang, and A. R. Zhang. “Exact Clustering in Tensor Block Model:
Statistical Optimality and Computational Limit”. In: Journal of the Royal Statistical
Society Series B: Statistical Methodology 84.5 (2022), pp. 1666–1698.
[22] C. J. Hillar and L.-H. Lim. “Most tensor problems are NP-hard”. In: Journal of the
ACM 60.6 (2013), pp. 1–39.
[23] V. Hore, A. Viñuela, A. Buil, J. Knight, M. I. McCarthy, K. Small, and J. Marchini.
“Tensor decomposition for multiple-tissue gene expression experiments”. In: Nature
Genetics 48.9 (2016), pp. 1094–1100.
[24] J. Jin. “Fast community detection by score”. In: Annals of Statistics 43.1 (2015),
pp. 57–89.
[25] Z. T. Ke, F. Shi, and D. Xia. “Community Detection for Hypergraph Networks via
Regularized Tensor Power Iteration”. In: arXiv:1909.06503 (2019).

27
[26] Y. Kong and T. Yu. “A hypergraph-based method for large-scale dynamic correlation
study at the transcriptomic scale”. In: BMC Genomics 20.397 (2019).
[27] D. Kunisky. “Low coordinate degree algorithms II: Categorical signals and generalized
stochastic block models”. In: arXiv:2412.21155 (2024).
[28] J. Lei, K. Chen, and B. Lynch. “Consistent community detection in multi-layer network
data”. In: Biometrika 107.1 (2020), pp. 61–73.
[29] J. Lei and K. Z. Lin. “Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic
Block Models”. In: Journal of the American Statistical Association 118.544 (2023),
pp. 2433–2445.
[30] J. Lei and A. Rinaldo. “Consistency of spectral clustering in stochastic block models”.
In: Annals of Statistics 43.1 (2015), pp. 215–237.
[31] J. Lei, A. R. Zhang, and Z. Zhu. “Computational and statistical thresholds in multi-
layer stochastic block models”. In: Annals of Statistics 52.5 (2024), pp. 2431–2455.
[32] M. Melé, P. G. Ferreira, F. Reverter, D. S. DeLuca, J. Monlong, M. Sammeth, T. R.
Young, J. M. Goldmann, D. D. Pervouchine, T. J. Sullivan, R. Johnson, A. V. Segrè, S.
Djebali, A. Niarchou, The GTEx Consortium, F. A. Wright, T. Lappalainen, M. Calvo,
G. Getz, E. T. Dermitzakis, K. G. Ardlie, and R. Guigó. “The human transcriptome
across tissues and individuals”. In: Science 348.6235 (2015), pp. 660–665.
[33] S. Paul and Y. Chen. “Spectral and matrix factorization methods for consistent com-
munity detection in multi-layer networks”. In: Annals of Statistics 48.1 (2020), pp. 230–
250.
[34] N. Ruggeri, M. Contisciani, F. Battiston, and C. De Bacco. “Community detection in
large hypergraphs”. In: Science Advances 9.28 (2023), eadg9159.
[35] L. Stephan and Y. Zhu. “Sparse random hypergraphs: non-backtracking spectra and
community detection”. In: Information and Inference: A Journal of the IMA 13.1
(2024), iaae004.
[36] W. Su, X. Guo, X. Chang, and Y. Yang. “Spectral co-clustering in multi-layer directed
networks”. In: Computational Statistics & Data Analysis 198 (2024), p. 107987.
[37] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12 (2012), pp. 389–434.
[38] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. “A New Truncation Strategy
for the Higher-Order Singular Value Decomposition”. In: SIAM Journal on Scientific
Computing 34.2 (2012), pp. 1027–1052.
[39] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data
Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, 2018.

28
[40] M. Wang, J. Fischer, and Y. S. Song. “Three-way clustering of multi-tissue multi-
individual gene expression data using semi-nonnegative tensor decomposition”. In:
Annals of Applied Statistics 13.2 (2019), pp. 1103–1127.
[41] M. Wang and Y. Zeng. “Multiway clustering via tensor block models”. In: Advances
in Neural Information Processing Systems (NeurIPS) 32 (2019).
[42] H. Weyl. “Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Dif-
ferentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung)”.
In: Mathematische Annalen 71.4 (1912), pp. 441–479.
[43] Q. Zhang and V. Y. F. Tan. “Exact Recovery in the General Hypergraph Stochastic
Block Model”. In: IEEE Transactions on Information Theory 69.1 (2023), pp. 453–471.

29
A Proof of Theorem 6.3
Lemma A.1 asserts that in many cases, the largest L1 -norms of rows of a random matrix can
be bounded with high probability. Furthermore, in a boundary case, almost every row can
be bounded with high probability. Although it is not needed to prove Theorem 6.3, Lemma
A.1 shows that only a small fraction of rows and columns are removed in Theorem 6.3 to
obtain the desired norm bound. Lemma A.1 is also needed to prove Theorem 6.4. Lemma
A.2 is used to prove Theorem 6.3 bounding the operator norm of a random matrix, from
which rows and columns with too large L1 -norms are removed. The proofs mainly follow
[18], which focuses on Bernoulli distributed entries (random graphs). The proof technique
originates from [19], where an analogous result is given for random regular graphs.

Lemma A.1. Let X ∈ Rn×m be a random matrix with independent centered entries satisfying
the MGF bound with variance proxy σ 2 (Assumption 2.1). Define centered L1 -norms Si =
Pm
j=1 (|Xij | − E|Xij |). Then for any t ≥ 0, we have

n n
! !
X
2
o en enmσ 2
P I Si ≥ (1 + t)mσ ≥ ≤ exp − mσ2 /8 t
i=1 emσ2 /8 8e
≤ (en)−t , mσ 2 ≤ 8 log en,
q
P max Si ≥ (1 + t) 8mσ 2 log en ≤ (en)−t , mσ 2 ≥ 8 log en.
i=1,...,n

That is, if mσ 2 ≥ 8 log en, then the largest L1 -norm is bounded by maxi ∥Xi: ∥1 ≤
√
maxi E∥Xi: ∥1 + C mσ 2 log en with high probability as n → ∞, and if 1 ≪ mσ 2 ≤ 8 log en,
then the fraction of “large” L1 -norms n1 ni=1 I{∥Xi: ∥1 ≥ E∥Xi: ∥1 + 2mσ 2 } ≪ 1 tends to zero
P

with high probability as n → ∞, and if C ≤ mσ 2 ≤ 8 log en for some constant C, then the
fraction of “large” L1 -norms n1 ni=1 I{∥Xi: ∥1 ≥ E∥Xi: ∥1 + 2mσ 2 } ≤ e1−C/8 is at most of con-
P

stant order with high probability as n → ∞, but by setting C large enough, this fraction can
be made arbitrarily small. Under an extra assumption E|Xij | ≲ σ 2 , we have E∥Xi: ∥1 ≲ mσ 2
and the L1 -norm bounds are in the order of mσ 2 .
2 λ
Proof. By Lemma D.1:(v),
P we have Eeλ(|Xij |−E|Xij |) ≤ e2σ (e −1−λ) for all λ ≥ 0. By indepen-
dence, EeλSi = Eeλ j (|Xij |−E|Xij |) = j Eeλ(|Xij |−E|Xij |) ≤ e2mσ (e −1−λ) for all λ ≥ 0. Assume
Q 2 λ

that k ≥ i P(Si ≥ t). By Bennett’s inequality (D.1) and Bernstein’s inequality (D.3) in
P

30
Lemma D.2, we have
n n n
! !
X X X
P I{Si ≥ t} ≥ k = P (I{Si ≥ t} − P{Si ≥ t}) ≥ k − P{Si ≥ t}
i=1 i=1 i=1
Pn !k
(D.1)
−
Pn
P{Si ≥t} e i=1 P{Si ≥ t}
≤ e i=1
k
Pn !k
(∗) e i=1 P{Si ≥ t}
≤
k
!!!k
(D.3)en t2 3t
≤ exp − 2
∧
k 8mσ 4
!!!
2
en t 3t
= exp k log − ∧ .
k 8mσ 2 4

Now
n
en 4 log en
( !) !
en
r
X
k
P I Si ≥ (1 + t) 8mσ 2 log ∨ ≥ k ≤ exp −k log t .
i=1 k 3 k

Notice that the assumption k ≥ i P(Si ≥ t) is needed for valid argumentation, but it
P

can be seen in (∗) that the probability upper bound becomes greater than one and hence
trivial, when k < i P(Si ≥ t). Therefore, the assumption k ≥ i P(Si ≥ t) does not
P P

need to be considered further. Notice also, that the function k 7→ k log en

k
is increasing for
en n
1 ≤ k ≤ n since it has a nonnegative derivative log k − 1 = log k ≥ 0. Furthermore, since
it is impossible for a sum of n indicators to exceed n, we have
n
en 4 log en
( r !) !
≥ k ≤ (en)−t
X
k
P I Si ≥ (1 + t) 2
8mσ log ∨
i=1 k 3

en en
for all k ≥ 1. Choose k = 1 ∨ 2
emσ /8
so that if mσ 2 ≤ 8 log en, then k = 2
emσ /8
and

en 4 log en mσ 2
r
∨
8mσ 2 log k
= mσ 2 ∨ = mσ 2 ,
k 3 ! 6
n n
!
2
o en enmσ
I Si ≥ (1 + t)mσ ≥ mσ2 /8 ≤ exp − mσ2 /8 t ≤ (en)−t ,
2
X
P
i=1 e 8e

31
and if mσ 2 ≥ 8 log en, then k = 1 and

en 4 log en 4 log en
r q
k
8mσ 2 log ∨ = 8mσ 2 log en ∨
k 3 s3
q 4 mσ 2 log en
≤ 8mσ 2 log en ∨
3 8
q
= 8mσ 2 log en,
n ! n
q q !
[ X
P Si ≥ (1 + t) 8mσ 2 log en =P I Si ≥ (1 + t) 8mσ 2 log en ≥ 1
i=1 i=1
−t
≤ (en) .

The following lemma in its original form [18] bounds sums of the form i∈S j∈T |Xij |,
P P

where X ∈ Rn×m is a random matrix and S ⊂ [n] and T ⊂ [m]. When X corresponds to an
adjacency matrix of a graph, then this sum counts the number of (directed) edges from the
vertex set S to the vertex set T . However, here the statement is modified to draw a clearer
connection to the operator norm ∥X∥op = sup∥u∥,∥v∥≤1 i,j ui vj Xij .
P

Lemma A.2. Let X ∈ Rn×m be a random matrix with independent centered entries (EXij =
2 |λ|
0) satisfying EeλXij ≤ eσ (e −1−|λ|) for all λ ∈ R. For subsets S ⊂ [n] and T ⊂ [m],
define X(S, T ) = max{ i∈S j∈T ui vj Xij | ui , vj ∈ {−1, +1}} and a function Cσ2 (t) =
P P
σ2 +t
2 eσ 2
e−σ σ 2 +t
. Then
 ( ! !!)
[ 2en 2em
P − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log 
S,T |S| |T |
≤ (e2 nm)−t , t ≥ 1,

where S ⊂ [n] and T ⊂ [m] are nonempty index sets.

̸ S ⊂ [n], u ∈ {±1}S , ∅ =
Proof. If ∅ = ̸ T ⊂ [m] and v ∈ {±1}T , then Lemma D.1:(iii)
2 |λ|
gives moment condition E exp(λ i∈S,j∈T ui vj Xij ) ≤ e|S||T |σ (e −1−|λ|) for all λ ∈ R. For any
P

tkl ≥ 0 indexed by k = 1, . . . , n and l = 1, . . . , m, applying the union bound, Bennett’s

32
k
n en
inequality (D.1) and the inequality k
≤ k
gives
    
[n o [ [ X X 
P X(S, T ) ≥ t|S|,|T |  = P uvX ≥
i j ij t|S|,|T | 
 
S,T S,T u∈{±1}S ,v∈{±1}T i∈S j∈T
 
X X XX
≤ P ui vj Xij ≥ t|S|,|T | 
S,T u∈{±1}S ,v∈{±1}T i∈S j∈T
 
n X
X m X X X X XX
= P ui vj Xij ≥ tkl 
k=1 l=1 S∈([n]) T ∈([m]) u∈{±1}S v∈{±1}T i∈S j∈T
k l

n X
m
! ! !klσ2 +tkl
X n m k l −klσ2 eklσ 2
≤ 2 2 e
k=1 l=1 k l klσ 2 + tkl
| {z }
Cklσ2 (tkl )
n X
m
2en k 2em l
X
≤ Cklσ2 (tkl ).
k=1 l=1 k l

Let us fix t ≥ 1. Since the bound Cklσ2 : [0, ∞) → [0, 1] is a continuous function start-
ing from 1 and decreasing to 0 in the limit, we can find tkl > 0 such that Cklσ2 (tkl ) =
k l −(1+t)
2en 2em
k l
. Since the bound Cklσ2 is a strictly decreasing function, we have an
equivalence X(S, T ) ≥ tkl ⇐⇒ Cklσ2 (X(S, T )) ≤ Cklσ2 (tkl ) (notice that X(S, T ) ≥ 0 is
always nonnegative, because whenever uT Xv < 0, we have (−u)T Xv > 0). Therefore
 ( ! !!)
[ 2en 2em
P − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log 
S,T |S| |T |
 
[n o
= P − log C|S||T |σ2 (X(S, T )) ≥ − log C|S||T |σ2 (t|S|,|T | ) 
S,T
n m k l
XX 2en 2em
≤ Cklσ2 (tkl )
k=1 l=1 k l
n
!kt m !lt
X k X l
= .
k=1 2en l=1 2em

Since x 7→ (x/2en)x = exp(x log(x/2en)) is decreasing for 1 ≤ x ≤ n (the derivative of

kt
k
x 7→ x log(x/2en) is nonpositive log(x/2en)+1 = log(x/2n) ≤ 0), we may estimate 2en
≤

33
2t
2
2en
for 2 ≤ k ≤ n yielding

n
!kt t 2t
k 1 2
X
≤ + (n − 1)+
k=1 2en 2en 2en
t≥1 2−t 1
≤ t
+n
(en) (en)1+t
2−t + e−1
=
(en)t
1
≤ .
(en)t

Hence,

n
!kt m !lt
X k X l 1
≤
k=1 2en l=1 2em (e2 nm)t

and the claim follows. ■

Proof of Theorem 6.3. Since the operator norm of a submatrix is less than the operator
norm of the whole matrix, without loss of generality, we may assume that n = m (the
smaller dimension can be extended to the larger dimension by filling with zero-entries, which
does not affect the claimed bound of ∥X∥op nor the probability bound). The proof is divided
into several parts. The first part discusses the general idea and the proof task, and the
remaining parts consist of completing the tasks.
Discretization. Let 0 < ε < 1/2 and S′ε be an ε-net of the n−1-dimensional unit sphere
Sn−1 of size |S′ε | ≤ (2ε−1 + 1)n (Corollary 4.2.13 in [39]). Define a new ε-net

Sε = {(s1 u1 , . . . , sn un ) | u ∈ S′ε , s ∈ {1, −1}n } ⊃ S′ε

so that |Sε | ≤ 2n ·(2ε−1 +1)n = (4ε−1 +2)n and whenever (u1 , . . . , un ) ∈ Sε , also (u1 , . . . , −ui , . . . , un ) ∈
Sε . Define a constant Cε = 2 log(4ε−1 + 2) (so that |Sε × Sε | ≤ eCε n ). By Exercise 4.4.3 in
[39], we have
1 1
∥X ′ ∥ ≤ max uT X ′ v = ui vj Xij′ .
X
max
1 − 2ε u,v∈Sε 1 − 2ε u,v∈S ε
i,j

Applying a Chernoff bound directly on sums i,j ui vj Xij′ will work sufficiently well only when
P

|ui vj | is sufficiently small (proven later). Hence, the proof also needs another argument for
heavier terms.
Discretize (0, 1] into bins by fixing any sequences 1 = a1 ≥ a2 ≥ . . . and 1 = b1 ≥ b2 ≥ . . .
to be used as bin boundaries. Let u, v ∈ B n be vectors with at most unit length. Let
us bin the values into sets Sk = {i | |ui | ∈ (ak+1 , ak ]}, a1 = 1 of size sk = |Sk | and
Tl = {j | |vj | ∈ (bl+1 , bl ]}, b1 = 1 of size tl = |Tl |. This is analogous to histograms, where the

34
entries of the vectors correspond to the data, Sk and Tl are the bins with endpoints ak and
bl and frequencies/counts sk and tl . Let us emphasize that sk and Sk depend on the vector u
and similarly, tl and Tl depend on the vector v, although this dependence is not written out
explicitly. The endpoints ak , bl , in turn, are fixed and do not depend on the vectors. Now

ui vj Xij′ = ui vj Xij′ ≤ xi yj Xij′

X X X X X
ak b l max
S
x∈{±1} k ,y∈{±1}Tl
i,j k,l i∈Sk ,j∈Tl k,l i∈Sk ,j∈Tl
| {z }
X ′ (Sk ,Tl ) (A.1)
ak bl X ′ (Sk , Tl ).
X
=
k,l

The intuition is, that X ′ (Sk , Tl ) concentrates better for larger bins Sk and Tl . From this
point of view, bins should be chosen as large as possible. However, the larger the bins, the
more information about the distribution of the values of u and v is lost. From this point of
view, bins should be chosen as small as possible. It turns out that preserving information
about the norms is important:

a2k a2k X 2 a2k

sk a2k = u2i
X XX
≤ sup u i = sup .
k k i∈Sk u2i 2
k ak+1 i
2
k ak+1

ak
That is, larger ratios ak+1 preserve the norm worse. This indicates that ak can decrease at
most geometrically/exponentially with a common ratio a = supk ak /ak+1 , and thus geomet-
rically/exponentially decreasing intervals might be the optimal choice (in [18], the ratio was
chosen to be a = 2). Thus, choose ak = a1−k to get k sk a2k ≤ a2 . Similarly, define bk = b1−k
P

for some b > 1.

By combining these two ideas, we arrive at

1
∥X ′ ∥op ≤ ui vj Xij′
X
max
1 − 2ε u,v∈Sε i,j
 
1
ui vj Xij′ + ui vj Xij′ 
X X X X
= max 
1 − 2ε u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl / i∈Sk ,j∈Tl
(k,l)∈L
 
1 
ak bl X ′ (Sk , Tl ) ,
X X X
≤ max ui vj Xij + max
1 − 2ε u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl u,v∈Sε
(k,l)∈L
/

where L is a set of indices called light couples (defined later). Here, the last inequality
follows from the symmetry condition of the ε-net Sε . Namely, although it may be that

35
0 = ui vj Xij′ > ui vj Xij , we do have

ui vj Xij′
X X
max
u,v∈Sε
(k,l)∈L i∈Sk ,j∈Tl

ui I(Xi:′ ̸= 0) vj Xij I(X:j′ ̸= 0)

X X
= max
u,v∈Sε
i j:∃(k,l)∈L:i∈Sk ,j∈Tl

vj Xij I(X:j′ ̸= 0)
X X
≤ max ui
u,v∈Sε
i j:∃(k,l)∈L:i∈Sk ,j∈Tl

vj Xij I(X:j′ ̸= 0)
X X
= max ui
u,v∈Sε
i j:∃(k,l)∈L:i∈Sk ,j∈Tl

vj I(X:j′ ̸= 0)
X X
= max ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl

X X
≤ max vj ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl
X X
= max vj ui Xij
u,v∈Sε
j i:∃(k,l)∈L:i∈Sk ,j∈Tl
X X
= max ui vj Xij .
u,v∈Sε
(k,l)∈L i∈Sk ,j∈Tl

When needed, we may also bound X ′ (S, T ) ≤ X(S, T ). Motivated by Bennett’s inequality
σ2 +t t
2 eσ 2 eσ 2
(D.1), define a function Cσ2 (t) = e−σ σ 2 +t
(recall the upper bound Cσ2 (t) ≤ t
)
and events
( ! !!)
[ 2en 2en
E1 = − log C|S||T |σ2 (X(S, T )) ≥ (1 + t) |S| log + |T | log
S,T |S| |T |
( ! !!)
[ X(S, T ) 2en 2en
⊃ X(S, T ) log ≥ (1 + t) |S| log + |T | log ,
S,T e|S||T |σ 2 |S| |T |
 
[  X X √ 
E2 = ui vj Xij ≥ (1 + t)CL nσ 2 ,
 
u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl

E = E1 ∪ E2 .
It remains to prove the desired upper bound for ∥X ′ ∥op by bounding the sums (k,l)∈L i∈Sk ,j∈Tl ui vj Xij
P P
P c
and (k,l)∈L
/ ak bl X(Sk , Tl ) under E .
1. Light couples. Define the set of light couples as L = {(k, l) | ak bl ≤ τ } for some

36
threshold value τ > 0. By Lemma D.1:(iii), we have
  
 
  X X ui vj  X X u2i vj2 σ 2 |λ|
E exp λ Xij  ≤ exp  (e − 1 − |λ|)
  
  τ  τ 2
(k,l)∈L i∈Sk ,j∈Tl | {z } (k,l)∈L i∈S k ,j∈Tl
∈[−1,1]
!
σ2
≤ exp 2 (e|λ| − 1 − |λ|) .
τ

For any t ≥ 3σ 2 /τ , Bernstein’s inequality (D.3) gives

    
[  X X  X X X ui vj t
P ui vj Xij ≥ t  ≤ P Xij ≥ 
u,v∈Sε

(k,l)∈L i∈Sk ,j∈Tl

u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl
τ τ
!!
X (t/τ )2 3(t/τ )
≤ exp − ∧
u,v∈Sε 4σ 2 /τ 2 4
t t 3

≤ exp Cε n − 2
∧
4 σ τ
3t

t≥3σ 2 /τ
= exp Cε n −
4τ

Choose τ such that when the condition t ≥ 3σ 2 /τ holds as equality, thenqthe probability
9σ 2
bound becomes trivial, that is, 0 = Cε n − 9σ 2 /4τ 2 , or equivalently, τ = 4C εn
= 2√3σ
Cε n
.
3σ 2
√ 2
Now τ = 2 Cε nσ and
  
[  X X q 
P(E2 ) = P  ui vj Xij ≥ (1 + t)2 Cε nσ 2 
 
u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl

≤ exp (−tCε n) .
√
In summary, by defining a constant CL = 2 Cε , a threshold parameter τ = CL3σ√n and the set
of light couples L = {(k, l) | ak bl ≤ τ }, we have an upper bound (k,l)∈L i∈Sk ,j∈Tl ui vj Xij ≤
P P
√
(1 + t)CL nσ 2 in E2c and P(E2 ) ≤ e−Cε nt .
2. Bounded L1 -norm. Define the first set of heavy couples as H1 = {(k, l) ∈ / L | abkl ≥

37
√ bl
√
nσ 2 or ak
≥ nσ 2 }. Then

1
ak bl X ′ (Sk , Tl ) ≤ √ a2k |Xij′ |
X X X
a √
k,l: k ≥ nσ 2
nσ 2 k,l:
ak √
≥ nσ 2 i∈Sk ,j∈Tl
bl bl

1 X 2 X
≤√ ak |Xij′ |
nσ k2
i∈Sk ,j
1 X 2
≤√ ak sk max∥Xi:′ ∥1
nσ k2 i

a2
≤√ max∥Xi:′ ∥1
nσ 2 i
√
≤ Ctrim a2 nσ 2 .

Hence,
√
ak bl X ′ (Sk , Tl ) ≤ Ctrim (a2 + b2 ) nσ 2 .
X

(k,l)∈H1

3. Successful concentration. Define the second set of heavy couples H2 = {(k, l) ∈ /

σ 2 3σ
L ∪ H1 | X(Sk , Tl ) ≤ e2 τ sk tl ak bl } (recall from light couples that τ = CL √n ). This is weaker
assumption than a well-behaved concentration X(Sk , Tl ) ≤ σ 2 e2 sk tl , namely in Lc , we have
2
ak bl ≥ τ and consequently σ 2 e2 sk tl ≤ e2 στ sk tl ak bl . Now

σ2
ak bl X(Sk , Tl ) ≤ e2 a2k sk b2l tl
X X

(k,l)∈H2
τ (k,l)∈H2

σ2
≤ e2 a2 b2
τ
2 2 2√
CL e a b
= nσ 2 .
3

4. Geometric sum. In H2c , we have X(Sk , Tl ) ≥ σ 2 e2 sk tl . Therefore log X(S k ,Tl )

esk tl σ 2
≥ 1 is
c
positive and condition E1 can be written as
 
1+t 2en 2en
 
E1c ⊂
\
X(Sk , Tl ) ≤ sk log + tl log .
(k,l)∈H2c
 log X(S k ,Tl )
esk tl σ 2
sk tl 

38
Under E1c , we have
 
X X sk log 2en
sk
X tl log 2en
tl
ak bl X(Sk , Tl ) ≤ (1 + t)  ak bl X(Sk ,Tl )
+ ak b l X(Sk ,Tl )

k,l k,l log esk tl σ 2 k,l log esk tl µ
 
X 2en X 2en 
≤ (1 + t)  ak bl sk log + ak bl tl log .
k,l sk k,l tl

By symmetry, it suffices to bound only the first sum; the second sum can be bounded with
similar arguments. By geometric summation formula, we have

log 2en
sk
X bl log 2en
sk
sk a2k
X X
s k ak b l =
k,l log X(S k ,Tl )
esk tl σ 2 k l ak log X(Sk ,Tl )
es t σ 2
k l
 
log 2ensk
sk a2k 
X X
≤ X(Sk ,Tl )
bl 
k ak minl log esk tl σ2 l
a2 bl log 2en
sk
≤ max .
1 − b−1 k,l ak minl log X(S k ,Tl )
esk tl σ 2

It remains to bound the terms of the geometric sums. By assuming log 2en ≤ CH3 log X(S k ,Tl )
esk tl σ 2
√ sk
and H1c , we have abkl ≤ nσ 2 (assumption H1c ) and consequently the desired upper bound
bl log 2en
sk
√
X(Sk ,Tl ) ≤ CH3 nσ 2 . Here the value of the constant CH3 will play an important role
ak minl log
esk tl σ 2
later, so it will be fixed then.
Let us study other ways of bounding the terms of the geometric sum by assuming
now log 2ensk
≥ CH3 log X(S k ,Tl )
esk tl σ 2
/ H2c , that is,
. By combining this with assumption (k, l) ∈
2
X(Sk , Tl ) ≥ e2 sk tl ak bl στ , we get
1
2en e2 sk tl ak bl σ 2 eak bl

CH3
≥ =
sk esk tl σ 2 τ τ
1
τ 2en CH3

=⇒ bl ≤
eak sk
log 2en
sk τ
1
2en CH3 2en
=⇒ bl ≤ 2 log .
ak eak sk sk

To obtain the desired bound for the geometric sum, we assume

1
τ

2en

CH3 2en √
log ≤ CH3 nσ 2
ea2k sk sk
1 √
2en 2en CH3 ea2k nσ 2 CH3 CL ea2k n

CH3
⇐⇒ log ≤ = .
sk sk τ 3

39
In summary, by defining the third set of heavy couples as

(k, l) ∈
/ L ∪ H1 ∪ H2 ,
 

 

 1 
≤ CH3 log X(S k ,Tl )
 
CH3 CL e 2
H3 = (k, l) log 2en
sk esk tl σ 2
or 2en
sk
CH3
log 2en
sk
≤ 3
ak n, ,
 1 

log 2en ≤ CH3 log X(S k ,Tl ) 2en CH3
log 2en ≤ CH3 CL e 2
 

tl esk tl σ 2
or tl tl 3
bl n 

2
a b2
√
we have (k,l)∈H3 ak bl X(Sk , Tl ) ≤ (1 + t) 1−b CH3 nσ 2 .
P
−1 + 1−a−1
5. Remaining heavy couples. Let us consider the remaining couples (k, l) ∈ / L∪
H1 ∪ H2 ∪ H3 . Without loss of generality, since (k, l) ∈ / H3 , we may assume that log 2en
sk
≥
CH3 log X(S k ,Tl )
esk tl σ 2
and

1
CH3 CL e 2 2en CH3 2en

ak n ≤ log
3 sk sk
1 1 !
2en CH3 2en CH4

= CH4 log
sk sk
1 1
2en CH3 + CH4

≤ CH4 ,
sk
1
1− − C1
2en 3CH4 2en

CH3 H4
≤ 2
sk CH3 CL eak n sk
6CH4 1
≤ .
CH3 CL sk a2k
−1 −1 −1 1 CH3
By setting 1 − CH3 − CH4 = CH3 , that is, CH4 = −1
1−2CH3
= CH3 −2
(as for example CH3 =
CH4 = 3) we have
1
H3c
2en

CH3
2
X X
ak bl X(Sk , Tl ) ≤ ak bl esk tl σ
k,l k,l sk
H3c 6eCH4 2 X 1
≤ σ ak bl sk tl
CH3 CL k,l sk a2k
6eCH4 2 X 2 X 1
= σ bl tl
CH3 CL l k ak b l
Lc6eCH4 2 2 τ −1
≤ σ b
CH3 CL 1 − a−1
2eCH4 b2 √ 2
= nσ .
CH3 (1 − a−1 )

In summary, by defining the fourth set of heavy couples as H4 = (L ∪ H1 ∪ H2 ∪ H3 )c and a

40
CH3
constant CH4 = CH3 −2
, we have

2eCH4 √ 2
!
X a2 b2
ak bl X(Sk , Tl ) ≤ + nσ .
(k,l)∈H4
1 − b−1 1 − a−1 CH3

Summary. Under E c , we have

 
1 
∥X ′ ∥op ak bl X ′ (Sk , Tl )
X X X
≤ max ui vj Xij + max
1 − 2ε u,v∈Sε (k,l)∈L i∈Sk ,j∈Tl u,v∈Sε
(k,l)∈L
/

1 √ 2 2
√ CL e2 a2 b2 √ 2
≤ 2 2
(1 + t)CL nσ + Ctrim (a + b ) nσ + nσ
1 − 2ε | {z } | {z } 3 {z
L H1
| }
H2
√
!
2 2
a b
+ (1 + t) −1
+ −1
CH3 nσ 2
1−b 1−a
| {z }
H3
2eCH4 √ 2
! !
a2 b2
+ + nσ
1 − b−1 1 − a−1 CH3
| {z }
H4
1 CL e2 a2 b2
= (1 + t)CL + Ctrim (a2 + b2 ) +
1 − 2ε 3
2 2
! !√
a b 2e
+ + (1 + t)CH3 + nσ 2
1 − b−1 1 − a−1 CH3 − 2
a=b=2, √
CH3 =CH4 =3 1 q 32e2 Cε
= (1 + t)2 Cε + 8Ctrim +
1 − 2ε 3
√
!
+ 16 (3(1 + t) + 2e) nσ 2
√
! !
1 q 32e2 q
= (2 Cε + 48)t + 8Ctrim + 2 + Cε + 48 + 32e nσ 2
1 − 2ε 3
1
q
= −1
2 2 log(4ε + 2) + 48 t + 8Ctrim
1 − 2ε
√
! !
32e2 q −1
+ 2+ 2 log(4ε + 2) + 48 + 32e nσ 2
3
ε=0.05 √
≤ (56t + 8.09Ctrim + 417) nσ 2 .

41
With Lemma A.2, we obtain a tail bound
√
P(∥X ′ ∥op ≥ (56t + 8.09Ctrim + 417) nσ 2 )
≤ P(E1 ∪ E2 )
Lemma A.2
≤ exp(−t log(e2 n2 )) + exp(−tCε n)
≤ 2 exp(−t(Cε n ∧ 2 log en)).

To further simplify the probability bound, we notice that with an inequality 2 log en =
2(1 + log n) ≤ 2elog n = 2n ≤ Cε n, we have Cε n ∧ 2 log en = 2 log en. ■

B Proof of Lemma B.1

Lemma B.1 provides a specialized norm bound needed in a small part of an argument in
Appendix C. The proof is based on a chaining argument developed in [12].

Lemma B.1. Let X ∈ Rn×m be a random matrix with independent centered entries satisfying
the MGF bound with variance proxy σ 2 (Assumption 2.1) and let A ∈ Rn×m be a deterministic
matrix with uniformly bounded entries Aij ≤ σ 2 . Then
√
2 n

P ∥AX T ∥F ≥ 34(1 + t)nσ 2 mσ 2 ∨ ≤ , t ≥ 1.
3 (en)t
√
Especially, if mσ 2 ≳ 1, then ∥AX T ∥F ≲ nσ 2 mσ 2 with high probability.

Proof. Discretization. For a vector ∥u∥ ≤ 1, define sets Sk = {i | 2−k−1 < |ui | ≤ 2−k },
k = 0, . . . , K − 1, SK = {i | |ui | ≤ 2−K } of size sk = |Sk | and values ak = 2−k ≥ max{|ui | |
i ∈ Sk }. Then

a2k 2
sk a2k = u2i ≤ 4,
X X
u ≤4
2 i
k < K,
i∈Sk ui i∈Sk

sK a2K ≤ n4−K .

n
Choose K = ⌊ log
log 4
⌋ so that also sK a2K ≤ n4−K ≤ 4 and n/4 ≤ 4K ≤ n. This implies
sk ≤ 4a−2
k = 4
k+1
for all k and

K K X K−1
sk a2k = a2k ≤ sK a2K + 4 u2i ≤ 4 + 4 ≤ 8.
X X X X

k=0 k=0 i∈Sk k=0 i∈Sk

42
Now for any vector x ∈ Rn , we have
X XX X X X
xi ui = xi ui ≤ ak max S xi v i = ak x(Sk ).
i k i∈Sk k v∈{−1,1} k
i∈Sk k
| {z }
=x(Sk )

Since ∥x∥ = sup∥u∥≤1

P
i xi ui , we have
X
∥x∥ ≤ max ak x(Sk ) (B.1)
S0 ,...,SK
k

where S0 , . . . , SK ⊂ [n] are assumed to satisfy k sk a2k ≤ 8, sk a2k ≤ 4.

Bounding (AX T )i: (S). For any nonempty subset S ⊂ [n], we have

(AX T )i: (S) =

XX
max Aij Xi′ j ui′ ,
u∈{−1,+1}S j i′ ∈S

P P
where j i′ ∈S Aij Xi′ j ui′ is a sum of independent random variables. By Bernstein’s inequal-
ity (D.3), we have a tail bound
 
 
n n
n X m
n o X X Aij ts 
(AX T )i: (S) ≥
[ [ X X X
P t|S|  ≤ P
 vi′2
X ′
ij ≥ 
2

i=1 ∅̸=S⊂[n] i=1 s=1 S∈([n]) v∈{−1,+1}S
′ σ
i ∈S j=1 | {z }
σ
s
∈[−1,1]
n
! !!
2 2
X n s (ts /σ ) 3(ts /σ 2 )
≤n 2 exp − ∧
s=1 s 4msσ 2 4
n
!!
X 2en t2s 3ts
≤n exp s log − 6
∧ 2 .
s=1 s 4msσ 4σ

43
2en
4σ 2 s log t2s
q
Now ts = 4ms2 σ 6 log 2en
s
∨ 3
s
satisfies s log 2en
s
= 4msσ 6
∧ 3ts
4σ 2
so that
 
n n
!!
[ [ n
T
o X 2en t2s 3ts
P (AX )i: (S) ≥ (1 + t)ts  ≤n exp s log − (1 + t) 6
∧ 2
i=1 ∅̸=S⊂[n] s=1 s 4msσ 4σ
n
2en
X
=n exp −ts log
s=1 s
n ts
s
X
=n
s=1 2en
1 t 2 2t
(∗) !
≤n + (n − 1)
2en 2en
t
1 1+t
!
t≥1 1 1

≤ n + (n − 1)
2 en en
t t !
1 1 1 1
≤n +
2 en e en
n
≤ ,
(en)t
s
s
where inequality (∗) holds, because the function s 7→ 2en is decreasing (the derivative of
s s s
its logarithm s log 2en is negative log 2en + 1 = log 2n , s ≤ n).
Norm bound. Define a rare event
n n o
(AX T )i: (S) ≥ (1 + t)t|S| .
[ [
E=
i=1 ∅̸=S⊂[n]

The aim is to bound the norms of the row vectors ∥(AX T )i: ∥ under a high-probability event
E c . When E c holds, inequality (B.1) gives
v
u n
∥AX T ∥F =
uX
t ∥(AX T )
i: ∥
2
i=1
√
ak (AX T )i: (Sk )
X
≤ n max max
i S0 ,...,SK
k
√ X
≤ n max (1 + t) ak tsk .
S0 ,...,SK
k

44
2en
q 4σ 2 sk log
Evaluating tsk = 4ms2k σ 6 log 2en
sk
∨ 3
sk
gives an upper bound
 s  
2 2en
T
√ X
2 6 2en 4σ sk log sk  
∥AX ∥F = (1 + t) n max   4msk σ log ∨ ak
S0 ,...,SK
k sk 3
√ 2 X √
!
2 2en

≤ 2(1 + t) nσ max mσ ∨ 2 ak sk log
S0 ,...,SK
k 3 sk
√
√ 2

2en
= 2(1 + t) nσ 2
X
mσ 2 ∨ max ak sk log .
3 S0 ,...,SK k sk

The above sum is estimated by recognizing it as an inner product ⟨x, y⟩ = k sk xk yk , namely

now Cauchy–Schwarz inequality and subadditivity of the square root yield

v v
K uK uK
2en 2en
log2
X uX uX
2t
sk ak log ≤ t s a
k k s k
k=0 sk k=0 k=0 sk
v
√ u K
4e2 n
log2
uX
≤ 8t s k
k=0 sk
v
(∗) √ u X
uK
4e2 n
≤ 8t 4k+1 log2
k=0 4k+1
√ X K
e2 n
≤ 32 2k log k
k=0 4
√ K K
!
2 k k
X X
= 32 log(e n) 2 − log(4) k2 ,
k=0 k=0

2
where the inequality (∗) follows from increasingness of a function s 7→ s log2 4es n as it has a
2 2 2
nonnegative derivative log2 4es n − 2 log 4es n = (log 4es n ) log 4n
s
≥ 0 for s ≤ 4n. With a help
of Wolfram, the last term is equal to
√
32 log(e2 n)(2K+1 − 1) − log(4)2(2K K − 2K + 1) .

45
Hence, we conclude that
K
2en √
≤ 2K+1 32 log(e2 n) − log(4)(K − 1)
X
sk ak log
k=0 sk
√ e2 n
= 2K+1 32 log K−1
4
√ 4e2 n
= 2K+1 32 log K
4
(∗∗) √ 4e2 n
≤ 2 32n log
√ n
2
= 2 32n log(4e ),
2
where the inequality (∗∗) follows from increasingness of a function s 7→ s log 4es2n as it has a
2 √
nonnegative derivative log 16es2 n − 2 = log 16n
s2
≥ 0 for s ≤ n.
Summary. Combining these results gives that the high-probability event E c implies
√
√ 2 √

∥AX T ∥F ≤ 2(1 + t) nσ 2 mσ 2 ∨ 2 32n log(4e2 )
3
√ 2 2
√ 2

= 16 2 log(4e )(1 + t)nσ mσ ∨
2

√
3
2

2
≤ 34(1 + t)nσ mσ 2 ∨ .
3
■

Notice that in previous lemma, if |Aij | = Var(Xij ) = σ 2 , then the variance of a single entry
of AX T is Var((AX T )ij ) = mσ 6 and the expected squared Frobenius norm is E∥AX T ∥2F =
n2 σ 4 mσ 2 , which makes the high-probability bound rate-optimal.

C Proof of Theorem 6.4

This appendix analyzes concentration of a product XX T , where the entries of the random
matrix X ∈ Rn×m are independent and centered. The focus is only on the off-diagonal
entries, because the diagonal and off-diagonal entries concentrate at different rates, as the
following simple lemma suggests.

Lemma C.1. Let X ∈ Rn×m , m ≥ n be a random matrix with independent centered

d
Bernoulli-distributed entries Xij = Ber(p) − p. Define a mask M = 11T − I ∈ Rn×n and
denote elementwise matrix product by ⊙ so that XX T = (XX T ) ⊙ I + (XX T ) ⊙ M . Then
q q
E∥(XX T ) ⊙ I∥2op ≥ SD((XX T )ii ) = mp(1 − p)(1 − 2p)2 ,
q q q
E∥(XX T ) ⊙ M ∥2op ≥ E∥((XX T ) ⊙ M )i: ∥2 = (n − 1)mp(1 − p).

46
Proof. The first inequality follows from E∥(XX T )⊙I∥2op = E maxi |(XX T )ii |2 ≥ E(XX T )2ii ≥
Var((XX T )ii ) = mVar(Xij2 ), where the variance follows from an elementary calculation

Var(Xij2 ) = EXij4 − (EXij2 )2

= p(1 − p)4 + (1 − p)(−p)4 − p2 (1 − p)2
= p(1 − p)((1 − p)3 + p3 − p + p2 )
= p(1 − p)(1 − 4p + 4p2 )
= p(1 − p)(1 − 2p)2 .

For the second statement, first notice that the matrix (XX T ) ⊙ M is centered, namely the
diagonal entries are zero and for i ̸= i′ , we have

E((XX T ) ⊙ M )ii′ = E Xij XjiT ′ =

X X X
EXij Xi′ j = EXij EXi′ j = 0.
j j j

Therefore,

E∥(XX T ) ⊙ M ∥2op ≥ E∥((XX T ) ⊙ M )i: ∥2

((XX T ) ⊙ M )2ii′
X
=E
i′

Var((XX T )ii′ )
X
=
i′ :i′ ̸=i
 
X X
= Var  Xij Xi′ j 
i′ :i′ ̸=i j
X X
= Var (Xij Xi′ j )
i′ :i′ ̸=i j

E(Xij2 Xi2′ j )
X X
=
i′ :i′ ̸=i j

EXij2 EXi2′ j
X X
=
i′ :i′ ̸=i j

= (n − 1)mp2 (1 − p)2 .

This suggests that XX T − EXX T consists of two parts, a diagonal and an off-diagonal
√
part. For small p ≪ 1, we expect the norm of the diagonal to be of order mp and the
√
norm of the off-diagonal to be of order nmp. If this holds, then the diagonal dominates
√ √
the off-diagonal, when mp ≫ nmp, or equivalently, p ≪ n−1 . When we consider square
matrices m = n, we rarely consider p ≪ n−1 , but when m ≫ n, we may be interested in the
regime p ≪ n−1 and this phenomenon will become significant. The rest of this subsection
√
aims at proving that the off-diagonal part is bounded from above by O( nmp), or in a

47
√
slightly more general setting, O( nmσ 2 ).
Lemma C.2 bounds a product X1 X2T of two independent matrices X1 , X2 with indepen-
dent entries. The key observation is that the entries of X1 X2T are conditionally independent
entries given X2 , when X2 is sufficiently sparse. This allows us to apply Theorem 6.3. Lemma
C.3 applies a well-known decoupling technique (see for example Chapter 6 in [39]) to analyze
a product (X − EX)X T as if X and X T were independent. The decoupling technique applies
to expectations of convex functions of (X − EX)X T which is then converted to a tail bound
in the proof. Theorem 6.4 finally bounds off-diagonal part of the Gram matrix XX T . The
desired norm bound is obtained by removing certain rows and columns, and Lemma 6.5
asserts that only a few rows and columns are removed. The bound presented here is derived
with a simple application of Markov’s inequality, and consequently the bound is worse than
the bound in Lemma A.1.

Lemma C.2. Let X1 ∈ Rn1 ×m , X2 ∈ Zn2 ×m be independent random matrices with indepen-
dent entries, n = n1 + n2 ≤ m. Assume that X1 is centered and its entries satisfy the MGF
bound with variance proxy σ 2 (Assumption 2.1) and E|(X1 )ij | ≤ σ 2 . For the other matrix, as-
sume that the entries of X2 satisfy the MGF bound with variance proxy σ 2 and E|(X2 )ij | ≤ σ 2 .
Furthermore, assume that 8 log en ≤ mσ 2 , nσ 2 ≤ 3−1 e−2 imposing m ≥ 24e2 n log en. Define
a product matrix Y = X1 X2T ∈ Rn1 ×n2 and its trimmed version Y ′ as

j ′ ,k |(X1 )ik (X2 )j ′ k | ∨ i′ ,k |(X1 )i′ k (X2 )jk | ≤ Ctrim nmσ 4 ,
P P
Yij ,

Yij′ =
0, otherwise.

(i) Then
 √  √ !
t tnmσ 4 √ √
4
+ 2n− t/2 6
X
P  |(X1 )il (X2 )jl | ≥ tnmσ ≤ √ exp − √
j,l 6 40 6

for all
24 log2 m
t ≥ 24 ∨ .
log2 (1/3enσ 2 )

(ii) There exists an absolute constant C such that

√ 1/3 /4
P ∥Y ′ ∥op ≥ C (t + Ctrim ) nmσ 2 ≤ 8n−1 e−t

for all
512 log3 m
t ≥ 53/2 ∨ .
log3 (1/3enσ 2 )
In particular, if nσ 2 ≲ m−c for some arbitrarily small constant c, then ∥Y ′ ∥op ≲
√ √
nmσ 2 with high probability as n → ∞, and if nσ 2 ≲ 1, then ∥Y ′ ∥op ≲ log3 (m) nmσ 2

48
with high probability as n → ∞.
P∞ (k)
Proof. Denote n = n1 + n2 . Decompose the integer matrix as X2 = k=1 X2 , where
(k)
matrices X2 are defined recursively by

k−1
!!
(k) X X (k′ )
X2 = arg min X2 − X + X2 , k≥1
X∈{0,±1}n2 ×m :maxj ∥X:j ∥1 ≤1 ij k′ =1 ij

with arbitrary tie breaks as there might not be unique minimizers. As an example of such a
decomposition, consider the following small integer matrix

3 −2 1
 
1 −1 1
 
1 −1 0
  
1 0 0
0 0 −1 = 0 0 0 + 0 0 −1 + 0 0 0 .
       

0 1 0 0 0 0 0 0 0 0 1 0

That is, the components are matrices with entries being in {−1, 0, 1}, each column having
at most one nonzero entry and the k:th component picks “the k:th one of each column” if it
(k)
exists. Since each column has at most one nonzero entry, the supports of the rows of X2
are disjoint.
Notice that the summands are eventually zero matrices as each entry of X2 is finite
almost surely. Let Kmax denote the number of nonzero components (which is random as X2
is random). By Lemma D.1:(v), we have
2 (eλ −1−λ) 2 (eλ −1−λ)
Eeλ(|X2 (i,j)−EX2 (i,j)|−E|X2 (i,j)−EX2 (i,j)|) ≤ e2σ ≤ e3σ for all λ ≥ 0

and consequently
P 2 (eλ −1−λ)
Eeλ i
(|X2 (i,j)−EX2 (i,j)|−E|X2 (i,j)−EX2 (i,j)|)
≤ Ee3n2 σ for all λ ≥ 0.

49
Now for any kmax ≥ 3n2 σ 2 , Bennett’s inequality (D.1) gives a tail bound

P (Kmax ≥ kmax )
 (n )
m X
[ 2

= P |X2 (i, j)| ≥ kmax 

j=1 i=1
m n
!
X X2

≤ P |X2 (i, j) − EX2 (i, j)| + |EX2 (i, j)| ≥ kmax

The resulting upper bound holds also for 0 < kmax ≤ 3n2 σ 2 as the upper bound becomes
trivial.
(k)
Conditioned on X2 , the product Y (k) = X1 (X2 )T consists of independent entries (since
(k)
the rows of X2 have disjoint supports) and by Lemma D.1:(iii), they satisfy the moment
condition

(k) (k)
λYij λ(X1 (X2 )T )ij
E e |X2 = E e |X2

(k)
≤ exp ∥(X2 )j: ∥1 σ 2 (e|λ| − 1 − |λ|)

2 |λ|
≤ exp max
′
∥(X2 )j ′ : ∥1 σ (e − 1 − |λ|)
j

for all λ ∈ R. |(X1 )il | consists of at most

P
Conditioned on X2 , the sum (k)
l:∃j:(X2 )jl ̸=0
(k) (k)
j,l |(X2 )jl | ≤ n maxj ∥(X2 )j: ∥1 independent random variables satisfying Eeλ(|(X1 )il |−E|(X1 )il |) ≤
P

2σ 2 (eλ −1−λ)
e for all λ ≥ 0 (Lemma D.1:(v)). Hence,
   
X
E exp λ |(X1 )il | − E|(X1 )il | | X2 
   
(k)
l:∃j:(X2 )jl ̸=0

2 λ
≤ exp 2n max
′
∥(X2 ) ∥1 σ (e − 1 − λ) j′:
j

50
for all λ ≥ 0. The next objective is to bound maxj ′ ∥(X2 )j ′ : ∥1 .
For any t ≥ 1, assumptions E|(X2 )ij | ≤ σ 2 and 8 log en ≤ mσ 2 and Lemma A.1 give a
tail bound

P max∥(X2 )j: ∥1 ≥ 5tmσ 2
j

≤ P max∥(X2 − EX2 )j: ∥1 + ∥(EX2 )j: ∥1 ≥ 5tmσ 2
j

≤ P max∥(X2 − EX2 )j: ∥1 ≥ 4tmσ 2
j
 
[ q
≤ P ∥(X2 − EX2 )j: ∥1 ≥ E∥(X2 − EX2 )j: ∥1 + (1 + t) 8mσ 2 log en 
j
(A.1)
≤ (en)−t .

That is, under a high-probability event i {∥(X2 )j: ∥1 < 5tmσ 2 }, we have the moment condi-
T

tions
(k)
E e λYij
| X2 ≤ exp 5tmσ 4 (e|λ| − 1 − |λ|)

for all λ ∈ R and

   

|(X1 )il | − E|(X1 )il | | X2  ≤ exp 10tnmσ 4 (eλ − 1 − λ)
X
E exp λ
   
(k)
l:∃j:(X2 )jl ̸=0

for all λ ≥ 0.

51
(i) For any t ≥ 1, Bernstein’s inequality (D.3) gives
 

≥ 6kmax tnmσ 4 
X
P  |(X1 )il (X2 )jl |
j,l
!
X X (k) 4 2
≤ P |(X1 )il (X2 )jl | ≥ 6tnmσ , max∥(X2 )i: ∥1 < 5tmσ , Kmax < kmax
i
k≤kmax j,l

+ P max∥(X2 )i: ∥1 ≥ 5tmσ 2 + P (Kmax ≥ kmax )
i
!
4 2
X X
≤ P |(X1 )il | ≥ 6tnmσ , max∥(X2 )i: ∥1 < 5tmσ
i
k≤kmax (k)
j,l:(X2 )jl ̸=0
!kmax
−t 3eσ 2
+ (en) +m
kmax
" " # !
|(X1 )il | | X2 + tnmσ 4 | X2
X X X
≤ E P |(X1 )il | ≥ E
k≤kmax (k)
l:∃j:(X2 )jl ̸=0
(k)
l:∃j:(X2 )jl ̸=0
# !kmax
3enσ 2

−t
× I max∥(X2 )i: ∥1 < 5tmσ 2
+ (en) +m n−kmax
i kmax
!! !k
(D.3) t2 n2 m2 σ 8 3tnmσ 4 −t 3enσ 2 max −kmax
≤ kmax exp − ∧ + (en) + m n
4 · 10tnmσ 4 4 kmax
! !
tnmσ 4 −t kmax
= kmax exp − + (en) + exp log m − kmax log n−kmax .
40 3enσ 2

2 log m t 2
By choosing kmax = ⌊t⌋ ≥ 2 ∨ log(1/3enσ 2 ) so that 2 ≤ t − 1 ≤ kmax ≤ t and t ≥ 6enσ

(recall the assumption nσ 2 ≤ 3−1 e−2 ≤ (3e)−1 ), we obtain

P ∥Y:1 ∥1 ≥ 6t2 nmσ 4
!
tnmσ 4 t t

≤ t exp − + (en)−t + exp log m − log n−t/2
40 2 6enσ 2
!
tnmσ 4 t 1

−t
≤ t exp − + (en) + exp log m − log 2
n−t/2
40 2 3enσ
!
tnmσ 4
≤ t exp − + 2n−t/2
40
√ √
√
!
6t2 6t2 nmσ 4 2
√
= √ exp − √ + 2n− 6t /2 6 .
6 40 6
2
log m
The claim follows by writing the assumption on t as 6t2 ≥ 24 ∨ log24
2
(1/3enσ 2 )
.
(ii) Notice that the trimming constant in Theorem 6.3 differs from Ctrim . Namely,
(k) (k)
the trimming constant in the theorem has to be chosen to satisfy ∥Yi: ∥1 ∨ ∥Y:j ∥1 ≤
′
Ctrim n(5tmσ 4 ) for indices i, j that are not masked out. Comparing this to trimming with

By Theorem 6.3, there exists a constant C such that for all t ≥ 5, we have a tail bound

Ctrim √
! !
′
P ∥Y ∥op ≥ Ckmax t + √ nmσ 2
5t
√
! !
C trim
≤ P ∥Y ′ ∥op ≥ Ckmax t + √ nmσ 2 , Kmax ≤ kmax , max∥(X2 )i: ∥1 ≤ 5tmσ 2
5t i

+ P (Kmax ≥ kmax ) + P max∥(X2 )i: ∥1 ≥ 5tmσ 2
i
   s   
k[
max 
t Ctrim  q 
≤ EP  ∥(Y (k) )′ ∥op ≥C + n(5tmσ 4 ) | X2 

k=1 5 5t 
( ) !kmax
3en2 σ 2
2
+ (en)−t
\
×I {∥(X2 )i: ∥1 < 5tmσ } +m
i kmax
√ 3en2 σ 2
!kmax
−2
≤ 2kmax (en) t/5
+m + (en)−t
kmax
√ 3en2 σ 2
!kmax
≤ 3kmax (en)−2 t/5
+m .
kmax
√ √ 8 log m 8 log m
√
t
√ √
By choosing kmax = ⌊ t⌋ ≥ 5∨ log(1/3enσ 2 ) ≥ 2∨ log(1/3enσ 2 ) so that 2 ≤ t−1 ≤ kmax ≤ t
√
and t ≥ 6e2 nσ 2 (recall the assumption nσ 2 ≤ 3−1 e−2 ), we obtain

Ctrim √
! !
P ∥Y ′ ∥op ≥ C t3/2 + √ nmσ 2
5
√ !√t/2
√ 6enσ 2
≤ 3 t(en)−2 t/5 + m √
t
√ √
s
√ t −2 t/5 √
t 1
√
t
≤ 3 5n−2 t/5 e + elog m− 4 log 3enσ2 − 4
5
√ −2 −√t/5 √
≤ 3 5n e + m−1 e− t/4
√
3 5 + 1 −√t/4
≤ e
n
8 3/2 1/3
≤ e−(t ) /4 .
n

53
83 log3 m
The claim follows by writing the assumption on t as t3/2 ≥ 53/2 ∨ log3 (1/3enσ 2 )
. ■

Lemma C.3 (Decoupling). Let X ∈ Zn×m be a random matrix with independent integer-
valued entries that satisfy Assumption 2.1 with variance proxy σ 2 and E|Xij | ≤ σ 2 . Fur-
thermore, assume that 4 log en ≤ mσ 2 , nσ 2 ≤ 6−1 e−2 , imposing m ≥ 24e2 n log en. Define
a mask M = 11T − I ∈ Rn×n , product matrix Y = ((X − EX)X T ) ⊙ M ∈ Rn×n , trimming
mask N ∈ Rn×n as
 
X 
|(Xi′ k − EXi′ k )Xjk |Mi′ j ≤ Ctrim nmσ 4
X
Nij = I |(Xik − EXik )Xj ′ k |Mij ′ ∨
j ′ ,k i′ ,k
 

and a trimmed product Y ′ = Y ⊙ N . Then there exists an absolute constant C such that
√ 1/3
P ∥Y ′ ∥op ≥ C(t + Ctrim ) nmσ 2 ≤ Cn−1 e−t

for all
log3 m
!
t≥C 1∨ .
log3 (1/6enσ 2 )
√
In particular, if nσ 2 ≲ m−c for some arbitrarily small constant c, then ∥Y ′ ∥op ≲ nmσ 2
√
with high probability as n → ∞, and if nσ 2 ≲ 1, then ∥Y ′ ∥op ≲ log3 (m) nmσ 2 with high
probability as n → ∞.

Proof. Let δi = Ber(1/2), i = 1 . . . n be independent Bernoulli random variables and define

disjoint random index sets I = {i | δi = 1} and I c = {i | δi = 0}. Define a random submatrix
XI: with the convention (XI: )ij = Xij I{i ∈ I} = δi Xij so that XI: and X have the same
dimensions. Define also XI c : similarly. For any i ̸= j, we have elementwise equalities

((X − EX)X T )ij = 4EI δi (1 − δj )((X − EX)X T )ij

X
= 4EI δi (X − EX)ik (1 − δj )Xjk
k
X
= 4EI (XI: − EXI: )ik (XI c : )jk
k

= 4EI ((XI: − EXI: )(XI c : )T )ij .

On the diagonal, we have ((XI: − EXI: )(XI c : )T )ii = 0, that is, ((X − EX)X T ) ⊙ M =
4EI (XI: − EXI: )(XI c : )T . Since operator norm is a convex function in the space of matrices,
Jensen’s inequality gives

∥((X − EX)X T ) ⊙ M ⊙ N ∥op ≤ 4EI ∥(XI: − EXI: )(XI c : )T ∥op .

Similarly, for any convex and nondecreasing function f : R≥0 → R≥0 , Jensen’s inequality

54
gives

Ef ∥((X − EX)X T ) ⊙ M ⊙ N ∥op = Ef ∥4EI ((XI: − EXI: )XITc : ) ⊙ N ∥op

≤ EI Ef 4∥((XI: − EXI: )XITc : ) ⊙ N ∥op .

The interchange of the order of the expectations E and EI is justified most evidently when
EI = 2−n I⊂[n] is written as a finite sum. The gain of this argument is the independence of
P

the submatrices XI: and XI c : . Namely, now Lemma C.2 can be applied to the product (XI: −
EXI: )XITc : with variance proxy 2σ 2 (the number 2 is needed to have E|(XI: − EXI: )ij | ≤ 2σ 2 ).
′
Consequently, the trimming constant in Lemma C.2 becomes Ctrim = Ctrim /2. Define Z =
′ 3

∥Y ∥ ′ 3/2 512 log m
√
8C nmσ 2
− Ctrim − 5 ∨ log3 (1/6enσ2 ) , where C is the universal constant from Lemma
+
p
C.2:(ii). Since f (x) = (x − 1)+ is a nondecreasing convex function f : R≥0 → R≥0 for p ≥ 1,
by Lemma C.2, the absolute moments of Z can be estimated from above as

EZ p
!!p
4∥((XI: − EXI: )XITc : ) ⊙ N ∥ ′ 3/2 512 log3 m
≤ EI E √ − C trim − 5 ∨
8C nmσ 2 log3 (1/6enσ 2 ) +
3
∥((XI: − EXI: )XITc : )
! !
Z ∞
⊙ N∥ 512 log m ′
= EI ptp−1 P √ ≥ 53/2 ∨ + t + Ctrim dt
0 C nm2σ 2 log3 (1/3en2σ 2 )
Lemma C.2
Z ∞
−1 1/3 /4
≤ 8n ptp−1 e−t dt
Z0 ∞
= 8n−1 p43(p−1) u3(p−1) e−u 3 · 43 u2 du
0
−1 3p
= 8n 4 3pΓ(3p)
= 8n−1 43p Γ(3p + 1)

for any p ≥ 1. For 1/3 ≤ p ≤ 1, this argument does not work as x 7→ xp is not convex. Now
for any |λ| < 4−1 , geometric summation formula gives
2 ∞
λk Z k/3 λk
!
λZ 1/3
EZ k/3
X X
E e − =
k=0 k! k=3 k!
∞
λk k
≤ 8n−1
X
4 Γ(k + 1)
k=3 k!
∞
= 8n−1 (4λ)k
X

k=3
(4λ)3
= 8n−1 .
1 − 4λ

55
By Markov’s inequality, this gives a tail bound
∞ ∞
λk k/3 X λk k/3
!
X
P(Z ≥ t) ≤ P Z ≥ t
k=3 k! k=3 k!
∞
!−1
−1 (4λ)3 X λk k/3
≤ 8n t .
1 − 4λ k=3 k!
k
To simplify the upper bound, let us study the relation between the series ∞ x P
k=3 k! and the
exponential function ex . Denote the Taylor polynomials of the exponential function by
i
pk (x) = ki=0 xi! . Since the derivatives of the polynomials satisfy p′k+1 = pk , for any x ≥ 0
P

and k ≥ 1, we have

d ex ex pk (x) − ex p′k (x) (pk (x) − pk−1 (x))ex xk ex

= = = ≥ 0.
dx pk (x) pk (x)2 pk (x)2 k!pk (x)2
pk (x0 ) x
This implies that ex /pk (x) ≥ ex0 /pk (x0 ) or pk (x) ≤ e x0
e , whenever x ≥ x0 ≥ 0. Therefore,
for any t ≥ t0 , we have
!−1  −1
∞ 1/3
λk k/3 λt1/3 λt1/3 p2 (λt0 ) λt1/3
− p2 (λt1/3 ))−1 ≤ e
X
t = (e − 1/3 e 
k=3 k! eλt0
 −1
1/3
p2 (λt0 )  −λt1/3
= 1 − 1/3 e .
eλt0

Now
λ=8−1 , !−1
t0 =λ−3 p2 (1) 1/3 /8
−3 −1
P(Z ≥ t) ≤ 16 · 2 n 1− e−t
e
1/3 /8
≤ 25n−1 e−t , t ≥ 83 = 512.

The claim follows now from

512 log3 m √
( ! ! )
{Z ≥ t} = ∥Y ′ ∥ ≥ 8C 53/2 ∨ 3 2
′
+ t + Ctrim nmσ 2
log (1/6enσ )
√
⊃ {∥Y ′ ∥ ≥ 8C(2t + Ctrim
′
) nmσ 2 }
√
⊃ {∥Y ′ ∥ ≥ 16C(t + Ctrim ) nmσ 2 },

512 log3 m
when t ≥ 53/2 ∨ log3 (1/6enσ 2 )
. ■

Proof of Theorem 6.4. Let t ≥ 0. Since E|Xij − EXij | ≤ 2E|Xij | ≤ 2σ 2 and mσ 2 ≥ 8 log en,

56
applying Lemma A.1 gives
n q !
2
≤ (en)−t .
[
P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t) 8mσ 2 log en
i=1

≤ Ctrim nmσ 4 + |(EXik )(X − EX)jk | + nmσ 4

j,k

≤ (1 + Ctrim )nmσ 4 + nσ 2 max∥(X − EX)j: ∥1

j
 s 
8 log en 
≤ 3 + Ctrim + (1 + t) 2
nmσ 4 .
mσ

≤ Ctrim nmσ 4 + |(EXik )(X − EX)jk | + nmσ 4

i,k

≤ (1 + Ctrim )nmσ 4 + nσ 2 max∥(X − EX)j: ∥1

j
 s 
8 log en 
≤ 3 + Ctrim + (1 + t) 2
nmσ 4 .
mσ

In the following calculation, constant C is allowed to vary from line to line. By Lemma
1/3
C.3 and Lemma B.1, with probability of at least 1 − Cn−1 e−t − (en)−t − n(en)−t ≥

57
1/3
1 − Cn−1 e−t , triangle inequality gives

∥((XX T ) ⊙ M − EXEX T ) ⊙ N ∥op

≤ ∥((X − EX)X T ) ⊙ M ⊙ N ∥op + ∥((EX)(X − EX)T ) ⊙ M ⊙ N ∥op
+ ∥(EXEX T ) ⊙ I ⊙ N ∥op
 s 
Lemma C.3 8 log en  √
≤ C t + 3 + Ctrim + (1 + t) nmσ 2
mσ 2
+ ∥((EX)(X − EX)T ) ⊙ M ∥F + mσ 4
√
≤ C(t + Ctrim ) nmσ 2 + ∥(EX)(X − EX)T ∥F + mσ 4
Lemma B.1 √ √
≤ C(t + Ctrim ) nmσ 2 + Ctnσ 2 mσ 2 + mσ 4
nσ 2 ≤6−1 e−2 √
≤ C(t + Ctrim ) nmσ 2 + mσ 4 .

Proof of Lemma 6.5. Let t ≥ 0. Since E|Xij − EXij | ≤ 2E|Xij | ≤ 2σ 2 and mσ 2 ≥ 8 log en,
applying Lemma A.1 gives
n n
!
o
2 2
[
P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t)mσ
i=1
n q !
2
[
≤P ∥(X − EX)i: ∥1 ≥ 2mσ + (1 + t) 8mσ 2 log en
i=1
−t
≤ (en) .
Tn
Under a high-probability event i=1 {∥(X − EX)i: ∥1 < (3 + t)mσ 2 }, we have

X n X
X m
|(EXik )(X − EX) |M j′k ij ′ ≤ |EXik ||Xjk − EXjk |
j ′ ,k j=1 k=1

≤ nσ 2 max∥(X − EX)j: ∥1
j

≤ (3 + t)nmσ 4 .

24 log2 m
Suppose t ≥ 24 ∨ log2 (1/6enσ 2 )
. By triangle inequality and Lemma C.2:(i) (with 2σ 2 ), we

≤ 4tnmσ 4 + (3 + t)nmσ 4 + nmσ 4

= (4 + 5t)nmσ 4

with probability of at least

√ √ √
√
! !
−t t tnmσ 4 √ √
− t/2 6 tnmσ 4 √ √
1 − (en) − √ exp − √ − 2n ≥ 1 − t exp − √ − 3n− t/2 6 .
6 10 6 10 6

The claim follows from Markov’s inequality. ■

D Bennett’s inequality
This subsection studies the concentration of a sum of independent random variables satis-
fying Assumption 2.1. Lemma D.1 shows ways to construct random variables satisfying the
assumption and Lemma D.2 collects well-known tail bounds for such random variables.
Lemma D.1.
(i) If X ∈ [0, 1] almost surely, then X satisfies Condition (2.1) with variance proxy EX.

(ii) If X ∈ [−1, 1] almost surely and EX = 0, then X satisfies Condition (2.1) with variance
proxy EX 2 .

(iii) If X1 , X2 are independent random variables satisfying Condition (2.1) with variance
proxies σ12 , σ22 respectively and a1 , a2 ∈ [−1, 1], then X = a1 X1 + a2 X2 satisfies Condi-
tion (2.1) with variance proxy a21 σ12 + a22 σ22 .

(iv) If X is a centered random variable satisfying Condition (2.1) with variance proxy σ 2
and ξ is a random variable independent of X satisfying |ξ| ≤ 1 almost surely, then their
product ξX satisfies Condition (2.1) with variance proxy σ 2 .

(v) If X satisfies Condition (2.1) with variance proxy σ 2 , then |X − EX| satisfies the
condition with variance proxy 2σ 2 at least for λ ≥ 0, that is,
2 (eλ −1−λ)
Eeλ(|X−EX|−E|X−EX|) ≤ e2σ for all λ ≥ 0.

(vi) If a random variable X has a binomial or a Poisson distribution, then it satisfies

Condition (2.1) with variance proxy EX.

59
Proof of Lemma D.1. (i) Let X ∈ [0, 1] be a bounded nonnegative random variable. By
convexity of the exponential function (linear interpolation of convex function t 7→ et
between points (0, e0 ) and (λ, eλ )), we have

Eeλ(X−EX) = e−λEX Ee(1−X)·0+X·λ ≤ e−λEX E((1 − X)e0 + Xeλ )

= e−λEX (1 − EX + EXeλ ) = e−λEX (1 + EX(eλ − 1))
(∗) λ −1) λ −1−λ)
≤ e−λEX eEX(e = eEX(e .

Here the second inequality (*) follows from the well-known inequality 1 + x ≤ ex . Since
λk P∞ |λ|k
eλ − 1 − λ = ∞ k=2 k! ≤
|λ|
k=2 k! = e − 1 − |λ|, random variable X satisfies Condition
P

(2.1) with variance proxy EX.

(ii) If X ∈ [−1, 1] is a bounded centered random variable, then

∞ ∞
λk EX k |λ|k EX 2
EeλX = 1 +
X X
≤1+
k=2 k! k=2 k!
2 (e|λ| −1−|λ|)
= 1 + EX 2 (e|λ| − 1 − |λ|) ≤ eEX .

(iii) If X1 , X2 are independent and satisfy Condition (2.1) with variance proxies σ12 and σ22
respectively and |a1 |, |a2 | ≤ 1, then

Eeλ((a1 X1 +a2 X2 )−E(a1 X1 +a2 X2 )) = Eea1 λ(X1 −EX1 ) Eea2 λ(X2 −EX2 )
2 |a1 λ| −1−|a λ|) 2 |a2 λ| −1−|a λ|)
≤ eσ1 (e 1
eσ2 (e 2

(∗) 2 2 |λ| −1−|λ|) 2 2 |λ| −1−|λ|)

≤ ea1 σ1 (e ea2 σ2 (e
2 2 2 2 |λ| −1−|λ|)
= e(a1 σ1 +a2 σ2 )(e ,

where inequality (*) follows from

∞ ∞
|ai λ|k |ai |≤1 X a2i |λ|k
e|ai λ| − 1 − |ai λ| = = a2i (e|λ| − 1 − |λ|).
X
≤
k=2 k! k=2 k!

This shows that a1 X1 + a2 X2 satisfies Condition (2.1) with variance proxy a21 σ12 + a22 σ22 .

2 (e|λξ| −1−|λξ|) 2 (e|λ| −1−|λ|)

EeλξX = EE(eλξX | ξ) ≤ Eeσ ≤ eσ .

(v) If X is a centered random variable satisfying Condition (2.1) with variance proxy σ 2 ,

60
then for any λ ≥ 0, we have
 

Eeλ(|X|−E|X|) = e−λE|X|  λ|X|

 
1 + λE|X| +E(e − 1 − λ|X|)

| {z }
≤eλE|X|

≤ 1 + |e−λE|X|
{z } E (e
λX
− 1 − λX) ∨ (e−λX − 1 + λX)
| {z } | {z }
≤1 ≥0 ≥0

≤ 1 + 1 · E(eλX − 1 − λX) + (e−λX − 1 + λX)

2 (eλ −1−λ) 2 (eλ −1−λ)
≤ −1 + EeλX + Ee−λX ≤ −1 + 2eσ ≤ e2σ .

(vi) By part (i), the statement holds for Bernoulli-variables. By part (iii), the statement
holds for binomially distributed variables as they can be represented as sums of in-
dependent and identically distributed Bernoulli-variables. If X is Poisson-distributed,
then
∞
(EX)k −EX λk
Eeλ(X−EX) = e−λEX EeλX = e−λEX
X
e e
k=0 k!
∞
−(1+λ)EX (EXeλ )k λ
= e−(1+λ)EX eEXe
X
=e
k=0 k!
λ −1−λ) |λ| −1−|λ|)
= eEX(e ≤ eEX(e .

In this context, a concentration inequality known as Bernstein’s inequality can be for-

mulated as follows.
2 (eλ −1−λ)
Lemma D.2. Let X be a random variable satisfying EeλX ≤ eσ for all λ ≥ 0. Then
for all t ≥ 0, its tail probability has upper bounds
!σ2 +t
eσ 2
−σ 2
P (X ≥ t) ≤ e (D.1)
σ2 + t
!
t2 /2
≤ exp − 2 (D.2)
σ + t/3
!!
t2 3t
≤ exp − ∧ . (D.3)
4σ 2 4

Inequality (D.1) is known as Bennett’s inequality [4] and inequalities (D.2) and (D.3) are
known as Bernstein’s inequalities. The general method of using moment generating functions
to derive the probability tail bounds is often attributed to Chernoff [8] or Cramér [11].

61
Proof of Lemma D.2. Markov’s inequality gives

P(X ≥ t) = inf P(eλX ≥ eλt ) ≤ inf e−λt EeλX ≤ exp inf −λt + σ 2 (eλ − 1 − λ) .
λ≥0 λ≥0 λ≥0

Bennett’s inequality (D.1) follows from optimizing the infimum by setting λ = log(1 + t/σ 2 ),
namely now

2 λ
P(X ≥ t) ≤ exp inf −λt + σ (e − 1 − λ)
λ≥0
−t
t t t

λ=log(1+t/σ 2 ) 2
= 1+ 2 exp σ 1 + 2 − 1 − log 1 + 2
σ σ σ
−σ2 −t !σ2 +t !σ2 +t
t σ2 eσ 2

t t −σ 2
= 1+ 2 e = e =e .
σ σ2 + t σ2 + t

Bernstein’s inequality (D.2) follows from

∞ ∞ ∞
λk λk λ2 X λk λ2 /2
eλ − 1 − λ =
X X
≤ k−2
= = 0 ≤ λ < 3,
k=2 k! k=2 2 · 3 2 k=0 3k 1 − λ/3

namely now
!σ2 +t
eσ 2

−σ 2 2 λ
e = exp inf −λt + σ (e − 1 − λ)
σ2 + t λ≥0
!!
σ 2 λ2 /2
≤ exp inf −λt +
λ≥0 1 − λ/3

and the tail bound can be derived in similar spirit (Exercise 2.8.5 and Exercise 2.8.6. in
[39]). Inequality (D.3) simply follows from
! ! !!
t2 /2 t2 /2 t2 3t
exp − 2 ≤ exp − = exp − ∧ .
σ + t/3 (2σ 2 ) ∨ (2t/3) 4σ 2 4

To apply the most practical but weakest form of Bernstein’s inequality, the following
simple lemma is sometimes useful.

Lemma D.3. If f, g : R≥0 → R≥0 are strictly increasing bijections, then f ∧ g is a strictly
increasing bijection with an inverse given by (f ∧ g)−1 = f −1 ∨ g −1 .

Proof. If 0 ≤ x1 < x2 , then f (x1 ) < f (x2 ) and g(x1 ) < g(x2 ) giving f (x1 ) ∧ g(x1 ) ≤ f (x1 ) <
f (x2 ) and f (x1 ) ∧ g(x1 ) ≤ g(x1 ) < g(x2 ), that is, f (x1 ) ∧ g(x1 ) < f (x2 ) ∧ g(x2 ) and f ∧ g is
strictly increasing.

62
Suppose f (x) ∧ g(x) = y. Then either f (x) = y or g(x) = y and hence f −1 (y) = x or
g −1 (y) = x. If f (x) ≥ g(x) = y, then f −1 (y) ≤ x = g −1 (y) (since f is increasing) and hence
x = f −1 (y) ∨ g −1 (y). Similarly, if y = f (x) ≤ g(x), then g −1 (y) ≤ x = f −1 (y) (since g is
increasing) and hence x = f −1 (y) ∨ g −1 (y). This shows that x = f −1 (y) ∨ g −1 (y). ■

In Bernstein’s inequality (D.3), choose f (t) = t2 /4σ 2 and g(t) = 3t/4. To find a specific
t for which the probability bound attains some given value e−s , solve s = f (t) ∧ g(t) by
√ 4s
t = f −1 (s) ∨ g −1 (s) = 4σ 2 s ∨ .
3
That is, solve both s = f (t) and s = g(t) and then choose the maximum of the solutions.
In Lemma D.2, Bernstein’s inequalities show that the probability tends to zero, if t ≫
σ ∨1, which might not be easily visible from Bennett’s inequality, although it is the sharpest.
Bennett’s inequality does show, however, that the probability tends to zero, if t ≫ σ 2 . This
conclusion is beneficial, if σ 2 is particularly small. In that case, the first inequality may be
estimated further to a more practical form
!t
eσ 2 t

P(X ≥ t) ≤ = exp −t log 2 .
t eσ

This is justified at least when eσ 2 ≤ σ 2 + t, but in the case eσ 2 ≥ σ 2 + t ≥ t this further

approximation gives a trivial upper bound for the probability, because probabilities are
always between 0 and 1. Notice, that when t ≫ σ 2 , the tail probability decreases more
rapidly than given by Bernstein’s inequalities (D.2) and (D.3).

Xie 等 - 2023 - An Efficient Spectral Clustering Algorithm Based o
No ratings yet
Xie 等 - 2023 - An Efficient Spectral Clustering Algorithm Based o
11 pages
Spectral Approach (BU)
No ratings yet
Spectral Approach (BU)
2 pages
Spectral Clustering Survey
No ratings yet
Spectral Clustering Survey
12 pages
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
No ratings yet
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
66 pages
The Latest Research Progress On Spectral Clustering
No ratings yet
The Latest Research Progress On Spectral Clustering
10 pages
High-Dimensional Clustering and Signal Recovery Under Block Signals
No ratings yet
High-Dimensional Clustering and Signal Recovery Under Block Signals
35 pages
Spectral Clustering 2
No ratings yet
Spectral Clustering 2
39 pages
2092 On Spectral Clustering Analysis and An Algorithm
No ratings yet
2092 On Spectral Clustering Analysis and An Algorithm
8 pages
Dynamic Tensor Clustering
No ratings yet
Dynamic Tensor Clustering
46 pages
Semi-Supervised Spectral Clustering Using Shared Nearest Neighbor For Data With Different Shape and Density
No ratings yet
Semi-Supervised Spectral Clustering Using Shared Nearest Neighbor For Data With Different Shape and Density
8 pages
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
No ratings yet
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
14 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Adaptive Fuzzy C-Means With Graph Embedding
No ratings yet
Adaptive Fuzzy C-Means With Graph Embedding
12 pages
Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
No ratings yet
Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
13 pages
Clustering
No ratings yet
Clustering
82 pages
Approximation Algorithms For Tensor Clustering
No ratings yet
Approximation Algorithms For Tensor Clustering
15 pages
I Jcs It 2015060141
No ratings yet
I Jcs It 2015060141
5 pages
Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
ML Unit - IV
No ratings yet
ML Unit - IV
56 pages
Model-Based Tensor Low-Rank Clustering
No ratings yet
Model-Based Tensor Low-Rank Clustering
12 pages
19 Ejs1533
No ratings yet
19 Ejs1533
32 pages
A Comprehensive Survey On Deep Clustering - Taxonomy, Challenges, and Future Directions
No ratings yet
A Comprehensive Survey On Deep Clustering - Taxonomy, Challenges, and Future Directions
35 pages
Handbook of Cluster Analysis: C. Hennig, M. Meila, F. Murtagh, R. Rocci (Eds.)
No ratings yet
Handbook of Cluster Analysis: C. Hennig, M. Meila, F. Murtagh, R. Rocci (Eds.)
28 pages
AAAI24-Enhancing Ensemble Clustering With Adaptive High-Order Topological Weights
No ratings yet
AAAI24-Enhancing Ensemble Clustering With Adaptive High-Order Topological Weights
9 pages
Learning Spectral Clustering
No ratings yet
Learning Spectral Clustering
8 pages
UNIT5
No ratings yet
UNIT5
60 pages
Hypergraph Convolutional Neural Network-Based Clustering Technique
No ratings yet
Hypergraph Convolutional Neural Network-Based Clustering Technique
9 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
On Clustering Binary Data: Tao Li Shenghuo Zhu
No ratings yet
On Clustering Binary Data: Tao Li Shenghuo Zhu
5 pages
Spectral Algorithms
No ratings yet
Spectral Algorithms
110 pages
n25 PDF
No ratings yet
n25 PDF
8 pages
SPINEX-Clustering: Similarity-Based Predictions With Explainable Neighbors Exploration For Clustering Problems
No ratings yet
SPINEX-Clustering: Similarity-Based Predictions With Explainable Neighbors Exploration For Clustering Problems
54 pages
Matrix Clustering for Data Scientists
No ratings yet
Matrix Clustering for Data Scientists
37 pages
(2008) A Survey of Kernel and Spectral Methods For Clustering
No ratings yet
(2008) A Survey of Kernel and Spectral Methods For Clustering
38 pages
Scalable Fuzzy Clustering Method
No ratings yet
Scalable Fuzzy Clustering Method
12 pages
Spectral Clustering: X Through The Parameter W 0. The Resulting
No ratings yet
Spectral Clustering: X Through The Parameter W 0. The Resulting
7 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Math 118: Mathematical Methods of Data Theory: Lecture 9: Graphs and Spectral Clustering
No ratings yet
Math 118: Mathematical Methods of Data Theory: Lecture 9: Graphs and Spectral Clustering
11 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Luxburg07 Tutorial 4488
No ratings yet
Luxburg07 Tutorial 4488
32 pages
Multiview Fuzzy Clustering via Anchor Graph
No ratings yet
Multiview Fuzzy Clustering via Anchor Graph
12 pages
Giu 2719 65 22376 2025-02-17T23 42 29
No ratings yet
Giu 2719 65 22376 2025-02-17T23 42 29
37 pages
Sem232 LA CC07 Group08
No ratings yet
Sem232 LA CC07 Group08
23 pages
Best-4 - Topological Gradient Based Competitive Learning
No ratings yet
Best-4 - Topological Gradient Based Competitive Learning
12 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Clustering
No ratings yet
Clustering
75 pages
Full Text 01
No ratings yet
Full Text 01
77 pages
Unsupervised Learning & Clustering
No ratings yet
Unsupervised Learning & Clustering
102 pages
K Medoids
No ratings yet
K Medoids
101 pages
Laude 2018 Discrete Continuous
No ratings yet
Laude 2018 Discrete Continuous
13 pages
Community Detection
No ratings yet
Community Detection
9 pages
Mathematics of Signals, Networks, and Learning
No ratings yet
Mathematics of Signals, Networks, and Learning
68 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Dual Autoencoder Clustering
No ratings yet
Dual Autoencoder Clustering
10 pages
A Theoretical Analysis of Density Peaks Clustering and The Component-Wise Peak-Finding Algorithm
No ratings yet
A Theoretical Analysis of Density Peaks Clustering and The Component-Wise Peak-Finding Algorithm
12 pages
Differentiable Deep Clustering With Cluster Size Constraints
No ratings yet
Differentiable Deep Clustering With Cluster Size Constraints
8 pages
Knowledge Management Framework - Accenture
No ratings yet
Knowledge Management Framework - Accenture
21 pages
Power Point Chapter - 10
No ratings yet
Power Point Chapter - 10
4 pages
Ethtre
No ratings yet
Ethtre
5 pages
Pastore Drycleaners Has Capacity To Clean Up
No ratings yet
Pastore Drycleaners Has Capacity To Clean Up
4 pages
8500 - Eng Locatelli
No ratings yet
8500 - Eng Locatelli
285 pages
Conspiracy Theories & Crime Families
100% (3)
Conspiracy Theories & Crime Families
45 pages
2024 GKS-U Application Forms
No ratings yet
2024 GKS-U Application Forms
16 pages
CE-5542 4542 Syllabus EarthqkEng F2019
No ratings yet
CE-5542 4542 Syllabus EarthqkEng F2019
4 pages
QSK38,50 MCRS Fuel System Pres
No ratings yet
QSK38,50 MCRS Fuel System Pres
66 pages
KC Tools Aug-Oct 2014 Catalogue
No ratings yet
KC Tools Aug-Oct 2014 Catalogue
5 pages
Cable Structures for Architecture Students
100% (1)
Cable Structures for Architecture Students
13 pages
Service: Passat (NMS - US) 2012
0% (1)
Service: Passat (NMS - US) 2012
9 pages
Đáp Án Bu I 20
No ratings yet
Đáp Án Bu I 20
4 pages
PL/SQL Control Structures Guide
100% (1)
PL/SQL Control Structures Guide
50 pages
Chen Et Al 2024 The Promise and Challenges of Inverted Perovskite Solar Cells
No ratings yet
Chen Et Al 2024 The Promise and Challenges of Inverted Perovskite Solar Cells
78 pages
John Rockefeller Quotes
0% (1)
John Rockefeller Quotes
41 pages
Plant Layout Design Seminar Evaluation
No ratings yet
Plant Layout Design Seminar Evaluation
17 pages
Catalogo Yim S.A.C. 2021
100% (1)
Catalogo Yim S.A.C. 2021
170 pages
Big Data Testing: Strategies & Steps
No ratings yet
Big Data Testing: Strategies & Steps
1 page
Vatsal Changoiwala - Bajaj Auto Ltd.
No ratings yet
Vatsal Changoiwala - Bajaj Auto Ltd.
12 pages
Sub-System Design 5
100% (1)
Sub-System Design 5
177 pages
Answers To Cost Accounting Chapter 8
57% (14)
Answers To Cost Accounting Chapter 8
4 pages
Creating Introduction, Scope and Limitation, and Significant of The Study
No ratings yet
Creating Introduction, Scope and Limitation, and Significant of The Study
22 pages
10 400 KV PANTO Isolator R2 Jan12
No ratings yet
10 400 KV PANTO Isolator R2 Jan12
39 pages
Fluid Mechanics-I Lab Manual
No ratings yet
Fluid Mechanics-I Lab Manual
28 pages
Calibration Report for Engineers
No ratings yet
Calibration Report for Engineers
1 page
The Application of The JORC Code
No ratings yet
The Application of The JORC Code
3 pages
Skills Audit M2 - D2 Reflective Essay Template Co
No ratings yet
Skills Audit M2 - D2 Reflective Essay Template Co
8 pages
Business Credit Application Form
No ratings yet
Business Credit Application Form
2 pages
FHM Singapore - September 2014
No ratings yet
FHM Singapore - September 2014
124 pages