[go: up one dir, main page]

Next Article in Journal
Output-Feedback Control for Discrete-Time Spreading Models in Complex Networks
Previous Article in Journal
Global Reliability Sensitivity Analysis Based on Maximum Entropy and 2-Layer Polynomial Chaos Expansion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors †

1
Laboratory of Signals and Systems (L2S), Department of Signals and Statistics, University of Paris-Sud, 91400 Orsay, France
2
Computer Science Department LIX, École Polytechnique, 91120 Palaiseau, France
3
Sony Computer Science Laboratories, Tokyo 141-0022, Japan
*
Author to whom correspondence should be addressed.
The results presented in this work have been partially published in the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017 and the 2017 25th European Association for Signal Processing (EUSIPCO), Kos, Greece, 28 August–2 September 2017.
Entropy 2018, 20(3), 203; https://doi.org/10.3390/e20030203
Submission received: 25 January 2018 / Revised: 13 March 2018 / Accepted: 14 March 2018 / Published: 17 March 2018
Figure 1
<p>Canonical Polyadic Decomposition (CPD).</p> ">
Figure 2
<p>Histogram of the eigenvalues of <math display="inline"> <semantics> <mfrac> <mrow> <msub> <mi mathvariant="bold">W</mi> <mi>N</mi> </msub> <msubsup> <mi mathvariant="bold">W</mi> <mi>N</mi> <mi>T</mi> </msubsup> </mrow> <mi>N</mi> </mfrac> </semantics> </math> (with <math display="inline"> <semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>256</mn> <mo>,</mo> <msub> <mi>c</mi> <mi>N</mi> </msub> <mo>=</mo> <mfrac> <mi>M</mi> <mi>N</mi> </mfrac> <mo>=</mo> <mfrac> <mn>1</mn> <mn>256</mn> </mfrac> </mrow> </semantics> </math>, <math display="inline"> <semantics> <mrow> <msup> <mi>σ</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>1</mn> </mrow> </semantics> </math>).</p> ">
Figure 3
<p>Histogram of the eigenvalues of <math display="inline"> <semantics> <mfrac> <mrow> <msub> <mi mathvariant="bold">W</mi> <mi>N</mi> </msub> <msubsup> <mi mathvariant="bold">W</mi> <mi>N</mi> <mi>T</mi> </msubsup> </mrow> <mi>N</mi> </mfrac> </semantics> </math> (with <math display="inline"> <semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>256</mn> <mo>,</mo> <msub> <mi>c</mi> <mi>N</mi> </msub> <mo>=</mo> <mfrac> <mi>M</mi> <mi>N</mi> </mfrac> <mo>=</mo> <mfrac> <mn>1</mn> <mn>4</mn> </mfrac> </mrow> </semantics> </math>, <math display="inline"> <semantics> <mrow> <msup> <mi>σ</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>1</mn> </mrow> </semantics> </math>).</p> ">
Figure 4
<p>Canonical Polyadic Decomposition (CPD) scenario: Optimal <span class="html-italic">s</span>-parameter versus Signal to Noise Ratio (SNR) in dB.</p> ">
Figure 5
<p>CPD scenario: <span class="html-italic">s</span>-divergence vs. <math display="inline"> <semantics> <mi>SNR</mi> </semantics> </math> in dB.</p> ">
Figure 6
<p>CPD scenario: <math display="inline"> <semantics> <msup> <mi>s</mi> <mo>⋆</mo> </msup> </semantics> </math> vs. <span class="html-italic">c</span> , <math display="inline"> <semantics> <mrow> <mi>SNR</mi> <mo>=</mo> <mn>45</mn> </mrow> </semantics> </math> dB.</p> ">
Figure 7
<p>TucKer Decomposition (TKD) scenario: Optimal <span class="html-italic">s</span>-parameter vs. <math display="inline"> <semantics> <mi>SNR</mi> </semantics> </math> in dB.</p> ">
Figure 8
<p>TKD scenario: <span class="html-italic">s</span>-divergence vs. <math display="inline"> <semantics> <mi>SNR</mi> </semantics> </math> in dB.</p> ">
Versions Notes

Abstract

:
Evaluating the performance of Bayesian classification in a high-dimensional random tensor is a fundamental problem, usually difficult and under-studied. In this work, we consider two Signal to Noise Ratio (SNR)-based binary classification problems of interest. Under the alternative hypothesis, i.e., for a non-zero SNR, the observed signals are either a noisy rank-R tensor admitting a Q-order Canonical Polyadic Decomposition (CPD) with large factors of size N q × R , i.e., for 1 q Q , where R , N q with R 1 / q / N q converge towards a finite constant or a noisy tensor admitting TucKer Decomposition (TKD) of multilinear ( M 1 , , M Q ) -rank with large factors of size N q × M q , i.e., for 1 q Q , where N q , M q with M q / N q converge towards a finite constant. The classification of the random entries (coefficients) of the core tensor in the CPD/TKD is hard to study since the exact derivation of the minimal Bayes’ error probability is mathematically intractable. To circumvent this difficulty, the Chernoff Upper Bound (CUB) for larger SNR and the Fisher information at low SNR are derived and studied, based on information geometry theory. The tightest CUB is reached for the value minimizing the error exponent, denoted by s . In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB) (that is, the Chernoff Information calculated at s = 1 / 2 ) cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find s . However, thanks to powerful random matrix theory tools, a simple analytical expression of s is provided with respect to the Signal to Noise Ratio (SNR) in the two schemes considered. This work shows that the BUB is the tightest bound at low SNRs. However, for higher SNRs, the latest property is no longer true.

1. Introduction

1.1. State-of-the-Art and Problem Statement

Evaluating the performance limit for the “Gaussian information plus noise” binary classification problem is a challenging research topic, see for instance [1,2,3,4,5,6,7]. Given a binary hypothesis problem, the Bayes’ decision rule is based on the principle of the largest posterior probability. Specifically, the Bayesian detector chooses the alternative hypothesis H 1 if Pr ( H 1 | y ) > Pr ( H 0 | y ) for a given N-dimensional measurement vector y or the null hypothesis H 0 , otherwise. Consequently, the optimal decision rule can often only be derived at the price of a costly numerical computation of the log posterior-odds ratio [3] since an exact calculation of the minimal Bayes’ error probability P e ( N ) is often intractable [3,8]. To circumvent this problem, it is standard to exploit well-known bounds on P e ( N ) based on information theory [9,10,11,12,13]. In particular, the Chernoff information [14,15] is asymptotically (in N) relied on the exponential rate of P e ( N ) . It turns out that the Chernoff information is very useful in many practically important problems as for instance, distributed sparse detection [16], sparse support recovery [17], energy detection [18], multi-input and multi-output (MIMO) radar processing [19,20], network secrecy [21], angular resolution limit in array processing [22], detection performance for informed communication systems [23], just to name a few. In addition, the Chernoff information bound can be tight for a minimal s-divergence over parameter s ( 0 , 1 ) . Generally, this step requires solving numerically an optimization problem [24] and often leads to a complicated and uninformative expression of the optimal value of s. To circumvent this difficulty, a simplified case of s = 1 / 2 is often used corresponding to the well-known Bhattacharyya divergence [13] at the price of a less accurate prediction of P e ( N ) . In information geometry, parameter s is often called α , and the s-divergence is the so-called Chernoff α -divergence [24].
The tensor decomposition theory is a timely and prominent research topic [25,26]. Confronting the problem of extracting useful information from a massive and multidimentional volume of measurements, it is shown that tensors are extremely relevant. In the standard literature, two main families of tensor decomposition are prominent, namely the Canonical Polyadic Decomposition (CPD) [26] and the Tucker decomposition (TKD)/HOSVD (High-Order SVD) [27,28]. These approaches are two possible multilinear generalization of the Singular Value Decomposition (SVD). A natural generalization to tensors of the usual concept of rank for matrices is called the CPD. The tensorial/canonical rank of a P-order tensor is equal to the minimal positive integer, say R, of unit rank tensors that must be summed up for perfect recovery. A unit rank tensor is the outer product of P vectors. In addition, the CPD has remarkable uniqueness properties [26] and involves only a reduced number of free parameters due to the constraint of minimality on R. Unfortunately, unlike the matrix case, the set of tensors with fixed (tensorial) rank is not close [29,30]. This singularity implies that the problem of the computation of the CPD is mathematically ill-posed. The consequence is that its numerical computation remains non trivial and is usually done using suboptimal iterative algorithms [31]. Note that this problem can sometimes be avoided by exploiting some natural hidden structures in the physical model [32]. The TKD [28] and the HOSVD [27] are two popular decompositions being an alternative to the CPD. Under this circumstance, alternative definition of rank is required, since the tensorial rank based on CPD scenario is no longer appropriate. In particular, stardard definition of multilinear rank defined as the set of positive integers { R 1 , , R P } where each integer, R p , is the usual rank of the p-th mode. Following the Eckart-Young theorem at each mode level [33], this construction is non-iterative, optimal and practical. In real-time computation [34] or adaptively computation [35], it is shown that this approach is suitable. However, in general, the low (multilinear) rank tensor based on this procedure is suboptimal [27]. More precisely, for tensors of order strictly greater than two, a generalization of the Eckart-Young theorem does not exist.
The classification performance of a multilinear tensor following the CPD and TKD can be derived and studied. It is interesting to note that the classification theory for tensors is very under studied. Based on our knowledge on the topic, only the publication [36] tackles this problem in the context of radar multidimensional data detection. A major difference with this publication is that their analysis is based on the performance of a low rank detection after matched filtering.
More precisely, we consider two cases where the observations are either (1) a noisy rank-R tensor admitting a Q-order CPD with large factors of size N q × R , i.e., for 1 q Q , R , N q with R 1 / q / N q converging towards a finite constant, or (2) a noisy tensor admitting a TKD of multilinear ( M 1 , , M Q ) -rank with large factors of size N q × M q , i.e., for 1 q Q , where N q , M q with M q / N q converging towards a finite constant. A standard approach for zero-mean independent Gaussian core and noise tensors, is to define the Signal to Noise Ratio by SNR = σ s 2 / σ 2 where σ s 2 and σ 2 are the variances of the vectorized core and noise tensors, respectively. So, the binary classification can be described in the following way:
Under the null hypothesis H 0 , SNR = 0 , meaning that the observed tensor contains only noise. Conversely, the alternative hypothesis H 1 is based on SNR 0 , meaning that there exists a multilinear signal of interest. First note that there exists a lack of contribution dealing with classification performance for tensors. Since the exact derivation of the error probability is intractable, the performance of the classification of the core tensor random entries is hard to evaluate. To circumvent this audible difficulty, based on computational information geometry theory, we consider the Chernoff Upper Bound (CUB), and the Fisher information in the context of massive measurement vectors. The error exponent can be minimized at s , which corresponds to the reachable tightest CUB. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB)—Chernoff Information calculated at s = 1 / 2 —cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find s . However, with respect to different Signal to Noise Ratios (SNR), we provide simple analytical expressions of s , thanks to the so-called Random Matrix Theory (RMT). For low SNR, analytical expressions of the Fisher information are given. Note that the analysis of the Fisher information in the context of the RMT has been only studied in recent contributions [37,38,39] for parameter estimation. For larger SNR, analytic and simple expression of the CUB for the CPD and the TKD are provided.
We note that Random Matrix Theory (RMT) has attracted both mathematicians and physicists since they were first introduced in mathematical statistics by Wishart in 1928 [40]. When Wigner [41] introduced the concept of statistical distribution of nuclear energy levels, the subject has started to earn prominence. However, it took until 1955 before Wigner [42] introduced ensembles of random matrices. Since then, many important results in RMT were developed and analyzed, see for instance [43,44,45,46] and the references therein. In the last two decades, research on RMT has been constantly published.
Finally, let us underline that many arguments of this paper differ from the works presented in [47,48]. In [47], we tackled the problem of detection using Chernoff Upper Bound in data of type matrix in the double asymptotic regime. In [48], we established the detection problem in tensor data by analyzing the Chernoff Upper Bound. In [48], we assumed that the tensor follows the Canonical Polyadic Decomposition (CPD), we gave some analysis of Chernoff Upper Bound when the rank of the tensor is much smaller than the dimensions of the tensor. Since [47,48] are conference papers, some proofs have been omitted due to limited space. Therefore, this full paper may share the ideas in [47,48] on Information Geometry (s-divergence, Chernoff Upper Bound, Fisher Information, etc.), but completes [48] in a more general asymptotic regime. Moreover, in this work, we give new analysis in both scenarios (SNR small and large) whereas [48] did not, and the important and difficult new tensor scenario of the Tucker decomposition is considered. This is in our view the main difference because the CPD is a particular case of the more general decomposition of TucKer. Indeed, in the CPD, the core tensor is assumed to be diagonal.

1.2. Paper Organisation

The organization of the paper is as follows: In the second section, we introduce some definitions, tensor models, and the Marchenko-Pastur distribution from random matrix theory. The third section is devoted to present Chernoff Information for binary hypothesis test. The fourth section gives the main results on Fisher Information and the Chernoff bound. The numerical simulation results are given in the fifth section. We conclude our work by giving some perspectives in the Section 6. Finally, several proofs of the paper can be found in the appendix.

2. Algebra of Tensors and Random Matrix Theory (RMT)

In this section, we introduce some useful definitions from tensor algebra and from the spectral theory of large random matrices.

2.1. Multilinear Functions

2.1.1. Preliminary Definitions

Definition 1.
The Kronecker product of matrices X and Y of size I × J and K × N , respectively is given by
X Y = [ X ] 11 Y [ X ] 1 J Y X ] I 1 Y [ X ] I J Y R ( I K ) × ( J N ) .
We have rank { X Y } = rank { X } × rank { Y } .
Definition 2.
The vectorization vec ( X ) of a tensor X R M 1 × × M Q is a vector x R M 1 M 2 M Q defined as
x h = [ X ] m 1 , , m Q
where h = m 1 + k = 2 Q ( m k 1 ) M 1 M 2 M k 1 .
Definition 3.
The q-mode product denoted by × q between a tensor X R M 1 × × M Q and a matrix U R K × M q is denoted by X × q U R M 1 × × M q 1 × K × M q + 1 × × M Q with
[ X × q U ] m 1 , , m q 1 , k , m q + 1 , , m Q = m q = 1 M q [ X ] m 1 , , m Q [ U ] k , m q
where 1 k K .
Definition 4.
The q-mode unfolding matrix of size M q × k = 1 , k q Q M k denoted by X ( q ) = unfold q ( X ) of a tensor X R M 1 × × M q is defined according to
[ X ( q ) ] M q , h = [ X ] m 1 , , m Q
where h = 1 + k = 1 , k q Q ( m k 1 ) v = 1 , v q k 1 M v .

2.1.2. Canonical Polyadic Decomposition (CPD)

The rank-R CPD of order Q is defined according to
X = r = 1 R s r ϕ r ( 1 ) ϕ r ( Q ) X r w i t h rank { X r } = 1
where ○ is the outer product [25], ϕ r ( q ) R N q × 1 and s r is a real scalar.
An equivalent formulation using the q-mode product defined in Definition 3 is
X = S × 1 Φ ( 1 ) × 2 × Q Φ ( Q )
where S is the R × × R diagonal core tensor with [ S ] r , , r = s r and Φ ( q ) = [ ϕ 1 ( q ) ϕ R ( q ) ] is the q-th factor matrix of size N q × R .
The q-mode unfolding matrix for tensor X is given by
X ( q ) = Φ ( q ) S Φ ( Q ) Φ ( q + 1 ) Φ ( q 1 ) Φ ( 1 ) T
where S = diag ( s ) with s = [ s 1 , , s R ] T and ⊙ stands for the Khatri-Rao product [25].

2.1.3. Tucker Decomposition (TKD)

The Tucker tensor model of order Q is defined according to
X = m 1 = 1 M 1 m 2 = 1 M 2 m Q = 1 M Q s m 1 m 2 m Q ϕ m 1 ( 1 ) ϕ m 2 ( 2 ) ϕ m Q ( Q )
where ϕ m q ( q ) R N q × 1 , q = 1 , , Q and s m 1 m 2 m Q is a real scalar.
The q-mode product of X is similar to CPD case, however the q-mode unfolding matrix for tensor X is slightly different
X ( q ) = Φ ( q ) S ( q ) Φ ( Q ) Φ ( q + 1 ) Φ ( q 1 ) Φ ( 1 ) T
where S ( q ) R N q × N 1 N 2 N q 1 N q + 1 N Q the q-mode unfolding matrix of tensor S , Φ ( q ) = [ ϕ 1 ( q ) ϕ M q ( q ) ] R N q × M q and ⊗ stands for Kronecker product. See Figure 1.
Following the definitions, we note that the CPD and TKD scenarios imply that vector x in Equation (11) is related either to the structured linear system Φ = Φ ( Q ) Φ ( q + 1 ) Φ ( q 1 ) Φ ( 1 ) or Φ = Φ ( Q ) Φ ( q + 1 ) Φ ( q 1 ) Φ ( 1 ) .

2.2. The Marchenko-Pastur Distribution

The Marchenko-Pastur distribution was introduced half a century ago [45] in 1967, and plays a key role in a number of high-dimensional signal processing problems. To help the reader, in this section, we introduce some fundamental results concerning large empirical covariance matrices. Let ( v n ) n = 1 , , N a sequence of i.i.d zero mean Gaussian random M-dimensional vectors for which E ( v n v n T ) = σ 2 I M . We consider the empirical covariance matrix
1 N n = 1 N v n v n T
which can be also written as
1 N n = 1 N v n v n T = W N W N T
where matrix W N is defined by W N = 1 N [ v 1 , , v N ] . W N is thus a Gaussian matrix with independent identically distributed N ( 0 , σ 2 N ) entries. When N + while M remains fixed, matrix W N W N T converges towards σ 2 I M in the spectral norm sense. In the high dimensional asymptotic regime defined by
M + , N + , c N = M N c > 0
it is well understood that W N W N T σ 2 I M does not converge towards 0. In particular, the empirical distribution ν ^ N = 1 M m = 1 M δ λ ^ m , N of the eigenvalues λ ^ 1 , N λ ^ M , N of W N W N T does not converge towards the Dirac measure at point λ = σ 2 . More precisely, we denote by ν c , σ 2 the Marchenko-Pastur distribution of parameters ( c , σ 2 ) defined as the probability measure
ν c , σ 2 ( d λ ) = δ 0 [ 1 1 c ] + + ( λ λ ) ( λ + λ ) 2 σ 2 c π λ [ λ , λ + ] ( λ ) d λ
with λ = σ 2 ( 1 c ) 2 and λ + = σ 2 ( 1 + c ) 2 . Then, the following result holds.
Theorem 1 
([45]). The empirical eigenvalue value distribution ν ^ N converges weakly almost surely towards ν c , σ 2 when both M and N converge towards + in such a way that c N = M N converges towards c > 0 . Moreover, it holds that
λ ^ 1 , N σ 2 ( 1 + c ) 2 a . s .
λ ^ min ( M , N ) σ 2 ( 1 c ) 2 a . s .
We also observe that Theorem 1 remains valid if W N is not necessarily a Gaussian matrix whose i.i.d. elements have a finite fourth order moment (see e.g., [43]). Theorem 1 means that when ratio M N is not small enough, the eigenvalues of the empirical spatial covariance matrix of a temporally and spatially white noise tend to spread out around the variance of the noise, and that almost surely, for N large enough, all the eigenvalues are located in a neighbourhood of interval [ λ , λ + ] . See Figure 2 and Figure 3.

3. Classification in a Computational Information Geometry (CIG) Framework

3.1. Formulation Based on a SNR -Type Criterion

We denote by SNR = σ s 2 / σ 2 and p i ( · ) = p ( · | H i ) with i { 0 , 1 } . The binary classification of the random signal based on the equi-probable binary hypothesis test, s , is
H 0 : p 0 ( y N ; Φ , SNR = 0 ) = N 0 , Σ 0 , H 1 : p 1 ( y N ; Φ , SNR 0 ) = N 0 , Σ 1
where Σ 0 = σ 2 I N and Σ 1 = σ 2 SNR × Φ Φ T + I N . The null hypothesis data-space ( H 0 ) is defined as X 0 = X \ X 1 where
X 1 = y N : Λ ( y N ) = log p 1 ( y N ) p 0 ( y N ) > τ
is the alternative hypothesis ( H 1 ) data-space. Following the above expression, the log-likelihood ratio test Λ ( y N ) and the binary classification threshold τ are given by
Λ ( y N ) = y N T Φ Φ T Φ + SNR × I 1 Φ T y N σ 2 , τ = log det SNR × Φ Φ T + I N
where det ( · ) and log ( · ) are respectively the determinant and the natural logarithm.

3.2. The Expected Log-likelihood Ratio in Geometry Perspective

We note that the estimated hypothesis H ^ is associated to p ( y N | H ^ ) = N 0 , Σ . Therefore, the expected log-likelihood ratio is defined by
E y N | H ^ Λ ( y N ) = X p ( y N | H ^ ) log p 1 ( y N ) p 0 ( y N ) d y N = KL ( H ^ | | H 0 ) KL ( H ^ | | H 1 ) = 1 σ 2 Tr Φ T Φ + SNR × I 1 Φ T Σ Φ
where
KL ( H ^ | | H i ) = X p ( y N | H ^ ) log p ( y N | H ^ ) p i ( y N ) d y N
is the Kullback-Leibler Divergence (KLD) [10]. The expected log-likelihood ratio test admits to a simple geometric characterization based on the difference of two KLDs [8]. However, it is often difficult to evaluate the performance of the test via the minimal Bayes’ error probability P e ( N ) , since its expression cannot be determined analytically in closed-form [3,8].
The minimal Bayes’ error probability conditionally to vector y N is defined as
Pr ( Error | y N ) = 1 2 min { P 1 , 0 , P 0 , 1 }
where P i , i = Pr ( H i | y N X i ) .

3.3. CUB

According to [24], the relation between the Chernoff Upper Bound and the (average) minimal Bayes’ error probability P e ( N ) = E Pr ( Error | y N ) is given by
P e ( N ) 1 2 × exp [ μ ˜ N ( s ) ]
where the (Chernoff) s-divergence for s ( 0 , 1 ) is given by
μ ˜ N ( s ) = log M Λ ( y N | H 1 ) ( s )
in which M X ( t ) = E exp [ t × X ] is the moment generating function (mgf) of variable X. The error exponent, denoted by μ ˜ ( s ) , is given by the Chernoff information which is an asymptotic characterization on the exponentially decay of the minimal Bayes’ error probability. The error exponent is derived thanks to the Stein’s lemma according to [13]
lim N log P e ( N ) N = lim N μ ˜ N ( s ) N = def . μ ˜ ( s ) .
As parameter s ( 0 , 1 ) is free, the CUB can be tightened by minimizing this parameter:
s = arg max s ( 0 , 1 ) μ ˜ ( s ) .
Finally, using Equations (5) and (7), the Chernoff Upper Bound (CUB) is obtained. Instead of solving Equation (7), the Bhattacharyya Upper Bound (BUB) is calculated by Equation (5) and by fixing s = 1 / 2 . Therefore we have the following relation of order:
P e ( N ) 1 2 × exp [ μ ˜ N ( s ) ] 1 2 × exp [ μ ˜ N ( 1 / 2 ) ] .
Lemma 1.
The log-moment generating function given by Equation (6) for test of Equation (4) is given by
μ ˜ N ( s ) = 1 s 2 log det SNR × Φ Φ T + I + 1 2 log det SNR × ( 1 s ) Φ Φ T + I .
Proof. 
See Appendix A. ◻
From now on, to simplify the presentation and the numerical results later on, we denote by
μ N ( s ) = μ ˜ N ( s ) μ ( s ) = μ ˜ ( s )
for all s [ 0 , 1 ] , the opposites of the log-moment generating function and its limit.
Remark 1.
The functions μ N ( s ) , μ ( s ) are negative, since the s-divergence μ ˜ N ( s ) is positive for all s [ 0 , 1 ] .

3.4. Fisher Information

In the small deviation regime, we assume that δ SNR is a small deviation of the SNR. The new binary hypothesis test is
H 0 : y | δ SNR = 0 N 0 , Σ ( 0 ) , H 1 : y | δ SNR 0 N 0 , Σ ( δ SNR )
where Σ ( x ) = x × Φ Φ T + I . The s-divergence in the small SNR deviation scenario is written as
μ N ( s ) = 1 s 2 log det Σ ( δ SNR ) 1 2 log det Σ ( δ SNR × ( 1 s ) )
Lemma 2.
The s-divergence in the small deviation regime can be approximated according to
μ N ( s ) N δ SNR 1 ( s 1 ) s × ( δ SNR ) 2 2 × J F ( 0 ) N
where the Fisher information [3] is given by
J F ( x ) = 1 2 Tr ( ( I + x × Φ Φ T ) 1 Φ Φ T ( I + x × Φ Φ T ) 1 Φ Φ T ) .
Proof. 
See Appendix B. ◻
According to Lemma 2, the optimal s-value at low SNR is s = δ SNR 1 1 2 . At contrary, the optimal s-value for larger SNR is given by the following lemma.
Lemma 3.
In case of large SNR , we have
s SNR 1 1 1 log SNR + 1 K n = 1 K log λ n .
where ( λ n ) n = 1 , , N are the eigenvalues of Φ Φ T .
Proof. 
See Appendix C. ◻

4. Computational Information Geometry for Classification

4.1. Formulation of the Observation Vector as a Structured Linear Model

The measurement tensor follows a noisy Q-order tensor of size N 1 × × N Q can be expressed as
Y = X + N
where N is the noise tensor whose entries are assumed to be centered i.i.d. Gaussian, i.e., [ N ] n 1 , , n Q N ( 0 , σ 2 ) and the core tensor X follows either CPD or TKD given by Section 2.1.2 and Section 2.1.3, respectively. The vectorization of Equation (10) is given by
y N = vec ( Y ( 1 ) ) = x + n
where n = vec ( N ( 1 ) ) and x = vec ( X ( 1 ) ) . Note that Y ( 1 ) , N ( 1 ) and X ( 1 ) are respectively the first unfolding matrices given by Definition 4 of tensors Y , N and X ,
  • When tensor X follows a Q-order CPD with a canonical rank of M, we have
    x = vec Φ ( 1 ) S Φ ( Q ) Φ ( 2 ) T = Φ s
    where Φ = Φ ( Q ) Φ ( 1 ) is a N × R structured matrix and s = s 1 s R T where s r N ( 0 , σ s 2 ) , i.i.d. and N = N 1 N Q .
  • When tensor X follows a Q-order TKD of multilinear rank of { M 1 , , M Q } , we have
    x = vec Φ ( 1 ) S ( 1 ) Φ ( Q ) Φ ( 2 ) T = Φ vec ( S )
    where Φ = Φ ( Q ) Φ ( 1 ) is a N × M structured matrix with M = M 1 M Q and vec ( S ) is the vectorization of tensor S where s m 1 , , . m Q N ( 0 , σ s 2 ) , i.i.d.

4.2. The CPD Case

We recall that in the CPD case, matrix Φ = Φ ( Q ) Φ ( 1 ) and ( Φ ( q ) ) q = 1 , , Q are matrices of size N q × R . In the following, we assume that matrices Φ q = 1 , , Q ( q ) are random matrices with Gaussian N ( 0 , 1 N q ) variate entries. We evaluate the behavior of μ N ( s ) N when ( N q ) q = 1 , , Q converge towards + at the same rate and that R N converges towards a non zero limit.
Result 1.
In the asymptotic regime where N 1 , , N Q converge towards + at the same rate and where R + in such a way that c R = R N converges towards a finite constant c > 0 , it holds that
μ N ( s ) N a . s μ ( s ) = 1 s 2 Ψ c ( SNR ) 1 2 Ψ c ( ( 1 s ) × SNR )
with a . s standing for “almost sure convergence” and
Ψ c x = log 1 + 2 c u ( x ) + ( 1 c ) + c × log 1 + 2 u ( x ) ( 1 c ) 4 c x ( u ( x ) 2 ( 1 c ) 2 )
with u ( x ) = 1 x + ( 1 x + λ c + ) ( 1 x + λ c ) where λ c ± = ( 1 ± c ) 2 .
Proof. 
See Appendix D. ◻
Remark 2.
In [49], the Central Limit Theorem (CLT) for the linear eigenvalue statistics of the tensor version of the sample covariance matrix of type Φ ( Φ ) T is established, for Φ = Φ ( 2 ) Φ ( 1 ) , i.e., the tensor order is Q = 2 .

4.2.1. Small SNR Deviation Scenario

In this section, we assume that SNR is small. Under this regime, we have the following result:
Result 2.
In the small SNR scenario, the Fisher information for CPD is given as
μ 1 2 SNR 1 ( SNR ) 2 16 × c ( 1 + c ) .
Proof. 
Using Lemma 2, we can notice that
J F ( 0 ) N = 1 2 R N 1 R Tr ( Φ ( Φ ) T ) 2
and that
1 R Tr ( Φ ( Φ ) T ) 2
converges a.s towards the second moment of the Marchenko-Pastur distribution which is 1 + c (see for instance [43]). ◻
Note that μ 1 2 is the error exponent related to the Bhattacharyya divergence.

4.2.2. Large SNR Deviation Scenario

Result 3.
In case of large SNR , the minimizer of Chernoff Information is given by
s SNR 1 1 1 log SNR 1 1 c c log ( 1 c ) .
Proof. 
It is straightforward to notice that
1 K n = 1 K log ( λ n ) 0 + log ( λ ) d ν c ( λ ) = 1 1 c c log ( 1 c ) .
The last equality can be obtained as in [50]. Using Lemma 3, we get immediately Equation (14). ◻
Remark 3.
It is interesting to note that for c 0 or 1, the optimal s-value follows the same approximated relation given by
s SNR 1 1 1 log SNR
as long as SNR exp [ 1 ] or equivalently a SNR in dB much larger than 4 dB.
Proof. 
It is straightforward to note that
1 c c log ( 1 c ) c 1 0 , a n d 1 c c log ( 1 c ) c 0 1 .
Using Equation (14) and condition SNR exp [ 1 ] , the desired result is proved. ◻

4.2.3. Approximated Analytical Expressions for c 1 and Any SNR

In the case of low rank CPD where its rank R is supposed to be small compared to N, it is realistic to assume c 1 since R N .
Result 4.
Under this regime, the error exponent can be approximated as follows:
μ ( s ) c 1 c 2 ( 1 s ) log ( 1 + SNR ) log ( 1 + ( 1 s ) SNR ) .
Proof. 
See Appendix E. ◻
It is easy to notice that the second-order derivative of μ ( s ) is strictly positive. Therefore, μ ( s ) is a strictly convex function over interval ( 0 , 1 ) . As a consequence, μ ( s ) admits at most one global minimum. We denote by s , the global minimizer and obtained by zeroing the first-order derivative of the error exponent. This optimal value is expressed as
s c 1 1 + 1 SNR 1 log ( 1 + SNR ) .
The two following scenarios can be considered:
  • At low SNR , we denote by μ ( s ) , the error exponent associated with the tightest CUB, coincides with the error exponent associated with the BUB. To see this, when c 1 , we derive the second-order approximation of the optimal value s in Equation (47)
    s 2 1 + 1 SNR 1 1 + SNR 2 = 1 2 .
    Result 1 and the above approximation allow us to get the best error exponent at low SNR and c 1 ,
    μ 1 2 SNR 1 1 4 Ψ c 1 ( SNR ) 1 2 Ψ c 1 SNR 2 = c 2 log 1 + SNR 1 + SNR 2 .
  • Contrarily, when SNR , s 1 . As a consequence, the optimal error exponent in this regime is not the BUB anymore. Assuming that log SNR SNR 0 , Equation (15) in Result 4 provides the following approximation of the optimal error exponent for large SNR
    μ s SNR 1 c 2 1 log SNR + log log ( 1 + SNR ) .

4.3. The TKD Case

In the TKD case, we recall that matrix Φ = Φ ( Q ) Φ ( 1 ) , with ( ϕ ( q ) ) 1 q Q are N q × M q dimensional matrices. We still assume that matrices Φ q = 1 , , Q ( q ) are random matrices with Gaussian N ( 0 , 1 N q ) entries.
Result 5.
In the asymptotic regime where M q < N q , 1 q Q and M q , N q converge towards + at the same rate such that M q N q c q , where 0 < c q < 1 , it holds
μ N ( s ) N a . s μ ( s ) = c 1 c Q 1 s 2 0 + 0 + log ( 1 + SNR × λ 1 λ Q ) d ν c 1 ( λ 1 ) d ν c Q ( λ Q )        1 2 0 + 0 + log ( 1 + ( 1 s ) SNR × λ 1 λ Q ) d ν c 1 ( λ 1 ) d ν c Q ( λ Q )
where ν c q are Marchenko-Pastur distributions of parameters ( c q , 1 ) defined as in Equation (1).
Proof. 
See Appendix F. ◻
Remark 4.
We can notice that for Q = 1 , the result 5 is similar to result 1. However, when Q 2 , the integrals in Equation (16) are not tractable in a closed-form expression. For instance, let Q = 2 , we consider the integral
+ + log ( 1 + SNR × λ 1 λ 2 ) ν c 1 ( d λ 1 ) ν c 2 ( d λ 2 ) = λ c 1 λ c 1 + λ c 2 λ c 2 + log ( 1 + SNR × λ 1 λ 2 ) λ 1 λ c 1 λ c 1 + λ 1 2 π c 1 λ 1 λ 2 λ c 2 λ c 2 + λ 2 2 π c 2 λ 2 d λ 1 d λ 2
where λ c i ± = ( 1 ± c i ) 2 , i = 1 , 2 . We can notice that this integral is characterized by elliptic integral (see e.g., [51]). As a consequence, it cannot be expressed in closed-form. However, numerical computations can be exploited to solve efficiently the minimization problem of Equation (7).

4.3.1. Large SNR Deviation Scenario

Result 6.
In case of large SNR , the minimizer of Chernoff Information for TKD is given by
s SNR 1 1 1 log SNR Q i = 1 Q 1 c i c i log ( 1 c i ) .
Proof. 
We have that
1 M n = 1 M log ( λ n ) q = 1 Q 0 + log ( λ q ) d ν c q ( λ q ) = q = 1 Q 1 1 c q c q log ( 1 c q ) = Q q = 1 Q 1 c q c q log ( 1 c q ) .
Using Lemma 3, we get immediately Equation (17). ◻

4.3.2. Small SNR Deviation Scenario

Under this regime, we have the following results
Result 7.
For small SNR deviation, the Chernoff information for the TKD is given by
μ 1 2 δ SNR 1 ( δ SNR ) 2 16 q = 1 Q c q × ( 1 + c q ) .
Proof. 
Using Lemma 2, we can notice that
J F ( 0 ) N = 1 2 M N 1 M Tr ( Φ ( Φ ) T ) 2 = 1 2 M N q = 1 Q Tr ( Φ ( q ) Φ ( q ) T ) 2 M q .
Each term in the product converges a.s towards the second moment of Marchenko-Pastur distributions ν c q which are 1 + c q and M N converges to q = 1 Q c q . This proves the desired result. ◻
Remark 5.
Contrary to the Remark 3, it is interesting to note that for c 1 = c 2 = = c Q = c and c 0 or 1, the optimal s-value follows different approximated relation given by
s c 0 SNR 1 1 1 log SNR
which does not depend on Q, and
s c 1 SNR 1 1 1 log SNR Q
which depends on Q.
In practice, when c is close to 1, we have to carefully check if Q is in the neighbourhood of log ( SNR ) . As we can see that, when log SNR Q < 0 or 0 < log SNR Q < 1 , following the above approximation, s [ 0 , 1 ] .

5. Numerical Illustrations

In this section, we consider cubic tensors of order Q = 3 with N 1 = 10 , N 2 = 20 , N 3 = 30 , R = 3000 following a CPD and M 1 = 100 , M 2 = 120 , M 3 = 140 , N 1 = N 2 = N 3 = 200 for the TKD, respectively.
Firstly, for the CPD model, in Figure 4, parameter s is drawn with respect to the SNR in dB. The parameter s is obtained thanks to three different methods. The first one is based on the brute force/exhaustive computation of the CUB by minimizing the expression in Equation (8) with Φ = Φ . This approach has a very high computational cost especially in our asymptotic regime (for a standard computer with Intel Xeon E5-2630 2.3 GHz and 32 GB RAM, it requires 183 h to establish 10,000 simulations). The second approach is based on the numerical optimization of the closed-form expression of μ ( s ) given in Result 4. In this scenario, the drawback in terms of the computational cost is largely mitigated since it consists of a minimization of a univariate regular function. Finally, under the hypothesis that SNR is large, typically >30 dB, the optimal s-value, s , is derived by an analytic expression given by Equation (15). We can check that the proposed semi-analytic and analytic expressions are in good agreement with the brute-force method for a lowest computational cost. Moreover, we compute the mean square relative error 1 L l = 1 L ( s ^ l s s ) 2 where L = 10,000 the number of samples for Monte–Carlo process and where s ^ l = arg min s [ 0 , 1 ] μ N , l ( s ) and s = arg min s [ 0 , 1 ] μ ( s ) . It turns out that the mean square relative errors are in mean of order 40 dB. We can conclude that the estimator s ^ is a consistent estimator of s .
In Figure 5, we draw various s-divergences: μ 1 2 , μ ( s ) , 1 N μ N 1 2 , 1 N μ N ( s ^ ) . We can observe the good agreement with the proposed theoretical results. The s-divergence obtained by fixing s = 1 2 is accurate only at small SNR but degrades when SNR grows large.
In Figure 6, we fix SNR = 45 dB and draw s obtained by Equation (14) versus values of c { 10 6 , 10 5 , 10 4 , 10 3 , 10 2 , 10 1 , 0.25 , 0.5 , 0.75 , 0.9 , 0.99 } and the expression obtained by Equation (15). The two curves approach each other as c goes to zero as predicted by our theoretical analysis.
For the TKD scenario, we follow the same methodology as above for CPD, Figure 7 and Figure 8 all agree with the analysis provided in Section 4.3.
For TKD scenario, the mean square relative error is in mean of order 40 dB. So, we check numerically the consistency of the estimator of the optimal s-value.
We can also notice that the convergence of μ N ( s ) N towards its deterministic equivalent μ ( s ) in the case TKD is faster than in the case CPD, since the dimension of matrix Φ is 200 , 200 , 200 × 100 , 120 , 140 ( N = 200 3 ) which is much larger than the dimension 6000 × 3000 of Φ ( N = 6000 ).

6. Conclusions

In this work, we derived and studied the limit performance in terms of minimal Bayes’ error probability for the binary classification of high-dimensional random tensors using both the tools of Information Geometry (IG) and of Random Matrix Theory (RMT). The main results on Chernoff Bounds and Fisher Information are illustrated by Monte–Carlo simulations that corroborated our theoretical analysis.
For future work, we would like to study the rate of convergence and the fluctuation of the statistics μ N ( s ) N and s ^ .

Acknowledgments

The authors would like to thank Philippe Loubaton (UPEM, France) for the fruitful discussions. This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR (The French National Research Agency) as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).

Author Contributions

Gia-Thuy Pham, Rémy Boyer and Frank Nielsen contributed to the research results presented in this paper. Gia-Thuy Pham and Rémy Boyer performed the numerical experiments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

The s-divergence in Equation (6) for the following binary hypothesis test
H 0 : y N 0 , Σ 0 , H 1 : y N 0 , Σ 1
is given by [15]:
μ ˜ N ( s ) = 1 2 log det ( s Σ 0 + ( 1 s ) Σ 1 ) [ det Σ 0 ] s [ det Σ 1 ] 1 s .
Using the expressions of the covariance matrices Σ 0 and Σ 1 , the numerator in Equation (A1) is given by
N log σ 2 + log det SNR × ( 1 s ) Φ Φ T + I
and the two terms at its numerator are log [ det Σ 0 ] s = s N log σ 2 and
log [ det Σ 1 ] 1 s = ( 1 s ) N log σ 2 + log det SNR × Φ Φ T + I .
Using the above expressions, m u ˜ N ( s ) is given by Equation (8).

Appendix B. Proof of Lemma 2

If we note d Σ ( SNR ) = Σ ( x ) x x = SNR then the following expression holds:
Σ ( δ SNR ) = Σ ( 0 ) + ( δ SNR ) × d Σ ( 0 ) = I + ( δ SNR ) × Φ Φ T .
Using the above expression, the s-divergence is given by
μ N ( s ) = 1 s 2 log det I + ( δ SNR ) × Φ Φ T 1 2 log det I + δ SNR × ( 1 s ) × Φ Φ T
Now, using Equation (8), and the following approximation:
1 N log det ( I + x A ) = 1 N Tr log ( I + x A ) x × 1 N Tr A x 2 2 × 1 N Tr A 2
we obtain
μ N ( s ) N ( s 1 ) s × ( δ SNR ) 2 2 × J F ( 0 ) N
where the Fisher information for y | δ SNR N ( 0 , Σ ( δ SNR ) ) is given by [3]:
J F ( δ SNR ) = E 2 log p ( y | δ SNR ) ( δ SNR ) 2 = 1 2 Tr { Σ ( δ SNR ) 1 d Σ ( δ SNR ) Σ ( δ SNR ) 1 d Σ ( δ SNR ) } = 1 2 Tr ( ( I + ( δ SNR ) × Φ Φ T ) 1 Φ Φ T ( I + δ SNR ) × Φ Φ T ) 1 Φ Φ T ) .

Appendix C. Proof of Theorem 3

The first step of the proof is based on the derivation of an alternative expression of μ s ( SNR ) given by Equation (A1) involving the inverse of the covariance matrices Σ 0 and Σ 1 . Specifically, we have
μ s ( SNR ) = 1 2 log ( det Σ 0 ) ( det Σ 1 ) det ( ( 1 s ) Σ 0 1 + s Σ 1 1 ) [ det Σ 0 ] s [ det Σ 1 ] 1 s = 1 2 log det [ ( 1 s ) Σ 0 1 + s Σ 1 1 ] 1 [ det Σ 0 ] 1 s [ det Σ 1 ] s .
The second step is to derive a closed-form expression in the high SNR regime using the following the approximation (see [52] for instance): x × Φ Φ T + I 1 x 1 Π Φ = I N Φ Φ where Π Φ is an orthogonal projector such as Π Φ Φ = 0 and Φ = ( Φ T Φ ) 1 Φ T . The numerator in Equation (A2) is given by
( 1 s ) Σ 0 1 + s Σ 1 1 1 SNR 1 σ 2 I N s I N + s Π Φ 1 = σ 2 I N s Φ Φ 1 .
As s Φ Φ is a rank-K projector matrix scaled by factor s > 0 , its eigen-spectrum is given by s , , s K , 0 , , 0 N K . In addition, as the rank-N identity matrix and the scaled projector s Φ Φ can be diagonalized in the same orthonormal basis matrix, the n-th eigenvalue of the inverse of matrix I N s Φ Φ is given by
λ n I N s Φ Φ 1 = 1 λ n I N s λ n Φ Φ = 1 1 s , 1 n K , 1 , K + 1 n N
with s ( 0 , 1 ) . Using the above property, we obtain
log det [ I N s Φ Φ ] 1 = log n = 1 N λ n I N s Φ Φ 1 = K log ( 1 s ) .
In addition, we have
log det SNR × Φ Φ T + I SNR 1 Tr log SNR × Φ T Φ = K × log SNR + n = 1 K log λ n
Finally, thanks to Equation (A2), we have
μ s ( SNR ) N SNR 1 1 2 K N log ( 1 s ) + s × log SNR + s K n = 1 K log λ n
Finally, to obtain s in Equation (9), we solve μ s ( SNR ) s = 0 .

Appendix D. Proof of Result 1

The asymptotic behavior of μ N ( s ) N when N q + for each q = 1 , , Q , R + in such a way that R 1 / q N q converge towards a non zero constant for each q = 1 , , Q can be obtained thanks to large random matrix theory. We suppose that N 1 , , N Q converge towards + at the same rate (i.e., N q N p converge towards a non zero constant for each ( p , q ) ), and c R = R N converges towards a constant c > 0 . Under this regime, the empirical eigenvalue distribution of covariance matrix Φ ( Φ ) T is known to converge towards the so-called Marcenko–Pastur distribution. By Section 2.2, we recall that the Marcenko–Pastur distribution ν c ( d λ ) is defined as
ν c ( d λ ) = δ ( λ ) [ 1 c ] + + λ λ c λ c + λ 2 π λ [ λ c , λ c + ] ( λ ) d λ
where λ c = ( 1 c ) 2 and λ c + = ( 1 + c ) 2 . We define t c ( z ) = R + ν c ( d λ ) λ z the Stieltjes transform of ν c . We have that t c ( z ) satisfies the equation
t c ( z ) = z + c 1 + t c ( z ) 1 .
When z R * , i.e., z = ρ , with ρ > 0 , it is well known that t c ( ρ ) is given by
t c ( ρ ) = 2 ρ ( 1 c ) + ( ρ + λ c ) ( ρ + λ c + )
It was established for the first time in [45] that if X represents a K × P random matrix with zero mean and 1 K variance i.i.d. entries, and if ( λ k ) k = 1 , , K represent the eigenvalues of X X T arranged in decreasing order, then 1 K k = 1 K δ ( λ λ k ) , the empirical eigenvalue distribution of X X T converges weakly almost surely towards ν c , under the regime K + , P + , P K c . In addition, we have the following property, for each continuous function f ( λ )
1 K k = 1 K f ( λ k ) a . s R + f ( λ ) ν c ( d λ ) .
Practically, when K and P are large enough, the histogram of the eigenvalues of each realization of X X T accumulates around the graph of the probability density of ν c .
The columns ( ϕ r ) r = 1 , , R of Φ are vectors ( ϕ r ( Q ) ϕ r ( 1 ) ) r = 1 , , R , which are mutually independent, identically distributed, and satisfy E ( ϕ r ϕ r T ) = I N N . However, since the components of each column ϕ r are not independent, it results in that the entries of Φ are not mutually independent. Applying the results of [53] (see also [54]), we can establish that the empirical eigenvalue distribution of Φ ( Φ ) T still converges almost surely towards ν c , under the asymptotic regime R N c . For continuous function f ( λ ) = log ( 1 + λ / ρ ) , we apply Equation (A4), R + log ( 1 + λ / ρ ) ν c ( d λ ) can be expressed in terms of t c ( ρ ) given by Equation (A3) (see e.g., [50]), we finish the proof.

Appendix E. Proof of Result 4

We have u ( x ) c 1 1 x + ( 1 x + 1 ) 2 = 2 x + 1 and u ( x ) + ( 1 c ) c 1 2 1 x + 1 , u ( x ) ( 1 c ) c 1 2 x , u ( x ) 2 ( 1 c ) 2 c 1 4 x 1 x + 1 . Using the above first-order approximations, Equation (13) is
Ψ c 1 x 1 c × x 1 + x + c log ( 1 + x ) c x 1 + x = c log ( 1 + x ) .
Using the above approximation and Equation (12), we obtain Result 4.

Appendix F. Proof of Result 5

We first denote λ 1 ( q ) λ 2 ( q ) λ n q ( q ) λ N q ( q ) the eigenvalues of Φ ( q ) ( Φ ( q ) ) T , 1 n q N q , for 1 q Q . We can notice that the eigenvalues of Φ ( Φ ) T are λ n 1 ( 1 ) λ n Q ( Q ) . Moreover, in the asymptotic regime, where M q + , N q + such that M q N q c q , 0 < c q < 1 , for all 1 q Q , we have that λ n q ( q ) = 0 if M q + 1 n q N q and the empirical distribution of the eigenvalues ( λ n q ( q ) ) 1 n q M q behaves as Marchenko-Pastur distributions ν c q of parameters ( c q , 1 ) . Recalling that M = M 1 M Q , N = N 1 N Q , we obtain immediately that
1 N log det SNR × Φ ( Φ ) T + I = 1 N n 1 = 1 N 1 n q = 1 N Q log SNR × λ n 1 ( 1 ) λ n Q ( Q ) + 1 = M N 1 M n 1 = 1 M 1 n q = 1 M Q log SNR × λ n 1 ( 1 ) λ n Q ( Q ) + 1
and that
1 M n 1 = 1 M 1 n q = 1 M Q log SNR × λ n 1 ( 1 ) λ n Q ( Q ) + 1 a . s 0 + 0 + log ( 1 + SNR × λ 1 λ Q ) d ν c 1 ( λ 1 ) d ν c Q ( λ Q )
Similarly, we have that
1 M log det SNR × ( 1 s ) Φ ( Φ ) T + I a . s 0 + 0 + log ( 1 + SNR × ( 1 s ) λ 1 λ Q ) d ν c 1 ( λ 1 ) d ν c Q ( λ Q )
We obtain easily Result 5.

References

  1. Besson, O.; Scharf, L.L. CFAR matched direction detector. IEEE Trans. Signal Process. 2006, 54, 2840–2844. [Google Scholar] [CrossRef]
  2. Bianchi, P.; Debbah, M.; Maida, M.; Najim, J. Performance of Statistical Tests for Source Detection using Random Matrix Theory. IEEE Trans. Inf. Theory 2011, 57, 2400–2419. [Google Scholar] [CrossRef] [Green Version]
  3. Kay, S.M. Fundamentals of Statistical Signal Processing, Volume II: Detection Theory; PTR Prentice-Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
  4. Loubaton, P.; Vallet, P. Almost Sure Localization of the Eigenvalues in a Gaussian Information Plus Noise Model. Application to the Spiked Models. Electron. J. Probab. 2011, 16, 1934–1959. [Google Scholar] [CrossRef]
  5. Mestre, X. Improved Estimation of Eigenvalues and Eigenvectors of Covariance Matrices Using Their Sample Estimates. IEEE Trans. Inf. Theory 2008, 54, 5113–5129. [Google Scholar] [CrossRef]
  6. Baik, J.; Silverstein, J. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivar. Anal. 2006, 97, 1382–1408. [Google Scholar] [CrossRef]
  7. Silverstein, J.W.; Combettes, P.L. Signal detection via spectral theory of large dimensional random matrices. IEEE Trans. Signal Process. 1992, 40, 2100–2105. [Google Scholar] [CrossRef]
  8. Cheng, Y.; Hua, X.; Wang, H.; Qin, Y.; Li, X. The Geometry of Signal Detection with Applications to Radar Signal Processing. Entropy 2016, 18, 381. [Google Scholar] [CrossRef]
  9. Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar]
  10. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  11. Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar]
  12. Nielsen, F. Hypothesis Testing, Information Divergence and Computational Geometry; Geometric Science of Information; Springer: Berlin, Germany, 2013; pp. 241–248. [Google Scholar]
  13. Sinanovic, S.; Johnson, D.H. Toward a theory of information processing. Signal Process. 2007, 87, 1326–1344. [Google Scholar] [CrossRef]
  14. Chernoff, H. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
  15. Nielsen, F. Chernoff information of exponential families. arXiv, 2011; arXiv:1102.2684. [Google Scholar]
  16. Chepuri, S.P.; Leus, G. Sparse sensing for distributed Gaussian detection. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015. [Google Scholar]
  17. Tang, G.; Nehorai, A. Performance Analysis for Sparse Support Recovery. IEEE Trans. Inf. Theory 2010, 56, 1383–1399. [Google Scholar] [CrossRef]
  18. Lee, Y.; Sung, Y. Generalized Chernoff Information for Mismatched Bayesian Detection and Its Application to Energy Detection. IEEE Signal Process. Lett. 2012, 19, 753–756. [Google Scholar]
  19. Grossi, E.; Lops, M. Space-time code design for MIMO detection based on Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2012, 58, 3989–4004. [Google Scholar] [CrossRef]
  20. Sen, S.; Nehorai, A. Sparsity-Based Multi-Target Tracking Using OFDM Radar. IEEE Trans. Signal Process. 2011, 59, 1902–1906. [Google Scholar] [CrossRef]
  21. Boyer, R.; Delpha, C. Relative-entropy based beamforming for secret key transmission. In Proceedings of the 2012 IEEE 7th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hoboken, NJ, USA, 17–20 June 2012. [Google Scholar]
  22. Tran, N.D.; Boyer, R.; Marcos, S.; Larzabal, P. Angular resolution limit for array processing: Estimation and information theory approaches. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012. [Google Scholar]
  23. Katz, G.; Piantanida, P.; Couillet, R.; Debbah, M. Joint estimation and detection against independence. In Proceedings of the Annual Conference on Communication Control and Computing (Allerton), Monticello, IL, USA, 30 September–3 October 2014; pp. 1220–1227. [Google Scholar]
  24. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
  25. Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; Phan, H.A. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag. 2015, 32, 145–163. [Google Scholar] [CrossRef]
  26. Comon, P. Tensors: A brief introduction. IEEE Signal Process. Mag. 2014, 31, 44–53. [Google Scholar] [CrossRef]
  27. De Lathauwer, L.; Moor, B.D.; Vandewalle, J. A Multilinear Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 2000, 21, 1253–1278. [Google Scholar] [CrossRef]
  28. Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
  29. Comon, P.; Berge, J.T.; De Lathauwer, L.; Castaing, J. Generic and Typical Ranks of Multi-Way Arrays. Linear Algebra Appl. 2009, 430, 2997–3007. [Google Scholar] [CrossRef] [Green Version]
  30. De Lathauwer, L. A survey of tensor methods. In Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS 2009, Taipei, Taiwan, 24–27 May 2009. [Google Scholar]
  31. Comon, P.; Luciani, X.; De Almeida, A.L.F. Tensor decompositions, alternating least squares and other tales. J. Chemom. 2009, 23, 393–405. [Google Scholar] [CrossRef] [Green Version]
  32. Goulart, J.H.D.M.; Boizard, M.; Boyer, R.; Favier, G.; Comon, P. Tensor CP Decomposition with Structured Factor Matrices: Algorithms and Performance. IEEE J. Sel. Top. Signal Process. 2016, 10, 757–769. [Google Scholar] [CrossRef]
  33. Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar]
  34. Badeau, R.; Richard, G.; David, B. Fast and stable YAST algorithm for principal and minor subspace tracking. IEEE Trans. Signal Process. 2008, 56, 3437–3446. [Google Scholar] [CrossRef]
  35. Boyer, R.; Badeau, R. Adaptive multilinear SVD for structured tensors. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), Toulouse, France, 14–19 May 2006. [Google Scholar]
  36. Boizard, M.; Ginolhac, G.; Pascal, F.; Forster, P. Low-rank filter and detector for multidimensional data based on an alternative unfolding HOSVD: Application to polarimetric STAP. EURASIP J. Adv. Signal Process. 2014, 2014, 119. [Google Scholar] [CrossRef]
  37. Bouleux, G.; Boyer, R. Sparse-Based Estimation Performance for Partially Known Overcomplete Large-Systems. Signal Process. 2017, 139, 70–74. [Google Scholar] [CrossRef]
  38. Boyer, R.; Couillet, R.; Fleury, B.-H.; Larzabal, P. Large-System Estimation Performance in Noisy Compressed Sensing with Random Support—A Bayesian Analysis. IEEE Trans. Signal Process. 2016, 64, 5525–5535. [Google Scholar] [CrossRef]
  39. Ollier, V.; Boyer, R.; El Korso, M.N.; Larzabal, P. Bayesian Lower Bounds for Dense or Sparse (Outlier) Noise in the RMT Framework. In Proceedings of the 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM 16), Rio de Janerio, Brazil, 10–13 July 2016. [Google Scholar]
  40. Wishart, J. The generalized product moment distribution in samples. Biometrika 1928, 20A, 32–52. [Google Scholar] [CrossRef]
  41. Wigner, E.P. On the statistical distribution of the widths and spacings of nuclear resonance levels. Proc. Camb. Philos. Soc. 1951, 47, 790–798. [Google Scholar] [CrossRef]
  42. Wigner, E.P. Characteristic vectors of bordered matrices with infinite dimensions. Ann. Math. 1955, 62, 548–564. [Google Scholar]
  43. Bai, Z.D.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices, 2nd ed.; Springer Series in Statistics; Springer: Berlin, Germany, 2010. [Google Scholar]
  44. Girko, V.L. Theory of Random Determinants; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1990. [Google Scholar]
  45. Marchenko, V.A.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Math. Sb. (N.S.) 1967, 72, 507–536. [Google Scholar]
  46. Voiculescu, D. Limit laws for random matrices and free products. Invent. Math. 1991, 104, 201–220. [Google Scholar] [CrossRef]
  47. Boyer, R.; Nielsen, F. Information Geometry Metric for Random Signal Detection in Large Random Sensing Systems. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
  48. Boyer, R.; Loubaton, P. Large deviation analysis of the CPD detection problem based on random tensor theory. In Proceedings of the 2017 25th European Association for Signal Processing (EUSIPCO), Kos, Greece, 28 August–2 September 2017. [Google Scholar]
  49. Lytova, A. Central Limit Theorem for Linear Eigenvalue Statistics for a Tensor Product Version of Sample Covariance Matrices. J. Theor. Prob. 2017, 1–34. [Google Scholar] [CrossRef]
  50. Tulino, A.M.; Verdu, S. Random Matrix Theory and Wireless Communications; Now Publishers Inc.: Hanover, MA, USA, 2004; Volume 1. [Google Scholar]
  51. Milne-Thomson, L.M. “Elliptic Integrals” (Chapter 17). In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing; Abramowitz, M., Stegun, I.A., Eds.; Dover Publications: New York, NY, USA, 1972; pp. 587–607. [Google Scholar]
  52. Behrens, R.T.; Scharf, L.L. Signal processing applications of oblique projection operators. IEEE Trans. Signal Process. 1994, 42, 1413–1424. [Google Scholar] [CrossRef]
  53. Pajor, A.; Pastur, L.A. On the Limiting Empirical Measure of the sum of rank one matrices with log-concave distribution. Stud. Math. 2009, 195, 11–29. [Google Scholar] [CrossRef]
  54. Ambainis, A.; Harrow, A.W.; Hastings, M.B. Random matrix theory: Extending random matrix theory to mixtures of random product states. Commun. Math. Phys. 2012, 310, 25–74. [Google Scholar] [CrossRef]
Figure 1. Canonical Polyadic Decomposition (CPD).
Figure 1. Canonical Polyadic Decomposition (CPD).
Entropy 20 00203 g001
Figure 2. Histogram of the eigenvalues of W N W N T N (with M = 256 , c N = M N = 1 256 , σ 2 = 1 ).
Figure 2. Histogram of the eigenvalues of W N W N T N (with M = 256 , c N = M N = 1 256 , σ 2 = 1 ).
Entropy 20 00203 g002
Figure 3. Histogram of the eigenvalues of W N W N T N (with M = 256 , c N = M N = 1 4 , σ 2 = 1 ).
Figure 3. Histogram of the eigenvalues of W N W N T N (with M = 256 , c N = M N = 1 4 , σ 2 = 1 ).
Entropy 20 00203 g003
Figure 4. Canonical Polyadic Decomposition (CPD) scenario: Optimal s-parameter versus Signal to Noise Ratio (SNR) in dB.
Figure 4. Canonical Polyadic Decomposition (CPD) scenario: Optimal s-parameter versus Signal to Noise Ratio (SNR) in dB.
Entropy 20 00203 g004
Figure 5. CPD scenario: s-divergence vs. SNR in dB.
Figure 5. CPD scenario: s-divergence vs. SNR in dB.
Entropy 20 00203 g005
Figure 6. CPD scenario: s vs. c , SNR = 45 dB.
Figure 6. CPD scenario: s vs. c , SNR = 45 dB.
Entropy 20 00203 g006
Figure 7. TucKer Decomposition (TKD) scenario: Optimal s-parameter vs. SNR in dB.
Figure 7. TucKer Decomposition (TKD) scenario: Optimal s-parameter vs. SNR in dB.
Entropy 20 00203 g007
Figure 8. TKD scenario: s-divergence vs. SNR in dB.
Figure 8. TKD scenario: s-divergence vs. SNR in dB.
Entropy 20 00203 g008

Share and Cite

MDPI and ACS Style

Pham, G.-T.; Boyer, R.; Nielsen, F. Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy 2018, 20, 203. https://doi.org/10.3390/e20030203

AMA Style

Pham G-T, Boyer R, Nielsen F. Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy. 2018; 20(3):203. https://doi.org/10.3390/e20030203

Chicago/Turabian Style

Pham, Gia-Thuy, Rémy Boyer, and Frank Nielsen. 2018. "Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors" Entropy 20, no. 3: 203. https://doi.org/10.3390/e20030203

APA Style

Pham, G. -T., Boyer, R., & Nielsen, F. (2018). Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy, 20(3), 203. https://doi.org/10.3390/e20030203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop