[go: up one dir, main page]

Academia.eduAcademia.edu
IEEE TRANSACTIONS ON INFORMATION THEORY 1 Rank Minimization over Finite Fields: Fundamental Limits and Coding-Theoretic Interpretations arXiv:1104.4302v4 [cs.IT] 1 Dec 2011 Vincent Y. F. Tan, Laura Balzano, Student Member, IEEE, Stark C. Draper, Member, IEEE Abstract—This paper establishes information-theoretic limits for estimating a finite field low-rank matrix given random linear measurements of it. These linear measurements are obtained by taking inner products of the low-rank matrix with random sensing matrices. Necessary and sufficient conditions on the number of measurements required are provided. It is shown that these conditions are sharp and the minimum-rank decoder is asymptotically optimal. The reliability function of this decoder is also derived by appealing to de Caen’s lower bound on the probability of a union. The sufficient condition also holds when the sensing matrices are sparse – a scenario that may be amenable to efficient decoding. More precisely, it is shown that if the n×n-sensing matrices contain, on average, Ω(nlog n) entries, the number of measurements required is the same as that when the sensing matrices are dense and contain entries drawn uniformly at random from the field. Analogies are drawn between the above results and rank-metric codes in the coding theory literature. In fact, we are also strongly motivated by understanding when minimum rank distance decoding of random rank-metric codes succeeds. To this end, we derive minimum distance properties of equiprobable and sparse rank-metric codes. These distance properties provide a precise geometric interpretation of the fact that the sparse ensemble requires as few measurements as the dense one. Index Terms—Rank minimization, Finite fields, Reliability function, Sparse parity-check matrices, Rank-metric codes, Minimum rank distance properties I. I NTRODUCTION This paper considers the problem of rank minimization over finite fields. Our work attempts to connect two seemingly disparate areas of study that have, by themselves, become popular in the information theory community in recent years: (i) the theory of matrix completion [2]–[4] and rank minimization [5], [6] over the reals and (ii) rank-metric codes [7]–[12], which are the rank distance analogs of binary block codes endowed with the Hamming metric. The work herein provides a starting point for investigating the potential impact of the low-rank assumption on information and coding theory. We provide a brief review of these two areas of study. This work is supported in part by the Air Force Office of Scientific Research under grant FA9550-09-1-0140 and by the National Science Foundation under grant CCF 0963834. V. Y. F. Tan is also supported by A*STAR Singapore. This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT), St. Petersburg, Russia, August 2011 [1]. The authors are with the Department of Electrical and Computer Engineering (ECE), University of Wisconsin, Madison, WI, 53706, USA (emails: vtan@wisc.edu; sunbeam@ece.wisc.edu; sdraper@ece.wisc.edu). The first author is also affiliated to the Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA (email: vtan@mit.edu). Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. The problem of matrix completion [2]–[4] can be stated as follows: One is given a subset of noiseless or noisy entries of a low-rank matrix (with entries over the reals), and is then required to estimate all the remaining entries. This problem has a variety of applications from collaborative filtering (e.g., Netflix prize [13]) to obtaining the minimal realization of a linear dynamical system [14]. Algorithms based on the nuclear norm (sum of singular values) convex relaxation of the rank function [14], [15] have enjoyed tremendous successes. A generalization of the matrix completion problem is the rank minimization problem [5], [6] where, instead of being given entries of the low-rank matrix, one is given arbitrary linear measurements of it. These linear measurements are obtained by taking inner products of the unknown matrix with sensing matrices. The nuclear norm heuristic has also been shown to be extremely effective in estimating the unknown lowrank matrix. Theoretical results [5], [6] are typically of the following flavour: If the number of measurements (also known as the measurement complexity) exceeds a small multiple of the product of the dimension of the matrix and its rank, then optimizing the nuclear-norm heuristic yields the same (optimal) solution as the rank minimization problem under certain conditions on the sensing matrices. Note that in the case of real matrices, if the observations (or the entries) are noisy, perfect reconstruction is impossible. As we shall see in Section V, this is not the case in the finite field setting. We can recover the underlying matrix exactly albeit at the cost of a higher measurement complexity. Rank-metric codes [7]–[12] are subsets of finite field matrices endowed with the rank-metric. We will be concerned with linear rank-metric codes, which may be characterized by a family of parity-check matrices, which are equivalent to the sensing matrices in the rank minimization problem. A. Motivations Besides analyzing the measurement complexity for rank minimization over finite fields, this paper is also motivated by two applications in coding. The first is index coding with side information [16]. In brief, a sender wants to communicate the l-th coordinate of a length-L bit string to the l-th of L receivers. Furthermore, each of the L receivers knows a subset of the coordinates of the bit string. These subsets can be represented by (the neighbourhoods of) a graph. Bar-Yossef et al. [16] showed that the linear version of this problem reduces to a rank minimization problem. In previous works, the graph is deterministic. Our work, and in particular the rank minimization problem considered herein, can be cast as IEEE TRANSACTIONS ON INFORMATION THEORY 2 the solution of a linear index coding problem with a random side information graph. Second, we are interested in properties of the rank-metric coding problem [10]. Here, we are given a set of matrix-valued codewords that form a linear rank-metric code C . A codeword C∗ ∈ C is transmitted across a noisy finite field matrix-valued channel which induces an additive error matrix X. This error matrix X is assumed to be low rank. For example, X could be a matrix induced by the crisscross error model in data arrays [17]. In the crisscross error model, X is a sparse low rank matrix in which the non-zero elements are restricted to a small number of rows and columns. The received matrix is R := C∗ + X. The minimum distance decoding problem is given by the following: Ĉ := arg min rank(R − C). (1) C∈C We would like to study when problem (1) succeeds (i.e., uniquely recovers the true codeword C∗ ) with high probability1 (w.h.p.) given that C is a random code characterized by either dense or sparse random parity-check matrices and X is a deterministic error matrix. But why analyze random codes? Our study of random (instead of deterministic) codes is motivated by the fact that data arrays that arise in applications are often corrupted by crisscross error patterns [17]. Decoding techniques used in the rank-metric literature such as error trapping [11], [18] are unfortunately not able to correct such error patterns because they are highly structured and hence the “error traps” would miss (or not be able to correct) a non-trivial subset of errors. Indeed, the success such an error trapping strategy hinges strongly on the assumption that the underlying low-rank error matrix X is drawn uniformly at random over all matrices whose rank is r [18, Sec. IV] (so subspaces can be trapped). The decoding technique in [17] is specific to correcting crisscross error patterns. In contrast, in this work, we are able to derive distance properties of random rank-metric codes and to show that given sufficiently many constraints on the codewords, all error patterns of rank no greater than r can be successfully corrected. Although our derivations are similar in spirit to those in Barg and Forney [19], our starting point is rather different. In particular, we combine the use of techniques from [20] and those in [19]. We are also motivated by the fact that error exponentlike results for matrix-valued finite field channels are, to the best of the authors’ knowledge, not available in the literature. Such channels have been popularized by the seminal work in [21]. Capacity results for specific channel models such as the uniform given rank (u.g.r.) multiplicative noise model [22] have recently been derived. In this work, we derive the error exponent for the minimum-rank decoder E(R) (for the additive noise model). This fills an important gap in the literature. B. Main Contributions We summarize our four main contributions in this work. Firstly, by using a standard converse technique (Fano’s inequality), we derive a necessary condition on the number 1 Here and in the following, with high probability means with probability tending to one as the problem size tends to infinity. of measurements required for estimating a low-rank matrix. Furthermore, under the assumption that the linear measurements are obtained by taking inner products of the unknown matrix with sensing matrices containing independent entries that are equiprobable (in Fq ), we demonstrate an achievability procedure, called the min-rank decoder, that matches the information-theoretic lower bound on the number of measurements required. Hence, the sufficient condition is sharp. Extensions to the noisy case are also discussed. Note that in this paper, we are not as concerned with the computational complexity of recovering the unknown low-rank matrix as compared to the fundamental limits of doing so. Secondly, we derive the reliability function (error exponent) E(R) of the min-rank decoder by using de Caen’s lower bound on the probability of a union [23]. The use of de Caen’s bound to obtain estimates of the reliability function (or probability of error) is not new. See the works by Séguin [24] and Cohen and Merhav [25] for example. However, by exploiting pairwise independence of constituent error events, we not only derive upper and lower bounds on E(R), we show that these bounds are, in fact, tight for all rates (for the min-rank decoder). We derive the corresponding error exponents for codes in [7] and [18] and make comparisons between the error exponents. Thirdly, we show that if the fraction of non-zero entries of the sensing or measurement matrices scales (on average) as Ω( logn n ) (where the matrix is of size n × n), the min-rank decoder achieves the information-theoretic lower bound. Thus, if the average number of entries in each sparse sensing matrix is Ω(n log n) (which is much fewer than n2 ), we can show that, very surprisingly, the number of linear measurements required for reliable reconstruction of the unknown low-rank matrix is exactly the same as that for the equiprobable (dense) case. This main result of ours opens the possibility for the development of efficient, message-passing decoding algorithms based on sparse parity-check matrices [26]. Finally, we draw analogies between the above results and rank-metric codes [7]–[12] in the coding theory literature. We derive minimum (rank) distance properties of the equiprobable random ensemble and the sparse random ensemble. Using elementary techniques, we derive an analog of the GilbertVarshamov distance for the random rank-metric code. We also compare and contrast our result to classical binary linear block codes with the Hamming metric [19]. From our analyses in this section, we obtain geometric intuitions to explain why minimum rank decoding performs well even when the sensing matrices are sparse. We also use these geometric intuitions to guide our derivation of strong recovery guarantees along the lines of the recent work by Eldar et al. [27]. C. Related Work There is a wealth of literature on rank minimization to which we will not be able to do justice here. See for example the seminal works by Fazel et al. [14], [15] and the subsequent works by other authors [2]–[4] (and the references therein). However, all these works focus on the case where the unknown matrix is over the reals. We are interested in the finite field setting because such a problem has many connections with IEEE TRANSACTIONS ON INFORMATION THEORY 3 TABLE I C OMPARISON OF OUR WORK (T AN -BALZANO -DRAPER ) TO EXISTING CODING - THEORETIC TECHNIQUES FOR RANK MINIMIZATION TABLE II C OMPARISONS BETWEEN THE RESULTS IN VARIOUS SECTIONS OF THIS Paper Gabidulin [7] SKK [10] MU [11] SKK [18] GLS [33] TBD Parity-check matrix Ha Random, dense Deterministic, dense Random, sparse Deterministic, sparse Code Structure Algebraic Algebraic Factor Graph Error Trapping Perfect Graph See Table II Decoding Technique Berlekamp-Massey Extended Berlekamp-Massey Error Trapping & Message Passing Error Trapping Semidefinite Program (Ellipsoid) Min-Rank Decoder (Section VIII) and applications to coding and information theory [16], [17], [28]. The analogous problem for the reals was considered by Eldar et al. [27]. The results in [27], developed for dense sensing matrices with i.i.d. Gaussian entries, mirror those in this paper but only achievability results (sufficient conditions) are provided. We additionally analyze the sparse setting. Our work is partially inspired by [29] where fundamental limits for compressed sensing over finite fields were derived. To the best of our knowledge, Vishwanath’s work [30] is the only one that employs information-theoretic techniques to derive necessary and sufficient conditions on the number of measurements required for reliable matrix completion (or rank minimization). It was shown using typicality arguments that the number of measurements required is within a logarithmic factor of the lower bound. Our setting is different because we assume that we have linear measurements instead of randomly sampled entries. We are able to show that the achievability and converse match for a family of random sensing matrices. Emad and Milenkovic [31] recently extended the analyses in the conference version [1] of this paper to the tensor case, where the rank, the order of the tensor and the number of measurements grow simultaneously with the size of the matrix. We compare and contrast our decoder and analysis for the noisy case to that in [31]. Another recent related work is that by Kakhaki et al. [32] where the authors considered the binary erasure channel (BEC) and binary symmetric channel (BSC) and empirically studied the error exponents for codes whose generator matrices are random and sparse. For the BEC, the authors showed that there exist capacity-achieving codes with generator matrices whose sparsity factor (density) is O( logn n ) (similar to this work). However, motivated by the fact that sparse parity-check matrices may make decoding amenable to lower complexity message-passing type decoders, we analyze the scenario where the parity-check matrices are sparse. The family of codes known as rank-metric codes [7]–[12], which are the the rank-distance analog of binary block codes equipped with the Hamming metric, bears a striking similarity to the rank minimization problem over finite fields. Comparisons between this work and related works in the coding theory literature are summarized in Table I. Our contributions in the various sections of this paper, and other pertinent references, are summarized in Table II. We will further elaborate on these comparisons in Section IX-A. D. Outline of Paper Section II details our notational choices, describes the measurement models and states the problem. In Section III, we PAPER AND OTHER RELATED WORKS Random low-rank matrix X Section IV Section IV, [18] Section VI Section VI, [11], [18] Deterministic low-rank matrix X Section IV Section VII-C, [7], [10] Section VI Section VII-C use Fano’s inequality to derive a lower bound on the number of measurements for reconstructing the unknown low-rank matrix. In Section IV, we consider the uniformly at random (or equiprobable) model where the entries of the measurement matrices are selected independently and uniformly at random from Fq . We derive a sufficient condition for reliable recovery and the reliability function of the min-rank decoder using de Caen’s lower bound. The results are then extended to the noisy scenario in Section V. Section VI, which contains our main result, considers the case where the measurement matrices are sparse. We derive a sufficient condition on the sparsity factor (density) as well as the number of measurements for reliable recovery. Section VII is devoted to understanding and interpreting the above results from a coding-theoretic perspective. In Section VIII, we provide a procedure to search for the low-rank matrix by exploiting indeterminacies in the problem. Discussions and conclusions are provided in Section IX. The lengthier proofs are deferred to the appendices. II. P ROBLEM S ETUP AND M ODEL In this section, we state our notational conventions, describe the system model and state the problem. We also distinguish between the two related notions of weak and strong recovery. A. Notation In this paper we adopt the following set of notations: Serif font and san-serif font denote deterministic and random quantities respectively. Bold-face upper-case and bold-face lower-case denote matrices and (column) vectors respectively. Thus, y, y, X and X denote a deterministic scalar, a scalarvalued random variable, a deterministic matrix and a random matrix respectively. Random functions will also be denoted in san-serif font. Sets (and events) are denoted with calligraphic font (e.g., U or C ). The cardinality of a finite set U is denoted as |U|. For a prime power q, we denote the finite (Galois) field with q elements as Fq . If q is prime, one can identify Fq with Zq = {0, . . . , q − 1}, the set of the integers modulo q. The set of m × n matrices with entries in Fq is . For simplicity, we let [k] := {1, . . . , k} denoted as Fm×n q and yk := (y1 , . . . , yk ). For a matrix M, the notations kMk0 and rank(M) respectively denote the number of non-zero elements in M (the Hamming weight) and the rank of M in Fq . For a matrix M ∈ Fm×n , we also use the notation q vec(M) ∈ Fmn to denote vectorization of M with its columns q stacked on top of one another. For a real number b, the notation |b|+ is defined as max{b, 0}. Asymptotic notation such as O( · ), Ω( · ) and o( · ) will be used throughout. See [34, IEEE TRANSACTIONS ON INFORMATION THEORY 4 TABLE III TABLE OF SYMBOLS USED IN THIS PAPER Notation k r/n → γ σ = kwk0 /n2 α = k/n2 p = Ekwk0 /k δ = EkHa k0 /n2 NC (r) d(C ) Definition Number of measurements Rank-dimension ratio Deterministic noise parameter Measurement scaling parameter Random noise parameter Sparsity factor Num. of matrices of rank r in C Minimum rank distance of C we also analyze the scenario where Ha is relatively sparse. Our setting is more similar in spirit to the rank minimization problems analyzed in Recht et al. [5], Meka et al. [6] and Eldar et al. [27]. However, these works focus on problems in the reals whereas our focus is the finite field setting. Section Section II-B Section II-B Section V-A Section V-B Section V-B Section VI Section VII Section VII C. Problem Statement Sec. I.3] for definitions. For the reader’s convenience, we have summarized the symbols used in this paper in Table III. B. System Model We are interested in the following model: Let X be an unknown (deterministic or random) square2 matrix in Fqn×n whose rank is less than or equal to r, i.e., rank(X) ≤ r. The upper bound on the rank r is allowed to be a function of n, i.e., r = rn . We assume that r/n → γ and we say that the limit γ ∈ [0, 1] is the rank-dimension ratio.3 We would like to recover or estimate X from k linear measurements X [Ha ]i,j [X]i,j a ∈ [k], (2) ya = hHa , Xi := (i,j)∈[n]2 i.e., ya is the trace of Ha XT . In (2), the sensing or measurement matrices Ha ∈ Fqn×n , a ∈ [k], are random matrices chosen according to some probability mass function (pmf). The k scalar measurements ya ∈ Fq , a ∈ [k], are available for estimating X. We will operate in the so-called highdimensional setting and allow the number of measurements k to depend on n, i.e., k = kn . Multiplication and addition in (2) are performed in Fq . In the subsequent sections, we will also be interested in a generalization of the model in (2) where the measurements ya , a ∈ [k], may not be noiseless, i.e., ya = hHa , Xi + wa , a ∈ [k], (3) where wa , a ∈ [k], represents random or deterministic noise. We will specify precise noise models in Section V. The measurement models we are concerned with in this paper, (2) and (3), are somewhat different from the matrix completion problem [2]–[4]. In the matrix completion setup, a subset of entries Ω ⊂ [n]2 in the matrix X is observed and one would like to “fill in” the rest of the entries assuming the matrix is low-rank. This model can be captured by (2) by choosing each sensing matrix Ha to be non-zero only in a single position. Assuming Ha 6= Ha′ for all a 6= a′ , the number of measurements is k = |Ω|. In contrast, our measurement models in (2) and (3) do not assume that kHa k0 = 1. The sensing matrices are, in general, dense although in Section VI, 2 Our results are not restricted to the case where X is square but for the most part in this paper, we assume that X is square for ease of exposition. 3 Our results also include the regime where r = o(n) but the case where r = Θ(n) (and γ is the proportionality constant) is of greater interest and significance. This is because the rank r grows as rapidly as possible and hence this regime is the most challenging. Note that if r/n → γ = 1, then we would need n2 measurements to recover X since we are not making any low rank assumptions on it. This is corroborated by the converse in Proposition 2. Our objective is to estimate the unknown low-rank matrix X given yk (and the measurement matrices Ha , a ∈ [k]). In general, given the measurement model in (2) and without any assumptions on X, the problem is ill-posed and it is not possible to recover X if k < n2 . However, because X is assumed to have rank no larger than r (and r/n → γ), we can exploit this additional information to estimate X with k < n2 measurements. Our goal in this paper is to characterize necessary and sufficient conditions on the number of measurements k as n becomes large assuming a particular pmf governing the sensing matrices Ha , a ∈ [k] and under various (random and deterministic) models on X. D. Weak Versus Strong Recovery In this paper, we will focus (in Sections III to VI) on the so-called weak recovery problem where the unknown low-rank matrix X is fixed and we ask how many measurements k are sufficient to recover X (and what the procedure is for doing so). However, there is also a companion problem known as the strong recovery problem, where one would like to recover all matrices in Fqn×n with rank no larger than r. A familiar version of this distinction also arises in compressed sensing.4 More precisely, given k sensing matrices Ha , a ∈ [k], we define the linear operator H : Fqn×n → Fkq as H(X) := [hH1 , Xi, hH2 , Xi, . . . , hHk , Xi]T . (4) Then, a necessary and sufficient condition for strong recovery is that the operator H is injective when restricted to the set of all matrices of rank-2r (or less). In other words, there are no rank-2r (or less) matrices in the nullspace of the operator H [27, Sec. 2]. This can be observed by noting that for two matrices X1 and X2 of rank-r (or less) that generate the same linear observations (i.e., H(X1 ) = H(X2 )), their difference X1 − X2 has rank at most 2r by the triangle inequality.5 We would thus like to find conditions on k (via, for example, the geometry of the random code) such that the following subset of Fqn×n (n) R2r := {X ∈ Fqn×n : rank(X) ≤ 2r} (5) is disjoint from the nullspace of H with probability tending to one as n grows. As mentioned in Section II-B, we allow r to 4 Analogously in compressed sensing, consider the combinatorial ℓ -norm 0 optimization problem minx̃∈Fn {kx̃k0 : Ax̃ = y}, where the field F can either be the reals R [27] or a finite field Fq [29]. It can be seen that if we want to recover fixed but unknown s-sparse vector x (weak recovery), s + 1 linear measurements suffice w.h.p. However, for strong recovery where we would like to guarantee recovery for all s-sparse vectors, we need to ensure that the nullspace of the measurement matrix A is disjoint from the set of 2s-sparse vectors. Thus, w.h.p., 2s measurements are required for strong recovery [27], [29]. 5 Note that (A, B) 7→ rank(A − B) is a metric on the space of matrices. IEEE TRANSACTIONS ON INFORMATION THEORY 5 grow linearly with n (with proportionality constant γ). Under (n) the condition that R2r ∩ nullspace(H) = ∅, the solution to the rank minimization problem [stated precisely in (12) below] is unique and correct for all low-rank matrices with probability tending to one as n grows. As we shall see in Section VII-C, the conditions on k for strong recovery are more stringent than those for weak recovery. See the recent paper by Eldar et al. [27, Sec. 2] for further discussions on weak versus strong recovery in the real field setting. for recovery of X to be reliable, i.e., for the probability of Ẽn to tend to zero as n grows. From a linear algebraic perspective, this means we need at least as many measurements as there are degrees of freedom in the unknown object X. Clearly, the bound in (9) applies to both the noisy and the noiseless models introduced in Section II-B. The proof involves an elementary application of Fano’s inequality [35, Sec. 2.10]. Proof: Consider the following lower bounds on the probability of error P(Ẽn ): (a) E. Bounds on the number of low-rank matrices In the sequel, we will find it useful to leverage the following lemma, which is a combination of results stated in [21, Lemma 4], [9, Proposition 1] and [12, Lemma 5]. Lemma 1 (Bounds on the number of low-rank matrices). Let Φq (n, r) and Ψq (n, r) respectively be the number of matrices in Fqn×n of rank exactly r and the number of matrices in Fqn×n of rank less than or equal to r. Note that Ψq (n, r) = P r l=0 Φq (n, l). The following bounds hold: 2 2 2 2 q (2n−2)r−r ≤ Φq (n, r) ≤ 4q 2nr−r , q 2nr−r ≤ Ψq (n, r) ≤ 4q 2nr−r . (6) (7) In other words, we have from (7) and the fact that r/n → γ that | n12 logq Ψq (n, r) − 2γ(1 − γ/2)| → 0. III. A N ECESSARY C ONDITION FOR R ECOVERY This section presents a necessary condition on the scaling of k with n for the matrix X to be recovered reliably, i.e., for the error probability in estimating X to tend to zero as n grows. As with most other converse statements in information theory, it is necessary to assume a statistical model on the unknown object, in this case X. Hence, in this section, we denote the unknown low-rank matrix as X (a random variable). We also assume that X is drawn uniformly at random from the set of matrices in Fqn×n of rank less than or equal to r. For an estimator (deterministic or random function) X̂ : Fkq ×(Fqn×n )k → Fqn×n whose range is the set of all Fqn×n -matrices whose rank is less than or equal to r, we define the error event: Ẽn := {X̂(yk , Hk ) 6= X}. (8) This is the event that the estimate X̂(yk , Hk ) is not equal to the true low-rank matrix X. We emphasize that the estimator can either be deterministic or random. In addition, the arguments (yk , Hk ) are random so X̂(yk , Hk ) in the definition of Ẽn is a random matrix. We can demonstrate the following: Proposition 2 (Converse). Fix ε > 0 and assume that X is drawn uniformly at random from all matrices of rank less than or equal to r. Also, assume X is independent of Hk . If, k < (2 − ε)γ (1 − γ/2) n2 (9) then for any estimator X̂ whose range is the set of Fqn×n matrices whose rank is less than or equal to r, P(Ẽn ) ≥ ε/4 > 0 for all n sufficiently large. Proposition 2 states that the number of measurements k must exceed 2nr−r2 (which is approximately 2γ(1−γ/2)n2) P(X̂ 6= X) ≥ (b) = (c) ≥ H(X|yk , Hk )−1 H(X)−I(X; yk , Hk )−1 = logq Ψq (n, r) logq Ψq (n, r) H(X) − H(yk |Hk ) − 1 H(X) − I(X; yk |Hk ) − 1 = logq Ψq (n, r) logq Ψq (n, r) k H(X) − k − 1 (d) − o(1), = 1− logq Ψq (n, r) logq Ψq (n, r) (10) where (a) is by Fano’s inequality (estimating X given yk and Hk ), (b) is because Hk is independent of X so I(X; yk , Hk ) = I(X; yk |Hk ) + I(X; Hk ) = I(X; yk |Hk ). Inequality (c) is due to the fact that ya is q-ary for all a ∈ [k] so H(yk |Hk ) ≤ H(yk ) ≤ kH(y1 ) ≤ k logq q = k, (11) and finally, (d) is due to the uniformity of X. It can be easily verified that if k satisfies (9) for some ε > 0, then k/logq Ψq (n, r) ≤ 1−ε/3 for n sufficiently large by the lower bound in (7) and the convergence r/n → γ. Hence, (10) is larger than ε/4 for all n sufficiently large. We emphasize that the assumption that the sensing matrices Ha , a ∈ [k] are statistically independent of the unknown lowrank matrix X is important. This is to ensure the validity of equality (b) in (10). This assumption is not a restrictive one in practice since the sensing mechanism is usually independent of the unknown matrix. IV. U NIFORMLY R ANDOM S ENSING M ATRICES : T HE N OISELESS C ASE In this section, we assume the noiseless linear model in (2) and provide sufficient conditions for the recovery of a fixed X (a deterministic low-rank matrix) given yk , where rank(X) ≤ r. We will also provide the functional form of the reliability function (error exponent) for this recovery problem. To do so we first consider the following optimization problem: minimize rank(X̃) subject to hHa , X̃i = ya , a ∈ [k] (12) The optimization variable is X̃ ∈ Fqn×n . Thus among all the matrices that satisfy the linear constraints in (2), we select one whose rank is the smallest. We call the optimization problem in (12) the min-rank decoder, denoting the set of minimizers as S ⊂ Fqn×n . If S is a singleton set, we also denote the unique optimizer to (12), a random quantity, as X∗ . We analyze the error probability that either S is not a singleton set or X∗ does not equal the true matrix X, i.e., the error event En := {|S| > 1} ∪ ({|S| = 1} ∩ { X∗ 6= X}). (13) IEEE TRANSACTIONS ON INFORMATION THEORY 6 The optimization in (12) is, in general, intractable (in fact NP-hard) unless there is additional structure on the sensing matrices Ha (See discussions in Section IX). Our focus, in this paper, is on the information-theoretic limits for solving (12) and its variants. We remark that the minimization problem is reminiscent of Csiszár’s so-called α-decoder for linear codes [36]. In [36], Csiszár analyzed the error exponent of the decoder that minimizes a function α( · ) [e.g., the entropy H( · )] of the type (or empirical distribution) of a sequence subject to the sequence satisfying a set of linear constraints. For this section and Section V, we assume that each element in each sensing matrix is drawn independently and uniformly at random from Fq , i.e., from the pmf ∀ h ∈ Fq . Ph (h; q) = 1/q, (14) We call this the uniform or equiprobable measurement model. For simplicity, throughout this section, we use the notation P to denote the probability measure associated to the equiprobable measurement model. less than or equal to the rank of X. Furthermore, we claim that P(AZ ) = q −k for every Z 6= X. This follows because P(AZ ) = P(hZ − X, Ha i = 0, a ∈ [k]) (a) In this subsection, we assume the noiseless linear model in (2). We can now exploit ideas from [29] to demonstrate the following achievability (weak recovery) result. Recall that X is non-random and fixed, and we are asking how many measurements y1 , . . . , yk are sufficient for recovering X. Proposition 3 (Achievability). Fix ε > 0. Under the uniform measurement model as in (14), if k > (2 + ε)γ (1 − γ/2) n2 (15) (18) where (a) follows from the fact that the Ha are i.i.d. matrices and (b) from the fact Z − X 6= 0 and every non-zero element in a finite field has a (unique) multiplicative inverse so P(hZ − X, H1 i = 0) = q −1 [29], [36]. More precisely, this is because hZ − X, H1 i has distribution Ph by independence and uniformity of the elements in H1 . Since r/n → γ, for any fixed η ′ > 0, |r/n − γ| ≤ η ′ for all n sufficiently large. By the uniform continuity of the function t 7→ 2t−t2 on t ∈ [0, 1], for any η > 0, |(2nr − r2 )/n2 − 2γ(1 − γ/2)| ≤ η for all n ≥ Nη (an integer just depending on η). Now by combining (18) with the union of events bound, X (c) q −k ≤ Ψq (n, r) q −k P(En ) ≤ Z:Z6=X,rank(Z)≤rank(X) (d) A. A Sufficient Condition for Recovery in the Noiseless Case (b) = P(hZ − X, H1 i = 0)k = q −k , ≤ 4q 2nr−r 2 −k (e) 2 ≤ 4q −n [−2γ(1−γ/2)−η+k/n2 ] , (19) where (c) follows because rank(X) ≤ r, (d) follows from the upper bound in (7) and (e) follows for all n sufficiently large as argued above. Thus, we see that if k satisfies (15), the exponent in (19) is positive if we choose η ′ sufficiently small so that η < εγ(1 − γ/2). Hence P(En ) → 0 as desired. Remark: Here and in the following, we can, without loss of generality, assume that r = ⌊γn⌋ (in place of r/n → γ). In this way, we can remove the effect of the small positive constant η as in the above argument. This simplification does not affect the precision of any of the arguments in the sequel. then P(En ) → 0 as n → ∞. Note that the number of measurements stipulated by Proposition 3 matches the information-theoretic lower bound in (9). In this sense, the min-rank decoder prescribed by the optimization problem in (12) is asymptotically optimal, i.e., the bounds are sharp. Note also that in the converse (Proposition 2), the range of the decoder X̂( · ) is constrained to be the set of matrices whose rank does not exceed r. Hence, the decoder in the converse has additional side information – namely the upper bound on the rank. For the min-rank decoder in (12), no such knowledge of the rank is required and yet it meets the lower bound. We remark that the packing-like achievability proof is much simpler than the typicality-based argument presented by Vishwanath in [30] (albeit in a different setting). Proof: To each matrix Z ∈ Fqn×n that is not equal to X and whose rank is no greater than rank(X), define the event AZ := {hZ, Ha i = hX, Ha i, ∀ a ∈ [k]}. (16)  We have shown in the previous section that the min-rank decoder is asymptotically optimal in the sense that the number of measurements required for it to decode X reliably with P(En ) → 0 matches the lower bound (necessary condition) on k (Proposition 2). It is also interesting to analyze the rate of decay of P(En ) for the min-rank decoder. For this purpose, we define the rate R of the measurement model. Definition 1. The rate of (a sequence of) linear measurement models as in (2) is defined as n2 − k k = lim 1 − 2 2 n→∞ n→∞ n n assuming the limit exists. Note that R ∈ [0, 1]. R := lim (20) The use of the term rate is in direct analogy to the use of the term in coding theory. The rate of the linear code C := {C ∈ Fqn×n : hC, Ha i = 0, a ∈ [k]} Then we note that  B. The Reliability Function (21) (17) is Rn := 1 − dim(span{vec(H1 ), . . . , vec(Hk )})/n2 , which is lower bounded6 by 1 − k/n2 for every k = 0, 1, . . . , n2 . since an error occurs if and only if there exists a matrix Z 6= X such that (i) Z satisfies the linear constraints, (ii) its rank is 6 The lower bound is achieved when the vectors vec(H ), . . . , vec(H ) are 1 k linearly independent in Fq . See Section VII, and in particular Proposition 14, for details when the sensing matrices are random. P(En ) = P  [ Z:Z6=X,rank(Z)≤rank(X) AZ  IEEE TRANSACTIONS ON INFORMATION THEORY 7 We revisit the connection of the rank minimization problem to coding theory (and in particular to rank-metric codes) in detail in Section VII. Definition 2. If the limit exists, the reliability function or error exponent of the min-rank decoder (12) is defined as E(R) := lim − n→∞ 1 logq P(En ). n2 Proposition 4 (Upper bound on E(R)). Assume that rank(X)/n → γ̃ as n → ∞. Under the uniform measurement model in (14) and assuming the min-rank decoder is used, + lim sup − n→∞ k 1 logq P(En ) ≤ −2γ̃ (1 − γ̃/2) + η + lim 2 , n→∞ n n2 (25) 2 (22) We show in Corollary 7 that the limit in (22) indeed exists. Unlike the usual definition of the reliability function [37, Eq. (5.8.8)], the normalization in (22) is 1/n2 since X is an n × n matrix.7 Also, we restrict our attention to the min-rank decoder. The following proposition provides an upper bound on the reliability function of the min-rank decoder when there is no noise in the measurements as in (2). E(R) ≤ |(1 − R) − 2γ̃ (1 − γ̃/2)| . 1 − R > 2γ̃ (1 − γ̃/2), the normalized logarithm of the error probability can now be simplified as Corollary 7 (Reliability function). Under the assumptions of Proposition 4, the error exponent of the min-rank decoder is E(R) = |(1 − R) − 2γ̃ (1 − γ̃/2)|+ . Lemma 5 (de Caen [23]). Let (Ω, F , Q) be a probability space. For a finite number events B1 , . . . , BM ∈ F , the probability of their union can be lower bounded as ! M M X [ Q(Bm )2 Q . (24) Bm ≥ PM m=1 m=1 m′ =1 Q(Bm ∩ Bm′ ) (26) Proof: The lower bound on E(R) follows from the achievability in (19), which may be strengthened as follows: + 2 2 P(En ) ≤ 4q −n |−2γ̃(1−γ̃/2)−η+k/n | , (23) The proof of this result hinges on the pairwise independence of the events AZ and de Caen’s inequality [23], which for the reader’s convenience, we restate here: 2 where we used the fact that 4q n [2γ̃(1−γ̃/2)+η−k/n ] → 0 for sufficiently small η > 0. The case where 1−R ≤ 2γ̃ (1 − γ̃/2) results in E(R) = 0 because P(En ) fails to converge to zero as n → ∞. The proof of the upper bound of the reliability function is completed by appealing to the definition of R in (20) and the arbitrariness of η > 0. (27) since P(En ) can also be upper bounded by unity. Now, because | · |+ is continuous, the lower limit of the normalized logarithm of the bound in (27) can be expressed as follows: + 1 k logq P(En ) ≥ −2γ̃ (1 − γ̃/2) − η + lim 2 . n→∞ n→∞ n n2 (28) Combining the upper bound in Proposition 4 and the lower bound in (28) and noting that η > 0 is arbitrary yields the reliability function in (26). We observe that pairwise independence of the events AZ We now prove Proposition 4. (Lemma 6) is essential in the proof of Proposition 4. Pairwise Proof: In order to apply (24) to analyze the error proba- independence is a consequence of the linear measurement bility in (17), we need to compute the probabilities P(AZ ) and model in (2) and the uniformity assumption in (14). Note that P(AZ ∩ AZ′ ). The former is q −k as argued in (18). The latter the events AZ are not jointly (nor triple-wise) independent. But uses the following lemma which is proved in Appendix A. the beauty of de Caen’s bound allows us to exploit the pairwise Lemma 6 (Pairwise Independence). For any two distinct independence to lower bound P(En ) and thus to obtain a matrices Z and Z′ , neither of which is equal to X, the events tight upper bound on E(R). To draw an analogy, just as only pairwise independence is required to show that linear codes AZ and AZ′ (defined in (16)) are independent. achieve capacity in symmetric DMCs, de Caen’s inequality As a result of this lemma, P(AZ ∩AZ′ ) = P(AZ )P(AZ′ ) = allows us to move the exploitation of pairwise independence q −2k if Z 6= Z′ and P(AZ ∩ AZ′ ) = P(AZ ) = q −k if Z = Z′ . into the error exponent domain to make statements about the Now, we apply the lower bound (24) to P(En ) noting from (17) error exponent behavior of ensembles of linear codes. that En is the union of all AZ such that Z 6= X and rank(Z) ≤ A natural question arises: Is E(R) given in (26) the largest r̃ := rank(X). Then, for a fixed η > 0, we have possible exponent over all decoders X̂( · ) for the model in which Ha follows the uniform pmf? We conjecture that this X q −2k ! is indeed the case, but a proof remains elusive. P(En ) ≥ P Z:Z6=X 1) Comparison of error exponents to existing works [38]: −k 1 + q −k Z′ :Z′ 6=X,Z rank(Z)≤rank(X) q ′ As mentioned in the Introduction, the preceding results can rank(Z )≤rank(X) 2 2 2 be interpreted from a coding-theoretic perspective. This is (a) (q 2nr̃−r̃ − 1)q −k (b) q n [2γ̃(1−γ̃/2)−η−k/n ] − q −k indeed what we will do in Section VII. In this subsection, ≥ , ≥ 1 + 4q 2nr̃−r̃2 −k 1 + 4q n2 [2γ̃(1−γ̃/2)+η−k/n2 ] we compare the reliability function derived in Corollary 7 where (a) is from the upper and lower bounds in (7) and with three other coding techniques present in the literature. (b) holds for all n sufficiently large since r̃/n → γ̃. See First, we have the well-known construction of maximum rank argument justifying inequality (c) in (19). Assuming that distance (MRD) codes by Gabidulin [7]. Second, we have the error trapping technique [18] alluded to in Section I-A. 7 The “block-length” of the code C in (21) is n2 . Third, we have a combination of the two preceding code lim inf − IEEE TRANSACTIONS ON INFORMATION THEORY 8 elements of Ha are i.i.d. and uniform in Fq . The noise w is first assumed in Section V-A to be deterministic but unknown. We then extend our results to the situation where w is a random vector in Section V-B. constructions which is discussed in [18, Section VI.E]. To perform this comparison, we define another reliability function E1 (R) that is “normalized by n”. This is simply the quantity in (22) where the normalization is 1/n instead of 1/n2 . We now denote the reliability function normalized by n2 as in (22) by E2 (R). We also use various superscripts on E1 and E2 to denote different coding schemes. Hence, for our encoding and decoding strategy using random sensing and min-rank decoding (RSMR), E1RSMR (R) = ∞ for all R ≤ (1 − γ)2 and E2RSMR (R) is given by (26). Since Gabidulin codes are MRD, they achieve the Singleton bound [12, Section III]) for rank-metric codes given by n2 − k ≤ n(n − dR + 1), where dR is the minimum rank distance of the code in (21) [See exact definitions in (48) and (49)]. Thus, it can be verified that for j = 1, 2,  ∞ R ≤ 1 − 2γ EjGab (R) = . (29) 0 else In the deterministic setting, we assume that kwk0 = ⌊σn2 ⌋ for some noise level σ ∈ (0, k/n2 ]. Instead of using the minimum entropy decoder as in [29] (also see [36]), we consider the following generalization of the min-rank decoder: From [18, Section IV.B, Eq. (12)], it can also be checked that for the error trapping coding strategy, assuming the low-rank error matrix is uniformly distributed over those of rank r, √ + E2ET (R) = 0. (30) E1ET (R) = 1 − γ − R , Proposition 8 (Achievability under deterministic noisy measurement model). Fix ε > 0 and choose λ = 1/n. Assume the uniform measurement model and that kwk0 = ⌊σn2 ⌋. If Finally, from [18, Section VI.E], for the combination of Gabidulin coding and error trapping, under the same condition of uniformity, E1GabET (R) = 1 − γ − R 1−γ + , E2GabET (R) = 0. (31) Note that for the error exponents in (29), (30) and (31), the randomness is over the low-rank error matrix X and not the code construction, which is deterministic. In contrast, our coding strategy RSMR involves a random encoding scheme. It can be seen from (29) to (31) that there is a non-trivial interval of rates R := [1 − 2γ, (1 − γ)2 ] in which our reliability functions E1RSMR (R) and E2RSMR (R) are the best (largest). Indeed, in the interval R, E1RSMR (R) = ∞ and our result in (22) implies that E2RSMR (R) > 0 whereas all the abovementioned coding schemes give E2 (R) = 0. Thus, using both a random code for encoding and min-rank decoding is advantageous from a reliability function standpoint in the regime R ∈ R. Furthermore, as we shall see from (40) in Section VI which deals with the sparse sensing setting (SRSMR), E1SRSMR (R) = ∞ and E2SRSMR (R) = 0 for all R ≤ (1 − γ)2 . Such an encoding scheme using sparse parity-check matrices may be amenable for the design of low-complexity decoding strategies that also have good error exponent properties. In general though, our min-rank decoder requires exhaustive search (though Section VIII proposes techniques to reduce the search space), while all the preceding techniques have polynomial-time decoding complexity. V. U NIFORMLY R ANDOM S ENSING M ATRICES : T HE N OISY C ASE We now generalize the noiseless model and the accompanying results in Section IV to the case where the measurements yk are noisy as in (3). As in Section IV, we assume that the A. Deterministic Noise minimize rank(X̃) + λkw̃k0 subject to hHa , X̃i + w̃a = ya , a ∈ [k] (32) The optimization variables are X̃ ∈ Fqn×n and w̃ ∈ Fkq . The parameter λ = λn > 0 governs the tradeoff between the rank of the matrix X and the sparsity of the vector w. Let Hq (p) := −p logq (p) − (1 − p) logq (p) be the (base-q) binary entropy. k> (3 + ε)(γ + σ)[1 − (γ + σ)/3] 2 n , 1 − H2 [1/(3 − (γ + σ))] logq 2 (33) then P(En ) → 0 as n → ∞. The proof of this proposition is provided in Appendix B. Since the prefactor in (33) is a monotonically increasing function in the noise level σ, the number of measurements increases as σ increases, agreeing with intuition. Note that the regularization parameter λ is chosen to be 1/n and is thus independent of σ. Hence, the decoder does not need to know the true value of the noise level σ. The factor of 3 (instead of 2) in (33) arises in part due to the uncertainty in the locations of the non-zero elements of the noise vector w. We remark that Proposition 8 does not reduce to the noiseless case (σ = 0) in Proposition 3 because we assumed a different measurement model in (3), and employed a different bounding technique. The measurement complexity in (33) is suboptimal, i.e., it does not match the converse in (9). This is because the decoder in (32) estimates both the matrix X and the noise w whereas in the derivation of the converse, we are only concerned with reconstructing the unknown matrix X. By decoding (X, w) jointly, the analysis proceeds along the lines of the proof of Proposition 3. It is unclear whether a better parameter-free decoding strategy exists in the presence of noise and whether such a strategy is also amenable to analysis. The noisy setting was also analyzed in [31] but, as in our work, the number of measurements for achievability does not match the converse. B. Random Noise We now consider the case where the noise in (3) is random, i.e., w = (w1 , . . . , wk ) ∈ Fkq is a random vector. We assume the noise vector w is i.i.d. and each component is distributed according to any pmf for which Pw (w; p) = 1 − p if w = 0. (34) IEEE TRANSACTIONS ON INFORMATION THEORY 9 Plot of the critical α against p for q = 2 0.8 Corollary 9 (Converse under random noise model). Assume the setup in Proposition 2 and consider the noisy measurement model given by (3) and (34). Additionally, assume that X, Hk and w are jointly independent. If, (2 − ε)γ (1 − γ/2) 2 n 1 − Hq (p) Note that the probability of error P(Ẽn ) above is computed over both the randomness in the sensing matrices Ha and in the noise w. The proof is given in Appendix C. From (35), the number of measurements necessarily has to increase by a factor of 1/(1 − Hq (p)) for reliable recovery. As expected, for a fixed q, the larger the crossover probability p ∈ (0, 1/2), the more measurements are required. The converse is illustrated for different parameter settings in Figs. 1 and 2. To present our achievability result compactly, we assume that k = ⌈αn2 ⌉ for some scaling parameter α ∈ (0, 1), i.e., the number of observations is proportional to n2 and the constant of proportionality is α. We would like to find the range of values of the scaling parameter α such that reliable recovery is possible. Recall that the upper bound on the rank is r and the noise vector has expected weight pk ≈ pαn2 . Corollary 10 (Achievability under random noisy measurement model). Fix ε > 0 and choose λ = 1/n. Assume the uniform measurement model and that k = ⌈αn2 ⌉. Define the function   g(α; p, γ) := α 1 − (logq 2)H2 (p + γ/α) − 2p(1 − γ) +α2 p2 . (36) If the tuple (α, p, γ) satisfies the following inequality: g(α; p, γ) ≥ (2 + ε)γ(1 − γ/2), 0.2 (35) then for any estimator, P(Ẽn ) ≥ ε/4 > 0 for all n sufficiently large, where Ẽn is defined in (8). 0.6 0.4 0 0 0.02 0.04 0.06 Crossover probability p 0.08 0.1 Fig. 1. Plot of αcrit against p for q = 2. Both αcrit for the converse (con) in (35) the achievability (ach) in (37) are shown. All α’s below the converse curves are not achievable. Plot of the critical α against p for q = 256 0.22 0.2 0.18 0.16 αcrit k< γ = 0.075 (ach) γ = 0.050 (ach) γ = 0.025 (ach) γ = 0.075 (con) γ = 0.050 (con) γ = 0.025 (con) 1 αcrit This pmf represents a noisy channel where every symbol is changed to some other (different) symbol independently with crossover probability p ∈ (0, 1/2). We can ask how many measurements are necessary and sufficient for recovering a fixed X in the presence of the additive stochastic noise w. Also, we are interested to know how this measurement complexity depends on p. We leverage on Propositions 2 and 8 to derive a converse result and an achievability result respectively. We start with the converse, which is partially inspired by Theorem 3 in [31]. 0.14 0.12 0.1 0.08 0.06 0.04 Fig. 2. 0 0.02 0.04 0.06 Crossover probability p 0.08 0.1 Plot of αcrit against p for q = 256. See Fig. 1 for the legend. (37) then P(En ) → 0 as n → ∞. The proof of this corollary uses typicality arguments and is presented in Appendix D. As in the deterministic noise setting, the sufficient condition in (37) does not reduce to the noiseless case (p = 0) in Proposition 3. It also does not match the converse in (35). This is due to the different bounding technique employed to prove Corollary 10 [both X and w are decoded in (32)]. In addition, the inequality in (37) does not admit an analytical solution for α. Hence, we search for the critical α [the minimum one satisfying (37)] numerically for some parameter settings. See Figs. 1 and 2 for illustrations of how the critical α varies with (p, γ) when the field size is small (q = 2) and when it is large (q = 256). From Fig. 1, we observe that the noise results in a significant increase in the critical value of the scaling parameter α when q = 2. We see that for a rank-dimension ratio of γ = 0.05 and with a crossover probability of p = 0.02, the critical scaling parameter is αcrit ≈ 0.32. Contrast this to the noiseless case (Proposition 3) and the converse result for the noisy case (Corollary 9) which stipulate that the critical scaling parameters are 2γ(1 − γ/2) ≈ 0.098 and 2γ(1 − γ/2)/(1 − H2 (p)) ≈ 0.114 respectively. Hence, we incur roughly a threefold increase in the number of measurements to tolerate a noise level of p = 2%. This phenomenon is due to our incognizance of the locations of the non-zero elements of w (and hence knowledge of which measurements ya are reliable). In contrast to the reals, in the IEEE TRANSACTIONS ON INFORMATION THEORY 10 Comparison of αcrit between TBD and EM For example, in low-density parity-check (LDPC) codes, the parity-check matrix (analogous to the set of Ha matrices) is sparse. The sparsity aids in decoding via the sum-product algorithm [39] as the resulting Tanner (factor) graph is sparse [26]. In [32], the authors considered the case where the generator matrices are sparse and random but their setting is restricted to the BSC and BEC channel models. In this section, we revisit the noiseless model in (2) and analyze the scenario where the sensing matrices are sparse on average. More precisely, each element of Ha , a ∈ [k], is assumed to be an i.i.d. random variable with associated pmf  1−δ h=0 Ph (h; δ, q) := . (39) δ/(q − 1) h ∈ Fq \ {0} 0.55 p = 0.05 (TBD) p = 0.10 (TBD) p = 0.05 (EM) p = 0.10 (EM) p = 0.05 (con) p = 0.10 (con) 0.5 0.45 αcrit 0.4 0.35 0.3 0.25 0.2 0.15 0.1 1 2 3 4 5 6 7 8 log2 (q) Fig. 3. Plot of αcrit against log2 (q) for our work (TBD Corollary 10), the converse in Corollary 9 and Emad and Milenkovic (EM) [31]. finite field setting, there is no notion of the “size” of the noise (per measurement). Hence, estimation performance in the presence of noise does not degrade as gracefully as in the reals (cf. [6, Theorem 1.2]). However, when the field size is large (more likened to the reals), the degradation is not as severe. This is depicted in Fig. 2. Under the same settings as above, αcrit ≈ 0.114, which is not too far from the converse (2γ(1 − γ/2)/(1 − H256 (p)) ≈ 0.099). As a final remark, we compare the decoders for the noisy model in (32) and that in [31]. In [31], the authors considered the (analog of) following decoder (for tensors): minimize rank(X̃) subject to k yX̃ − y k0 ≤ τ, (38) where yX̃ := [hH1 , X̃i . . . hHk , X̃i]T and y = yk is the noisy observation vector in (3). However, the threshold τ that constrains the Hamming distance between yX̃ and y is not straightforward to choose.8 Our decoder, in contrast, is parameter-free because the regularization constant λ in (32) can be chosen to be 1/n, independent of all other parameters. In addition, Fig. 3 shows that at high q, our decoder and analysis result in a better (smaller) αcrit than that in [31]. Our decoding scheme gives a bound that is closer to the converse at high q while the decoding scheme in [31] is farther. The slight disadvantage of our decoder is that the number of measurements in (37) cannot be expressed in closed-form. VI. S PARSE R ANDOM S ENSING M ATRICES In the previous two sections, we focused exclusively on the case where the elements of the sensing matrices Ha , a ∈ [k], are drawn uniformly from Fq . However, there is substantial motivation to consider other ensembles of sensing matrices. 8 In fact, the achievability result of Theorem 4 in [31] says that τ = ηk where η ∈ (p, (q − 1)/q) but for our optimization program in (32), the decoder does not need to know the crossover probability p. Note that if δ is small, then the probability that an entry in Ha is zero is close to unity. The problem of deriving a sufficient condition for reliable recovery is more challenging as compared to the equiprobable case since (18) no longer holds (compare to Lemma 21). Roughly speaking, the matrix X is not sensed as much as in the equiprobable case and the measurements yk are not as informative because Ha , a ∈ [k], are sparse. In the rest of this section, we allow the sparsity factor δ to depend on n but we do not make the dependence of δ on n explicit for ease of exposition. The question we would like to answer is: How fast can δ decay with n such that the min-rank decoder is still reliable for weak recovery? Theorem 11 (Achievability under sparse measurement model). Fix ε > 0 and let δ be any sequence in Ω( logn n ) ∩ o(1). Under the sparse measurement model as in (39), if the number of measurements k satisfies (15) for all n > Nε,δ , then P(En ) → 0 as n → ∞. The proof of Theorem 11, our main result, is detailed in Appendix E. It utilizes a “splitting” technique to partition the set of misleading matrices {Z 6= X : rank(Z) ≤ rank(X)} into those with low Hamming distance from X and those with high Hamming distance from X. Observe that the sparsity-factor δ is allowed to tend to zero albeit at a controlled rate of Ω( logn n ). Thus, each Ha is allowed to have, on average, Ω(n log n) non-zero entries (out of n2 entries). The scaling rate is reminiscent of the number of trials required for success in the so-called coupon collector’s problem. Indeed, it seems plausible that we need at least one entry in each row and one entry in each column of X to be sensed (by a sensing matrix Ha ) for the min-rank decoder to succeed. It can easily be seen that if δ = o( logn n ), there will be at least one row and one column in Ha of zero Hamming weight w.h.p. Really surprisingly, the number of measurements required in the δ = Ω( logn n )-sparse sensing case is exactly the same as in the case where the elements of Ha are drawn uniformly at random from Fq in Proposition 3. In fact it also matches the information-theoretic lower bound in Proposition 2 and hence is asymptotically optimal. We will analyze this weak recovery sparse setting (and understand why it works) in greater detail by studying minimum distance properties of sparse paritycheck rank-metric codes in Section VII-B. The sparse scenario may be extended to the noisy case by combining the proof techniques in Proposition 8 and Theorem 11. IEEE TRANSACTIONS ON INFORMATION THEORY 11 There are two natural questions at this point: Firstly, can the reliability function be computed for the min-rank decoder assuming the sparse measurement model? The events AZ , defined in (16), are no longer pairwise independent. Thus, it is not straightforward to compute P(AZ ∩AZ′ ) as in the proof of Proposition 4. Further, de Caen’s lower bound may not be tight as in the case where the entries of the sensing matrices are drawn uniformly at random from Fq . Our bounding technique for Theorem 11 only ensures that 1 logq P(En ) ≤ −C (40) lim sup n→∞ n log n for some non-trivial C ∈ (0, ∞). Thus, instead of having a speed9 of n2 in the large-deviations upper bound, we have a speed of n log n. This is because δ is allowed to decay to zero. Whether the speed n log n is optimal is open. Secondly, is δ = Ω( logn n ) the best (smallest) possible sparsity factor? Is there a fundamental tradeoff between the sparsity factor δ and (a bound on) the number of measurements k? We leave these for further research. VII. C ODING -T HEORETIC I NTERPRETATIONS AND M INIMUM R ANK D ISTANCE P ROPERTIES This section is devoted to understand the coding-theoretic interpretations and analogs of the rank minimization problem in (12). In particular, we would like to understand the geometry of the random linear rank-metric codes that underpin the optimization problem in (12) for both the equiprobable ensemble in (14) and the sparse ensemble in (39). As mentioned in the Introduction, there is a natural correspondence between the rank minimization problem and rankmetric decoding [7]–[12]. In the former, we solve a problem of the form (12). In the latter, the code C typically consists of length-n vectors10 whose elements belong to the extension field Fqn and these vectors in Fnqn a belong to the kernel of some linear operator H. A particular vector codeword c ∈ C is transmitted. The received word is r = c + x, where x is assumed to be a low-rank “error” vector. (By rank of a vector we mean that there exists a fixed basis of Fqn over Fq and the rank of a vector a ∈ Fnqn is defined as the rank of the matrix A ∈ Fqn×n whose elements are the coefficients of a in the basis. See [10, Sec. VI.A] for details of this isomorphic map.) The optimization problem for decoding c given r is then minimize subject to rank(r − c) c∈C (41) which is identical to the min-rank problem in (12) with the identification of the low error vector x ≡ r − c. Note that the matrix version of the vector r (assuming a fixed basis), denoted as R, satisfies the linear constraints in (2). Since the assignment (A, B) 7→ rank(A−B) is a metric on the space of matrices [10, Sec. II.B], the problem in (41) can be interpreted as a minimum (rank) distance decoder. 9 The term speed is in direct analogy to the theory of large-deviations [40] where Pn is said to satisfy a large-deviations upper bound with speed an and rate function J( · ) if lim supn→∞ a−1 n log Pn (E) ≤ − inf x∈cl(E) J(x). 10 We abuse notation by using a common symbol C to denote both a code consisting of vectors with elements in Fqn and a code consisting of matrices with elements in Fq . A. Distance Properties of Equiprobable Rank-Metric Codes We formalize the notion of an equiprobable linear code and analyze its rank distance properties in this section. The results we derive here are the rank-metric analogs of the results in Barg and Forney [19] and will prove to be useful in shedding light on the geometry involved in the sufficient condition for recovering the unknown low-rank matrix X in Proposition 3. Definition 3. A rank-metric code is a non-empty subset of Fqn×n endowed with the the rank distance (A, B) 7→ rank(A − B). Definition 4. We say that C ⊂ Fqn×n is an equiprobable linear rank-metric code if C := {C ∈ Fqn×n : hC, Ha i = 0, a ∈ [k]} (42) where Ha , a ∈ [k] are random matrices where each entry is statistically independent of other entries and equiprobable in Fq , i.e., with pmf given in (14). Each matrix C ∈ C is called a codeword. Each matrix Ha is said to be a parity-check matrix. Recall that the inner product is defined as hC, Ha i = Tr(C HTa ). We reiterate that in the coding theory literature [7]– [12], rank-metric codes usually consist of length-n vectors c ∈ C whose elements belong to the extension field Fqn . We refrain from adopting this approach here as we would like to make direct comparisons to the rank minimization problem, where the measurements are generated as in (2).11 Hence, the term codewords will always refer to matrices in C . Definition 5. The number of codewords in the code C of rank r (r = 0, 1, . . . , n) is denoted as NC (r). Note that NC (r) is a random variable since C ⊂ Fqn×n is a random subspace. This quantity can also be expressed as X NC (r) := I{M ∈ C }, (43) M∈Fn×n :rank(M)=r q where I{M ∈ C } is the (indicator) random variable which takes on the value one if M ∈ C and zero otherwise. Note that the matrix M is deterministic, while the code C is random. We remark that the decomposition of NC (r) in (43) is different 11 The usual approach to defining linear rank-metric codes [7], [8] is the following: Every codeword in the codebook, c ∈ Fn , is required to satisfy qN Pn the m parity-check constraints i=1 ha,i ci = 0 ∈ Fq N for a ∈ [m] and where ha,i ∈ FqN and ci ∈ FqN are, respectively, the i-th elements of ha and c. Note that in the paper we focus on the case N = n, but make the distinction here to connect directly with the coding literature. We can reexpress each of these m constraints as N matrix trace constraints in Fq , per (42), as follows. Consider any basis B for FqN over Fq , B = {b1 , . . . , bN }, where P bj ∈ FqN . We represent ha,i and ci in this basis as ha,i = N j=1 ha,i,j bj PN and ci = k=1 ci,k bk , respectively. Let H̃a be the n × N matrix whose (i, j)-th entry is the coefficient ha,i,j ∈ Fq and C be similarly defined by the ci,k ∈ Fq . Now define ωj,k,l as the coefficients in Fq of the representation PN of bj bk , i.e., bj bk = l=1 ωj,k,l bl . Define Ωl to be the symmetric N × N matrix whose (j, k)-th entry is ωj,k,l . By substituting the expansions for ha and c into the standard parity-check definition and making use of the fact that the basis elements bj are linearly independent, we discover the P following: the constraint n i=1 ha,i ci = 0 is equivalent to the N constraints Tr(CΩl H̃T a ) = 0 ∈ Fq for l ∈ [N ]. If we define H̃a Ωl for each a ∈ [m], l ∈ [N ] to be one of the constraints in (42), we get that the set of C matrices C satisfying (42) is the rank-metric codes defined by the ha , a ∈ [m]. A simple relation between the Ωl matrices holds if the basis is chosen to be a normal basis [41, Def. 2.32]. IEEE TRANSACTIONS ON INFORMATION THEORY 12 from that in Barg and Forney [19, Eq. (2.3)] where the authors considered and analyzed the analog of the sum X I{rank(Cj ) = r}, (44) ÑC (r) := j∈{1,...,|C |} : Cj 6=0 where j ∈ {1, . . . , |C |} indexes the (random) codewords in C . Note that ÑC (r) = NC (r) for all r ≥ 1 but they differ when r = 0 (ÑC (0) = 0 while NC (0) = 1). It turns out that the sum in (43) is more amenable to analysis given that our parity-check (sensing) matrices Ha , a ∈ [k], are random (as in Gallager’s work in [20, Theorem 2.1]) whereas in [19, Sec. II.C], the generators are random.12 Recall the rankdimension ratio γ is the limit of the ratio r/n as n → ∞. Using (43), we can show the following: Lemma 12 (Moments of NC (r)). For r = 0, NC (r) = 1. For 1 ≤ r ≤ n, the mean of NC (r) satisfies q −k+2rn−r 2 −2r 2 ≤ ENC (r) ≤ 4q −k+2rn−r . (45) Furthermore, the variance of NC (r) satisfies var(NC (r)) ≤ ENC (r). (46) The proof of Lemma 12 is provided in Appendix F. Observe from (45) that the average number of codewords with rank r, namely ENC (r), is exponentially large (in n2 ) if k < (2 − ε)γ(1−γ/2)n2 (compare to the converse in Proposition 2) and exponentially small if k > (2 + ε)γ(1 − γ/2)n2 (compare to the achievability in Proposition 3). By Chebyshev’s inequality, an immediate corollary of Lemma 12 is the following: Corollary 13 (Concentration of number of codewords of rank r). Let fn be any sequence such that limn→∞ fn = ∞. Then,   p lim P |NC (r) − ENC (r)| ≥ fn ENC (r) = 0. (47) n→∞ Thus, NC (r) concentrates to its mean in the sense of (47). A similar result for the random generator case was developed in [9, Corollary 1]. Also, our derivations based on Lemma 12 are cleaner and require fewer assumptions. We now define the notion of the minimum rank distance of a rank-metric code. Definition 6. The minimum rank distance of a rank-metric code C is defined as dR (C ) := min C1 ,C2 ∈C :C1 6=C2 rank(C1 − C2 ). (48) By linearity of the code C , it can be seen that the minimum rank distance in (48) can also be written as dR (C ) := min C∈C :C6=0 rank(C). (49) Thus, the minimum rank distance of a linear code is equal to the minimum rank over all non-zero matrix codewords. Definition 7. The relative minimum rank distance of a code C ⊂ Fqn×n is defined as dR (C )/n. Note that the relative minimum rank distance is a random variable taking on values in the unit interval. In this section, 12 Indeed, if the generators are random, it is easier to derive the statistics of the number of codewords of rank r using (44) instead of (43). we assume there exists some α ∈ (0, 1) such that k/n2 → α (cf. Section V-B). This is the scaling regime of interest. Proposition 14 (Asymptotic linear independence). Assume that each random matrix Ha ∈ Fqn×n consists of independent entries that are drawn according to the pmf in (39). Let m := dim(span{vec(H1 ), . . . , vec(Hk )}). If δ ∈ Ω( logn n ), then m/k → 1 almost surely (a.s.). The proof of this proposition is a consequence of a result by Blömer et al. [42]. We provide the details in Appendix G. We would now like to define the notion of the rate of a random code. Strictly speaking, since C is a random linear code, the rate of the code should be defined as the random variable R̃n := 1 − m/n2 . However, a consequence of Proposition 14 is that R̃n /(1 − k/n2 ) → 1 a.s. if δ ∈ Ω( logn n ). Note that this prescribed rate of decay of δ subsumes the equiprobable model (of interest in this section) as a special case. (Take δ = (q − 1)/q to be constant.) In light of Proposition 14, we adopt the following definition: Definition 8. The rate of the linear rank-metric code C [as in (42)] is defined as Rn := k n2 − k = 1 − 2. n2 n (50) The limit of Rn in (50) is denoted as R ∈ [0, 1]. Note also that R̃n /R → 1 a.s. Proposition 15 (Lower bound on relative minimum distance). Fix ε > 0. For any R ∈ [0, 1], the probability that the equiprobable linear code √ in (42) has relative minimum rank distance less than 1 − R − ε goes to zero as n → ∞. Proof: Assume13 ε ∈ (0, 2(1 − γ)) and define the positive constant ε′ := 2ε(1 − γ) − √ ε2 . Consider a sequence of ranks r such that r/n → γ ≤ 1 − R − ε. Fix η = ε′ /2 > 0. Then, by Markov’s inequality and (45), we have 2 k P(NC (r) ≥ 1) ≤ ENC (r) ≤ 4q −n [ n2 −2γ(1−γ/2)−η] , (51) √ for all n > Nε′ . Since γ ≤ 1 − R − ε, we may assert by invoking the definition of R that k ≥ (2γ(1 − γ/2) + ε′ )n2 . Hence, the exponent in square parentheses in (51) is no smaller than ε′ /2. This implies that P(NC (r) ≥ 1) → 0 or equivalently, P(NC (r) = 0) → 1. In other words, there are no matrices of rank r in the equiprobable linear code C with ′ 2 probability at least 1 − 4q −ε n /2 for all n > Nε′ . We now introduce some additional notation. We say that two positive sequences {an }n∈N and {bn }n∈N are equal to .. second order in the exponent (denoted an = bn ) if lim n→∞ an 1 logq = 0. n2 bn (52) Proposition 16 (Concentration of relative minimum distance). Fix ε > 0. For any R ∈ [0, √ 1], if r is a sequence of ranks such that r/n → γ ≥ 1 − R + ε, then the probability that 2 .. NC (r) = q −k+2γ(1−γ/2)n goes to one as n → ∞. 13 The restriction that ε < 2(1−γ) is not a serious one since the validity of the claim in Proposition 15 for some ε0 > 0 implies the same for all ε > ε0 . IEEE TRANSACTIONS ON INFORMATION THEORY 13 Proof: √ If the sequence of ranks r is such that r/n → γ ≥ 1 − R + ε, then the average number of matrices in the code of rank r, namely ENC (r), is exponentially large. By Markov’s inequality and the triangle inequality, Definition 9. We say that C is a δ-sparse linear rank-metric code if C is as in (42) and where Ha , a ∈ [k] are random matrices where each entry is statistically independent and drawn from the pmf Ph ( · ; δ, q) defined in (39). E|NC (r) − ENC (r)| t 2ENC (r) ≤ . (53) t To analyze the number of matrices of rank r in this random ensemble NC (r), we partition the sum in (43) into subsets of matrices based on their Hamming weight, i.e., P(|NC (r) − ENC (r)| ≥ t) ≤ 2 2 Choose t := q −k+(2γ(1−γ/2)+η)n +n , where η is given in the proof of Proposition 15. Then, applying (45) to (53) yields P(|NC (r) − ENC (r)| ≥ t) ≤ 8q −n → 0. (54) Hence, NC (r) ∈ (ENC (r) − t, ENC (r) + t) with probability exceeding 1 − 8q −n . Furthermore, it is easy to verify that 2 .. ENC (r) ± t = q −k+2γ(1−γ/2)n , as desired. Propositions 15 and 16 allow us to conclude that with probability approaching one (exponentially fast) as n → ∞, the relative minimum rank distance of the equiprobable linear √ code √ in (42) is contained in the interval (1 − R − ε, 1 − R + ε) for all R ∈ [0, 1]. The analog of the Gilbert-Varshamov (GV) distance [19, Sec. II.C] is thus √ (55) γGV (R) := 1 − R. Indeed, by substituting the definition of R into NC (r) in Proposition 16, we see that a typical (in the sense of [19]) equiprobable linear rank-metric code has distance distribution:  .. 2 2 = q n [R−(1−γ) ] γ ≥ γGV (R) + ε, Ntyp (r) (56) = 0 γ ≤ γGV (R) − ε. We again remark that Loidreau in [9, Sec. 5] also derived results for uniformly random linear codes in the rank-metric that are somewhat similar to Propositions 15 and 16. However, our derivations are more straightforward and require fewer assumptions. As mentioned above, we assume that the paritycheck matrices Ha , a ∈ [k], are random (akin to [20, Theorem 2.1]), while the assumption in [9, Sec. 5] is that the generators are random and linearly independent. Furthermore, to the best of our knowledge, there are no previous studies on the minimum distance properties for the sparse parity-check matrix setting. We do this in Section VII-B. From the rank distance properties, we can re-derive the achievability (weak recovery) result in Proposition 3 by using the definition of R and solving the following inequality for k: √ 1 − R − ε ≥ γ. (57) This provides geometric intuition as to why the min-rank decoder succeeds on average; the typical relative minimum rank distance of the code should exceed the rank-dimension ratio for successful error correction. We derive a stronger condition (known as the strong recovery condition) in Section VII-C. B. Distance Properties of Sparse Rank-Metric Codes In this section, we derive the analog of Proposition 15 for the case where the code C is characterized by sparse sensing (or measurement or parity-check) matrices Ha , a ∈ [k]. NC (r) = n X X d=0 M∈Fn×n :rank(M)=r,kMk0 =d q I{M ∈ C }. (58) Define θ(d; δ, q, k) := [q −1 + (1 − q −1 )(1 − δ/(1 − q −1 ))d ]k . As shown in Lemma 21 in Appendix E, this is the probability that a non-zero matrix M of Hamming weight d belongs to the δ-sparse code C . We can demonstrate the following important bound for the δ-sparse linear rank-metric code: Lemma 17 (Mean of NC (r) for sparse codes). For r = 0, NC (r) = 1. If 1 ≤ r ≤ n and η > 0, 2 ENC (r) ≤ 2n H2 (β) 2 (q − 1)βn (1 − δ)k + 2 1 + 4n2 q n [2γ(1−γ/2)+η+ n2 logq θ(⌈βn2 ⌉; δ,q,k)] , (59) for all β ∈ [0, 1/2] and all n ≥ Nη . By using the sum in (58), one sees that this lemma can be justified in exactly the same way as Theorem 11 (See steps leading to (81) and (82) in Appendix E). Hence, we omit its proof. Lemma 17 allows us to find a tight upper bound on the expectation of NC (r) for the sparse linear rank-metric code by optimizing over the free parameter β ∈ [0, 1/2]. It turns out β = Θ( logδ n ) is optimum. In analogy to Proposition 15 for the equiprobable linear rank-metric code, we can demonstrate the following for the sparse linear rank-metric code. Proposition 18 (Lower bound on relative minimum distance for sparse codes). Fix ε > 0 assume that δ = Ω( logn n ) ∩ o(1). For any R ∈ [0, 1], the probability that the sparse √ linear code has relative minimum distance less than 1 − R − ε goes to zero as n → ∞. Proof: The condition on the minimum distance implies that k > (2 + ε̃)γ(1 − γ/2)n2 for some ε̃ > 0 (for sufficiently small ε). See detailed argument in proof of Proposition 15. This implies from Theorem 11, Lemma 17 and Markov’s inequality that P(NC (r) ≥ 1) → 0. Proposition 18 asserts that the relative minimum rank distance of a√δ = Ω( logn n )-sparse linear rank-metric code is at least 1 − R − ε w.h.p. Remarkably, this property is exactly the same as that of a (dense) linear code (cf. Proposition 15) in which the entries in the parity-check matrices Ha are statistically independent and equiprobable in Fq . The fact that the (lower bounds on the) minimum distances of both ensembles of codes coincide explains why the min-rank decoder matches the information-theoretic lower bound (Proposition 2) in the sparse setting (Theorem 11) just as in the dense one (Proposition 3). Note that only an upper bound of ENC (r) as in (59) is required to make this claim. IEEE TRANSACTIONS ON INFORMATION THEORY 14 C. Strong Recovery We now utilize the insights gleaned from this section to derive results for strong recovery (See Section II-D and also [27, Sec. 2] for definitions) of low-rank matrices from linear measurements. Recall that in strong recovery, we are interested in recovering all matrices whose ranks are no larger than r. We contrast this to weak recovery where a matrix X (of low rank) is fixed and we ask how many random measurements are needed to estimate X reliably. Proposition 19 (Strong recovery for uniform measurement model). Fix ε > 0. Under the uniform measurement model, the min-rank decoder recovers all matrices of rank less than or equal to r with probability approaching one as n → ∞ if k > (4 + ε)γ(1 − γ)n2 . (60) We contrast this to the weak achievability result (Proposition 3) in which X with rank(X) ≤ r was fixed and we showed that if k > (2 + ε)γ(1 − γ/2)n2, the min-rank decoder recovers X w.h.p. Thus, Proposition 19 says that if γ is small, roughly twice as many measurements are needed for strong recovery vis-à-vis weak recovery. These fundamental limits (and the increase in a factor of 2 for strong recovery) are exactly analogous those developed by Draper and Malekpour in [29] in the context of compressed sensing over finite fields and Eldar et al. [27] for the problem of rank minimization over the reals. Given our derivations in the preceding subsections, the proof of this result is straightforward. Proof: We showed in Proposition 15 that with probability approaching one (exponentially fast),√the relative minimum distance of C is no smaller than 1 − R − ε̃ for any ε̃ > 0. As such to guarantee strong recovery, we need the decoding regions (associated to each codeword in C ) to be disjoint. In other words, the rank distance between any two distinct codewords C1 , C2 ∈ C must exceed 2r. See Fig. 4 for an illustration. In terms of the relative minimum rank distance √ 1 − R − ε̃, this requirement translates to14 √ 1 − R − ε̃ ≥ 2γ. (61) Rearranging this inequality as and using the definition of R [limit of Rn in (50)] as we did in Proposition 15 yields the required number of measurements prescribed. In analogy to Proposition 19, we can show the following for the sparse model. Proposition 20 (Strong recovery for sparse measurement model). Fix ε > 0. Under the δ = Ω( logn n )-sparse measurement model, the min-rank decoder recovers all matrices of rank less than or equal to r with probability approaching one as n → ∞ if (60) holds. Proof: The proof uses Proposition 18 and follows along the exact same lines as that of Proposition 19. 14 The strong recovery requirement in (61) is analogous to the well-known fact that in the binary Hamming case, in order to correct any vector r = c+e corrupted with t errors (i.e., kek0 = t) using minimum distance decoding, we must use a code with minimum distance at least 2t + 1. C3 r r C1 r C2 Fig. 4. For strong recovery, the decoding regions associated to each codeword C ∈ C have to be disjoint, resulting in the criterion in (61). VIII. R EDUCTION IN THE C OMPLEXITY OF M IN -R ANK D ECODER THE In this section, we devise a procedure to reduce the complexity for min-rank decoding (vis-à-vis exhaustive search). This procedure is inspired by techniques in the cryptography literature [43], [44]. We adapt the techniques for our problem which is somewhat different. As we mentioned in Section VII, the codewords in this paper are matrices rather than vectors whose elements belong to an extension field [43], [44]. Recall that in min-rank decoding (12), we search for a ×n matrix X ∈ FN of minimum rank that satisfies the linq ear constraints. In this section, for clarity of exposition, we differentiate between the number of rows (N ) and the number of columns (n) in X. The vector yk is known as the syndrome. We first suppose that the minimum rank in (12) is known to be equal to some integer r ≤ min{N, n}. Since our proposed algorithm requires exponentially many elementary operations (addition and multiplication) in Fq , this assumption does not affect the time complexity significantly. Then the problem in (12) reduces to a satisfiability problem: Given an integer r, a collection of parity-check matrices Ha , a ∈ [k] and a ×n syndrome vector yk , find (if possible) a matrix X ∈ FN q of rank exactly equal to r that satisfies the linear constraints in (12). Note that the constrains in (12) are equivalent to hvec(Ha ), vec(X)i = ya , a ∈ [k]. We first claim that we can, without loss of generality, assume that yk = 0k , i.e, the constraints in (12) read hHa , Xi = 0, a ∈ [k]. (62) We justify this claim as follows: Consider the new syndromen+1 for every a ∈ [k]. augmented vectors [vec(Ha ); ya ]T ∈ FN q ′ Then, every solution vec(X ) of the system of equations h[vec(Ha ); ya ], vec(X′ )i = 0, a ∈ [k] ′ (63) can be partitioned into two parts, vec(X ) = [vec(X1 ); x2 ] n where vec(X1 ) ∈ FN and x2 ∈ Fq . Thus, every solution q of (63) satisfies one of two conditions: • x2 = 0. In this case X1 is a solution to the linear equations in (12). • x2 6= 0. In this case X1 solves hHa , X1 i = x2 ya . Thus, x−1 2 X1 solves (12). This is also known as coset decoding. Now, observe that since it is known that X has rank equal to r (which is assumed IEEE TRANSACTIONS ON INFORMATION THEORY 15 known), it can be written as X= r X ul vlT = UVT (64) l=1 n where each of the vectors ul ∈ FN q and vl ∈ Fq . The matrices n×r N ×r and V ∈ Fq are of (full) rank r and are referred U ∈ Fq to as the basis matrix and the coefficient matrix respectively. The linear system of equations in (62) can be expanded as r X N X n X [Ha ]i,j ul,i vl,j = 0, l=1 i=1 j=1 a ∈ [k] discussion on the indeterminacies in the decomposition of the low rank matrix X, we observe that the complexity involved ×r in the enumeration of all FN matrices in step 2 in the q naı̈ve implementation can be reduced by only enumerating the different equivalence classes induced by ∼. More precisely, we find (if possible) coefficients V for a basis U from each equivalence class, e.g., U1 ∈ [U1 ], . . . , Um ∈ [Um ]. Note that the number of equivalence classes (by Lagrange’s theorem) is m= (65) where ul = [ul,1 , . . . , ul,N ]T and vl = [vl,1 , . . . , vl,n ]T . Thus, we need to solve a system of quadratic equations in the basis elements ul,i and the coefficients vl,j . A. Naı̈ve Implementation A naı̈ve way to find a consistent U and V for (65) is to employ the following algorithm: 1) Start with r = 1. 2) Enumerate all bases U = {ul,i : i ∈ [N ], l ∈ [r]}. 3) For each basis, solve (if possible) the resulting linear system of equations in V = {vl,j : j ∈ [n], l ∈ [r]}. 4) If a consistent set of coefficients V exists (i.e., (65) is satisfied), terminate and set X = UVT . Else increment r ← r + 1 and go to step 2. The second step can be solved easily if the number of equations is less than or equal to the number of unknowns, i.e., if nr ≥ k. However, this is usually not the case since for successful recovery, k has to satisfy (15) so, in general, there are more equations (linear constraints) than unknowns. We attempt to solve for (if possible) a consistent V, otherwise we increment the guessed rank r. The computational complexity of this naı̈ve approach (assuming r is known and so no iterations over r are needed) is O((nr)3 q N r ) since there are q N r distinct bases and solving the linear system via Gaussian elimination requires at most O((nr)3 ) operations in Fq . B. Simple Observations to Reduce the Search for the Basis U We now use ideas from [43], [44] and make two simple observations to dramatically reduce the search for the basis in step 2 of the above naı̈ve implementation. Observation (A): Note that if X̃ solves (62), so does ρX̃ for any ρ ∈ Fq . Hence, without loss of generality, we may assume that the we can scale the (1,1) element of U to be equal to 1. The number of bases we need to enumerate may thus be reduced by a factor of q. Observation (B): Note that the decomposition X = UVT is not unique. Indeed if X = UVT , we may also decompose X as X = ŨṼT , where Ũ = UT and Ṽ = VT−T and T is any invertible r × r matrix over Fq . We say that two bases U, Ũ are equivalent, denoted U ∼ Ũ, if there exists an invertible matrix T such that U = ŨT. The equivalence ×r relation ∼ induces a partition of the set of FN matrices. q N ×r Let [U] := {Ũ ∈ Fq : Ũ ∼ U} be the equivalence class of matrices containing the matrix U. From the preceding qN r ≤ 4q r(N −r) , Φq (r, r) (66) where recall from Section II-E that Φq (r, r) is the number of non-singular matrices in Fr×r . The inequality arises from q 2 the fact that Φq (r, r) ≥ 14 q r , a simple consequence of [43, Cor. 4]. Algorithmically, we can enumerate the equivalence classes by first considering all matrices of the form   Ir×r U= , (67) Q where Ir×r is the identity matrix of size r, and Q takes on (N −r)×r all possible values in Fq . Note that if Q and Q̃ are distinct, the corresponding U = [I; QT ]T and Ũ = [I; Q̃T ]T belong to different equivalence classes. However, the top r rows of U may not be linearly independent so we have yet to consider all equivalence classes. Hence, we subsequently permute the rows of each previously considered U to ensure every equivalence class is considered. From the considerations in (A) and (B), the computational complexity can be reduced from O((nr)3 q N r ) to O((nr)3 q r(N −r)−1 ). By further noting that there is symmetry between the basis matrix U and the coefficient matrix V, we see that the resulting computational complexity is O((max{n, N }r)3 q r(min{n,N }−r)−1 ). Finally, to incorporate the fact that r is unknown, we start the procedure assuming r = 1, proceed to r ← r + 1 if there does not exist a consistent solution and so on, until a consistent (U, V) pair is found. The resulting computational complexity is thus O(r(max{n, N }r)3 q r(min{n,N }−r)−1 ). IX. D ISCUSSION AND C ONCLUSION In this section, we elaborate on connections of our work to the related works mentioned the introduction and in Tables I and II. We will also conclude the paper by summarizing our main contributions and suggesting avenues for future research. A. Comparison to existing coding-theoretic techniques for rank minimization over finite fields In general, solving the min-rank decoding problem (41) is intractable (NP-hard). However, it is known that if the linear operator H (in (4) characterizing the code C ) admits a favorable algebraic structure, then one can estimate a sufficiently low-rank (vector with elements in the extension field Fqn or matrix with elements in Fq ) x and thus the codeword c from the received word r efficiently (i.e., in polynomial time). For instance, the class of Gabidulin codes [7], [8], which are rankmetric analogs of Reed-Solomon codes, not only achieves the Singleton bound and thus has maximum rank distance IEEE TRANSACTIONS ON INFORMATION THEORY ✛ ✲ n ✻ 16 r r r n r r r r r r r r ❄ r Fig. 5. Probabilistic crisscross error patterns [17]: The figure shows an error matrix X. The non-zero values (indicated as black dots) are restricted to two columns and one row. Thus, the rank of the error matrix X is at most three. (MRD), but decoding can be achieved using a modified form of the Berlekamp-Massey algorithm (See [45] for example). However, the algebraic structure of the codes (and in particular the mutual dependence between the equivalent Ha matrices) does not permit the line of analysis we adopted. Thus it is unclear how many linear measurements would be required in order to guarantee recovery using the suggested code structure. Silva, Kschischang and Kötter [10] extended the BerlekampMassey-based algorithm to handle errors and erasures for the purpose of error control in linear random network coding. In both these cases, the underlying error matrix is assumed to be deterministic and the algebraic structure on the parity check matrix permitted efficient decoding based on error locators. In another related work, Montanari and Urbanke [11] assumed that the error matrix X is drawn uniformly at random from all matrices of known rank r. The authors then constructed a sparse parity check code (based on a sparse factor graph). Using an “error-trapping” strategy by constraining codewords to have rows that are have zero Hamming weight without any loss of rate, they first learned the rowspace of X before adopting a (subspace) message passing strategy to complete the reconstruction. However, the dependence across rows of the parity check matrix (caused by lifting) violates the independence assumptions needed for our analyses to hold. The ideas in [11] were subsequently extended by Silva, Kschischang and Kötter [18] where the authors computed the information capacity of various (additive and/or multiplicative) matrix-valued channels over finite fields. They also devised “error-trapping” codes to achieve capacity. However, unlike this work, it is assumed in [18] that the underlying low-rank error matrix is chosen uniformly. As such, their guarantees do not apply to so-called crisscross error patterns [17], [45] (see Fig. 5), which are of interest in data storage applications. Our work in this paper is focused primarily on understanding the fundamental limits of rank-metric codes that are random. More precisely, the codes are characterized by either dense or sparse sensing (parity-check) matrices. This is in contrast to the literature on rank-metric codes (except [9, Sec. 5]), in which deterministic constructions predominate. The codes presented in Section VII are random. However, in analogy to the random coding argument for channel coding [35, Sec. 7.7], if the ensemble of random codes has low average error probability, there exists a deterministic code that has low error probability. In addition, the strong recovery results in Section VII-C allow us to conclude that our analyses apply to all low-rank matrices X in both equiprobable and sparse settings. This completes all remaining entries in Table II. Yet another line of research on rank minimization over finite fields (in particular over F2 ) has been conducted by the combinatorial optimization and graph theory communities. In [33, Sec. 6] and [46, Sec. 1] for example, it was demonstrated that if the code (or set of linear constraints) is characterized by a perfect graph,15 then the rank minimization problem can be solved exactly and in polynomial time by the ellipsoid method (since the problem can be stated as a semidefinite program). In fact, the rank minimization problem is also intimately related to Lovász’s θ function [47, Theorem 4], which characterizes the Shannon capacity of a graph. B. Conclusion and Future Directions In this paper, we derive information-theoretic limits for recovering a low-rank matrix with elements over a finite field given noiseless or noisy linear measurements. We show that even if the random sensing (or parity-check) matrices are very sparse, decoding can be done with exactly the same number of measurements as when the sensing matrices are dense. We then adopt a coding-theoretic approach and derived minimum rank distance properties of sparse random rank-metric codes. These results provide geometric insights as to how and why decoding succeeds when sufficiently many measurements are available. The work herein could potentially lead to the design of low-complexity sparse codes for rank-metric channels. It is also of interest to analyze whether the sparsity factor of Θ( logn n ) is the smallest possible and whether there is a fundamental tradeoff between this sparsity factor and the number of measurements required for reliable recovery of the low-rank matrix. Additionally, in many of the applications that motivate this problem, the sensing matrices fixed by the application and will not be random; take for example deterministic parity-check matrices that might define a rankmetric code. In rank minimization in the real field there are properties about the sensing matrices, and about the underlying matrix being estimated, that can be checked (for example the restricted isometry property [6, Eq. (1)], or random point sampling joint with incoherence of the low-rank matrix) that, if they are satisfied, guarantee that the true matrix of interest can be recovered using convex programming. It is of interest to identify an analog in the finite field, that is, a necessary (or sufficient) condition on the sensing matrices and the underlying matrix such that recovery is guaranteed. We would like to develop tractable algorithms along the lines of those in Table I or in the work by Baron et al. [26] to solve the minrank optimization problem approximately for particular classes of sensing matrices such as the sparse random ensemble. Finally, Dimakis and Vontobel [48] make an intriguing connection between linear programming (LP) decoding for channel coding and LP decoding for compressed sensing. They reach known compressed sensing results via a new path 15 A perfect graph G is one in which each induced subgraph H ⊂ G has a chromatic number χ(H) that is the same as its clique number ω(H). IEEE TRANSACTIONS ON INFORMATION THEORY 17 – channel coding. Analogously, we wonder whether known rank minimization results can be derived using rank-metric coding tools, thereby providing novel interpretations. And just as in [48], the reverse direction is also open. That is, whether the growing literature and understanding of rank minimization problems could be leveraged to design more tractable and interesting decoding approaches for rank-metric codes. in (69) into two disjoint subsets each, obtaining P(hM′ − M, H1 i = 0 | hM, H1 i = 0)  X X [H1 ]i,j = 0 [H1 ]i,j + =P (i,j)∈L∩K (i,j)∈K\L X =P Acknowledgements We would like to thank Associate Editor Erdal Arıkan and the reviewers for their suggestions to improve the paper and to acknowledge discussions with Ron Roth, Natalia Silberstein and especially Danilo Silva, who made the insightful points in Section IV-B1 [38]. We would also like to thank Ying Liu and Huili Guo for detailed comments and help in generating Fig. 4 respectively.  X [H1 ]i,j = (i,j)∈L\K [H1 ]i,j = − Proof: It suffices to show that the conditional probability P(AZ′ |AZ ) = P(AZ′ ) = q −k for Z 6= Z′ . We define the non-zero matrices M := X − Z and M′ := X − Z′ . Let K := supp(M′ − M) and L := supp(M). The idea of the proof is to partition the joint support K ∪ L into disjoint sets. More precisely, consider (a) P(AZ′ |AZ ) = P(hM′ , H1 i = 0 | hM, H1 i = 0)k ′ k where (a) is from the definition of AZ := {hX − Z, Ha i = 0, ∀ a ∈ [k]} and the independence of the random matrices Ha , a ∈ [k] and (b) by linearity. It suffices to show that the probability in (68) is q −1 . Indeed, [M]i,j [H1 ]i,j = 0 (d) = P  X (i,j)∈K [H1 ]i,j = 0 X (i,j)∈L (i,j)∈L∩K  (f ) = q −1 , A PPENDIX B OF P ROPOSITION 8 Ennoisy:= {|S noisy | > 1}∪({|S noisy | = 1}∩{(X∗ , w∗ ) 6= (X, w)}). Note that (Ennoisy )c occurs, both the matrix X and the noise vector w are recovered so, in fact, we are decoding two objects when we are only interested in X. Clearly, En ⊂ Ennoisy so it suffices to upper bound P(Ennoisy ) to obtain an upper bound of P(En ). For this purpose consider the event (70) defined for each matrix-vector pair (Z, v) ∈ Fqn×n × Fkq such that rank(Z) + λkvk0 ≤ rank(X) + λkwk0 . The error event Ennoisy occurs if and only if there exists a pair (Z, v) 6= (X, w) such that (i) rank(Z) + λkvk0 ≤ rank(X) + λkwk0 and (ii) the event Anoisy Z,v occurs. By the union of events bound, the error probability can be bounded as: X P(Anoisy P(Ennoisy ) ≤ Z,v ) (Z,v):rank(Z)+λkvk0 ≤rank(X)+λkwk0 (a) = X q −k (Z,v):rank(Z)+λkvk0 ≤rank(X)+λkwk0 (i,j)∈K (i,j)∈L [H1 ]i,j Anoisy Z,v := {hZ, Ha i = hX, Ha i + va , ∀ a ∈ [k]}, = P(hM −M, H1 i = 0 | hM, H1 i = 0) , (68) X X Proof: Recall the optimization problem for the noisy case in (32) where the optimization variables are X̃ and w̃. Let S noisy ⊂ Fqn×n × Fkq be the set of optimizers. In analogy to (13), we define the “noisy” error event A PPENDIX A P ROOF OF L EMMA 6 P(hM′ − M, H1 i = 0 | hM, H1 i = 0)  X (c) [M′ − M]i,j [H1 ]i,j = 0 =P [H1 ]i,j P Equality (e) is by using the condition (i,j)∈L\K [H1 ]i,j = P − (i,j)∈L∩K [H1 ]i,j and finally (f ) from the fact that the sets K\L, L\K and L∩K are mutually disjoint so the probability is q −1 by independence and uniformity of [H1 ]i,j , (i, j) ∈ [n]2 . P ROOF (b) X  (i,j)∈K\L (i,j)∈L\K X [H1 ]i,j = 0 (i,j)∈L∩K (i,j)∈L\K (e) X [H1 ]i,j +   [H1 ]i,j = 0 , (b) ≤ q −k |Ur,s |, (69) where (c) is from the definition of the inner product and the sets K and L, (d) from the fact that [M]i,j [H1 ]i,j has the same distribution as [H1 ]i,j since [M]i,j 6= 0 and [H1 ]i,j is uniformly distributed in Fq . Now, we split the sets K and L (71) where (a) is from the same argument as the noiseless case [See (18)] and in (b), we defined the set Ur,s := {(Z, v) : rank(Z) + λkvk0 ≤ rank(X) + λkwk0 }, where the subscripts r and s index respectively the upper bound on the rank of X and sparsity of w. Note that s = kwk0 = ⌊σn2 ⌋ ≤ σn2 . It remains to bound the cardinality of Ur,s . In the following, we partition the counting argument into disjoint subsets by fixing the sparsity of the vector v to be equal to l for all possible IEEE TRANSACTIONS ON INFORMATION THEORY l’s. Note that 0 ≤ l ≤ (kvk0 )max := Ur,s is bounded as follows: r λ 18 + s. The cardinality of (kvk0 )max X |Ur,s | = l=0 |{v ∈ Fkq : kvk0 = l}|× where (a) follows by bounding the number of vectors which are non-zero in l positions and the number of matrices whose rank is no greater than r + λ(s − l) (Lemma 1), (b) follows by first noting that the assignment r 7→ 2nr − r2 is monotonically increasing in r = 0, 1, . . . , n and second by upper bounding the summands by their largest possible values. Observe that (33) ensures that λr + s ≤ k2 , which is needed to upper bound the binomial coefficient since l 7→ kl is monotonically increasing iff l ≤ k/2. Inequality (c) uses the fact that the binomial coefficient is upper bounded by a function of the binary entropy [35, Theorem 11.1.3]. Now, note that since r/n → γ, for every η > 0, |r/n − γ| < η for n sufficiently large. Define γ̃η := γ + η + σ. From (c) above, |Ur,s | can be further upper bounded as (d) γ̃η n2  2 2 2 2 ≤ 4 γ̃η n2 + 1 2kH2 ( k ) q γ̃η n q 2γ̃η n −γ̃η n ≤ O(n2 )2 1 kH2 ( 3−γ̃ η ) γ̃η n +2γ̃η n q 2 2 −γ̃η2 n2 . (72) (73) Inequality (d) follows from the problem assumption that rank(X) ≤ r ≤ (γ +η)n for n sufficiently large, kwk0 = s ≤ σn2 and the choice of the regularization parameter λ = 1/n. Inequality (e) follows from the fact that since k satisfies (33), k > 3γ̃η (1 − γ̃η /3)n2 and hence the binary entropy term in (72) can be upper bounded as in (73). By combining (71) and (73), we observe that the error probability P(Ennoisy ) can be upper bounded as   2 2 k 1 noisy 2 −n n2 (1−(logq 2)H2 ( 3−γ̃η )−3γ̃η +γ̃η P(En ) ≤ O(n )q . (74) Now, again by using the assumption that k satisfies (33), the exponent in (74) is positive for η sufficiently small (γ̃η → γ+σ as η → 0) and hence P(Ennoisy ) → 0 as n → ∞. A PPENDIX C P ROOF OF C OROLLARY 9 Proof: Fano’s inequality can be applied to obtain inequality (a) as in (10). We lower bound the term H(X|yk , Hk ) in (10) differently taking into account the stochastic noise. It can be expressed as k k k k (b) (a) × |{Z ∈ Fqn×n : rank(Z) ≤ r + λ(s − l)}|  (kvk0 )max   X (a) 2 k ≤ (q − 1)l 4q 2n[r+λ(s−l)]−[r+λ(s−l)] l l=0  k  r (b)  r 2 q λ +s 4q 2n(r+λs)−(r+λs) +s+1 ≤ r + s λ λ  r +s (c)  r 2 r kH2 ( λk ) λ q +s 4q 2n(r+λs)−(r+λs) , ≤ +s+1 2 λ (e) The second term can be upper bounded as H(yk |Hk ) ≤ k by (11). The third term, which is zero in the noiseless case, can be (more tightly) lower bounded as follows: k k H(X|y , H ) = H(X) − H(y |H ) + H(y |H , X). (75) H(yk |Hk , X) = kH(y1 |H1 , X) = kH(w1 ) ≥ kHq (p), (76) where (a) follows by the independence of (X, H1 ) and w1 and (b) follows from the fact that the entropy of w with pmf in (34) is lower bounded by putting all the remaining probability mass p on a single symbol in Fq \ {0} (i.e., a Bern(p) distribution). Note that logarithms are to the base q. The result in (35) follows by uniting (75), (76) and the lower bound in (7). P ROOF A PPENDIX D OF C OROLLARY 10 Proof: The main idea in the proof is to reduce the problem to the deterministic case and apply Proposition 8. For this purpose, we define the ζ-typical set (for the length-k = ⌈αn2 ⌉ noise vector w) as   kwk0 − p ≤ ζ . Tζ = Tζ (w) := w ∈ Fkq : αn2 We choose ζ to be dependent on n in the following way (cf. the Delta-convention [49]): ζn → 0 and nζn → ∞ (e.g., ζn = n−1/2 ). By Chebyshev’s inequality, P(w ∈ / Tζn ) → 0 as n → ∞. We now bound the probability of error that the estimated matrix is not the same as the true one by using the law of total probability to condition the error event Ennoisy on the event {w ∈ Tζn } and its complement: / Tζn ). P(Ennoisy ) ≤ P(Ennoisy |w ∈ Tζn ) + P(w ∈ (77) Since the second term in (77) converges to zero, it suffices to prove that the first term also converges to zero. For this purpose, we can follow the steps of the proof in Proposition 8 and in particular the steps leading to (72) and (74). Doing so and defining pζ := p + ζ, we arrive at the upper bound P(Ennoisy |w ∈ Tζn ) ≤ O(n2 )2kH2 ( 2 ≤ O(n2 )q −n 2 = O(n2 )q −n  γn2 +pζ αn2 n αn2 ×q ) ( q γn2 +pζ αn2 n αn2 ) × 2n2 (γ+pζn α)−(γn+pζn αn)2 −αn2 γ )−2αpζn (1−γ)+α2 p2ζn −2γ+γ 2 α−α(logq 2)H2 (pζn + α [ g(α;pζn ,γ)−2γ(1−γ/2) ] , (78) Since ζn → 0 and g defined in (36) is continuous in the second argument, g(α; pζn , γ) → g(α; p, γ). Thus, if α satisfies (37), the exponent in (78) is positive. Hence, P(Ennoisy ) → 0 as n → ∞ as desired. A PPENDIX E P ROOF OF T HEOREM 11 Proof: We first state a lemma which will be proven as the end of this section.  IEEE TRANSACTIONS ON INFORMATION THEORY 19 Lemma 21. Define d := kX − Zk0 . The probability of AZ , defined in (16), under the δ-sparse measurement model, denoted as θ(d; δ, q, k), is a function of d and is given as " d #k  δ −1 −1 . (79) θ(d; δ, q, k) := q + (1 − q ) 1 − 1 − q −1 Lemma 21 says that the probability P(AZ ) is only a function of X though the number of entries it differs from Z, namely d. Furthermore, it is easy to check that the probability in (79) satisfies the following two properties: 1) θ(d; δ, q, k) ≤ (1 − δ)k ≤ exp(−kδ) for all d ∈ [n2 ], 2) θ(d; δ, q, k) is a monotonically decreasing function in d. We upper bound the probability in (17). To do so, we partition all possibly misleading matrices Z into subsets based on their Hamming distance from X. Our idea is to separately bound those partitions with low Hamming distance (which are few and so for which a loose upper bound on θ(d; δ, q, k) suffices) and those further from X (which are many, but for which we can get a tight upper bound on θ(d; δ, q, k), a bound that is only a function of the Hamming distance ⌈βn2 ⌉). Then we optimize the split over the free parameter β: 2 P(En ) ≤ (a) = n X X P(AZ ) d=1 Z:Z6=X,rank(Z)≤rank(X) kX−Zk0 =d ⌊βn2 ⌋ X X θ(d; δ, q, k)+ d=1 Z:Z6=X,rank(Z)≤rank(X) kX−Zk0 =d 2 n X + X θ(d; δ, q, k) d=⌈βn2 ⌉ Z:Z6=X,rank(Z)≤rank(X) kX−Zk0 =d (b) ≤ ⌊βn2 ⌋ X X exp(−kδ)+ d=1 Z:Z6=X,rank(Z)≤rank(X) kX−Zk0 =d 2 + n X X d=⌈βn2 ⌉ Z:Z6=X,rank(Z)≤rank(X) kX−Zk0 =d θ(⌈βn2 ⌉; δ, q, k) (c) ≤ |{Z : kZ − Xk0 ≤ ⌊βn2 ⌋} exp(−kδ)+ + n2 |{Z : rank(Z) ≤ rank(X)}|θ(⌈βn2 ⌉; δ, q, k). (80) In (a), we used the definition of θ(d; δ, q, k) in Lemma 21. The fractional parameter β, which we choose later, may depend on n. In (b), we used the fact that θ(d; δ, q, k) ≤ exp(−kδ) and that θ(d; δ, q, k) is monotonically decreasing in d so θ(d; δ, q, k) ≤ θ(⌈βn2 ⌉; δ, q, k) for all d ≥ ⌈βn2 ⌉. In (c), we upper bounded the cardinality of the set {Z 6= X : rank(Z) ≤ rank(X), kX − Zk0 ≤ ⌊βn2 ⌋} by the cardinality of the set of matrices that differ from X in no more than ⌊βn2 ⌋ locations (neglecting the rank constraint). For the second term, we upper bounded the cardinality of each set Md := {Z 6= X : rank(Z) ≤ rank(X), kX − Zk0 = d} by the cardinality of the set of matrices whose rank no more than rank(X) (neglecting the Hamming weight constraint). We denote the first and second terms in (80) as An and Bn respectively. Now, An := |{Z : kZ − Xk0 ≤ ⌊βn2 ⌋}| exp(−kδ) (a) 2 ≤ 2n H2 (β) 2 (q − 1)βn exp(−kδ) 2 k ≤ 2n [H2 (β)+β log2 (q−1)− n2 δ log2 (e)] , (81) where (a) used the fact that the number of matrices that differ from X by less than or equal to ⌊βn2 ⌋ positions is upper 2 2 bounded by 2n H2 (β) (q − 1)βn . Note that this upper bound is independent of X. Now fix η > 0 and consider Bn : Bn := n2 |{Z : rank(Z) ≤ rank(X)}|θ(⌈βn2 ⌉; δ, q, k) (a) 2 ≤ 4n2 q (2γ(1−γ/2)+η)n θ(⌈βn2 ⌉; δ, q, k) (b) h  i 2 −1 −1 ⌈βn2 ⌉ k δ 2 n 2γ(1−γ/2)+η+ n2 logq q +(1−q )(1− 1−q−1 ) = 4n q (82) In (a), we used the fact that the number of matrices of rank 2 no greater than r is bounded above by 4q (2γ(1−γ/2)+η)n (Lemma 1) for n sufficiently large (depending on η by the convergence of r/n to γ). Equality (b) is obtained by applying (79) in Lemma 21. Our objective in the rest of the proof is to find sufficient conditions on k and β so that (81) and (82) both converge to zero. We start with Bn . From (82) we observe that if for every ε > 0, there exists an N1,ε ∈ N such that  ε 2γ(1 − γ/2)n2  k > 1+ ⌈βn2 ⌉  ,  5 δ −1 −1 − logq q + (1 − q ) 1 − 1−q−1 (83) for all n > N1,ε , then Bn → 0 since the exponent in (82) is negative (for η sufficiently small). Now, we claim that if limn→∞ ⌈βn2 ⌉δ = +∞ then the denominator in (83) tends to 1 from below. This is justified as follows: Consider the term,  ⌈βn2 ⌉   δ ⌈βn2 ⌉δ n→∞ 1− ≤ exp − −→ 0, 1 − q −1 1 − q −1 so the argument of the logarithm in (83) tends to q −1 from above if limn→∞ ⌈βn2 ⌉δ = +∞. Since δ ∈ Ω( logn n ), by definition, there exists a constant C ∈ (0, ∞) and an integer Nδ ∈ N such that log2 (n) , n for all n > Nδ . Let β be defined as δ = δn ≥ C β = βn := 2γ(1 − γ/2) log2 (e)δ . log2 (n) (84) (85) Then ⌈βn2 ⌉δ ≥ 2γ(1 − γ/2) log2 (e)C 2 log2 (n) = Θ(log n) and so the condition limn→∞ ⌈βn2 ⌉δ = +∞ is satisfied. Thus, for sufficiently large n, the denominator in (83) exceeds 1/(1+ ε/5) < 1. As such, the condition in (83) can be equivalently written as: Given the choice of β in (85), if there exists an N2,ε ∈ N such that  ε 2 γ(1 − γ/2)n2 (86) k >2 1+ 5 . IEEE TRANSACTIONS ON INFORMATION THEORY 20 for all n > N2,ε , then Bn → 0. We now revisit the upper bound on An in (81). The inequality says that, for every ε > 0, if there exists an N3,ε ∈ N such that  ε  H2 (β) + β log2 (q − 1) 2 k > 1+ n , 5 δ log2 (e) (87) for all n > N3,ε , then An → 0 since the exponent in (81) is negative. Note that H2 (β)/(−β log2 β) ↓ 1 as β ↓ 0. Hence, if β is chosen as in (85), then by using (84), we obtain lim n→∞ H2 (β) + β log2 (q − 1) ≤ 2γ(1 − γ/2). δ log2 (e) (88) In particular, for n sufficiently large, the terms in the sequence in (88) and its limit (which exists) differ by less than 2γ(1 − γ/2)ε/5. Hence (87) is equivalent to the following: Given the choice of β in (85), if there exists an N4,ε ∈ N such that  ε 2 k >2 1+ γ(1 − γ/2)n2 (89) 5 for all n > N4,ε , the sequence An → 0. The choice of β in (85) “balances” the two sums An and Bn in (80). Also note that 2(1 + ε/5)2 < 2 + ε for all ε ∈ (0, 5/2). Hence, if the number of measurements k satisfies (15) for all n > Nε,δ := max{N1,ε , N2,ε , N3,ε , N4,ε , Nδ }, both (86) and (89) will also be satisfied and consequently, P(En ) ≤ An + Bn → 0 as n → ∞ as desired. We remark that the restriction of ε ∈ (0, 5/2) is not a serious one, since the validity of the claim in Theorem 11 for some ε0 > 0 implies the same for all ε > ε0 . This completes the proof of Theorem 11. It now remains to prove Lemma 21. Proof: Recall that d = kX − Zk0 and θ(d; δ, q, k) = P(hHa , Xi = hHa , Zi, a ∈ [k]). By the i.i.d. nature of the random matrices Ha , a ∈ [k], it is true that θ(d; δ, q, k) = P(hH1 , Xi = hH1 , Zi)k . δ P(hH1 , Xi = hH1 , Zi) = q −1 + (1 − q −1 ) 1 − 1 − q −1 δ/(q − 1) 0 Let the first and second vectors above be p1 and p2 respectively. Then, by linearity of the DFT, Fp = Fp1 + Fp2 where     1 − δ − δ/(q − 1) qδ/(q − 1)  1 − δ − δ/(q − 1)  0     Fp1 =  .  , Fp2 =  .. ..     . . 1 − δ − δ/(q − 1) 0 Summing these up yields   1 1 − δ/(1 − q −1 )   Fp =  . ..   . 1 − δ/(1 − q −1 ) Raising Fp to the d-th power yields   1 (1 − δ/(1 − q −1 ))d    (Fp).d =  . ..   . (1 − δ/(1 − q −1 ))d Now using the same splitting technique, (Fp).d can be decomposed into     (1 − δ/(1 − q −1 ))d 1−(1 − δ/(1 − q −1 ))d (1 − δ/(1 − q −1 ))d    0     (Fp).d =  + . .. ..     . . (1 − δ/(1 − q −1 ))d It thus remains to demonstrate that  of the vector v is raised to the d-th power.) We split p into two vectors whose DFTs can be evaluated in closed-form:     δ/(q − 1) 1 − δ − δ/(q − 1)  δ/(q − 1)  0     p= . + .. ..     . . d . (90) This may be proved using induction on d but we prove it using more direct transform-domain ideas. Note that (90) is simply the d-fold q-point circular convolution of the δ-sparse pmf in (39). Let F ∈ Cq×q and F−1 ∈ Cq×q be the discrete Fourier transform (DFT) and the inverse DFT matrices respectively. We use the convention in [50]. Let   1−δ δ/(q − 1)   p := Ph ( · ; δ, q) =   ..   . δ/(q − 1) be the vector of probabilities defined in (39). Then, by properties of the DFT, (90) is simply given by F−1 [(Fp).d ] evaluated at the vector’s first element. (The notation v.d := d [v0d . . . vq−1 ]T denotes the vector in which each component 0 Let s1 and s2 denote each vector on the right hand side above. Define ϕ := (1 − δ/(1 − q −1 ))d . Then, the inverse DFTs of s1 and s2 can be evaluated analytically as   −1   q (1 − ϕ) ϕ q −1 (1 − ϕ) 0     F−1 s2 =  F−1 s1 =  .  , . ..    ..  . 0 q −1 (1 − ϕ) Summing the first elements of F−1 s1 and F−1 s2 completes the proof of (90) and hence of Lemma 21. A PPENDIX F P ROOF OF L EMMA 12 Proof: The only matrix for which the rank r = 0 is the zero matrix which is in C , since C is a linear code (i.e., a subspace). Hence, the sum in (43) consists only of a single term, which is one. Now for 1 ≤ r ≤ n, we start from (43) IEEE TRANSACTIONS ON INFORMATION THEORY 21 where for (a) recall that k ∈ Θ(n2 ) and δ ∈ Ω( logn n ). These facts imply that δ (as a sequence in n) belongs to the interval Ik for all sufficiently large n [because any function in Ω( logn n ) dominates the lower bound logke k for k ∈ Θ(n2 )] so the hypothesis of Theorem 22 is satisfied and we can apply (91) (with l = ǫk) to get inequality (a). Since (92) is a summable sequence, by the Borel-Cantelli lemma, the sequence of random variables m/k → 1 a.s. and by the linearity of expectation, we have X E I{M ∈ C } ENC (r) = M∈Fn×n :rank(M)=r q = X P(M ∈ C ) X q −k = Φq (n, r)q −k , M∈Fn×n :rank(M)=r q (a) = M∈Fn×n :rank(M)=r q where (a) is because M 6= 0 (since 1 ≤ r ≤ n). Hence, as in (18), P(M ∈ C ) = q −k . The proof is completed by appealing to (6), which provides upper and lower bounds on the number of matrices of rank exactly r. For the variance, note that the random variables in the set {I{M ∈ C } : rank(M) = r} are pairwise independent (See Lemma 6). As a result, the variance of the sum in (43) is a sum of variances, i.e., X var(NC (r)) = var(I{M ∈ C }) M∈Fn×n :rank(M)=r q X = M∈Fn×n :rank(M)=r q ≤ X :rank(M)=r M∈Fn×n q   E I{M ∈ C }2 − [E I{M ∈ C }]2 E I{M ∈ C } = ENC (r), as desired. P ROOF A PPENDIX G OF P ROPOSITION 14 Proof: We first restate a beautiful result from [42]. For each positive integer k, define the interval Ik := [ logke k , q−1 q ]. Theorem 22 (Corollary 2.4 in [42]). Let M be a random k×k matrix over the finite field Fq , where each element is drawn independently from the pmf in (39) with δ, a sequence in k, belonging to Ik for each k ∈ N. Then, for every l ≤ k, P(k − rank(M) ≥ l) ≤ Aq −l , (91) and A is a constant. Moreover, if A is considered as a function of δ then it is monotonically decreasing as a function in the interval Ik . To prove the Proposition 14, first define N := n2 and let ha := vec(Ha ) ∈ FN q be the vectorized versions of the random ×k sensing matrices. Also let H := [h1 . . . hk ] ∈ FN be the q k×k matrix with columns ha . Finally, let H[k×k] ∈ Fq be the square sub-matrix of H consisting only of its top k rows. Clearly, the dimension of the column span of H, denoted as m ≥ rank(H[k×k] ). Note that m is a sequence of random variables and k is a sequence of integers but we suppress their dependences on n. Fix 0 < ǫ < 1 and consider   m  m −1 ≥ǫ =P ≤ 1−ǫ P k  k rank(H[k×k] ) ≤1−ǫ ≤P k  = P k − rank(H[k×k] ) ≥ ǫk (a) ≤ Aq −ǫk , (92) R EFERENCES [1] V. Y. F. Tan, L. Balzano, and S. C. Draper, “Rank minimization over finite fields,” in Intl. Symp. Inf. Th., (St Petersburg, Russia), Aug 2011. [2] E. J. Candès and T. Tao, “The power of convex relaxation: near-optimal matrix completion,” IEEE Trans. on Inf. Th., vol. 56, pp. 2053–2080, May 2010. [3] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009. [4] B. Recht, “A simpler approach to matrix completion,” To appear in J. Mach. Learn. Research, 2009. arXiv:0910.0651v2. [5] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Rev., vol. 2, no. 52, pp. 471–501, 2009. [6] R. Meka, P. Jain, and I. S. Dhillon, “Guaranteed rank minimization via singular value projection,” in Proc. of Neural Information Processing Systems, 2010. arXiv:0909.5457. [7] E. M. Gabidulin, “Theory of codes with maximum rank distance,” Probl. Inform. Transm., vol. 21, no. 1, pp. 1–12, 1985. [8] R. M. Roth, “Maximum-rank array codes and their application to crisscross error correction,” IEEE Trans. on Inf. Th., vol. 37, pp. 328– 336, Feb 1991. [9] P. Loidreau, “Properties of codes in rank metric,” 2006. arXiv:0610057. [10] D. Silva, F. R. Kschischang, and R. Kötter, “A rank-metric approach to error control in random network coding,” IEEE Trans. on Inf. Th., vol. 54, pp. 3951 – 3967, Sep 2008. [11] A. Montanari and R. Urbanke, “Coding for network coding,” 2007. arXiv:0711.3935. [12] M. Gadouleau and Z. Yan, “Packing and covering properties of rank metric codes,” IEEE Trans. on Inf. Th., vol. 54, pp. 3873–3883, Sep 2008. [13] ACM SIGKDD and Netflix, Proceedings of KDD Cup and Workshop, (San Jose, CA), Aug 2007. Proceedings available online at http://www.cs.uic.edu/∼ liub/KDD-cup-2007/proceedings.html. [14] M. Fazel, H. Hindi, and S. P. Boyd, “A Rank Minimization Heuristic with Application to Minimum Order System Approximation,” in American Control Conference, 2001. [15] M. Fazel, H. Hindi, and S. P. Boyd, “Log-det heuristic for matrix rank minimization with applications with applications to Hankel and Euclidean distance metrics,” in American Control Conference, 2003. [16] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol, “Index coding with side information,” IEEE Trans. on Inf. Th., vol. 57, pp. 1479 – 1494, Mar 2011. [17] R. M. Roth, “Probabilistic crisscross error correction,” IEEE Trans. on Inf. Th., vol. 43, pp. 1425–1438, May 1997. [18] D. Silva, F. R. Kschischang, and R. Kötter, “Communication over finitefield matrix channels,” IEEE Trans. on Inf. Th., vol. 56, pp. 1296 – 1305, Mar 2010. [19] A. Barg and G. D. Forney, “Random codes: Minimum distances and error exponents,” IEEE Trans. on Inf. Th., vol. 48, pp. 2568–2573, Sep 2002. [20] R. G. Gallager, Low density parity check codes. MIT Press, 1963. [21] R. Kötter and F. R. Kschischang, “Coding for errors and erasures in random network coding,” IEEE Trans. on Inf. Th., vol. 54, pp. 3579 – 3591, Aug 2008. [22] R. W. Nóbrega, B. F. Uchôa-Filho, and D. Silva, “On the capacity of multiplicative finite-field matrix channels,” in Intl. Symp. Inf. Th., (St Petersburg, Russia), Aug 2011. [23] D. de Caen, “A lower bound on the probability of a union,” Discrete Math., vol. 69, pp. 217–220, May 1997. [24] G. E. Séguin, “A lower bound on the error probability for signals in white Gaussian noise,” IEEE Trans. on Inf. Th., vol. 44, pp. 3168–3175, Jul 1998. IEEE TRANSACTIONS ON INFORMATION THEORY [25] A. Cohen and N. Merhav, “Lower bounds on the error probability of block codes based on improvements on de Caen’s inequality,” IEEE Trans. on Inf. Th., vol. 50, pp. 290–310, Feb 2004. [26] D. Baron, S. Sarvotham, and R. G. Baraniuk, “Bayesian compressive sensing via belief propagation,” IEEE Trans. on Sig. Proc., vol. 51, pp. 269 – 280, Jan 2010. [27] Y. C. Eldar, D. Needell, and Y. Plan, “Unicity conditions for low-rank matrix recovery,” Preprint, Apr 2011. arXiv:1103.5479 (Submitted to SIAM Journal on Mathematical Analysis). [28] D. S. Papailiopoulos and A. G. Dimakis, “Distributed storage codes meet multiple-access wiretap channels,” in Proc. of Allerton, 2010. [29] S. C. Draper and S. Malekpour, “Compressed sensing over finite fields,” in Intl. Symp. Inf. Th., (Seoul, Korea), July 2009. [30] S. Vishwanath, “Information theoretic bounds for low-rank matrix completion,” in Intl. Symp. Inf. Th., (Austin, TX), July 2010. [31] A. Emad and O. Milenkovic, “Information theoretic bounds for tensor rank minimization,” in Proc. of Globecomm, Dec 2011. arXiv:1103.4435. [32] A. Kakhaki, H. K. Abadi, P. Pad, H. Saeedi, K. Alishahi, and F. Marvasti, “Capacity achieving random sparse linear codes,” Preprint, Aug 2011. arXiv:1102.4099v3. [33] M. Grötchel, L. Lovász, and A. Schrijver, “The ellipsoid method and its consequences in combinatorial optimization,” Combinatorica, vol. 1, no. 2, pp. 169–197, 1981. [34] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms. McGraw-Hill Science/Engineering/Math, 2nd ed., 2003. [35] T. M. Cover and J. A. Thomas, Elements of Information Theory. WileyInterscience, 2nd ed., 2006. [36] I. Csiszár, “Linear codes for sources and source networks: Error exponents, universal coding,” IEEE Trans. on Inf. Th., vol. 28, pp. 585–592, Apr 1982. [37] R. G. Gallager, Information Theory and Reliable Communication. Wiley, 1968. [38] D. Silva. Personal communication, Sep 2011. [39] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. on Inf. Th., vol. 47, pp. 498–519, Feb 2001. [40] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications. Springer, 2nd ed., 1998. [41] R. Lidl and H. Niederreiter, Introduction to Finite Fields and their Applications. Cambridge University Press, 1994. [42] J. Blömer, R. Karp and E. Welzl, “The Rank of Sparse Random Matrices over Finite Fields,” Random Structures and Algorithms, vol. 10, no. 4, pp. 407–419, 1997. [43] F. Chabaud and J. Stern, “The cryptographic security of the syndrome decoding problem for rank distance codes,” in ASIACRYPT, pp. 368– 381, 1996. [44] A. V. Ourivski and T. Johansson, “New technique for decoding codes in the rank metric and its cryptography applications,” Probl. Inf. Transm., vol. 38, pp. 237–246, July 2002. [45] G. Richter and S. Plass, “Error and erasure of rank-codes with a modified Berlekamp-Massey algorithm,” in Proceedings of ITG Conference on Source and Channel Coding, Jan 2004. [46] R. Peeters, “Orthogonal representations over finite fields and the chromatic number of graphs,” Combinatorica, vol. 16, no. 3, pp. 417–431, 1996. [47] L. Lovász, “On the Shannon capacity of a graph,” IEEE Trans. on Inf. Th., vol. IT-25, pp. 1–7, Jan 1981. [48] A. G. Dimakis and P. O. Vontobel, “LP Decoding meets LP Decoding: A Connection between Channel Coding and Compressed Sensing,” in Proc. of Allerton, 2009. [49] I. Csiszár and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems. Akademiai Kiado, 1997. [50] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing. Prentice Hall, 1999. Vincent Y. F. Tan received the B.A. and M.Eng. degrees in Electrical and Information Engineering from Sidney Sussex College, Cambridge University in 2005. He received the Ph.D. degree in Electrical Engineering and Computer Science (EECS) from the Massachusetts Institute of Technology (MIT) in 2011. He is currently a postdoctoral researcher in the Electrical and Computer Engineering Department at the University of Wisconsin (UW), Madison as well as a research affiliate at the Laboratory for Information and Decision 22 Systems (LIDS) at MIT. He has held summer research internships at Microsoft Research in 2008 and 2009. His research is supported by A*STAR, Singapore. His research interests include network information theory, detection and estimation, and learning and inference of graphical models. Dr. Tan is a recipient of the 2005 Charles Lamb Prize, a Cambridge University Engineering Department prize awarded annually to the top candidate in Electrical and Information Engineering. He also received the 2011 MIT EECS Jin-Au Kong outstanding doctoral thesis prize. He has served as a reviewer for the IEEE Transactions on Signal Processing, the IEEE Transactions on Information Theory, and the Journal of Machine Learning Research. Laura Balzano is a Ph.D. candidate in Electrical and Computer Engineering, working with Professor Robert Nowak at the University of Wisconsin (UW), Madison, degree expected May 2012. Laura received her B.S. and M.S. in Electrical Engineering from Rice University 2002 and the University of California in Los Angeles 2007 respectively. She received the Outstanding M.S. Degree of the year award from UCLA. She has worked as a software engineer at Applied Signal Technology, Inc. Her Ph.D. is being supported by a 3M fellowship. Her main research focus is on low-rank modeling for inference and learning with highly incomplete or corrupted data, and its applications to communications, biological, and sensor networks, and collaborative filtering. Stark C. Draper (S’99-M’03) is an Assistant Professor of Electrical and Computer Engineering at the University of Wisconsin (UW), Madison. He received the M.S. and Ph.D. degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT), and the B.S. and B.A. degrees in Electrical Engineering and History, respectively, from Stanford University. Before moving to Wisconsin, Dr. Draper worked at the Mitsubishi Electric Research Laboratories (MERL) in Cambridge, MA. He held postdoctoral positions in the Wireless Foundations, University of California, Berkeley, and in the Information Processing Laboratory, University of Toronto, Canada. He has worked at Arraycomm, San Jose, CA, the C. S Draper Laboratory, Cambridge, MA, and Ktaadn, Newton, MA. His research interests include communication and information theory, error-correction coding, statistical signal processing and optimization, security, and application of these disciplines to computer architecture and semiconductor device design. Dr. Draper has received an NSF CAREER Award, the UW ECE Gerald Holdridge Teaching Award, the MIT Carlton E. Tucker Teaching Award, an Intel Graduate Fellowship, Stanford’s Frederick E. Terman Engineering Scholastic Award, and a U.S. State Department Fulbright Fellowship.