[go: up one dir, main page]

Academia.eduAcademia.edu
SIAM J. MATRIX ANAL. APPL. Vol. 0, No. 0, pp. 000–000 c XXXX Society for Industrial and Applied Mathematics  A SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS OF LINEAR EQUATIONS∗ S. CHANDRASEKARAN† , M. GU‡ , X. SUN§ , J. XIA¶, AND J. ZHU‡ Abstract. In this paper we develop a new superfast solver for Toeplitz systems of linear equations. To solve Toeplitz systems many people use displacement equation methods. With displacement structures, Toeplitz matrices can be transformed into Cauchy-like matrices using the FFT or other trigonometric transformations. These Cauchy-like matrices have a special property, that is, their off-diagonal blocks have small numerical ranks. This low-rank property plays a central role in our superfast Toeplitz solver. It enables us to quickly approximate the Cauchy-like matrices by structured matrices called sequentially semiseparable (SSS) matrices. The major work of the constructions of these SSS forms can be done in precomputations (independent of the Toeplitz matrix entries). These SSS representations are compact because of the low-rank property. The SSS Cauchy-like systems can be solved in linear time with linear storage. Excluding precomputations the main operations are the FFT and SSS system solvers, which are both very efficient. Our new Toeplitz solver is stable in practice. Numerical examples are presented to illustrate the efficiency and the practical stability. Key words. displacement equation, SSS structure, superfast algorithm, Toeplitz matrix AMS subject classifications. 15A06, 65F05, 65G05 DOI. 10.1137/040617200 1. Introduction. Toeplitz systems of linear equations arise in many applications, including PDE solving, signal processing, time series analysis, orthogonal polynomials, and many others. A Toeplitz system is a linear system (1.1) Tx = b with a coefficient matrix to be a Toeplitz matrix ⎛ t0 t−1 t−2 · · · ⎜ t1 t t−1 · · · 0 ⎜ ⎜ t2 t t0 ··· 1 (1.2) T =⎜ ⎜ .. .. .. .. ⎝ . . . . tN −1 tN −2 tN −3 · · · t−(N −1) t−(N −2) t−(N −3) .. . ··· ⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎠ that is, its entries are constant along every diagonal (a matrix whose entries are constant along every antidiagonal is called a Hankel matrix). The vector t = (t−(N −1) · · · t−1 t0 t1 · · · tN −1 ) is called the Toeplitz-vector that generates T . There are both direct and iterative methods for solving (1.1). Direct solvers are said to be fast if they cost O(N 2 ) operations; examples include Schur-type methods, ∗ Received by the editors October 19, 2004; accepted for publication (in revised form) by D. A. Bini May 8, 2007; published electronically DATE. http://www.siam.org/journals/simax/x-x/61720.html † Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA 93106 (shiv@ece.ucsb.edu). ‡ Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720 (mgu@ math.berkeley.edu, zhujiang.cal@gmail.com). § Department of Mechanical Engineering, University of California at Berkeley, Berkeley, CA 94720 (sunxt@cal.berkeley.edu). ¶ Department of Mathematics, University of California at Los Angeles, Los Angeles, CA 90095 (jxia@math.ucla.edu). 1 2 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU Levinson-type methods, and others [32]. An important type of direct solver is the displacement equation–type fast solver based on Gaussian eliminations. Some known displacement equation–type methods are the Heinig [28], GKO [22], and Gu [26] methods. Those methods have complexity O(N 2 ). Methods with complexity less than O(N 2 ) are called superfast. In this paper we will present a displacement equation– type superfast algorithm. 1.1. Fast and superfast methods. Many fast and superfast methods that have been developed are numerically unstable [5, 15, 16, 7, 37]. References [5] and [15] showed that the Schur algorithm and the Levinson algorithm are weakly stable in some cases, but both may be highly unstable in the case of an indefinite and nonsymmetric matrix. Stable generalized Schur algorithms [32] and look-ahead algorithms were developed in [9, 10]. High-performance look-ahead Schur algorithms were presented [20]. Many other solvers use the FFT or other trigonometric transforms to convert the Toeplitz (or even Hankel or Toeplitz-plus-Hankel) matrices into generalized Cauchy or Vandermonde matrices, which can be done stably in O(N log N ) operations. This is also the approach that we will use in this paper, with the aid of a displacement structure. The concept of displacement structure was first introduced in [31]. The Sylvestertype displacement equation for a matrix Ĉ ∈ RN ×N [29] is (1.3) ΩĈ − ĈΛ = U V, where Ω, Λ ∈ RN ×N , U ∈ RN ×α , V ∈ Rα×N , and α ≤ N is the displacement rank with respect to Ω and Λ if rank(U V ) = α. The matrix Ĉ is considered to possess a displacement structure with respect to Ω and Λ if α ≪ N . With displacement structures it was shown in [19, 24, 36, 22, 28] that Toeplitz and Hankel matrices can be transformed into Cauchy-like matrices of the following form:  T ui · vj (ui , vj ∈ Rα ) , Ĉ = ηi − λj 1≤i,j≤N where we assume that ηi = λj for 1 ≤ i, j ≤ N . Equivalently, a Cauchy-like matrix is the unique solution to the displacement equation (1.3) with ⎛ T ⎞ u1 ⎜ .. ⎟ Ω = diag(η1 , . . . , ηn ), Λ = diag(λ1 , . . . , λn ), U = ⎝ . ⎠ , and V = (v1 , . . . , vn ) . uTn In particular, Ĉ is a Cauchy matrix if uTi vj = 1 for all i and j. The displacement rank of Ĉ is at most α. To solve Toeplitz systems through Cauchy-like matrices, many people have utilized matrix factorizations. Gohberg and Olshevsky [23] presented a fast variation of the straightforward Gaussian elimination with partial pivoting (GEPP) procedure to solve a Cauchy-like linear system of equations in O(N 2 ) operations. Among other results, Gohberg, Kailath, and Olshevsky [22] developed algorithm GKO, an improved version of Heinig’s algorithm [28], and demonstrated numerically that it is stable. In their algorithm, the Hankel matrix and the Toeplitz-plus-Hankel matrix are also transformed via fast trigonometric transforms into Cauchy-like matrices. SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 3 Gu presented a modified algorithm in [26] to avoid extra error growth. This algorithm is numerically stable, provided that the element growth in the computed factorization is not large. The algorithm takes O(N 2 ) operations and is a fast stable method. Superfast algorithms appeared in [34, 4, 6, 17, 35, 1, 2, 21] and many others. Superfast algorithms use divide-and-conquer strategies. Morf developed the first idea in [34]. These methods are unstable for nonsymmetric systems as they cannot deal with nearly singular leading principal submatrices. Van Barel and Kravanja presented a superfast method for rational interpolation at roots of unity [39]. A similar idea was then applied to Toeplitz systems [38]. It provided an explicit formula for the inverse of a Toeplitz matrix. Additional techniques such as iterative refinement and downdating were still required to stabilize their algorithm. 1.2. Main results. Our new Toeplitz solver is also of the displacement equation type. Given a Toeplitz linear system, we first use the FFT to transform the associated Toeplitz matrix into a Cauchy-like matrix. Then instead of using matrix factorizations which often cost O(N 2 ) or more, we exploit a special low-rank property of Cauchy-like matrices, that is, every off-diagonal block of a Cauchy-like matrix has a low numerical rank. Using this low-rank property, we then approximate the Cauchy-like matrix by a low-rank matrix structure called the sequentially semiseparable (SSS) matrix proposed by Chandrasekaran et al. [12, 13]. A system with the coefficient matrix in compact SSS form can be solved with only O(p2 N ) operations, where N is the matrix dimension, and p is the complexity of the semiseparable description. The SSS solver is practically stable in our numerical tests and those in [12, 13]. The SSS structure was developed to capture the low-rank property of the offdiagonal blocks of a matrix and to maintain stability or practical stability in the mean time. It is a matrix analog of semiseparable integral kernels in Kailath’s paper [30]. Matrix operations with compact form SSS representations are very efficient, provided that such compact representations exist or can be easily computed. This turns out to be true for our case, as the Cauchy-like matrices are transformed from Toeplitz matrices. We use a recursive compression scheme with a shifting strategy to construct compact SSS forms for those Cauchy-like matrices. The major work of the compressions can be precomputed on some Cauchy matrices which are independent of the actual Toeplitz matrix entries. The overall algorithm thus has the following stages: (1) Precompute compressions of off-diagonal blocks of Cauchy matrices. (2) Transform the Toeplitz matrix into a Cauchy-like matrix in O(N log N ) operations. (3) Construct a compact SSS representation from precomputed compressions and solve the Cauchy-like matrix system O(p2 N ) operations. (4) Recover the solution of the Toepliz system in O(N log N ) operations. The stages above are either stable or practically stable. Our numerical results indicate that the overall algorithm is stable in practice. The Toeplitz matrix does not have to be symmetric or positive definite, and no extra stabilizing step is necessary. After the precomputations, the total cost for the algorithm is O(N log N ) + O(p2 N ). This indicates that the entire algorithm is superfast. We also point out that similar techniques are used in [33], where the low-rank property is exploited through the block columns without diagonals (called neutered block columns in [33]), in contrast with the off-diagonal blocks here. The compres- 4 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU sions of either the neutered blocks or the off-diagonal blocks both give data-sparse representations which enable fast factorizations of the Cauchy-like matrices. In fact, corresponding to neutered block rows or columns, there are also matrix representations called hierarchically semiseparable (HSS) matrices [14] which are usually more complicated structures than SSS matrices. 1.3. Overview. We will discuss the displacement structure and the transformation from a Toeplitz problem to a Cauchy-like problem in section 2. The low-rank property of this Cauchy-like problem is then exploited. Section 3 then gives a linear complexity solver using the SSS structure. In section 4, we will present an algorithm for fast construction of SSS structures. We will then analyze the complexity in section 5 and use some numerical experiments to demonstrate the efficiency and the practical stability. All algorithms have been implemented in Fortran 90. Section 6 draws some conclusions. 2. Displacement structures and low-rank property. 2.1. Cauchy-like systems. Given a Toeplitz system (1.1), we can use a displacement structure to transform it into a Cauchy-like system. Define ⎞ ⎛ 0 0 ··· 0 δ ⎜ 1 0 ··· ··· 0 ⎟ ⎟ ⎜ ⎜ .. ⎟ .. ⎟ , . . 0 1 Zδ = ⎜ ⎟ ⎜ ⎟ ⎜ . . . . .. . . .. ⎠ ⎝ .. 0 ··· 0 1 0 and let Ω = Z1 and Λ = Z−1 in (1.3). Kailath, Kung, and Morf [31] have shown that every Toeplitz matrix satisfies the displacement equation (1.3) with A · B, having nonzero entries only in its first row and last column, to be a matrix of rank at most 2. Hence the displacement rank of a Toeplitz matrix is at most 2 with respect to Z1 and Z−1 . The following result can be found in [28]. Proposition 2.1. Let Ĉ ∈ RN ×N be a matrix satisfying the displacement equation Z1 Ĉ − ĈZ−1 = U V, (2.1) where U ∈ Rn×α and V ∈ Rα×n . Then F ĈD0−1 F H is a Cauchy-like matrix satisfying (2.2) D1 (F ĈD0−1 F H ) − (F ĈD0−1 F H )D−1 = (FU ) V D0H F H , where F = 1 2(k−1)(j−1) )1≤k,j≤N N (ω is the normalized inverse discrete Fourier trans- πi N form matrix, ω = e , and D1 = diag(1, ω 2 , . . . , ω 2(N −1) ), D−1 = diag(ω, ω 3 , . . . , ω 2N −1 ), D0 = diag(1, ω, . . . , ω N −1 ). Here α ≤ 2. This proposition suggests that for a Toeplitz matrix T, one can convert it into the Cauchy-like matrix in (2.2). Therefore the Toeplitz system (1.1) can be readily transformed into a new system (2.3) C x̃ = b̃, SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 5 where C has the form (2.4) C=  ω 2i uTi vj − ω 2j+1 (ui , vj ∈ Rα ) . 1≤i,j≤N In section 3 we will present a fast solver for (2.3). After obtaining x̃ we will then recover x with an FFT again. All the stages involving FFT are stable and cost O(N log N ). The solver for (2.3) has a linear complexity and turns out to be practically stable. Thus the total cost of our algorithm is bounded by O(N log N ) + O(N p2 ), where p is some parameter that will be described below. This indicates our method is a superfast one with practical stability. 2.2. Low-rank property of Cauchy-like matrices. In this section, we will show a low-rank property of C, i.e., every off-diagonal block of C has a low numerical rank. This property is the basis of the superfast SSS solver in section 3. First, a simple numerical experiment can give us an idea of the low-rank property of C. To find out the numerical ranks we can use one of the following tools: (1) τ -accurate SVD: singular values less than τ are dropped if τ is an absolute tolerance, or singular values less than τ times the largest singular value are dropped if τ is a relative tolerance. (2) τ -accurate QR: A ≈ QR, A : m × n, Q : m × k, R : k × k, k ≤ l ≡ min(m, n), which is obtained in the following way. Compute the exact QR factorization of matrix A = Q̂R̂, where Q̂ is m × l and R̂ is l × n with diagonal entries satisfying R̂11 ≥ R̂22 ≥ · · · ≥ R̂ll . Then obtain R by dropping all rows of R̂ with diagonal entries less than τ if τ is an absolute tolerance, or with diagonal entries less than τ R̂11 if τ is a relative tolerance. Drop relevant columns of Q̂ accordingly to obtain Q. Later, by ranks we mean numerical ranks. Here we take some random Toeplitz matrices in different sizes. Then we transform them into Cauchy-like matrices C and compute the numerical ranks of their off-diagonal blocks. For simplicity, we compute the ranks for blocks C(1 : d, d + 1 : N ), C(1 : 2d, 2d + 1 : N ), . . ., C(1 : kd, kd + 1 : N ), . . ., where d is a fixed integer, and these blocks are numbered as block numbers 1, 2, . . . , k as shown in Figure 2.1. Here, k = 8 off-diagonal blocks for each of three N × N Cauchy-like matrices are considered. See Table 2.1 for the numerical ranks, where we use the τ -accurate SVD with τ to be an absolute tolerance. We can see that the numerical ranks are relatively small as compared to the block sizes. And when we double the dimension of the matrix, the numerical ranks do not increase much. This is more significant when a larger τ is used. Fig. 2.1. Off-diagonal blocks (numbered as 1, 2, 3). Upper triangular part only. 6 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU Table 2.1 Off-diagonal numerical ranks of Cauchy-like matrices transformed from some Toeplitz matrices with absolute tolerance τ = 10−9 . N = 640 Block size Rank 80 × 560 37 160 × 480 43 240 × 400 45 320 × 320 46 400 × 240 46 480 × 160 43 560 × 80 37 Block # 1 2 3 4 5 6 7 N = 1280 Block size Rank 160 × 1120 44 320 × 960 50 480 × 800 53 640 × 640 53 800 × 480 52 960 × 320 50 1120 × 160 44 N = 2560 Block size Rank 320 × 2240 52 640 × 1920 57 960 × 1600 60 1280 × 1280 61 1600 × 960 60 1920 × 640 58 2240 × 320 52 The low-rank property can be verified theoretically in the following way. We first consider a special case of (2.4) where all uTi vj = 1. We show that the following Cauchy matrix has low-rank off-diagonal blocks:  1 (2.5) C0 = ω 2i − ω 2j+1 1≤i,j≤N The central idea is similar to that in the fast multipole method [25, 8], which implies that with well-separated points (see, e.g., Figure 2.2) an interaction matrix is numerically low-rank. 2k N N 2k+1 N Here we introduce two sets of points {λk }N }k=1 k=1 ≡ {ω }k=1 and {ηk }k=1 ≡ {ω on the unit circle. When we consider an off-diagonal block of C0 as follows:  1 G= λi − ηj 1≤i≤p, p+1≤j≤N we can show G is (numerically) low-rank. In fact G corresponds to two well-separated sets {λk }pk=1 and {ηj }N k=p+1 ; that is, there exists a point c ∈ C such that |λi − c| > |ηj − c| < (2.6) d + e, i = 1, . . . , p, d, j = p + 1, . . . , N, where d, e are positive constants. Consider the expansion (2.7) 1 1 1 = = λi − ηj λi − c 1 − ληj −c −c i r (2.8) = k (ηj − c) k+1 k=0 (λi − c) +O  k (ηj − c) k+1 k=0 r (λi − c) + ε, Fig. 2.2. Well-separated sets in the plane. ηj − c λi − c r+1  7 SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS η −c where r is a number such that the error term |ε| = |O(( λji −c )r+1 )| is bounded by a given tolerance. We have the estimate    r+1  ηj − c r+1 d   < ,  λi − c  d+e which enables us to find an appropriate r according to the tolerance. Thus  r  k (ηj − c) G= + ε̂ k+1 k=0 (λi − c) 1≤i≤p, p+1≤j≤N (2.9) ⎛ 1 λ1 −c 1 λ2 −c ⎜ ⎜ = ⎜ .. ⎝ . 1 λp −c 1 (λ1 −c)2 1 (λ2 −c)2 . . . 1 (λp −c)2 ··· ··· ··· ··· 1 (λ1 −c)r+1 1 (λ2 −c)r+1 . . . 1 (λp −c)r+1 ⎞ ⎛ 1 ⎟ (ηp+1 −c) ⎟⎜ . ⎟⎝ . ⎠ . (ηp+1 −c)r 1 (ηp+2 −c) . . . (ηp+2 −c)r ··· ··· ··· ··· 1 (ηN −c) . . . (ηN −c)r ⎞ ⎟ ⎠ + ε̂. Therefore the numerical rank of G is at most r + 1, up to an error ε̂. Now we can return to the Cauchy-like matrix (2.4). A similar argument shows that any off-diagonal block of C satisfies   r k (ηj − c) T Ĝ≈ (ui vj ) k+1 (λi − c) k=0 1≤i≤p, p+1≤j≤N ⎛ uT 1 λ1−c T u2 λ2−c ⎜ ⎜ ⎜ =⎜ . ⎜ .. ⎝ T up λp−c uT 1 (λ1−c)2 uT 2 (λ2−c)2 . . . uT p 2 (λp−c) ··· ··· uT 1 (λ1−c)r+1 uT 2 (λ2−c)r+1 ··· . . . ··· uT p (λp−c)r+1 ⎞ ⎛ v1 ⎟ ⎟ (ηp+1 −c) v1 ⎟⎜ .. ⎟⎝ ⎟ . ⎠ (η −c) r v p+1 1 v2 (ηp+2 −c) v2 .. . (ηp+2 −c)r v2 ··· ··· ··· ··· vq (ηN −c) vN .. . (ηN −c)r vN ⎞ ⎟ ⎠. That is, we replace the entries of the two matrix factors in (2.9) with appropriate vectors. Thus the numerical rank of Ĝ will be no larger than α(r + 1), which will be relatively small as compared to N . 3. SSS structures and superfast SSS solver. 3.1. SSS representations. To take advantage of the low-rank property of the Cauchy-like matrix C, we can use SSS structures introduced by Chandrasekaran et al. [12, 13]. The SSS structure nicely captures the ranks of off-diagonal blocks of a matrix such as shown in Figure 2.1. A matrix A ∈ CM ×M̃ satisfies the SSS structure if there exist 2n positive integers m1 , . . . , mn , and m̃1 , . . . , m̃n with M = m1 + · · · + mn and M̃ = m̃1 + · · · + m̃n to block-partition A as A = (Ai,j )k×k , where Aij ∈ Cmi ×m̃j satisfies (3.1) ⎧ ⎪ ⎨ Di Ui Wi+1 · · · Wj−1 VjH Aij = ⎪ ⎩ Pi Ri−1 · · · Rj+1 QH j if i = j, if i < j, if i > j. Here the superscript H denotes the Hermitian transpose and empty products are n−1 n defined to be identity matrices. The matrices {Ui }i=1 , {Vi }ni=2 , {Wi }n−1 i=2 , {Pi }i=2 , 8 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU Table 3.1 Dimensions of matrices in (3.1). Matrix Dimensions Ui mi × ki Vi m̃i × ki−1 Wi ki−1 × ki Pi mi × li Qi m̃i × li+1 Ri li+1 × li n−1 n {Qi }n−1 i=1 , {Ri }i=2 , and {Di }i=1 are called generators for the SSS structure and their dimensions are defined in Table 3.1. As an example, the matrix A with n = 4 has the form U1 V2H D2 P3 QH 2 P4 R3 QH 2 ⎛ D1 ⎜ P QH 2 1 A=⎜ ⎝ P3 R2 QH 1 P4 R3 R2 QH 1 (3.2) U1 W2 V3H U2 V3H D3 P4 QH 3 ⎞ U1 W2 W3 V4H U2 W3 V4H ⎟ ⎟. ⎠ U3 V4H D4 The SSS representation (3.2) is related to the off-diagonal blocks in Figure 2.1 in the way that the upper off-diagonal block numbers 1, 2, and 3 are U1 V2H W2 V3H W2 W3 V4H ,  U1 W2 U2 V3H W3 V4H ⎞ U1 W2 W3 , ⎝ U2 W3 ⎠ V4H . U3 ⎛ Appropriate row and column bases of the off-diagonal blocks are clearly reflected. The SSS structure depends on the sequences {mi } and {m̃i } and the SSS generation scheme. If A is a square matrix (M = M̃ ), then we can have a simpler situation mi = m̃i , i = 1, . . . , n. SSS matrices are closed under addition, multiplication, inversion, etc., although the sizes of the generators may increase. While any matrix can be represented in this form for sufficiently large ki ’s and li ’s, the column dimensions of Ui ’s and Pi ’s, respectively, our main focus will be on SSS matrices which have low-rank off-diagonal blocks and have generators with ki ’s and li ’s to be close to those ranks. We say these SSS matrices are compact. Particularly true for Cauchy-like matrices, they can have compact SSS forms. Using SSS structures, we can take advantage of the superfast SSS system solver in [12, 13] to solve the Cauchy-like systems. The solver is efficient when the SSS form is compact, and is practically stable. The solver shares similar ideas with that for banded plus semiseparable systems in [11]. Here we briefly describe the main ideas of the solver in [12, 13]. We consider solving the linear system Ax = b, where A ∈ CN ×N satisfies (3.1) and b itself is an unstructured matrix. The solver computes an implicit U LV H decomposition of A, where U and V are orthogonal matrices. Before we present the formal algorithm, we demonstrate the key ideas on a 4 × 4 block matrix example. 3.2. SSS solver: 4 × 4 example. Let the initial system Ax = b be partitioned as follows: (3.3) ⎛ ⎞ ⎞⎛ ⎞ ⎛ ⎞ ⎛ D1 U1 V2H U1 W2 V3H U1 W2 W3 V4H 0 b1 x1 ⎜ P2 QH ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ U2 W3 V4H ⎟ D2 U2 V3H 1 ⎜ ⎟⎜x2 ⎟ = ⎜b2 ⎟ − ⎜ P2 ⎟ ξ, H H ⎝ P3 R2 QH ⎠ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ P R b x P Q D U V 3 2 3 3 3 2 3 3 4 1 H H P R R b x D P Q P R Q P4 R3 R2 QH 4 3 2 4 4 4 4 3 4 3 2 1 9 SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS where the dimensions of the generators follow those in Table 3.1 with mi = m̃i and the vector ξ = 0. The extra zero vector ξ on the right-hand side of (3.3) has been added for the purpose of a general recursive pattern. The algorithm has two main stages, compression (or elimination) and merging, depending on the relationship between ki (li ) and mi in the intermediate procedure. 3.2.1. Compression. At the beginning, k1 < m1 because of the low-rank property described earlier. We apply a unitary transformation q1H to U1 so that the first m1 − k1 rows of U1 become zeros:  0 m1 − k1 H . (3.4) q1 U1 = k1 Û1 Now we multiply q1H to the first m1 equations of the system ⎛ 0  H  H ⎜ P q1 0 q1 0 2 Ax = b−⎜ ⎝ P3 R 2 0 I 0 I P4 R 3 R 2 ⎞ ⎟ ⎟ ξ. ⎠ We pick another unitary transformation w1H to lower-triangularize q1H D1 , the (1, 1) diagonal block A, i.e., q1H D1 w1H = m1 − k1 k1  m1 − k1 D11 D21 k1 0 . D22 Then system (3.3) becomes  q1H 0 0 I A  w1H 0 0 I  which can be rewritten as ⎛ D11 0 ⎜ D21 D22 ⎜ H H ⎜ P Q P 2 2 Q̂1 11 ⎜ ⎝ P3 R2 QH P3 R2 Q̂H 1 11 P4 R3 R2 QH P4 R3 R2 Q̂H 1 11 ⎛ ⎜ ⎜ =⎜ ⎜ ⎝ w1 0 0 I x=  q1H 0 0 I ⎞ 0 ⎟ ⎜ P2 ⎟ b−⎜ ⎝ P3 R2 ⎠ ξ, P4 R 3 R 2 ⎛ 0 0 0 H H Û1 V2 Û1 W2 V3 Û1 W2 W3 V4H H U2 W3 V4H D2 U2 V3 H D3 U3 V4H P3 Q2 P4 R3 QH P4 QH D4 2 3 ⎞ ⎞ ⎛ 0 β1 ⎟ ⎜ 0 γ1 ⎟ ⎟ ⎟ ⎜ ⎜ ⎟ ξ, ⎟ P2 b2 ⎟ − ⎜ ⎟ ⎝ ⎠ P3 R 2 ⎠ b3 P4 R 3 R 2 b4 ⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝ z1 x̂1 x2 x3 x4 ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ where we have used the partitions w1 x1 =  m1 − k1 z1 , x̂1 k1 q1H b1 = m1 − k1 k1  β1 , γ1 and w1 Q1 = m1 − k1 k1  Q11 . Q̂1 10 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU At this point, we can solve for z1 from the system of equations D11 z1 = β1 . We also subtract D21 z1 from the right-hand side to obtain b̂1 = γ1 − D21 z1 . Then we can discard the first m1 − k1 rows and columns of the coefficient matrix of the system to obtain ⎞ ⎞⎛ ⎞ ⎛ ⎞ ⎛ ⎛ D22 Û1 V2H Û1 W2 V3H Û1 W2 W3 V4H 0 x̂1 b̂1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ P2 Q̂H D2 U2 V3H U2 W3 V4H ⎟ 1 ˆ ⎟ ⎜x2 ⎟ = ⎜b2 ⎟−⎜ P2 ⎟ ξ, ⎜ H H ⎠ ⎠ ⎝ ⎝ ⎠ ⎝ ⎝ P3 R2 Q̂H P3 R 2 ⎠ x3 b3 D3 U3 V4 P3 Q2 1 P4 R 3 R 2 x4 b4 D4 P4 QH P4 R3 QH P4 R3 R2 Q̂H 3 2 1 where ξˆ = ξ + QH 11 z1 . This new system has a similar structure to the original one but with smaller dimension. We can continue to solve it by recursion, if further compressions of the blocks such as (3.4) are possible. Note the actual solution, say, x1 , can be recovered by  z1 H x1 = w1 . x̂1 3.2.2. Merging. During the recursive eliminations there are situations when ki is no longer smaller than mi and no further compression is possible. We are then unable to introduce more zeros into the system. Now we proceed by merging appropriate block rows and columns of the matrix. As an example we can merge the first two block rows and columns and rewrite the system of equations as follows:   ⎛  ⎞ D1 U1 W2 U1 W2 U1 V2H H H V3 W 3 V4 ⎟ ⎛  ⎜ ⎞ U2 U2 P2 QH D2 1 ⎜ ⎟ x1 ⎜ ⎟  H H ⎜ ⎟⎜ ⎟ x2 ⎜ P3 Q1 R2 ⎟⎜ ⎟ D3 U3 V4H ⎜ ⎟ ⎝ ⎠ x3 Q2 ⎜ ⎟ ⎜ ⎟  x4 H ⎝ ⎠ Q1 R2H P4 R 3 P4 QH D4 3 Q2 ⎞ ⎛  ⎛  ⎞ b1 0 ⎟ ⎜ ⎜ ⎟ 0 b2 − P2 ξˆ ˆ ⎟−⎜ ⎟ (R ξ). =⎜ ⎠ ⎝ ⎝ P3 ⎠ 2 b3 P4 R 3 b4 Hence the system becomes ⎛ ⎞ ⎞ ⎛ ⎞ ⎛ ⎞⎛ D̂1 Û1 V3H Û1 W3 V4H 0 x̂1 b̂1 ˜ ⎝ P3 Q̂H D3 U3 V4H ⎠ ⎝ x3 ⎠ = ⎝ b3 ⎠ − ⎝ P3 ⎠ (ξ), 1 H H P R x b 4 3 4 P4 R3 Q̂1 P4 Q3 D4 4 where D̂1 = x̂1 =   U1 V2H , Û1 = D2  b1 , b̂1 = b 2 − P2 τ D1 P2 QH 1 x1 x2  U1 W2 U2 , , Q̂1 =  Q1 R2H Q2 , ˆ ξ˜ = R2 ξ. The number of block rows/columns is reduced by one. Further compressions become possible and we can proceed to solve the system recursively. In the case n = 1, we have the system D1 x1 = b1 − 0ξ, which is solved directly. SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 11 3.3. General solve algorithm. We now present a short description of the general algorithm. The procedure in the 4 × 4 example can be directly extended to a general system. We assume that the matrix A is in compact SSS form represented by n−1 n−1 n−1 n n n the generators {Ui }n−1 i=1 , {Vi }i=2 , {Wi }i=2 , {Pi }i=2 , {Qi }i=1 , {Ri }i=2 , and {Di }i=1 as in (3.1). We also partition x = (xi ) and b = (bj ) such that xi and bi have mi rows. As in the 4 × 4 example, there are two stages at each step of the recursion. In the compression stage, we perform orthogonal eliminations on both sides of A to create an (m1 − k1 ) × (m1 − k1 ) lower triangular submatrix at the top left corner of A. Then we solve a small triangular system and obtain the first few components of the solution vector. At this stage, we are left with a new system with less unknowns; hence we can carry out a recursion. In the merging stage, we merge the first two block rows and columns of A while still maintaining the SSS structure. The numbers of block rows and columns are reduced by one. Combining these two stages, we can proceed with recursion to solve the system. When n = 1 we can solve the linear system directly with standard solvers. The SSS solver has a complexity O(N p2 ) [12, 13], where p is the maximum numerical rank of the off-diagonal blocks of A, as compared to the traditional O(N 3 ) cost for a general dense N × N matrix. We use only orthogonal transformations and a single substitution in the SSS solver. Although a formal proof for the backward stability is not yet available, the solver is shown to be practically stable. The reader is referred to [12, 13] for more discussions on the stability. 4. Fast construction of SSS representation for C. According to section 3, a system in compact SSS form can be solved very efficiently. We can thus use that algorithm to solve the Cauchy-like system (2.3), provided that C can be quickly written in SSS form. Therefore we try to find an efficient construction scheme. Here we provide a divide-and-conquer SSS construction scheme by using the fast merging and splitting strategy in [13]. If we know further that the matrix has low-rank off-diagonal blocks and it is easy to compress the off-diagonal blocks, then the construction can be superfast. Here we will concentrate on this situation as Cauchy-like matrices have this low-rank property. We first present a general divide-and-conquer construction algorithm and then describe a fast shifting strategy to compress the off-diagonal blocks of C. 4.1. General divide-and-conquer construction algorithm. In this section, we discuss the construction of the SSS structure of a matrix A, when the partition sequence {mi }ni=1 is given. The general construction methods can be applied to any unstructured matrix, thus proving that any matrix has an SSS structure. (Of course, ki and li will usually be large in this case, precluding any speed-ups.) These methods can be viewed as specific ways to make the realization algorithm of [18] more efficient. Suppose we are given an N × N matrix A and a partition sequence {mi }ni=1 with n k = N . Starting with an appropriate sequence {m̃1 , m̃2 }, where i=1 mi = i=1 mi n m̃1 and i=k+1 mi = m̃2 , we can first partition A into a 2 × 2 block matrix and then construct a simple SSS form (4.1) A= m̃1 m̃2  m̃1 D1 E m̃2 B D2 =  m̃1 D1 P2 QH 1 m̃2 U1 V2H D2 , where (4.2) H B = U1 V2H ≡ U1 Σ1 F1H , F = P2 QH 1 ≡ P2 Σ1 F1 12 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU are the low-rank SVDs of the off-diagonal blocks B and E. Note if the compressions are done by τ -accurate SVD approximations or rank-revealing QR decompositions, then appropriate “=” signs should be replaced by “≈” signs. We now split either the (1, 1) block or the (2, 2) block to obtain a 3 × 3 block SSS matrix. For instance, we can split the (1, 1) block according to an appropriate new sequence {m̂1 , m̂2 , m̂3 } as follows, where m̂1 + m̂2 = m1 , m̂3 = m2 :   new new D1new U1new (V2new )H U1 W2 = D = U1 , (V3new )H = V2H , , 1 U2new P2new (Qnew )H D2new 1 new (R2new (Qnew )H (Qnew )H ) = QH = P2 , D3new = D2 , 1 2 1 , P3 where the new generators (marked by the superscript new) introduced based on (4.1) can be determined in the following way. First, we partition the matrices for the old first block conformally with the two new blocks as  11  1 D1 D112 U1 = D , = U1 , ((Q11 )H (Q21 )H ) = QH 1 1 . D121 D122 U12 We can then identify from these and the previous equations that D1new = D111 , D2new = D122 , U2new = U12 , V3new = V2 , Qnew = Q21 , P3new = P2 , D3new = D2 . 2 The remaining matrices satisfy (4.3) (D112 U11 ) = U1new ((V2new )H W2new ),  D121 (Q11 )H =  P2new (Qnew )H . 1 R2new By factorizing the left-hand side matrices using numerical tools such as the SVD and rank-revealing QR, these two equations allow us to compute those remaining matrices for the new blocks. A is thus in the new form m̂1 m̂1 D1new new ⎝ P2 (Qnew )H A = m̂2 1 new H new new m̂3 P2 R2 (Q1 ) ⎛ m̂2 new U1 (V2new )H D2new new )H P3 (Qnew 2 m̂3 new new U1 W2 (V3new )H U2new (V3new )H D3new ⎞ ⎠. We can use similar techniques if we want to split the second row and column of (4.1). We can continue this by either splitting the first block row and block column or the last ones using the above techniques, or splitting any middle block row and block column similarly. Then we will be able to construct the desired SSS representation according to the given sequence {mi , i = 1, 2, . . . , n}. The general construction can be organized with bisection. The major cost is in compressions of off-diagonal blocks of the form (4.4) Di12 = Xi YiH , where Xi and Yi are tall and thin, and Di12 is an off-diagonal block of  11 Di Di12 . Di = 21 Di Di22 The compression (4.4) can be achieved by a τ -accurate QR factorization. The construction is also practically stable in our implementation. SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 13 4.2. Compression of off-diagonal blocks. For a general matrix with low-rank off-diagonal blocks, the SSS construction can cost O(N 2 p) as a compression such as (4.2), (4.3), and (4.4) can take O(K 2 p), where K is the dimension of the block being compressed. However, for the Cauchy-like matrix C in (2.3) the compressions can be precomputed. Theorem 4.1. The compression of the off-diagonal block  T ui · vj (4.5) G= = XY H λi − ηj 1≤i≤p, q≤j≤N of C can be obtained by the compression of the corresponding off-diagonal block  1 = X0 Y0H G= λi − ηj 1≤i≤p, q≤j≤N of C0 , where the column dimension of X is no larger than twice the size of the column dimension of X0 . 1 Proof. Assume the off-diagonal block G0 = ( λi −η )1≤i≤p, q≤j≤N is in compressed j 1 H H form X0 Y0 , and the (i, j) entry is λi −ηj = Xi,: Yj,: , where Xi,: (Yj,: ) denotes the ith row of X (Y ). As α ≤ 2 for simplicity we fix α = 2 in (2.1). Then the corresponding off-diagonal block in C is   T ui1 vj1 + ui2 vj2 ui · vj = G= λi − ηj 1≤i≤p, q≤j≤N λi − ηj 1≤i≤p, q≤j≤N H = (ui1 vj1 + ui2 vj2 ) Xi,: Yj,: 1≤i≤p, q≤j≤N   H v Y = (ui1 Xi,: ui2 Xi,: ) j1 j,: H vj2 Yj,: 1≤i≤p, q≤j≤N ⎞ ⎛ u11 X1,: u12 X1,:  H H ⎜ .. ⎟ vq1 Yq,: · · · vN 1 YN,: = ⎝ ... . ⎠ vq2 Y H · · · vN 2 Y H q,: N,: up1 Xi,: up2 Xi,: ≡ XY H . That is, we get a compression of G. Theorem 4.1 indicates that we can convert the compressions of the off-diagonal blocks of C to be the compressions of those of C0 which is independent of the actual entries of the Toeplitz matrix T . This means the compressions of off-diagonal blocks of C0 can be precomputed. The precomputation can be done in O(N 2 p) flops by a rank-revealing QR factorization such as in [27]. It is possible to reduce the cost to O(N log N ) due to the fact that the compression of a large off-diagonal block can be obtained by that of small ones. This can be seen implicitly from the following subsection. 4.3. Compressions of off-diagonal blocks in precomputation. We further present a shifting strategy to reduce the cost of the compressions in the precomputation. The significance of this shifting strategy is to relate the compressions of large off-diagonal block of C0 to those of small ones. That is, in different splitting stages of the SSS construction of C, the compressions of off-diagonal blocks with different sizes can be related. 14 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU For simplicity, we look at the example of partitioning C0 into 4×4 blocks. Assume in the first cut that we can partition C0 into 2 × 2 blocks with equal dimensions as in the following:   C0;1,1 C0;1,2 1 = C0 = 2i 2j+1 ω −ω C0;2,1 C0;2,2 1≤i, j≤N ⎛ 1 ω 2 −ω 3 ⎜ .. ⎜ . ⎜ 1 ⎜ = ⎜ ω4k −ω3 ⎜ ⎜ ⎝ 1 ω 2 −ω 4k+1 ··· .. . ··· 1 ω 4k −ω 4k+3 ··· ··· .. . 1 ω 4k −ω 4k+1 ··· 1 ω 2 −ω 4k+3 C0;2,1 C0;2,2 1 ω 2 −ω 2N +1 .. . ⎞ ⎟ ⎟ ⎟ 1 ⎟ 4k 2N +1 ω −ω ⎟, ⎟ ⎟ ⎠ where k = N4 , and without loss of generality, we consider only the block upper triangular part. Assume that we have obtained a compression X1 Y1H of C0;1,2 such that  C0;1,1 X1 Y1H , C0 = C0;2,1 C0;2,2 where X1 and Y1 are tall and thin and can be computed with τ -accurate SVD or rank-revealing QR factorization. Next, we split C0;1,1 , the (1, 1) block of C0 , into 2 × 2 blocks and compress its off-diagonal block. Suppose the resulting off-diagonal block of C0;1,1 has size k × ( N2 − l). We will compress it by shifting certain parts of X1 and Y1 . That is, we partition X1 and Y1 conformally as   Y1,1 X1,1 , Y1 = , X1 = Y1,2 X1,2 and also pick a k × ( N2 − l) block from the lower left corner of C0;1,2 = X1 Y1H (that H ) is, X1,2 Y1,1 C0 = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ = ⎝  X1 Y1H C0;2,2 C0;1,1 C0;2,1 1 ω 2 −ω 2l+3 ··· . . . 1 ω 2k −ω 2l+3 ··· ··· 1 ω 2 −ω N +1 ⎞ . . . 1 ω 2k −ω N +1 1 ω 2k+2 −ω N +3 . . . 1 ω 4k −ω N +3 C0;2,1 X2 Y2H C0;2,1 ··· 1 ω 2k+2 −ω 2(N −l)+1 ··· ··· 1 . . . ω N −ω 2(N −l)+1 C0;2,2 H X1,1 Y1,1 H X1,2 Y1,1 H X1,1 Y1,2 H X1,2 Y1,2 C0;2,2 ⎞ ⎟ ⎠, ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 15 where X2 Y2H is an unknown compression of the upper right submatrix of C0;1,1 , and the blocks that don’t concern us are left blank. At this point we do not need another factorization to get X2 Y2H ; instead we can directly derive the compression X2 Y2H of C0;1,1 from X1,2 and Y1,1 . Clearly we have ⎛ ⎞ 1 · · · ω2 −ω1 N +1 ω 2 −ω 2l+3 ⎜ ⎟ .. .. X2 Y2H =⎝ ⎠, . . ··· 1 1 · · · ω2k −ωN +1 ω 2k −ω 2l+3 H X1,2 Y1,1 ⎛ ⎜ =⎝ 1 ω 2k+2 −ω N +3 ··· 1 ω 4k −ω N +3 ··· ··· .. . ⎞ 1 ω 2k+2 −ω 2(N −l)+1 .. . 1 ω N −ω 2(N −l)+1 ⎟ ⎠= 1 X Y H. ω 2k 2 2 That means we can get X2 Y2H by shifting a subblock of X1 Y1H . A similar situation holds for the splitting of the (2, 2) block of C after the first splitting. For the successive compressions in later splittings, a similar shifting can be used. For splitting with general block sizes the shifting is also similar. The shifting scheme indicates that, in different levels of divide-and-conquer SSS constructions, the compressions of large blocks and small blocks are related. This can be used to further save the compression cost. 5. Performance and numerical experiments. It is well known that the FFT transformation of the Toeplitz matrix T to a Cauchy-like matrix C, and the recovery from the solution x̃ of the Cauchy-like system to the solution x of the Toeplitz system, are stable. In addition, the fast divide-and-conquer construction and the SSS system solver in section 3 are both practically stable as we will see in our numerical results. Thus our overall algorithm is stable in practice. Here no extra steps are needed to stabilize the algorithm as required in some other superfast methods [38]. After the precomputations discussed in sections 4.2 and 4.3 are finished, the cost of each SSS construction is only O(N p2 ), where p is the maximum of the off-diagonal ranks of the Cauchy matrix C0 . The SSS solver also costs O(N p2 ). The total cost is thus no more than O(N log N ) + O(N p2 ) flops. The total storage requirement is O(N p). For the convenience of coding, we use an O(N 2 p) cost precomputation routine as discussed in subsection 4.2. A preliminary implementation of our superfast solver in Fortran 90 is available at http://www.math.ucla.edu/˜jxia/work/toep. We did experiments on an Itanium 2 1.4 GHz SGI Altix 350 server with a 64-bit Linux operating system and Intel MKL BLAS. Our method (denoted by NEW) is compared to Gaussian elimination with partial pivoting (denoted by GEPP) (via the LAPACK linear system solver ZGESV [3]). We consider real N × N Toeplitz matrices T whose entries are random and uniformly distributed in [0, 1]. The right-hand sides b are obtained by b = T x, where the exact solution x has random entries uniformly distributed in [−1, 1]. Matrices of size N = 2k × 100, k = 2, 3, . . . , are considered. These matrices are moderately illconditioned. For example, for N = 2k ×100 with k = 2, . . . , 7, the one-norm condition numbers of the matrices increase from the order of about 103 to 106 . For the NEW solver two important parameters are involved, the SSS block size d (Figure 2.1) and the tolerance τ in the compressions of off-diagonal blocks. Here we use the τ -accurate QR factorization with τ as a relative tolerance (section 2.2). Figure 5.1 shows the execution time (in seconds) for GEPP and our NEW solver with different d and τ . For the NEW solver only the time for solving the Cauchy-like 16 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU Execution time 4 10 GEPP NEW(τ=10−9, d=100) 3 −4 10 NEW(τ=10 , d=50) 2 time (seconds) 10 1 10 0 10 −1 10 −2 10 2 4 6 8 10 12 k Fig. 5.1. Computation time (in seconds) versus N = 2k × 100. For the NEW solver, time for the precomputation is excluded. Precomputation time 4 10 −9 NEW(τ=10 , d=100) NEW(τ=10−4, d=50) 3 10 2 time (seconds) 10 1 10 0 10 −1 10 −2 10 2 4 6 8 10 12 k Fig. 5.2. Time (in seconds) for precomputations in the NEW solver. system (2.3) is reported, as the compressions of off-diagonal blocks can be done in the precomputation, as described in section 4.3, and the SSS construction time is nearly the precomputation time. The precomputation time is shown in Figure 5.2, although the precomputation only needs to be done once for a particular matrix size N . We also consider the time scaling factor, that is, the factor by which the time is multiplied when the matrix size doubles; see Figure 5.3. We observe that the time scaling factors for the NEW solver are near 2, that is to say the NEW solver is close to being a linear time solver. ||T x̂−b|| ||x̂−x|| Figures 5.4 and 5.5 present the errors ε1 = ||x|| 2 and ε2 = || |T | |x̂|+|b|2 || , respec2 2 tively, where x̂ denotes the numerical solution. A significant point of the new solver is that we can use a relatively large tolerance τ in the precomputation and solving, and then use iterative refinement to improve the accuracy. A relatively large τ leads to relatively small off-diagonal ranks (and thus SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS Time scaling factor 10 GEPP NEW(τ=10−9, d=100) 9 −4 NEW(τ=10 , d=50) 8 time scaling factor 7 6 5 4 3 2 1 2 4 6 8 10 12 k Fig. 5.3. Time scaling factors. ε1 −2 10 NEW(τ=10−9, d=100) GEPP −4 10 −6 relative error 10 −8 10 −10 10 −12 10 −14 10 −16 10 1 2 3 4 Fig. 5.4. ε1 = 5 ||x̂−x||2 ||x||2 6 k 7 8 9 10 11 versus N = 2k × 100. ε2 −2 10 −9 NEW(τ=10 , d=100) GEPP −4 10 −6 relative residual error 10 −8 10 −10 10 −12 10 −14 10 −16 10 1 2 3 Fig. 5.5. ε2 = 4 5 6 k ||T x̂−b||2 || |T | |x̂|+|b| ||2 7 8 9 10 versus N = 2k × 100. 11 17 18 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU small p in the operation counts above). Iterative refinements are very cheap due to the facts that no precomputation is needed and the solver itself is very fast (Figure 5.1). For τ = 10−4 and d = 50 the accuracy results after 2 to 8 steps of iterative refinement are displayed in Figure 5.6. In fact the number of iterative refinement steps required to reach ε2 < 10−13 is listed in Table 5.1. Thus it is also clear that our NEW solver can also perform well as a preconditioner. The block size d also affects the performance. Figure 5.7 indicates that for a ε2 −2 10 −4 10 −6 relative residual error 10 −8 10 −10 10 −12 10 −14 10 initial 2 steps 4 steps 6 steps 8 steps −16 10 −18 10 1 2 3 4 5 6 7 8 9 10 k ||T x̂−b|| Fig. 5.6. ε2 = || |T | |x̂|+|b|2 || , initial values (before iterative refinements), and after some steps 2 of iterative refinement. j Computation time for NEW solver with different d=2 ×25 80 precomputation + solver time solver time 70 time (seconds) 60 50 40 30 20 10 0 0 1 2 3 j 4 5 6 Fig. 5.7. Computation time of the NEW solver with N = 12800, τ = 10−9 , and different block size d = 2j × 25. Table 5.1 Number of iterative refinement steps required to reach ε2 < 10−13 for different matrix dimensions N = 2k × 100. k Number of steps 2 4 3 4 4 5 5 6 6 7 7 15 8 9 9 21 SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS 19 given N if d = 2j × 25 is too large or too small, then both the solver time and the precomputation time can be relatively large. In fact by counting the operations in the algorithm it is possible to determine an optimal d; see [12] for the derivation of the optimal choice of d in the SSS solver. We can do similar derivations for both the SSS construction and the SSS solver. We just point out that when τ gets larger, the offdiagonal ranks decrease and a smaller d should be chosen. In the previous experiments we used two sets of parameters: (τ = 10−9 , d = 100) and (τ = 10−4 , d = 50). Finally, it turns out that our current preliminary implementation of the solver is slower than the implementation of the algorithm in [38], likely due to our inefficient data structure and nonoptimized codes for SSS matrices. The complicated nature of SSS matrices needs more careful coding and memory management. An improved software implementation is under construction. 6. Conclusions and future work. In this paper we have presented a superfast and practically stable solver for Toeplitz systems of linear equations. A Toeplitz matrix is first transformed into a Cauchy-like matrix, which has a nice low-rank property, and then the Cauchy-like system is solved by a superfast solver. This superfast solver utilizes this low-rank property and makes use of an SSS representation of the Cauchy-like matrix. A fast construction procedure for SSS structures is presented. After a one-time precomputation the solver is very efficient (O(N log N ) + O(N p2 ) complexity). Also the algorithm is efficient in that only linear storage is required. In future work we hope to further reduce the precomputation cost and finish a betterdesigned version of the current Fortran 90 codes in both the computations and the coding. Acknowledgments. Many thanks to the referees for their valuable comments which have greatly improved the development and presentation of this paper. The authors are also grateful to the editors and the editorial assistant for their patience and thoughtful suggestions. REFERENCES [1] G. S. Ammar and W. B. Gragg, Superfast solution of real positive definite Toeplitz systems, SIAM J. Matrix Anal. Appl., 9 (1988), pp. 61–76. [2] G. S. Ammar and W. B. Gragg, Numerical experience with a superfast real Toeplitz solver, Linear Algebra Appl., 121 (1989), pp. 185–206. [3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, 2nd ed., SIAM, Philadelphia, PA, 1994. [4] R. R. Bitmead and B. D. O. Anderson, Asymptotically fast solution of Toeplitz and related systems of linear equations, Linear Algebra Appl., 34 (1980), pp. 103–116. [5] A. W. Bojanczyk, R. P. Brent, F. R. de Hoog, and D. R. Sweet, On the stability of the Bareiss and related Toeplitz factorization algorithms, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 40–57. [6] R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun, Fast solution of Toeplitz systems of equations and computation of Padé approximants, J. Algorithms, 1 (1980), pp. 259–295. [7] J. R. Bunch, Stability of methods for solving Toeplitz systems of equations, SIAM J. Sci. Statist. Comput., 6 (1985), pp. 349–364. [8] J. Carrier, L. Greengard, and V. Rokhlin, A fast adaptive multipole algorithm for particle simulations, SIAM J. Sci. Statist. Comput., 9 (1988), pp. 669–686. [9] T. F. Chan and P. C. Hansen, A look-ahead Levinson algorithm for general Toeplitz systems, IEEE Trans. Signal Process., 40 (1992), pp. 1079–1090. [10] T. F. Chan and P. C. Hansen, A look-ahead Levinson algorithm for indefinite Toeplitz systems, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 490–506. [11] S. Chandrasekaran and M. Gu, Fast and stable algorithms for banded plus semiseparable matrices, SIAM J. Matrix Anal. Appl., 25 (2003), pp. 373–384. 20 CHANDRASEKARAN, GU, SUN, XIA, AND ZHU [12] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A.-J. van der Veen, and D. White, Fast Stable Solvers for Sequentially Semi-Separable Linear Systems of Equations and Least Squares Problems, Technical report, University of California, Berkeley, CA, 2003. [13] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A.-J. van der Veen, and D. White, Some fast algorithms for sequentially semiseparable representations, SIAM J. Matrix Anal. Appl., 27 (2005), pp. 341–364. [14] S. Chandrasekaran, M. Gu, and W. Lyons, A Fast and Stable Adaptive Solver for Hierarchically Semi-Separable Representations, Technical report, UCSB Math 2004-20, University of California, Santa Barbara, CA, 2004. [15] G. Cybenko, The numerical stability of the Levinson-Durbin algorithm for Toeplitz systems of equations, SIAM J. Sci. Statist. Comput., 1 (1980), pp. 303–319. [16] G. Cybenko, Error Analysis of Some Signal Processing Algorithms, Ph.D. thesis, Princeton University, Princeton, NJ, 1978. [17] F. R. deHoog, On the solution of Toeplitz systems, Linear Algebra Appl., 88/89 (1987), pp. 123–138. [18] P. Dewilde and A. van der Veen, Time-Varying Systems and Computations, Kluwer Academic Publishers, Boston, MA, 1998. [19] M. Fiedler, Hankel and Loewner matrices, Linear Algebra Appl., 58 (1984), pp. 75–95. [20] K. A. Gallivan, S. Thirumalai, P. Van Dooren, and V. Vermaut, High performance algorithms for Toeplitz and block Toeplitz matrices, Linear Algebra Appl., 241/243 (1996), pp. 343–388. [21] L. Gemignani, Schur complements of Bezoutians and the inversion of block Hankel and block Toeplitz matrices, Linear Algebra Appl., 253 (1997), pp. 39–59. [22] I. Gohberg, T. Kailath, and V. Olshevsky, Fast Gaussian elimination with partial pivoting for matrices with displacement structure, Math. Comp., 64 (1995), pp. 1557–1576. [23] I. Gohberg and V. Olshevsky, Fast state space algorithms for matrix Nehari and NehariTakagi interpolation problems, Integral Equations Operator Theory, 20 (1994), pp. 44–83. [24] I. Gohberg and V. Olshevsky, Complexity of multiplication with vectors for structured matrices, Linear Algebra Appl., 202 (1994), pp. 163–192. [25] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, J. Comput. Phys., 73 (1987), pp. 325–348. [26] M. Gu, Stable and efficient algorithms for structured systems of linear equations, SIAM J. Matrix Anal. Appl., 19 (1998), pp. 279–306. [27] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM J. Sci. Comput., 17 (1996), pp. 848–869. [28] G. Heinig, Inversion of generalized Cauchy matrices and other classes of structured matrices, in Linear Algebra for Signal Processing, IMA Vol. Math. Appl. 69, Springer, New York, 1995, pp. 63–81. [29] G. Heinig and K. Rost, Algebraic Methods for Toeplitz-like Matrices and Operators, Oper. Theory Adv. Appl. 13, Birkhäuser Verlag, Basel, 1984, pp. 109–127. [30] T. Kailath, Fredholm resolvents, Wiener-Hopf equations, and Riccati differential equations, IEEE Trans. Inform. Theory, 15 (1969), pp. 665–672. [31] T. Kailath, S. Kung, and M. Morf, Displacement ranks of matrices and linear equations, J. Math. Anal. Appl., 68 (1979), pp. 395–407. [32] T. Kailath and A. H. Sayed, eds., Fast Reliable Algorithms for Matrices with Structure, SIAM, Philadelphia, 1999. [33] P. G. Martinsson, V. Rokhlin, and M. Tygert, A fast algorithm for the inversion of general Toeplitz matrices, Comput. Math. Appl., 50 (2005), pp. 741–752. [34] M. Morf, Fast Algorithms for Multivariable Systems, Ph.D. thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 1974. [35] B. R. Musicus, Levinson and Fast Choleski Algorithms for Toeplitz and Almost Toeplitz Matrices, Technical report, Res. Lab. of Electronics, M.I.T., Cambridge, MA, 1984. [36] V. Pan, On computations with dense structured matrices, Math. Comp., 55 (1990), pp. 179–190. [37] D. R. Sweet, The use of pivoting to improve the numerical performance of algorithms for Toeplitz matrices, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 468–493. [38] M. Van Barel, G. Heinig, and P. Kravanja, A stabilized superfast solver for nonsymmetric Toeplitz systems, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 494–510. [39] M. Van Barel and P. Kravanja, A stabilized superfast solver for indefinite Hankel systems, Linear Algebra Appl., 284 (1998), pp. 335–355.