SIAM J. MATRIX ANAL. APPL.
Vol. 0, No. 0, pp. 000–000
c XXXX Society for Industrial and Applied Mathematics
A SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS OF
LINEAR EQUATIONS∗
S. CHANDRASEKARAN† , M. GU‡ , X. SUN§ , J. XIA¶, AND J. ZHU‡
Abstract. In this paper we develop a new superfast solver for Toeplitz systems of linear equations. To solve Toeplitz systems many people use displacement equation methods. With displacement
structures, Toeplitz matrices can be transformed into Cauchy-like matrices using the FFT or other
trigonometric transformations. These Cauchy-like matrices have a special property, that is, their
off-diagonal blocks have small numerical ranks. This low-rank property plays a central role in our
superfast Toeplitz solver. It enables us to quickly approximate the Cauchy-like matrices by structured
matrices called sequentially semiseparable (SSS) matrices. The major work of the constructions of
these SSS forms can be done in precomputations (independent of the Toeplitz matrix entries). These
SSS representations are compact because of the low-rank property. The SSS Cauchy-like systems
can be solved in linear time with linear storage. Excluding precomputations the main operations are
the FFT and SSS system solvers, which are both very efficient. Our new Toeplitz solver is stable in
practice. Numerical examples are presented to illustrate the efficiency and the practical stability.
Key words. displacement equation, SSS structure, superfast algorithm, Toeplitz matrix
AMS subject classifications. 15A06, 65F05, 65G05
DOI. 10.1137/040617200
1. Introduction. Toeplitz systems of linear equations arise in many applications, including PDE solving, signal processing, time series analysis, orthogonal polynomials, and many others. A Toeplitz system is a linear system
(1.1)
Tx = b
with a coefficient matrix to be a Toeplitz matrix
⎛
t0
t−1
t−2 · · ·
⎜ t1
t
t−1 · · ·
0
⎜
⎜ t2
t
t0
···
1
(1.2)
T =⎜
⎜ ..
..
..
..
⎝ .
.
.
.
tN −1 tN −2 tN −3 · · ·
t−(N −1)
t−(N −2)
t−(N −3)
..
.
···
⎞
⎟
⎟
⎟
⎟,
⎟
⎠
that is, its entries are constant along every diagonal (a matrix whose entries are constant along every antidiagonal is called a Hankel matrix). The vector t = (t−(N −1) · · ·
t−1 t0 t1 · · · tN −1 ) is called the Toeplitz-vector that generates T .
There are both direct and iterative methods for solving (1.1). Direct solvers are
said to be fast if they cost O(N 2 ) operations; examples include Schur-type methods,
∗ Received by the editors October 19, 2004; accepted for publication (in revised form) by D. A.
Bini May 8, 2007; published electronically DATE.
http://www.siam.org/journals/simax/x-x/61720.html
† Department of Electrical and Computer Engineering, University of California at Santa Barbara,
Santa Barbara, CA 93106 (shiv@ece.ucsb.edu).
‡ Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720 (mgu@
math.berkeley.edu, zhujiang.cal@gmail.com).
§ Department of Mechanical Engineering, University of California at Berkeley, Berkeley, CA 94720
(sunxt@cal.berkeley.edu).
¶ Department of Mathematics, University of California at Los Angeles, Los Angeles, CA 90095
(jxia@math.ucla.edu).
1
2
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
Levinson-type methods, and others [32]. An important type of direct solver is the
displacement equation–type fast solver based on Gaussian eliminations. Some known
displacement equation–type methods are the Heinig [28], GKO [22], and Gu [26]
methods. Those methods have complexity O(N 2 ). Methods with complexity less than
O(N 2 ) are called superfast. In this paper we will present a displacement equation–
type superfast algorithm.
1.1. Fast and superfast methods. Many fast and superfast methods that have
been developed are numerically unstable [5, 15, 16, 7, 37]. References [5] and [15]
showed that the Schur algorithm and the Levinson algorithm are weakly stable in
some cases, but both may be highly unstable in the case of an indefinite and nonsymmetric matrix. Stable generalized Schur algorithms [32] and look-ahead algorithms were developed in [9, 10]. High-performance look-ahead Schur algorithms
were presented [20].
Many other solvers use the FFT or other trigonometric transforms to convert the
Toeplitz (or even Hankel or Toeplitz-plus-Hankel) matrices into generalized Cauchy
or Vandermonde matrices, which can be done stably in O(N log N ) operations. This
is also the approach that we will use in this paper, with the aid of a displacement
structure.
The concept of displacement structure was first introduced in [31]. The Sylvestertype displacement equation for a matrix Ĉ ∈ RN ×N [29] is
(1.3)
ΩĈ − ĈΛ = U V,
where Ω, Λ ∈ RN ×N , U ∈ RN ×α , V ∈ Rα×N , and α ≤ N is the displacement rank
with respect to Ω and Λ if rank(U V ) = α. The matrix Ĉ is considered to possess a
displacement structure with respect to Ω and Λ if α ≪ N .
With displacement structures it was shown in [19, 24, 36, 22, 28] that Toeplitz
and Hankel matrices can be transformed into Cauchy-like matrices of the following
form:
T
ui · vj
(ui , vj ∈ Rα ) ,
Ĉ =
ηi − λj 1≤i,j≤N
where we assume that ηi = λj for 1 ≤ i, j ≤ N . Equivalently, a Cauchy-like matrix is
the unique solution to the displacement equation (1.3) with
⎛ T ⎞
u1
⎜ .. ⎟
Ω = diag(η1 , . . . , ηn ), Λ = diag(λ1 , . . . , λn ), U = ⎝ . ⎠ , and V = (v1 , . . . , vn ) .
uTn
In particular, Ĉ is a Cauchy matrix if uTi vj = 1 for all i and j. The displacement
rank of Ĉ is at most α.
To solve Toeplitz systems through Cauchy-like matrices, many people have utilized matrix factorizations. Gohberg and Olshevsky [23] presented a fast variation of
the straightforward Gaussian elimination with partial pivoting (GEPP) procedure to
solve a Cauchy-like linear system of equations in O(N 2 ) operations. Among other results, Gohberg, Kailath, and Olshevsky [22] developed algorithm GKO, an improved
version of Heinig’s algorithm [28], and demonstrated numerically that it is stable.
In their algorithm, the Hankel matrix and the Toeplitz-plus-Hankel matrix are also
transformed via fast trigonometric transforms into Cauchy-like matrices.
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
3
Gu presented a modified algorithm in [26] to avoid extra error growth. This
algorithm is numerically stable, provided that the element growth in the computed
factorization is not large. The algorithm takes O(N 2 ) operations and is a fast stable
method.
Superfast algorithms appeared in [34, 4, 6, 17, 35, 1, 2, 21] and many others.
Superfast algorithms use divide-and-conquer strategies. Morf developed the first idea
in [34]. These methods are unstable for nonsymmetric systems as they cannot deal
with nearly singular leading principal submatrices.
Van Barel and Kravanja presented a superfast method for rational interpolation
at roots of unity [39]. A similar idea was then applied to Toeplitz systems [38]. It
provided an explicit formula for the inverse of a Toeplitz matrix. Additional techniques such as iterative refinement and downdating were still required to stabilize
their algorithm.
1.2. Main results. Our new Toeplitz solver is also of the displacement equation
type. Given a Toeplitz linear system, we first use the FFT to transform the associated
Toeplitz matrix into a Cauchy-like matrix. Then instead of using matrix factorizations
which often cost O(N 2 ) or more, we exploit a special low-rank property of Cauchy-like
matrices, that is, every off-diagonal block of a Cauchy-like matrix has a low numerical
rank. Using this low-rank property, we then approximate the Cauchy-like matrix by a
low-rank matrix structure called the sequentially semiseparable (SSS) matrix proposed
by Chandrasekaran et al. [12, 13]. A system with the coefficient matrix in compact SSS
form can be solved with only O(p2 N ) operations, where N is the matrix dimension,
and p is the complexity of the semiseparable description. The SSS solver is practically
stable in our numerical tests and those in [12, 13].
The SSS structure was developed to capture the low-rank property of the offdiagonal blocks of a matrix and to maintain stability or practical stability in the
mean time. It is a matrix analog of semiseparable integral kernels in Kailath’s paper [30]. Matrix operations with compact form SSS representations are very efficient,
provided that such compact representations exist or can be easily computed. This
turns out to be true for our case, as the Cauchy-like matrices are transformed from
Toeplitz matrices. We use a recursive compression scheme with a shifting strategy to
construct compact SSS forms for those Cauchy-like matrices. The major work of the
compressions can be precomputed on some Cauchy matrices which are independent
of the actual Toeplitz matrix entries.
The overall algorithm thus has the following stages:
(1) Precompute compressions of off-diagonal blocks of Cauchy matrices.
(2) Transform the Toeplitz matrix into a Cauchy-like matrix in O(N log N )
operations.
(3) Construct a compact SSS representation from precomputed compressions and
solve the Cauchy-like matrix system O(p2 N ) operations.
(4) Recover the solution of the Toepliz system in O(N log N ) operations.
The stages above are either stable or practically stable. Our numerical results
indicate that the overall algorithm is stable in practice. The Toeplitz matrix does not
have to be symmetric or positive definite, and no extra stabilizing step is necessary.
After the precomputations, the total cost for the algorithm is O(N log N ) + O(p2 N ).
This indicates that the entire algorithm is superfast.
We also point out that similar techniques are used in [33], where the low-rank
property is exploited through the block columns without diagonals (called neutered
block columns in [33]), in contrast with the off-diagonal blocks here. The compres-
4
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
sions of either the neutered blocks or the off-diagonal blocks both give data-sparse
representations which enable fast factorizations of the Cauchy-like matrices. In fact,
corresponding to neutered block rows or columns, there are also matrix representations called hierarchically semiseparable (HSS) matrices [14] which are usually more
complicated structures than SSS matrices.
1.3. Overview. We will discuss the displacement structure and the transformation from a Toeplitz problem to a Cauchy-like problem in section 2. The low-rank
property of this Cauchy-like problem is then exploited. Section 3 then gives a linear
complexity solver using the SSS structure. In section 4, we will present an algorithm
for fast construction of SSS structures. We will then analyze the complexity in section 5 and use some numerical experiments to demonstrate the efficiency and the
practical stability. All algorithms have been implemented in Fortran 90. Section 6
draws some conclusions.
2. Displacement structures and low-rank property.
2.1. Cauchy-like systems. Given a Toeplitz system (1.1), we can use a displacement structure to transform it into a Cauchy-like system. Define
⎞
⎛
0 0 ···
0 δ
⎜ 1 0 ··· ··· 0 ⎟
⎟
⎜
⎜
.. ⎟
..
⎟ ,
.
.
0
1
Zδ = ⎜
⎟
⎜
⎟
⎜ .
.
.
.
..
. . .. ⎠
⎝ ..
0 ···
0
1 0
and let Ω = Z1 and Λ = Z−1 in (1.3). Kailath, Kung, and Morf [31] have shown that
every Toeplitz matrix satisfies the displacement equation (1.3) with A · B, having
nonzero entries only in its first row and last column, to be a matrix of rank at most
2. Hence the displacement rank of a Toeplitz matrix is at most 2 with respect to Z1
and Z−1 . The following result can be found in [28].
Proposition 2.1. Let Ĉ ∈ RN ×N be a matrix satisfying the displacement equation
Z1 Ĉ − ĈZ−1 = U V,
(2.1)
where U ∈ Rn×α and V ∈ Rα×n . Then F ĈD0−1 F H is a Cauchy-like matrix satisfying
(2.2)
D1 (F ĈD0−1 F H ) − (F ĈD0−1 F H )D−1 = (FU ) V D0H F H ,
where F =
1
2(k−1)(j−1)
)1≤k,j≤N
N (ω
is the normalized inverse discrete Fourier trans-
πi
N
form matrix, ω = e , and
D1 = diag(1, ω 2 , . . . , ω 2(N −1) ), D−1 = diag(ω, ω 3 , . . . , ω 2N −1 ),
D0 = diag(1, ω, . . . , ω N −1 ).
Here α ≤ 2. This proposition suggests that for a Toeplitz matrix T, one can
convert it into the Cauchy-like matrix in (2.2). Therefore the Toeplitz system (1.1)
can be readily transformed into a new system
(2.3)
C x̃ = b̃,
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
5
where C has the form
(2.4)
C=
ω 2i
uTi vj
− ω 2j+1
(ui , vj ∈ Rα ) .
1≤i,j≤N
In section 3 we will present a fast solver for (2.3). After obtaining x̃ we will
then recover x with an FFT again. All the stages involving FFT are stable and
cost O(N log N ). The solver for (2.3) has a linear complexity and turns out to be
practically stable. Thus the total cost of our algorithm is bounded by O(N log N ) +
O(N p2 ), where p is some parameter that will be described below. This indicates our
method is a superfast one with practical stability.
2.2. Low-rank property of Cauchy-like matrices. In this section, we will
show a low-rank property of C, i.e., every off-diagonal block of C has a low numerical
rank. This property is the basis of the superfast SSS solver in section 3.
First, a simple numerical experiment can give us an idea of the low-rank property
of C. To find out the numerical ranks we can use one of the following tools:
(1) τ -accurate SVD: singular values less than τ are dropped if τ is an absolute
tolerance, or singular values less than τ times the largest singular value are
dropped if τ is a relative tolerance.
(2) τ -accurate QR:
A ≈ QR, A : m × n, Q : m × k, R : k × k, k ≤ l ≡ min(m, n),
which is obtained in the following way. Compute the exact QR factorization
of matrix A = Q̂R̂, where Q̂ is m × l and R̂ is l × n with diagonal entries
satisfying R̂11 ≥ R̂22 ≥ · · · ≥ R̂ll . Then obtain R by dropping all rows of R̂
with diagonal entries less than τ if τ is an absolute tolerance, or with diagonal
entries less than τ R̂11 if τ is a relative tolerance. Drop relevant columns of
Q̂ accordingly to obtain Q.
Later, by ranks we mean numerical ranks. Here we take some random Toeplitz
matrices in different sizes. Then we transform them into Cauchy-like matrices C and
compute the numerical ranks of their off-diagonal blocks. For simplicity, we compute
the ranks for blocks C(1 : d, d + 1 : N ), C(1 : 2d, 2d + 1 : N ), . . ., C(1 : kd, kd + 1 :
N ), . . ., where d is a fixed integer, and these blocks are numbered as block numbers
1, 2, . . . , k as shown in Figure 2.1.
Here, k = 8 off-diagonal blocks for each of three N × N Cauchy-like matrices are
considered. See Table 2.1 for the numerical ranks, where we use the τ -accurate SVD
with τ to be an absolute tolerance.
We can see that the numerical ranks are relatively small as compared to the block
sizes. And when we double the dimension of the matrix, the numerical ranks do not
increase much. This is more significant when a larger τ is used.
Fig. 2.1. Off-diagonal blocks (numbered as 1, 2, 3). Upper triangular part only.
6
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
Table 2.1
Off-diagonal numerical ranks of Cauchy-like matrices transformed from some Toeplitz matrices
with absolute tolerance τ = 10−9 .
N = 640
Block size
Rank
80 × 560
37
160 × 480
43
240 × 400
45
320 × 320
46
400 × 240
46
480 × 160
43
560 × 80
37
Block #
1
2
3
4
5
6
7
N = 1280
Block size
Rank
160 × 1120
44
320 × 960
50
480 × 800
53
640 × 640
53
800 × 480
52
960 × 320
50
1120 × 160
44
N = 2560
Block size
Rank
320 × 2240
52
640 × 1920
57
960 × 1600
60
1280 × 1280
61
1600 × 960
60
1920 × 640
58
2240 × 320
52
The low-rank property can be verified theoretically in the following way. We first
consider a special case of (2.4) where all uTi vj = 1. We show that the following Cauchy
matrix has low-rank off-diagonal blocks:
1
(2.5)
C0 =
ω 2i − ω 2j+1 1≤i,j≤N
The central idea is similar to that in the fast multipole method [25, 8], which implies that with well-separated points (see, e.g., Figure 2.2) an interaction matrix is
numerically low-rank.
2k N
N
2k+1 N
Here we introduce two sets of points {λk }N
}k=1
k=1 ≡ {ω }k=1 and {ηk }k=1 ≡ {ω
on the unit circle. When we consider an off-diagonal block of C0 as follows:
1
G=
λi − ηj 1≤i≤p, p+1≤j≤N
we can show G is (numerically) low-rank. In fact G corresponds to two well-separated
sets {λk }pk=1 and {ηj }N
k=p+1 ; that is, there exists a point c ∈ C such that
|λi − c| >
|ηj − c| <
(2.6)
d + e, i = 1, . . . , p,
d,
j = p + 1, . . . , N,
where d, e are positive constants. Consider the expansion
(2.7)
1
1
1
=
=
λi − ηj
λi − c 1 − ληj −c
−c
i
r
(2.8)
=
k
(ηj − c)
k+1
k=0
(λi − c)
+O
k
(ηj − c)
k+1
k=0
r
(λi − c)
+ ε,
Fig. 2.2. Well-separated sets in the plane.
ηj − c
λi − c
r+1
7
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
η −c
where r is a number such that the error term |ε| = |O(( λji −c )r+1 )| is bounded by a
given tolerance. We have the estimate
r+1
ηj − c r+1
d
<
,
λi − c
d+e
which enables us to find an appropriate r according to the tolerance. Thus
r
k
(ηj − c)
G=
+ ε̂
k+1
k=0 (λi − c)
1≤i≤p, p+1≤j≤N
(2.9)
⎛
1
λ1 −c
1
λ2 −c
⎜
⎜
= ⎜ ..
⎝ .
1
λp −c
1
(λ1 −c)2
1
(λ2 −c)2
.
.
.
1
(λp −c)2
···
···
···
···
1
(λ1 −c)r+1
1
(λ2 −c)r+1
.
.
.
1
(λp −c)r+1
⎞
⎛
1
⎟ (ηp+1 −c)
⎟⎜
.
⎟⎝
.
⎠
.
(ηp+1 −c)r
1
(ηp+2 −c)
.
.
.
(ηp+2 −c)r
···
···
···
···
1
(ηN −c)
.
.
.
(ηN −c)r
⎞
⎟
⎠ + ε̂.
Therefore the numerical rank of G is at most r + 1, up to an error ε̂.
Now we can return to the Cauchy-like matrix (2.4). A similar argument shows
that any off-diagonal block of C satisfies
r
k
(ηj − c)
T
Ĝ≈
(ui vj )
k+1
(λi − c)
k=0
1≤i≤p, p+1≤j≤N
⎛
uT
1
λ1−c
T
u2
λ2−c
⎜
⎜
⎜
=⎜ .
⎜ ..
⎝ T
up
λp−c
uT
1
(λ1−c)2
uT
2
(λ2−c)2
.
.
.
uT
p
2
(λp−c)
···
···
uT
1
(λ1−c)r+1
uT
2
(λ2−c)r+1
···
.
.
.
···
uT
p
(λp−c)r+1
⎞
⎛
v1
⎟
⎟ (ηp+1 −c) v1
⎟⎜
..
⎟⎝
⎟
.
⎠ (η −c)
r
v
p+1
1
v2
(ηp+2 −c) v2
..
.
(ηp+2 −c)r v2
···
···
···
···
vq
(ηN −c) vN
..
.
(ηN −c)r vN
⎞
⎟
⎠.
That is, we replace the entries of the two matrix factors in (2.9) with appropriate
vectors. Thus the numerical rank of Ĝ will be no larger than α(r + 1), which will be
relatively small as compared to N .
3. SSS structures and superfast SSS solver.
3.1. SSS representations. To take advantage of the low-rank property of the
Cauchy-like matrix C, we can use SSS structures introduced by Chandrasekaran
et al. [12, 13]. The SSS structure nicely captures the ranks of off-diagonal blocks
of a matrix such as shown in Figure 2.1.
A matrix A ∈ CM ×M̃ satisfies the SSS structure if there exist 2n positive integers
m1 , . . . , mn , and m̃1 , . . . , m̃n with M = m1 + · · · + mn and M̃ = m̃1 + · · · + m̃n to
block-partition A as A = (Ai,j )k×k , where Aij ∈ Cmi ×m̃j satisfies
(3.1)
⎧
⎪
⎨ Di
Ui Wi+1 · · · Wj−1 VjH
Aij =
⎪
⎩
Pi Ri−1 · · · Rj+1 QH
j
if i = j,
if i < j,
if i > j.
Here the superscript H denotes the Hermitian transpose and empty products are
n−1
n
defined to be identity matrices. The matrices {Ui }i=1
, {Vi }ni=2 , {Wi }n−1
i=2 , {Pi }i=2 ,
8
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
Table 3.1
Dimensions of matrices in (3.1).
Matrix
Dimensions
Ui
mi × ki
Vi
m̃i × ki−1
Wi
ki−1 × ki
Pi
mi × li
Qi
m̃i × li+1
Ri
li+1 × li
n−1
n
{Qi }n−1
i=1 , {Ri }i=2 , and {Di }i=1 are called generators for the SSS structure and their
dimensions are defined in Table 3.1.
As an example, the matrix A with n = 4 has the form
U1 V2H
D2
P3 QH
2
P4 R3 QH
2
⎛
D1
⎜
P
QH
2
1
A=⎜
⎝ P3 R2 QH
1
P4 R3 R2 QH
1
(3.2)
U1 W2 V3H
U2 V3H
D3
P4 QH
3
⎞
U1 W2 W3 V4H
U2 W3 V4H ⎟
⎟.
⎠
U3 V4H
D4
The SSS representation (3.2) is related to the off-diagonal blocks in Figure 2.1 in
the way that the upper off-diagonal block numbers 1, 2, and 3 are
U1
V2H
W2 V3H
W2 W3 V4H
,
U1 W2
U2
V3H
W3 V4H
⎞
U1 W2 W3
, ⎝ U2 W3 ⎠ V4H .
U3
⎛
Appropriate row and column bases of the off-diagonal blocks are clearly reflected.
The SSS structure depends on the sequences {mi } and {m̃i } and the SSS generation scheme. If A is a square matrix (M = M̃ ), then we can have a simpler situation
mi = m̃i , i = 1, . . . , n. SSS matrices are closed under addition, multiplication, inversion, etc., although the sizes of the generators may increase.
While any matrix can be represented in this form for sufficiently large ki ’s and
li ’s, the column dimensions of Ui ’s and Pi ’s, respectively, our main focus will be
on SSS matrices which have low-rank off-diagonal blocks and have generators with
ki ’s and li ’s to be close to those ranks. We say these SSS matrices are compact.
Particularly true for Cauchy-like matrices, they can have compact SSS forms. Using
SSS structures, we can take advantage of the superfast SSS system solver in [12, 13] to
solve the Cauchy-like systems. The solver is efficient when the SSS form is compact,
and is practically stable. The solver shares similar ideas with that for banded plus
semiseparable systems in [11].
Here we briefly describe the main ideas of the solver in [12, 13]. We consider
solving the linear system Ax = b, where A ∈ CN ×N satisfies (3.1) and b itself is an
unstructured matrix. The solver computes an implicit U LV H decomposition of A,
where U and V are orthogonal matrices.
Before we present the formal algorithm, we demonstrate the key ideas on a 4 × 4
block matrix example.
3.2. SSS solver: 4 × 4 example. Let the initial system Ax = b be partitioned
as follows:
(3.3)
⎛
⎞
⎞⎛ ⎞ ⎛ ⎞ ⎛
D1
U1 V2H
U1 W2 V3H U1 W2 W3 V4H
0
b1
x1
⎜ P2 QH
⎟
⎜ ⎟ ⎜ ⎟ ⎜
U2 W3 V4H ⎟
D2
U2 V3H
1
⎜
⎟⎜x2 ⎟ = ⎜b2 ⎟ − ⎜ P2 ⎟ ξ,
H
H
⎝ P3 R2 QH
⎠
⎠
⎝
⎠
⎝
⎠
⎝
P
R
b
x
P
Q
D
U
V
3 2
3
3
3 2
3
3 4
1
H
H
P
R
R
b
x
D
P
Q
P
R
Q
P4 R3 R2 QH
4 3 2
4
4
4
4 3
4 3 2
1
9
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
where the dimensions of the generators follow those in Table 3.1 with mi = m̃i and
the vector ξ = 0. The extra zero vector ξ on the right-hand side of (3.3) has been
added for the purpose of a general recursive pattern.
The algorithm has two main stages, compression (or elimination) and merging,
depending on the relationship between ki (li ) and mi in the intermediate procedure.
3.2.1. Compression. At the beginning, k1 < m1 because of the low-rank property described earlier. We apply a unitary transformation q1H to U1 so that the first
m1 − k1 rows of U1 become zeros:
0
m1 − k1
H
.
(3.4)
q1 U1 =
k1
Û1
Now we multiply q1H to the first m1 equations of the system
⎛
0
H
H
⎜
P
q1 0
q1 0
2
Ax =
b−⎜
⎝ P3 R 2
0 I
0 I
P4 R 3 R 2
⎞
⎟
⎟ ξ.
⎠
We pick another unitary transformation w1H to lower-triangularize q1H D1 , the (1, 1)
diagonal block A, i.e.,
q1H D1 w1H =
m1 − k1
k1
m1 − k1
D11
D21
k1
0
.
D22
Then system (3.3) becomes
q1H
0
0
I
A
w1H
0
0
I
which can be rewritten as
⎛
D11
0
⎜
D21
D22
⎜
H
H
⎜
P
Q
P
2
2 Q̂1
11
⎜
⎝ P3 R2 QH
P3 R2 Q̂H
1
11
P4 R3 R2 QH
P4 R3 R2 Q̂H
1
11
⎛
⎜
⎜
=⎜
⎜
⎝
w1
0
0
I
x=
q1H
0
0
I
⎞
0
⎟
⎜
P2
⎟
b−⎜
⎝ P3 R2 ⎠ ξ,
P4 R 3 R 2
⎛
0
0
0
H
H
Û1 V2
Û1 W2 V3
Û1 W2 W3 V4H
H
U2 W3 V4H
D2
U2 V3
H
D3
U3 V4H
P3 Q2
P4 R3 QH
P4 QH
D4
2
3
⎞
⎞ ⎛
0
β1
⎟
⎜
0
γ1 ⎟
⎟
⎟ ⎜
⎜
⎟ ξ,
⎟
P2
b2 ⎟ − ⎜
⎟
⎝
⎠
P3 R 2 ⎠
b3
P4 R 3 R 2
b4
⎞⎛
⎟⎜
⎟⎜
⎟⎜
⎟⎜
⎠⎝
z1
x̂1
x2
x3
x4
⎞
⎟
⎟
⎟
⎟
⎠
where we have used the partitions
w1 x1 =
m1 − k1 z1
,
x̂1
k1
q1H b1 =
m1 − k1
k1
β1
,
γ1
and w1 Q1 =
m1 − k1
k1
Q11
.
Q̂1
10
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
At this point, we can solve for z1 from the system of equations D11 z1 = β1 . We
also subtract D21 z1 from the right-hand side to obtain b̂1 = γ1 − D21 z1 . Then we can
discard the first m1 − k1 rows and columns of the coefficient matrix of the system to
obtain
⎞
⎞⎛ ⎞ ⎛ ⎞ ⎛
⎛
D22
Û1 V2H
Û1 W2 V3H Û1 W2 W3 V4H
0
x̂1
b̂1
⎟
⎜ ⎟ ⎜ ⎟ ⎜
⎜ P2 Q̂H
D2
U2 V3H
U2 W3 V4H ⎟
1
ˆ
⎟ ⎜x2 ⎟ = ⎜b2 ⎟−⎜ P2 ⎟ ξ,
⎜
H
H
⎠
⎠
⎝
⎝
⎠
⎝
⎝ P3 R2 Q̂H
P3 R 2 ⎠
x3
b3
D3
U3 V4
P3 Q2
1
P4 R 3 R 2
x4
b4
D4
P4 QH
P4 R3 QH
P4 R3 R2 Q̂H
3
2
1
where ξˆ = ξ + QH
11 z1 . This new system has a similar structure to the original one
but with smaller dimension. We can continue to solve it by recursion, if further
compressions of the blocks such as (3.4) are possible. Note the actual solution, say,
x1 , can be recovered by
z1
H
x1 = w1
.
x̂1
3.2.2. Merging. During the recursive eliminations there are situations when
ki is no longer smaller than mi and no further compression is possible. We are
then unable to introduce more zeros into the system. Now we proceed by merging
appropriate block rows and columns of the matrix. As an example we can merge the
first two block rows and columns and rewrite the system of equations as follows:
⎛
⎞
D1
U1 W2
U1 W2
U1 V2H
H
H
V3
W 3 V4 ⎟ ⎛
⎜
⎞
U2
U2
P2 QH
D2
1
⎜
⎟
x1
⎜
⎟
H
H
⎜
⎟⎜
⎟
x2
⎜ P3 Q1 R2
⎟⎜
⎟
D3
U3 V4H
⎜
⎟
⎝
⎠
x3
Q2
⎜
⎟
⎜
⎟
x4
H
⎝
⎠
Q1 R2H
P4 R 3
P4 QH
D4
3
Q2
⎞ ⎛
⎛
⎞
b1
0
⎟ ⎜
⎜
⎟
0
b2 − P2 ξˆ
ˆ
⎟−⎜
⎟ (R ξ).
=⎜
⎠
⎝
⎝
P3 ⎠ 2
b3
P4 R 3
b4
Hence the system becomes
⎛
⎞
⎞ ⎛
⎞ ⎛
⎞⎛
D̂1
Û1 V3H Û1 W3 V4H
0
x̂1
b̂1
˜
⎝ P3 Q̂H
D3
U3 V4H ⎠ ⎝ x3 ⎠ = ⎝ b3 ⎠ − ⎝ P3 ⎠ (ξ),
1
H
H
P
R
x
b
4 3
4
P4 R3 Q̂1 P4 Q3
D4
4
where
D̂1 =
x̂1 =
U1 V2H
, Û1 =
D2
b1
, b̂1 =
b 2 − P2 τ
D1
P2 QH
1
x1
x2
U1 W2
U2
,
,
Q̂1 =
Q1 R2H
Q2
,
ˆ
ξ˜ = R2 ξ.
The number of block rows/columns is reduced by one. Further compressions become
possible and we can proceed to solve the system recursively. In the case n = 1, we
have the system D1 x1 = b1 − 0ξ, which is solved directly.
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
11
3.3. General solve algorithm. We now present a short description of the general algorithm. The procedure in the 4 × 4 example can be directly extended to a
general system. We assume that the matrix A is in compact SSS form represented by
n−1
n−1
n−1
n
n
n
the generators {Ui }n−1
i=1 , {Vi }i=2 , {Wi }i=2 , {Pi }i=2 , {Qi }i=1 , {Ri }i=2 , and {Di }i=1
as in (3.1). We also partition x = (xi ) and b = (bj ) such that xi and bi have mi rows.
As in the 4 × 4 example, there are two stages at each step of the recursion.
In the compression stage, we perform orthogonal eliminations on both sides of A
to create an (m1 − k1 ) × (m1 − k1 ) lower triangular submatrix at the top left corner
of A. Then we solve a small triangular system and obtain the first few components of
the solution vector. At this stage, we are left with a new system with less unknowns;
hence we can carry out a recursion.
In the merging stage, we merge the first two block rows and columns of A while
still maintaining the SSS structure. The numbers of block rows and columns are
reduced by one.
Combining these two stages, we can proceed with recursion to solve the system.
When n = 1 we can solve the linear system directly with standard solvers.
The SSS solver has a complexity O(N p2 ) [12, 13], where p is the maximum
numerical rank of the off-diagonal blocks of A, as compared to the traditional O(N 3 )
cost for a general dense N × N matrix. We use only orthogonal transformations and
a single substitution in the SSS solver. Although a formal proof for the backward
stability is not yet available, the solver is shown to be practically stable. The reader
is referred to [12, 13] for more discussions on the stability.
4. Fast construction of SSS representation for C. According to section 3,
a system in compact SSS form can be solved very efficiently. We can thus use that
algorithm to solve the Cauchy-like system (2.3), provided that C can be quickly
written in SSS form. Therefore we try to find an efficient construction scheme. Here we
provide a divide-and-conquer SSS construction scheme by using the fast merging and
splitting strategy in [13]. If we know further that the matrix has low-rank off-diagonal
blocks and it is easy to compress the off-diagonal blocks, then the construction can
be superfast. Here we will concentrate on this situation as Cauchy-like matrices have
this low-rank property.
We first present a general divide-and-conquer construction algorithm and then
describe a fast shifting strategy to compress the off-diagonal blocks of C.
4.1. General divide-and-conquer construction algorithm. In this section,
we discuss the construction of the SSS structure of a matrix A, when the partition
sequence {mi }ni=1 is given. The general construction methods can be applied to any
unstructured matrix, thus proving that any matrix has an SSS structure. (Of course,
ki and li will usually be large in this case, precluding any speed-ups.) These methods
can be viewed as specific ways to make the realization algorithm of [18] more efficient.
Suppose we are given an N × N matrix A and a partition sequence {mi }ni=1 with
n
k
= N . Starting with an appropriate sequence {m̃1 , m̃2 }, where i=1 mi =
i=1 mi
n
m̃1 and i=k+1 mi = m̃2 , we can first partition A into a 2 × 2 block matrix and then
construct a simple SSS form
(4.1)
A=
m̃1
m̃2
m̃1
D1
E
m̃2
B
D2
=
m̃1
D1
P2 QH
1
m̃2
U1 V2H
D2
,
where
(4.2)
H
B = U1 V2H ≡ U1 Σ1 F1H , F = P2 QH
1 ≡ P2 Σ1 F1
12
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
are the low-rank SVDs of the off-diagonal blocks B and E. Note if the compressions
are done by τ -accurate SVD approximations or rank-revealing QR decompositions,
then appropriate “=” signs should be replaced by “≈” signs. We now split either the
(1, 1) block or the (2, 2) block to obtain a 3 × 3 block SSS matrix. For instance, we
can split the (1, 1) block according to an appropriate new sequence {m̂1 , m̂2 , m̂3 } as
follows, where m̂1 + m̂2 = m1 , m̂3 = m2 :
new new
D1new
U1new (V2new )H
U1 W2
=
D
= U1 , (V3new )H = V2H ,
,
1
U2new
P2new (Qnew
)H
D2new
1
new
(R2new (Qnew
)H (Qnew
)H ) = QH
= P2 , D3new = D2 ,
1
2
1 , P3
where the new generators (marked by the superscript new) introduced based on (4.1)
can be determined in the following way. First, we partition the matrices for the old
first block conformally with the two new blocks as
11
1
D1 D112
U1
=
D
,
= U1 , ((Q11 )H (Q21 )H ) = QH
1
1 .
D121 D122
U12
We can then identify from these and the previous equations that
D1new = D111 , D2new = D122 , U2new = U12 , V3new = V2 , Qnew
= Q21 , P3new = P2 , D3new = D2 .
2
The remaining matrices satisfy
(4.3)
(D112 U11 ) = U1new ((V2new )H W2new ),
D121
(Q11 )H
=
P2new
(Qnew
)H .
1
R2new
By factorizing the left-hand side matrices using numerical tools such as the SVD and
rank-revealing QR, these two equations allow us to compute those remaining matrices
for the new blocks.
A is thus in the new form
m̂1
m̂1
D1new
new
⎝
P2 (Qnew
)H
A = m̂2
1
new H
new new
m̂3 P2 R2 (Q1 )
⎛
m̂2
new
U1 (V2new )H
D2new
new
)H
P3 (Qnew
2
m̂3
new
new
U1 W2 (V3new )H
U2new (V3new )H
D3new
⎞
⎠.
We can use similar techniques if we want to split the second row and column
of (4.1).
We can continue this by either splitting the first block row and block column or
the last ones using the above techniques, or splitting any middle block row and block
column similarly. Then we will be able to construct the desired SSS representation
according to the given sequence {mi , i = 1, 2, . . . , n}.
The general construction can be organized with bisection. The major cost is in
compressions of off-diagonal blocks of the form
(4.4)
Di12 = Xi YiH ,
where Xi and Yi are tall and thin, and Di12 is an off-diagonal block of
11
Di
Di12
.
Di =
21
Di
Di22
The compression (4.4) can be achieved by a τ -accurate QR factorization.
The construction is also practically stable in our implementation.
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
13
4.2. Compression of off-diagonal blocks. For a general matrix with low-rank
off-diagonal blocks, the SSS construction can cost O(N 2 p) as a compression such
as (4.2), (4.3), and (4.4) can take O(K 2 p), where K is the dimension of the block
being compressed. However, for the Cauchy-like matrix C in (2.3) the compressions
can be precomputed.
Theorem 4.1. The compression of the off-diagonal block
T
ui · vj
(4.5)
G=
= XY H
λi − ηj 1≤i≤p, q≤j≤N
of C can be obtained by the compression of the corresponding off-diagonal block
1
= X0 Y0H
G=
λi − ηj 1≤i≤p, q≤j≤N
of C0 , where the column dimension of X is no larger than twice the size of the column
dimension of X0 .
1
Proof. Assume the off-diagonal block G0 = ( λi −η
)1≤i≤p, q≤j≤N is in compressed
j
1
H
H
form X0 Y0 , and the (i, j) entry is λi −ηj = Xi,: Yj,: , where Xi,: (Yj,: ) denotes the ith
row of X (Y ). As α ≤ 2 for simplicity we fix α = 2 in (2.1). Then the corresponding
off-diagonal block in C is
T
ui1 vj1 + ui2 vj2
ui · vj
=
G=
λi − ηj 1≤i≤p, q≤j≤N
λi − ηj
1≤i≤p, q≤j≤N
H
= (ui1 vj1 + ui2 vj2 ) Xi,: Yj,:
1≤i≤p, q≤j≤N
H
v Y
= (ui1 Xi,: ui2 Xi,: ) j1 j,:
H
vj2 Yj,:
1≤i≤p, q≤j≤N
⎞
⎛
u11 X1,: u12 X1,:
H
H
⎜
.. ⎟ vq1 Yq,: · · · vN 1 YN,:
= ⎝ ...
. ⎠ vq2 Y H · · · vN 2 Y H
q,:
N,:
up1 Xi,: up2 Xi,:
≡ XY H .
That is, we get a compression of G.
Theorem 4.1 indicates that we can convert the compressions of the off-diagonal
blocks of C to be the compressions of those of C0 which is independent of the actual
entries of the Toeplitz matrix T . This means the compressions of off-diagonal blocks
of C0 can be precomputed. The precomputation can be done in O(N 2 p) flops by a
rank-revealing QR factorization such as in [27]. It is possible to reduce the cost to
O(N log N ) due to the fact that the compression of a large off-diagonal block can
be obtained by that of small ones. This can be seen implicitly from the following
subsection.
4.3. Compressions of off-diagonal blocks in precomputation. We further
present a shifting strategy to reduce the cost of the compressions in the precomputation. The significance of this shifting strategy is to relate the compressions of large
off-diagonal block of C0 to those of small ones. That is, in different splitting stages
of the SSS construction of C, the compressions of off-diagonal blocks with different
sizes can be related.
14
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
For simplicity, we look at the example of partitioning C0 into 4×4 blocks. Assume
in the first cut that we can partition C0 into 2 × 2 blocks with equal dimensions as in
the following:
C0;1,1 C0;1,2
1
=
C0 =
2i
2j+1
ω −ω
C0;2,1 C0;2,2
1≤i, j≤N
⎛
1
ω 2 −ω 3
⎜ ..
⎜ .
⎜ 1
⎜
= ⎜ ω4k −ω3
⎜
⎜
⎝
1
ω 2 −ω 4k+1
···
..
.
···
1
ω 4k −ω 4k+3
···
···
..
.
1
ω 4k −ω 4k+1
···
1
ω 2 −ω 4k+3
C0;2,1
C0;2,2
1
ω 2 −ω 2N +1
..
.
⎞
⎟
⎟
⎟
1
⎟
4k
2N
+1
ω −ω
⎟,
⎟
⎟
⎠
where k = N4 , and without loss of generality, we consider only the block upper triangular part. Assume that we have obtained a compression X1 Y1H of C0;1,2 such
that
C0;1,1 X1 Y1H
,
C0 =
C0;2,1 C0;2,2
where X1 and Y1 are tall and thin and can be computed with τ -accurate SVD or
rank-revealing QR factorization. Next, we split C0;1,1 , the (1, 1) block of C0 , into
2 × 2 blocks and compress its off-diagonal block. Suppose the resulting off-diagonal
block of C0;1,1 has size k × ( N2 − l). We will compress it by shifting certain parts of
X1 and Y1 . That is, we partition X1 and Y1 conformally as
Y1,1
X1,1
, Y1 =
,
X1 =
Y1,2
X1,2
and also pick a k × ( N2 − l) block from the lower left corner of C0;1,2 = X1 Y1H (that
H
)
is, X1,2 Y1,1
C0 =
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
= ⎜
⎜
⎜
⎜
⎜
⎜
⎝
⎛
⎜
= ⎝
X1 Y1H
C0;2,2
C0;1,1
C0;2,1
1
ω 2 −ω 2l+3
···
.
.
.
1
ω 2k −ω 2l+3
···
···
1
ω 2 −ω N +1
⎞
.
.
.
1
ω 2k −ω N +1
1
ω 2k+2 −ω N +3
.
.
.
1
ω 4k −ω N +3
C0;2,1
X2 Y2H
C0;2,1
···
1
ω 2k+2 −ω 2(N −l)+1
···
···
1
.
.
.
ω N −ω 2(N −l)+1
C0;2,2
H
X1,1 Y1,1
H
X1,2 Y1,1
H
X1,1 Y1,2
H
X1,2 Y1,2
C0;2,2
⎞
⎟
⎠,
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
15
where X2 Y2H is an unknown compression of the upper right submatrix of C0;1,1 , and
the blocks that don’t concern us are left blank. At this point we do not need another
factorization to get X2 Y2H ; instead we can directly derive the compression X2 Y2H of
C0;1,1 from X1,2 and Y1,1 . Clearly we have
⎛
⎞
1
· · · ω2 −ω1 N +1
ω 2 −ω 2l+3
⎜
⎟
..
..
X2 Y2H
=⎝
⎠,
.
.
···
1
1
· · · ω2k −ωN +1
ω 2k −ω 2l+3
H
X1,2 Y1,1
⎛
⎜
=⎝
1
ω 2k+2 −ω N +3
···
1
ω 4k −ω N +3
···
···
..
.
⎞
1
ω 2k+2 −ω 2(N −l)+1
..
.
1
ω N −ω 2(N −l)+1
⎟
⎠=
1
X Y H.
ω 2k 2 2
That means we can get X2 Y2H by shifting a subblock of X1 Y1H . A similar situation
holds for the splitting of the (2, 2) block of C after the first splitting. For the successive
compressions in later splittings, a similar shifting can be used. For splitting with
general block sizes the shifting is also similar.
The shifting scheme indicates that, in different levels of divide-and-conquer SSS
constructions, the compressions of large blocks and small blocks are related. This can
be used to further save the compression cost.
5. Performance and numerical experiments. It is well known that the FFT
transformation of the Toeplitz matrix T to a Cauchy-like matrix C, and the recovery
from the solution x̃ of the Cauchy-like system to the solution x of the Toeplitz system,
are stable. In addition, the fast divide-and-conquer construction and the SSS system
solver in section 3 are both practically stable as we will see in our numerical results.
Thus our overall algorithm is stable in practice. Here no extra steps are needed to
stabilize the algorithm as required in some other superfast methods [38].
After the precomputations discussed in sections 4.2 and 4.3 are finished, the cost
of each SSS construction is only O(N p2 ), where p is the maximum of the off-diagonal
ranks of the Cauchy matrix C0 . The SSS solver also costs O(N p2 ). The total cost
is thus no more than O(N log N ) + O(N p2 ) flops. The total storage requirement
is O(N p). For the convenience of coding, we use an O(N 2 p) cost precomputation
routine as discussed in subsection 4.2.
A preliminary implementation of our superfast solver in Fortran 90 is available
at http://www.math.ucla.edu/˜jxia/work/toep. We did experiments on an Itanium 2
1.4 GHz SGI Altix 350 server with a 64-bit Linux operating system and Intel MKL
BLAS. Our method (denoted by NEW) is compared to Gaussian elimination with partial
pivoting (denoted by GEPP) (via the LAPACK linear system solver ZGESV [3]). We
consider real N × N Toeplitz matrices T whose entries are random and uniformly
distributed in [0, 1]. The right-hand sides b are obtained by b = T x, where the
exact solution x has random entries uniformly distributed in [−1, 1]. Matrices of
size N = 2k × 100, k = 2, 3, . . . , are considered. These matrices are moderately illconditioned. For example, for N = 2k ×100 with k = 2, . . . , 7, the one-norm condition
numbers of the matrices increase from the order of about 103 to 106 .
For the NEW solver two important parameters are involved, the SSS block size d
(Figure 2.1) and the tolerance τ in the compressions of off-diagonal blocks. Here we
use the τ -accurate QR factorization with τ as a relative tolerance (section 2.2).
Figure 5.1 shows the execution time (in seconds) for GEPP and our NEW solver
with different d and τ . For the NEW solver only the time for solving the Cauchy-like
16
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
Execution time
4
10
GEPP
NEW(τ=10−9, d=100)
3
−4
10
NEW(τ=10 , d=50)
2
time (seconds)
10
1
10
0
10
−1
10
−2
10
2
4
6
8
10
12
k
Fig. 5.1. Computation time (in seconds) versus N = 2k × 100. For the NEW solver, time for the
precomputation is excluded.
Precomputation time
4
10
−9
NEW(τ=10 , d=100)
NEW(τ=10−4, d=50)
3
10
2
time (seconds)
10
1
10
0
10
−1
10
−2
10
2
4
6
8
10
12
k
Fig. 5.2. Time (in seconds) for precomputations in the NEW solver.
system (2.3) is reported, as the compressions of off-diagonal blocks can be done in the
precomputation, as described in section 4.3, and the SSS construction time is nearly
the precomputation time. The precomputation time is shown in Figure 5.2, although
the precomputation only needs to be done once for a particular matrix size N .
We also consider the time scaling factor, that is, the factor by which the time is
multiplied when the matrix size doubles; see Figure 5.3. We observe that the time
scaling factors for the NEW solver are near 2, that is to say the NEW solver is close to
being a linear time solver.
||T x̂−b||
||x̂−x||
Figures 5.4 and 5.5 present the errors ε1 = ||x|| 2 and ε2 = || |T | |x̂|+|b|2 || , respec2
2
tively, where x̂ denotes the numerical solution.
A significant point of the new solver is that we can use a relatively large tolerance
τ in the precomputation and solving, and then use iterative refinement to improve the
accuracy. A relatively large τ leads to relatively small off-diagonal ranks (and thus
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
Time scaling factor
10
GEPP
NEW(τ=10−9, d=100)
9
−4
NEW(τ=10 , d=50)
8
time scaling factor
7
6
5
4
3
2
1
2
4
6
8
10
12
k
Fig. 5.3. Time scaling factors.
ε1
−2
10
NEW(τ=10−9, d=100)
GEPP
−4
10
−6
relative error
10
−8
10
−10
10
−12
10
−14
10
−16
10
1
2
3
4
Fig. 5.4. ε1 =
5
||x̂−x||2
||x||2
6
k
7
8
9
10
11
versus N = 2k × 100.
ε2
−2
10
−9
NEW(τ=10 , d=100)
GEPP
−4
10
−6
relative residual error
10
−8
10
−10
10
−12
10
−14
10
−16
10
1
2
3
Fig. 5.5. ε2 =
4
5
6
k
||T x̂−b||2
|| |T | |x̂|+|b| ||2
7
8
9
10
versus N = 2k × 100.
11
17
18
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
small p in the operation counts above). Iterative refinements are very cheap due to the
facts that no precomputation is needed and the solver itself is very fast (Figure 5.1).
For τ = 10−4 and d = 50 the accuracy results after 2 to 8 steps of iterative refinement
are displayed in Figure 5.6. In fact the number of iterative refinement steps required
to reach ε2 < 10−13 is listed in Table 5.1.
Thus it is also clear that our NEW solver can also perform well as a preconditioner.
The block size d also affects the performance. Figure 5.7 indicates that for a
ε2
−2
10
−4
10
−6
relative residual error
10
−8
10
−10
10
−12
10
−14
10
initial
2 steps
4 steps
6 steps
8 steps
−16
10
−18
10
1
2
3
4
5
6
7
8
9
10
k
||T x̂−b||
Fig. 5.6. ε2 = || |T | |x̂|+|b|2 || , initial values (before iterative refinements), and after some steps
2
of iterative refinement.
j
Computation time for NEW solver with different d=2 ×25
80
precomputation + solver time
solver time
70
time (seconds)
60
50
40
30
20
10
0
0
1
2
3
j
4
5
6
Fig. 5.7. Computation time of the NEW solver with N = 12800, τ = 10−9 , and different block
size d = 2j × 25.
Table 5.1
Number of iterative refinement steps required to reach ε2 < 10−13 for different matrix dimensions N = 2k × 100.
k
Number of steps
2
4
3
4
4
5
5
6
6
7
7
15
8
9
9
21
SUPERFAST ALGORITHM FOR TOEPLITZ SYSTEMS
19
given N if d = 2j × 25 is too large or too small, then both the solver time and the
precomputation time can be relatively large. In fact by counting the operations in the
algorithm it is possible to determine an optimal d; see [12] for the derivation of the
optimal choice of d in the SSS solver. We can do similar derivations for both the SSS
construction and the SSS solver. We just point out that when τ gets larger, the offdiagonal ranks decrease and a smaller d should be chosen. In the previous experiments
we used two sets of parameters: (τ = 10−9 , d = 100) and (τ = 10−4 , d = 50).
Finally, it turns out that our current preliminary implementation of the solver is
slower than the implementation of the algorithm in [38], likely due to our inefficient
data structure and nonoptimized codes for SSS matrices. The complicated nature
of SSS matrices needs more careful coding and memory management. An improved
software implementation is under construction.
6. Conclusions and future work. In this paper we have presented a superfast and practically stable solver for Toeplitz systems of linear equations. A Toeplitz
matrix is first transformed into a Cauchy-like matrix, which has a nice low-rank property, and then the Cauchy-like system is solved by a superfast solver. This superfast
solver utilizes this low-rank property and makes use of an SSS representation of the
Cauchy-like matrix. A fast construction procedure for SSS structures is presented.
After a one-time precomputation the solver is very efficient (O(N log N ) + O(N p2 )
complexity). Also the algorithm is efficient in that only linear storage is required. In
future work we hope to further reduce the precomputation cost and finish a betterdesigned version of the current Fortran 90 codes in both the computations and the
coding.
Acknowledgments. Many thanks to the referees for their valuable comments
which have greatly improved the development and presentation of this paper. The
authors are also grateful to the editors and the editorial assistant for their patience
and thoughtful suggestions.
REFERENCES
[1] G. S. Ammar and W. B. Gragg, Superfast solution of real positive definite Toeplitz systems,
SIAM J. Matrix Anal. Appl., 9 (1988), pp. 61–76.
[2] G. S. Ammar and W. B. Gragg, Numerical experience with a superfast real Toeplitz solver,
Linear Algebra Appl., 121 (1989), pp. 185–206.
[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum,
S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’
Guide, 2nd ed., SIAM, Philadelphia, PA, 1994.
[4] R. R. Bitmead and B. D. O. Anderson, Asymptotically fast solution of Toeplitz and related
systems of linear equations, Linear Algebra Appl., 34 (1980), pp. 103–116.
[5] A. W. Bojanczyk, R. P. Brent, F. R. de Hoog, and D. R. Sweet, On the stability of
the Bareiss and related Toeplitz factorization algorithms, SIAM J. Matrix Anal. Appl., 16
(1995), pp. 40–57.
[6] R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun, Fast solution of Toeplitz systems of
equations and computation of Padé approximants, J. Algorithms, 1 (1980), pp. 259–295.
[7] J. R. Bunch, Stability of methods for solving Toeplitz systems of equations, SIAM J. Sci.
Statist. Comput., 6 (1985), pp. 349–364.
[8] J. Carrier, L. Greengard, and V. Rokhlin, A fast adaptive multipole algorithm for particle
simulations, SIAM J. Sci. Statist. Comput., 9 (1988), pp. 669–686.
[9] T. F. Chan and P. C. Hansen, A look-ahead Levinson algorithm for general Toeplitz systems,
IEEE Trans. Signal Process., 40 (1992), pp. 1079–1090.
[10] T. F. Chan and P. C. Hansen, A look-ahead Levinson algorithm for indefinite Toeplitz systems, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 490–506.
[11] S. Chandrasekaran and M. Gu, Fast and stable algorithms for banded plus semiseparable
matrices, SIAM J. Matrix Anal. Appl., 25 (2003), pp. 373–384.
20
CHANDRASEKARAN, GU, SUN, XIA, AND ZHU
[12] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A.-J. van der Veen, and
D. White, Fast Stable Solvers for Sequentially Semi-Separable Linear Systems of Equations and Least Squares Problems, Technical report, University of California, Berkeley,
CA, 2003.
[13] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A.-J. van der Veen, and
D. White, Some fast algorithms for sequentially semiseparable representations, SIAM J.
Matrix Anal. Appl., 27 (2005), pp. 341–364.
[14] S. Chandrasekaran, M. Gu, and W. Lyons, A Fast and Stable Adaptive Solver for Hierarchically Semi-Separable Representations, Technical report, UCSB Math 2004-20, University
of California, Santa Barbara, CA, 2004.
[15] G. Cybenko, The numerical stability of the Levinson-Durbin algorithm for Toeplitz systems
of equations, SIAM J. Sci. Statist. Comput., 1 (1980), pp. 303–319.
[16] G. Cybenko, Error Analysis of Some Signal Processing Algorithms, Ph.D. thesis, Princeton
University, Princeton, NJ, 1978.
[17] F. R. deHoog, On the solution of Toeplitz systems, Linear Algebra Appl., 88/89 (1987),
pp. 123–138.
[18] P. Dewilde and A. van der Veen, Time-Varying Systems and Computations, Kluwer
Academic Publishers, Boston, MA, 1998.
[19] M. Fiedler, Hankel and Loewner matrices, Linear Algebra Appl., 58 (1984), pp. 75–95.
[20] K. A. Gallivan, S. Thirumalai, P. Van Dooren, and V. Vermaut, High performance
algorithms for Toeplitz and block Toeplitz matrices, Linear Algebra Appl., 241/243 (1996),
pp. 343–388.
[21] L. Gemignani, Schur complements of Bezoutians and the inversion of block Hankel and block
Toeplitz matrices, Linear Algebra Appl., 253 (1997), pp. 39–59.
[22] I. Gohberg, T. Kailath, and V. Olshevsky, Fast Gaussian elimination with partial pivoting
for matrices with displacement structure, Math. Comp., 64 (1995), pp. 1557–1576.
[23] I. Gohberg and V. Olshevsky, Fast state space algorithms for matrix Nehari and NehariTakagi interpolation problems, Integral Equations Operator Theory, 20 (1994), pp. 44–83.
[24] I. Gohberg and V. Olshevsky, Complexity of multiplication with vectors for structured matrices, Linear Algebra Appl., 202 (1994), pp. 163–192.
[25] L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, J. Comput. Phys.,
73 (1987), pp. 325–348.
[26] M. Gu, Stable and efficient algorithms for structured systems of linear equations, SIAM J.
Matrix Anal. Appl., 19 (1998), pp. 279–306.
[27] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR
factorization, SIAM J. Sci. Comput., 17 (1996), pp. 848–869.
[28] G. Heinig, Inversion of generalized Cauchy matrices and other classes of structured matrices,
in Linear Algebra for Signal Processing, IMA Vol. Math. Appl. 69, Springer, New York,
1995, pp. 63–81.
[29] G. Heinig and K. Rost, Algebraic Methods for Toeplitz-like Matrices and Operators, Oper.
Theory Adv. Appl. 13, Birkhäuser Verlag, Basel, 1984, pp. 109–127.
[30] T. Kailath, Fredholm resolvents, Wiener-Hopf equations, and Riccati differential equations,
IEEE Trans. Inform. Theory, 15 (1969), pp. 665–672.
[31] T. Kailath, S. Kung, and M. Morf, Displacement ranks of matrices and linear equations,
J. Math. Anal. Appl., 68 (1979), pp. 395–407.
[32] T. Kailath and A. H. Sayed, eds., Fast Reliable Algorithms for Matrices with Structure,
SIAM, Philadelphia, 1999.
[33] P. G. Martinsson, V. Rokhlin, and M. Tygert, A fast algorithm for the inversion of general
Toeplitz matrices, Comput. Math. Appl., 50 (2005), pp. 741–752.
[34] M. Morf, Fast Algorithms for Multivariable Systems, Ph.D. thesis, Department of Electrical
Engineering, Stanford University, Stanford, CA, 1974.
[35] B. R. Musicus, Levinson and Fast Choleski Algorithms for Toeplitz and Almost Toeplitz
Matrices, Technical report, Res. Lab. of Electronics, M.I.T., Cambridge, MA, 1984.
[36] V. Pan, On computations with dense structured matrices, Math. Comp., 55 (1990),
pp. 179–190.
[37] D. R. Sweet, The use of pivoting to improve the numerical performance of algorithms for
Toeplitz matrices, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 468–493.
[38] M. Van Barel, G. Heinig, and P. Kravanja, A stabilized superfast solver for nonsymmetric
Toeplitz systems, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 494–510.
[39] M. Van Barel and P. Kravanja, A stabilized superfast solver for indefinite Hankel systems,
Linear Algebra Appl., 284 (1998), pp. 335–355.