MA398 Script
MA398 Script
Acknowledgement
We are grateful to Menelaos Karavelas, Ian Mitchell and Stuart Price for assistance in the type-
setting of this material. We are grateful to a variety of students at Stanford University (CS237A)
and at Warwick University (MA398) for many helpful comments which have signicantly im-
proved the notes.
3
4
Contents
1 Vector and Matrix Analysis 7
1.1 Vector Norms and Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Dual Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Structured Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Matrix Factorisations 23
2.1 Diagonalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Jordan Canonical Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 QR Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 LU Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Complexity of Algorithms 47
4.1 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Bidiagonal and Hessenberg Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Iterative Methods 73
6.1 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 The Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 The Gauss-Seidel and SOR Methods . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Nonlinear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5 The Steepest Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5
6 CONTENTS
Examples.
the p-norm for 1 p < :
n
X 1/p
kxkp = |xj |p x Cn ;
j=1
Theorem 1.2. All norms on Cn are equivalent: for each pair of norms k ka and k kb on Cn
there are constants 0 < c1 c2 < with
c1 kxka kxkb c2 kxka x Cn .
7
8 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
Proof. Using property b) from the dention of a vector norm it suces to consider vectors
x S = x Cn kxk2 = 1 . Since k ka is non-zero on all of S we can dene f : S R
by f (x) = kxkb /kxka . Because the function f is continuous and the set S is compact there are
x1 , x2 S with f (x1 ) f (x) f (x2 ) for all x S . Setting c1 = f (x1 ) > 0 and c2 = f (x2 )
completes the proof.
Remarks.
1. The same result holds for norms on Rn . The proof transfers to this situation without
change.
Remark. Conditions c) and d) above state that h , i is linear in the second component. Using
the rules for inner products we get
and
hx, yi = hx, yi for all C, x, y Cn .
The inner product is said to be anti-linear in the rst component.
n
X
hx, yi = xj yj x, y Cn . (1.1)
j=1
Denition 1.4. Two vectors x, y are orthogonal with respect to an inner product h, i i
hx, yi = 0.
Lemma 1.5 (Cauchy-Schwarz inequality). Let h , i : Cn Cn C be an inner product. Then
hx, yi2 hx, xi hy, yi
(1.2)
for every x, y Cn and equality holds if and only if x and y are linearly dependent.
Proof. For every C we have
hx, yi2
hx, yihy, xi hy, xihx, yi hy, xihx, yi
0 hx, xi + = hx, xi
hy, yi hy, yi hy, yi hy, yi
and multiplying the result by hy, yi gives (1.2).
If equality holds in (1.2) then xy in (1.3) must be 0 and thus x and y are linearly dependent.
If on the other hand x and y are linearly dependent, say x = y , then = hy, yi/hy, yi =
and x y = 0 giving equality in (1.3) and thus in (1.2).
1.1. VECTOR NORMS AND INNER PRODUCTS 9
is a vector norm.
Proof.
p
a) Since h, i is an inner product we have hx, xi 0 for all x Cn , i.e. hx, xi is real
and positive. Also we get
kxk = 0 hx, xi = 0 x = 0.
b) We have
p p
kxk = hx, xi = hx, xi = || kxk.
c) Using the Cauchy-Schwarz inequality
x, y Cn
hx, yi kxkkyk
kx + yk2 = hx + y, x + yi
= hx, xi + hx, yi + hy, xi + hy, yi
kxk2 + 2hx, yi + kyk2
Remark. The angle between two vectors x and y is the unique value [0, ] with
When considering the Euclidean norm and inner product on Rn , this denition of angle coincides
with the usual, geometric meaning of angles. In any case, two vectors are orthogonal, if and only
if they have angle /2.
hx, yi = x y.
for all x Cn , y Cm and allA Cmn . Unless otherwise specied, we will use h , i to
denote the standard inner product (1.1) and k k2 to denote the corresponding Euclidean norm.
10 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
A matrix Q is unitary, if and only if the columns of Q are orthonormal with respect to the
standard inner-product. In particular unitary matrices cannot have more columns than
rows.
The standard inner product and norm are invariant under multiplication by a unitary matrix:
Theorem 1.9. Let h , i denote the standard inner product. Then for any unitary Q Cmn
and any x, y Cn we have hQx, Qyi = hx, yi and kQxk2 = kxk2 .
Proof. The rst claim follows from hQx, Qyi = hx, Q Qyi = hx, yi and using the relation kxk2 =
hx, xi gives the second claim.
Other inner products with appropriate properties can give rise to other norms; for example,
for matrices A which are Hermitian and positive-denite,
When x is an eigenvector of A, then for every 6= 0 the vector x is an eigenvector for the
same eigenvalue, since both sides of (1.6) are linear in x. Sometimes it is convenient to normalise
n
x by choosing kxk2 = 1. Then the eigenvalue problem is to nd (x, ) C C satisfying
Ax = x and kxk2 = 1.
y A = y and y 6= 0
AX = X.
By invertibility of X we have
A = XX 1 . (1.7)
These are both similarity transformations which reveal the eigenvalues of A on the diagonals
of J and T, respectively. The Jordan Canonical Form is not stable to perturbations, but the
Schur Factorization is. Hence Schur Factorization will form the basis of good algorithms while
the Jordan Canonical Form is useful for more theoretical purposes, such as dening the matrix
exponential eAt .
Theorem 1.16 (Similarity and Eigenvalues). If B is similar to A, then B and A have the same
eigenvalues with the same algebraic and geometric multiplicities.
Proof. Exercise 2-4.
Proof. (Sketch) We prove this in the case where A has n linearly independent eigenvalues
xi all of which correspond to simple eigenvalues i . Then A may be factorized as in (1.7) with
= diag{1 , 2 , . . . , n }. Hence
(A I)2 = XX 1 ,
12 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
where = diag{(1 )2 , (2 )2 , . . . , (n )2 }.
Without loss of generality let = 1 , noting that j 6= by simplicity. ker() is one
2
dimensional, spanned by e1 . Hence ker(A I) is one dimensional by Theorem 1.16
The general case can be established by use of the Jordan form (see Theorem 2.7), using the
fact that the Jordan block corresponding to a simple eigenvalue is diagonal.
Theorem 1.18 (Eigenvalue Multiplicities). For any eigenvalue of A Rnn the algebraic and
geometric multiplicities q and r respectively satisfy
1 r q.
Proof. Let be the eigenvalue. Since (A I) is non-invertible, its null-space U has dimension
r 1. Let V Cnr have r columns comprising an orthonormal basis for U ; then extend V to
V Cnn by adding orthonormal columns so that V is unitary. Now
I C
B = V AV = ,
0 D
Thus B has algebraic multiplicity r for B and hence A has algebraic multiplicity r, by
Theorem 1.16.
being normal in the sense of the denition above. Sometimes this alternative condition is used
to dene when a matrix is normal. As a consequence of this equivalence, every Hermitian matrix
is also normal.
Let now A be Hermitian and positive denite. Then Ax = x implies
xi.
hx, xi = hx, xi = hx, Axi = hAx, xi = hx, xi = hx,
Thus, all eigenvalues of A are real and we can arrange them in increasing order min = 1
n = max . The following lemma uses min and max to estimate the values of the norm
k kA from (1.5) by k k.
Lemma 1.21. Let min and max be the smallest and largest eigenvalues of a Hermitian, positive
denite matrix A Cnn . Then
min kxk2 kxk2A max kxk2 x Cn .
1.3. DUAL SPACES 13
and
n
X
kxk2A = i i2 .
i=1
This gives the upper bound
n
X
kxk2A max i2 max kxk2
i=1
Denition 1.22. Given a norm k k on Cn , the pair (Cn , k k) is a Banach space B . The Banach
space B , the dual of B , is the pair (C , k kB ), where
0 n 0
See Exercise 1-5 to deduce that the preceeding denition satises the norm axioms.
Theorem 1.23. The spaces (Cn , k k1 ) and (Cn , k k ) are the duals of one another.
Proof. Firstly, we must show
kxk1 = max |hx, yi|.
kyk =1
and therefore
max |hx, yi| kxk1 .
kyk =1
We need to show that this upper-bound is achieved. If yj = xj /|xj | (with the convention that
this is 0 when xj is 0) then kyk = 1 (since x 6= 0) and
n
X n
X
hx, yi = |xj |2 /|xj | = |xj | = kxk1 .
j=1 j=1
We have
If x = 0 we have equality; if not then, for some k such that |xk | = kxk > 0, choose yj =
jk xk /|xk |. Then
hx, yi = |xk | = kxk and kyk1 = 1.
Thus maxkyk1 =1 |hx, yi| = kxk .
14 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
Theorem 1.24. If p, q (1, ) with p1 + q1 = 1 then the Banach spaces (Cn , k kp ) and
(Cn , k kq ) are the duals of one another.
Since we can consider the space Cmn of all mn-matrices to be a copy of the mn-dimensional
mn
vector space C , we can use all vector norms on Cmn as vector norms on the matrices Cmn .
Examples of vector norms on the space of matrices include
kAk(m,
n) = max kAxkm
kxkn
=1
where k km
is a norm on Cm , and k kn is a norm on Cn . Note that, for any operator
norm,
kAxkm
kAk(m,
n) = max kAxkm
= max kAxkm
= max .
kxkn
1 kxkn
=1 xCn \{0} kxkn
Sometimes it is helpful to consider special vector norms on a space of matrices, which are
compatible with the matrix-matrix multiplication.
Remark. Conditions a), b) and c) state that kk is a vector norm on the vector space Cnn .
Condition d) only makes sense for matrices, since general vectors spaces are not equipped with
a product.
The vector operator norm from Cn into Cm reduces to the p-operator norm if n=m and
the p-norm is chosen in the range and image spaces.
Denition 1.26. Given a vector norm k kv on Cn we dene the induced norm k km on Cnn
by
kAxkv
kAkm = max
x6=0 kxkv
nn
for all AC .
Theorem 1.27. The induced norm k km of a vector norm k kv is a matrix norm with
kIkm = 1
and
kAxkv kAkm kxkv
for all A Cnn and x Cn .
Proof. a) kAkm R and kAkm 0 for all A Cnn is obvious from the denition. Also from
the denition we get
kAxkv
kAkm = 0 = 0 x 6= 0
kxkv
kAxkv = 0 x 6= 0 Ax = 0 x 6= 0 A = 0.
kAxkv ||kAxkv
kAkm = max = max = ||kAkm
x6=0 kxkv x6=0 kxkv
kAx + Bxkv
kA + Bkm = max
x6=0 kxkv
kAxkv + kBxkv
max
x6=0 kxkv
kAxkv kBxkv
max + max = kAkm + kBkm .
x6=0 kxkv x6=0 kxkv
kIxkv kxkv
kIkm = max = max =1
x6=0 kxkv x6=0 kxkv
and
kAykv kAxkv
kAkm = max x Cn \ {0}
y6=0 kykv kxkv
which gives
kAxkv kAkm kxkv x Cn .
d) Using this estimate we nd
kABxkv
kABkm = max
kxkv
x6=0
kAkm kBxkv
max = kAkm kBkm .
x6=0 kxkv
Remarks.
1. Usually one denotes the induced matrix norm with the same symbol as the corresponding
vector norm. For the remainder of this text we will follow this convention.
2. As a consequence of theorem 1.27 we can see that not every matrix norm is an induced
norm: If k km is a matrix norm, then it is easy to check that k k0m = 2k km is a matrix
norm, too. But at most one of these two norms can equal 1 for the identity matrix, and
thus the other one cannot be an induced matrix norm.
16 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
1
kxk kxkA kAkkxk.
kA1 k
Theorem 1.28. The matrix norm induced by the innity norm is the maximum row sum:
n
X
kAk = max |aij |.
1in
j=1
n
X n
X
kAxk = max |(Ax)i | = max | aij xj | max |aij |kxk
1in 1in 1in
j=1 j=1
which gives
n
kAxk X
max |aij |
kxk 1in
j=1
Pn
for all x Cn kAk max1in j=1 |aij |.
and thus
For the lower bound choose k {1, 2, . . . , n} such that
n
X n
X
max |aij | = |akj |
1in
j=1 j=1
and dene x Cn by xj = akj /|akj | for all j = 1, . . . , n (with the convention that this is 0 when
akj is 0). Then we have kxk = 1 and
kAxk
kAk
kxk
n
X akj
= max | aij |
1in
j=1
|akj |
n
X akj
| akj |
j=1
|akj |
n
X
= |akj |
j=1
n
X
= max |aij |.
1in
j=1
Proof. (Sketch)
= kA k1 .
(A) = max || is eigenvalue of A .
Theorem 1.30. For any matrix norm k k, any matrix A Cnn and any k N we have
(A)k (Ak ) kAk k kAkk .
Dividing by kXk gives (B) kBk. The nal inequality follows from property d) in the denition
of a matrix norm.
and get
n
X
kxk22 = |j |2 .
j=1
Similarly we nd
n
X n
X
Ax = j j xj and kAxk22 = |j j |2 .
j=1 j=1
18 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
This shows
2 1/2
Pn
kAxk2 j=1 |j j |
=
kxk2 2 1/2
Pn
j=1 |j |
Pn |j |2 |1 |2 1/2
j=1
P n 2
j=1 |j |
= |1 | = (A) x Cn
Similar methods to those used in the proof of the previous result yield the following theorem.
The matrix 2-norm has the special property that it is invariant under multplication by a
unitary matrix. This is the analog of Theorem 1.9 for vector norms.
Theorem 1.33. For all matrices A Cmn and unitary matrices U Cmm , V Cnn
Proof. The rst result follows from the previous theorem, after noting that (U A) (U A) =
A U U A = A A. Because (AV ) (AV ) = V (A A)V and because V (A A)V is a similarity
transformation of A A the second result also follows from the previous theorem.
Let A, B Cmn . In the following it will be useful to employ the notation |A| to denote the
matrix with entries (|A|)ij = |aij | and the notation |A| |B| as shorthand for |aij | |bij | for
all i, j .
Lemma 1.34. If two matrices A, B Cmn satisfy |A| |B| then kAk kBk and kAk1
kBk1 . Furthermore k|AB|k k|A||B|k .
Proof. For the rst two observations, it suces to prove the rst result since kAk1 = kA k and
|A| |B| implies that |A | |B |. The rst result is a direct consequence of the representation
of the -norm and 1-norm from theorems 1.28 and 1.29. To prove the last result note that
X
(|AB|)ij = | Aik Bkj |
k
X
|Aik ||Bkj |
k
= (|A||B|)ij .
Denition 1.36. The outer product of two vectors a, b Cn is the matrix a b Cnn dened
by
S = {x Cn | hx, yi = 0 y S}.
The orthogonal projection onto S, P , can be dened as follows: let {yi }ki=1 be an orthonormal
basis for S, then
k
X k
X k
X
Px = hyj , xiyj = yj yj x = yj yj x.
j=1 j=1 j=1
n
X
x= hyj , xiyj ,
j=1
and so
k
X
Px = hyj , xiyj ,
j=1
found by truncating to k terms. Clearly truncating again leaves the expression unchanged:
k
X
P 2x = hyj , xiyj = P x, x Cn .
j=1
Pn
Now (I P )x = P x = j=k+1 hyj , xiyj , proving the second result.
diagonal if aij = 0 i 6= j
(strictly) upper-triangular if aij = 0 i > j ()
(strictly) lower-triangular if aij = 0 i < j ()
upper Hessenberg if aij = 0 i > j + 1
upper bidiagonal if aij = 0 i > j & i < j 1
tridiagonal if aij = 0 i > j + 1 & i < j 1
Denition 1.40. A matrix P Rnn is called a permutation matrix if every row and every
column contains n1 zeros and 1 one.
20 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
Remarks.
1. If P is a permutation matrix, then we have
n
X
(P T P )ij = pki pkj = ij
k=1
(
1 if j = (i) and
pij =
0 else
4. We get
n
X
(P A)ij = pik akj = a(i),j
k=1
for all i, j {1, . . . , n}. This shows that multiplying a permutation matrix from the left
reorders the rows of A. Furthermore we have
n
X
(AP )ij = aik pkj = ai,1 (j)
k=1
and hence multiplying a permutation matrix from the right reorders the columns of A.
Bibliography
Excellent treatments of matrix analysis may be found in [Bha97] and [HJ85]. More advanced
treatment of the subject includes [Lax97]. Theorem 1.16 and Lemma 1.17 are proved in [Ner71].
The proof of Theorem 1.24 may be found in [Rob01]. Theorem 1.32 is proved in the solutions
for instructors.
Exercises
Exercise 1-1.
Show that the following relations hold for all x Cn :
a) kxk2 kxk1 nkxk2 ,
b) kxk kxk2 nkxk and
c) kxk kxk1 nkxk .
Exercise 1-2. Prove that, for Hermitian positive denite A, equations (1.4) and (1.5) dene an
inner-product and norm, respectively.
kAkmax kAkF mnkAkmax .
1.5. STRUCTURED MATRICES 21
Exercise 1-5. Prove that the norm k kB 0 appearing in Denition 1.22 is indeed a norm.
Exercise 1-7. Show that kAkmax = maxi,j |aij | for all A Cnn denes a vector norm on the
space of n n-matrices, but not a matrix norm.
Exercise 1-9. Show that kAk22 = (AT A) for every matrix A Rnn (this is the real version
of Theorem 1.32).
Exercise 1-10. For A Rnn recall the denition of |A|, namely (|A|)ij = |aij |. Show that
k|A|k = kAk
holds in the Frobenius, innity and 1-norms. Is the result true in the Euclidean norm? Justify
your assertion.
Exercise 1-11. Let kk be an operator norm. Prove that if kXk < 1, then
I X is invertible,
P
the series i=0 Xi converges, and
P
(I X)1 = i=0 X i.
Exercise 1-12. Let K be a matrix in Rnn with non-negative entries and let f, g be two vectors
n
in R with strictly positive entries which satisfy
Exercise 1-14. Prove that the Frobenius norm of a matrix is unchanged by multplication by
unitary matrices. This is an analogue of Theorem 1.33.
Exercise 1-15. Show that for every vector norm kk on Rnn there is a number >0 such
that
kAk = kAk A Rnn
denes a matrix norm.
22 CHAPTER 1. VECTOR AND MATRIX ANALYSIS
Chapter 2
Matrix Factorisations
In this chapter we present various matrix factorisations. These are of interest in their own right,
and also because they form the basis for many useful numerical algorithms.
There are two groups of results presented in this chapter. The rst kind of results factorises
a matrix A Cnn as
1
A = S AS
where, typically, A is of a simpler form than the original matrix A. Results of this type are
matrix diagonalisations and the Jordan canonical form. These factorisations are useful, because
properties of A are often easier to understand than properties of A and often questions about A
can be reduced to questions about A. For example, since 1 x) = (S 1 x),
Ax = x implies A(S
the matrices A and A have the same eigenvalues (but dierent eigenvectors). These factorisations
are typically used in proofs.
The second group of results, including the QR factorisation and LU factorisation, just splits
a matrix A into two, simpler parts:
A = BC
where B and C are matrices of a simpler form than A is, for example triangular matrices. These
factorisations typically form the basis of numerical algorithms, because they allow to split one
complicated problem into two simpler ones. This strategy will be used extensively in the later
chapters of this text.
2.1 Diagonalisation
A matrix A Cnn is diagonalised by nding a unitary matrix Q Cnn and a diagonal matrix
nn
DC such that
A = QDQ .
Since this implies AQ = DQ we see, by considering the individual columns of this matrix
equation, that the diagonal elements of D are the eigenvalues of A and the (orthonormal) columns
of Q are the corresponding eigenvectors. This insight has several consequences: Firstly, a matrix
can be diagonalised if and only if it has a complete, orthonormal system of eigenvectors. And,
secondly, there can be no direct algoritms to diagonalise a matrix, since the eigenvalues of a
matrix in general cannot be found exactly in nite time (see the discussion around Theorem 8.2).
Thus, diagonalisation will be mostly useful as a tool in our proofs and not as part of an algorithm.
The basic result in this section is the Schur triangularisation of a matrix; diagonalisation will
follow from this. The next lemma is key in proving the Schur factorisation.
Lemma 2.1. For all A Cnn satisfying dim(range(A)) = k n, there is an orthonormal set
{y1 . . . , yk } range(A) with the property that Ayl span{y1 , . . . , yl } for l = 1, . . . , k .
23
24 CHAPTER 2. MATRIX FACTORISATIONS
For induction assume that we have the result for some k < n. Let A satisfy dim(range(A)) =
k+1. y1 to be an eigenvector of A with ky1 k2 = 1. Let P denote the orthogonal projection
Choose
onto span{y1 } and P = I P . Dene A = P A and note that dim(range(A )) = k . By the
inductive hypothesis, there is an orthonormal set {y2 , . . . , yk+1 } range(A ) with the property
A yl span{y2 , . . . , yl } l = 2, . . . , k + 1.
Furthermore we have that y1 is orthogonal to span{y2 , . . . , yk }. Consider the set {y1 , . . . , yk+1 }.
Note that Ay1 = y1 . Also
Ayl span{y1 , . . . , yl },
as required.
Theorem 2.2 (Schur Factorisation). For any A Cnn , there is a unitary Q Cnn and an
upper triangular T Cnn such that
A = QT Q .
Proof. Let k = dim(range(A)), and construct orthonormal vectors {y1 , . . . , yk } as in Lemma 2.1.
dim range(A) = n k , we can nd an orthonormal basis {yk+1 , . . . , yn } of range(A) .
Since
Then {yq , . . . , yn } is an orthonormal basis of Cn and
l
X
Ayl = tjl yj , l = 1, . . . , n.
j=1
Q = y1 yn
Letting and dening T by (T )ij = tij for ij and (T )ij = 0 for i>j we obtain
AQ = QT as required.
A = QT Q ,
so that
X i
X
(T T )ii = |tki |2 = |tki |2 .
k k=1
Similarly,
n
X
(T T )ii = |tik |2 .
k=i
2.1. DIAGONALISATION 25
We now prove that T is diagonal by equating these expressions and using induction.
n
X
i=1: |t11 |2 = |t1k |2 ,
k=1
and so t1k = 0 k = 2, . . . , n.
for
Assume for induction in m that tlk = 0 for l = 1, . . . , m 1 and all k 6= l. Note that we have
proved this for m = 2. Then
m
X
(T T )mm = |tkm |2 = |tmm |2 (by induction hyp.)
k=1
Xn n
X
(T T )mm = |tmk |2 = |tmm |2 + |tmk |2 ,
k=m k=m+1
tmk = 0 k 6= m
and tlk = 0 l = 1, . . . , m, k 6= l,
and the induction is complete.
Remark. In the situation of the preceeding theorem, the diagonal elements of D are the
eigenvalues of A and the columns of Q are corresponding eigenvectors. Since Q is unitary, the
eigenvectors are orthogonal and thus the matrix A is normal. When combined with the discussion
after denition 1.20 this shows that a matrix A is normal if and only if A A = AA .
AQ = Q,
and hence, if q1 , . . . , q n are the columns of Q, we get Aqi = i qi and kqi k = 1. This implies
Lemma 2.5. Let A Cnn be Hermitian and positive denite. Then there is a Hermitian,
positive denite matrix A1/2 Cnn , the square root of A, such that A = A1/2 A1/2 .
Proof. Since A Cnn is positive-denite, we have
i kxi k22 = hxi , Axi i > 0
for all eigenpairs (xi , i ) and thus all eigenvalues are positive.
By Theorem 2.4 we have
A = QQ
with = diag(1 , . . . , n ). Since all i 0, we may dene 1/2 = diag( 1 , . . . , n ) and this
is real. Now dene
A1/2 = Q1/2 Q . (2.1)
1/2 1/2 1/2 1/2
Then A A = Q Q = QQ = A as required and, since i > 0 for all i = 1, . . . , n,
the matrix A1/2 is Hermitian, positive denite.
26 CHAPTER 2. MATRIX FACTORISATIONS
Remarks.
A real, positive number has two distinct square roots, and . Similarly, a Hermi-
nn n
tian, positive denite matrix A C has 2 distinct square roots, obtained by choosing
1/2
all possible combination of signs in front of the square roots on the diagonal of in (2.1).
The square root constructed in the lemma is the only positive one.
The same principle used to construct the square root of a matrix here, can be used to
construct many dierent functions of a Hermitian matrix: one diagonalises the matrix and
applies the function to the eigenvalues on the diagonal.
Denition 2.6. A Jordan block Jn () Cnn for C is the matrix satisfying Jk ()ii = ,
Jk ()i,i+1 = 1, and Jk ()ij = 0 i, j = 1, . . . , n, i.e. a matrix of the form
else, for
1
.. ..
. .
Jk () = .
..
. 1
A Jordan matrix is a block diagonal matrix J Cnn of the form
Jn1 (1 )
Jn2 (2 )
J =
..
.
Jnk (k )
Pk
where j=1 nj = n.
The following factorisation is of central theoretical importance.
Theorem 2.7 (Jordan Canonical Form). For any A Cnn there is an invertible S Cnn
and a Jordan matrix J Cnn satisfying
A = SJS 1
where the diagonal elements 1 , . . . , k of the Jordan blocks are the eigenvalues of A.
Remarks.
1. Clearly both the normal and Hermitian diagonalisation results reveal the eigenvalues of A:
they are simply the diagonal entries of D and . This is also true of the Jordan and Schur
factorisations. The following lemma shows that triangular matrices reveal their eigenvalues
as diagonal entries. Since both the Jordan Canonical Form and the Schur Factorisation pro-
vide similarity transformations of A which reduce it to triangular form, and since similarity
transformations leave the eigenvalues unchanged, this establishes the desired properties.
Thus all the preceding factorisations are eigenvalue revealing factorisations.
2. An eigenvalue revealing factorisation cannot be achieved in a nite number of arithmetic
steps, in dimension n 5, since it implies factorisation of a polynomial equation of degree
n. See Chapter 8.
b a C, b, 0 Cj1 ,
a
Tj = ,
0 Tj1 Tj1 C(j1)(j1) upper triangular.
n
Y
det(T ) = Tii .
i=1
Eigenvalues of T are such that det(T I) = 0. Now T I is triangular with diagonal entries
Tii , therefore
n
Y
det(T I) = (Tii ).
i=1
Denition 2.9. A -Jordan block Jn () Cnn for C is the matrix satisfying Jk ()ii = ,
Jk ()i,i+1 = , and Jk ()ij = 0 else, for i, j = 1, . . . , n. A -Jordan matrix is a block diagonal
matrix J Cnn of the form
Jn1 (1 )
Jn 2 (2 )
J =
..
.
Jn k (k )
Pk
where j=1 nj = n.
Lemma 2.10. Let A Cnn and > 0. Then there is a vector norm k kS on Cn such that
the induced matrix norm satises (A) kAkS (A) + .
Proof. From Theorem 1.30 we already know (A) kAk for every matrix norm k k. Thus we
only have to show the second inequality of the claim.
Let J = S 1 AS be the Jordan Canonical Form of A and D = diag(1, , 2 , . . . , n1 ). Then
(SD )1 A(SD ) = D1 JD = J .
kxkS =
(SD )1 x
kAxkS
kAkS = max
x6=0 kxkS
k(SD )1 Axk
= max
x6=0 k(SD )1 xk
Since we know the k k -matrix norm from Theorem 1.28 and we have calculated the explicit
form of the matrix (SD )1 A(SD ) above, this is easy to evaluate. We get kAk maxi |i |+ =
(A) + . This completes the proof.
28 CHAPTER 2. MATRIX FACTORISATIONS
Remark. In general, ( ) is not a norm. But note that if the Jordan matrix J is diagonal then
=0 and we can deduce the existence of a norm in which kAkS = (A). This situation arises
whenever A is diagonalisable.
The singular value decomposition is based on the fact that, for any matrix A, it is possible to
nd a set of real positive i and vectors ui , vi such that
Avi = i ui .
The i are known as singular values and, in some applications, are more useful than eigenvalues.
This is because the singular values exist even for non-square matrices, because they are always
real, and because the {ui } and {vi } always can be chosen orthogonal. Furthermore, the singular
value decomposition is robust to perturbations, unlike the Jordan canonical form.
A = U V
A = U V
Theorem 2.12 (SVD). Every matrix has a singular value decomposition and the singular values
are uniquely determined.
Proof. A Cmn . We prove existence of the SVD by induction over p = min(m, n). If p = 0
Let
the matrices U , V , and are just the appropriately shaped empty matrices (one dimension is
zero) and there is nothing to show.
Assume p>0 and that the existence of the SVD is already known for matrices where one
dimension is smaller than 1 = kAk2 = maxx6=0 kAxk
min(m, n). Let
kxk2 = maxkxk2 =1 kAxk2 .
2
n
Since the map v 7 Av is continuous and the set { x | kxk2 = 1 } C is compact, the image
{ Ax | kxk2 = 1 } C is also compact. Since k k2 : C R is continuous there is a v1 Cn
m n
U1 = (u1 , . . . , um ) Cmm
and
V1 = (v1 , . . . , vn ) Cnn .
Then the product U1 AV1 is of the form
w
1
S = U1 AV1 =
0 B
2.3. SINGULAR VALUE DECOMPOSITION 29
+ w w
2
1
1
1/2
=
1
12 + w w = 12 + w w
w 2 Bw w 2
S
2
and thus kSk2 (12 + w w)1/2 . Thus we conclude that w = 0 and thus
1 0
A = U1 SV1 = U1 V1 .
0 B
B = U2 2 V2 .
Then
1 0 1 0 1 0
A = U1 V1
0 U2 0 2 0 V2
is a SVD of A and existence of the SVD is proved.
Uniqueness of the largest singular value 1 holds, since 1 is uniquely determined by the
relation
kU V xk2 kxk2
kAk2 = max = max = 1 .
x6=0 kxk2 x6=0 kxk2
The penultimate line of the proof shows that, with the ordering of singular values as dened,
we have the following:
A = U V
4. From the proof we see that we can get the k k2 -norm of a matrix from its SVD: we have
kAk2 = 1 .
30 CHAPTER 2. MATRIX FACTORISATIONS
2. The eigenvalues of A A are i2 and the eigenvectors of A A are the right singular vectors
vi .
V
A=U = A A = V
2 V Rnn
3. We have
A = U V = AA = U U
where Rm(mn)
U is any matrix such that [U ] Rmm
U is orthogonal. The result
then follows since
2
0
= .
0 0
For the rest of this section let A Cmn be a matrix with singular value decomposition
A = U V and singular values 1 r > 0 = = 0. To illustrate the usefulness of the
SVD we prove several fundamental results about it.
This shows
We also have
ker(A) = span{vr+1 , . . . , vn } Cn .
2.4. QR FACTORISATION 31
Theorem 2.17 (The SVD and Eigenvalues). Let A Rnn be invertible with SVD A = U V T .
If
AT
0
H= R2n2n
A 0
and
U = (u1 un )
V = (v1 vn )
= diag(1 , . . . , n )
then H has
2n eigenvalues {i }ni=1
vi
eigenvectors
1
i = 1, . . . , n
2 ui
AT z = y
Ay = z.
Hence
AT (z) = 2 y
and AT Ay = 2 y.
Thus 2 {12 , . . . , n2 } and so the 2n eigenvalues of H are drawn from the set
{1 , . . . , n }.
Note that
AT U = V , AV = U
and so
Avi = i ui , AT ui = i vi .
Ay = z, AT z = y.
1
(y T , z T ) = (vi , ui ).
2
2.4 QR Factorisation
The SVD factorisation, like the four preceding it, reveals eigenvalues; hence it cannot be achieved
in a nite number of steps. The next three factorisations, QR, LU and Cholesky do not reveal
eigenvalues and, as we will show in later chapters, can be achieved in a polynomial number
of operations, with respect to dimension n. Recall the following classical algorithm for the
construction of an orthonormal basis from the columns of a matrix A.
32 CHAPTER 2. MATRIX FACTORISATIONS
1: R = 0
2: for j=1,. . . ,nPdo
j1
3: qj = aj k=1 rkj qk with rkj = hqk , aj i
4: rjj = k qj k2
5: if rjj > 0 then
6: qj = qj /rjj
7: else
8: let qj be an arbitrary normalised vector orthogonal to q1 , . . . , qj1
9: end if
10: end for
11: choose qn+1 , . . . , qm to make q1 , . . . , qm an orthonormal basis.
12: let q1 , . . . , qm Cm be the columns of Q; let (R)ij = rij , i j , (R)ij = 0 otherwise.
m A = Q R
Theorem 2.18 (QR factorisation). Every matrix A Cmn with m n can be written as
A = QR where Q Cmm is unitary and R Cmn is upper triangular.
Proof. The Gram-Schmidt algorithm calculates matrices Q and R with
j
X j1
X
(QR)ij = qk rkj = qk rkj + qj = (aj )i
i i
k=1 k=1
Thus induction shows that the columns of Q are orthonormal and hence that Q is unitary.
Remarks.
1. The factorisation in the theorem is called full QR factorisation. Since all entries below the
diagonal of R are 0, the columns n + 1, . . . , m of Q do not contribute to the product QR.
2.5. LU FACTORISATION 33
m A = Q R
mn
n n mn
2.5 LU Factorisation
Denition 2.19. A triangular matrix is said to be unit if all diagonal entries are equal to 1.
Denition 2.20. The j th principal sub-matrix of a matrix A Cnn is the matrix Aj Cjj
with (Aj )kl = akl for 1 k, l j .
Theorem 2.21 (LU Factorisation). a) Let A Cnn be a matrix such that Aj is invertible
for j = 1, . . . , n. Then there is a unique factorisation A = LU where L Cnn is unit lower
triangular and U Cnn is non-singular upper triangular. b) If Aj is singular for one j
{1, . . . , n} then there is no such factorisation.
The following picture gives a graphical representation of the LU factorisation.
A = L U
where An1 is the (n 1)th principal sub-matrix of A, and b, c C(n1) and ann C are the
remaining blocks. We are looking for a factorisation of the form
L 0 U u LU Lu
A= = (2.3)
` 1 0 ` U ` u +
34 CHAPTER 2. MATRIX FACTORISATIONS
To illustrate the failure of LU factorisation when the principal submatrices are not invertible,
consider the following matrix:
0 0 1
A = 1 1 0
0 2 1
This matrix is clearly non-singular: det(A) = 2. However, both principal sub-matrices are
singular:
A1 = (0)
0 0
A2 =
1 1
1 1 0
A0 = 0 2 1
0 0 1
A01 = (1)
0 1 1
A2 =
0 2
Because a non-singular matrix may not have an LU factorisation, while that same matrix
with its rows interchanged may be factorisable, it is of interest to study the eect of permutations
on LU factorisation.
Theorem 2.22 (LU Factorisation with Permutations). If A Cnn is invertible, then there
exists a permutation matrix P Cnn , unit lower triangular L Cnn , and non-singular upper
triangular U Cnn such that P A = LU .
Proof. We prove the result by induction. Note that the base case n = 1 is straightforward:
P = L = 1 and U = A 6= 0. Now assume the theorem is true for the (n 1) (n 1) case. Let
A Cnn be invertible. Choose a permutation P1 such that
11 := a 6= 0.
(P1 A)
2.6. CHOLESKY FACTORISATION 35
This is possible since A invertible implies that the rst column of A has a non-zero entry and
P1 permutes rows. Now as follows:
we can factorise P1 A
a u u
1 0 a
P1 A := =
l B l/a I 0 A
= det(P1 A)
0 6= det(A) = a det(A)
a u
1 0 1 0
P1 A =
l/a I 0 P2 L 0 U
1 0 a u
=
l/a P2 L 0 U
a u
1 0 1 0
=
0 P2 P2T l/a L 0 U
= P3 LU ,
and therefore
P3T P1 A = L
U .
Note that
L is unit lower triangular and that det(U ) = a det(U ) 6= 0 so that
U is non-singular
upper trangular. Since P = P3T P1 is a permutation the result follows.
Theorem 2.23 (Cholesky Factorisation). If A Cnn is positive denite then there exists a
unique upper triangular R Cnn with positive diagonal elements such that A = R R.
A = R R
Proof. We use induction. The claim is clearly true if n = 1: A = R+ , R = .
Assume the claim holds true for An1 C(n1)(n1) :
An1 = Rn1 Rn1 , (Rn1 )ii > 0 i = 1, . . . , n 1.
Rn1 r=c
2 = krk22 .
Since Rn1 is non-singular (positive diagonal elements), r and are uniquely dened.
It remains to show R+ . Note that, since A has positive eigenvalues det(A) > 0 and so
can choose R+ .
Bibliography
The books [Bha97], [HJ85] and [Ner71] all present matrix factorisations in a theoretical context.
The books [Dem97], [GL96] and [TB97] all present related material, focussed on applications to
computational algorithms. Theorem 2.7 is proved in [Ner71].
Exercises
Exercise 2-1. By following the proof of the existence of a singular value decomposition, nd
an SVD for the following matrices:
2 0 1
1 2
A= , B= 2 0 1 .
3 1
0 18 0
Exercise 2-2. Show that a symmetric positive denite matrix A Rnn can be written in the
form
zT 1 z T
a 0 1 0
A= = 1 .
z A1 z I 0 A1 a1 zz T 0 I
where = a. Based on this observation nd an alternative proof of Theorem 2.23, for the
Cholesky factorisation of real symmetric positive denite matrices.
Exercise 2-3. Let A Rmn have singular values ( i | i = 1, . . . , n ). Show that kAk2 = max
and, if m=n and A is invertible, kA1 k1
2 = min .
p q p
for some G: C C C . Here y is the solution that we want to nd, and the data which
denes the problem. For (SLE) and (LSQ) the problem is linear and y = x (see below for a
proof of this for (LSQ)) and for (EVP) it is nonlinear and y = (x, ). The parameter is thus
dened by the matrix A and, for (SLE) and (LSQ), the matrix A and the vector b. Backward
error analysis will enable us to show that the computed solution y
solves
G(
y , ) = 0.
A problem is called well conditioned if small changes in the problem only lead to small changes
in the solution and badly conditioned if small changes in the problem can lead to large changes
in the solution. In this context the following denition is central.
Denition 3.2. The condition number (A) of a matrix A Cnn in the norm kk is the
number (
kAk kA1 k if A is invertible and
(A) =
+ else.
37
38 CHAPTER 3. STABILITY AND CONDITIONING
For this section x a vector norm kk and an induced matrix norm k k. By Theorem 1.27
these norms then satisfy
Remark. We always have kIk = kAA1 k kAkkA1 k = (A). For induced matrix norms this
implies (A) 1 for every A Cnn .
Then we have kAk2 = max . Since the matrix A1 has eigenvalues 1/1 , . . . , 1/n we nd
1 1
kA k2 = min and thus the condition number of A in the 2-norm is
max
(A) = . (3.2)
min
Since A1 b = x we get
The previous proposition gave an upper bound on how much the solution of the equation
Ax = b can change if the right hand side is perturbed. The result shows that the problem is well
conditioned if the condition number (A) is small. Proposition 3.5 below gives a similar result
for perturbation of the matrix A instead of the vector b. For the proof we will need the following
lemma.
Lemma 3.4. Assume that A Cnn satises kAk < 1 in some induced matrix norm. Then
I + A is invertible and
k(I + A)1 k (1 kAk)1 .
Proof. With the triangle inequality we get
and thus
k(I + A)xk 1 kAk kxk
for every x Cn . Since this implies (I + A)x 6= 0 for every x 6= 0, and thus the matrix I +A is
invertible.
Now let b 6= 0 and x = (I + A)1 b. Then
k(I + A)1 bk 1
k(I + A)1 k = sup .
b6=0 kbk 1 kAk
Assume that A is invertible with condition number (A) in some induced matrix norm k k.
Then we have, for (A)kAk < kAk,
kxk (A) kAk
kAk
.
kxk 1 (A) kAk kAk
x = (I + A1 A)1 A1 A x
and we get
kA1 Ak
kxk k(I + A1 A)1 kkA1 Akkxk kxk.
1 kA1 Ak
Using (3.4) and the fact that the map x 7 x/(1 x) is increasing on the interval [0, 1) we
get
Refer to Exercise 3-2 for a result which combines the statements of propositions 3.3 and 3.5.
In this section we study the conditioning of the following problem: given A Cmn with m n,
m n
rank(A) = n and bC , nd xC kAx bk2 .
which minimizes
For expository purposes consider the case where A and b are real. Notice that then x solving
n
(LSQ) minimizes : R R given by
1 1
(x) := hAx, Axi hAx, bi + kbk22 .
2 2
This is equivalent to minimizing
1 1
hx, A Axi hx, A bi + kbk22 .
2 2
40 CHAPTER 3. STABILITY AND CONDITIONING
(Note that A = AT in this real case). This quadratic form is positive-denite since A A is
Hermitian and has positive eigenvalues under (3.5). The quadratic form is minimized where the
gradient is zero, namely where
A Ax = A b.
Although we have derived this in the case of real A, b, the nal result holds as stated in the
complex case; see Chapter 7.
Consider the reduced SVD A = U V
where U Cmn ,
Rnn with ii = i , and
nn
V C . The assumption on rank(A) implies, by Theorem 2.15,
1 2 n > 0. (3.5)
Denition 3.6. For A Cmn , the matrix A = (A A)1 A Cnm is called the pseudo-
inverse (or Moore-Penrose inverse) of A.
With this notation the solution of (LSQ) is
x = A b.
Denition 3.7. For A Cmn with m n and n positive singular values, dene the condition
number of A in the norm k k to be
(
kAk kA k if rank(A) = n and
(A) =
+ else.
Remark. Since for square, invertible matrices A we have A = A1 , the denition of (A) is
consistent with the previous one for square matrices. As before the condition number depends
on the chosen matrix norm.
Lemma 3.8. Let A Cmn with m n have n positive singular values satisfying (3.5) . Then
the condition number in the k k2 -norm is
1
(A) = .
n
Proof. Let V
A=U be a reduced SVD of A. Then 2V
A A = V and thus
1 U
A = V . (3.6)
This equation is a reduced SVD for A (modulo ordering of the singular values) and it can be
extended to a full SVD by adding m n orthogonal columns to
U and zeros to 1 .
Doing this
we nd, by Corollary 2.13,
1
(A) = kAk2 kA k2 = 1 .
n
Theorem 3.9. Assume that x solves (LSQ) for (A, b) and x + x solves (LSQ) for (A, b + b).
Dene = kAk2 kxk2 /kAxk2 1. Then we have
kxk2 (A) kbk2
kxk2 cos() kbk2
Remark. The constant (A)/ cos() becomes large if either (A) is large or /2. In either
of these cases the problem is badly conditioned. The rst case involves only the singular values
of A. The second case involves a relationship between A and b. In particular cos() is small if
the range of A is nearly orthogonal to b. Then
1 1
(x) kAxk22 + kbk22
2 2
so that simply setting x=0 gives a good approximation to the minimizer; in this case small
changes in b can induce large relative changes in x.
Theorem 3.10. Let and as above. Assume that x solves (LSQ) for (A, b) and x + x solves
(LSQ) for (A + A, b). Then
kxk2 (A)2 tan() kAk2
(A) +
kxk2 kAk2
Theorem 3.11 (Conditioning of LSQ). Assume that x solves (LSQ) for (A, b) and x + x
solves (LSQ) for (A + A, b + b). Let r = b Ax and
kAk kbk
2 2
= max , .
kAk2 kbk2
Then
kxk2 (A) krk2
2 + ((A) + 1)
kxk2 1 (A) cos()kbk2
where (A) is the condition number of A in k k2 -norm.
In this section we will discuss the conditioning of the eigenvalue problem (EVP), i.e. we will
discuss how much the eigenvalues of a matrix can change, if the matrix is changed by a small
amount. A preliminiary result is given in the following theorem: the eigenvalues change con-
tinuously when the matrix is changed. A more detailed analysis is presented in the rest of the
section.
42 CHAPTER 3. STABILITY AND CONDITIONING
Theorem 3.12 (Continuity of Eigenvalues). Let (A) denote the vector of eigenvalues of the
matrix A Cnn , ordered in decreasing absolute value, and repeated according to algebraic
multiplicities. Then : Cnn Cn is continuous.
Proof. This follows since the eigenvalues are the roots of the characteristic polynomial of A; this
has coecients continuous in A.
Before discussing the conditioning of eigenvalue problems we make a brief diversion to state
a version of the implicit function theorem (IFT) which we will then use.
Theorem 3.13 (IFT). Let G : Cl Cm Cm be two times dierentiable and assume that
G(x, y) = 0 for some (x, y) Cl Cm . If Dy G(x, y) is invertible then, for all suciently small
x, there is a unique solution y of the equation
G(x + x, y + y) = 0
Denition 3.14. For A Cnn , the eigenvalue condition number for an eigenvalue C of a
nn
matrix AC is (
|hx, yi|1 , if hx, yi =
6 0 and
A () =
, else,
where x and y are the normalised right and left eigenvectors for eigenvalue .
Ax x
G(A, x, ) = 1 2 1 .
2 kxk2 2
G(A + A, x + x, + ) = 0
x
= O kAk2 .
DA G(A, x, ) A + D(x,) G(A, x, )
A I x
D(x,) G(A, x, ) = C(n+1)(n+1)
x 0
To show that D(x,) G(A, x, ) is invertible assume D(x,) G(A, x, ) (y, ) = 0 for y Cn and
C. It suces to show that this implies (y, ) = 0. From
(A I)y x = 0
x y = 0
we get
(A I)2 y = 0 and hx, yi = 0.
This implies (A I) has a two dimensional null-space, contradicting simplicity by Lemma 1.17,
unless y = 0. Hence it follows that y = 0, = 0, and so D(x,) G is invertible.
Finally,
A x A I x x
= + .
0 x 0
satises
kk2 + || = O kAk22 .
Corollary 3.16. Let be a simple eigenvalue of A corresponding to right and left normalized
eigenvectors x and y. Then for all suciently small A Cnn the matrix A + A has an
eigenvalue + with
|| A () kAk2 + O kAk22
.
The stability of an algorithm measures how susceptible the result is to rounding errors occurring
during the computation. Consider the general framework for all our problems, namely equation
(3.1), where y denotes the solution we wish to compute and the input data. Then we may
view a locally unique family of solutions to the problem as a mapping from the input data to
the solution: y = g(). For example, for (SLE) we have y = x, = (A, b) and g() = A1 b. We
assume that the algorithm returns the computed 6= y which can be represented as
result y the
exact image y = g(
) of a dierent input value .
Denition 3.17. The quantity y = y y is called the forward error and = is called a
backward error. If is not uniquely dened then we make the choice of which results in minimal
kk. The relative backward error is kk/kk and the relative forward error is kyk/kyk.
exact y
uted
comp y
exact y
input data
results
44 CHAPTER 3. STABILITY AND CONDITIONING
Assumption 3.19. There is a parameter m > 0 ( machine epsilon) such that the following
conditions hold.
fl(x) = x (1 + ).
x ~ y = (x y) (1 + )
Denition 3.20. An algorithm, is called backward stable if the relative backward error satises
kk
= O(m ).
kk
Remark. Some algorithms which are backward stable by the denition are termed unstable in
practice. This occurs when the constant in the O(m ) term depends badly on some measure of
the problem dimension for instance grows exponentially in n for the (SLE). For this reason,
rather than just stating a backward stability result, we will often try to identify a norm k k and
constant C(n) such that
kk
C(n)m .
kk
This, of course, establishes backward stability. But it gives more detailed information on exactly
when the backward error is small, as a function of problem size. Note, however, that statements
about the dependence of the constants in the backward error are norm-dependent. The denition
of backward stable itself is not.
The typical way to use backward stability is in a two-step procedure as follows. In the rst
step, backward error analysis, one shows that the algorithm in question is backward stable, i.e.
that the inuence of rounding errors can be represented as a small perturbation x of the original
problem. (This is a property of the algorithm used.) In the second step one uses results like
Theorem 3.18 about the conditioning of the problem (which does not depend on the algorithm
used) to show that the forward error is also small. Together these steps show that the calculated
result is close the the exact result.
We give a simple example of a backward stability calculation.
Proof. The exact result of a subtraction is given by g(x1 , x2 ) = x1 x2 , the computed result is
g(x1 , x2 ) = fl(x1 ) fl(x2 ). Using Assumption A1 we get
Remarks.
1. The above proof is a case where the x
from the denition of the backward error is not
uniquely determined. But since we are only interested in the x
which minimizes the
backward error, we can choose any x
which gives the result kxk2 O(m )kxk2 . The
optimal x
can only be better.
2. Similar proofs show that the operations , and are also backward stable.
3. Proofs of backward stability have to analyse the inuence of rounding errors and thus are
typically based on our Assumptions A1 and A2 about computer arithmetic. Since they
tend to be long and involved we omit most of these proofs in this book. We frequently
study the statements of backward stability results, however, as they give useful practical
insight into the ecacy of algorithms.
Bibliography
Conditioning for the basic problems of numerical linear algebra is a central part of the pre-
sentation in [Dem97]. Numerical stability is introduced, and thoroughly studied, in a range of
contexts arising in numerical linear algebra, in [Hig02]. Theorem 3.10 may be found in [Dem97]
and Theorem 3.11 in [Hig02]. The Implicit Function Theorem 3.13 is proved in [CH96].
Exercises
Exercise 3-1. Choose a matrix norm and calculate the condition number of the matrix
1
A=
1 1
in this norm.
Exercise 3-2. Assume that A is invertible with (A)kAk/kAk < 1 in some induced matrix
norm. Let x be a solution of Ax = b and let x be a solution of (A + A)x = b + b. Show that
k
x xk (A) kAk kbk
kAk
+ .
kxk 1 (A) kAk kbk
kAk
46 CHAPTER 3. STABILITY AND CONDITIONING
Exercise 3-5. Show that the calculated arithmetic operations , and are backward stable.
Chapter 4
Complexity of Algorithms
In this chapter we study methods which quantify how long it takes to solve a numerical problem
on a computer. Specically we focus on the question how the run time, as measured by the
number of algebraic operations, depends on some measure of the problem size.
The computational cost of an algorithm is the amount of resources it takes to perform this algo-
rithm on a computer. For simplicity here we just count the number of oating point operations
(additions, subtractions, multiplications, divisions) performed during one run of the algorithm,
together with a count of the number of square roots taken; on one occasion we also count the
number of comparisons between real numbers. A more detailed analysis would also take factors
like memory usage into account.
where n is the size of the input data (e.g. the number of equations etc.).
The following denition provides the notation we will use to describe the asyptotic compu-
tation cost of an algorithm, this is the behaviour of the cost C(n) for n . (In the case of
the notation O it is closely related to the notation used in Denition 3.1 from the last chapter,
except that there we consider the order of magnitude of small quantities. Here our concern with
cost leads to consideration of the order of magnitude of large quantities. However, the denitions
below can be adapted to the case x0 and we will sometimes use this without comment.)
g(x)
g(x) = O f (x) if lim sup < ,
x f (x)
g(x)
g(x) = f (x) if lim inf > 0,
x f (x)
g(x) = f (x) if g(x) = (f (x)) and g(x) = O(f (x)),
g(x)
g(x) f (x) if lim = 1.
x f (x)
Example.
From the denition we can see that the property g(x) f (x) implies g(x) = f (x) .
Example. Using this notation we can write 5n2 + 2n 3 5n2 = (n2 ), n2 = O(n3 ) and
2
n = (n).
47
48 CHAPTER 4. COMPLEXITY OF ALGORITHMS
1: s = x
1 y1
2: for i = 2, . . . , n do
3: s=s+x i yi
4: end for
5: return s
Theorem 4.3. The standard inner-product algorithm on Cn has computational cost C(n) =
(n). Any algorithm for the inner product has C(n) = (n).
Proof. The standard inner-product algorithm above has n multiplications and n 1 additions,
i.e. C(n) = 2n 1 = (n). Sketch of the proof for C(n) = (n): since each of the products xi yi
is independent of the others, we have to calculate all n of them.
Remark. Giving a real proof for the lower bound in the above theorem would require a detailed
model about what an algorithm actually is. One would for example need to be able to work in
a framework where guessing the result in just one operation and returning it is not a proper
algorithm. We avoid these diculties here by only giving sketches for lower bounds.
1: for i = 1, . . . , m do
2: let yi = hai , xi using the standard inner product algorithm
3: end for
4: return y
Theorem 4.4. The computational complexity of the standard method for Cnn -matrix-vector
multiplication satises C(n) = (n2 ). Any method has C(n) = (n).
Proof. See Exercise 4-2.
1: for i = 1, . . . , l do
2: for j = 1, . . . , n do
3: let cij = hai , bj i using the standard inner product algorithm
4: end for
5: end for
6: return C
Theorem 4.5. The standard method for Cnn -matrix-matrix multiplication satises C(n) =
(n3 ). Any method has C(n) = (n2 ).
Proof. We have to calculate n2 inner products. Thus the asymptotic computational cost is
2 3
C(n) = n (n) = (n ). Sketch for the lower bound: we have to calculate n2 entries of the
2
resulting matrix and thus C(n) n .
4.2. MATRIX-MATRIX MULTIPLICATION 49
In Theorem 4.5 there is a gap between the order (n3 ) of the standard method for multiplying
2
matrices and the lower bound (n ). The purpose of this section is to show that there are
actually algorithms with an asymptotic order which is better than (n3 ).
nn
For A, B C , n even and D = AB write
A11 A12 B11 B12 D11 D12
A= ,B = ,D =
A21 A22 B21 B22 D21 D22
The above method calculates the product of two n n-matrices using eight multiplications
of (n/2) (n/2)-matrices. Using this idea recursively leads to an algorithm for nn matrix
multplication which builds the answer from the multplication of small matrices. The cost then
satises
C(n) = 8C(n/2) + n2
with the n2 contribution coming from the 4 additions of (n/2) (n/2) matrices. Exercise 4-4
shows that this leads to a cost of (n3 ) and is thus no better than the standard algorithm.
There is, however, another way to calculate the entries of the matrix D, which is algebraically
more complicated, but, crucially, only uses seven multiplications of (n/2) (n/2)-matrices. It
will transpire that this fact can be utilised to get an asymptotically faster method of multiplying
matrices. Using
we can write
D11 = P1 + P4 P5 + P7
D12 = P3 + P5
D21 = P2 + P4
D22 = P1 + P3 P2 + P6 .
1: if n = 1 then
2: return AB
3: else
4: calculate P1 , . . . , P7 (using recursion)
5: calculate D11 , D12 , D21 and D22
6: return D
7: end if
50 CHAPTER 4. COMPLEXITY OF ALGORITHMS
Remark. Recursive algorithms of this kind are called divide and conquer algorithms.
n n
Using the Strassen-multiplication we can calculate D with 7 multiplications of 2 2 -matrices
n n
and 18 additions of
2 2 -matrices. Thus we nd
Lemma 4.6. The Strassen-multiplication has computational cost C(2k ) = 7 7k 6 4k for all
k N0 .
k
7k = 2log2 (7 )
= 2k log2 7 = (2k )log2 7 = nlog2 7
and
4k = (2k )2 = n2 .
Using the lemma we get
AB 012
AB =
021 022
Thus we can nd the product of the n n-matrices A and B by multiplying the 2k 2k -matrices
A and B
k
with the Strassen-algorithm. Since we have n 2 2n, the extended matrices are at
most double the size of the original ones, and because (2n) = (n ) for every > 0 the result
k log2 7
for n = 2 implies C(n) = (n ) as required.
Remark. By ideas similar to (but more involved than) those used by Strassen it is possible to
construct an algorithm which multiplies matrices in O(n2.376... ). It remains an open question
whether the exponent can be made arbitrarily close to 2.
4.2. MATRIX-MATRIX MULTIPLICATION 51
Often the rst method encountered for matrix inversion is Cramer's rule:
adj(A)
A1 = .
det(A)
Using the naive way to calculate det(A) takes (n!) operations, and so for Cramer's rule we
would get asymptotic computational cost
Although of theoretical importance, this is a totally inecient method to invert a matrix. The
next result indicates why.
A1
012
A1 =
021 I22
for every x 6= 0 and thus A A is positive denite and therefore invertible. We can write
A1 = (A A)1 A .
This allows us to invert A with cost C(n) = D(n) + O(n ) where D is the cost of inverting a
Hermitian, positive denite matrix and O(n ) is the cost for matrix-matrix multiplication. So
we can restrict ourselves to the case of Hermitian, positive denite matrices.
3) To determine the cost function D B be
let Hermitian and positive denite:
B11 B12
B=
B12 B22
n n 1
where the Bjk 2 2 -matrices. Let
are S = B22 B12 B11 B12 , known as the Schur complement.
A direct calculation shows that
1 1 1 1
B12 S 1 B12
B12 S 1
B11 + B11 B11 B11
B 1 = 1 .
S 1 B12
B11 S 1
where O(n ) is the cost for the multiplications and O(n2 ) is the cost for the additions and
subtractions.
From Theorem 4.5 we already know 2, so we can simplify the above estimate to
D(n) 2D(n/2) + cn
for some constant c > 0. With an induction argument (see Exercise 4-5 below) one can conclude
c
D(2k ) (2k )
1 21
and thus we get
D(2k ) = O (2k )
The previous section shows that the inner-product based algorithm for matrix-matrix multipli-
cation can be substantially improved by use of a recursive divide and conquer approach. We now
attempt a similar improvement of matrix-vector multiplication, over the inner-product based al-
gorithm. However, in contrast to the matrix-matrix case, the idea we present here works only for
a restricted class of matrices the discrete Fourier transform matrices. However other, similar,
ideas have also been developed for other discrete transforms. The technique we present here is
prototypical of a whole range of results available for matrix-vector multiplications of structured
matrices.
As an abbreviation write
n = e2i/n
and dene the vectors l Cn for l = 0, . . . , n 1 by
1
l = (e2il/n )0 , (e2il/n )1 , . . . , (e2il/n )n1
n
1
= 1, nl , n2l , . . . , n(n1)l .
n
Lemma 4.9. hl , m i = lm .
Proof.
n1
1 X (lm)j
hl , m i = .
n j=0 n
(lm)n
1 n 1
.
n n(lm) 1
(lm)n
But n = e2i(lm) = 1 implying hl , n i = 0.
n1
X n1
X
hm , ui = al hm , l i = al ml = am .
l=0 l=0
Thus am = hm , ui.
Writing this out, with n denoted by for simplicity, gives
a0
1 1 1 ... 1
a1 1 2 ... n1
..
. = 1 2 4 2(n1)
1 ...
u
.
.
n .. ..
.. . .
2
n1 2(n1)
an1 1 ... (n1)
| {z }
:=An
Thus to expand arbitrary u in the basis {k } we need to be able to perform matrix vector
multiplication by An Cnn .
Theorem 4.10 (Complexity of FFT Matrix-Vector Multiply). Let n = 2k . Matrix-vector mul-
tiplication by An Cnn can be performed in O(n log n) time.
Proof. We let
Fk = nAn with n = 2k .
Note that Fk1 Cn/2n/2 and is built from n/2 . We prove that multiplication by Fk can be
k
eected in O(k2 ) time, which gives the desired result.
Let
k
x = (x0 , x1 , . . . , xn1 )T C2
k1
xe = (x0 , x2 , . . . , xn2 )T C2
k1
xo = (x1 , x3 , . . . , xn1 )T C2 .
Knowledge of xo , xe gives x. Let
y = Fk x, y = ((y t ) , (y b ) ) .
Then, for j = 0, . . . , 2k 1,
k
2X 1
yj = 2jlk xl
l=0
2k1
X1 2k1
X1 j(2l+1)
= 22jl
k x2l + 2k x2l+1 .
l=0 l=0
k k1 j(2l+1)
But 22jl
k = e
2i(2jl)/2
= e2i(jl)/2 = 2jlk1 , and so 2k = 2j k 22jl j jl
k = 2k 2k1 . Hence,
2k1
X1 2k1
X1
yj = 2jlk1 xel + 2j k 2jlk1 xol .
l=0 l=0
jl (j+M )l j j+M
Now M = M and 2M = 2M , and so
2k1
X1 2k1
X1
yj = 2jlk1 xel + 2j k 2jlk1 xol , j = 0, . . . , 2k1 1
l=0 l=0
2k1
X1 2k1
X1
y2k1 +j = 2jlk1 xel 2j k 2jlk1 xol , j = 0, . . . , 2k1 1.
l=0 l=0
54 CHAPTER 4. COMPLEXITY OF ALGORITHMS
y t =Fk1 xe + DFk1 xo = y e + Dy o
y b =Fk1 xe DFk1 xo = y e Dy o .
where
k1
1
D = diag{20k , 21k , . . . , 22k }.
Thus, if y e = Fk1 xe andy o = Fk1 xo are known, then y is found through O(2k ) multipli-
cations and additions (since D is diagonal). If Lk = C(2k ), C(n) being cost of multiplication by
An , then
Lk 2Lk1 + a2k .
For induction, assume Lk (ak + L0 )2k . For the inductive step l 1 7 l assume Ll1
l1
(a(l 1) + L0 )2 , then
We conclude with the statement of a pair of matrix factorization results. Their importance is
intricately bound up with the complexity of the algorithms to eect them. It is for this reason
that they appear in this chapter.
Lemma 4.11. For any A Cnn there exist unitary U, V and upper bidiagonal B such that
A = U BV .
Lemma 4.12. For any A Cnn there exists unitary Z and upper Hessenberg T such that
A = ZT Z .
Bibliography
Exercises
Exercise 4-1.
Let (i) f (n) = n2 1 + sin(n) and (ii) f (n) = n + n2 . In each case, which of the
following are true:
f (n) = O(1);
f (n) = O(n);
f (n) = O(n2 );
f (n) = (1);
f (n) = (n);
f (n) = (n2 ).
Exercise 4-2. Show that the standard method for matrix-vector multiplication has asymptotic
computational cost C(n) = (n2 ).
Exercise 4-3. Show that the matrices S and B11 in the proof of Theorem 4.8 are invertible.
Exercise 4-4. Prove that if C(n) = 8C(n/2) + (n2 ) then C(n) = (n3 ). This shows that a
divide and conquer approach to matrix multiplication which requires 8 matrix multiplications of
the smaller sub-matrices cannot beat the complexity of standard matrix-matrix multiplication;
in contrast, the Strassen method, using 7 multiplications, does.
D(n) 2D(n/2) + cn
c
D(2k ) (2k )
1 21
for all k N0 .
Exercise 4-6. Let A Rnn wheren = 2k . Noting that the LU factorisation of a matrix A
can be written as
L11 0 U11 U12
A = LU = ,
L21 L22 0 U22
design a divide and conquer strategy which results in LU factorisation of A in O(na ) operations
a
where O(n ) is the cost of matrix multiplication.
Gaussian elimination is the most commonly used method to solve systems of linear equations.
The method is based on a systemisation of the elimination procedure used to solve simultaneous
linear equations by hand. The mathematical background of the algorithm is the LU factorisation
described in Theorems 2.21 and 2.22.
The Method
Under the conditions of Theorem 2.21 the matrix A can be converted into upper triangular shape
by multiplying lower triangular matrices from the left. We introduce the subsequent theoretical
developments with an example.
1 2 1 1 2 1 1
L2 L1 A = 1 0 1 1 = 0 1 1 .
3 1 0 3 5 0 0 2
We have found lower triangular matrices L1 and L2 and an upper triangular matrix U with
A = (L2 L1 )1 U . It turns out that (L2 L1 )1 itself is unit lower triangular with inverse that is
easily calculable. Hence we have found an LU factorisation of A during the process of elimination.
57
58 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
The ideas in this example generalise to arbitrary dimension. The following lemma helps to
calculate (Lk L2 L1 )1 = L1 1 1
1 L2 Lk in general.
Lemma 5.1. a) Let L = (`ij ) be unit lower triangular with non-zero entries below the diagonal
only in column k. Then L1 is also unit lower triangular with non-zero entries below the diagonal
only in column k and we have (L1 )ik = `ik for all i > k.
b) Let A = (aij ) and B = (bij ) be unit lower triangular n n matrices where A has non-
zero entries below the diagonal only in columns 1, . . . , k and B has non-zero entries below the
diagonal only in columns k + 1, . . . , n. Then AB is unit lower triangular with (AB)ij = aij for
j {1, . . . , k} and (AB)ij = bij for j {k + 1, . . . , n}.
Proof. a) Multiplying L with the suggested inverse gives the identity. b) Direct calculation.
Example. For the matrices L1 and L2 from the previous example we get
1 1 1
(L2 L1 )1 = L1 1
1 L2 = 2 1 1 = 2 1 .
4 1 3 1 4 3 1
Thus we have found the LU factorisation
2 1 1 1 2 1 1
4 3 3 = 2 1 1 1 .
8 7 9 4 3 1 2
Recall the notation for principal submatrix from Denition 2.20. The technique to convert A
into an upper triangular matrix by multiplying lower triangular matrices leads to the following
algorithm:
1: L = I , U = A
2: for k = 1, . . . , n 1 do
3: for j = k + 1, . . . , n do
4: ljk = ujk /ukk
5: (uj,k , . . . , uj,n ) = (uj,k , . . . , uj,n ) lj,k (uk,k , . . . , uk,n )
6: end for
7: end for
Remarks. Line 5 of the algorithm subtracts a multiple of row k from row j, causing ujk = 0
without changing columns 1, . . . , k1. This corresponds to multiplication with a lower triangular
matrix Lk as in the example above. Thus after the loop ending in line 6 is nished, the current
value of the matrix U is Lk L1 A and it has zeros below the diagonal in columns 1, . . . , k .
Since the principal sub-matrices Aj are non-singular and the matrices Lj are unit triangular
we get, for A = Lk L1 A,
det Ak+1 = det Ak+1 6= 0
and thus, since A has zeros below the diagonal in column k, we must have ukk 6= 0 in line 4.
Lemma 5.1 shows that the algorithm calculates the correct entry ljk for the matrix L =
(Ln L1 )1 .
The last missing building block for the Gaussian elimination method is the following algorithm
to solve systems of linear equations when the coecient matrix is triangular.
1: for j = n, . . . , 1 do
n
1 X
2: xj = bj ujk xk
ujj
k=j+1
3: end for
Remarks.
1. Since U is triangular we get
n 1 n n
X X X
(U x)i = uij xj = uii bi uik xk + uij xj = bi .
j=i
uii j=i+1
k=i+1
Combining all our preparations we get the Gaussian elimination algorithm to solve the prob-
lem (SLE):
Remarks.
1. The result of this algorithm is an x Cn with Ax = LU x = Ly = b and thus the algorithm
gives the correct result.
2. The vector y may be calculated at the same time as the LU factorisation of A, by adding
b as an additional column in A.
Computational Complexity
Lemma 5.2. The LU factorisation algorithm has computational cost
2 3 1 2 7
C(n) = n + n n.
3 2 6
Proof. We have to count the number of oating point operations in the LU factorisation algo-
rithm. Line 5 requires (n k + 1) multiplications and (n k + 1) subtractions, i.e. 2(n k + 1)
operations. Line 4 contributes one division. Thus the loop starting at line 3 needs
(n k) 1 +
2(n k + 1) operations. Considering the outer loop the total number of operations is
n1
X n1
X n1
X n1
X
C(n) = (n k) 1 + 2(n k + 1) = 2(n k)2 + 3(n k) = 2 l2 + 3 l.
k=1 k=1 l=1 l=1
Remark. Using the -notation the claim of the lemma becomes C(n) 32 n3 .
Lemma 5.3. Forward substitution and back substitution have computational cost C(n) n2 .
60 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
Proof. Calculating xj in the back substitution algorithm needs 2(n j) + 1 operations. Thus
the total cost is
n
X n1
X
k + n = n(n 1) + n = n2 .
C(n) = 2(n j) + 1 = 2
j=1 k=0
Error Analysis
The following theorem is proved by using the Assumptions (A1) and (A2) concerning computer
arithmetic. The proof is somewhat involved and we omit it here. We include the statement,
however, because it is an important building block in the overall understanding of backward
error analysis for Gaussian elimination. We use the notation || for matrices, introduced just
prior to Lemma 1.34.
Theorem 5.5. The forward and back substitution algorithms are backward stable: the computed
solution x of a triangular system of equations T x = b satises (T +T )x = b for some triangular
matrix T Cnn with, for m suciently small,
nm
|T | |T |.
1 nm
Note that this theorem implies that T has the same triangular structure as T itself. The
result shows that, for nm suciently small, there is a constant c independent of n such that,
by Lemma 1.34,
kT k cnm kT k .
Applying this result to backsubstitution we nd that the computed solution of U x = y, x
, solves
(U + U )
x=y and that
kU k cnm kU k .
Using Proposition 3.5 about the conditioning of (SLE) we get an upper bound on the error in
the computed result of back substitution:
k
x xk
c(U )nm
kxk
The proof of this result is similar to that of the previous theorem, and we omit it too.
Combining the two previous theorems gives the following backward error analysis result for the
solution of (SLE) by Gaussian elimnation.
Theorem 5.7. Let x be the approximate solution of (SLE) as computed by algorithm GE. Then
(A + A)
x=b
Now assume 1. Then 1 is a huge number and the representation of these matrices stored
in a computer will be rounded. The matrices might be represented as
= 1 0 = 1
L , U ,
1 1 0 1
U= 1 0 0
L =A+ .
1 0 0 1
| {z }
A
A small rounding error leads to a large backward error A. The example shows that for Gaussian
elimination a backward error analysis will, in general, lead to the conclusion that the perturbed
problem is not close to the original one: the algorithm is not backward stable.
Discussing the result in terms of the previous theorem, note that kAk = 2 and that
U
k|L|| |k = (1 ). Hence it is not possible to bound U
k|L|| |k in terms of kAk, indepen-
dently of .
Note that this problem is not related to the conditioning of the matrix A. We have
1 1
A1 = (1 )1
1
and thus (A) = kAk kA1 k 4 for small >0 and the matrix A is well conditioned.
Because of the instability illustrated in the previous example, the classical Gaussian elimination
method is not used in black box numerical software for (SLE). However, the continuation of the
example illustrates the resolution: permutations can cure this problem.
62 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
0 1 1 1
P = , and A0 = P A = ,
1 0 1
then
1 0 1 1
L0 = U0 =
1 0 1
0 = 1
L
0 0 = 1 1 .
U
1 0 1
Thus
0U
0 = 1 1 0 0
L = A0 + ,
1+ 0
| {z }
A0
Here GEPP denotes Gaussian Elimination with partial pivoting whilst GECP denotes Gaus-
sian Elimination with complete pivoting. GEPP is the most widely used in practice and so we
devote most of our analysis to this algorithm.
The Method
The instability of Gaussian elimination highlighted above is caused by the fact that we divide by
the tiny number ukk = in step 4 of the LU factorisation algorithm. Motivated by the resolution
in the example above, we will avoid this problem in the improved version of the algorithm by
rearranging rows k, . . . , n by multiplication with a permutation matrix at the beginning of the
k th iteration in order to maximise the element ukk . The following argument shows that the
modied algorithm still works correctly.
We want to calculate
U = Ln1 Pn1 L1 P1 A.
Multiplying the permutation matrix Pk from the left exchanges rows k and ik where ik is chosen
from {k, . . . , n} to maximise the element uik k . We can rewrite this as
1: U = A, L = I , P = I
2: for k = 1, . . . , n 1 do
3: choose i {k, . . . , n} which maximises |uik |
4: exchange row (uk,k , . . . , uk,n ) with (ui,k , . . . , ui,n )
5: exchange row (lk,1 , . . . , lk,k1 ) with (li,1 , . . . , li,k1 )
6: exchange row (pk,1 , . . . , pk,n ) with (pi,1 , . . . , pi,n )
7: for j = k + 1, . . . , n do
8: ljk = ujk /ukk
9: (uj,k , . . . , uj,n ) = (uj,k , . . . , uj,n ) lj,k (uk,k , . . . , uk,n )
10: end for
11: end for
Remark. One can show that, whenever A is non-singular, the element ukk is always non-zero
after the row-exchange. Thus we do no longer require the condition det(Aj ) 6= 0 to avoid a
division by zero here.
Lemma 5.8. The resulting matrices L and Lk satisfy kLk n and kLk k 2.
Proof. Each matrix Lk has entries |(Lk )ij | 1 for all i, j {1, . . . , n}. Consequently the same
is true of L. The result for kLk follows directly. Furthermore, as each row of Lk contains at
most two non-zero entries, we have kLk k 2.
Remarks.
1. The result of this algorithm is an x Cn with Ax = P 1 LU x = P 1 Ly = P 1 P b = b
and thus the algorithm is correct.
Computational Complexity
The computational complexity is the same as for the LU algorithm given in Theorem 5.4. This
follows from the fact that we only added steps to exchange rows in the LU algorithm and such
exchanges do not incur computational cost in our model. If we assign unit cost to the comparisons
between number, made to maximise |uik | in step 3 of the algorithm, then this added additional
cost is of order (n2 ); thus it can be neglected for a (n3 ) algorithm. For GECP the additional
3
cost is of order (n ) and hence cannot be neglected.
Error Analysis
To derive a backward error analysis for GEPP is nontrivial. We limit ourselves to an heuristic
understanding of the analysis, based on the following two assumptions:
P (A + A)
x = Pb
where
3nm kU
k .
kAk kLk
1 3nm
To bound the right-hand side of this expression we use the following denition:
Denition 5.9. Dene gn (A), the growth factor for matrix A Cnn , by
kU kmax
gn (A) = . (5.1)
kAkmax
From this it follows, by Lemma 1.35, that
kU k ngn (A)kAk .
3n3 gn (A)m
kAk . kAk . (5.2)
1 3nm
(Here . indicates that approximations have been made in deriving the inequality). To under-
stand the implications of this bound we study the behaviour of the growth factor in further
detail.
Lemma 5.10. The growth factor gn (A) satises gn (A) 2n1 for every A Cnn .
Proof. We have U = L0n1 L01 P A. For any k {1, . . . , n 1} and any matrix A Cnn we
nd
and thus by applying the matrices L01 , . . . L0n1 in order we get, since kP Akmax = kAkmax ,
n1
kU kmax 2 maxi,j kAkmax . This completes the proof.
Placing this upper-bound on gn (A) in (5.2) shows a worryingly large n-dependence in the
backward error bound for GEPP:
3n3 2n1 m
kAk . kAk .
1 3nm
Furthermore, the following example shows that the upper bound on gn (A) from the lemma can
be attained.
Example. Let
1 1
1 1 1
A = 1 1 1 1 Rnn .
.. .
. .. .. .
.
. . . . .
1 1 1 1
5.3. THE QR FACTORISATION 65
Since all relevant elements have modulus 1 we do not need to use pivoting and LU factorisation
gives
1 1
1 2
1 4
nn
U = . R .
.. .
. .
1 2n2
2n1
Thus for the matrix A as dened above we get
kU kmax
gn (A) = = 2n1 .
kAkmax
Despite this worst case analysis, GEPP is widely used in practice; we outline why this is the
case.
Remarks.
1. The implied backward error analysis for GEPP given by (5.2) is indeed very weak: although
the algorithm is technically backward stable, for certain problems it is necessary that 2n m
be small to ensure a small backward error.
2. The preceding analysis may also be carried out for the GECP algorithm and results in a
1 1
bound for the growth factor which scales like n 4 log n+ 2 ; whilst still faster than polynomial
in n, this bound is considerably slower than 2n . Thus GECP has, in a worst case analysis,
far better stability properties than GEPP.
3. If comparisons between numbers are assigned unit cost, then GECP is still an (n3 )
algorithm, but the constant is larger than for GEPP.
4. Practical experience, and some statistical analysis, has led to a belief that matrices resulting
in large backward error for GEPP do not arise commonly in practice. Thus the worst case
analysis is thought, typically, to be misleading.
5. For these reasons GEPP is commonly used in practice, because it is cheaper than GECP
if comparisons are assigned a unit computational cost, and because it appears to produce
reasonable backward error on most problems.
The Householder QR factorisation is another method to solve (SLE). QR factorisation avoids the
poor dependence of the stability constant on n that arises for the LU factorisation with partial
pivoting. However the computation takes roughly twice the number of operations.
The Method
The following algorithm solves the problem (SLE) using the QR factorisation from Theorem 2.18.
In order to apply the algorithm we will need a numerically stable method to calculate the QR
factorisation. To avoid a minor technical complication, we restrict ourselves to the case of
real-valued matrices in this section.
Householder Reections
In step 8 of algorithm QR we calculate a product of the form
R11 R12 R11 R12
Qk = (5.3)
0 R22 0 Hk R22
where R11 R(k1)(k1) and Hk , R22 R(mk+1)(mk+1) . The purpose of this subsection is
to understand this step of the algorithm.
If Hk as calculated in step 6 of algorithm QR is applied to a vector x Rmk+1 the result is
Since the vector vhv, xi is the projection of x onto v , the value x vhv, xi is the projection of
x onto the plane which is orthogonal to v and x 2vhv, xi is the reection of x at that plane.
Reecting twice at the same plane gives back the original vector and thus we nd
Hk Hk = Hk Hk = I.
5.3. THE QR FACTORISATION 67
This shows that the matrices Hk , and then also Qk , k {1, . . . , n 1}.
are orthogonal for every
The vector which denes the reection plane is either v = u kuk2 e1
v = u (kuk2 e1 ), or
depending on the sign of u1 . The corresponding reection maps the vector u to Hk u = kuk2 e1
or Hk u = kuk2 e1 respectively. In either case the image is a multiple of e1 and since u is the
rst column of the matrix block R22 the product Hk R22 has zeros below the diagonal in the
rst column. The rst column of R22 is the k th column of R and thus Qk R has zeros below
the diagonal in columns 1, . . . , k . For k = n 1 we nd that R = Qn1 Q1 A is an upper
triangular matrix as required.
Remarks.
1. The matrices Hk and sometimes also Qk are called Householder reections.
2. The choice of sign in the denition of u helps to increase the stability of the algorithm in
the cases u kuk2 e1 and u kuk2 e1 . In the complex case, sign(x) will be replaced by
a complex number C with || = 1.
Computational Complexity
We considered two variants of algorithm QR, either calculating the full matrix Q as formulated
in line 9 of the algorithm or only calculating Q b by replacing line 9 with the statement b = Qk b.
We handle the dierent cases by rst analysing the operation count for the algorithm with line 9
omitted.
Lemma 5.11. The computational cost C(m, n) for algorithm QR applied to an m n-matrix
without calculating Q or Q b is asymptotically
2
C(m, n) 2mn2 n3 for m, n with m = (n).
3
Proof. We count the number of operations for the individual steps of algorithm QR. From
equation (5.3) we can see that for calculating the product Qk R
in step 8 we only have to
calculate Hk R22 = R22 2vv R22 . Since v k2
v = v/k and v k2 = v v we can calculate this
k as
v
Hk R22 = R22 v R22 .
v v/2
Using this formula we get the following operations count:
construction of v: 2(m k + 1) + 1 operations (counting as 1).
Calculating v v/2 needs 2(m k + 1) operations and dividing v by the result requires
another (m k + 1) divisions.
Calculating v R22 )
( )( from this needs (m k + 1)(n k + 1) multiplications.
n
X
C(m, n) = 5(m k + 1) + 1 + (n k + 1) 4(m k + 1) 1
k=1
m
X
= 5l + 1 + (n m + l)(4l 1)
l=mn+1
2
= 2mn2 n3 + terms with at most two factors m, n
3
2
2mn2 n3
3
68 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
2 4
C(n) 2nn2 n3 = n3 for n .
3 3
This analysis shows that solving (SLE) using Householder QR factorisation requires asymp-
totically double the number of steps than algorithm GEPP does. This is the price we have to
pay for the better stability properties of the QR algorithm.
Error Analysis
The backward error analysis for Householder QR is best expressed in terms of the norms of
matrix columns. For any matrix D Rnn we write dj to denote the j th column of D. A
similar convention for upper and lower case letters is used in what follows.
The rst result concerns the form of the triangular factor in the Householder QR factorisation.
Theorem 5.12. Let R denote the computed upper triangular factor in a Householder QR fac-
torisation of A Rnn . Then there exists orthogonal Q Rnn and constant c R+ such
that
A + A = QR
Theorem 5.13. Let x solve Ax = b and let x be the solution computed through Householder
QR. Then
(A + A)
x = b + b,
This is a considerable improvement over the worst case analysis for GEPP. However, as we
have seen, the worst case analysis does not give a true picture of the stability of GEPP in
practice; furthermore, Householder QR costs twice as much as GEPP. Consequently GEPP is
favoured in practice.
5.3. THE QR FACTORISATION 69
Bibliography
The book [Hig02] is an excellent source of material concerning backward error analysis for nu-
merical algorithms. In particular, proofs of Theorems 5.5, 5.6, 5.12 and 5.13 may all be found
in this book. The fact that the worst case analysis of GEPP does not seem to arise often in
practice is discussed in [TB97].
Exercises
Exercise 5-1. The purpose of this question is to illustrate some of the content of the chapter
by means of some Matlab experiments. We consider the linear system
Ax = b (5.4)
>> A=randn(5)
>> b=randn(5,1)
These two commands will produce a random 55 matrix and 5-vector where each entry is
independent of all others and is distributed normally with mean 0 and variance 1. Now type
>> x=A\b
This x is the (computer approximation) to the solution of (5.4). To verify this type
Matlab
big is rn? What does this tell you?
Now we try and explain how solved (5.4). Type
>> [L,U,P]=lu(A)
and describe what special properties L, U and P have. In particular, what eect does P have
when applied to an arbitrary 55 matrix? Why is P 1 = P T ? If you type
>> P*A
and then
>> L*U
you see that P A = LU (approximately use norm to measure the residual). Thus, applying P
to (5.4), we see that x solves
LU x = P b.
Thus
x = U 1 L1 P b.
What special features do U and L have which make this a good way to nd x? (Think about
what is involved in solving an arbitrary linear system and compare it to what is involved if the
U L).
Matlab
matrix to be inverted has the form of or
The LU factorisation described above is what uses to solve (5.4) (unless A is sym-
metric with positive diagonal entries, in which case Cholesky factorisation is used). But an
alternative is to use the QR factorisation which we now outline. Type
>> [Q,R]=qr(A)
and then type
70 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
>> rn=norm(Q*R-A)
noting that this shows that QR is a (approximate) factorisation of A. R has a property you
have seen on another matrix in this question (what is it?) and is hence easy to invert. What
about Q? Let us extract the columns from Q. Type
Exercise 5-2. In analogy to the back substitution algorithm formulate the forward substitution
algorithm to solve Ly = b where L Cnn is a lower triangular, invertible matrix and b Cn
is a vector. Show that your algorithm computes the correct result and that it has asymptotic
computational cost of order (n2 ).
Exercise 5-4. A = ab. Find all eigenvalues and eigenvectors of A. When is A a normal
a) Let
matrix? b) Let H Rnn be a Householder reection. Show that H has a single eigenvalue at
1 and an eigenvalue +1 of multiplicity (n 1). What is the value of det(H)?
Exercise 5-5. Determine the asymptotic computational cost of algorithm QR when calculating
the full matrix Q.
Ux = y
Ly = b
where U Rnn (resp. L Rnn ) is an upper (resp. lower) triangular matrix. Construct
examples U and y where you know the exact solution x (for some large n). Compute x
using
your algorithm and study the error
kx x
k
e= ,
kxk
discussing your results.
Now generate a random upper triangular U R5050 with non-zero entries uniformly dis-
T 50
tributed in [1, 1]. Let x = (1, . . . , 1) R and compute y = U x by use of Matlab's matrix
multiply. Solve U x
= y (with the just computed y ) using the algorithm you designed above and
compute the relative error
kx x
k
e= .
kxk
Do this several hundred times and discuss the resulting statistics of e.
Exercise 5-7. Using the construction from Exercise 2-2 as the basis for an algorithm, and
making maximum use of symmetry, show that Cholesky factorisation of a symmetric positive-
denite matrix can be achieved in approximately half the ops of standard Gaussian Elimination.
5.3. THE QR FACTORISATION 71
Exercise 5-8. Let A be a symmetric positive denite matrix that is also strongly (row) di-
agonally dominant. Let gn (A) be the growth factor from (5.1). Show that when Gaussian
Elmination is applied to A we get gn (A) 1.
Exercise 5-9. Banded matrices arise frequently in discretisations of PDEs and it is thus im-
portant to understand them both from a theoretical and a practical point of view.
We consider here m m tridiagonal matrices. These have the form
d1 c1
a2 d2 c2
.. .. ..
(5.5)
. . .
am1 dm1 cm1
am dm
First of all it will be useful to be able to invert tridiagonal matrices. Consider solving
Ax = b, (5.6)
Exercise 5-10. Prove Theorem 5.7 about the backward stability of algorithm GE.
72 CHAPTER 5. SYSTEMS OF LINEAR EQUATIONS
Chapter 6
Iterative Methods
In the previous chapter we considered methods which directly solve (SLE) using (n3 ) opera-
tions. In this chapter we will introduce iterative methods. These methods construct a sequence
(xk )kN with
xk x for k ,
where x solves (SLE). These methods approximate the exact solution x by the computed value
xk for large k and thus, even in the absence of rounding errors, the resulting algorithms only
compute approximate solutions to (SLE). In practical applications this is no problem, since one
can choose k large enough such that the approximation error is of the same order of magnitude
as the deviation caused by rounding errors in the exact methods.
The primary motivation for the use of such iterative methods arises for problems where
A Cnn with n 1, so that solving the problem by the direct methods of Chapter 5, with
3
cost (n ), is prohibitively expensive. If A has some special structure, for example if matrix-
vector multiplication with A is cheap to calculate, then iterative methods can have a substantial
advantage over direct methods.
For the methods considered in this chapter, the sequence (xk ) is constructed iteratively, i.e.
starting with a value x0 the value xk , for each k, is computed from xk1 . The map xk1 7 xk
may be either linear or nonlinear; we discuss methods of both types. The two competing criteria
for the iteration are:
The calculation of xk from xk1 should be cheap, relative to the cost of solving (SLE).
The convergence of xk to x should be fast.
These two criteria are often contradictory to one another and the trade-o between them is very
much dependent upon the matrix A itself. To illustrate this trade o it is useful to have the
following two extreme iterative methods in mind:
xk = A1 b; and
xk = xk1 .
The rst is not cheap to implement each step involves solving (SLE) by a direct method,
and our assumption is that this is prohibitively expensive but converges fast: in one step.
The second is cheap to implement but does not converge at all. Successful iterative methods lie
somewhere between these extremes.
In many cases, the cost of computing xk from xk1 is dominated by the cost of matrix-vector
multiplication for a matrix N which is derived from A. The resulting methods will be ecient if
these multiplications can be computed eciently. The most common structure leading to cheap
matrix-vector multiplication is sparsity : where A has only a small number of non-zero entries.
Then matrix-vector multiplication by A, A or any matrix N where N inherits the sparsity of
A will be cheap; the matrix N might, for example, be formed from the o-diagonals of A.
In this chapter we will focus only on errors incurred through using a nite number of iterations
but we will not discuss the eect of rounding errors. For each section describing a computational
73
74 CHAPTER 6. ITERATIVE METHODS
approach there will be three subsections, one dening the method, the second studying error as
a function of the number of iterations k, and the third discussing computational cost. We will
consider the cost of achieving an error of size . We will show that, for some important problems,
it is possible to compute an approximate solution in time (na ) with a<3 as the dimension n
of the problem increases, thus beating the cost of the direct methods from the previous chapter.
Iterative methods where xk is computed from xk1 by application of a linear map are called
linear methods. In this section we describe results about linear methods in general. Sections 6.2
and 6.3 below introduce two specic instances of linear methods, namely the Jacobi method and
the successive over relaxation scheme (SOR).
The Method
The basic idea of the methods described in this section is to write the matrix A as A = M +N
where the matrix M is easier to invert than A. Then, given xk1 , we can dene xk by
M xk = b N xk1 . (6.1)
If we assume for the moment that limk xk = x, then the limit x satises
and thus we get Ax = b. This shows that the only possible limit for this sequence is the solution
of (SLE). Thus, once we know that the generated sequence converges for a given matrix A
and a given initial value x0 , we can iteratively compute xk for a big value of k to get an
approximation for x. In the following sections we will give sucient criteria for convergence of
the sequence (xk )kN .
This leads to the following algorithm.
1: for k = 1, 2, 3, . . . do
2: compute yk = b N xk1
3: solve M xk = yk
4: end for
To actually implement this method one needs to choose M and N such that the method
converges (we discuss dierent choices below). One also needs a stopping criterion to decide
when to quit iterating and to return the resulting approximation xk .
Thus, the remaining problem is to choose a stopping criterion for the iteration. We consider
the error
ek = x xk ,
and the residual vector
rk = b Axk = Aek
where x is the exact solution of (SLE) and xk the approximation at step k. We would like to
choose k big enough such that kek k for a given >0 but, since ek cannot be computed
without knowledge of the exact solution x, this is not a practical stopping criterion. Instead,
we will stop the iteration once krk k r for some r > 0. The residual error rk can easily be
computed during the iteration and from the estimate kek k = kA1 rk k kA1 kkrk k kA1 kr
we can get estimates for the error if needed.
6.1. LINEAR METHODS 75
Error Analysis
Since Mx = b Nx we get the relation
ek = M 1 N ek1 .
Remarks.
1. The lemma shows that the linear iterative method dened by (6.1) converges for every
initial condition x 0 Cn if and only if (R) < 1 for R = M 1 N . The convergence is
fast if (R) is small.
Computational Complexity
When measuring computational cost for iterative methods it is of interest to know how many
iterations are required to reduce the error to size . As before we will consider the relative error
kxk xk/kxk = kek k/kxk. By Proposition 3.3 we have
kek k krk k
(A) . (6.2)
kxk kbk
Finding k to achieve
krk k
, (6.3)
kbk
and thus
kek k
(A), (6.4)
kxk
will lead to an estimate of the computational cost of using the iterative method.
Using the relation rk = Aek = ARk e0 we can estimate the left hand side of (6.3) to get,
from (6.2),
kek k kAkkRkk ke0 k
(A)
kxk kbk
for any vector norm k k. If kk is chosen so that kRk < 1 then choosing k to be the smallest
integer greater than
ln 1 + ln kAk + ln ke0 k ln kbk
k] = (6.5)
ln kRk1
76 CHAPTER 6. ITERATIVE METHODS
Theorem 6.3. Under Assumption 6.2 the computational cost to achieve (6.4) by using the linear
iterative method is bounded above by cn+ ln 1 , for some constant c independent of n and .
Proof. The rst item of the assumption ensures that each step of the iteration costs (n ). The
1 1
second item of the assumption implies that ln kRk = (n ). Hence combining the second
and third items and using (6.5) gives the desired result.
Remarks.
1. For the theorem to be practically useful, the norm in (6.3) must be readily computable.
In some cases kRk < 1 in some computable norm such as k k , k k2 or k k1 . Then the
preceding theorem applies directly.
In other cases we only know that kRkS < 1; as the construction of kkS requires knowledge
of the Jordan Canonical form, this is not a readily computable norm. However, by norm
equivalence, we know that there is a constant c2 (0, ) such that
kak kakS
c2 .
kbk kbkS
The constant c2 may depend on n, but in many important applications this occurs only in
a polynomial fashion.
kn+ ln 1 + ln c2 (n)
by the methods given in the preceding theorem, if Assumption 6.2 holds in the norm k kS .
Thus there is an increase in cost by only a logarithmic factor in n when c2 (n) is polynomial
in n.
2. In many applications (A), kAk and kbk do depend upon n, but only polynomially. Again
this leads to an increase in computational cost over the above, but only by a factor which
is logarithmic in n.
6.2. THE JACOBI METHOD 77
In this and the next section we will consider specic methods which are obtained by choosing
the matrices M and N in (6.1). For a matrix A Cnn dene the three n n-matrices
a11
a21 a22
L = a31 a32 , D= a33 ,
.. ..
. .
an1 an2 ... an,n1 ann
a12 a13 ... a1n
a23 ... a2n
.. .
U = . . (6.7)
. .
an1,n
Then we have A = L + D + U. The matrices L and U are, respectively, strictly lower and upper
triangular.
The Method
The iterative method obtained by choosing M = D and N = L+U in (6.1) is called Jacobi
method. This choice of M and N leads to the iteration
xk = D1 b (L + U )xk1
Error Analysis
In order to study convergence of the Jacobi method using Lemma 6.1 we have to consider the
matrix R = M 1 N = D1 (L + U ). The method converges if and only if (R) < 1. The
following theorems give sucient criteria for this to happen.
Theorem 6.4. a) The Jacobi method is convergent for all matrices A with
X
|aii | > |aij | (6.8)
j6=i
0 else.
Thus the strong row sum criterion gives kRk < 1 which implies (R) < 1 and the method
converges.
78 CHAPTER 6. ITERATIVE METHODS
b) If the strong column sum criterion (6.9) holds for A, then the strong row sum crite-
A and
rion (6.8) holds for thus the method converges for A . From Lemma 6.1 we know then
1
(D ) (L + U ) < 1. Since for every matrix R the matrices R, R and also D1 RD have
the same eigenvalues we get
D1 (L + U ) = D1 (L + U )D1 D
= ((L + U )D1 )
= (D )1 (L + U ) < 1
Remark. As well as showing convergence, the proof of the theorem estimates kRk < 1. This
estimate can be used in Theorem 6.3 to analyse computational complexity.
There is an alternative description of irreducibility, which is often easier to check than the
denition given. To the matrix A we associate the oriented graph G(A) with vertices 1, . . . , n
and edges i j i, j {1, . . . , n} with aij 6= 0. Then the matrix A is irreducible if
for all
and only if the graph G(A) is connected, i.e. if you can reach any vertex j from any vertex i by
following edges.
1 2 3
The matrix A is not irreducible (indeed P = I in the denition is enough to see this) and since
there is no path from 3 to 1, the graph G(A) is not connected.
1 1
A = 1 2 1 .
1 1
1 2 3
Now the graph G(A) is connected and the matrix is thus irreducible.
6.2. THE JACOBI METHOD 79
Remarks.
The proof of equivalence of the two characterisations of irreducibility is based on the following
observations.
1. For any permutation P the graphs G(A) and G(P T AP ) are isomorphic, only the vertices
are numbered in a dierent way.
2. The block A22 in the denition of irreducibility corresponds to a set of states from where
there is no path into the states corresponding to the block A11 .
The characterisation of irreducibility through G(A) shows that the diagonal entries of A are
irrelevant to determining irreducibility. Any matrix which has non-zero o-diagonal entries if
and only if A does so, will have the same graph. Hence R = D1 (L + U ) is irreducible if and
only if A is.
for i = 1, . . . , n but X
|akk | > |akj | (6.11)
j6=k
for one index k, then the Jacobi method converges. (This condition is called the weak row sum
criterion.)
Proof. We have to show that (R) < 1 where R = D1 (L + U ). Dene the matrix |R| =
(|rij |)ij Rnn and the vector e = (1, 1, . . . , 1)T Rn . Then we have
Xn X |a |
ij
(|R|e)i = |rij | 1 = 1 = ei .
j=1
i |a ii |
j6=i
Thus |R|e e where this and some of the following inequalities between vectors are to be read
componentwise. Therefore we get
|R|l+1 e |R|l e e
for all l N.
Let t(l) = e |R|l e 0. Then the vectors t(l) l be the
are componentwise increasing. Let
(l)
number of non-vanishing components of t l is strictly increasing until it
. We will show that
reaches the value n. Assume, for contradiction, that l+1 = l = k < n. Since one row of A
satises the strict inequality (6.11) we have |R|e 6= e and thus k > 0. Then without loss of
generality (since we can reorder the rows and columns of A) we have
a b
t(l) = , t(l+1) =
0 0
where a, b Rk and a, b > 0. We can conclude
b
= t(l+1) = e |R|l+1 e
0
|R|e |R|l+1 e = |R|t(l)
|R11 | |R12 | a
= .
|R21 | |R22 | 0
From a > 0 we have |R21 | = 0. Hence R is not irreducible. This implies that A is not irreducible,
and we have found the required contradiction. Thus we can conclude that l+1 > l whenever
l < n. This proves t(n) > 0 componentwise.
n
Hence e > |R| e and we get, using Lemma 1.34,
Computational Complexity
The sequence (xk )kN from the Jacobi method is dened by the relation
xk = D1 b (L + U )xk1 .
Example. Assume that each row of A contains at most ` > 0 non-zero entries outside the
diagonal. Then step 1) requires (n) operations and performing one step of Jacobi method has
cost (n) as n . In this case we get =1 in Theorem 6.3.
The Method
The method obtained by using M = L+D and N = U +(1)D in (6.1), for the matrices L, D
and U from (6.7) and some R, is called the Successive Over Relaxation (SOR) scheme. The
parameter can be chosen to accelerate convergence. The special case = 1, i.e. M = L + D
and N = U , is called the Gauss-Seidel method. For the SOR scheme we get the iteration
Since (L + D) is lower triangular we can use forward substitution to calculate xk in each step.
Error Analysis
The error analysis of the SOR and Gauss-Seidel method is similar to the analysis for the Jacobi
method from the previous section. The convergence properties of the SOR scheme are determined
by the matrix
R = (L + D)1 U + (1 )D .
The following summarises some sucient criteria for convergence of the Gauss-Seidel method.
Computational Complexity
The speed of convergence and thus the cost of the method depends on the choice of the param-
eter . For linear systems arising from discretisation of certain elliptic PDEs, optimising can
lead to an order of magnitude improvement in eciency, when compared with Gauss-Seidel and
Jacobi methods. This arises though decreasing in Theorem 6.3 for appropriate choice of . In
the same context Gauss-Seidel improves over Jacobi, but not by an order of magnitude.
6.4. NONLINEAR METHODS 81
In the remaining part of this chapter we will consider nonlinear methods for solving (SLE),
i.e. methods where the map from xk to xk+1 is no longer linear. We restrict ourselves to the
case where the matrix A is Hermitian and positive denite. This section explains the general
approach. The remaining two sections of the chapter examines two specic instances of nonlinear
methods in detail.
The fundamental observation for the methods we consider here is that the vector equation
Ax = b kAx bk = 0 for any vector norm k k. Since we assume that A is
is equivalent to
Hermitian and positive denite, hx, yiA = hx, Ayi denes an inner product and kxk2A = hx, Axi
1
denes the associated vector norm. The matrix A is again Hermitian and positive denite
2 1
and kxkA1 = hx, A xi denes the associated vector norm.
Lemma 6.8. Let A Cnn be Hermitian and positive denite and b Cn . Then x Cn solves
Ax = b if and only if x is the minimiser of the function
1
g(y) = kAy bk2A1 y Cn . (6.12)
2
In this section we will consider iterative methods which solve (SLE) by constructing sequences
(xk )kN which converge to the unique minimum of the function g . Using the residual rk = bAxk
and the error ek = x xk , where x is the exact solution of Ax = b, we can write
1 1
g(xk ) = krk k2A1 = kek k2A .
2 2
Thus minimising g corresponds to minimising the length of the residual vector in the k kA1 -
norm or, equivalently, to minimising the error for the exact solution in the k kA -norm.
The Method
The two methods that we study below both take the form
where dk1 Cn \ {0} is called the search direction and the scalar k1 C is called the step
length. The step length is chosen so that, given xk1 and dk1 ,
k1 = argmin g(xk1 + dk1 ).
C
g(y + z)=0 = Rehz, Ay bi. (6.15)
At the minimum we nd
k1 )
0= g(xk1 + k1 dk1 + d =0
= Re hdk1 , A(xk1 + k1 dk1 ) bi
= Re hdk1 , rk1 + k1 Adk1 i
for all C and thus
hdk1 , rk1 i
k1 = .
kdk1 k2A
Hence we get the following algorithm.
82 CHAPTER 6. ITERATIVE METHODS
1: for k = 1, 2, 3, . . . do
2: compute dk1
hdk1 , rk1 i
3: k1 =
kdk1 k2A
4: xk = xk1 + k1 dk1
5: end for
The steepest descent method and the conjugate gradient method, discussed below, are in-
stances of this algorithm, employing dierent choices for the search directions dk .
Error Analysis
The error analysis for nonlinear methods is done in two steps. We rst study how fast the
value kek k2A = 2g(xk ) decays. This analysis depends on the specic choice of search directions dk
and is presented for the two methods below, separately. In a second step we can use this result
to estimate the error of the approximate solution xk in the Euclidean norm:
Lemma 6.9. Assume kek kA cqk ke0 kA for all k N for some constants q, c > 0. Then
p
kek k (A)c q k ke0 k k N
1 2 max k
kek k2 kek k2A cq k ke0 k2A cq ke0 k2 .
min min min
This completes the proof.
For future reference we state the following relation, which describes how the residual error
changes in each step:
rk = rk1 k1 Adk1 k N. (6.16)
The Method
The steepest descent method chooses a search direction which makes a step in the direction of
the (locally) steepest descent for g. From equation (6.14) we get
2
g(xk + dk ) = g(xk ) Rehdk , rk i + kdk k2A (6.17)
2
for all R. Thus, the direction of steepest descent corresponds to the direction dk which
maximises Rehdk , rk i while keeping the length of dk xed. Since dk is only determined up to a
scalar factor, we can choose
dk = rk k N0
in (6.13). With this choice, the step length becomes
krk k2
k =
krk k2A
and we get the following algorithm.
6.5. THE STEEPEST DESCENT METHOD 83
1: for k = 1, 2, . . . do
2: rk1 = b Axk1
3: if krk1 k r then
4: return xk1
5: end if
6: k1 = krk1 k2 /krk1 k2A
7: xk = xk1 + k1 rk1
8: end for
Error Analysis
The following result establishes the convergence of the method.
Theorem 6.10. Let A be Hermitian and positive denite. Then the rate of convergence of the
steepest descent algorithm for solving (SLE) is given by
p k
kxk xkA 1 1/(A) kx0 xkA k N,
where (A) is the condition number of A in the 2-norm.
Proof. By substituting = k1 into equation (6.17) we get
krk1 k4 1
g(xk ) 1 1 g(x k1 ) = 1 g(xk1 )
max krk1 k2 min krk1 k2 (A)
Remarks.
1. If the matrix A is ill-conditioned then the method converges slowly. Ill-conditioning occurs
in the 2-norm when the eigenvalues of A are very dierent: max min . In this case the
steepest descent direction is almost orthogonal to the direction in which the solution lies.
2. Since the steepest descent method uses dk = rk , the search directions satisfy rk = rk1
k1 Ark1 by (6.16). Using induction we get
for all k N.
Computational Complexity
We can achieve
kxk xkA (6.19)
ln 2 + ln kx0 xk2A
k# = . (6.20)
ln 1 (A)1
84 CHAPTER 6. ITERATIVE METHODS
Theorem 6.10 shows that the steepest descent method can behave quite poorly on ill-conditioned
problems. The conjugate gradient method, and its pre-conditioned variants, go a long way to
ameliorating this diculty, at little extra cost when compared to the method of steepest descents.
Again, we assume that A is Hermitian and positive denite. The conjugate gradient method
is similar to the steepest descent method but instead of making steps in direction of the steepest
descent it restricts the increments xk xk1 to lie in the Krylov subspace
Kk (A, r0 ) = span{r0 , Ar0 , A2 r0 , . . . , Ak1 r0 }.
Since the spaces Kk (A, r0 ) are increasing, this leads to xk x0 + Kk (A, r0 ) for all k and, as we
will see, the algorithm even chooses xk to minimise the function g from (6.12) on this space.
An underlying, general approach to the design of iterative methods for systems of linear
equations (and also for the eigenvalue problem) is as follows: seek approximate solutions which
at the k th step of the iteration use only increments in Kk (A, r0 ). Such methods are natural when
matrix-vector multiplication by A is cheap to compute. Equation (6.18) shows that the steepest
descent method follows this principle. The conjugate gradient method can be implemented al-
most as easily as the method of steepest descents, but since it nds the optimal approximation to
x from Kk (A, r0 ) we expect it to perform better. Indeed, the CG method converges considerably
faster on ill-conditioned problems. Many other iterative methods also use the idea of nding
approximations from the Krylov subspace.
The Method
Denition 6.13. Given A Hermitian positive-denite we say that the set {d0 , , dk } is A-
orthogonal (or conjugate) if
hdi , dj iA = 0 i 6= j.
The following Lemma shows that choosing the search directions dk to be A-orthogonal is
advantageous for non-linear iterative methods.
Lemma 6.14. Assume that the search directions {d0 , . . . , dk1 } form an A-orthogonal set. Then
xk , given by (6.13), minimises g over x0 + span{d0 , , dk1 }.
Proof. Dene G : Cn R by
k1
X
G() = g(x0 + l dl ) Cn .
l=0
Then G is a convex quadratic function and has a unique minimum where all directional deriva-
Pk1
tives vanish. Using (6.15) with y = x0 + l dl and m
z = d we nd
l=0
k1
X
Re hdm , Ax0 + l Adl bi = 0
l=0
6.6. THE CONJUGATE GRADIENT METHOD 85
for k = 1, . . . , m and thus hdm , r0 i = hdm , rm i. This shows that the minimum is, indeed, attained
for = .
The steepest descent method also chooses xk x0 + span{d0 , , dk1 }, but minimises g
only along a line instead of over all of the ane sub-space. Thus, any method with A-orthogonal
search directions will be at least as good as the steepest descent method.
In general it is expensive to construct orthogonal sets something like a Gram-Schmidt
orthogonalisation is required. Thus to make a viable method it remains to try and construct
an iteration which iteratively generates an A-orthogonal set in a cheap fashion. The idea of the
conjugate gradient method is to construct the A-orthogonal set by using the iteration
dk = rk + k dk1
hdk1 , rk iA
k = .
kdk1 k2A
This clearly ensures that dk is orthogonal to dk1 ; we will see that it is possible to choose d0
such that the entire set {d0 , . . . , dk } is an A-orthogonal set.
A preliminary form of the resulting algorithm is given by the following procedure: choose
x0 , d0 Cn and let r0 = b Ax0 . Then, for k = 1, 2, . . . dene, in the given order,
hdk1 , rk1 i
k1 = , xk = xk1 + k1 dk1 ,
kdk1 k2A
hdk1 , rk iA
rk = rk1 k1 Adk1 , k = , dk = rk + k dk1 . (6.21)
kdk1 k2A
Here we have used (6.16) to compute rk inductively. This allows to compute all required quan-
tities with only one matrix-vector multiplication per step (to calculate Adk1 ).
Lemma 6.15. Let x1 , . . . , xk be given by the conjugate gradient method (6.21) with d0 = r0 .
Assume r0 , . . . , rk 6= 0 and d0 , . . . , dk1 6= 0. Then dk 6= 0 and hdk1 , rk i = hrk1 , rk i = 0.
Furthermore 2 2
krk1 k krk k
k1 = > 0, k = > 0.
kdk1 k2A krk1 k2
Proof. From the denition of rk and the denition of k1 we get
and dk 6= 0. k = 1 we have hrk1 , dk1 iA = hd0 , d0 iA = kdk1 k2A . For k > 1 we can solve the
For
2
equation dening dk1 for rk1 to get hrk1 , dk1 iA = hdk1 k1 dk2 , dk1 iA = kdk1 kA .
Hence
hrk1 , rk i = hrk1 , rk1 i k1 hrk1 , Adk1 i = 0
which is the second orthogonality relation from the claim.
86 CHAPTER 6. ITERATIVE METHODS
For k=1 we have hdk1 , rk1 i = hr0 , r0 i = krk1 k2 . For k > 1 we can use the the denition
of dk1 to get hdk1 , rk1 i = hrk1 + k1 dk2 , rk2 i = krk1 k2 . Thus we get
Finally, since k1 > 0, we can solve the denition of rk for Adk1 to get
A consequence of Lemma 6.15 is that, assuming we have not already found the exact solution,
we will have dk 6= 0 and thus all the expressions in (6.21) make sense. We can now prove that
the conjugate gradient method does indeed produce an A-orthogonal set of search directions.
Lemma 6.16. Let x1 , . . . , xk be given by the conjugate gradient method (6.21) with d0 = r0 .
Then {d0 , , dk1 } Kk (A, r0 ) is an A-orthogonal set.
Proof. We rst prove by induction that
for l = 0, . . . , k 1. d0 = r0 , the statement holds for l = 0. Now let l > 0 and assume that
Since
the statement holds for l 1. Then, since l1 6= 0, we can solve the denition of rl for Adl1
to get Adl1 span{rl1 , rl } and thus
and thus span{d0 , d1 , . . . , dl } span{r0 , Ar0 , . . . , Al r0 }. This completes the proof of (6.22).
Assume for induction that {d0 , . . . , dk1 } is an A-orthogonal set. This is true for k = 1. Now
let k > 1. From the denition of rj+1 we get
for j = i + 1, . . . , k 1 and thus, by induction, hdi , rj i = hdi , ri+1 i = 0 for all 0 i < j k where
the last identity comes from Lemma 6.15. Since Adi span{ri , ri+1 } span{d0 , . . . , di+1 }, we
get
hdi , rk iA = hAdi , rk i = 0
for i = 0, . . . , k 2. Using this result we can prove that dk is A-orthogonal to d0 , . . . , dk1 : for
i < k 1 we get
hdi , dk iA = hdi , rk iA + k hdi , dk1 iA = hdi , rk iA = 0
and hdk1 , dk i = 0 by the construction of k . This completes the proof.
The CG-algorithm is normally implemented in the following form which exploits the ex-
pressions for k , k as derived in Lemma 6.15. This allows the required number of arithmetic
operations to be kept small.
6.6. THE CONJUGATE GRADIENT METHOD 87
Error Analysis
We have already seen, in Lemma 6.14, that any method with A-orthogonal search directions will
have smaller errors than the steepest descent method. The following theorem shows a surprising
result: these methods, even though iterative by construction, obtain the exact solution of (SLE)
after at most n iterations.
Theorem 6.17. If {d0 , . . . , dn1 } form an A-orthogonal set in Cn then, for any starting vec-
tor x0 , the sequence (xk ) given by (6.13) reaches the exact solution x of (SLE) in at most n
steps.
Proof. The set {d0 , dn1 } forms an A-orthogonal basis for Cn and so we may write
n1
X hdk , x x0 iA
x x0 = k dk , k = .
kdk k2A
k=0
Conjugacy gives
hdk , xk x0 iA = 0
and thus
n1
X
x x0 = k dk = xn x0
k=0
so that x = xn as required.
Corollary 6.18. The conjugate gradient algorithm, for any starting vector x0 , reaches the exact
solution of (SLE) in at most n steps.
Proof. Assume that the algorithm has not converged in n 1 steps; otherwise the proof is
complete. Then the set {d0 , , dn1 } is A-orthogonal by Lemma 6.16. Hence the result follows
from Theorem 6.17.
88 CHAPTER 6. ITERATIVE METHODS
Remark. When the computation is performed on a computer, rounding errors will cause rn 6= 0
where rn is the calculated value for the residual error rn . In practice one just treats the method
as an iterative method and continues the iteration until the residual error k
rk k is small enough.
Typically the exit criterion will even be reached for k < n.
In order to understand how the conjugate gradient method performs iteratively, we describe
an error analysis based around the theory of polynomial approximation. In the following we let
(A) := { C | is an eigenvalue of A}
and we let Pk denote the set of all polynomials of degree k with p(0) = 1.
Theorem 6.19. If the conjugate gradient algorithm has not converged at step k, then
kxk xkA = inf kp(A)(x0 x)kA
pPk
inf max |p()| kx0 xkA .
pPk (A)
k1
X k1
X
ek = x xk = x x0 + j Aj r0 = e0 + j Aj+1 e0 .
j=0 j=0
Dening p Pk by
k
X
p() = 1 + j1 j
j=1
Therefore
n
X
kek k2A max |p()|2 |aj |2 j = max |p()|2 ke0 k2A
(A) (A)
j=1
By choosing an appropriately scaled and shifted Chebyshev polynomial and applying the
previous result, the following may be shown:
Theorem 6.20. Let be the condition number of a symmetric positive denite matrix A in the
Euclidean norm. Then the error in step k of the conjugate gradient algorithm for solving (SLE)
satises " k k
# 1
+1 +1
kxk xkA 2 + kx0 xkA .
1 1
Proof. See Exercise 6-6.
6.6. THE CONJUGATE GRADIENT METHOD 89
Computational Complexity
We make the same assumptions as for the steepest descent method, and show the decrease in
computational cost that can be obtained by using the conjugate gradient method.
Theorem 6.21. Under Assumption 6.11 the computational cost of using the conjugate gra-
dient method to achieve (6.19) is bounded above by cnmax{,1}+ ln 1 , for some constant c
1
2
independent of n and .
Proof.
The rst item of the assumption ensures that each step of the iteration costs (nmax{,1} ).
If 1 then the Theorem 6.20 gives
k
2
kxk xkA . 2 1 kx0 xkA .
Combining the second and third items and using this expression gives the desired result.
Remark. In an applied context the most important form of the conjugate gradient method is
its preconditioned form. Roughly speaking the method is applied to the linear system
M Ax = M b
where the preconditioner M is chosen to try and improve the conditioning of MA over A, yet
have M easy to invert. In practical terms the algorithm diers from standard conjugate gradient
only by the addition of one multiplication with M per step.
Bibliography
Linear iterative methods for (SLE) are described in the book [Saa97]. Theorem 6.7 is proved in
[SB02]. For discussion of Theorem 6.20 see [TB97]; for related results see [NW06]. The SOR
method is analysed in detail in [LT88].
The nonlinear iterative methods that we discuss are constructed through minimisation or
optimisation techniques. As such they are strongly related to the topic of optimisation in general.
This topic is well overviewed in [NW06]; the exposition of the conjugate gradient method there
forms the basis for ours. A good discussion of preconditioned versions of the conjugate gradient
method may also be found in this book.
Exercises
Exercise 6-1. The purpose of this exercise is to introduce you to the idea of iteration. Many
problems are simply too large to solve by the methods we have looked at so far too much
computer time or memory is required if the dimension n of the matrix is large. For linear systems
Ax = b, A Rnn , b Rn (6.23)
this will lead to the idea of generating approximating sequences {xk } which hopefully converge
to x as k . For eigenvalue problems which are large we will study methods which generate
approximations to a subset of the eigenvalues, and not to all of them. (Note that for eigenvalue
problems it is necessary to use iterative methods in all cases, provided the matrix has dimension
bigger than 4; the main dierence when the problem is large is that we solve for only a subset
of the eigenvalues).
(i) To show why iterative methods are necessary construct a matrix A with n = 1000 as follows:
type
>> A=0.0005*rand(1000);
>> b=rand(1000,1);
>> for i=1:1000; A(i,i)=1; end;
90 CHAPTER 6. ITERATIVE METHODS
This creates a matrix A with 1 on the diagonal and random entries uniformly distributed in
[0, 5 104 ] on the o-diagonals. Now try solving (6.23). Type
>> x=A\b;
Almost certainly your computer cannot handle a problem of this size. So type ctrl c to stop it
from trying. (If you are lucky enough to be working on a machine where this problem is solved
1000 by 10000 or by some matrix suciently
Matlab
easily then repeat the exercise replacing large that
fails.) Keep the (original 1000 1000) matrix A for use in (iii) where we will use a
more successful approach to solve this problem.
(ii) We look at the simplest iterative method, known as Jacobi. Any matrix A can be written
uniquely as
A=D+L+U
where D is diagonal, L is zero on and above the diagonal and U is zero on and below the diagonal.
Assuming that D is invertible, consider a sequence {xk }
k=1 satisfying
xk+1 = D1 [b Lxk U xk ].
>> L=tril(A,-1);
>> U=triu(A,1);
>> 0=L+U;
This creates L and U as dened in (ii) (a Matlab question: why is this so?). Type
>> j=1;
>> y=ones(1000,1);
>> n(j)=2*j;
>> for i=1:n(j); y=b-O*y; end
>> rn(j)=norm(A*y-b)
What does this tell you? Repeat the above 5 lines of code with j = 2, 3, . . . , 5 and type:
>> plot(n,rn)
What do you observe? Also type
>> plot(n,log(rn))
(iv) Under what conditions on A will the Jacobi iteration converge to the correct solution of
(6.23)?
Exercise 6-2. Prove that the Jacobi iteration that you implemented in Exercise 6-1 converges
for the matrix A given there.
Exercise 6-3. Which of the following matrices are irreducible? For which of these matrices
does the Jacobi-method converge?
4 2 1 1 3 8
2 1 , 3 25 4 , 4 2
1 4 1 2 1 1
Exercise 6-4. Show that the Jacobi method for the solution of
1 1 x1 1
1 10 2 x2 = 2
2 1 x3 3
converges and, for the iteration starting with x0 = 0, give an upper bound on the number of
steps required to get the relative error of the result below 106 .
6.6. THE CONJUGATE GRADIENT METHOD 91
where b = (1, )T . Calculate x1 from the steepest descents algorithm, starting from x0 = 0.
Which component of the solution is better approximated after one step? Why is this?
Exercise 6-6. Prove Theorem 6.20 by choosing a Chebyshev polynomial scaled and shifted so
that its min/max properties hold on the interval [min (A), max (A)]. Then prove that
where Pk is the set of all polynomials p of degree less than or equal to k with p(0) = 1,
y Rn , and (A) is the set of eigenvalues of A.
Theorem 7.1. A vector x Cn minimises kAx bk2 for A Cmn and b Cm , if and only
if Ax b is orthogonal to range(A).
Proof. Let g(x) = 21 kAx bk22 . Then minimising kAx bk2 is equivalent to minimising the
function g.
Assume that Ax b is orthogonal to range(A) and let y Cn . Then
Ay Ax = A(y x) Ax b
and Pythagoras' theorem gives
1
0= g(x + y) = hAy, Ax bi + hAx b, Ayi = RehAx b, Ayi
2
and
1
0= g(x + iy) = ihAy, Ax bi + ihAx b, Ayi = ImhAx b, Ayi.
2
This shows hAx b, Ayi = 0 and thus Ax b Ay for all y Cn .
Corollary 7.2. A vector x Cn solves (LSQ) if and only if
A Ax = A b. (7.1)
Proof. From the theorem we know that x solves (LSQ) if and only if Ax b range(A). This
in turn is equivalent to Ax b ai for every column ai of A, i.e. to A (Ax b) = 0.
It is hence of interest to know whether A A is invertible. We assume throughout the remain-
der of this chapter that the singular values of A satisfy
1 2 n > 0
and in this case A A is indeed invertible. Under this assumption, (LSQ) is solved by
x = A b
where the pseudo-inverse A = (A A)1 A is dened in Denition 3.6. The system (7.1) is
called the normal equations for (LSQ).
Recall the denition of the 2-norm condition number for (LSQ) from Chapter 3. Note that
this is large when A is close to being rank-decient, so that n is small. In many practical
applications of (LSQ), such as arise naturally in statistics, A is indeed close to being rank-
decient. We will comment on this issue when we discuss the stability of the algorithms.
93
94 CHAPTER 7. LEAST SQUARES PROBLEMS
The Method
A natural approach to solution of the normal equations is as follows:
1: calculate A A and A b
2: solve (A A)x = A b
Computational Complexity
The method is usually implemented via standard algorithms for the matrix-matrix multiplica-
tion and matrix-vector multiplication, together with Cholesky factorisation for solution of the
symmetric linear system.
Theorem 7.3 (Cost of Normal Equations for LSQ). The cost of solving (LSQ) by the Normal
Equations approach is, in the usual implementation described above,
1
C(m, n) mn2 + n3
3
as m, n .
Proof. Calculating A A and A b requires asymptotically 2mn2 operations for m, n .
Since A A is symmetric we only need to calculate half of the entries, which can be done in
mn2 operations. Calculation of A b costs (mn). Solving (A A)x = A b with Cholesky-
1 3
factorisation requires 3 n operations. Combining these terms gives the result.
Error Analysis
One way that the error analysis proceeds is by proving that the computed x, x
, solves
(A A + A)
x = A b + c
and then using the backward error analysis for (SLE). Because (A A) = (A)2 in the 2-norm
it follows that
kxk2
= O (A)2 m .
kxk2
In practice this method can perform very badly when the matrix A is close to being rank-decient
(so that n is small by Lemma 3.8) and the 2-norm condition number of A is large.
The Method
To increase stability, at an increase in computational cost, the following QR based algorithm is
often used in practice.
Rx
A Ax = A Q
Q
= A Q b
Q
=R QQ
b
Q
=R b
= A b
Computational Complexity
The method is usually implemented by means of Householder based QR factorisation, standard
matrix-vector multiplication for b
Q and standard triangular inversion of R. As for the normal
equations approach, we assume that there is positive a, independent of n, such that m an as
n .
Theorem 7.4 (Cost of QR for LSQ). Let sn (A) > 0. The cost of solving (LSQ) by the QR
approach is, in the usual implementation,
2
2mn2 n3 + (mn + n2 ).
3
Proof. Steps 1 and 2 together have asymptotic cost C1 (m, n) 2mn2 32 n3 + (mn), and step 3
2
has asymptotic cost C2 (m, n) = (n ). Thus we get the desired result.
If a 1 then the cost is dominated by 2mn2 and is roughly twice that of the normal equations
approach.
Error Analysis
The Householder QR approach has considerably improved backward stability properties when
compared with the normal equations based solution method. In particular, the backward error
is small in the following sense.
Theorem 7.5. Let x solve the (LSQ) problem for (A, b) and let x be the computer solution via
Householder QR factorisation. Then x minimises k(A + A)x (b + b)k2 for matrix A with
columns aj and vector b satisfying, for m suciently small,
Cmnm
kaj k2 kaj k2 , j = 1, . . . , n
1 Cmnm
Cmnm
kbk2 kbk2 .
1 Cmnm
In contrast to the normal equations method, this bound on the backward error is independent
of the condition number of A. However, practitioners are often interested in the implied error in
estimating the residual r = b Ax. This can still behave poorly for the QR method, particularly
when the matrix A is ill-conditioned.
The Method
The most stable solution of (LSQ), especially useful for problems where A is close to being
rank-decient, and n close to zero, is SVD based.
96 CHAPTER 7. LEAST SQUARES PROBLEMS
V
A Ax = A U V y = A U
U b = A b
Step 4. Note that A = (QU1 U2 )(V1 V2 ) and that hence A = U V with orthogonal U, V
given by U = QU1 U2 , V = V1 V2 . (Note that the composition of orthogonal matrices is
itself orthogonal).
It may be shown that the cost of this algorithm is 2mn2 + 11n3 + (mn + n2 ).
Computational Complexity
Using these ideas is the basis for the standard implementation of SVD based approaches to
(LSQ). As for the normal equations approach, we assume that there is a positive a, independent
of n, such that m an as n .
Theorem 7.6 (Cost of SVD for LSQ). Let sn (A) > 0. The cost of solving (LSQ) by the SVD
approach is, in the usual implementation,
2mn2 + 11n3 + (mn + n2 ).
Proof. (Sketch) SVD is 2mn2 + 11n3 + (mn + n2 ).
form c=U b : (mn)
= c : (n)
solve w
If a 1 the cost is similar to that of of the QR based method. But, for moderate values
of a, the additional 11n3 term means that this method can be considerably more time-consuming
than the QR method.
Error Analysis
The SVD approach is by far the most stable of the three methods for (LSQ). This is because
it reveals the singular values which are close to zero, and hence possibly reect rank-deciency
of A. Exploiting this fact, enables the method to perform well even in situations which are close
to rank-decient and, in particular, where n is very close to zero.
7.3. LSQ VIA SVD 97
Bibliography
The books [TB97] and [Dem97] both contain clear discussions of algorithms for (LSQ); the
statements about complexity of the SVD, and SVD based methods for (LSQ) may be found in
[TB97] for example. More details concerning the error analysis, and Theorem 7.5 in particular,
can be found in [Hig02].
Exercises
Exercise 7-1. The purpose of this question is to introduce you to the content of this chapter.
It is broken into three parts: (i) formulate a least squares problem; (ii) solve it blindly using
Matlab; (iii) introduce the SVD, a major tool used in solving least squares problems.
(i) Let {(u(i) , v (i) )}m
i=1 be a set of points in the (u, v) plane. We wish to nd the best straight
line t
v = u +
to this data. The least squares approach to this problem is to nd (, ) R2 minimising
m
X
G(, ) = |u(i) + v (i) |2 .
i=1
>> u=rand(100,1);
>> v=2*u+ones(100,1)+delta*rand(100,1);
Construct A and b as dened in (i) and then form
>> x=A\b
This is the least squares solution. What would you expect the solution to be if delta is very
small? Test your hypothesis by repeating the previous construction of A and b for a variety of
small delta.
(iii) As we shall see, there are several competing methods to solve a least squares problem. One
is based on the QR decomposition. A second is based on the SVD ( singular value decomposi-
tion). To make the matrices shorter type A=A(1:5,:). We will perform operations on this new
shortened version of A (corresponding to 5 data points rather than 100.) Type
>> [U,X,V]=svd(A)
and discuss the properties of U, X and V (think about columns!). Now type
>> rn=norm(U*X*V'-A)
What does this tell you? Now type
>> [W,S,Z]=svd(A,0)
and compare W, S and Z with U, X and V. Again type
>> rn=norm(W*S*Z'-A)
What does this tell you?
The two factorisations are known (respectively) as the full SVD and the reduced SVD In our
theoretical development we will mainly use the reduced SVD, and will refer to this as the SVD
for short.
98 CHAPTER 7. LEAST SQUARES PROBLEMS
Exercise 7-2. You are given m data points (u(i) , v (i) ) where u(i) Rn1 and v (i) R for
n1
i = 1, . . . , m. We would like to nd R and R to minimise
m
X
|T u(i) + v (i) |2 .
i=1
Show that this problem may be reformulated as a standard least squares problem by specifying
appropriate choices of A Rmn and b Rm .
Now let n = 2 and generate m = 50 random data points of the form
v (i)
=u (i)
+ 102 W (i)
where the u(i) are chosen at random uniformly from [0, 100] and the W (i) are chosen at random
from the standard normal distribution (with mean zero and variance one). Solve the resulting
least squares problem by the QR method and by using the SVD (make use of Matlab's qr and
svd routines). What do you expect the answer to be? Compare the eciency and accuracy of
the two dierent methods.
Exercise 7-3. Vandermonde matrices again: remember the command vander in Matlab.
a) Consider the polynomial interpolation problem. For each of the cases l = 2, 4, 6, 8, 10,
generate a square Vandermonde matrix B (l) with the data points {uj }lj=1 chosen from the set
n
X
v(x) = j uj
j=1
which best ts the data points {(uj , vj )}nj=1 in the sense of minimising
m
X
|v(ui ) vi |2 .
i=1
The matrix B is a rectangular Vandermonde matrix. For each of the cases l = 2, 4, 6, 8, 10,
generate the rectangular B (l) with m=l and n = l/2. Choose the data points {uj }lj=1 from the
set {1, 2, . . . , l}. Estimate the condition numbers of the B (l) with the Matlab routine cond, and
plot it as a function of l. Discuss what you observe in relation to the rst part of the question.
c) Now consider the data points
uj = j, j = 1, . . . , 9
u10 = 9.0000001
vj = 1 + uj + u4j + 102 W (j)
where the W (j) are independently chosen at random from the standard normal distribution
(mean zero and variance one). Find the tenth degree polynomial interpolant to this data set,
using QR factorisation. Then nd the best t polynomial of degree four, using QR to solve the
resulting least squares problem. Compare the results you obtain from interpolation and best t.
Use graphical output to illustrate your discussion.
0 < 1 2 . . . n .
1. Show that
kAk2 = n and kA1 k2 = 1
1 .
kAk2
min : A + A is singular .
kAk2
where b Rm .
4. If B has singular values {i }ni=1 then what is the Euclidean norm condition number of the
matrix which must be inverted to solve the normal equations?
Exercise 7-5. Show that the least squares solution will yield Ax = b if and only if b = AA b
where A is the Moore-Penrose inverse of A.
100 CHAPTER 7. LEAST SQUARES PROBLEMS
Chapter 8
Eigenvalue Problems
In this chapter we will study methods for solution of the eigenvalue problem (EVP): given
A Cnn , compute the eigenvalues and eigenvectors of A.
A rst insight is that nding eigenvalues and nding eigenvectors are equivalent problems.
To nd an eigenvalue from an eigenvector we proceed as follows: Given an eigenvector x Cn we
try to nd the C which minimises kx Axk2 . Since x is an eigenvector, the minimum of 0
is attained for the corresponding eigenvalue. It is easy to check, for example using Corollary 7.2,
that the optimal choice of is
hx, Axi
= (x x)1 x (Ax) = .
hx, xi
Note that solvability of the normal equations in this case is guaranteed, provided that x 6= 0. The
preceding considerations show that the following denition provides a natural way to estimate
an eigenvalue from an estimate of an eigenvector.
hx, Axi
rA (x) =
hx, xi
for all x Cn .
Using this notation, we get that
Ax = rA (x)x
for all eigenvectors x of A.
Calculating an eigenvectorx from an eigenvalue can be achieved by xing one component
of the vector x, say 1 and then solving n 1 of the linear equations Ax = x, with
xl , to be
and xl given. If the eigenvectors corresponding to have xl = 0 then this system of linear
equations will have no solution but, since x 6= 0, there will be an l {1, . . . , n} such that the
system can be solved and the solution will be an eigenvector with eigenvalue .
Theorem 1.12 shows that, if we can nd the roots of arbitrary polynomials, we can also
nd the eigenvalues of arbitrary matrices. In fact the two problems are equivalent: for every
polynomial we can nd a matrix with this polynomial as the characteristic polynomial. To see
p(z) = (1)n (c0 + c1 z + + cn1 z n1 + z n ) be given.
this let Dene the companion matrix
C Cnn of the polynomial p by
0 c0
1
c1
.. .
.
C= . . . (8.1)
..
. cn2
1 cn1
101
102 CHAPTER 8. EIGENVALUE PROBLEMS
One can show that C (z) = det(C zI) = p(z) (see Exercise 8-8). Hence nding eigenvalues
of C is equivalent to the problem of nding the roots of the polynomial p. This establishes
equivalence between the eigenvalue problem and root nding for polynomials. Regarding this
latter problem we know the following:
Theorem 8.2 (Abel, 1824). For every n 5 there is a polynomial p of degree n with ratio-
nal coecients that has a real root which cannot be expressed by only using rational numbers,
addition, subtraction, multiplication, division and taking kth roots.
Any possible algorithm will be based on the operations mentioned in the theorem and hence
it is not possible to nd an algorithm which calculates eigenvalues exactly after a nite number
of steps. The conclusion to be drawn from this is that any eigenvalue solver must be iterative:
aiming to approximate the solution of the eigenvalue problem in a nite number of steps. We
describe a variety of iterative methods for (EVP) in the following sections.
In the remainder of this chapter we will consider the Hermitian eigenvalue problem. As we
have seen, if A Cnn is Hermitian, the matrix has an orthonormal system of eigenvectors and
the corresponding eigenvalues are real. In this chapter x1 , . . . , xn Cn will always denote an or-
thonormal system of eigenvectors of A and 1 , . . . , n R will be the corresponding eigenvalues.
We order the eigenvalues such that
|1 | |2 | |n | 0.
Since our methods will be iterative, the algorithms will only compute approximate eigenvec-
tors and eigenvalues. The following theorem shows, that the Rayleigh quotient of an approximate
eigenvector is a good approximation for the corresponding eigenvalue.
as y x.
Proof. a) The gradient of rA is
2
rA (x) = Ax rA (x) x .
kxk2
2
rA (x) = 2 x x = 0.
kxk2
= rA (x) + O ky xk22 .
Many of the methods for nding eigenvalues rely on repeated matrix-vector multiplication
by A. Thus they will be considerably cheaper if A is preprocessed so that it has as many zeros as
possible, but is similar to the original matrix. This can be achieved by reducing A to Hessenberg
form using Lemma 4.12. When A is Hermitian, the Hessenberg form is necessarily tridiagonal.
Hence tridiagonal matrices play an important role in what follows.
8.1. THE POWER METHOD 103
The Method
The key idea of the power method is that, assuming some minor technical conditions, repeated
application of A to an arbitrary starting vector will lead to a sequence of vectors which eventually
align with the eigenvector associated with the largest eigenvalue.
Remarks.
1. As with all iterative methods, the method must be stopped at some point where the result
is close enough to the real one.
2. The algorithm calculates an approximate eigenvector as z (k) = Ak z (0) /kAk z (0) k2 . To avoid
(k)
overow/underow errors the vector z is normalised in every step of the iteration.
3. The method is based on the following idea: if we express z (0) in the basis x1 , . . . , x n we
get
n
X
z (0) = i xi
i=1
and
n
X n
X
Ak z (0) = i Ak xi = i ki xi .
i=1 i=1
For large k this expression is dominated by the term corresponding to the eigenvalue with
the largest modulus.
Error Analysis
Since eigenvectors are only determined up to scalar factors, we cannot expect the approxima-
tionz (k) to be close to the vector x1 we chose. Instead, we would expect z (k) to be close to x1
where is a factor which might depend on k . We get the following result.
Theorem 8.4. Let |1 | > |2 | |n | and hx1 , z (0) i = 6 0. Then there is a sequence
( (k) )kN in C with | (k) | = 1 for all k N such that the sequences (z (k) ) and ((k) ) from the
power iteration algorithm satisfy
z (k) x1
= O 2 k
(k)
(8.2)
2 1
and
1 = O 2 2k
(k)
(8.3)
1
as k .
104 CHAPTER 8. EIGENVALUE PROBLEMS
n
X
z (0) = i xi , i = hxi , z (0) i.
i=1
n n
X X i i k
Ak z (0) = i ki xi = 1 k1 x1 + xi
i=1
1
i=2 1
n
k (0)
2 X i 2 i 2k
A z
= |1 k1 |2 1 + = |1 k1 |2 (1 + k )
2
i=2 1 1
where
n (0) 2
kz k2 2 2k .
X i 2 i 2k
k =
i=2
1 1 |1 |2 1
Thus we can conclude
n
Ak z (0) 1 k1 X i i k 1
z (k) = k (0)
= k
x 1 + xi .
kA z k2 |1 1 | 1
i=2 1
1 + k
Ak z (0)
2
Ak z (0)
2 X n
(k) i 2 i 2k
x = x = = k
k 1 k 1
|1 1 | 1 1
2 1 1 2
i=2
and thus
Ak z (0) Ak z (0)
Ak z (0)
(k)
z (k) x1
k (0) + (k)
x
1
kA z k2 |1 k1 | 2 |1 k1 |
2 2
p
= 1 + k 1 + k
2 k
kz (0) k2 2 k
2 .
|1 | 1
This completes the proof of (8.2).
b) Using part b) of Theorem 8.3 and (8.2) we get
Computational Complexity
For simplicity assume that our primary interest is in the accuracy of eigenvectors. We wish to
iterate until
kz (k) (k) x1 k2 . (8.4)
Inspection of the preceding proof shows that this inequality will be attained provided
kz (0) k
2 2 k
2 . (8.5)
1 1
Hence it suces to choose k to be the smallest integer greater than
ln 1 + ln kz (0) k2 + ln 2 ln 1
k# = .
ln(1 /2 )
8.2. INVERSE ITERATION 105
Theorem 8.6. Under Assumption 8.5 and the assumptions of Theorem 8.4, the computational
cost of using the power method to achieve (8.4) is bounded above by c n1+ ln 1 + n3 , for
The Method
The power iteration algorithm helps to nd the eigenvalue with the largest modulus and the cor-
responding eigenvector. The following simple idea leads to a modication of the power iteration
method to nd dierent eigenvalues of A.
Given R, the matrix (A I)1 has eigenvalues i = 1/(i ) for i = 1, . . . , n.
The power iteration method applied to this matrix will thus return an approximation for the
eigenvector corresponding to the j with the biggest modulus. This is the j where j is the
eigenvalue closest to . The resulting algorithm is called the inverse iteration method.
Remark. For practical usage the method is stopped at some point where the result is close
enough to the real one.
Error Analysis
The following convergence result is a direct consequence of Theorem 8.4.
Computational Complexity
The computational cost is analogous to that of the power iteration, under appropriate assump-
tions, found by translating the theorem for the power method, where we iterate A, to the inverse
iteration situation where we iterate (A I)1 . Because, after preprocessing, A is tridiagonal
the cost of each step of inverse iteration is the same as for the power method.
The Method
For inverse iteration it is necessary to estimate the eigenvalue in advance. An alternative is to
estimate the eigenvalue during the course of the algorithm:
Error Analysis
The previous two algorithms both converged linearly: asymptotically, the error decreases by a
constant factor at each step. In contrast, when the Rayleigh quotient iteration converges to an
eigenvalue, say j , it does so cubically:
|(k+1) j | = O(|(k) j |3 ).
A similar cubic rate of convergence is also obtained for the eigenvector. Such methods can hence
be very ecient.
Computational Complexity
However, with iterations of this type, where the convergence is not linear (as with the New-
ton method for example), it is dicult to predict when it will converge, and to which eigen-
value/eigenvector pair it will converge. Much of the computational complexity of such algorithms
stems from choosing starting points which will lead to convergence to the desired pair.
The Method
Now imagine that we wish to calculate all the eigenvectors. A natural idea is to take
z (0) Rnn
(k)
Then each column of Z (k) , zi , is calculated as
(k+1) (k)
zi = Azi .
Let
n
(0)
X
zi = j,i xj
j=1
Now
n
(k)
X
zi = j,i kj xj .
j=1
If
|1 | > |2 | > > |n1 | > |n | (8.7)
then
(k)
span{z1 } = span{y1 }, ky1 k2 = 1, y 1 = x1 + e1
(k) (k)
span{z1 , z2 } = span{y1 , y2 }, ky2 k2 = 1, y 2 = x2 + e2
.
.
.
(k) (k)
span{z1 , z2 , . . . , zn(k) } = span{y1 , y2 , . . . , yn }, kyn k2 = 1, yn = xn + en ,
where
min{j,n1} !
l+1 k
X
ej = O = O()
l
l=1
for
l+1
:= max .
1ln1 l
However, in practice this is a poor algorithm, not only because of overow / underow if
(k) (k) (k)
|1 | =
6 1, but also because the approximate basis {z1 , z2 , . . . , zn } for {x1 , . . . , xn } is highly
ill-conditioned, here meaning that the vectors nearly all point in the same direction, x1 .
To overcome these problems the algorithm used in practice forms an orthonormal basis from
the approximate basis for {x1 , . . . , xn }, by means of QR.
(k) (k)
Z (k) = (z1 , . . . , zn(k) ), W (k) = (w1 , . . . , wn(k) ).
(k1)
We show below that the diagonal entries of (k) are rA (zi ).
108 CHAPTER 8. EIGENVALUE PROBLEMS
Error Analysis
Theorem 8.8 (Convergence of Simultaneous Iteration). Let A Rnn be symmetric, satisfying
(8.7) so that < 1. Under assumption (8.6) there are, for j = 1, . . . , n, sequences {sj }kZ ,
(k)
+
Computational Complexity
In practice the simultaneous iteration is usually implemented in the form known as the QR
algorithm, which we discuss in the next section. Hence we discuss complexity in that context.
The Method
In practice the eigenvalues of A can be found directly by a method related to simultaneous
iteration, and which we now derive. The resulting algorithm appears almost magical upon rst
sight, but when related to simultaneous iteration, its mechanism is clear. To establish this
relation let Z (k1) , Z (k) , R(k) , W (k) and (k) be as in the simultaneous iteration algorithm and
dene
Q(k) := (Z (k1) )T Z (k) .
This is an orthogonal matrix by construction.
Recall that
(k) = (Z (k1) )T AZ (k1) .
Thus
1: Set (0) = A
2: for k = 1, 2, . . . do
3: calculate the QR factorisation (k1) = Q(k1) R(k1)
(k) (k1) (k1)
4: =R Q
5: end for
8.5. THE QR ALGORITHM FOR EIGENVALUES 109
Error Analysis
Theorem 8.9 (Convergence of QR). Under the assumptions of Theorem 8.8, the QR eigenvalue
algorithm satises (possibly after reordering of eigenvalues)
(k)
kii i k2 = O ||2k ,
i = 1, , n.
Proof. The algorithm produces the same (k) as simultaneous iteration. So the diagonal be-
haviour follows from Theorem 8.8.
Computational Complexity
ln
Making the error in the QR method of size gives the number of iterations k = O( ). Under
ln
the assumptions of Theorem 8.8, the QR algorithm as stated would cost
ln 3
O n
ln
work to obtain estimates of all eigenvalues to within O(). However, this can be improved to
ln 2
O n + O(n3 ),
ln
with second term independent of , as follows.
Recall from Denition 1.8 that a matrix A Cnn is upper Hessenberg if
aij = 0 i > j + 1.
Lemma 8.10. If (0) is upper Hessenberg, the QR Algorithm generates (k) which are upper
Hessenberg.
Proof. We have
(k) = R(k1) Q(k1)
= R(k1) (k1) (R(k1) )1 .
But R(k1) and its inverse are upper triangular and pre- or post-multiplication by upper trian-
gular matrices preserves the upper Hessenberg form.
Theorem 8.11 (Cost of QR). Under the assumptions of Theorem 8.8 the QR algorithm can be
implemented to give eigenvalue approximations of size with cost
ln 2
C(n, ) k1 n + k2 n3 ,
ln
where k1 , k2 are independent of n, and .
Proof. (Sketch) First nd T as in Lemma 4.12, at cost (n3 ). Then apply the QR algorithm to
T. Since each QR factorisation is on an upper Hessenberg matrix, the cost reduces to (n2 ).
The number of iterations required to reduce the error to size is bounded by O(ln / ln ) and
the result follows.
110 CHAPTER 8. EIGENVALUE PROBLEMS
We conclude the chapter with a somewhat dierent method for eigenvalue calculation, speci-
cally constructed for symmetric tridiagonal systems. It uses a divide and conquer approach, as
introduced for the Strassen algorithm.
The Method
If A Rnn is symmetric then the reduction to upper Hessenberg form of Lemma 4.12 yields
3
tridiagonal T in (n ) operations. Thus we consider nding the eigenvalues of a symmetric
tridiagonal T . By expressing T in terms of two smaller tridiagonal matrices of half the size,
we create the basis for a divide and conquer approach for the symmetric eigenvalue problem.
Consider a tridiagonal T in the form:
a1 b1
.. ..
b1 . .
.. ..
. . bn/21
bn/21 an/2 bn/2
T = .
bn/2 an/2+1 bn/2+1
.. ..
bn/2+1 . .
.. ..
. . bn1
bn1 an
The basis of the divide and conquer approach is to split T into two separate tridiagonal
matrices, each of whose eigenvalues and eigenvectors determine those of T. This can be done
exactly if b is zero, and the following splitting of T tries to build on this fact.
Let n = 2k , b = bn/2 and write
T1 0
T = + b(v v), (8.8)
0 T2
How do we nd the eigenvalues of T from those of T1 and T2 ? Since T1 , T2 are real symmetric,
Ti = Qi i QTi , i = 1, 2.
Let
1 0
D= = diag(d1 , . . . , dn )
0 2
T T
Q1 0 last column of Q1
u= v = .
0 QT2 rst column of Q2
T
Theorem 8.13. Assume that the {di } are distinct and that ui 6= 0 for all i. Then the eigenvalues
of T are the real roots of the secular equation.
Proof.
Q1 1 QT1
0
T = + bv v
0 Q2 2 QT2
T
Q1 0 1 0 Q1 0
= + bu u .
0 Q2 0 2 0 QT2
Thus T has the same eigenvalues as D + bu u by similarity.
Now,
det(D + bu u I) = det((D I)(I + b(D I)1 u u)).
If
det(D + bu u I) = 0
and is not an eigenvalue of D then
1 + bhu, (D I)1 ui = f () = 0.
Recall that we assume the {di } are distinct and that ui 6= 0. Let d0 = and dn+1 = .
Then
n
X u2i
f () = 1 + b .
i=1
(di )
Hence all the eigenvalues are accounted for and it follows that the assumption that cannot be
an eigenvalue of D holds.
Theorem 8.13 is the basis of a divide and conquer algorithm for the eigenvalue problem.
Error Analysis
The primary source of error, in real arithmetic, arises from solution of the secular equation.
However, because the roots which we seek are real, and because they lie in known intervals,
root-nding methods with great accuracy and eciency can be applied (a simple example being
bisection). In practice the errors arising from this step of the algorithm are negligible compared
with those arising from nite precision arithmetic.
Computational Complexity
Let Lk = C(2k ), C(n) is the cost of nding eigenvalues and eigenvectors of symmetric
where
tridiagonal matrices in Rnn . If we use a recursive algorithm, then the cost Lk needs to be
expressed in terms of the cost Lk1 of nding the eigenvalues and eigenvectors of symmetric
n/2n/2
tridiagonal matrices in R .
nn
Finding the eigenvalues and eigenvectors in R has cost Lk where
Lk = 2Lk1 + mk ,
where mk is the cost of constructing, and solving, the secular equation for T Rnn . This cost
comprises:
112 CHAPTER 8. EIGENVALUE PROBLEMS
nding eigenvectors from these eigenvalues: this costs (n) for each eigenvalue since the
matrix is tridiagonal, giving a total of (n2 );
constructing u: (n) to nd last column of QT1 and rst column of QT2 ;
Solving the secular equation: (n), assuming unit cost for each root using, for example,
bisection.
To state a theorem about the cost of this iteration we need the following denition.
Bibliography
The book [Par80] contains a very complete overview of the symmetric eigenvalue problem, the
subject that dominates this chapter. The book [TB97] has a good introduction to the subject.
Exercises
Exercise 8-1.
Matlab
The purpose of this question is to introduce you to ideas associated with eigen-
value problems and their solution in . The basic problem is, given A Rnn , to nd
n
xC and C such that
Ax = x (8.9)
p1 3 + p2 2 + p3 + p4 = 0
and nd p = (p1 , p2 , p3 , p4 ). In principle you can solve this cubic exactly so there is no need to
resort to numerical solution. (In general we obtain
n
X
ni pi+1 = 0.
i=0
Matlab
What does Galois tell us about our need for applying numerical solutions to this problem?)
(v) One way to nd the eigenvalues of A is to use the command roots. Type
8.6. DIVIDE AND CONQUER FOR SYMMETRIC PROBLEMS 113
>> roots(p)
to nd the roots of our cubic, and hence the eigenvalues of A. We can do this more generally.
Type
>> A=randn(20);
>> A=0.5*[A+A'];
>> p=poly(A)'
This creates a 20 20 matrix with normally distributed random entries, and then makes a
symmetric matrix from A. The vector p contains the co-ecients in the characteristic polynomial
of A so that, if you type sort(roots(p)) you will get the eigenvalues of A (sort just orders the
roots). Do this.
A Matlab
Matlab does not do this
(vi) Thus one way to nd eigenvalues of in is to form the characteristic polynomial
and nds its roots. But as more ecient methods are available. Type
>> sort(eig(A))
and you should get the same as you did in (v). But the method employed was very dierent.
Here we show you the two key ideas used in eig. Set
>> A=[-1,2,3,0,4;2,1,6,2,2;3,6,3,3,0;0,2,3,8,2;4,2,0,2,-1]
>> [Q,H]=hess(A)
>> rn=norm(Q*H*Q'-A)
What can you say about the relationship between the eigenvalues of H and those of A?
Now type
>> [W,D]=schur(H)
>> rn=norm(W*D*W'-H)
What can you say about the relationship between the eigenvalues of D H?
and those of
Finally, what can you say about the relationship between the eigenvalues of A and the
diagonal elements of D? Test your hypothesis by typing eig(A) and comparing with D .
Exercise 8-2. Give a proof by induction which shows that the matrix A from (8.1) really has
determinant (1)n p(z) where p(z) = a0 + a1 z + + an1 z n1 + z n .
Exercise 8-3. Let B be a constant matrix and let T (t) solve the matrix dierential equation
dT
= BT T B, T (0) = T0 .
dt
Assuming that the solution of this equation exists for all t R, show that T (t) has the same
eigenvalues as T0 , for all t R. If B is skew-symmetric, show that kT (t)k2 = kT0 k2 for all t R.
Exercise 8-5. Construct an example showing that the Jordan canonical form of a matrix can
be discontinuous with respect to small perturbations.
114 CHAPTER 8. EIGENVALUE PROBLEMS
where dG(y) is the Jacobian of G evaluated at y . Find the explicit formula for yk+1 in
terms of yk when m = 1. What is the order of convergence of the method?
2. The eigenvalue problem for AR can be regarded as a nonlinear system of equations for
y = y(x, ) satisfying
Ax x = 0
1 T
(x x 1) = 0
2
Apply Newton's method to this system and show that it gives the iteration
1 2
2 (kxk k2 + 1)
k+1 = k + T
xk (A k I)1 xk
xk+1 = (k+1 k )(A k I)1 xk .
k+1 = h(k ).
Show that this iteration is Newton's method applied to a scalar equation and identify that
equation. Using this fact, prove that the full method on Rm is second order convergent for
this specic choice of initial data.
Exercise 8-7. Prove the following theorem concerning eigenvalues of tridiagonal matrices.
Theorem 8.16. Consider a matrix A of the form (5.5). Let the diagonal entries be real, constant
with equal o-diagonal entries: ai = a, ci = a and di = d for all i = 1, . . . , m where a and d are
real. Then the eigenvalues k of the matrix are given by
k
k = d + 2a cos ,
m+1
Exercise 8-8. Consider the companion matrix C from (8.1). Prove that the eigenvalues k
of C are the roots of the polynomial
m
X
j j = 0
j=0
x = [x1 , . . . , xm ]T
where xj = j1
k .
Bibliography
[Bha97] R. Bhatia. Matrix Analysis. Springer, 1997.
[CH96] S.-N. Chow and J.K. Hale. Methods of Bifurcation Theory. Springer, 1996.
[CLRS01] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms.
McGraw Hill, 2001.
[GL96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins, third edition,
1996.
[Hig02] N.J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2002.
[HJ85] R.A. Horn and JC.R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
[LT88] Randall J. Leveque and Lloyd N. Trefethen. Fourier analysis of the SOR iteration.
IMA Journal of Numerical Analysis, 8(3):273279, 1988.
[Ner71] E.D. Nering. Linear Algebra and Matrix Theory. John Wiley, 1971.
[Saa97] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, 1997.
[SB02] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, third edition,
2002.
[TB97] L.N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, 1997.
115