Parallel Computing
Numerical linear algebra
Thorsten Grahs, 22.06.2015
Overview
Direct methods
Iterative methods
Splitting methods
Krylov subspace methods
Preconditioner
Efficient storage formats
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Overview. Applications
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Solution algorithms Classification
Dense matrices
Few/no zero entries
Direct algorithms (Gauss or splitting algorithms)
Band matrices
Special algorithms: e.g. Thomas algorithm
Sparse matrices
Special storage formats
(store only non-zeros)
Iterative algorithms
(Jacobi, Gauss-Seidel, Krylov methods)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4
Dense matrix algorithms
Gaussian elimination
Solve system of equations Ax = b
Reduction
a11 a12
0 a
22
..
..
.
.
0
0
x1
b1
a1n
x2
b2
2n
a
.. .. = ..
..
.
. . .
nn
a
xn
bn
Substitution
xk =
1
bk
akk
n
X
j=k+1
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5
akj xj
LU/LR decomposition
LU (Lower/Upper) or LR (Left/Right) decomposition
Solve stem of equations Ax = b
LU decomposition
l11 0
a11 a12 a1n
a21 a22 a2n l21 l22
..
.. . .
.. = ..
..
.
. . .
.
.
an1 an2 ann
ln1 ln2
0
u11 u12
0 0 u22
.. ..
..
. .
.
lnn
0
0
..
.
Solve
Ax = b
LUx = L(Ux) = Ly = b
Ly = b Ux = y
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6
u1n
u2n
..
..
. .
unn
Single Value Decomposition
Single Value Decomposition(SVD)
Solve system of equations Ax = b
A = UVT with
1 0 0
0 1 0
UT U = VT V = I = .. .. . . .. ,
. .
. .
Decompose
0 0 1
Solve
Ax = V1 UT b
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7
11 0
0 22
= ..
..
.
.
0
0
..
.
0
0
..
.
nn
Q-R Algorithm
QR decomposition
Solve system of equations Ax = b
Decompose A = QR with
R upper diagonal matrix
Q orthogonal matrix with
1 0
0 1
QT Q = I = .. .. . .
. .
.
0 0
0
0
..
.
1
Solve
Ax = b
QRx = b
QT QRx = QT b
Rx = QT b
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8
Parallel algorithms?
Example Gaussian Elimination (GE)
Standard sequential algorithm for system of equations
n
P
aij xj = bi , i = 1, . . . , n
j=1
1
2
3
4
5
6
7
8
9
10
for k=1,n
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
for j =k+1,n
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
GE example from L.R. Scott Scientific Parallel Computing
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Iteration space f. Gaussian elimination
The iteration space for the standard sequential algorithm
for Gaussian elimination forms a trapezoidal region with
square cross-section in the i, j plane.
Within each square (with k fixed) there are no
dependencies
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10
What Gaussian elimination does
Gaussian elimination reduces the original to a triangular
system
n
P
lij yj = bi , i = 1, . . . , n
j=1
Matrix L is the lower-triangular matrix computed in GE
example (which determines the values below the
diagonal) with ones placed on the diagonal
Thus, Gaussian elimination performs a matrix factorization
(LU decomposition)
Ax = LUx = b
Solve triangular systems
Ly = b
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11
Ux = y
Properties Gaussian elimination
The dominant part of the computation is the factorization
(GE example) in which L and U are determined.
The triangular system solves require less computation
There are no loop-carried dependences in the inner-most
two loops (i, j-loops) in GE because i, j > k
Therefore these loops can be decomposed in any
desired fashion
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12
Standard parallelization of GE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for k=1,n
if ( " I own column k " )
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
"broadcast" l (k+1 : n)
else "receive" l (k+1 : n)
endif
for j =k+1,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
G. A. Geist and C. H. Romine. LU factorization algorithms on distributed-memory
multiprocessor architectures. SIAM J. Sci. Stat. Comput., 9:639649, 1988.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Parallel complexity
For each value of k, n k divisions are performed in
computing the multipliers l(i, k)
Subsequently, these multipliers are broadcast to all other
processors.
T1 = (n k)(Ta + Tk )
Once these are received, (n k)2 multiply-add pairs are
executed, all of which can be done in parallel
T2 = 2(n k)2 Ta /P
For each K
T = T1 + T2 = (n k)(Ta + Tk ) +
computation time is used
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14
2(n k)2 Ta
P
Parallel complexity
Summing over k, the total time of execution is
n1
X
2(n k)2 Ta
T =
(n k)(Ta + Tk ) +
P
k=1
1 2
2n3 Ta
n (Ta + Tk ) +
2
3P
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15
Overlapped Gaussian elimination
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
first processor computes l(2:n,1) & "broadcast"
for k=1,n
unless(" I own column k ") "recieve" l(k+1:n,k)
if ( " I own column k+1 " )
for i =k+1,n
a( i ,k+1) = a( i ,k+1) l ( i ,k) a(k,k+1)
l ( i ,k+1) = a( i ,k+1)/a(k+1,k+1)
endfor( i )
"broadcast" l (k+1 : n)
endif
for j =k+2,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16
Overlapped GE Results
Idea of overlapped Gaussian elimination
Compute multipliers l(i, k) as soon as possible
Broadcast them to all other processors
Efficiencies for Gaussian elimination for factoring full n n
matrices on an Intel iPSC.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17
Problems
Storage consumption
Gauss elimination is very memory consuming
GEmemory O(n2 )
Since unknowns from industrial application could be around
N 109
GE O(1018 ) i.e. 1 Exa Bytes.
time consuming
Number of arithmetic operations is O(n3 )
(dont think about communication)
GE is slow (compared to other approaches)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18
GE timing Laplace (2D)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19
GE timing Laplace (3D)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20
Iterative solvers
Iterative algorithms
In general for sparse matrices
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21
Iterative Solvers
In contrast to direct solvers, iterative solvers
Compute approximate solutions uk = A1 f .
Needs only matrix-vector product yk = Ax
One iteration is cheap to compute, in 2D
Based on Fix-point iteration
Calculates successively better approximation uk+1 in an
iterative process
Improved convergence by pre-conditioning
Types of methods
Splitting methods
Krylov methods
Multigrid
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
Fix-point iteration
f is contractive when kf (x) f (y)k < kx yk.
There exist an unique Fix-point x = f (x )
k
For x k+1 = f (x k ),
x k x
If kf (x) f (y)k < kx yk, < 1 x, y
kx k x k < kx x k
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23
Splitting methods
Iterative methods with a splitting of the form
A = M N,
A, M, N Rnn
M easy to invert (f.i. diagonal matrix)
and iteration rule
Mx k+1 = Nx k + b
x k+1 = Cx k + d
.
with
C := M1 N,
d := M1 b
called splitting methods
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24
Matrix-Splitting
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25
Convergence
Guaranteed for splitting methods with
Spectral radius (M1 N) < 1
M 1 acts as a preconditioning matrix
Preconditioner helps the method to converge faster
Best preconditioning matrix for convergence: M1 = A1
Simplest preconditioning matrix: M1 = I
Realistic preconditioning matrix between
Jacobi:
M=D
Gau-Seidel: M = (D L)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26
Jacobi methods
or total-step iteration
Splitting
A=DLR
D, L, R Rnn , D diagonal matrix
iteration rule
Dx k+1 = (L + R)x k + b
x k+1 = Cx k + d,
xik+1 =
1
aii
C := D1 (L + R) d := D1 b
!
n
P
bi
aij xjk , i = 1, . . . , n.
j=1,j6=i
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27
Gau-Seidel method
Single-step method
Splitting
A=DLR
D, L, R Rnn ,
D diagonal matrix
iteration rule
(D L)x k+1 = Rx k + b
x k+1 = Cx k + d, C := (D L)1 (R), d := (D L)1 b
xik+1 =
1
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
!
aij xjk
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28
, i = 1, . . . , n.
SOR (Successive Over Relaxation)
or relaxed Gau-Seidel method
(with relaxation parameter (0, 2))
Splitting
1
A=
D L R 1
D, D, L, R Rnn ,
Iteration rule
(D L)x k+1 = (1 )Dx k + Rx k + b
x k+1 = Cx k + d,
C := (D L)1 [(1 )D + R]
xik+1 =
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
D diag.matrix
e.g..
d := (D L)1 b
!
aij xjk
i = 1, . .22.06.2015
. , n. Thorsten Grahs Parallel Computing I SS 2015 Seite 29
+ (1 )xik ,
Krylov methods
Matrix-vector multiplication with A
Krylov subspace
Kk (A, r0 ) := [r0 , Ar0 , A2 r0 , . . . , Ak1 r0 ]
Select best solution in Krylov subspace
1
2
3
r0 = b - A*x0
Compute
best
xk = x0 + uk
uk in KN(A, r0)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30
Krylov methods
The question arises in which sense
best
is meant.
Algorithms
GMRES
CG
BiCG
BiCGStab
...
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31
GMRES
Generalized Minimum Residual Method
(Saad, Schultz 1986)
Best = smallest residual
1
2
3
4
5
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||
is minimized
Cost and memory grows linear with N
Pick small k (e.g. k = 20)
Restart with x 0 restart = x k (compute new basis)
Restarted GMRES GMRES(k)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
CG
Conjugated Gradient method
(Hestenes/Stiefel 1952)
Starting point
Minimizing the functional
1
E(x) := hAx, yi hb, xi
2
is equivalent with solving the linear system Ax = b.
Idea
Minimizing the functional E over a subspace with
search/basis vectors dk+1 := rk + k dk , d0 = r0 .
The Gradient of E at position xk is
E|xk = Axx b := rk
which is easy to compute
The dk s are A-conjugated, i.e. hAdi , dj i = 0, i 6= j
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
CG
Best = smallest residual in the A-norm
i.e. krk kA < tol
1
2
3
4
5
6
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||A
is minimized, where
|| v ||A = vT A v
Matrix must be S.P.D.
Symmetric and Positive-Definite
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
CG | Prerequisites
Given:
Matrix (Dimension n n) :
Vectors (length n) :
Scalars:
A
x,b
kmax (Max. iteration)
eps (prescribed tolerance)
Local variable
Vectors (length n) : r, d, Ad
Scalars :
rNew, rOld, alpha, beta
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35
CG | Pseudo code
1
2
3
4
5
6
7
8
9
10
d = r = b - A*x
rOld = r*r
For k = 1, ..., kMax
If modulus rOld smaller than eps is, than stop
else
Ad = A*d
alpha = rOld/(d*Ad)
x = x + alpha*d
r = r - alpha*Ad
rNew = r*r
11
12
beta = rNew / rOld
13
14
d = beta*d + r
15
16
rOld = rNew
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
CG
Finite termination/convergence
of the CG method after N
steps/iterations
A RNN
in exact arithmetic
Name stems from the
conjugated search directions di
with
hAdi , dj i = 0
CG compared with gradient descent method.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37
Convergence of iterative methods
Splitting methods(Jacobi/Gauss-Seidel)
vs.
Krylov methods (GMRES/CG)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38
Preconditioner
What does a preconditioner?
The preconditioner approximates the inverse of A
What is this good for?
Reduces the condition number of the matrix
i.e. reduces the spectral radius of the matrix.
Helps to steer the iteration in the right direction
Better approximation Faster Convergence
M A1
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39
Diagonal Preconditioner
Simplest preconditioner
Inverse of matrix diagonal
Always cheap, sometimes effective
1
0
5
0
7
2
0
6
0
8
3
0
A
0
0
9
4
1 0
0
0
0 1/2 0
0
0 0 1/3 0
0 0
0 1/4
M
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40
Preconditioner
Solving with no preconditioner
Solver will continue until residual norm 5e-06
Iteration Number | Residual Norm
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
89
4.568771e-05
90
3.513509e-05
91
3.462404e-05
92
3.977090e-05
93
2.056327e-06
Successfully converged after 93 iterations.
Solving with diagonal preconditioner (M = D^-1)
Solver will continue until residual norm 5e-06
Iteration Number | Residual Norm
0
5.000000e+00
1
5.148771e+01
2
4.341677e+01
3
3.299794e+01
4
1.074329e+01
5
2.807501e+00
6
1.739602e+00
7
1.538450e+00
8
4.079266e-01
9
7.029972e-02
10
2.436168e-02
11
6.656054e-03
12
1.284295e-03
13
1.432453e-06
Successfully converged after 13 iterations.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41
Preconditioner
Can be extremely efficient
Solving with no preconditioner
Solver will continue until residual norm 2.5e-04
Iteration Number | Residual Norm
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
512
3.064404e-04
513
2.943765e-04
514
2.796355e-04
515
2.690365e-04
516
2.472412e-04
Successfully converged after 516 iterations.
Solving with smoothed aggregation preconditioner.
Solver will continue until residual norm 2.5e-04
Iteration Number | Residual Norm
0
2.560000e+02
1
4.373383e+03
2
2.359818e+03
3
1.211452e+03
4
6.210880e+02
5
3.169577e+02
6
1.569394e+02
7
8.138102e+01
...
20
6.011106e-03
21
2.555795e-03
22
1.083367e-03
23
4.799830e-04
24
2.213949e-04
Successfully converged after 24 iterations.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42
Storing in two dim. arrays
Dense or banded matrix
It is natural to use a 2D array to store a dense or banded
matrix. Unfortunately, there are a couple of significant issues
that complicate this seemingly simple approach.
Row-major vs. column-major storage pattern is language
dependent.
It is not possible to dynamically allocate two-dimensional
arrays in C and C++.
(at least not without pointer storage and manipulation
overhead).
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43
Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example
The stride between adjacent elements in the same row is
1.
The stride between adjacent elements in the same column
is the row length (5 in this example).
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44
Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example
The stride between adjacent elements in the same row is
the column length (2 in this example) while the stride
between adjacent elements in the same column is 1.
Notice that if C is used to read a matrix stored in Fortran
(or vice-versa), the transpose matrix will be read.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45
Significance of array ordering
There are two main reasons why HPC programmers need to
be aware of this issue:
Memory access patterns can have a dramatic impact on
performance, especially on modern systems with a
complicated memory hierarchy.
These code segments access the same elements of an
array, but the order of accesses is different.
Many important numerical libraries (e.g. LAPACK) are
written in Fortran. To use them with C or C++ the
programmer must often work with a transposed matrix
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46
Dense storage formats
Efficient storage formats for sparse matrices
Obviously, there is no need to store the zeros for an
sparse matrix.
There are several ways to store a sparse matrix efficiently
Common sparse storage formats:
Coordinate
Diagonal
Compressed Sparse Row (CSR)
ELLPACK
Hybrid
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47
Storage format COO
Coordinate (COO)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48
Storage format CSR
Compressed Sparse Row (CSR)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49
Storage format DIA
Diagonal (DIA)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50
Storage format ELL
ELLPACK (ELL)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51
Storage format Hyb
Hybrid (HYB) ELL+COO
(CUSP only)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52
Use of formats
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 53
Formats comparison | structured
Structured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 54
float
12.00
8.45
4.05
8.11
8.11
double
16.00
12.45
8.10
12.16
12.16
Formats comparison | unstructured
Unstructured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 55
float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44
Formats comparison | random
Random matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 56
float
12.00
8.42
76.83
14.20
9.60
double
16.00
12.42
153.65
21.29
14.20
Formats comparison | power-law
Power-law matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 57
float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46
Literature
L.R. Scott, T. Clark, & B. Bagheri
Scientific Parallel Computing
Princton Universit press, 2005.
Y. Saad
Iterative methods for sparse linear system
Society for Industrial and Applied Mathematics (SIAM),
2003
A. Meister
Numerik linearer Gleichungssysteme
Springer-Spektrum, 2014
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 58