Cache Friendly Sparse Matrix-vector Multiplication
[Extended Abstract]
Sardar Anisul Haque
Shahadat Hossain
Marc Moreno Maza
University of Western Ontario
ON N6A 5B7, Canada
University of Lethbridge
AB T1K 3M4, Canada
University of Western Ontario
ON N6A 5B7, Canada
shaque4@csd.uwo.ca
shahadat.hossain@uleth.ca
1. INTRODUCTION
Sparse matrix-vector multiplication or SpMXV is an important kernel in scientific computing. For example, the conjugate gradient method (CG) is an iterative linear system
solving process where multiplication of the coefficient matrix A with a dense vector x is the main computational step
accounting for as much as 90% of the overall running time.
Though the total number of arithmetic operations (involving
nonzero entries only) to compute Ax is fixed, reducing the
probability of cache misses per operation is still a challenging area of research. This preprocessing is done once and
its cost is amortized by repeated multiplications. Computers that employ cache memory to improve the speed of data
access rely on reuse of data that are brought into the cache
memory. The challenge is to exploit data locality especially
for unstructured problems: modeling data locality in this
context is hard.
Pinar and Heath [8] propose column reordering to make
the nonzero entries in each row contiguous. However, column
reordering for arranging the nonzero entries in contiguous
location is NP-hard [8]. In a considerable volume of work
[2, 6, 8, 9, 10] on the performance of SpMXV on modern
processors, researchers propose optimization techniques such
as reordering of the columns or rows of A to reduce, for
example, indirect access and improving data locality, and
blocking for reducing memory load and loop overhead.
In this extended abstract, we present a new column ordering algorithm, based on the binary reflected Gray codes, that
runs in linear time with respect to the number of nonzero
entries. We analyze the cache complexity of SpMXV when
the sparse matrix is ordered by our technique. The results
from numerical experiments, with very large test matrices,
clearly demonstrate the performance gains rendered by our
proposed technique.
Categories and Subject Descriptors
F.2.1 [Numerical Algorithms and Problems]: Computations on matrices; G.1.3 [Numerical Linear Algebra]:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PASCO 2010, 21–23 July 2010, Grenoble, France.
Copyright 2010 ACM 978-1-4503-0067-4/10/0007 ...$10.00.
moreno@csd.uwo.ca
Sparse, structured, and very large systems (direct and iterative methods)
General Terms
Algorithms
Keywords
Sparse Matrix, Cache Complexity, Binary reflected Gray
Code
2. BINARY REFLECTED GRAY CODE ORDERING
We develop a new column ordering algorithm based on binary reflected Gray code (BRGC for short) for sparse matrices. We will call it BRGC ordering. A p-bit binary reflected
Gray code [3] is a Gray code denoted by Gp and defined by
G1 = [0, 1] and
, . . . , 1Gp−1
],
Gp = [0Gp−1
, . . . , 0Gp−1
, 1Gp−1
0
0
2p−1 −1
2p−1 −1
where Gpi is the i-th binary string of Gp . We call i the
rank of Gpi in Gp . For example the rank of 011 in G3 is
2. We consider each column of a m × n sparse matrix A as
a binary string of length m where each nonzero is treated
as 1. Hence, we have n binary strings of length m, say
{b0 , b1 , . . . , bn−1 }. Let Π be the permutation of these strings
satisfying the following property. For any pair of indices i, j
with i 6= j, the rank of bΠ(j) in Gm is less than that of bΠ(i)
if and only if Π(i) < Π(j) holds. We refer to Abrgc as our
sparse matrix A after its columns have been permuted by Π.
One can check that the BRGC ordering sorts the columns of
A according to their ranks in Gm in descending order.
On average, our sorting algorithm proceeds in ρ (the average number of nonzeros in a column) successive phases,
which are described below. During the first phase, we sort
the columns by increasing position of their first nonzero entries from above, creating equivalence classes where any two
columns are uncomparable for this ordering. During the
second phase, in each equivalence class, we sort the columns
by decreasing position of their second nonzero entries from
above, thus, refining the equivalence classes of the first phase
into new classes where again any two columns are uncomparable for this second ordering. More generally, during the
k-th phase, in each equivalence class obtained at the (k − 1)th phase, we sort the columns by increasing position (resp.
decreasing position) of their k-th nonzero entry from above,
if k is odd, (resp. if k is even) thus, refining again the equivalence classes. Continuing in this manner, we obtain the
desired sorted matrix. Observe that whenver an equivalence
class is a singleton, it no longer participates to the next sorting phases.
Based on the above procedure and the counting sort algorithm [4], the matrix Abrgc is obtained from A using O(τ ) integer comparisons (on average) and O(n + τ ) data-structure
updates, where τ is the total nonzero entries in A [7].
Let C be an equivalence class obtained after the ℓ-th phase
and before the (ℓ + 1)-th phase. We call nonzero stream at
level (ℓ+1) in C the set of the (ℓ+1)-th nonzero entries from
above in the columns of C. In the nonzero stream at level
(ℓ + 1) in C, a set of nonzeros having the same row index is
called a step.
3. CACHE COMPLEXITY
Consider the ideal cache [5] of Z words, with cache line of
L words. Assume that n is large enough such that the vector
x does not fit into the cache. During SpMXV, the total number of accesses in x is τ . These accesses are usually irregular.
Note that n of those accesses are cold misses. Each of the
other τ − n accesses creates a cache miss with probability
(n − Z/L)/n, since no spatial locality should be expected in
accessing x. Therefore, the total number of expected cache
misses in accessing x is computed as follows.
Q1 = Z/L + (τ − Z/L) n−Z/L
.
n
We claim that Abrgc has at least nonzero streams at level
1 and 2. Indeed, each column has at least some nonzeros, which implies that Abrgc has nonzero stream at level
1. Observe that each step of the nonzero stream at level 1
is expected to have ρ entries. Moreover, we assume ρ ≥ 2.
This leads to the formation of the nonzero stream at level 2.
Therefore, the total number of nonzeros, in all the nonzero
streams of level 1 and 2, is 2n. Due to the LRU replacement policy, one can expect that the n multiplications with
the nonzeros in the nonzero stream at level 1 incur the
same amount of cache misses as if x was scanned in a regular manner during SpMXV. Next, we observe that each
of the accesses in x for multipliying with nonzeros in the
nonzero streams at level 2 creates cache misses with probability n/ρ−Z/L
. More generally, each of the other access in x
n/ρ
, where, c is the
creates cache miss with probability cn/ρ−Z/L
cn/ρ
average number of nonzero streams under one step of first
level nonzero stream and 1 ≤ c ≤ ρ. Therefore, the expected
cache misses in accessing x is given by:
Q2 = n/L + Z/L + (n − Z/L) n/ρ−Z/L
+ (τ − 2n) cn/ρ−Z/L
.
n/ρ
cn/ρ
We apply the computer algebra system MAPLE to analyze the difference between Q1 and Q2 . For the large matrices of [1], the equality n = O(Z 2 ) holds for level 2 cache and
our calculations show that we have, Q1 - Q2 ≈ n.
4. EXPERIMENTAL RESULTS
We selected 10 matrices from [1] for our experimentation.
The basic information for each test matrix is given in Table 4. We run our experiments on an intel core 2 processor
Q6600. It has L2 cache of 8MB and the CPU frequency is
2.40 GHz [11]. We measure the CPU time (given in seconds) for 1000 SpMXV s for three variants reported in Table 4: with BRGC ordering, without any preprocessing and
after a random re-ordering of the columns. It shows that
Matrix name
fome21
lp ken 18
barrier2-10
rajat23
hcircuit
GL7d24
GL7d17
GL7d19
wikipedia-20051105
wikipedia-20070206
m
67748
105127
115625
110355
105676
21074
1548650
1911130
1634989
3566907
n
216350
154699
115625
110355
105676
105054
955128
1955309
1634989
3566907
τ
465294
358171
3897557
556938
513072
593892
25978098
37322725
19753078
45030389
Table 1: Test matrices.
Matrix name
fome21
lp ken 18
barrier2-10
rajat23
hcircuit
GL7d24
GL7d17
GL7d19
wikipedia-20051105
wikipedia-20070206
BRGC
ordering
3.6
2.7
19.0
3.0
2.6
3.0
484.6
784.6
258.9
731.5
no
ordering
3.9
3.1
19.1
3.0
2.5
3.2
625.0
799.0
321.0
859.0
random
ordering
4.8
3.3
23.2
3.4
2.9
3.1
580.7
899.2
411.5
1046.0
Table 2: CPU time for 1000 SpMXV s.
the cost of BRGC ordering is amortized by 1000 SpMXV s
for all of the matrices. Our experimental results also show
that the cost of BRGC ordering √
algorithm, as a preprocessing step, can be much less than n SpMXV s and thus can
improve the performances of CG-type algorithms in practice.
Note that other column ordering algorithms reported in [8]
and their performances are compared with BRGC ordering
algorithm in [6]. As reported in [6], BRGC algorithm outperforms these other column ordering algorithms on three
different computer architectures.
5. REFERENCES
[1] T. Davis, Uni. of florida sparse matrix collection.
http://www.cise.ufl.edu/research/sparse/
[2] E. Im, Optimizing the performance of sparse matrix-vector
multiplication. PhD Thesis, Uni. of California Berkeley, 2000.
[3] D. Kreher and D. Stinson, Combinatorial Algorithms :Gen.,
Enum., and Search. CRC Press, 1999.
[4] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms. 2nd Edition McGraw-Hill, 2001.
[5] M. Frigo, C. E. Leiserson, Prokop, Harald, Ramachandran, and
Sridhar, Cache-Oblivious algorithms. FOCS ’99: Proc. of the
40th Annual Symp. on Foundations of Comp. Sc., 1999
[6] S. Haque, A computational study of sparse matrix storage
scheme M.Sc. Thesis, Uni. of Lethbridge , 2008.
[7] S. Haque, and M. Moreno Maza, Algorithms for sorting large
objects, Tech. Report, Uni. of Western Ontario, 2010.
[8] A. Pinar and M. Heath, Improving performance of sparse
matrix-vector multiplication. In Supercomputing ’99: Proc. of
the 1999 ACM/IEEE conf. on Supercomputing (CDROM),
New York, USA, 1999.
[9] S. Toledo, Improving the memory-system performance of
sparse-matrix vector multiplication, In IBM J. Res. Dev., vol.
41, num. 6, 1997.
[10] R. Vuduc, Automatic performance tuning of sparse matrix
kernels. PhD Thesis, Uni. of California Berkeley, 2003.
[11] Intel Webpage, Intel core 2 quad processor q6600.
http://ark.intel.com/Product.aspx?id=29765