Parallel Computing Platforms and Memory System Performance
John Mellor-Crummey
Department of Computer Science Rice University johnmc@cs.rice.edu
COMP 422
Lecture 9
8-10 February 2011
Topics for Today
SIMD and MIMD control structure Communication models for parallel platforms Memory hierarchy and performance
Parallel Computing Platforms
A parallel computing platform must specify
concurrency = control structure interaction between concurrent tasks = communication model
Control Structure of Parallel Platforms
Parallelism ranges from instructions to processes
Processor control structure alternatives
operate under the centralized control of a single control unit, or work independently
SIMD
Single Instruction stream
single control unit dispatches the same instruction to processors
Multiple Data streams
processors work on their own data
MIMD
Multiple Instruction streams
each processor has its own control control unit each processor can execute different instructions
Multiple Data streams
processors work on their own data 4
SIMD and MIMD Processors
SIMD architecture
MIMD architecture
PE = Processing Element
SIMD Control
SIMD relies on the regular structure of computations
media processing, scientific kernels (e.g. linear algebra, FFT)
Activity mask
per PE predicated execution: turn off operations on certain PEs
each PE tests own conditional and sets own activity mask PE can conditionally perform operation predicated on mask value
Executing a Conditional on SIMD Processors
if (B == 0) then C = A else C = A/B
conditional statement
initial values
execute then branch execute else branch
SIMD Examples
Many early parallel computers
Illiac IV, MPP, DAP, Connection Machine CM-1/2, and MasPar MP-1/2
Today
vector units: SSE, SSE2, Altivec (Velocity Engine, VMX)
128-bit vector registers 16 8-bit chars, 8 16-bit short int, 4 32-bit ints, 4 32-bit FP variables SSE2 also operates on 2 64-bit double precision values
co-processors ClearSpeed array processor (control PE + array of 96 PEs)
http://www.clearspeed.com
nVidia G80 GPGPU
SSE/SSE2 as examples of SIMD vector units
Scalar processing
traditional mode one operation produces one result
SIMD vector units
with SSE / SSE2 one operation produces multiple results
x3
x2
x1
x0
+
Y X+Y Y X+Y y3 x3+y3 y2
+
y1 x1+y1 y0 x0+y0
x2+y2
Slide Credit: Alex Klimovitski & Dean Macri, Intel Corporation
SSE / SSE2 SIMD Processing
SSE2 data types: anything that fits into 16 bytes, e.g.,
4x floats 2x doubles 16x bytes
Instructions operate in parallel on data in this 16 byte register
add, multiply etc.
Data bytes must be contiguous in memory and aligned Additional instructions needed for
masking data moving data from one part of a register to another
10
SIMD: ClearSpeed MTAP Co-processor
MTAP processor
Features hardware multi-threading asynchronous, overlapped I/O extensible instruction set
SIMD core poly controller poly execution unit
array of 96 PEs 64- and 32-bit floating point 250 MHz (key to low power) 128 million transistors low power: ~10 Watts 11
Figure credit:http://www.clearspeed.com/images/arch_mtap.jpg
The Subtlety of Short Vectors
Consider the following:
Stream alignment conflict! A solution: data layout transformation
Figure credit: P. Sadayappan. See Henretty et al. [CC11]
12
Dimension-lifted Transformation
(a) 1D array in memory (b) 2D view of same array (c) Transposed 2D array brings non-interacting elements into contiguous vectors (d) New 1D layout after transformation
Figure credit: P. Sadayappan. See Henretty et al. [CC11]
13
MIMD Processors
Execute different programs on different processors
Platforms include current generation systems
shared memory
multicore laptop workstation with multiple quad core processors SGI Altix UV (up to 32K sockets, each with an 8-core processor) Legacy: Cray X1 (up to 8K processors)
distributed memory
clusters (e.g. sugar.rice.edu, stic.rice.edu, ada.rice.edu) Cray XT, IBM Blue Gene
SPMD programming paradigm
Single Program, Multiple Data streams same program on different PEs, behavior conditional on thread id
14
SIMD vs. MIMD
SIMD platforms
special purpose: not well-suited for all applications custom designed with long design cycles less hardware: single control unit need less memory: only 1 copy of program today: SIMD common only for accelerators and vector units
MIMD platforms
suitable for broad range of applications inexpensive: off-the-shelf components + short design cycle need more memory: program and OS on each processor
15
Communication Models for Parallel Platforms
Two primary forms of data exchange between parallel tasks
accessing a shared data space exchanging messages
Platforms that provide a shared data space =
shared-memory platforms
AKA multiprocessors
Platforms that support messaging =
message-passing platforms
AKA multicomputers
16
Shared Memory Platforms
Components
set of processors part (or all) memory is accessible to all processors
Processor interactions
modify data objects stored in shared memory
Flavors of shared address space platforms
UMA: uniform memory access
time taken by a processor to access any memory word is identical
NUMA: non-uniform memory access
time taken by a processor to memory may vary
17
NUMA and UMA Platforms
UMA shared address space platform with cache (Sequent Symmetry, 1988)
NUMA shared address space platform (BBN Butterfly, 1988)
UMA shared address space platform (BBN Monarch, 1990)
18
SGI Origin 2000: NUMA Platform (1997)
J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. Proc. of the 24th annual Intl. Symp. on Computer Architecture, Denver, 241 - 251,1997
19
Contemporary NUMA Multiprocessor
Figure credit: AMD (http://bit.ly/dT7Zxd)
20
SGI Origin Network Topology
Example: bristled hypercube for 64 processors
Figure credit: http://web.cecs.pdx.edu/~alaa/ece588/papers/laudon_isca_1997.pdf
21
NUMA and UMA Platforms
NUMA and UMA platforms differ with respect to locality
algorithms must exploit locality for performance on NUMA platforms
Programming these platforms
easy communication: reads and writes are implicitly visible to other processors tricky coordination: read-write operations to shared data must be coordinated
Pthreads mutexes, OpenMP critical sections
Cache coherent vs. non cache-coherent architectures
non-cache coherent shared-address space architectures
provides an address map, but not coordinated sharing processors must explicitly flush data to memory before sharing examples: BBN Butterfly, Cray T3E
cache coherent architectures: caches coordinate access to multiple copies
hardware support to keep copies consistent up to date values can be retrieved from cache by remote processors examples: SGI Origin, SGI Altix
22
Shared Memory vs. Shared Address Space
Shared memory
access shared data with load/store
Shared address space
name shared data with global address cant necessarily access all shared data with load/store
e.g. Cray T3E: GET/PUT
Can provide shared address space abstraction on distributed memory multicomputers using software
e.g. Unified Parallel C, Co-array Fortran
23
Message-Passing Multicomputers
Components
set of processors each processor has its own exclusive memory
Examples
clustered workstations non-shared-address-space multicomputers
Cray XT, IBM Blue Gene, many others
Communication model
exchange data using send and receive primitives
Common message passing library: MPI
24
Important Aspects of Communication
Latency: How long does a single operation take?
measured in microseconds
Bandwidth: What data rate can be sustained?
measured in Mbytes/second
These terms can be applied to
memory access messaging
25
Bandwidth vs. Latency in a Pipeline
Dave Pattersons Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) = 90 min
6 PM
T a s k O r d e r 30
7
40 40
8
40 40
9
Time 20
In this example: sequential execution 4 * 90 min = 6 hours Pipelined execution 30+ 4 * 40 + 20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more total loads Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages
A B C D
26
Memory Systems and Performance
Consider architecture model X
a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns, no cache
Assume that the processor can execute 1 FLOP per cycle
FLOP = floating point operation peak performance = 1 GFLOPS
Consider adding two vectors on this architecture
each FLOP requires two data accesses peak speed of this computation
one addition (i.e. 1 FLOP) every 200 ns = speed of 5 MFLOPS
achieves small fraction of the peak processor performance
Memory system is the rate limiter for performance
27
A Modern Memory Hierarchy (Itanium 2)
Processor TLB TLB
(FP)
L1 Cache-I L1 Cache-D
Write Buffers, Etc
16K I + 16K D, 1 cycle 256K, 5 (6 FP) cycles 3M, 13.3 (13.1 FP cycles)
L2 Cache L3 cache memory controller mem bank 1 mem bank 2
209.6 ns
28
http://www.devx.com/Intel/Article/20521
Consider the Impact of Adding a Cache
Add a cache of size 32 KB with a latency of 1 ns or one cycle Use this setup to multiply n x n matrices: C = A x B
let n = 32: carefully chosen so all matrices fit in cache
Observations
fetching the two matrices into the cache: 2n2 words
fetch 2K words 100ns x 2K = ~200 s
multiplying two n n matrices takes 2n3 operations
64K operations total 64K cycles = ~64 s
total time for the computation is therefore approximately sum of
time = time for load/store operations + time for the computation time = 200 s + 64 s
peak computation rate = (64K flop) / (264 s) = or 248 MFLOPS
29
Memory Bandwidth
Limited by both
the bandwidth of the memory bus the bandwidth of the memory modules
Can be improved by increasing the size of memory blocks Memory system takes l time units to deliver b units of data
l is the latency of the system b is the block size
30
Consider the Impact of Non-unit Block Size
Again, consider architecture model X, except
block size is 4 words instead of 1 word
Repeat the vector-add computation analysis
assume vectors are laid out contiguously in memory single memory access fetches four consecutive words four additions can be performed in 200 cycles performance = 1 FLOP / 50 ns; peak speed = 20 MFLOPS
31
Reusing Data in the Memory Hierarchy
Spatial reuse: using more than one word in a multi-word line
using multiple words in a cache line
Temporal reuse: using a word repeatedly
accessing the same word in a cache line more than once
Applies at every level of the memory hierarchy
e.g. TLB
spatial reuse: access multiple cache lines in a page temporal reuse: access data on the same page repeatedly
32
Experimental Study of Memory (membench)
Microbenchmark for memory system performance
s for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop time the following loop (repeat many times and average) (repeat many times and average) for i from 0 to L for i from 0 to L by s load A[i] from memory (4 Bytes) load A[i] from memory (4 Bytes)
1 experiment
33
Membench: What to Expect
average cost per access
memory time
size > L1
cache hit time
s = stride
total size < L1
Consider the average cost per load
plot one line for each array length, time vs. stride unit stride is best: if cache line holds 4 words, only miss if array is smaller than a cache, all accesses will hit after first run time for first run is negligible with enough repetitions upper right figure assumes only one level of cache performance profile is more complicated on modern systems 34
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
Mem: 396 ns (132 cycles)
L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) 8 K pages, 32 TLB entries
L1: 16 B line
L2: 64 byte line
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
35
Memory Hierarchy on a Pentium III
Katmai processor on Millennium, 550 MHz Array size
L2: 512 KB 60 ns
L1: 32 byte line ?
L1: 64K 5 ns, 4-way?
36
Memory System Performance: Summary
Examples here illustrate the following concepts:
exploiting spatial and temporal locality is critical for
amortizing memory latency increasing effective memory bandwidth
ratio of the number of operations to number of memory accesses
good indicator of anticipated tolerance to memory bandwidth
memory layout and computation organization significantly affect spatial and temporal locality
37
Multithreading for Latency Hiding
A thread is a single stream of control in the flow of a program. We illustrate threads with a dense matrix vector multiply for (i = 0; i < n; i++) c[i] = dot_product(get_row(a, i), b);
Each dot-product is independent of others
thus, can execute concurrently
Can rewrite the above code segment using threads
for (i = 0; i < n; i++) c[i] = create_thread(dot_product,get_row(a, i), b);
38
Multithreading for Latency Hiding (contd)
Consider how the code executes
first thread accesses a pair of vector elements and waits for them second thread can access two other vector elements in the next cycle ...
After l units of time
(l is the latency of the memory system) first thread gets its data from memory and performs its madd
Next cycle
data items for the next function instance arrive
... Every clock cycle, we can perform a computation
39
Multithreading for Latency Hiding (contd)
Previous example makes two hardware assumptions
memory system can service multiple outstanding requests processor is capable of switching threads at every cycle
Also requires program to have explicit threaded concurrency Machines such as the HEP, Tera, and Sun T2000 (Niagara-2) rely on multithreaded processors
can switch the context of execution in every cycle are able to hide latency effectively
Sun T2000, 64-bit SPARC v9 processor @1200MHz
organization: 8 cores, 4 strands per core, 8KB Data cache and 16KB Instruction cache per core, L2 cache: unified 12-way 3MB, RAM: 32GB
40
Prefetching for Latency Hiding
Misses on loads cause programs to stall; why not load data before it is needed?
by the time it is actually needed, it will be there!
Drawback: need space to store early loads
may overwrite other necessary data in cache if early loads are overwritten, we are little worse than before!
Prefetching support
software only, e.g. Itanium2 hardware and software, e.g. Opteron
Hardware prefetching requires
predictable access pattern limited number of independent streams
41
Tradeoffs in Multithreading and Prefetching
Multithreaded systems
bandwidth requirements
may increase very significantly because of reduced cache/ thread
can become bandwidth bound instead of latency bound
Multithreading and prefetching
only address latency may often exacerbate bandwidth needs have significantly larger data footprint; need hardware for that
42
References
Adapted from slides Parallel Programming Platforms by Ananth Grama accompanying course textbook Vivek Sarkar (Rice), COMP 422 slides from Spring 2008 Jack Dongarra (U. Tenn.), CS 594 slides from Spring 2008, http://www.cs.utk.edu/%7Edongarra/WEB-PAGES/ cs594-2008.htm Kathy Yelick (UC Berkeley), CS 267 slides from Spring 2007, http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures Tom Henretty, Kevin Stock, Louis-Nol Pouchet, Franz Franchetti, J. Ramanujam and P. Sadayappan. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures. In ETAPS Intl. Conf. on Compiler Construction (CC'2011), Springer Verlag, Saarbrucken, Germany, March 2011. To appear.
43