Chapter 06
Chapter 06
5 D
Edition
th
Chapter 6
Parallel Processors from
Client to Cloud
Multiprocessors
Scalability, availability, power efficiency
6.1 Introduction
Introduction
Multicore microprocessors
Hardware
Software
Synchronization
Subword Parallelism
Cache Coherence
Difficulties
Partitioning
Coordination
Communications overhead
Parallel Programming
Amdahls Law
1
Speedup
90
(1 Fparallelizable ) Fparallelizable /100
Scaling Example
100 processors
100 processors
As in previous example
10 processors, 10 10 matrix
Time = 20 tadd
Time = 10 tadd + 1000/100 tadd = 20 tadd
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
Vector Processors
Example: DAXPY (Y = a X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
;load scalar a
;upper bound of what to load
;load x(i)
;a x(i)
;load y(i)
;a x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x (load 64 data)
;vector-scalar multiply
;load vector y
;add y to product
;store the result
SIMD
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications, such as media applications
Chapter 6 Parallel Processors from Client to Cloud 14
SIMD Extensions
15
Implementations:
SIMD Implementations
16
L.D F0,a
MOV F1, F0
MOV F2, F0
MOV F3, F0
DADDIU
Loop: L.4D
MUL.4D
;load scalar a
;copy a into F1 for SIMD MUL
;copy a into F2 for SIMD MUL
;copy a into F3 for SIMD MUL
R4,Rx,#512
;last address to load
F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]
F4,F4,F0
;aX[i],aX[i+1],aX[i+2],aX[i+3]
; F4*F0, F5*F1, F6*F2, F7*F3
L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
ADD.4D
F8,F8,F4
;aX[i]+Y[i], ..., aX[i+3]+Y[i+3]
S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU
Rx,Rx,#32
;increment index to X
DADDIU
Ry,Ry,#32
;increment index to Y
DSUBU
R20,R4,Rx
;compute bound
BNEZ
R20,Loop
;check if done
17
Fine-grain multithreading
Multithreading
Coarse-grain multithreading
Simultaneous Multithreading
Multithreading Example
Multiprocessing
Superscalar
Simultaneous
Multithreading
Single thread
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
22
24
Processors
Caches
Prefetching
Multithreading
Shared Memory
half = 100;
Repeat /* parallel */
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 6 Parallel Processors from Client to Cloud 28
3D graphics processing
History of GPUs
31
GPU Architectures
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Open CL
Chapter 6 Parallel Processors from Client to Cloud 32
CPU has significant cache space and control logic for general-purpose
applications
GPU builds large number of replicated cores for data-parallel, thread-parallel
computations; requires higher DRAM bandwidth
33
33
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
L2
FB
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Architecture generation:
G80
GT200
Feimi
Kepler
36
8 Streaming
processors
Streaming Multiprocessors
38
Application
kernel 0
Block 0
Thread
Host execution
Block 1
..
.
Block 2
..
.
..
.
Block 3
..
.
Host execution
kernel 1
Block 0
..
.
Block N
..
.
39
41
Streaming Processors
Executed in parallel,
SIMD style
8 SPs
4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs,
Chapter 6 Parallel Processors from Client to Cloud 42
43
43 43
Fermi
Fermi
Fermi
GTX 680
GTX 580
GTX 560 Ti
GTX 480
1536
512
384
480
Texture Units
ROPs
Core Clock
Shader Clock
Boost Clock
128
32
1006MHz
N/A
1058MHz
64
48
772MHz
1544MHz
N/A
64
32
822MHz
1644MHz
N/A
60
48
700MHz
1401MHz
N/A
Memory Clock
6.008GHz
GDDR5
4.008GHz
GDDR5
4.008GHz
GDDR5
3.696GHz
GDDR5
256-bit
384-bit
256-bit
384-bit
2GB
1.5GB
1GB
1.5GB
FP64
1/24 FP32
1/8 FP32
1/12 FP32
1/12 FP32
TDP
195W
244W
170W
250W
Transistor Count
3.5B
3B
1.95B
3B
TSMC 28nm
TSMC 40nm
TSMC 40nm
TSMC 40nm
$499
$499
$249
$499
Stream Processors
Manufacturing
Process
Launch Price
44
44 44
45
45 45
Classifying GPUs
Instruction-Level
Parallelism
Data-Level
Parallelism
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Host
Shared Memory
Registers
Registers
Global Memory
48
GPU
SIMD processors
4 to 8
8 to 16
SIMD lanes/processor
2 to 4
8 to 16
2 to 4
16 to 32
2:1
2:1
8 MB
0.75 MB
64-bit
64-bit
8 GB to 256 GB
4 GB to 6 GB
Yes
Yes
Demand paging
Yes
No
Yes
No
Cache coherent
Yes
No
TSMC
Reduction
half = 100;
Repeat /* parallel */
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 6 Parallel Processors from Client to Cloud 54
Grid Computing
Network topologies
Bus
Ring
N-cube (N = 3)
2D Mesh
Fully connected
Chapter 6 Parallel Processors from Client to Cloud 57
Multistage Networks
Network Characteristics
Performance
Link bandwidth
Total network bandwidth
Bisection bandwidth
Cost
Power
Routability in silicon
Chapter 6 Parallel Processors from Client to Cloud 59
Job-level parallelism
Parallel Benchmarks
Code or Applications?
Traditional benchmarks
Modeling Performance
Attainable GPLOPs/sec
= Max ( Peak Memory BW Arithmetic Intensity, Peak FP Performance )
Comparing Systems
Optimizing Performance
Optimize FP performance
Software prefetch
Memory affinity
Optimizing Performance
Arithmetic intensity is
not always fixed
Increases arithmetic
intensity
Rooflines
Benchmarks
Performance Summary
Use OpenMP:
Multi-threading DGEMM
Multithreaded DGEMM
Multithreaded DGEMM
Fallacies
Pitfalls
Concluding Remarks