Parallel Architecture and Systems
January, 2012 Kenichi Miura, Ph.D. Professor, National Institute of Informatics Fellow, Fujitsu Laboratories limited
Course Outline
1. Introduction (What is parallel processing? Why needed?) 2. HPC Architecture 2.1. History of Supercomputers and trends 2.2. Classification of Parallel Architecture (From CPU to System) 2.3. Memory Architecture (Shared, Distributed ) 3. Computational Models 4. Parallel Algorithms 7.1 Serial vs. Parallel Algorithms 7.2 Hardware Realization and Examples of special-purpose Processors 5.Parallel Programming Languages 4.1. Relations between Parallel Languages and Architecuture 4.2. Parallel Language for Shared-memory Architecture OpenMP 4.3. Parallel Languages for Distributed-memory Architecture Message Passing Interface 6. Application Areas for the Large Scale Scientific Computations/Simulations 7.Grid Computing and Cloud
There are THREE things which are inevitable in this world Death, Tax and Parallelism !
John RiganatiSarnoff Lab
Origin of Digital Computers
Digital computers were developed during and soon after the WW II for military purposes.
USA ENIACBallistic Table EDVAC, JOHNNYAC, MANIAC Nuclear Weapon Development U.K. ColossusCode Breaking
ENIAC c1946
Colossus (UK, 1944)
Source: K.Miura at Bletcheley Park
Alan M. Turing 1912-1954)
John von Neumann (1903-1957)
D.H.Lehmers Criticism
This(ENIAC) was a highly parallel machine, until von Neumann spoiled it.
D.H.Lehmer: A History of the Sieve Process in N.Metropolis et al: A History of Computing in the Twentieth Century(1980)
Von Neumann Model
Input Device
Central Processing Unit (CPU)
Output Device
Memory (MSU)
Program Data
Von Neumann Bottleneck
Input Device
Central Processing Unit(CPU)
Program Counter
Output Device
Memory(MSU)
Program Data
Performance Trends of Supercomputers in USA (plus Fujitsu)
Flop/s
ASCI White VPP5000
10 10 10 10 10 10
12
tera
ASCI Blue ASCI Red NWT
VPP800
10
giga
8
VP2000 CRAY T90 VP400 CRAY C90 CRAY X-MP + + CRAY Y-MP ILLIAC4 + VP200 CRAY2 CDC7600 CRAY1 STRETCH LARK IBM704 CDC1604 CDC6600 230-75/APU
mega
kilo
2
UNIVAC ENIAC
1010/50 years 1.58 50
Mark I
1950
1960
1970
1980
1990
2000
All Rights Reserved, Copyright FUJITSU LIMITED 1999
Source:L.Smar
What is Parallel Processing? Why necessary?
To Solve problems in shorter time (Requirement toward CPU Performance To Solve larger problems (Requirement toward memory capacity) Issues Is it easy to use? How reliable is the system? Can software keep up with parallelism?
- Correctness of programs - Scalability of the programs - Ease of Programming - Portability of the programs across multiple parallel systems
Parallel Architecture - Discussion items Classification of Hardware Parallelism Processor DesignVector, Scalar Memory Design Shared, Distributed) Computational Models and Parallelism Correspondence with Systems Vector vs Scalar Parallel Applications and Parallelism SIMD,MIMD,SPMD) Data Parallel vs Control Parallel Numerical Algorithms and Parallelism
How to make computers run faster (1)
Higher Clock Frequency
Miniaturization of circuits and high level of IntegrationMoores Law) Shorter wire lengths internally and also across the chips Efficient cooling technology to remove heat
Generation of Computer Hardware
1st Gen.Vacuum Tube 2nd Gen.Discrete Transistor 3rd Gen.Integrated Circuit (IC 4th Gen.: Large scale Integration (LSI VLSI, ULSI, etc.
Moores Law
Number of transistors which can be mounted On a chip doubles in 24 18 months
Linpack Temperature Measurements (1)
Configuration
x86 Linux cluster Myrinet interconnect
Computation Terminated
Measurements
Celsius
microprocessor motherboard
measured at six locations
next slide
Source: D.Reed, North Carolina Univ.
Celsius
Linpack Temperature Measurements (2)
Thermal dynamics matter
reliability and fault management power and economics
Computation Terminated
Why?
Arrhenius equation
temperature implications
Celsius
mean time to catastrophic failure of commercial silicon
2X for every 10 C above 70 C
Source: D.Reed, North Carolina Univ.
Celsius
CPU Power Trend
Moores Second Law
Times are changing
Source: Burton Smith, Microsoft
How to make computers run faster (2)
Pipelining
Multiple operations are executed concurrently with time delay Examples: Automobile assembly line
- Instruction Stream (Fetch, Decode, Issue,.) - Data Stream (Segmented Arithmetic Units, chaining,)
Cray 1(1976): 160 Mflop/s
Seymour Cray
Cray-1
Sourced from http://www.thocp.net/hardware/cray_1.htm
CRAY 1 Architecture
VP100/200/400 Architecture
How to make computers run faster (3)
Parallelism
Multiple operations are executed simultaneously on physically-replicated hardware resources (Example: Ticket gates at train stations)
Instruction Stream (Superscalar, VLIW, MTA,...) Data Stream (Multiple Arithmetic Units, Striped Arithmetic Units, ) System Architecture (SIMD, MIMD, ccNUMA, Clusters, MPP..)
Analogy of Parallel Processing
ILLIAC IV:
The first SIMD Supercomputer System (1975-1982 @ NASA Ames Research Center)
Computer History Museum, Mt. View CA
D.L.Slotnick
ILLIAC IV Architecture
SIMD:64 way parallelism ECL7 gates/chip by TI) 16 MHz clock First Semiconductor Memory by Fairchild Designed at University of Illinois Built by Burroughs Corporation
Number of Processors vs System Performance
Processor Performance (Flop/s) 1T System Performance
100 G
Vector-
10 G 1G
SMP
V-SMP Cluster
ScalarSMP
SPP / SMP-Cluster
MPP/Cluster
100 M No. of Processors 101 102 103 104 105
How to make computers run faster (4)
data caching
Inout Devices
Cache
Central Processin Unit (CPU) registers
Cache
Output Device
Cache Memory Unit
Program Data
L1 L2 L3
Importance of Memory Performance which Matches CPU Performance
The performance of a processor is determined by the speed at which data can be moved to/from Memory Unit
- Registers (arch., rename,window, Resv. Station) - Cache Memory (I-cache,D-cache, L1,L2,L3) - Interleaved Memory Banks - Memory Technology Choices (SRAM vs SDRAM)
More complexity are introduced in order to cope with mismatch between CPU and Memory!
Interleaved Memory
-Cache avoidance technique Cyclic use of multiple memory banks to hide the cycle time of memory Used in Vector Processing System
CPU Registers MSU
Bank 0
Bank 1
Bank2
Bank n-2
Bank n-1
Memory Bank Conflict and How to Avoid it (1)
Vector: Array padding
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (257,256) a0 0 a1 a2 a255 a256 a257 a258 a510
DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511
a511 0 a512 a513 a765 a766 a767 0 a768 a1020 0
B0 B1 B2 Memory Banks
B255
B0 B1 B2 Memory Banks
B255
Memory Bank Conflict and How to Avoid it (2)
Parallel: Skewed Storage by Routing
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (256,256) a0 a1 a2 a255 a511 a256 a257 a258 a510 Route 0 Route 1
DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511
a766 a767 a512 a513 a765 Route 2 A1021a1022a1023a768 a1020 Route 3 0
B0 B1 B2 Memory Banks
B255
B0 B1 B2 Memory Banks
B255
Memory Bank Conflict and How to Avoid it (3)
Parallel: Prime-Numbered Memory Banks
a0 a16 a32 a48 a1 a17 a33 a49 a2 a15 a18 a31 a34 a47 a50 a63 DIMENSION A (16,16) a0 a17 a34 a51 a1 a2 a15 a16 a18 a19 a32 a33 a35 a36 a48 a49 a50 a52 a53 a64
Matrix A =
DIMENSION A (16,16) a0 a1 a2 a15 a16 a17 a18 a31
B0 B1 B2 B15 16 Memory Banks
B0 B1 B2 B16 17 Memory Banks
Intel Microprocessor
IBM Power7
Power Wall
25MW to the building 12.5MW to the computers
Lawrence Livermore National Lab.
Memory Architecture of Parallel Systems and Data Organization
1Shared Memory Model Uniform SMP Non-Uniform NUMAcc-NUMA 2Distributed Memory Model (Cluster, MPP) 3Hierarchical Model SMP Clusters
Memory Bus
System Interconnect
Variations of Shared-Memory Architecture
False Sharing of Data on SMP
Shared data may be logically independent, but happened to be on the same cache line. When updating the data, thrashing of the cache takes place, resulting in serious performance degradation. Note that read-only data are not affected.
CPU 0 cache Memory CPU 1 cache
Distributed Memory Architecture - Interconnect Topology
3D Torus Interconnect of IBM BlueGene (64 node case)
Tofu: 6D Torus Interconnect of Fujitsu Next Generation System
ToFu 1
ToFu 2
ToFu 3
ToFu 4
ToFu 5
4 x 4 Crossbar Network
cable
cable
Scalar vs Vector vs Multithread
Pipelining of hardware Physical replication of Units (intra-CPU, Inter-CPU ConcurrencyDepth of PipelineNumber of Pipelines
Key issue is how to keep the pipelines filled for longer time!
Degree of Parallelism in Application software Locality of datasets Memory latency hiding and Wider memory Bandwidth
Multi-threading
Denelcor HEP Cray MTA/XMT - Multiple Instruction Counters (one/thread) - Pipelines everywhere - Full /Empty bits on every memory word
Concept of Multi-thread Architecture (B.Smith)
HEP Architecture
Cray XMT (Multitasking)