0% found this document useful (0 votes)

121 views87 pages

Architecture1 1 (2012)

Kenichi miura, Ph.D. Professor, national institute of informatics fellow, Fujitsu Laboratories limited. Miura: "there are THREE things which are inevitable in this world: death, Tax and Parallelism"

Uploaded by

itzakshay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views87 pages

Architecture1 1 (2012)

Kenichi miura, Ph.D. Professor, national institute of informatics fellow, Fujitsu Laboratories limited. Miura: "there are THREE things which are inevitable in this world: death, Tax and Parallelism"

Uploaded by

itzakshay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Parallel Architecture and Systems

January, 2012 Kenichi Miura, Ph.D. Professor, National Institute of Informatics Fellow, Fujitsu Laboratories limited

Course Outline
1. Introduction (What is parallel processing? Why needed?) 2. HPC Architecture 2.1. History of Supercomputers and trends 2.2. Classification of Parallel Architecture (From CPU to System) 2.3. Memory Architecture (Shared, Distributed ) 3. Computational Models 4. Parallel Algorithms 7.1 Serial vs. Parallel Algorithms 7.2 Hardware Realization and Examples of special-purpose Processors 5.Parallel Programming Languages 4.1. Relations between Parallel Languages and Architecuture 4.2. Parallel Language for Shared-memory Architecture OpenMP 4.3. Parallel Languages for Distributed-memory Architecture Message Passing Interface 6. Application Areas for the Large Scale Scientific Computations/Simulations 7.Grid Computing and Cloud

There are THREE things which are inevitable in this world Death, Tax and Parallelism !
John RiganatiSarnoff Lab

Origin of Digital Computers

Digital computers were developed during and soon after the WW II for military purposes.

USA ENIACBallistic Table EDVAC, JOHNNYAC, MANIAC Nuclear Weapon Development U.K. ColossusCode Breaking

ENIAC c1946

Colossus (UK, 1944)

Source: K.Miura at Bletcheley Park

Alan M. Turing 1912-1954)

John von Neumann (1903-1957)

D.H.Lehmers Criticism

This(ENIAC) was a highly parallel machine, until von Neumann spoiled it.
D.H.Lehmer: A History of the Sieve Process in N.Metropolis et al: A History of Computing in the Twentieth Century(1980)

Von Neumann Model

Input Device

Central Processing Unit (CPU)

Output Device

Memory (MSU)
Program Data

Von Neumann Bottleneck

Input Device

Central Processing Unit(CPU)

Program Counter

Output Device

Memory(MSU)
Program Data

Performance Trends of Supercomputers in USA (plus Fujitsu)

Flop/s

ASCI White VPP5000

10 10 10 10 10 10

tera

ASCI Blue ASCI Red NWT

VPP800

giga
8

VP2000 CRAY T90 VP400 CRAY C90 CRAY X-MP + + CRAY Y-MP ILLIAC4 + VP200 CRAY2 CDC7600 CRAY1 STRETCH LARK IBM704 CDC1604 CDC6600 230-75/APU

mega

kilo
2

UNIVAC ENIAC

1010/50 years 1.58 50

Mark I

1950

1960

1970

1980

1990

2000

All Rights Reserved, Copyright FUJITSU LIMITED 1999

Source:L.Smar

What is Parallel Processing? Why necessary?

To Solve problems in shorter time (Requirement toward CPU Performance To Solve larger problems (Requirement toward memory capacity) Issues Is it easy to use? How reliable is the system? Can software keep up with parallelism?
- Correctness of programs - Scalability of the programs - Ease of Programming - Portability of the programs across multiple parallel systems

Parallel Architecture - Discussion items Classification of Hardware Parallelism Processor DesignVector, Scalar Memory Design Shared, Distributed) Computational Models and Parallelism Correspondence with Systems Vector vs Scalar Parallel Applications and Parallelism SIMD,MIMD,SPMD) Data Parallel vs Control Parallel Numerical Algorithms and Parallelism

How to make computers run faster (1)

Higher Clock Frequency
Miniaturization of circuits and high level of IntegrationMoores Law) Shorter wire lengths internally and also across the chips Efficient cooling technology to remove heat

Generation of Computer Hardware

1st Gen.Vacuum Tube 2nd Gen.Discrete Transistor 3rd Gen.Integrated Circuit (IC 4th Gen.: Large scale Integration (LSI VLSI, ULSI, etc.

Moores Law
Number of transistors which can be mounted On a chip doubles in 24 18 months

Linpack Temperature Measurements (1)

Configuration
x86 Linux cluster Myrinet interconnect
Computation Terminated

Measurements
Celsius

microprocessor motherboard
measured at six locations
next slide

Source: D.Reed, North Carolina Univ.

Celsius

Linpack Temperature Measurements (2)

Thermal dynamics matter
reliability and fault management power and economics
Computation Terminated

Why?
Arrhenius equation
temperature implications
Celsius

mean time to catastrophic failure of commercial silicon

2X for every 10 C above 70 C

Source: D.Reed, North Carolina Univ.

Celsius

CPU Power Trend

Moores Second Law

Times are changing

Source: Burton Smith, Microsoft

How to make computers run faster (2)

Pipelining
Multiple operations are executed concurrently with time delay Examples: Automobile assembly line

- Instruction Stream (Fetch, Decode, Issue,.) - Data Stream (Segmented Arithmetic Units, chaining,)

Cray 1(1976): 160 Mflop/s

Seymour Cray

Cray-1
Sourced from http://www.thocp.net/hardware/cray_1.htm

CRAY 1 Architecture

VP100/200/400 Architecture

How to make computers run faster (3)

Parallelism
Multiple operations are executed simultaneously on physically-replicated hardware resources (Example: Ticket gates at train stations)
Instruction Stream (Superscalar, VLIW, MTA,...) Data Stream (Multiple Arithmetic Units, Striped Arithmetic Units, ) System Architecture (SIMD, MIMD, ccNUMA, Clusters, MPP..)

Analogy of Parallel Processing

ILLIAC IV:
The first SIMD Supercomputer System (1975-1982 @ NASA Ames Research Center)

Computer History Museum, Mt. View CA

D.L.Slotnick

ILLIAC IV Architecture
SIMD:64 way parallelism ECL7 gates/chip by TI) 16 MHz clock First Semiconductor Memory by Fairchild Designed at University of Illinois Built by Burroughs Corporation

Number of Processors vs System Performance

Processor Performance (Flop/s) 1T System Performance

100 G
Vector-

10 G 1G

SMP

V-SMP Cluster

ScalarSMP

SPP / SMP-Cluster

MPP/Cluster

100 M No. of Processors 101 102 103 104 105

How to make computers run faster (4)

data caching

Inout Devices

Cache

Central Processin Unit (CPU) registers

Cache

Output Device

Cache Memory Unit

Program Data

L1 L2 L3

Importance of Memory Performance which Matches CPU Performance

The performance of a processor is determined by the speed at which data can be moved to/from Memory Unit
- Registers (arch., rename,window, Resv. Station) - Cache Memory (I-cache,D-cache, L1,L2,L3) - Interleaved Memory Banks - Memory Technology Choices (SRAM vs SDRAM)

More complexity are introduced in order to cope with mismatch between CPU and Memory!

Interleaved Memory
-Cache avoidance technique Cyclic use of multiple memory banks to hide the cycle time of memory Used in Vector Processing System
CPU Registers MSU

Bank 0

Bank 1

Bank2

Bank n-2

Bank n-1

Memory Bank Conflict and How to Avoid it (1)

Vector: Array padding
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (257,256) a0 0 a1 a2 a255 a256 a257 a258 a510

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a511 0 a512 a513 a765 a766 a767 0 a768 a1020 0

B0 B1 B2 Memory Banks

B255

B0 B1 B2 Memory Banks

B255

Memory Bank Conflict and How to Avoid it (2)

Parallel: Skewed Storage by Routing
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (256,256) a0 a1 a2 a255 a511 a256 a257 a258 a510 Route 0 Route 1

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a766 a767 a512 a513 a765 Route 2 A1021a1022a1023a768 a1020 Route 3 0

B0 B1 B2 Memory Banks

B255

B0 B1 B2 Memory Banks

B255

Memory Bank Conflict and How to Avoid it (3)

Parallel: Prime-Numbered Memory Banks
a0 a16 a32 a48 a1 a17 a33 a49 a2 a15 a18 a31 a34 a47 a50 a63 DIMENSION A (16,16) a0 a17 a34 a51 a1 a2 a15 a16 a18 a19 a32 a33 a35 a36 a48 a49 a50 a52 a53 a64

Matrix A =

DIMENSION A (16,16) a0 a1 a2 a15 a16 a17 a18 a31

B0 B1 B2 B15 16 Memory Banks

B0 B1 B2 B16 17 Memory Banks

Intel Microprocessor

IBM Power7

Power Wall
25MW to the building 12.5MW to the computers

Lawrence Livermore National Lab.

Memory Architecture of Parallel Systems and Data Organization

1Shared Memory Model Uniform SMP Non-Uniform NUMAcc-NUMA 2Distributed Memory Model (Cluster, MPP) 3Hierarchical Model SMP Clusters

Memory Bus

System Interconnect

Variations of Shared-Memory Architecture

False Sharing of Data on SMP

Shared data may be logically independent, but happened to be on the same cache line. When updating the data, thrashing of the cache takes place, resulting in serious performance degradation. Note that read-only data are not affected.
CPU 0 cache Memory CPU 1 cache

Distributed Memory Architecture - Interconnect Topology

3D Torus Interconnect of IBM BlueGene (64 node case)

Tofu: 6D Torus Interconnect of Fujitsu Next Generation System

ToFu 1

ToFu 2

ToFu 3

ToFu 4

ToFu 5

4 x 4 Crossbar Network
cable

cable

Scalar vs Vector vs Multithread

Pipelining of hardware Physical replication of Units (intra-CPU, Inter-CPU ConcurrencyDepth of PipelineNumber of Pipelines

Key issue is how to keep the pipelines filled for longer time!
Degree of Parallelism in Application software Locality of datasets Memory latency hiding and Wider memory Bandwidth

Multi-threading
Denelcor HEP Cray MTA/XMT - Multiple Instruction Counters (one/thread) - Pipelines everywhere - Full /Empty bits on every memory word

Concept of Multi-thread Architecture (B.Smith)

HEP Architecture

Cray XMT (Multitasking)

U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
1.1 The Characteristics of Contemporary Processors.280155520
No ratings yet
1.1 The Characteristics of Contemporary Processors.280155520
1 page
Parallel Computing
No ratings yet
Parallel Computing
32 pages
ACA Notes UNIT-1
No ratings yet
ACA Notes UNIT-1
20 pages
Brief History of Computer Evolution
No ratings yet
Brief History of Computer Evolution
13 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Microprocessor Engineering PDF
No ratings yet
Microprocessor Engineering PDF
68 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Computer Architecture Ebook
No ratings yet
Computer Architecture Ebook
443 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
Computer Architecture Important Thing
No ratings yet
Computer Architecture Important Thing
7 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
29 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Introduction
No ratings yet
Introduction
32 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Parallel Computing Concepts Guide
No ratings yet
Parallel Computing Concepts Guide
32 pages
Cse.m-ii-Advances in Computer Architecture (12scs23) - Notes
No ratings yet
Cse.m-ii-Advances in Computer Architecture (12scs23) - Notes
213 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
UNIT1
No ratings yet
UNIT1
11 pages
Lecture 0. Introduction: Instructor: Weidong Shi (Larry), PHD Computer Science Department University of Houston
No ratings yet
Lecture 0. Introduction: Instructor: Weidong Shi (Larry), PHD Computer Science Department University of Houston
65 pages
Computer Architecture and Organization: General Introduction
No ratings yet
Computer Architecture and Organization: General Introduction
72 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
11 VM
No ratings yet
11 VM
118 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
RAjeev
No ratings yet
RAjeev
11 pages
Computer Evolution (進化) and Performance
No ratings yet
Computer Evolution (進化) and Performance
56 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Evolution Computer1
No ratings yet
Evolution Computer1
17 pages
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
No ratings yet
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
151 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
ACA Notes
No ratings yet
ACA Notes
156 pages
Computer Architecture Design and Performance
No ratings yet
Computer Architecture Design and Performance
381 pages
Unit 1
No ratings yet
Unit 1
22 pages
Lecture 1
No ratings yet
Lecture 1
34 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
Module 1
No ratings yet
Module 1
30 pages
Module 1a A Brief History of Computer Architecture
No ratings yet
Module 1a A Brief History of Computer Architecture
53 pages
SISd
No ratings yet
SISd
17 pages
Computer Organization and Architecture
0% (1)
Computer Organization and Architecture
49 pages
Advanced Computer Architecture: CSE-401 E
No ratings yet
Advanced Computer Architecture: CSE-401 E
71 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
It App Prelim Reviewer1
No ratings yet
It App Prelim Reviewer1
3 pages
Microprocessor: Md. Atiqur Rahman Ahad
No ratings yet
Microprocessor: Md. Atiqur Rahman Ahad
43 pages
Advanced Computer Architecture: Azvjvhd
No ratings yet
Advanced Computer Architecture: Azvjvhd
61 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Lec 01
No ratings yet
Lec 01
67 pages
Memory Management Unit IA-32: Logical Address 16 + 32 Bit
No ratings yet
Memory Management Unit IA-32: Logical Address 16 + 32 Bit
6 pages
Why Reset?: Advantages
No ratings yet
Why Reset?: Advantages
2 pages
VLSI Lab: NAND & NOR Gate Analysis
No ratings yet
VLSI Lab: NAND & NOR Gate Analysis
5 pages
Embedded Systems Books Tools Pack
No ratings yet
Embedded Systems Books Tools Pack
18 pages
Max6950 Max6951 PDF
No ratings yet
Max6950 Max6951 PDF
19 pages
C640-C600 - 6050a2357502-Mb-A02-T1-Ct10
100% (3)
C640-C600 - 6050a2357502-Mb-A02-T1-Ct10
56 pages
Esquema Portatil ZA1-MB
No ratings yet
Esquema Portatil ZA1-MB
31 pages
6es7-214-2ad23-0xbb Cpu PLC Jereh
No ratings yet
6es7-214-2ad23-0xbb Cpu PLC Jereh
7 pages
Dan Joyce's 29 Tips For Gate-Level Simulation
No ratings yet
Dan Joyce's 29 Tips For Gate-Level Simulation
6 pages
Hazard UNIT 4
No ratings yet
Hazard UNIT 4
10 pages
PD Interview Questions - 1
No ratings yet
PD Interview Questions - 1
38 pages
Netlist Quality Checks
100% (4)
Netlist Quality Checks
39 pages
Holbright Hb-8212
100% (1)
Holbright Hb-8212
1 page
MIPS Processor Design in Verilog
No ratings yet
MIPS Processor Design in Verilog
2 pages
Compal Engineering Schematics
0% (1)
Compal Engineering Schematics
55 pages
Microprocessors & Interfacing (ET21110)
No ratings yet
Microprocessors & Interfacing (ET21110)
1 page
Altera Max and 7000
No ratings yet
Altera Max and 7000
23 pages
Mba (1 Semester) Computer Fundamental Mba Unit - I: Secondary Storage
No ratings yet
Mba (1 Semester) Computer Fundamental Mba Unit - I: Secondary Storage
12 pages
Setup and Hold Check - Advance STA (Static Timing Analysis) - VLSI Concepts
No ratings yet
Setup and Hold Check - Advance STA (Static Timing Analysis) - VLSI Concepts
5 pages
CO2: 1. Concept of Program Execution/Interpretation
No ratings yet
CO2: 1. Concept of Program Execution/Interpretation
22 pages
Computer Design Handouts 2011
No ratings yet
Computer Design Handouts 2011
92 pages
Unit 9: Fundamentals of Parallel Processing
No ratings yet
Unit 9: Fundamentals of Parallel Processing
16 pages
Ee6502 Microprocessors and Microcontrollers
No ratings yet
Ee6502 Microprocessors and Microcontrollers
2 pages
Measuring The Gap Between Fpgas and Asics
No ratings yet
Measuring The Gap Between Fpgas and Asics
16 pages
Wireless Digital Electronic Notice Board Using Wi-Fi
No ratings yet
Wireless Digital Electronic Notice Board Using Wi-Fi
6 pages
Be 03009011
No ratings yet
Be 03009011
7 pages
Safety-Critical Dual Lockstep SoC
No ratings yet
Safety-Critical Dual Lockstep SoC
8 pages
Systemc in QT GUI PDF
No ratings yet
Systemc in QT GUI PDF
11 pages
RISC Architecture Overview
No ratings yet
RISC Architecture Overview
11 pages
Tetris PDF
No ratings yet
Tetris PDF
4 pages

Architecture1 1 (2012)

Uploaded by

Architecture1 1 (2012)

Uploaded by

Parallel Architecture and Systems

Origin of Digital Computers

Colossus (UK, 1944)

Source: K.Miura at Bletcheley Park

Alan M. Turing 1912-1954)

John von Neumann (1903-1957)

Von Neumann Model

Central Processing Unit (CPU)

Von Neumann Bottleneck

Central Processing Unit(CPU)

Performance Trends of Supercomputers in USA (plus Fujitsu)

ASCI White VPP5000

ASCI Blue ASCI Red NWT

1010/50 years 1.58 50

All Rights Reserved, Copyright FUJITSU LIMITED 1999

What is Parallel Processing? Why necessary?

How to make computers run faster (1)

Generation of Computer Hardware

Linpack Temperature Measurements (1)

Source: D.Reed, North Carolina Univ.

Linpack Temperature Measurements (2)

mean time to catastrophic failure of commercial silicon

Source: D.Reed, North Carolina Univ.

CPU Power Trend

Moores Second Law

Times are changing

Source: Burton Smith, Microsoft

How to make computers run faster (2)

Cray 1(1976): 160 Mflop/s

How to make computers run faster (3)

Analogy of Parallel Processing

Computer History Museum, Mt. View CA

Number of Processors vs System Performance

100 M No. of Processors 101 102 103 104 105

How to make computers run faster (4)

Central Processin Unit (CPU) registers

Cache Memory Unit

Importance of Memory Performance which Matches CPU Performance

Memory Bank Conflict and How to Avoid it (1)

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a511 0 a512 a513 a765 a766 a767 0 a768 a1020 0

Memory Bank Conflict and How to Avoid it (2)

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a766 a767 a512 a513 a765 Route 2 A1021a1022a1023a768 a1020 Route 3 0

Memory Bank Conflict and How to Avoid it (3)

DIMENSION A (16,16) a0 a1 a2 a15 a16 a17 a18 a31

B0 B1 B2 B15 16 Memory Banks

B0 B1 B2 B16 17 Memory Banks

Lawrence Livermore National Lab.

Memory Architecture of Parallel Systems and Data Organization

Variations of Shared-Memory Architecture

False Sharing of Data on SMP

Distributed Memory Architecture - Interconnect Topology

3D Torus Interconnect of IBM BlueGene (64 node case)

Tofu: 6D Torus Interconnect of Fujitsu Next Generation System

Scalar vs Vector vs Multithread

Concept of Multi-thread Architecture (B.Smith)

Cray XMT (Multitasking)

You might also like