0% found this document useful (0 votes)

181 views21 pages

Chapter 6 Parallel Processor

The document discusses parallel processors from client to cloud systems. It covers topics like multiprocessors, multicore processors, parallelism, Amdahl's law, scaling, vector architectures, hardware multithreading, and shared memory multiprocessors. An example of summing numbers on a shared memory multiprocessor using reduction is also presented.

Uploaded by

q qq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views21 pages

Chapter 6 Parallel Processor

Uploaded by

q qq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

COMPUTER ORGANIZATION AND DESIGN

5th
Edition
The Hardware/Software Interface

Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
 Multiprocessor
 Goal: connecting multiple computers
to get higher performance
 Scalability, availability, power efficiency
 Multicore processors
 Chips with multiple processors (cores)
 Task-level (process-level) parallelism
 High throughput for independent jobs
 Parallel program (parallel software)
 A single program run on multiple processors
 Challenges: partitioning, coordination,
communications overhead
Chapter 6 — Parallel Processors from Client to Cloud — 2
Amdahl’s Law
 Sequential part can limit speedup
 Example: 100 processors, 90× speedup?
 Tnew = Tparallelizable/100 + Tsequential

 Solving: Fparallelizable = 0.999

 Need sequential part to be 0.1% of original
time

Chapter 6 — Parallel Processors from Client to Cloud — 3

Scaling Example
 Workload
 sum of 10 scalars (sequential)
 sum of a pair of 10×10 matrix (parallel)
 What’s the speed up from single to 10 and 100 processors?
 Single processor:
 Time = (10 + 100) × tadd

Assumes load balanced

 10 processors

across processors
 Time = 10 × tadd + 100/10 × tadd = 20 × tadd
 Speedup = 110/20 = 5.5 (55% of potential)
 100 processors
 Time = 10 × tadd + 100/100 × tadd = 11 × tadd
 Speedup = 110/11 = 10 (10% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 4

Scaling Example (cont)
 What if matrix size is 100 × 100?
 Single processor:
 Time = (10 + 10000) × tadd
 10 processors

Assumes load balanced

Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

across processors


 Speedup = 10010/1010 = 9.9 (99% of potential)

 100 processors
 Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
 Speedup = 10010/110 = 91 (91% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5

Strong vs Weak Scaling
 Strong scaling: problem size fixed
 Measure speed-up achieved on a multiprocessor while
keeping problem size fixed
 Weak scaling: problem size proportional to number of CPU
 Measure speed-up achieved on a multiprocessor while
increase the size of the problem proportionally to the
increase in the number of processors.

 10 processors, 10 × 10 matrix

performance
Time = 10 × tadd + 100/10 × tadd = 20 × tadd

Constant


 100 processors, 32 × 32 matrix

 Time = 10 × tadd + 1024/100 × tadd ~= 20 × tadd

Chapter 6 — Parallel Pro

cessors from Client to Cloud — 6
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
 An alternate classification on parallel
hardware
Data Streams
Single Multiple
Instruction Single SISD: SIMD (vector processor):
Streams Intel Pentium 4 SSE instructions of x86
Multiple MISD: MIMD :
No example yet Intel Xeon e5345

Chapter 6 — Parallel Processors from Client to Cloud — 7

Vector architecture
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into vector registers
 Results stored from vector registers to memory
 Example: Vector extension to MIPS
 32 vector registers, each has 64 64-bit elements
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double

Chapter 6 — Parallel Processors from Client to Cloud — 8

Vector architecture
 Single add pipeline
 Complete one addition per cycle

An array of parallel functional units

 Four add pipeline

 Complete four additions per cycle

Chapter 6 — Parallel Processors from Client to Cloud — 9

Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;bound of what to load
loop:l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done


 Vector MIPS code

l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result
Vector vs. Scalar architecture
 Reduce Instruction fetch and decode
 A single vector instruction is equivalent to executing an
entire loop.
 Avoid data hazard checking
 Only check data hazard between vectors, not for every
element within the vectors (computation of every element
within the same vector is independent).
 Avoid control hazard
 An entire loop is replaced by a vector instruction  loop
branch (leading to control hazard) is non-existent.
 Efficient memory access
 If the vector’s elements are all adjacent in memory,
fetching them is efficient using interleaved memory banks.

Chapter 6 — Parallel Processors from Client to Cloud — 11

§6.4 Hardware Multithreading
Hardware Multithreading
 Hardware Multithreading
 Multiple hardware threads (replicate registers, PC, etc.)
 fast context switching between threads
 1. Fine-grain multithreading
 Switch threads after execution of each instruction
 If one thread stalls (long and short), others are executed
 Slow down the execution of individual threads, especially
those without stalls
 2. Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls (eg,
data hazards)

Chapter 6 — Parallel Processors from Client to Cloud — 12

Simultaneous Multithreading (SMT)

 3. Simultaneous multithreading (SMT)

 in multiple-issue dynamically scheduled pipelined CPU
 Motivation
 Multiple-issue processors often have more functional

units available than single thread needs to use.

 Schedule instructions from multiple threads
 Instructions from independent threads execute when
function units are available
 hide the throughput loss from both short and long stall

Chapter 6 — Parallel Processors from Client to Cloud — 13

Multithreading Example

Coarse MT: Coarse-grained

multithreading
Fine MT: Fine-grained
multithreading
SMT: Simultaneous
multithreading
§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory Multiprocessor
 SMP: Symmetric Multi-Processing
 Hardware provides single physical
address space for all processors
 Synchronize shared variables using locks
 Memory access time
 UMA (uniform) vs. NUMA (nonuniform)

Processor Processor Processor Processor

cache cache cache cache

Interconnection network
Memory Memory

Memory Memory Interconnection network

UMA NUMA
Chapter 6 — Parallel Processors from Client to Cloud — 15
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for ( i=1000*Pn; i<1000*(Pn+1); i=i+1)
sum[Pn] = sum[Pn] + A[i];
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

Chapter 6 — Parallel Processors from Client to Cloud — 16

Example: Sum Reduction

half = 100;
repeat
synch();
// The condition
if (half%2 != 0 && Pn == 0)
// when half is odd
sum[0] = sum[0] + sum[half-1];

half = half/2;
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Chapter 6 — Parallel Processors from Client to Cloud — 17

§6.7 Clusters, WSC, and Other Message-Passing MPs
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — 18

Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Chapter 6 — Parallel Processors from Client to Cloud — 19

Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected

Chapter 6 — Parallel Processors from Client to Cloud — 20

§6.14 Concluding Remarks
Concluding Remarks
 Higher performance by using multiple processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SIMD and vector operations match multimedia
applications and are easy to program
 Higher disk performance by using RAID

Chapter 6 — Parallel Processors from Client to Cloud — 21

Introduction To MIPS Architecture
No ratings yet
Introduction To MIPS Architecture
10 pages
Cortex-M For Beginners - 2016 (Final v3)
No ratings yet
Cortex-M For Beginners - 2016 (Final v3)
25 pages
CH03 COA9e
No ratings yet
CH03 COA9e
52 pages
10/100/1000 Ethernet MAC With Protocol Acceleration MAC-NET Core
No ratings yet
10/100/1000 Ethernet MAC With Protocol Acceleration MAC-NET Core
7 pages
Chapter 2 Instructions Language of The Computer
No ratings yet
Chapter 2 Instructions Language of The Computer
95 pages
Time Sequence of Multiple Interrupts
No ratings yet
Time Sequence of Multiple Interrupts
49 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
60 pages
Boot MIPS
No ratings yet
Boot MIPS
155 pages
Design and Verification of SDRAM Controller Based
No ratings yet
Design and Verification of SDRAM Controller Based
9 pages
Amba 4axi Stream
No ratings yet
Amba 4axi Stream
2 pages
2 Mips Architecture
No ratings yet
2 Mips Architecture
70 pages
Design and ASIC Implementation of Ethernet Switch For Network Application
No ratings yet
Design and ASIC Implementation of Ethernet Switch For Network Application
5 pages
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
No ratings yet
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
61 pages
Arm Cortex r8 Mpcore Processor TRM 100400 0003 01 en
No ratings yet
Arm Cortex r8 Mpcore Processor TRM 100400 0003 01 en
419 pages
PCIe Presentation
No ratings yet
PCIe Presentation
14 pages
Arria 10 FPGA Development Kit User Guide: Subscribe Send Feedback
No ratings yet
Arria 10 FPGA Development Kit User Guide: Subscribe Send Feedback
135 pages
PCIe Drivers Learning Path for Beginners
No ratings yet
PCIe Drivers Learning Path for Beginners
38 pages
Usb2 Transceiver Macrocell Interface Specification PDF
No ratings yet
Usb2 Transceiver Macrocell Interface Specification PDF
67 pages
Z-scale: Tiny 32-bit RISC-V Systems
No ratings yet
Z-scale: Tiny 32-bit RISC-V Systems
19 pages
ARM System Architectures 09-02-2016
No ratings yet
ARM System Architectures 09-02-2016
317 pages
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
87 pages
PCIe
No ratings yet
PCIe
20 pages
ARM7 Processor Family
No ratings yet
ARM7 Processor Family
8 pages
2-3 - Common - Storage - Protocols - Copie
No ratings yet
2-3 - Common - Storage - Protocols - Copie
58 pages
ARM Instruction Sets and Program: Jin-Fu Li Department of Electrical Engineering National Central University
100% (2)
ARM Instruction Sets and Program: Jin-Fu Li Department of Electrical Engineering National Central University
116 pages
3 SoCVerification Modules
No ratings yet
3 SoCVerification Modules
2 pages
Advanced UVM Protocol Layering
No ratings yet
Advanced UVM Protocol Layering
6 pages
A Unified UVM Architecture For Flash-Based Memory
No ratings yet
A Unified UVM Architecture For Flash-Based Memory
4 pages
UVM HDMI Verification Guide
No ratings yet
UVM HDMI Verification Guide
7 pages
VLSI's Role in AI Development
No ratings yet
VLSI's Role in AI Development
4 pages
Arm Cortex m85 Processor Dgug 101928 0002 05 en
No ratings yet
Arm Cortex m85 Processor Dgug 101928 0002 05 en
722 pages
SoC or System On Chip Seminar Report
No ratings yet
SoC or System On Chip Seminar Report
25 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Implementation of Ahb Protocol Using Fpga PDF
No ratings yet
Implementation of Ahb Protocol Using Fpga PDF
12 pages
Ax 99100
No ratings yet
Ax 99100
74 pages
SystemC TLM PDF
No ratings yet
SystemC TLM PDF
95 pages
Pg054 7series Pcie en Us 3.3
No ratings yet
Pg054 7series Pcie en Us 3.3
413 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Course Basic Uvm Session7 Monitors and Subscribers Tfitzpatrick
No ratings yet
Course Basic Uvm Session7 Monitors and Subscribers Tfitzpatrick
17 pages
PCIe Packet Generator
No ratings yet
PCIe Packet Generator
2 pages
2 - A Top-Level View of Computer Function and Interconnection
100% (1)
2 - A Top-Level View of Computer Function and Interconnection
39 pages
UVM Based Verification Environment For USB 3 Physical Layer and LTSSM of Link Layer
No ratings yet
UVM Based Verification Environment For USB 3 Physical Layer and LTSSM of Link Layer
5 pages
Set Up Delay
No ratings yet
Set Up Delay
15 pages
1.3 Future Scaling: Where Systems and Technology Meet: 25 Digest of Technical Papers
No ratings yet
1.3 Future Scaling: Where Systems and Technology Meet: 25 Digest of Technical Papers
5 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
CPLD Memory Controller for MPC603E
No ratings yet
CPLD Memory Controller for MPC603E
38 pages
System Verilog Full Notes
No ratings yet
System Verilog Full Notes
413 pages
PCIe Clock Arch
No ratings yet
PCIe Clock Arch
10 pages
AMBA-AHB Protocol in BUSpec
No ratings yet
AMBA-AHB Protocol in BUSpec
22 pages
Wk05 - CPU Architecture (Part 1)
No ratings yet
Wk05 - CPU Architecture (Part 1)
72 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Arm PPT Full
No ratings yet
Arm PPT Full
84 pages
9-Verilog Coding and Synthesis Methodology Guidelines
No ratings yet
9-Verilog Coding and Synthesis Methodology Guidelines
23 pages
Data Link Layer
No ratings yet
Data Link Layer
50 pages
Architecture
No ratings yet
Architecture
21 pages
Patterson6e MIPS Ch06 PPT
No ratings yet
Patterson6e MIPS Ch06 PPT
63 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
100% (1)
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
57 pages
Chapter 06
No ratings yet
Chapter 06
59 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Best Java Questions
No ratings yet
Best Java Questions
33 pages
Cics Tutorials MODIFIED
No ratings yet
Cics Tutorials MODIFIED
188 pages
Threads
No ratings yet
Threads
8 pages
B.Sc. CS: Intro to Programming
No ratings yet
B.Sc. CS: Intro to Programming
51 pages
Visual Programming Essentials
No ratings yet
Visual Programming Essentials
4 pages
Lecture 2 - OS Overview
No ratings yet
Lecture 2 - OS Overview
42 pages
OS Design & Implementation Notes
No ratings yet
OS Design & Implementation Notes
183 pages
Unit 1 (OS)
No ratings yet
Unit 1 (OS)
26 pages
Thread Pool Pattern
No ratings yet
Thread Pool Pattern
2 pages
Esd Lesson Plan
No ratings yet
Esd Lesson Plan
13 pages
Exploring Es2016 Es2017 PDF
100% (1)
Exploring Es2016 Es2017 PDF
60 pages
Os Unit 6 Notes
No ratings yet
Os Unit 6 Notes
22 pages
Sas 9.0 Manual PDF
No ratings yet
Sas 9.0 Manual PDF
1,861 pages
Bca Ii Sem
No ratings yet
Bca Ii Sem
25 pages
DCC - Module A3 - Distributed Processes
No ratings yet
DCC - Module A3 - Distributed Processes
18 pages
Unit 1 Chapter 1: Introduction: Operating System F.Y.Bsc - It Prof. Sujata Rizal
No ratings yet
Unit 1 Chapter 1: Introduction: Operating System F.Y.Bsc - It Prof. Sujata Rizal
31 pages
CS609 Quiz 3 Merged File Learning With Me
No ratings yet
CS609 Quiz 3 Merged File Learning With Me
126 pages
Context Swiching
No ratings yet
Context Swiching
7 pages
Thinking Concurrency
No ratings yet
Thinking Concurrency
25 pages
Operating System Functions Guide
No ratings yet
Operating System Functions Guide
67 pages
SE 350: Operating Systems
No ratings yet
SE 350: Operating Systems
24 pages
CSC 205
No ratings yet
CSC 205
4 pages
Unix OS Quiz for IT Professionals
No ratings yet
Unix OS Quiz for IT Professionals
21 pages
GPU-Accelerated Ray Tracing Optimization
No ratings yet
GPU-Accelerated Ray Tracing Optimization
10 pages
Java Question Bank Cse II-i Sem
No ratings yet
Java Question Bank Cse II-i Sem
5 pages
Enhancing Hospital Management Systems With Multithreading and Preemptive Scheduling
No ratings yet
Enhancing Hospital Management Systems With Multithreading and Preemptive Scheduling
7 pages
Operating System Exercises - Chapter 4 Sol
33% (3)
Operating System Exercises - Chapter 4 Sol
2 pages
Symmetric Multiprocessing and Microkernel
No ratings yet
Symmetric Multiprocessing and Microkernel
6 pages
Nginx Vs Apache
No ratings yet
Nginx Vs Apache
4 pages
H 1000 5057 10 C UCC Programmers Guide en
No ratings yet
H 1000 5057 10 C UCC Programmers Guide en
92 pages

Chapter 6 Parallel Processor

Uploaded by

Chapter 6 Parallel Processor

Uploaded by

COMPUTER ORGANIZATION AND DESIGN

 Solving: Fparallelizable = 0.999

Chapter 6 — Parallel Processors from Client to Cloud — 3

Assumes load balanced

Chapter 6 — Parallel Processors from Client to Cloud — 4

Assumes load balanced

 Speedup = 10010/1010 = 9.9 (99% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5

 100 processors, 32 × 32 matrix

Chapter 6 — Parallel Pro

Chapter 6 — Parallel Processors from Client to Cloud — 7

Chapter 6 — Parallel Processors from Client to Cloud — 8

An array of parallel functional units

 Four add pipeline

Chapter 6 — Parallel Processors from Client to Cloud — 9

 Vector MIPS code

Chapter 6 — Parallel Processors from Client to Cloud — 11

Chapter 6 — Parallel Processors from Client to Cloud — 12

 3. Simultaneous multithreading (SMT)

units available than single thread needs to use.

Chapter 6 — Parallel Processors from Client to Cloud — 13

Coarse MT: Coarse-grained

Processor Processor Processor Processor

Memory Memory Interconnection network

Chapter 6 — Parallel Processors from Client to Cloud — 16

Chapter 6 — Parallel Processors from Client to Cloud — 17

Chapter 6 — Parallel Processors from Client to Cloud — 18

Chapter 6 — Parallel Processors from Client to Cloud — 19

Chapter 6 — Parallel Processors from Client to Cloud — 20

Chapter 6 — Parallel Processors from Client to Cloud — 21

You might also like