0% found this document useful (0 votes)

7 views40 pages

02 - B (Parallel Hardware)

The document discusses parallel and distributed programming, focusing on various hardware architectures, including Flynn's Taxonomy, SIMD, vector processors, and MIMD systems. It explains different interconnection networks, cache coherence, and the importance of bandwidth and latency in data transmission. The document also covers the pros and cons of different parallel processing systems and their applications in real-time graphics and computation.

Uploaded by

sophomorepieas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views40 pages

02 - B (Parallel Hardware)

Uploaded by

sophomorepieas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Parallel and Distributed

Programming
Dr. Muhammad Naveed Akhtar
Lecture – 02b
Parallel Hardware

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 2

Parallel Hardware

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 3

Flynn’s Taxonomy

Classic von Neumann

SISD (SIMD)
Single instruction stream Single instruction stream
Single data stream Multiple data stream

MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
Not covered

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 4

Computer Architecture Classifications

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 5

SIMD
• Parallelism achieved by dividing data among the processors.
• Applies the same instruction to multiple data items.
• Called data parallelism.

for (i = 0; i < n; i++) Control Unit

x[i] += y[i];

n Data Items
x[1] x[2] … x[n]
ALU1 ALU2 ALUn
n ALUs

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 6

SIMD
• What if we don’t have as many ALUs as data items?
• Divide the work and process iteratively. Round4 ALU1 ALU2 ALU3 ALU4
• Ex. m = 4 ALUs and n = 15 data items. 1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]
SIMD drawbacks
• All ALUs are required to execute the same instruction, or remain idle.
• In classic design, they must also operate synchronously.
• The ALUs have no instruction storage.
• Efficient for large data parallel problems, but not other types of more complex parallel problems.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 7

Vector Processors
• Operate on arrays or vectors of data while conventional CPU’s operate on individual data
elements or scalars.
• Vector registers.
• Capable of storing a vector of operands and operating simultaneously on their contents.

• Vectorized and pipelined functional units.

• The same operation is applied to each element in the vector (or pairs of elements).

• Vector instructions.
• Operate on vectors rather than scalars.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 8

Vector Processors (contd.)
• Interleaved memory.
• Multiple “banks” of memory, which can be accessed more or less independently.
• Distribute elements of a vector across multiple banks, so reduce or eliminate delay in loading/storing
successive elements.

• Strided memory access and hardware scatter/gather.

• The program accesses elements of a vector located at fixed intervals.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 9

Vector processors – Pros & Cons
• Fast. • They don’t handle irregular data structures as
• Easy to use. well as other parallel architectures.
• A very finite limit to their ability to handle
• Vectorizing compilers are good at
ever larger problems. (scalability)
identifying code to exploit.
• Compilers also can provide information
about code that cannot be vectorized.
• Helps the programmer re-evaluate code.

• High memory bandwidth.

• Uses every item in a cache line.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 10

Graphics Processing Units (GPU)
• Real time graphics application programming interfaces or API’s use points, lines, and triangles to
internally represent the surface of an object.
• A graphics processing pipeline converts the internal representation into an array of pixels that can
be sent to a computer screen.
• Several stages of this pipeline (called shader functions) are programmable.
• Typically just a few lines of C code.

• Shader functions are also implicitly parallel, since they can be applied to multiple elements in the
graphics stream.
• GPU’s can often optimize performance by using SIMD parallelism.
• The current generation of GPU’s use SIMD parallelism. (Although they are not pure SIMD systems.)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 11

Typical GPU Architecture

• Special Functions Units (SFUs)

• sin, cosine, reciprocal, and square root.
• Each SFU executes one instruction per thread.

• Warp
• Set of 32 threads within a thread block
• All threads in a warp execute the same
instruction.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 12
MIMD Systems
• Supports multiple simultaneous instruction streams operating on multiple data streams.
• Typically consist of a collection of fully independent processing units or cores, each of which has its
own control unit and its own ALU.
• Two types
• Shared Memory System
• Distributed Memory System

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 13

Shared Memory System
• A collection of processors, connected to a memory system via an interconnection network.
• Each processor can access each memory location.
• The processors usually communicate implicitly by accessing shared data structures.
• Most widely available shared memory systems use one or more multicore processors.
• (multiple CPU’s or cores on a single chip)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 14

Types of Shared Memory Systems

UMA multicore system NUMA multicore system

Time to access all the memory locations will be A memory location a core is directly connected
the same for all the cores. to can be accessed faster than a memory
location that must be accessed through another
chip.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 15
Distributed Memory System
• Clusters (most popular)
• A collection of commodity systems.
• Connected by a commodity interconnection network.

• Nodes of a cluster are individual computations units joined by a communication network.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 16

Interconnection Networks
• Affects performance of both distributed and shared memory systems.
• Two categories:
• Shared memory interconnects
• Bus interconnect
• Switched interconnect
• Distributed memory interconnects
• Direct interconnect
• Indirect interconnect

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 17

Shared Memory Interconnects
• Bus interconnect
• A collection of parallel communication wires together with some hardware that controls access to the bus.
• Communication wires are shared by the devices that are connected to it.
• As the number of devices connected to the bus increases, contention for use of the bus increases, and
performance decreases.

• Switched interconnect
• Uses switches to control the routing of data among the connected devices.
• Crossbar –
• Allows simultaneous communication among different devices.
• Faster than buses.
• But the cost of the switches and links is relatively high.
• Traditionally it was used in old telephone circuit switching exchanges

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 18

Crossbar Interconnect

Configuration of internal
switches in a crossbar

closed open
A crossbar switch connecting 4 processors Simultaneous memory accesses by the
(Pi) and 4 memory modules (Mj) processors
P1→M4
P2→M3
P3→M1
P4→M2
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 19
Distributed Memory Interconnects

Direct interconnect Indirect interconnect

processor
memory pair

Switch

Each switch is directly connected to a processor memory Switches may not be directly connected to a processor
pair, switches are connected to each other. memory pair.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 20

Direct interconnect
Ring

Linear Array (No Wrap Around) Linear Array (Wrap Around)

1-D Ring / Torus

2-D mesh with no 2-D mesh with 3-D mesh with no

Toroidal Mesh wraparound wraparound wraparound

We always take the number of node as even in power of 2

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 21

Bisection Width
• A measure of “number of simultaneous communications” or “connectivity”.
• How many simultaneous communications can take place “across the divide” between the halves?

Bisection width of ring=2

Fully Connected Network /
Bisection width of this toroidal
Completely Connected Network
mesh=8 (=2√p, here p=16)
Bisection width = 9 (p2/4, here p=6)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 22

Definitions
• Bandwidth
• The rate at which a link can transmit data. (Usually given in megabits or megabytes per second.)
• Diameter
• The distance between the farthest two nodes in the network. Metric for worst-case latency.
• Arc Connectivity:
• The minimum number of wires you must cut to partition the network into (not necessarily equal) parts.
• Bisection bandwidth
• A measure of network quality.
• Instead of counting the number of links joining the halves, it sums the bandwidth of the links.
• Cost
• The number of links or switches (whichever is asymptotically higher) are important contributors to cost.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 23

Hypercube
• Highly connected direct interconnect.
• Built inductively:
• A one-dimensional hypercube is a fully-connected system with two processors.
• A two-dimensional hypercube is built from two one-dimensional hypercubes by joining “corresponding”
switches.
• Similarly a three-dimensional hypercube is built from two two-dimensional hypercubes.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 24

Hypercubes and their Construction

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 25

Indirect interconnects
• Simple examples of indirect A generic indirect network
networks:
• Crossbar
• Omega network

• Often shown with unidirectional

links and a collection of processors,
each of which has an outgoing and
an incoming link, and a switching
network.

Processor – Memory Pair

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 26
Interconnect for distributed memory
Crossbar Omega

A switch in an omega network

2x2 crossbar

Processor – Memory Pair

Switch

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 27

Multistage interconnection networks (MINs)
• Crossbars have excellent performance scalability but poor cost scalability.
• Buses have excellent cost scalability, but poor performance scalability.
• An intermediate class of networks called multistage interconnection networks (MINs) strike a compromise
between these extremes

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 28

Blocking Vs. Non-blocking
• Is Network Blocking or Non-Blocking ?

• Simultaneous Message – A
• 0 to 1
• 3 to 7

• Simultaneous Message – B
• 1 to 6
• 3 to 7

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 29

Multi Stage Switch
Problem
• Design a three-stage, 200 × 200 switch
N = 200, with k = 4 and n = 20.

Solution
• first stage: N/n = 10 crossbars (size 20 × 4).
• second stage: k = 4 crossbars (size 10 × 10).
• third stage: N/n = 10 crossbars (size 4 × 20).
• In a three-stage switch, the total number of cross-points is:
2kN + k(N/n)2 • Total number of crosspoints is:
2kN + k(N/n)2 = 2000 crosspoints.
• which is much smaller than the number of cross-points in a
single-stage switch (N2). • This is 5 % of the number of crosspoints in a
single-stage switch (200 × 200 = 40,000).

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 30

More definitions
• Any time data is transmitted, we’re interested in how long it will take for the data to reach its
destination.
• Latency
• The time that elapses between the source’s beginning to transmit the data and the destination’s starting
to receive the first byte.

• Bandwidth
• The rate at which the destination receives data after it has started to receive the first byte.

• Message Transmission Time : MTT = l + n / b

• latency (seconds)
• length of message (no of bytes)
• bandwidth (bytes per second)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 31

Cache Coherence
• Cache-memory coherence using two policies:

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 32

Cache Coherence in Multiprocessor Systems

After value of a variable (x) is

changes, all its copies must either
be invalidated or updated.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 33

Cache coherence
• Programmers have no control over caches and when they get updated.
• y0 privately owned by Core 0
• y1 and z1 privately owned by Core 1
• x = 2; /* shared variable */

• y0 eventually ends up = 2
• y1 eventually ends up = 6
• z1 = ???

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 34

Cache Coherency
• The goal of cache coherence is to ensure that every cache sees the same value for a referenced
location, which means making sure that any shared operand that is changed is updated throughout
the system.
• This brings us to the issue of false sharing, which reduces cache performance when two operands
that are not shared between processes share the same cache line. The situation is shown below. The
problem is that each process will invalidate the other’s cache line when writing data without a real
need, unless the compiler prevents this.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 35

Cache line has 4 states
• Each line of a cache has associated with it two bits – four states
• Modified: line in this cache is modified and only valid in this cache
• Exclusive: line in this cache is same as that in memory (unmodified) and not present in any other
cache
• Shared: line in this cache is same as that in memory (unmodified) and may also be present in
another cache
• Invalid: line in this cache contains bad data

MESI Protocol Used in Intel

MOESI Protocol Used in AMD

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 36

Snoopy Protocols –Implementations
• Performance of these two protocols depends on number of caches and pattern of read/writes
• Some systems use adaptive protocols to use both methods
• Write invalidate most common – Used in Pentium 4 and PowerPC systems

• The cores or processors share a bus .

• Any signal transmitted on the bus can be “seen” by all cores connected to the bus.
• Core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus.
• Core 1 is “snooping” the bus: it will see that x has been updated and mark its copy of x as invalid.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 37

Snooping Cache Coherence
• How are invalidates sent to the
right core or processors?
• In snoopy caches, there is a
broadcast media that listens to
all invalidates and read
requests and performs
appropriate coherence
operations locally.

A simple snoopy bus based cache coherence system.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 38

Directory Based Cache Coherence
• Uses a data structure called a directory
that stores the status of each cache line.
• When a variable is updated, the directory
is consulted, and the cache controllers of the
cores that have that variable’s cache line in
their caches are invalidated.

• This is example of a centralized directory

(on left) and a distributed directory (right).

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 39

Questions and comments?

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 40

02 - A (Parallel Hardware)
No ratings yet
02 - A (Parallel Hardware)
47 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
28 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Module II (CC)
No ratings yet
Module II (CC)
125 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Module 2
No ratings yet
Module 2
124 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Pda 2
No ratings yet
Pda 2
105 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Parallel VS Distributed Computing
No ratings yet
Parallel VS Distributed Computing
9 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Explicitly Parallel Platforms
No ratings yet
Explicitly Parallel Platforms
90 pages
Lecture 3
No ratings yet
Lecture 3
49 pages
Parallel Computer Architecture Basics
No ratings yet
Parallel Computer Architecture Basics
268 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
01 (Why Parallel Computing)
No ratings yet
01 (Why Parallel Computing)
24 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Parallel Computing Concepts Guide
No ratings yet
Parallel Computing Concepts Guide
32 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
03 (Parallel Software)
No ratings yet
03 (Parallel Software)
38 pages
Programação Paralela e Distribuída
No ratings yet
Programação Paralela e Distribuída
39 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
3 4 Flayynn Taxonomy, Network
No ratings yet
3 4 Flayynn Taxonomy, Network
84 pages
Flynn's Taxonomy & Parallel Models
No ratings yet
Flynn's Taxonomy & Parallel Models
27 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
Lecture 5 Network Topologies For Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies For Parallel Architectures - Updated
46 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
PC Module1
No ratings yet
PC Module1
28 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
7020 Cns 13-8-21 Updated Final Manual
100% (1)
7020 Cns 13-8-21 Updated Final Manual
44 pages
Chameleon Chips Presentation
100% (1)
Chameleon Chips Presentation
24 pages
DFT Sequential Depth & Test Techniques
100% (2)
DFT Sequential Depth & Test Techniques
34 pages
Ict g11-1st Handout
No ratings yet
Ict g11-1st Handout
7 pages
Mayo 2024
No ratings yet
Mayo 2024
15 pages
OS - Unit 4
No ratings yet
OS - Unit 4
55 pages
Oracle SOA 11g: Mediator Dynamic Routing
No ratings yet
Oracle SOA 11g: Mediator Dynamic Routing
5 pages
Deadlock Prevention
No ratings yet
Deadlock Prevention
2 pages
Office Libre 14
No ratings yet
Office Libre 14
10 pages
Linux Shell Script Examples
No ratings yet
Linux Shell Script Examples
30 pages
YouTube Test Cases Guide
No ratings yet
YouTube Test Cases Guide
5 pages
EXP: 1B. Write Program Using Arduino IDE For Blink LED. WEEK-1
No ratings yet
EXP: 1B. Write Program Using Arduino IDE For Blink LED. WEEK-1
10 pages
Computer 5 Reviewer
No ratings yet
Computer 5 Reviewer
1 page
Assignment 2
No ratings yet
Assignment 2
8 pages
Analisis Keefektifan Penggunaan Aplikasi Autocad Untuk Merancang Konstruksi Bangunan Bagi Mahasiswa Ju
No ratings yet
Analisis Keefektifan Penggunaan Aplikasi Autocad Untuk Merancang Konstruksi Bangunan Bagi Mahasiswa Ju
12 pages
Web Application Security Testing
No ratings yet
Web Application Security Testing
14 pages
Amazon Things Graph
No ratings yet
Amazon Things Graph
242 pages
Datasheet TL210
No ratings yet
Datasheet TL210
1 page
SEPM - Report Final
No ratings yet
SEPM - Report Final
23 pages
MR - Abdul Raheem-2
No ratings yet
MR - Abdul Raheem-2
2 pages
Enabling Smart Home Energy Management Through Gesture-Based Control and Iot Technology
No ratings yet
Enabling Smart Home Energy Management Through Gesture-Based Control and Iot Technology
2 pages
Ras Markaz Crude Oil Park Project (Phase1)
No ratings yet
Ras Markaz Crude Oil Park Project (Phase1)
50 pages
T-PRO4500 User Manual D04301R01.00 PDF
No ratings yet
T-PRO4500 User Manual D04301R01.00 PDF
436 pages
KBA NK-1600-0057 Dell T5810Workstation Overview
No ratings yet
KBA NK-1600-0057 Dell T5810Workstation Overview
40 pages
Math 220 Past Question
No ratings yet
Math 220 Past Question
6 pages
Capítulo 16 PDF
No ratings yet
Capítulo 16 PDF
344 pages
Web Application and Security - MCQ
No ratings yet
Web Application and Security - MCQ
15 pages
Credit Control Area
No ratings yet
Credit Control Area
3 pages
Similarities and Differences (SQL Azure vs. SQL Server)
No ratings yet
Similarities and Differences (SQL Azure vs. SQL Server)
8 pages
CP3201 - Cyberpunk - Hardwired - The Sourcebook (1989) - Netrunning (Q4OCR) (PW Removed)
100% (1)
CP3201 - Cyberpunk - Hardwired - The Sourcebook (1989) - Netrunning (Q4OCR) (PW Removed)
14 pages

02 - B (Parallel Hardware)

Uploaded by

02 - B (Parallel Hardware)

Uploaded by

Parallel and Distributed

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 2

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 3

Classic von Neumann

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 4

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 5

for (i = 0; i < n; i++) Control Unit

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 6

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 7

• Vectorized and pipelined functional units.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 8

• Strided memory access and hardware scatter/gather.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 9

• High memory bandwidth.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 10

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 11

• Special Functions Units (SFUs)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 13

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 14

UMA multicore system NUMA multicore system

• Nodes of a cluster are individual computations units joined by a communication network.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 16

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 17

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 18

Direct interconnect Indirect interconnect

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 20

Linear Array (No Wrap Around) Linear Array (Wrap Around)

2-D mesh with no 2-D mesh with 3-D mesh with no

We always take the number of node as even in power of 2

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 21

Bisection width of ring=2

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 22

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 23

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 24

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 25

• Often shown with unidirectional

Processor – Memory Pair

A switch in an omega network

Processor – Memory Pair

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 27

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 28

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 29

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 30

• Message Transmission Time : MTT = l + n / b

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 31

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 32

After value of a variable (x) is

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 33

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 34

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 35

MESI Protocol Used in Intel

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 36

• The cores or processors share a bus .

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 37

A simple snoopy bus based cache coherence system.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 38

• This is example of a centralized directory

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 39

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 40

You might also like