[go: up one dir, main page]

0% found this document useful (0 votes)
7 views40 pages

02 - B (Parallel Hardware)

The document discusses parallel and distributed programming, focusing on various hardware architectures, including Flynn's Taxonomy, SIMD, vector processors, and MIMD systems. It explains different interconnection networks, cache coherence, and the importance of bandwidth and latency in data transmission. The document also covers the pros and cons of different parallel processing systems and their applications in real-time graphics and computation.

Uploaded by

sophomorepieas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views40 pages

02 - B (Parallel Hardware)

The document discusses parallel and distributed programming, focusing on various hardware architectures, including Flynn's Taxonomy, SIMD, vector processors, and MIMD systems. It explains different interconnection networks, cache coherence, and the importance of bandwidth and latency in data transmission. The document also covers the pros and cons of different parallel processing systems and their applications in real-time graphics and computation.

Uploaded by

sophomorepieas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Parallel and Distributed

Programming
Dr. Muhammad Naveed Akhtar
Lecture – 02b
Parallel Hardware

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 2


Parallel Hardware

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 3


Flynn’s Taxonomy

Classic von Neumann


SISD (SIMD)
Single instruction stream Single instruction stream
Single data stream Multiple data stream

MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
Not covered

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 4


Computer Architecture Classifications

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 5


SIMD
• Parallelism achieved by dividing data among the processors.
• Applies the same instruction to multiple data items.
• Called data parallelism.

for (i = 0; i < n; i++) Control Unit

x[i] += y[i];

n Data Items
x[1] x[2] … x[n]
ALU1 ALU2 ALUn
n ALUs

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 6


SIMD
• What if we don’t have as many ALUs as data items?
• Divide the work and process iteratively. Round4 ALU1 ALU2 ALU3 ALU4
• Ex. m = 4 ALUs and n = 15 data items. 1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]
SIMD drawbacks
• All ALUs are required to execute the same instruction, or remain idle.
• In classic design, they must also operate synchronously.
• The ALUs have no instruction storage.
• Efficient for large data parallel problems, but not other types of more complex parallel problems.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 7


Vector Processors
• Operate on arrays or vectors of data while conventional CPU’s operate on individual data
elements or scalars.
• Vector registers.
• Capable of storing a vector of operands and operating simultaneously on their contents.

• Vectorized and pipelined functional units.


• The same operation is applied to each element in the vector (or pairs of elements).

• Vector instructions.
• Operate on vectors rather than scalars.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 8


Vector Processors (contd.)
• Interleaved memory.
• Multiple “banks” of memory, which can be accessed more or less independently.
• Distribute elements of a vector across multiple banks, so reduce or eliminate delay in loading/storing
successive elements.

• Strided memory access and hardware scatter/gather.


• The program accesses elements of a vector located at fixed intervals.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 9


Vector processors – Pros & Cons
• Fast. • They don’t handle irregular data structures as
• Easy to use. well as other parallel architectures.
• A very finite limit to their ability to handle
• Vectorizing compilers are good at
ever larger problems. (scalability)
identifying code to exploit.
• Compilers also can provide information
about code that cannot be vectorized.
• Helps the programmer re-evaluate code.

• High memory bandwidth.


• Uses every item in a cache line.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 10


Graphics Processing Units (GPU)
• Real time graphics application programming interfaces or API’s use points, lines, and triangles to
internally represent the surface of an object.
• A graphics processing pipeline converts the internal representation into an array of pixels that can
be sent to a computer screen.
• Several stages of this pipeline (called shader functions) are programmable.
• Typically just a few lines of C code.

• Shader functions are also implicitly parallel, since they can be applied to multiple elements in the
graphics stream.
• GPU’s can often optimize performance by using SIMD parallelism.
• The current generation of GPU’s use SIMD parallelism. (Although they are not pure SIMD systems.)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 11


Typical GPU Architecture

• Special Functions Units (SFUs)


• sin, cosine, reciprocal, and square root.
• Each SFU executes one instruction per thread.

• Warp
• Set of 32 threads within a thread block
• All threads in a warp execute the same
instruction.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 12
MIMD Systems
• Supports multiple simultaneous instruction streams operating on multiple data streams.
• Typically consist of a collection of fully independent processing units or cores, each of which has its
own control unit and its own ALU.
• Two types
• Shared Memory System
• Distributed Memory System

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 13


Shared Memory System
• A collection of processors, connected to a memory system via an interconnection network.
• Each processor can access each memory location.
• The processors usually communicate implicitly by accessing shared data structures.
• Most widely available shared memory systems use one or more multicore processors.
• (multiple CPU’s or cores on a single chip)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 14


Types of Shared Memory Systems

UMA multicore system NUMA multicore system

Time to access all the memory locations will be A memory location a core is directly connected
the same for all the cores. to can be accessed faster than a memory
location that must be accessed through another
chip.
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 15
Distributed Memory System
• Clusters (most popular)
• A collection of commodity systems.
• Connected by a commodity interconnection network.

• Nodes of a cluster are individual computations units joined by a communication network.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 16


Interconnection Networks
• Affects performance of both distributed and shared memory systems.
• Two categories:
• Shared memory interconnects
• Bus interconnect
• Switched interconnect
• Distributed memory interconnects
• Direct interconnect
• Indirect interconnect

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 17


Shared Memory Interconnects
• Bus interconnect
• A collection of parallel communication wires together with some hardware that controls access to the bus.
• Communication wires are shared by the devices that are connected to it.
• As the number of devices connected to the bus increases, contention for use of the bus increases, and
performance decreases.

• Switched interconnect
• Uses switches to control the routing of data among the connected devices.
• Crossbar –
• Allows simultaneous communication among different devices.
• Faster than buses.
• But the cost of the switches and links is relatively high.
• Traditionally it was used in old telephone circuit switching exchanges

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 18


Crossbar Interconnect

Configuration of internal
switches in a crossbar

closed open
A crossbar switch connecting 4 processors Simultaneous memory accesses by the
(Pi) and 4 memory modules (Mj) processors
P1→M4
P2→M3
P3→M1
P4→M2
Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 19
Distributed Memory Interconnects

Direct interconnect Indirect interconnect

processor
memory pair

Switch

Each switch is directly connected to a processor memory Switches may not be directly connected to a processor
pair, switches are connected to each other. memory pair.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 20


Direct interconnect
Ring

Linear Array (No Wrap Around) Linear Array (Wrap Around)


1-D Ring / Torus

2-D mesh with no 2-D mesh with 3-D mesh with no


Toroidal Mesh wraparound wraparound wraparound

We always take the number of node as even in power of 2

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 21


Bisection Width
• A measure of “number of simultaneous communications” or “connectivity”.
• How many simultaneous communications can take place “across the divide” between the halves?

Bisection width of ring=2


Fully Connected Network /
Bisection width of this toroidal
Completely Connected Network
mesh=8 (=2√p, here p=16)
Bisection width = 9 (p2/4, here p=6)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 22


Definitions
• Bandwidth
• The rate at which a link can transmit data. (Usually given in megabits or megabytes per second.)
• Diameter
• The distance between the farthest two nodes in the network. Metric for worst-case latency.
• Arc Connectivity:
• The minimum number of wires you must cut to partition the network into (not necessarily equal) parts.
• Bisection bandwidth
• A measure of network quality.
• Instead of counting the number of links joining the halves, it sums the bandwidth of the links.
• Cost
• The number of links or switches (whichever is asymptotically higher) are important contributors to cost.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 23


Hypercube
• Highly connected direct interconnect.
• Built inductively:
• A one-dimensional hypercube is a fully-connected system with two processors.
• A two-dimensional hypercube is built from two one-dimensional hypercubes by joining “corresponding”
switches.
• Similarly a three-dimensional hypercube is built from two two-dimensional hypercubes.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 24


Hypercubes and their Construction

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 25


Indirect interconnects
• Simple examples of indirect A generic indirect network
networks:
• Crossbar
• Omega network

• Often shown with unidirectional


links and a collection of processors,
each of which has an outgoing and
an incoming link, and a switching
network.

Processor – Memory Pair


Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 26
Interconnect for distributed memory
Crossbar Omega

A switch in an omega network


2x2 crossbar

Processor – Memory Pair


Switch

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 27


Multistage interconnection networks (MINs)
• Crossbars have excellent performance scalability but poor cost scalability.
• Buses have excellent cost scalability, but poor performance scalability.
• An intermediate class of networks called multistage interconnection networks (MINs) strike a compromise
between these extremes

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 28


Blocking Vs. Non-blocking
• Is Network Blocking or Non-Blocking ?

• Simultaneous Message – A
• 0 to 1
• 3 to 7

• Simultaneous Message – B
• 1 to 6
• 3 to 7

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 29


Multi Stage Switch
Problem
• Design a three-stage, 200 × 200 switch
N = 200, with k = 4 and n = 20.

Solution
• first stage: N/n = 10 crossbars (size 20 × 4).
• second stage: k = 4 crossbars (size 10 × 10).
• third stage: N/n = 10 crossbars (size 4 × 20).
• In a three-stage switch, the total number of cross-points is:
2kN + k(N/n)2 • Total number of crosspoints is:
2kN + k(N/n)2 = 2000 crosspoints.
• which is much smaller than the number of cross-points in a
single-stage switch (N2). • This is 5 % of the number of crosspoints in a
single-stage switch (200 × 200 = 40,000).

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 30


More definitions
• Any time data is transmitted, we’re interested in how long it will take for the data to reach its
destination.
• Latency
• The time that elapses between the source’s beginning to transmit the data and the destination’s starting
to receive the first byte.

• Bandwidth
• The rate at which the destination receives data after it has started to receive the first byte.

• Message Transmission Time : MTT = l + n / b


• latency (seconds)
• length of message (no of bytes)
• bandwidth (bytes per second)

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 31


Cache Coherence
• Cache-memory coherence using two policies:

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 32


Cache Coherence in Multiprocessor Systems

After value of a variable (x) is


changes, all its copies must either
be invalidated or updated.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 33


Cache coherence
• Programmers have no control over caches and when they get updated.
• y0 privately owned by Core 0
• y1 and z1 privately owned by Core 1
• x = 2; /* shared variable */

• y0 eventually ends up = 2
• y1 eventually ends up = 6
• z1 = ???

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 34


Cache Coherency
• The goal of cache coherence is to ensure that every cache sees the same value for a referenced
location, which means making sure that any shared operand that is changed is updated throughout
the system.
• This brings us to the issue of false sharing, which reduces cache performance when two operands
that are not shared between processes share the same cache line. The situation is shown below. The
problem is that each process will invalidate the other’s cache line when writing data without a real
need, unless the compiler prevents this.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 35


Cache line has 4 states
• Each line of a cache has associated with it two bits – four states
• Modified: line in this cache is modified and only valid in this cache
• Exclusive: line in this cache is same as that in memory (unmodified) and not present in any other
cache
• Shared: line in this cache is same as that in memory (unmodified) and may also be present in
another cache
• Invalid: line in this cache contains bad data

MESI Protocol Used in Intel


MOESI Protocol Used in AMD

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 36


Snoopy Protocols –Implementations
• Performance of these two protocols depends on number of caches and pattern of read/writes
• Some systems use adaptive protocols to use both methods
• Write invalidate most common – Used in Pentium 4 and PowerPC systems

• The cores or processors share a bus .


• Any signal transmitted on the bus can be “seen” by all cores connected to the bus.
• Core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus.
• Core 1 is “snooping” the bus: it will see that x has been updated and mark its copy of x as invalid.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 37


Snooping Cache Coherence
• How are invalidates sent to the
right core or processors?
• In snoopy caches, there is a
broadcast media that listens to
all invalidates and read
requests and performs
appropriate coherence
operations locally.

A simple snoopy bus based cache coherence system.

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 38


Directory Based Cache Coherence
• Uses a data structure called a directory
that stores the status of each cache line.
• When a variable is updated, the directory
is consulted, and the cache controllers of the
cores that have that variable’s cache line in
their caches are invalidated.

• This is example of a centralized directory


(on left) and a distributed directory (right).

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 39


Questions and comments?

Parallel and Distributed Programming (Dr. M. Naveed Akhtar) 40

You might also like