Multiple Issue (2)
static multiple issue - functional units are
scheduled at compile time.
dynamic multiple issue – functional units
are scheduled at run-time.
Copyright © 2010, Elsevier Inc. All rights Reserved 1
Speculation (1)
In order to make use of multiple issue, the
system must find instructions that can be
executed simultaneously.
In speculation, the compiler or
the processor makes a guess
about an instruction, and then
executes the instruction on the
basis of the guess.
Copyright © 2010, Elsevier Inc. All rights Reserved 2
Speculation (2)
z=x+y;
i f ( z > 0) Z will be
w=x; positive
else
w=y;
If the system speculates incorrectly,
it must go back and recalculate w = y.
Copyright © 2010, Elsevier Inc. All rights Reserved 3
Hardware multithreading (1)
There aren’t always good opportunities for
simultaneous execution of different
threads.
Hardware multithreading provides a means
for systems to continue doing useful work
when the task being currently executed
has stalled.
Copyright © 2010, Elsevier Inc. All rights Reserved 4
Hardware multithreading (2)
Fine-grained - the processor switches
between threads after each instruction,
skipping threads that are stalled.
Pros: potential to avoid wasted machine time
due to stalls.
Cons: a thread that’s ready to execute a long
sequence of instructions may have to wait to
execute every instruction.
Copyright © 2010, Elsevier Inc. All rights Reserved 5
Hardware multithreading (3)
Simultaneous multithreading (SMT) - a
variation on fine-grained multithreading.
Allows multiple threads to make use of the
multiple functional units.
Copyright © 2010, Elsevier Inc. All rights Reserved 6
.
PARALLEL HARDWARE
Copyright © 2010, Elsevier Inc. All rights Reserved 7
Flynn’s Taxonomy
a n n
N e um
v o n
s ic SISD (SIMD)
cla s
Single instruction stream Single instruction stream
Single data stream Multiple data stream
MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
no
tc
ov
ere
d
Copyright © 2010, Elsevier Inc. All rights Reserved 8
SIMD
Parallelism achieved by dividing data
among the processors.
Applies the same instruction to multiple
data items.
Called data parallelism.
Copyright © 2010, Elsevier Inc. All rights Reserved 9
SIMD example
n data items
control unit
n ALUs
x[1] x[2] … x[n]
ALU1 ALU2 ALUn
for (i = 0; i < n; i++)
x[i] += y[i];
Copyright © 2010, Elsevier Inc. All rights Reserved 10
SIMD
What if we don’t have as many ALUs as
data items?
Divide the work and process iteratively.
Ex. m = 4 ALUs and n = 15 data items.
Round3 ALU1 ALU2 ALU3 ALU4
1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]
Copyright © 2010, Elsevier Inc. All rights Reserved 11
Graphics Processing Units (GPU)
Real time graphics application
programming interfaces or API’s use
points, lines, and triangles to internally
represent the surface of an object.
Copyright © 2010, Elsevier Inc. All rights Reserved 12
GPUs
A graphics processing pipeline converts
the internal representation into an array of
pixels that can be sent to a computer
screen.
Several stages of this pipeline
(called shader functions) are
programmable.
Typically just a few lines of C code.
Copyright © 2010, Elsevier Inc. All rights Reserved 13
GPUs
Shader functions are also implicitly
parallel, since they can be applied to
multiple elements in the graphics stream.
GPU’s can often optimize performance by
using SIMD parallelism.
The current generation of GPU’s use SIMD
parallelism.
Although they are not pure SIMD systems.
Copyright © 2010, Elsevier Inc. All rights Reserved 14
MIMD
Supports multiple simultaneous instruction
streams operating on multiple data
streams.
Typically consist of a collection of fully
independent processing units or cores,
each of which has its own control unit and
its own ALU.
Copyright © 2010, Elsevier Inc. All rights Reserved 15
Shared Memory System (1)
A collection of autonomous processors is
connected to a memory system via an
interconnection network.
Each processor can access each memory
location.
The processors usually communicate
implicitly by accessing shared data
structures.
Copyright © 2010, Elsevier Inc. All rights Reserved 16
Shared Memory System (2)
Most widely available shared memory
systems use one or more multicore
processors.
Copyright © 2010, Elsevier Inc. All rights Reserved 17
Shared Memory System
Figure 2.3
Copyright © 2010, Elsevier Inc. All rights Reserved 18
Distributed Memory System
Clusters (most popular)
A collection of commodity systems.
Connected by a commodity interconnection
network.
Nodes of a cluster are individual
computations units joined by a
communication network.
Copyright © 2010, Elsevier Inc. All rights Reserved 19
Distributed Memory System
Figure 2.4
Copyright © 2010, Elsevier Inc. All rights Reserved 20
Interconnection networks
Affects performance of both distributed
and shared memory systems.
Two categories:
Shared memory interconnects
Distributed memory interconnects
Copyright © 2010, Elsevier Inc. All rights Reserved 21
Shared memory interconnects
Bus interconnect
A collection of parallel communication wires
together with some hardware that controls
access to the bus.
Communication wires are shared by the
devices that are connected to it.
As the number of devices connected to the
bus increases, contention for use of the bus
increases, and performance decreases.
Copyright © 2010, Elsevier Inc. All rights Reserved 22
Shared memory interconnects
Switched interconnect
Uses switches to control the routing of data
among the connected devices.
Crossbar –
Allows simultaneous communication among
different devices.
Faster than buses.
But the cost of the switches and links is relatively
high.
Copyright © 2010, Elsevier Inc. All rights Reserved 23
Distributed memory interconnects
Two groups
Direct interconnect
Each switch is directly connected to a processor
memory pair, and the switches are connected to
each other.
Indirect interconnect
Switches may not be directly connected to a
processor.
Copyright © 2010, Elsevier Inc. All rights Reserved 24
Bisection width
A measure of “number of simultaneous
communications” or “connectivity”.
How many simultaneous communications
can take place “across the divide” between
the halves?
Copyright © 2010, Elsevier Inc. All rights Reserved 25
Two bisections of a ring
Figure 2.9
Copyright © 2010, Elsevier Inc. All rights Reserved 26
Fully connected network
Each switch is directly connected to every
other switch.
Figure 2.11
Copyright © 2010, Elsevier Inc. All rights Reserved 27