DST4030A: Parallel Computing
Introduction to Parallel Processor Machines
Dr Mike Asiyo
School of Science & Technology
Dr Asiyo (USIU) DST4030A: Parallel Computing 1 / 31
Contents
1 Processes, multitasking, and threads
2 Instruction-level parallelism
3 Hardware multithreading
4 Parallel hardware
Dr Asiyo (USIU) DST4030A: Parallel Computing 2 / 31
Processes, multitasking, and threads
Processes, multitasking, and threads
Dr Asiyo (USIU) DST4030A: Parallel Computing 3 / 31
Processes, multitasking, and threads
When a user runs a program, the operating system creates a
process—an instance of a computer program that is being executed.
A process consists of several entities:
The executable machine language program.
A block of memory, which will include the executable code, a call
stack that keeps track of active functions, a heap that can be used for
memory explicitly allocated by the user program, and some other
memory locations.
Descriptors of resources that the operating system has allocated to
the process, for example, file descriptors.
Security information—for example, information specifying which
hardware and software resources the process can access.
Information about the state of the process, such as whether the
process is ready to run or is waiting on some resource, the content of
the registers, and information about the process’s memory.
Dr Asiyo (USIU) DST4030A: Parallel Computing 4 / 31
Processes, multitasking, and threads
Most modern operating systems are multitasking. This means that
the operating system provides support for the apparent
simultaneous execution of multiple programs.
In a multitasking OS, if a process needs to wait for a resource—for
example, it needs to read data from external storage—it will block.
This means that it will stop executing and the operating system can
run another process.
Threading provides a mechanism for programmers to divide their
programs into more or less independent tasks, with the property
that when one thread is blocked, another thread can be run.
Threads are contained within processes, so they can use the same
executable, and they usually share the same memory and the same
I/O devices. In fact, two threads belonging to one process can share
most of the process’s resources.
Dr Asiyo (USIU) DST4030A: Parallel Computing 5 / 31
Processes, multitasking, and threads
If a process is the “master” thread of execution and threads are
started and stopped by the process, then we can envision the
process and its subsidiary threads as lines: when a thread is started,
it forks off the process; when a thread terminates, it joins the
process.
Dr Asiyo (USIU) DST4030A: Parallel Computing 6 / 31
Instruction-level parallelism
Instruction-level parallelism
Dr Asiyo (USIU) DST4030A: Parallel Computing 7 / 31
Instruction-level parallelism
Instruction-level parallelism, or ILP, attempts to improve processor
performance by having multiple processor components or functional
units simultaneously executing instructions.
There are two main approaches to ILP:
pipelining, in which functional units are arranged in stages; and
multiple issue, in which multiple instructions can be simultaneously
initiated.
Dr Asiyo (USIU) DST4030A: Parallel Computing 8 / 31
Instruction-level parallelism Pipelining
The principle of pipelining is similar to a factory assembly line:
while one team is bolting a car’s engine to the chassis, another team
can connect the transmission to the engine and the driveshaft of a
car that’s already been processed by the first team, and a third
team can bolt the body to the chassis in a car that’s been processed
by the first two teams.
Pipelines improve performance by taking individual pieces of
hardware or functional units and connecting them in sequence.
In general, a pipeline with k stages won’t get a k−fold improvement
in performance. For example, if the times required by the various
functional units are different, then the stages will effectively run at
the speed of the slowest functional unit. Furthermore, delays—such
as waiting for an operand to become available—can cause the
pipeline to stall.
Dr Asiyo (USIU) DST4030A: Parallel Computing 9 / 31
Instruction-level parallelism Multiple Issue
Multiple issue processors replicate functional units and try to simultaneously
execute different instructions in a program. For example, if we have two
complete floating point adders, we can approximately halve the time it takes to
execute the loop
for(i = 0; i<1000; i++){
z[i] = x[i] + y[i];
}
While the first adder is computing z[0], the second can compute z[1]; while the
first is computing z[2], the second can compute z[3]; and so on.
If the functional units are scheduled at compile time, the multiple issue system
is said to use static multiple issue. If they’re scheduled at run-time, the system
is said to use dynamic multiple issue. A processor that supports dynamic
multiple issue is sometimes said to be superscalar.
Dr Asiyo (USIU) DST4030A: Parallel Computing 10 / 31
Hardware multithreading
Hardware multithreading
Dr Asiyo (USIU) DST4030A: Parallel Computing 11 / 31
Hardware multithreading
ILP can be very difficult to exploit: a program with a long sequence of
dependent statements offers few opportunities. For example, in a direct
calculation of the Fibonacci numbers
f[0] = f[1] = 1 ;
for(i = 2; i<= n; i++){
f[i] = f[i-1] + f[i-2];
}
there’s essentially no opportunity for simultaneous execution of instructions.
Thread-level parallelism, or TLP, attempts to provide parallelism through the
simultaneous execution of different threads, so it provides a coarser-grained
parallelism than ILP, that is, the program units that are being simultaneously
executed— threads—are larger or coarser than the finer-grained
units—individual instructions.
Dr Asiyo (USIU) DST4030A: Parallel Computing 12 / 31
Hardware multithreading
Hardware multithreading provides a means for systems to continue doing useful
work when the task being currently executed has stalled—for example, if the
current task has to wait for data to be loaded from memory. Instead of looking
for parallelism in the currently executing thread, it may make sense to simply
run another thread. Of course, for this to be useful, the system must support
very rapid switching between threads.
In fine-grained multithreading, the processor switches between threads after
each instruction, skipping threads that are stalled. While this approach has the
potential to avoid wasted machine time due to stalls, it has the drawback that a
thread that’s ready to execute a long sequence of instructions may have to wait
to execute every instruction.
Coarse-grained multithreading attempts to avoid this problem by only
switching threads that are stalled waiting for a time-consuming operation to
complete (e.g., a load from main memory). This has the virtue that switching
threads doesn’t need to be nearly instantaneous. However, the processor can be
idled on shorter stalls, and thread switching will also cause delays.
Dr Asiyo (USIU) DST4030A: Parallel Computing 13 / 31
Hardware multithreading
Simultaneous multithreading, or SMT, is a variation on
fine-grained multithreading. It attempts to exploit superscalar
processors by allowing multiple threads to make use of the multiple
functional units. If we designate “preferred” threads—threads that
have many instructions ready to execute—we can somewhat reduce
the problem of thread slowdown.
Dr Asiyo (USIU) DST4030A: Parallel Computing 14 / 31
Parallel hardware
Parallel hardware
Dr Asiyo (USIU) DST4030A: Parallel Computing 15 / 31
Parallel hardware Classifications of parallel computers
We employ two independent classifications of parallel computers:
Flynn’s taxonomy, classifies a parallel computer according to the
number of instruction streams and the number of data streams (or
datapaths) it can simultaneously manage.
The alternative classification has to do with how the cores access
memory.
Dr Asiyo (USIU) DST4030A: Parallel Computing 16 / 31
Parallel hardware Flynn’s Taxonomy
1 SIMD systems
Single instruction, multiple data, or SIMD, systems are parallel systems.
They operate on multiple data streams by applying the same instruction to
multiple data items, so an abstract SIMD system can be thought of as
having a single control unit and multiple datapaths.
An instruction is broadcast from the control unit to the datapaths, and
each datapath either applies the instruction to the current data item, or it
is idle.
SIMD systems are ideal for parallelizing simple loops that operate on large
arrays of data.
Parallelism that’s obtained by dividing data among the processors and
having the processors all apply (more or less) the same instructions to
their subsets of the data is called data-parallelism.
SIMD parallelism can be very efficient on large data parallel problems, but
SIMD systems often don’t do very well on other types of parallel problems.
Dr Asiyo (USIU) DST4030A: Parallel Computing 17 / 31
Parallel hardware Flynn’s Taxonomy
Figure 1: SISD and SIMD architecture
Dr Asiyo (USIU) DST4030A: Parallel Computing 18 / 31
Parallel hardware Flynn’s Taxonomy
2 MIMD systems
Multiple instruction, multiple data, or MIMD, systems support
multiple simultaneous instruction streams operating on multiple data
streams.
MIMD systems typically consist of a collection of fully independent
processing units or cores, each of which has its own control unit and
its own datapath.
MIMD systems are usually asynchronous, that is, the processors can
operate at their own pace.
Dr Asiyo (USIU) DST4030A: Parallel Computing 19 / 31
Parallel hardware Flynn’s Taxonomy
Figure 2: MIMD Architecture
Dr Asiyo (USIU) DST4030A: Parallel Computing 20 / 31
Parallel hardware Flynn’s Taxonomy
There are two principal types of MIMD systems: shared-memory
systems and distributed-memory systems.
In a shared-memory system a collection of autonomous processors is
connected to a memory system via an interconnection network, and
each processor can access each memory location.
In a shared-memory system, the processors usually communicate
implicitly by accessing shared data structures.
In a distributed-memory system, each processor is paired with its
own private memory, and the processor-memory pairs communicate
over an interconnection network.
So in distributed-memory systems, the processors usually
communicate explicitly by sending messages or by using special
functions that provide access to the memory of another processor.
Dr Asiyo (USIU) DST4030A: Parallel Computing 21 / 31
Parallel hardware Cores Access Memory
3 Shared Memory: Involves sharing the same address space as
shown in Figure 3
Figure 3: Shared Memory Parallel Computer Architecture
Dr Asiyo (USIU) DST4030A: Parallel Computing 22 / 31
Parallel hardware Cores Access Memory
Advantages of Shared Memory:
Global address space provides a user friendly programming
perspective to memory
Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs
Disadvantages of Shared Memory:
Lack of scalability between memory and CPUs.
Programmer responsibility for synchronization constructs that
ensure ”correct” access of global memory.
Dr Asiyo (USIU) DST4030A: Parallel Computing 23 / 31
Parallel hardware Cores Access Memory
4 Distributed Memory: Involves a communication network to
connect inter Processor memory as shown in Figure 4
Figure 4: Distributed Memory Parallel Computer Architecture
Dr Asiyo (USIU) DST4030A: Parallel Computing 24 / 31
Parallel hardware Cores Access Memory
Processors have their own local memory.
Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across all
processors.
Changes it makes to its local memory have no effect on the memory
of other processors.
Task of the programmer to explicitly define how and when data is
communicated.
Synchronization between tasks is likewise the programmer’s
responsibility.
Dr Asiyo (USIU) DST4030A: Parallel Computing 25 / 31
Parallel hardware Cores Access Memory
Advantages of Distributed Memory:
Scalable Memory: is scalable with the number of processors.
Each processor can rapidly access its own memory.
Cost effective: can use commodity, off-the-shelf processors and
networking.
Disadvantages of Distributed Memory:
Programmer is responsible for details associated with data
communication between processors.
It may be difficult to map existing data structures, based on global
memory, to this memory organization.
Non-uniform memory access times -data residing on a remote node
takes longer to access than node local data.
Dr Asiyo (USIU) DST4030A: Parallel Computing 26 / 31
Parallel hardware Cores Access Memory
5 Hybrid Distributed-Shared Memory: The largest and fastest
computers in the world today employ both shared and distributed
memory architectures (Figure 5).
Figure 5: Hybrid Memory Parallel Computer Architecture
Dr Asiyo (USIU) DST4030A: Parallel Computing 27 / 31
Parallel hardware Cores Access Memory
Figure 6: Hybrid Systems with Multicore CPUs
Dr Asiyo (USIU) DST4030A: Parallel Computing 28 / 31
Parallel hardware Cores Access Memory
The shared memory component can be a shared memory machine
and/or graphics processing units (GPU).
The distributed memory component is the networking of multiple
shared memory/GPU machines, which know only about their own
memory not the memory on another machine.
Network communications are required to move data from one
machine to another.
Trends indicate that this type of memory architecture will prevail
and increase at the high end of computing for the foreseeable future.
Dr Asiyo (USIU) DST4030A: Parallel Computing 29 / 31
Parallel hardware Cores Access Memory
Advantages & Disadvantages of Hybrid
Whatever is common to both shared and distributed memory
architectures
Increased scalability is an important advantage
Increased programming complexity is a major disadvantage
Dr Asiyo (USIU) DST4030A: Parallel Computing 30 / 31
Thank You!
Dr Asiyo (USIU) DST4030A: Parallel Computing 31 / 31