0% found this document useful (0 votes)

25 views75 pages

Overview of Parallel Computing

big data

Uploaded by

nguyenchung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views75 pages

Overview of Parallel Computing

big data

Uploaded by

nguyenchung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

CS4403 - CS9535: An Overview of Parallel

Computing

Marc Moreno Maza

University of Western Ontario, London, Ontario (Canada)

January 10, 2017

Plan

1 Hardware

2 Types of Parallelism

3 Concurrency Platforms: Three Examples

Julia
Cilk
CUDA
MPI
Hardware

Plan

1 Hardware

2 Types of Parallelism

3 Concurrency Platforms: Three Examples

Julia
Cilk
CUDA
MPI
Hardware

von Neumann Architecture

In 1945, the Hungarian mathematician John von Neumann proposed

the above organization for hardware computers.
The Control Unit fetches instructions/data from memory, decodes the
instructions and then sequentially coordinates operations to
accomplish the programmed task.
The Arithmetic Unit performs basic arithmetic operation, while
Input/Output is the interface to the human operator.
Hardware

The Pentium Family

Hardware

Parallel computer hardware

Most computers today (including tablets, smartphones, etc.) are

equipped with several processing units (control+arithmetic units).
Various characteristics determine the types of computations: shared
memory vs distributed memory, single-core processors vs multicore
processors, data-centric parallelism vs task-centric parallelism.
Historically, shared memory machines have been classified as UMA
and NUMA, based upon memory access times.
Hardware

Uniform memory access (UMA)

Identical processors, equal access and access times to memory.

In the presence of cache memories, cache coherency is accomplished
at the hardware level: if one processor updates a location in shared
memory, then all the other processors know about the update.
UMA architectures were first represented by Symmetric
Multiprocessor (SMP) machines.
Multicore processors follow the same architecture and, in addition,
integrate the cores onto a single circuit die.
Hardware

Non-uniform memory access (NUMA)

Often made by physically linking two or more SMPs (or multicore

processors).
Global address space provides a user-friendly programming
perspective to memory, that is, it feels like there is a single large
memory where all data reside.
However, not all processors have equal access time to all memories,
since memory access across link is slower.
In fact, memory contention (that is, traffic jam) often limits the
ability to scale of these architectures.
Hardware

Multicore processors
Hardware

Multicore processors

Core Core Core Core

L1 L1 L1 L1 L1 L1 L1 L1
inst data ins data ins data ins data

L2 L2

Main Memory
Hardware

Once uopn a time, every thing was slow in a computer . . .

Hardware

Graphics processing units (GPUs)

Hardware

Graphics processing units (GPUs)

A GPU consists of several streaming multiprocessors (SMs) with a

large shared memory. In addition, each SM has a local (and private)
and small memory. Thus, GPUs cannot be classified as UMA or
NUMA.
Hardware

Graphics processing units (GPUs)

In a GPU, the small local memories have much smaller access time
than the large shared memory.
Thus, as much as possible, cores access data in the local memories
while the shared memory should essentially be used for data exchange
between SMs.
Hardware

Distributed Memory

Distributed memory systems require a communication network to

connect inter-processor memory.
Processors have their own local memory and operate independently.
Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.
Data exchange between processors is managed by the programmer ,
not by the hardware.
Hardware

Hybrid Distributed-Shared Memory

The largest and fastest computers in the world today employ both
shared and distributed memory architectures.
Current trends seem to indicate that this type of memory architecture
will continue to prevail.
While this model allows for applications to scale, it increases the
complexity of writing computer programs.
Types of Parallelism

Plan

1 Hardware

2 Types of Parallelism

3 Concurrency Platforms: Three Examples

Julia
Cilk
CUDA
MPI
Types of Parallelism

Pipelining

Pipelining is a common way to organize work with the objective of

optimizing throughput.
It turns out that this is also a way to execute concurrently several
tasks (that is, work units) processable by the same pipeline.
Types of Parallelism

Instruction pipeline

Above is a generic pipeline with four stages: Fetch, Decode, Execute,

Write-back.
The top gray box is the list of instructions waiting to be executed; the
bottom gray box is the list of instructions that have been completed;
and the middle white box is the pipeline.
Types of Parallelism

Data parallelism

The data set is typically organized into a common structure, such as

an array.
A set of tasks work collectively on that structure, however, each task
works on a different region.
Tasks perform the same operation on their region of work, for
example, ”multiply every array element by some value”.
Types of Parallelism

Data parallelism (2/2)

Illustration of a data-centric parallel program.

Types of Parallelism

Task parallelism (1/4)

program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program

Task parallelism is achieved when each processor executes a different

thread (or process) on the same or different data.
The threads may execute the same or different code.
Types of Parallelism

Task parallelism (2/4)

Code executed by CPU "a":

program:
...
do task "A"
...
end program

Code executed by CPU "b":

program:
...
do task "B"
...
end program

In the general case, different execution threads communicate with one

another as they work.
Communication usually takes place by passing data from one thread to
the next as part of a work-flow.
Types of Parallelism

Task parallelism (3/4)

Task parallelism can be regarded as a more general scheme than data

parallelism.
It applies to situations where the work can be decomposed evenly or
where the decomposition of the work is not predictable.
Types of Parallelism

Task parallelism (4/4)

In some situations, one may feel that work can be decomposed evenly.
However, as time progresses, some tasks may finish before others
Then, some processors may become idle and should be used, if other
tasks can be launched. This mapping of tasks onto hardware
resources is called scheduling.
In data-centric parallel applications, scheduling can be done off-line
(that is, by the programmer) or by the hardware (like GPUs).
For task-centric parallel applications, it is desirable that scheduling is
done on-line (that is, dynamically) so as to cover cases where tasks
consume unpredictable amounts of resources.
Types of Parallelism

Patterns in task or data distribution

Exchanging data among processors in a parallel fashion provides

fundamental examples of concurrent programs.
Above, a master processor broadcasts or scatters data or tasks to
slave processors.
The same master processor gathers or reduces data from slave
processors.
Types of Parallelism

Stencil computations

In scientific computing, stencil computations are very common.

Typically, a procedure updates array elements according to some fixed
pattern, called stencil.
In the above, a 2D array of 100 × 100 elements is updated by the
stencil T .
Types of Parallelism

Stencil computations (2/3)

The above picture illustrates dissipation of heat into a 2D grid.

A differential equation rules this phenomenon.
Once this discretized, through the finite element method, this leads a
stencil computation.
Types of Parallelism

Stencil computations (3/3)

The above picture illustrates dissipation of heat into a 2D grid.

A differential equation rules this phenomenon.
Once this discretized, through the finite element method, this leads a
stencil computation.
Types of Parallelism

Pascal triangle construction: another stencil computation

0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28
1 4 10 20 35 56
1 5 15 35 70
1 6 21 56
1 7 28
1 8

Construction of the Pascal Triangle: nearly the simplest stencil

computation!
Types of Parallelism

Divide and conquer: principle

I II
I II

II
II III

Each triangle region can computed as a square region followed by two

(concurrent) triangle regions.
Each square region can also be computed in a divide and conquer
manner.
Types of Parallelism

Blocking strategy: principle

0 0 0 0 0 0 0 0

a7
4
a6 1 2 3
a5

a4 2 3 4
a3
4
a2 3
a1
4
a0

Let B be the order of a block and n be the number of elements.

Each block is processed serially (as a task) and the set of all blocks is
computed concurrently.
Concurrency Platforms: Three Examples

Plan

1 Hardware

2 Types of Parallelism

3 Concurrency Platforms: Three Examples

Julia
Cilk
CUDA
MPI
Concurrency Platforms: Three Examples Julia

Distributed arrays and parallel reduction (1/4)

[moreno@compute-0-3 ~]$ julia -p 5

_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" to list help topics
| | | | | | |/ _‘ | |
| | |_| | | | (_| | | Version 0.2.0-prerelease+3622
_/ |\__’_|_|_|\__’_| | Commit c9bb96c 2013-09-04 15:34:41 UTC
|__/ | x86_64-redhat-linux

julia> da = @parallel [2i for i = 1:10]

10-element DArray{Int64,1,Array{Int64,1}}:
2
4
6
8
10
12
14
16
18
20
Concurrency Platforms: Three Examples Julia

Distributed arrays and parallel reduction (2/4)

julia> procs(da)
4-element Array{Int64,1}:
2
3
4
5

julia> da.chunks
4-element Array{RemoteRef,1}:
RemoteRef(2,1,1)
RemoteRef(3,1,2)
RemoteRef(4,1,3)
RemoteRef(5,1,4)

julia>

julia> da.indexes
4-element Array{(Range1{Int64},),1}:
(1:3,)
(4:5,)
(6:8,)
(9:10,)

julia> da[3]
6

julia> da[3:5]
3-element SubArray{Int64,1,DArray{Int64,1,Array{Int64,1}},(Range1{Int64},)}:
6
8
10
Concurrency Platforms: Three Examples Julia

Distributed arrays and parallel reduction (3/4)

julia> fetch(@spawnat 2 da[3])

julia>

julia> { (@spawnat p sum(localpart(da))) for p=procs(da) }

4-element Array{Any,1}:
RemoteRef(2,1,71)
RemoteRef(3,1,72)
RemoteRef(4,1,73)
RemoteRef(5,1,74)

julia>

julia> map(fetch, { (@spawnat p sum(localpart(da))) for p=procs(da) })

4-element Array{Any,1}:
12
18
42
38

julia>

julia> sum(da)
110
Concurrency Platforms: Three Examples Julia

Distributed arrays and parallel reduction (4/4)

julia> reduce(+, map(fetch,

{ (@spawnat p sum(localpart(da))) for p=procs(da) }))
110

julia>

julia> preduce(f,d) = reduce(f,

map(fetch,
{ (@spawnat p f(localpart(d))) for p=procs(d) }))
# methods for generic function preduce
preduce(f,d) at none:1

julia> function Base.minimum(x::Int64, y::Int64)

min(x,y)
end
minimum (generic function with 10 methods)

julia> preduce(minimum, da)

2
Concurrency Platforms: Three Examples Cilk

From Cilk to Cilk++ and Cilk Plus

Cilk has been developed since 1994 at the MIT Laboratory for
Computer Science by Prof. Charles E. Leiserson and his group, in
particular by Matteo Frigo.
Besides being used for research and teaching, Cilk was the system
used to code the three world-class chess programs: Tech, Socrates,
and Cilkchess.
Over the years, the implementations of Cilk have run on computers
ranging from networks of Linux laptops to an 1824-nodes Intel
Paragon.
From 2007 to 2009 Cilk has lead to Cilk++, developed by Cilk Arts,
an MIT spin-off, which was acquired by Intel in July 2009 and
became Cilk Plus, see http://www.cilk.com/
Cilk++ can be freely downloaded at
http://software.intel.com/en-us/articles/download-intel-ci
Cilk is still developed at MIT
http://supertech.csail.mit.edu/cilk/
Concurrency Platforms: Three Examples Cilk

CilkPlus (and Cilk Plus)

CilkPlus (resp. Cilk) is a small set of linguistic extensions to C++

(resp. C) supporting task parallelism, using fork & join constructs.

Both Cilk and CilkPlus feature a provably efficient work-stealing

scheduler.

CilkPlus provides a hyperobject library for performing reduction for

data aggregation.

CilkPlus includes the Cilkscreen race detector and the Cilkview

performance analyzer.
Concurrency Platforms: Three Examples Cilk

Task Parallelism in CilkPlus

int fib(int n)
{
if (n < 2) return n;
int x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}

The named child function cilk spawn fib(n-1) may execute in

parallel with its parent
CilkPlus keywords cilk spawn and cilk sync grant permissions
for parallel execution. They do not command parallel execution.
Concurrency Platforms: Three Examples Cilk

The fork-join parallelism model

int fib (int n) { Example:

if (n<2)
( ) return (n);
( ); fib(4)
else {
int x,y;
x = cilk_spawn fib(n-1); 4
y = fib(n-2);
fib(n 2);
cilk_sync;
return (x+y);
3 2
}
}

2 1 1 0
“Processor
oblivious”
1 0
The computation dag
unfolds dynamically.

At run time, the task DAG unfolds dynamically.

Concurrency Platforms: Three Examples Cilk

Loop Parallelism in CilkPlus

a11 a12 ⋯ a1n a11 a21 ⋯ an1

a21 a22 ⋯ a2n a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮
an1 an2 ⋯ ann a1n a2n ⋯ ann

A AT
// indices run from 0, not 1
cilk_for (int i=1; i<n; ++i) {
for (int j=0; j<i; ++j) {
d bl temp = A[i][j];
double [i][j]
A[i][j] = A[j][i];
A[j][i] = temp;
}
}

The iterations of a cilk for loop may execute in parallel.

Concurrency Platforms: Three Examples Cilk

Serial Semantics (1/2)

Cilk (resp. CilkPlus) is a multithreaded language for parallel

programming that generalizes the semantics of C (resp. C++) by
introducing linguistic constructs for parallel control.
Cilk (resp. CilkPlus) is a faithful extension of C (resp. C++):
• The C (resp. C++) elision of a Cilk (resp. CilkPlus) is a correct
implementation of the semantics of the program.
• Moreover, on one processor, a parallel Cilk (resp. CilkPlus) program
scales down to run nearly as fast as its C (resp. C++) elision.
To obtain the serialization of a CilkPlus program
#define cilk_for for
#define cilk_spawn
#define cilk_sync
Concurrency Platforms: Three Examples Cilk

Serial Semantics (2/2)

int fib (int n) {

if (n<2) return (n);
else {
int x,y;
x = cilk_spawn
cilk spawn fib(n
fib(n-1);
1);
y = fib(n-2);
cilk_sync;
return (x+y);
}
} Cilk++ source

int fib (int n) {

if (n<2) return (n);
else {
int x,y;
x = fib(n-1);
fib(n 1);
y = fib(n-2);
return (x+y);
}
} Serialization
Concurrency Platforms: Three Examples Cilk

Scheduling

Memory I/O

Network

$
P $ … $
P P P

A scheduler’s job is to map a computation to particular processors. Such

a mapping is called a schedule.
If decisions are made at runtime, the scheduler is online, otherwise, it
is offline
CilkPlus’s scheduler maps strands onto processors dynamically at
runtime.
Concurrency Platforms: Three Examples Cilk

The CilkPlus Platform

int fib (int n) {

if (n<2) return (n);
2 Cilk++
3
else { Hyperobject
int x,y; Compiler Library
1 x = cilk_spawn fib(n-1);
y = fib(n-2);
b( );
cilk_sync; Conventional
return (x+y); Compiler
}
} Cilk++ source 6
Cilkview
Linker Scalability
S l bilit AAnalyzer
l
int fib (int n) {
if (n<2) return (n);
else {
int x,y;
x = fib(n-1); 5
y = fib(n-2);
fib(n 2); Binary Cilkscreen
return (x+y);
} Race Detector
} Serialization

4 Runtime
Conventional System Parallel
Regression Tests Regression Tests

Reliable Single- Exceptional Reliable Multi-

Threaded Code Performance Threaded Code
Concurrency Platforms: Three Examples Cilk

Benchmarks for parallel divide-and-conquer matrix multiplication

Multiplying a 4000x8000 matrix by a 8000x4000 matrix

on 32 cores = 8 sockets x 4 cores (Quad Core AMD Opteron 8354)
per socket.
The 32 cores share a L3 32-way set-associative cache of 2 Mbytes.
#core Elision (s) Parallel (s) speedup
8 420.906 51.365 8.19
16 432.419 25.845 16.73
24 413.681 17.361 23.83
32 389.300 13.051 29.83
Concurrency Platforms: Three Examples Cilk

Uisng Cilkview
Concurrency Platforms: Three Examples CUDA

CUDA design goals

Enable heterogeneous systems (i.e., CPU+GPU)

Scale to 100’s of cores, 1000’s of parallel threads
Use C/C++ with minimal extensions
Let programmers focus on parallel algorithms (as much as possible).
Concurrency Platforms: Three Examples CUDA

Heterogeneous programming (1/3)

A CUDA program is a serial program with parallel kernels, all in C.

The serial C code executes in a host (= CPU) thread
The parallel kernel C code executes in many device threads across
multiple GPU processing elements, called streaming processors (SP).
Concurrency Platforms: Three Examples CUDA

Heterogeneous programming (2/3)

Thus, the parallel code (kernel) is launched and executed on a device

by many threads.
Threads are grouped into thread blocks.
One kernel is executed at a time on the device.
Many threads execute each kernel.
Concurrency Platforms: Three Examples CUDA

Heterogeneous programming (3/3)

The parallel code is written for a thread

• Each thread is free to execute a unique code path
• Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
Thus, each thread executes the same code on different data based on
its thread and block ID.
Concurrency Platforms: Three Examples CUDA