Chapter 5

An Introduction to Parallel Programming
Peter Pacheco
Chapter 5
Shared Memory Programming
with OpenMP
Copyright © 2010, Elsevier Inc. All rights Reserved 1

# Chapter Subtitle
Roadmap
 Writing programs that use OpenMP.
 Using OpenMP to parallelize many serial for
loops with only small changes to the source
code.
 Task parallelism.
 Explicit thread synchronization.
 Standard problems in shared-memory
programming.

OpenMP
 An API for shared-memory parallel
programming.
 MP = multiprocessing
 Designed for systems in which each
thread or process can potentially have
access to all available memory.
 System is viewed as a collection of cores
or CPU’s, all of which have access to main
memory.

A shared memory system

Pragmas
 Special preprocessor instructions.
 Typically added to a system to allow
behaviors that aren’t part of the basic C
specification.
 Compilers that don’t support the pragmas
ignore them.
#pragma

gcc −g −Wall −fopenmp −o omp_hello omp_hello . c
. / omp_hello 4
compiling
running with 4 threads
Hello from thread 0 of 4 possible Hello from thread 3 of 4

Hello from thread 1 of 4 outcomes
Hello from thread 1 of 4
Hello from thread 3 of 4 Hello from thread 1 of 4 Hello from thread 0 of 4

OpenMp pragmas
 # pragma omp parallel
 Most basic parallel directive.

 The number of threads that run
the following structured block of code
is determined by the run-time system.

A process forking and joining
two threads

clause
 Text that modifies a directive.
 The num_threads clause can be added to
a parallel directive.
 It allows the programmer to specify the
number of threads that should execute the
following block.
# pragma omp parallel num_threads ( thread_count )

Of note…
 There may be system-defined limitations on the
number of threads that a program can start.
 The OpenMP standard doesn’t guarantee that
this will actually start thread_count threads.
 Most current systems can start hundreds or even
thousands of threads.
 Unless we’re trying to start a lot of threads, we
will almost always get the desired number of
threads.

Some terminology
 In OpenMP parlance the collection of
threads executing the parallel block — the
original thread and the new threads — is
called a team, the original thread is called
the master, and the additional threads are
called slaves.

In case the compiler doesn’t
support OpenMP
# include <omp.h>
#ifdef _OPENMP
# include <omp.h>
#endif

In case the compiler doesn’t
support OpenMP
# ifdef _OPENMP
int my_rank = omp_get_thread_num ( );
int thread_count = omp_get_num_threads ( );
#else
int my_rank = 0;
int thread_count = 1;
# endif

THE TRAPEZOIDAL RULE

The trapezoidal rule

Serial algorithm

A First OpenMP Version
1) We identified two types of tasks:
a) computation of the areas of individual
trapezoids, and
b) adding the areas of trapezoids.
2) There is no communication among the
tasks in the first collection, but each task
in the first collection communicates with
task 1b.

A First OpenMP Version
3) We assumed that there would be many
more trapezoids than cores.
 So we aggregated tasks by assigning a

contiguous block of trapezoids to each
thread (and a single thread to each core).

Assignment of trapezoids to threads

Unpredictable results when two (or more)
threads attempt to simultaneously execute:
global_result += my_result ;

Mutual exclusion
# pragma omp critical

global_result += my_result ;
only one thread can execute

the following structured block at a time

SCOPE OF VARIABLES

Scope
 In serial programming, the scope of a
variable consists of those parts of a
program in which the variable can be used.
 In OpenMP, the scope of a variable refers

to the set of threads that can access the
variable in a parallel block.

Scope in OpenMP
 A variable that can be accessed by all the
threads in the team has shared scope.
 A variable that can only be accessed by a

single thread has private scope.
 The default scope for variables

declared before a parallel block
is shared.
THE REDUCTION CLAUSE

We need this more complex version to add each
thread’s local calculation to get global_result.
Although we’d prefer this.

If we use this, there’s no critical section!
If we fix it like this…
… we force the threads to execute sequentially.

We can avoid this problem by declaring a private
variable inside the parallel block and moving
the critical section after the function call.

I think we
can do
better.
Neither
do I.
I don’t
like it.

Reduction operators
 A reduction operator is a binary operation
(such as addition or multiplication).
 A reduction is a computation that
repeatedly applies the same reduction
operator to a sequence of operands in
order to get a single result.
 All of the intermediate results of the
operation should be stored in the same
variable: the reduction variable.

A reduction clause can be added to a parallel
directive.
+, *, -, &, |, ˆ, &&, ||

THE “PARALLEL FOR”
DIRECTIVE

Parallel for
 Forks a team of threads to execute the
following structured block.
 However, the structured block following
the parallel for directive must be a for loop.
 Furthermore, with the parallel for directive
the system parallelizes the for loop by
dividing the iterations of the loop among
the threads.

Legal forms for parallelizable for
statements

Caveats
 The variable index must have integer or
pointer type (e.g., it can’t be a float).
 The expressions start, end, and incr must

have a compatible type. For example, if
index is a pointer, then incr must have
integer type.

Caveats
 The expressions start, end, and incr must
not change during execution of the loop.
 During execution of the loop, the variable

index can only be modified by the
“increment expression” in the for
statement.

Data dependencies
fibo[ 0 ] = fibo[ 1 ] = 1;
for (i = 2; i < n; i++)
fibo[ i ] = fibo[ i – 1 ] + fibo[ i – 2 ];
note 2 threads
fibo[ 0 ] = fibo[ 1 ] = 1;
# pragma omp parallel for num_threads(2)
for (i = 2; i < n; i++)
fibo[ i ] = fibo[ i – 1 ] + fibo[ i – 2 ];
but sometimes
we get this
1 1 2 3 5 8 13 21 34 55
this is correct 1123580000

What happened?
1. OpenMP compilers don’t
check for dependences
among iterations in a loop
that’s being parallelized with
a parallel for directive.
2. A loop in which the results
of one or more iterations
depend on other iterations
cannot, in general, be
correctly parallelized by
OpenMP.

Estimating π

OpenMP solution #1
loop dependency

OpenMP solution #2
Insures factor has

private scope.

The default clause
 Lets the programmer specify the scope of
each variable in a block.
 With this clause the compiler will require

that we specify the scope of each variable
we use in the block and that has been
declared outside the block.

The default clause

MORE ABOUT LOOPS IN
OPENMP: SORTING

Bubble Sort

Serial Odd-Even Transposition Sort

Serial Odd-Even Transposition Sort

First OpenMP Odd-Even Sort

Second OpenMP Odd-Even Sort

Odd-even sort with two parallel for directives and two for directives.
(Times are in seconds.)

SCHEDULING LOOPS

We want to parallelize
this loop.
Assignment of work
using cyclic partitioning.

Our definition of function f.

Results
 f(i) calls the sin function i times.
 Assume the time to execute f(2i) requires
approximately twice as much time as the
time to execute f(i).
 n = 10,000
 one thread
 run-time = 3.67 seconds.

Results
 n = 10,000
 two threads
 default assignment
 run-time = 2.76 seconds
 speedup = 1.33
 n = 10,000
 two threads
 cyclic assignment
 run-time = 1.84 seconds
 speedup = 1.99
The Schedule Clause
 Default schedule:
 Cyclic schedule:

schedule ( type , chunksize )
 Type can be:
 static: the iterations can be assigned to the
threads before the loop is executed.
 dynamic or guided: the iterations are assigned
to the threads while the loop is executing.
 auto: the compiler and/or the run-time system
determine the schedule.
 runtime: the schedule is determined at run-
time.
 The chunksize is a positive integer.
The Static Schedule Type
twelve iterations, 0, 1, . . . , 11, and three threads



The Dynamic Schedule Type
 The iterations are also broken up into chunks
of chunksize consecutive iterations.
 Each thread executes a chunk, and when a
thread finishes a chunk, it requests another
one from the run-time system.
 This continues until all the iterations are
completed.
 The chunksize can be omitted. When it is
omitted, a chunksize of 1 is used.

The Guided Schedule Type
 Each thread also executes a chunk, and when a
thread finishes a chunk, it requests another one.
 However, in a guided schedule, as chunks are
completed the size of the new chunks
decreases.
 If no chunksize is specified, the size of the
chunks decreases down to 1.
 If chunksize is specified, it decreases down to
chunksize, with the exception that the very last
chunk can be smaller than chunksize.

Assignment of trapezoidal rule iterations 1–9999 using
a guided schedule with two threads.

The Runtime Schedule Type
 The system uses the environment variable
OMP_SCHEDULE to determine at run-
time how to schedule the loop.
 The OMP_SCHEDULE environment
variable can take on any of the values that
can be used for a static, dynamic, or
guided schedule.

PRODUCERS AND
CONSUMERS

Queues
 Can be viewed as an abstraction of a line of
customers waiting to pay for their groceries in a
supermarket.
 A natural data structure to use in many
multithreaded applications.
 For example, suppose we have several
“producer” threads and several “consumer”
threads.
 Producer threads might “produce” requests for data.
 Consumer threads might “consume” the request by
finding or generating the requested data.

Message-Passing
 Each thread could have a shared message
queue, and when one thread wants to
“send a message” to another thread, it
could enqueue the message in the
destination thread’s queue.
 A thread could receive a message by
dequeuing the message at the head of its
message queue.

Message-Passing

Sending Messages

Receiving Messages

Termination Detection
each thread increments this after

completing its for loop

Startup (1)
 When the program begins execution, a
single thread, the master thread, will get
command line arguments and allocate an
array of message queues: one for each
thread.
 This array needs to be shared among the
threads, since any thread can send to any
other thread, and hence any thread can
enqueue a message in any of the queues.

Startup (2)
 One or more threads may finish allocating
their queues before some other threads.
 We need an explicit barrier so that when a
thread encounters the barrier, it blocks
until all the threads in the team have
reached the barrier.
 After all the threads have reached the
barrier all the threads in the team can
proceed.

The Atomic Directive (1)
 Unlike the critical directive, it can only
protect critical sections that consist of a
single C assignment statement.
 Further, the statement must have one of

the following forms:

The Atomic Directive (2)
 Here <op> can be one of the binary operators
 Many processors provide a special load-

modify-store instruction.
 A critical section that only does a load-
modify-store can be protected much more
efficiently by using this special instruction
rather than the constructs that are used to
protect more general critical sections.

Critical Sections
 OpenMP provides the option of adding a
name to a critical directive:
 When we do this, two blocks protected

with critical directives with different names
can be executed simultaneously.
 However, the names are set during
compilation, and we want a different critical
section for each thread’s queue.
Locks
 A lock consists of a data structure and
functions that allow the programmer to
explicitly enforce mutual exclusion in a
critical section.

Locks

Using Locks in the Message-
Passing Program

Using Locks in the Message-
Passing Program

Some Caveats
1. You shouldn’t mix the different types of
mutual exclusion for a single critical
section.
2. There is no guarantee of fairness in
mutual exclusion constructs.
3. It can be dangerous to “nest” mutual
exclusion constructs.

Matrix-vector multiplication

Matrix-vector multiplication
Run-times and efficiencies

of matrix-vector multiplication
(times are in seconds)

Thread-Safety

Concluding Remarks (1)
 OpenMP is a standard for programming
shared-memory systems.
 OpenMP uses both special functions and
preprocessor directives called pragmas.
 OpenMP programs start multiple threads
rather than multiple processes.
 Many OpenMP directives can be modified
by clauses.

 A major problem in the development of
shared memory programs is the possibility
of race conditions.
 OpenMP provides several mechanisms for
insuring mutual exclusion in critical
sections.
 Critical directives
 Named critical directives
 Atomic directives
 Simple locks
 By default most systems use a block-
partitioning of the iterations in a
parallelized for loop.
 OpenMP offers a variety of scheduling
options.
 In OpenMP the scope of a variable is the
collection of threads to which the variable
is accessible.

 A reduction is a computation that
repeatedly applies the same reduction
operator to a sequence of operands in
order to get a single result.

Chapter 5

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

An Introduction to Parallel Programming

Copyright © 2010, Elsevier Inc. All rights Reserved 1

Copyright © 2010, Elsevier Inc. All rights Reserved 2

Copyright © 2010, Elsevier Inc. All rights Reserved 3

Copyright © 2010, Elsevier Inc. All rights Reserved 4

Copyright © 2010, Elsevier Inc. All rights Reserved 5

Hello from thread 0 of 4 possible Hello from thread 3 of 4

Copyright © 2010, Elsevier Inc. All rights Reserved 7

 Most basic parallel directive.

Copyright © 2010, Elsevier Inc. All rights Reserved 8

Copyright © 2010, Elsevier Inc. All rights Reserved 9

# pragma omp parallel num_threads ( thread_count )

Copyright © 2010, Elsevier Inc. All rights Reserved 10

Copyright © 2010, Elsevier Inc. All rights Reserved 11

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Copyright © 2010, Elsevier Inc. All rights Reserved 13

Copyright © 2010, Elsevier Inc. All rights Reserved 14

Copyright © 2010, Elsevier Inc. All rights Reserved 15

Copyright © 2010, Elsevier Inc. All rights Reserved 16

Copyright © 2010, Elsevier Inc. All rights Reserved 17

Copyright © 2010, Elsevier Inc. All rights Reserved 18

 So we aggregated tasks by assigning a

Copyright © 2010, Elsevier Inc. All rights Reserved 19

Copyright © 2010, Elsevier Inc. All rights Reserved 20

Copyright © 2010, Elsevier Inc. All rights Reserved 21

# pragma omp critical

only one thread can execute

Copyright © 2010, Elsevier Inc. All rights Reserved 22

Copyright © 2010, Elsevier Inc. All rights Reserved 25

 In OpenMP, the scope of a variable refers

Copyright © 2010, Elsevier Inc. All rights Reserved 26

 A variable that can only be accessed by a

 The default scope for variables

Copyright © 2010, Elsevier Inc. All rights Reserved 28

Although we’d prefer this.

Copyright © 2010, Elsevier Inc. All rights Reserved 29

If we fix it like this…

… we force the threads to execute sequentially.

Copyright © 2010, Elsevier Inc. All rights Reserved 30

Copyright © 2010, Elsevier Inc. All rights Reserved 31

Copyright © 2010, Elsevier Inc. All rights Reserved 32

Copyright © 2010, Elsevier Inc. All rights Reserved 33

Copyright © 2010, Elsevier Inc. All rights Reserved 34

Copyright © 2010, Elsevier Inc. All rights Reserved 35

Copyright © 2010, Elsevier Inc. All rights Reserved 36

Copyright © 2010, Elsevier Inc. All rights Reserved 38

 The expressions start, end, and incr must

Copyright © 2010, Elsevier Inc. All rights Reserved 39

 During execution of the loop, the variable

Copyright © 2010, Elsevier Inc. All rights Reserved 40

Copyright © 2010, Elsevier Inc. All rights Reserved 41

Copyright © 2010, Elsevier Inc. All rights Reserved 42

Copyright © 2010, Elsevier Inc. All rights Reserved 43

Copyright © 2010, Elsevier Inc. All rights Reserved 44

Insures factor has

Copyright © 2010, Elsevier Inc. All rights Reserved 45

 With this clause the compiler will require

Copyright © 2010, Elsevier Inc. All rights Reserved 46