About OpenMP
About OpenMP
• Background
• OpenMP
Parallel Hardware
M M M P P P
P P P
Interconnection Network
Interconnection Network
M M M
Parallel Hardware
Distributed Memory Machines private memory for each processor,
only accessible this processor, so no synchronization for memory
accesses needed. Information exchanged by sending data from one
processor to another via an interconnection network using explicit
communication operations.
M M M P P P
P P P
Interconnection Network
Interconnection Network
M M M
P P P
C C C
Bus
Shared Memory
Interconnection Network
M M M
(SGI Altix at NASA Ames - had 10,240 cpus of Itanium 2 nodes connected by
Infiniband, was ranked 84 in June 2010 list, ranked 3 in 2008. Expensive!)
Parallel Programming Models
• Data
• What data is private or shared?
• How is logically shared data accessed or communicated?
• Synchronization
• What operations are used to coordinate parallelism
• What operations are atomic (indivisible)?
Shared Memory Programming Model
• shared variables
• private variables
• Library functions
• Environment variables
Goal is standardization, ease of use, portability. Allows incremental
approach. Significant parallelism possible with just 3 or 4 directives.
Works on SMPs and DSMs.
Allows fine and coarse-grained parallelism; loop level as well as
explicit work assignment to threads as in SPMD.
What is OpenMP?
• http://www.openmp.org
• Maintained by the OpenMP Architecture Review Board
(ARB) (non-profit group of organizations that interpret and
update OpenMP, write new specs, etc. Includes
Compaq/Digital, HP, Intel, IBM, KAI, SGI, Sun, DOE.
(Endorsed by software and application vendors).
• Individuals also participate through cOMPunity, which
participates in ARB, organizes workshops, etc.
• Started in 1997. OpenMP 3.0 just recently released.
• For Fortran (77,90,95), C and C++, on Unix, Windows NT
and other platforms.
OpenMP = Open specifications for MultiProcessing
Basic Idea
Explicit programmer control of parallelization using fork-join
model of parallel execution
• all OpenMP programs begin as single process, the master
thread, which executes until a parallel region construct
encountered
• FORK: master thread creates team of parallel threads
• JOIN: When threads complete statements in parallel
region construct they synchronize and terminate, leaving
only the master thread. (similar to fork-join of Pthreads)
Compile line:
gcc -fopenmp helloWorld.c
icc -openmp helloWorld.c
Simple Example
Sample Output:
MacBook-Pro% a.out
Hello world from thread 1
Hello world from thread 0
Hello world from thread 2
Hello world from thread 3
MacBook-Pro% a.out
Hello world from thread 0
Hello world from thread 3
Hello world from thread 2
Hello world from thread 1
int main(){
int var1, var2, var3;
...serial Code
}
Parallel Directives
• When a thread reaches a PARALLEL directive, it becomes
the master and has thread number 0.
• All threads execute the same code in the parallel region
(Possibly redundant, or use work-sharing constructs to
distribute the work)
• There is an implied barrier∗ at the end of a parallel section.
Only the master thread continues past this point.
• If a thread terminates within a parallel region, all threads
will terminates, and the result is undefined.
• Cannot branch into or out of a parallel region.
barrier - all threads wait for each other; no thread proceeds until all threads
have reached that point
Parallel Directives
Directives are:
• Case sensitive (not for Fortran)
• Only one directive-name per statement
• Directives apply to at most one succeeding statement,
which must be a structured block.
• Continue on succeeding lines with backslash ( "\" )
FOR directive
#pragma omp for [clause ...]
schedule (type [,chunk])
private (list)
firstprivate(list)
lastprivate(list)
shared (list)
reduction (operator: list)
nowait
} // end sections
} // end parallel region
SINGLE directive
• SINGLE directive says only one thread in the team executes the
enclosed code
• useful for code that isn’t thread-safe (e.g. I/O)
• rest of threads wait at the end of enclosed code block (unless
NOWAIT clause specified)
• no branching in or out of SINGLE block
firstprivate example
Instead use:
Instead use:
dx = 1/n.;
#pragma omp parallel for private(x,dx)
for (i=0;i<n;i++){
x = i*dx
y(i) = exp(x) * cos(x) * sqrt(6*x+5);
}
firstprivate example
What is wrong with this code?
dx = 1/n.;
#pragma omp parallel for private(x,dx)
for (i=0;i<n;i++){
x = i*dx
y(i) = exp(x) * cos(x) * sqrt(6*x+5);
}
int sum = 0;
#pragma omp parallel for shared(n,a) \
reduction(+:sum)
{
for (i=0; i<n; i++){
sum += a[i];
}
}
Locks
Locks control access to shared resources. Up to
implementation to use spin locks (busy waiting) or not.
• Lock variables must be accessed only through locking
routines:
omp_init_lock omp_destroy_lock
omp_set_lock omp_unset_lock omp_test_lock
• In C, lock is a type omp lock t or omp nest lock t
(In Fortran lock variable is integer)
• initial state of lock is unlocked.
• omp set lock(omp lock t *lock) forces calling
thread to wait until the specified lock is available.
(Non-blocking version is omp test lock
Examing and setting a lock must be uninterruptible operation.
Lock Example
Deadlock
Runtime situation that occurs when a thread is waiting for a resource
that will never be available. Common situation is when two (or more)
actions are each waiting for the other to finish (for example, 2 threads
acquire 2 locks in different order)
or
for (j=0;j<m;j++)
# pragma omp parallel for
for (i=0;i<n;i++)
a[jk][i] = 0.;
Nested Loops
Which is better (assuming m ≈ n)?
or
for (j=0;j<m;j++)
# pragma omp parallel for
for (i=0;i<n;i++)
a[jk][i] = 0.;
parallel
for
4 parallel for
barrier
Time in microseconds
atomic
reduction
3
0
0 2 4 6 8
Number of Threads
Conditional Compilation
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif
h = 1.0/(double)/n;
sum = 0.0;
#pragma omp for reduction (+:sum) shared(h)
for (i=1; i<=n; i++){
x = h*((double)i - 0.5);
sum += (1.0)/(1.0+x*x));
}
pi = h * sum;
}
Default for index variables of parallel for loops is private, but not
for loops at a deeper nesting level.
int i,j;
#pragma omp parallel for
for (i=0;i<n;i++)
for (j=0;j<m;j++){
a[i][j] = compute(i,j)
}
Xlocal = Xinit;
} /* end parallel region */
int i;
#pragma omp parallel for
for (i=’a’; i<= ’z’; i++)
printf ("%c",i);
int i;
#pragma omp parallel
for (i=’a’; i<=’z’;i++)
printf("%c",i);
Typical Bugs
int i;
#pragma omp parallel for
for (i=’a’; i<= ’z’; i++)
printf ("%c",i);
Typical Bugs
int i;
#pragma omp parallel
for (i=’a’; i<= ’z’; i++)
printf ("%c",i);
Memory Model
When are new values of shared data guaranteed to be available to
other threads besides the one that did the update? (i.e. when should
the updating thread write back to memory or invalidate other copies
of data). In what order should the updates be performed?
Memory Model
When are new values of shared data guaranteed to be available to
other threads besides the one that did the update? (i.e. when should
the updating thread write back to memory or invalidate other copies
of data). In what order should the updates be performed?
Processor P1 Processor P2
x = new; y = new;
y_copy = y; x_copy = x;
OpenMP model: each thread has its own temporary view of the
values of shared data between explicit or implicit barrier
synchronization points. If shared variables accessed between
these points, user must take care to keep the temporary
memory view of the threads consistent.
Memory Model
signal = 0;
#pragma omp parallel default(none) shared(signal,newdata) \
private(TID,localdata)
{
TID = omp_get_thread_num();
if (TID == 0) {
newdata = 10;
signal = 1;
}
else {
while (signal == 0) {}
localdata = newdata;
}
} /* end parallel */
Memory Model
Wrong use of flush to synchronize! Compiler can reorder flush statements.
signal = 0;
#pragma omp parallel shared(signal,newdata) private(TID,localda
{
TID = omp_get_thread_num();
if (TID == 0) {
newdata = 10;
#pragma omp flush(newdata)
signal = 1;
#pragma omp flush(signal)
}
else {
#pragma omp flush(signal)
while (signal == 0) {
#pragma omp flush(signal)
}
#pragma omp flush(newdata)
localdata = newdata;
}
} /* end parallel */
Memory Model
Need fence with both items to prevent interchange of assignments.
signal = 0;
#pragma omp parallel shared(signal,newdata) private(TID,locald
{
TID = omp_get_thread_num();
if (TID == 0) {
newdata = 10;
#pragma omp flush(newdata,signal)
signal = 1;
#pragma omp flush(signal)
}
else {
#pragma omp flush(signal)
while (signal == 0) {
#pragma omp flush(signal)
}
#pragma omp flush(newdata)
localdata = newdata;
}
} /* end parallel */
Memory Model
program main
!$OMP PARALLEL DEFAULT(NONE) PRIVATE(...)
call mysubroutine()
!$OMP END PARALLEL
stop
end
subroutine mysubroutine()
implicit none
real, save :: a
real :: a_inv
logical, save :: first
data first/.true./
if (first) then
a = 10.0
first = .false.
end if
a_inv = 1.0/a
return
end
• Efficiency E = S(p)/p
• Ideal speedup S = T1 /Tp = T1 /(T1 /p) = p has 100%
efficiency
• Amdahl’s law for serial fraction of code f says
Tp = f + (1 − f )/p so S < 1/f
E = S/p = T1 /(Tp p) = 1/(f p) → 0
Scaled Speedup
Strong scaling as above, run the same code using increasing
number of processors; the problem size stays constant.
Weak scaling increase size of the problem as increase # cpus.
• originally due to Gustafson to get around Amdahl’s law.
Keep problem size n constant on each processor, so total
problem size grows linearly with p
• Problem if time complexity of algorithm grows superlinearly
with problem size. For example,
for data size n, let T1 = n2
for data size pn, parallel time is Tp = p2 n2 /p = p · T1
so parallel time increases linearly with p even with ideal
scaling.
• Hard to measure fairly. Problem doesn’t all fit on 1 cpu.
• http://computing.llnl.gov/tutorials/openMP/
very complete description of OpenMP for Fortran and C
• Using OpenMP
Portable Shared Memory Parallel Programming
by Chapman, Jost and Van Der Pas
• http://www.openmp.org