Sample - Code - Parallel - Cse6230 Fa14 04 Omp
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
๏ Also: https://computing.llnl.gov/tutorials/openMP/
Simple example
!
int main ()
{
!
return 0;
}
Simple example
#include <omp.h>
!
int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
Compiling:
#pragma omp parallel
{ gcc -fopenmp …
printf (“hello, world!\n”); // Execute in parallel icc -openmp …
} // Implicit barrier/join
return 0;
}
Simple example
#include <omp.h>
!
int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
Output:
#pragma omp parallel
{ hello, world!
printf (“hello, world!\n”); // Execute in parallel hello, world!
} // Implicit barrier/join
return 0; hello, world!
} …
Parallel loops
#pragma omp parallel for default (none) shared (a,n) private (i)
for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join
Combining omp parallel and omp for is just a convenient shorthand for a common idiom.
“If” clause
const int B = …;
#pragma omp parallel for if (n>B) default (none) shared (a,n) private (i)
for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join
Parallel loops
๏ You must check dependencies
s = 0;
for (i = 0; i < n; ++i)
s += x[i];
Parallel loops
๏ You must check dependencies
s = 0;
for (i = 0; i < n; ++i)
s += x[i];
s = 0;
for (i = 0; i < n; ++i)
s += x[i];
Only one thread from the team will execute the first loop. Use single with nowait to allow other threads to proceed while the one thread executes the first loop.
Master thread
omp_set_lock (l);
Explicit locks May require flushing …
omp_unset_lock (l);
21
Self-scheduling trade-off
Unit of work to grab: balance vs. contention
Worker
threads
Some variations:
Guided self-scheduling
Tapering
22
Work queue
23
23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67
23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67
Fixed k=1
12.5
Fixed k=3
11
23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67
12.5 11
Fixed k=3
11
23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67
12.5 11
11 10.5
23
Summary: Loop scheduling
Static: k iterations per thread, assigned statically
24
Tasking (OpenMP 3.0+)
int fib (int n) {
// G == tuning parameter
if (n <= G) fib__seq (n);
int f1, f2;
f1 = _Cilk_spawn fib (n-1);
f2 = fib (n-2);
_Cilk_sync;
return f1 + f2;
}
25
! ●
●
●
●● ●
// …
9 ●● ●●●●●
●●
● ●
●●
●●
●
8 ●●
●●
! ●● ●●
●●
●
4
f1 = fib (n-1);
●
f2 = fib (n-2); 3
return f1 + f2;
●
●
●●●●●●●●
●●●
●
●●
●●
●●●
● ●
● ●●
●
●● ●
●●●●
●●
●●
1 ●
●
●●
● ● ●●●
●
●●● ●●●●
●●●
●●●●●●●●
}
●●
Note: Parallel run-times are highly variable! For your assignments and projects, you should gather and report suitable statistics.
Tasking (OpenMP 3.0+)
// Assume a parallel region
#pragma omp task shared(a)
a = A();
// Computes: f2 (A (), f1 (B (), C ())) #pragma omp task if (0) shared (b, c, x)
int a, b, c, x, y; {
! #pragma omp task shared(b)
a = A(); b = B();
b = B(); #pragma omp task shared(c)
c = C(); c = C();
x = f1(b, c); #pragma omp taskwait
y = f2(a, x); }
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x); 28
bar ()
{
#pragma omp for
for (int i = 1; i < 100; ++i)
boo ();
}
29
Core0 Core1 0 1
Core2 Core3 2 3
31
Example: Two quad-core CPUs with logically shared but physically distributed memory
------------------------------------------------------------- *************************************************************
CPU type: Intel Core Westmere processor NUMA domains: 2
************************************************************* -------------------------------------------------------------
Hardware Thread Topology Domain 0:
************************************************************* Processors: 0 2 4 6 8 10 12 14 16 18 20 22
Sockets: 2 Memory: 10988.6 MB free of total 12277.8 MB
Cores per socket: 6 -------------------------------------------------------------
Threads per core: 2 Domain 1:
------------------------------------------------------------- Processors: 1 3 5 7 9 11 13 15 17 19 21 23
HWThread Thread Core Socket Memory: 10986.1 MB free of total 12288 MB
0 0 0 0 -------------------------------------------------------------
1 0 0 1 !
2 0 8 0 *************************************************************
3 0 8 1 Graphical:
4 0 2 0 *************************************************************
5 0 2 1 Socket 0:
6 0 10 0 +-------------------------------------------------------------------+
7 0 10 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
8 0 1 0 | | 0 12 | | 8 20 | | 4 16 | | 2 14 | | 10 22 | | 6 18 | |
9 0 1 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
10 0 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
11 0 9 1 | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | |
12 1 0 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
13 1 0 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
14 1 8 0 | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | |
15 1 8 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
16 1 2 0 | +---------------------------------------------------------------+ |
17 1 2 1 | | 12MB | |
18 1 10 0 | +---------------------------------------------------------------+ |
19 1 10 1 +-------------------------------------------------------------------+
20 1 1 0 Socket 1:
21 1 1 1 +-------------------------------------------------------------------+
22 1 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
23 1 9 1 | | 1 13 | | 9 21 | | 5 17 | | 3 15 | | 11 23 | | 7 19 | |
------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
Socket 0: ( 0 12 8 20 4 16 2 14 10 22 6 18 ) | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
Socket 1: ( 1 13 9 21 5 17 3 15 11 23 7 19 ) | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | |
------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
jinx9 $ /nethome/rvuduc3/local/jinx/likwid/2.2.1-ICC/bin/likwid-topology -g
Exploiting NUMA: Linux “first-touch” policy
a = /* … allocate buffer … */;
#pragma omp parallel for … schedule(static)
for (i = 0; i < n; ++i) {
a[i] = /* … initial value … */ ;
}
Consider: 2-socket x 6-core system, main thread initializes data and ‘p’ OpenMP threads operate
35
25
“Triad:” c[i] ← a[i] + s*b[i]
●
●●●
●
20 ●
●
●
●
●
● ●●
Effective Bandwidth (GB/s)
● ●
● ●
●
●
●●
15
●● ●
● ●●●●
●
●●
●●
●
●
●
●●
●
● ●● ●●
●
●●●
●●
●●
●●
● ●●
●●
●●
●● ●●●
●
●
●
●●●
●●●
10 ●● ●●●●●
●●●●
● ●
●●
●
●●
●
●●
●●
●
●●
●●
●
●
●●
●●●
● ●
●●
●●
●
●
● ●● ●●
●●●●
●●
●
●
●●●
●●●
●●
●●
●●
●●●●
38
Expressing task parallelism (II):
Tasks (new in OpenMP 3.0)
Like Cilk’s spawn construct
// Idiom for tasks
void foo () {
#pragma omp parallel
{
void foo () { #pragma omp single nowait
// A & B independent {
A (); #pragma omp task
B (); A ();
} #pragma omp task
B ();
}
}
} 39
http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature
Expressing task parallelism (II):
Tasks (new in OpenMP 3.0)
Like Cilk’s spawn construct
// Idiom for tasks
void foo () {
#pragma omp parallel
{
void foo () { #pragma omp single nowait
// A & B independent {
A (); #pragma omp task
B (); A ();
} // Or, let parent run B
B ();
}
}
} 40
http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature
Variation 1: Fixed chunk size
Kruskal and Weiss (1985) give a model for computing optimal chunk size
Independent subtasks
Limitation: Must know distribution, though ‘n / p’ works (~ .8x optimal for large n/p)
41
Variation 2: Guided self-scheduling
Idea
Polychronopoulos & Kuck (1987): “Guided self-scheduling: A practical scheduling scheme for parallel
supercomputers”
42
Variation 3: Tapering
Idea
43
Variation 4: Weighted factoring
What if hardware is heterogeneous?
Ref: Hummel, Schmit, Uma, Wein (1996). “Load-sharing in heterogeneous systems using weighted
factoring.” In SPAA
44
When self-scheduling is useful
Task cost unknown
Tasks without dependencies; can use with, but most analysis ignores this
45