[go: up one dir, main page]

0% found this document useful (0 votes)
19 views51 pages

Sample - Code - Parallel - Cse6230 Fa14 04 Omp

The document provides an overview of OpenMP, a parallel programming model that allows programmers to identify parallel regions in their code using directives. It includes examples of parallel loops, synchronization primitives, and loop scheduling strategies, emphasizing the importance of checking dependencies to avoid data races. Additionally, it discusses tasking introduced in OpenMP 3.0+, which allows for more dynamic parallelism in code execution.

Uploaded by

lzhang35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views51 pages

Sample - Code - Parallel - Cse6230 Fa14 04 Omp

The document provides an overview of OpenMP, a parallel programming model that allows programmers to identify parallel regions in their code using directives. It includes examples of parallel loops, synchronization primitives, and loop scheduling strategies, emphasizing the importance of checking dependencies to avoid data races. Additionally, it discusses tasking introduced in OpenMP 3.0+, which allows for more dynamic parallelism in code execution.

Uploaded by

lzhang35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

OpenMP + NUMA

CSE 6230: HPC Tools & Apps


Fall 2014 — September 5
!

Based in part on the LLNL tutorial @ https://computing.llnl.gov/tutorials/openMP/


See also the textbook, Chapters 6 & 7
OpenMP
๏ Programmer identifies serial and parallel regions, not threads

๏ Library + directives (requires compiler support)

๏ Official website: http://www.openmp.org

๏ Also: https://computing.llnl.gov/tutorials/openMP/
Simple example
!

int main ()
{
!

printf (“hello, world!\n”); // Execute in parallel


!

return 0;
}
Simple example
#include <omp.h>
!

int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!

#pragma omp parallel


{
printf (“hello, world!\n”); // Execute in parallel
} // Implicit barrier/join
return 0;
}
Simple example
#include <omp.h>
!

int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!

#pragma omp parallel num_threads(8) // Restrict team size locally


{
printf (“hello, world!\n”); // Execute in parallel
} // Implicit barrier/join
return 0;
}
Simple example
#include <omp.h>
!

int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
Compiling:
#pragma omp parallel
{ gcc -fopenmp …
printf (“hello, world!\n”); // Execute in parallel icc -openmp …
} // Implicit barrier/join
return 0;
}
Simple example
#include <omp.h>
!

int main ()
{
omp_set_num_threads (16); // OPTIONAL — Can also use
// OMP_NUM_THREADS environment variable
!
Output:
#pragma omp parallel
{ hello, world!
printf (“hello, world!\n”); // Execute in parallel hello, world!
} // Implicit barrier/join
return 0; hello, world!
} …
Parallel loops

for (i = 0; i < n; ++i) {


a[i] += foo (i);
}
Parallel loops

for (i = 0; i < n; ++i) {


a[i] += foo (i);
}
#pragma omp parallel // Activates the team of threads
{
#pragma omp for shared (a,n) private (i) // Declares work sharing loop
for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join
} // Implicit barrier/join
Parallel loops

for (i = 0; i < n; ++i) {


a[i] += foo (i);
}
#pragma omp parallel void foo (item* a, int n) {
{ int i;
foo (a, n); #pragma omp for shared (a,n) private (i)
} // Implicit barrier/join for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join
}

Note: if foo() is called outside a parallel region, it is orphaned.


Parallel loops

for (i = 0; i < n; ++i) {


a[i] += foo (i);
}

#pragma omp parallel for default (none) shared (a,n) private (i)
for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join

Combining omp parallel and omp for is just a convenient shorthand for a common idiom.
“If” clause

for (i = 0; i < n; ++i) {


a[i] += foo (i);
}

const int B = …;
#pragma omp parallel for if (n>B) default (none) shared (a,n) private (i)
for (i = 0; i < n; ++i) {
a[i] += foo (i);
} // Implicit barrier/join
Parallel loops
๏ You must check dependencies

s = 0;
for (i = 0; i < n; ++i)
s += x[i];
Parallel loops
๏ You must check dependencies

s = 0;
for (i = 0; i < n; ++i)
s += x[i];

#pragma omp parallel for shared(s)


for (i = 0; i < n; ++i)
s += x[i]; // Data race!
Parallel loops
๏ You must check dependencies

s = 0;
for (i = 0; i < n; ++i)
s += x[i];

#pragma omp parallel for shared(s)


#pragma omp parallel for reduction (+:s)
for (i = 0; i < n; ++i)
for (i = 0; i < n; ++i)
#pragma omp critical
s += x[i];
s += x[i];
Removing implicit barriers: nowait

#pragma omp parallel default (none) shared (a,b,n) private (i)


{
#pragma omp for nowait
for (i = 0; i < n; ++i)
a[i] = foo (i);
!
#pragma omp for nowait
for (i = 0; i < n; ++i)
b[i] = bar (i);
}

Contrast with _Cilk_for, which does not have such a “feature.”


Single thread

#pragma omp parallel default (none) shared (a,b,n) private (i)


{
#pragma omp single [nowait]
for (i = 0; i < n; ++i) {
a[i] = foo (i);
} // Implied barrier unless “nowait” specified
!
#pragma omp for
for (i = 0; i < n; ++i)
b[i] = bar (i);
}

Only one thread from the team will execute the first loop. Use single with nowait to allow other threads to proceed while the one thread executes the first loop.
Master thread

#pragma omp parallel default (none) shared (a,b,n) private (i)


{
#pragma omp master
for (i = 0; i < n; ++i) {
a[i] = foo (i);
} // No implied barrier
!
#pragma omp for
for (i = 0; i < n; ++i)
b[i] = bar (i);
}
Synchronization primitives

#pragma omp critical


Critical sections No explicit locks
{…}

Barriers #pragma omp barrier

omp_set_lock (l);
Explicit locks May require flushing …
omp_unset_lock (l);

Single-thread #pragma omp single


Inside parallel regions
regions { /* executed once */ }
Loop scheduling
Static: k iterations per thread, assigned statically

! #pragma omp parallel for schedule static(k) …


Dynamic: k iters / thread, using logical work queue

! #pragma omp parallel for schedule dynamic(k) …


Guided: k iters / thread initially, reduced with each allocation

! #pragma omp parallel for schedule guided(k) …


Run-time (schedule runtime): Use value of environment variable, OMP_SCHEDULE

What are all these scheduling things?


20
Loop scheduling strategies for load balance
Centralized scheduling (task queue)
Worker
Dynamic, on-line approach threads

Good for small no. of workers

Independent tasks, known

For loops: Self-scheduling

Task = subset of iterations Task


queue
Loop body has unpredictable time

Tang & Yew (ICPP ’86)

21
Self-scheduling trade-off
Unit of work to grab: balance vs. contention
Worker
threads
Some variations:

Grab fixed size chunk

Guided self-scheduling

Tapering

Weighted factoring, adaptive factoring, distributed trapezoid Task


queue
Self-adapting, gap-aware, …

22
Work queue
23

23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67

23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67

Fixed k=1

12.5

23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67

Fixed k=1

12.5

Fixed k=3

11

23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67

Fixed k=1 Tapered, k0=3

12.5 11

Fixed k=3

11

23
Work queue For P=3 procs,
23 Ideal: 23 / 3 ~ 7.67

Fixed k=1 Tapered, k0=3

12.5 11

Fixed k=3 Tapered, k0=4

11 10.5

23
Summary: Loop scheduling
Static: k iterations per thread, assigned statically

! #pragma omp parallel for schedule static(k) …


Dynamic: k iters / thread, using logical work queue

! #pragma omp parallel for schedule dynamic(k) …


Guided: k iters / thread initially, reduced with each allocation

! #pragma omp parallel for schedule guided(k) …


Run-time: Use value of environment variable, OMP_SCHEDULE

24
Tasking (OpenMP 3.0+)
int fib (int n) {
// G == tuning parameter
if (n <= G) fib__seq (n);
int f1, f2;
f1 = _Cilk_spawn fib (n-1);
f2 = fib (n-2);
_Cilk_sync;
return f1 + f2;
}

25

See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf


Tasking (OpenMP 3.0+)
int
// G == tuning parameter

f1 = int fib (int n) {


f2 = fib (n-2); if (n <= G) fib__seq (n);
int f1, f2;
#pragma omp task default(none) shared(n,f1)
} f1 = fib (n-1);
f2 = fib (n-2);
#pragma omp taskwait
return f1 + f2;
} 26

See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf


Tasking (OpenMP 3.0+) // At the call site:
int #pragma omp parallel
// G == tuning parameter #pragma omp single nowait
answer = fib (n);

f1 = int fib (int n) {


f2 = fib (n-2); if (n <= G) fib__seq (n);
int f1, f2;
#pragma omp task default(none) shared(n,f1)
} f1 = fib (n-1);
f2 = fib (n-2);
#pragma omp taskwait
return f1 + f2;
} 26

See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf


14


// At the call site:



●● ● ●
13 ●●●
●● ● ● ●



●● ●

#pragma omp parallel


●●●● ●

●●●●●●
● ●● ●
12 ●● ● ●
●● ●●
●●● ●
●● ●

#pragma omp single nowait 11


● ●●●
● ● ●


answer = fib (n);


● ● ●●●
● ●

Speedup relative to mean sequential time


●● ●
10 ●

● ●
●●

! ●


●● ●

// …
9 ●● ●●●●●
●●
● ●
●●
●●

8 ●●
●●
! ●● ●●

int fib (int n) {


● ●●
● ●

7 ●
●● ●
●●

if (n <= G) fib__seq (n); 6


● ● ●

●●●

●●

int f1, f2;


●●●
●●
●●
● ●●●

● ●●●●
5 ●●

#pragma omp task …


●●●

●●
●●

4
f1 = fib (n-1);

f2 = fib (n-2); 3

#pragma omp taskwait 2

return f1 + f2;


●●●●●●●●
●●●

●●
●●
●●●
● ●
● ●●

●● ●
●●●●
●●
●●
1 ●

●●
● ● ●●●

●●● ●●●●
●●●
●●●●●●●●

}
●●

Sequential Cilk Plus OpenMP Tasking

Note: Parallel run-times are highly variable! For your assignments and projects, you should gather and report suitable statistics.
Tasking (OpenMP 3.0+)
// Assume a parallel region
#pragma omp task shared(a)
a = A();
// Computes: f2 (A (), f1 (B (), C ())) #pragma omp task if (0) shared (b, c, x)
int a, b, c, x, y; {
! #pragma omp task shared(b)
a = A(); b = B();
b = B(); #pragma omp task shared(c)
c = C(); c = C();
x = f1(b, c); #pragma omp taskwait
y = f2(a, x); }
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x); 28

See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf


Check: What does this code do?
foo ()
{
// At some call site: #pragma omp task
#pragma omp parallel bar ();
#pragma omp single nowait baz ();
foo (); }

bar ()
{
#pragma omp for
for (int i = 1; i < 100; ++i)
boo ();
}
29

See also: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf


Performance tuning tip:
Controlling the number of threads

#pragma omp parallel num_threads(4)


{
#pragma omp for default (none) shared (a,n) private (i)
for (i = 0; i < n; ++i) {
a[i] += foo (i);
}
}
30
Performance tuning tip:
Exploit non-uniform memory access (NUMA)

DRAM Socket-0 Socket-1 DRAM

Core0 Core1 0 1

Core2 Core3 2 3

31

Example: Two quad-core CPUs with logically shared but physically distributed memory
------------------------------------------------------------- *************************************************************
CPU type: Intel Core Westmere processor NUMA domains: 2
************************************************************* -------------------------------------------------------------
Hardware Thread Topology Domain 0:
************************************************************* Processors: 0 2 4 6 8 10 12 14 16 18 20 22
Sockets: 2 Memory: 10988.6 MB free of total 12277.8 MB
Cores per socket: 6 -------------------------------------------------------------
Threads per core: 2 Domain 1:
------------------------------------------------------------- Processors: 1 3 5 7 9 11 13 15 17 19 21 23
HWThread Thread Core Socket Memory: 10986.1 MB free of total 12288 MB
0 0 0 0 -------------------------------------------------------------
1 0 0 1 !
2 0 8 0 *************************************************************
3 0 8 1 Graphical:
4 0 2 0 *************************************************************
5 0 2 1 Socket 0:
6 0 10 0 +-------------------------------------------------------------------+
7 0 10 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
8 0 1 0 | | 0 12 | | 8 20 | | 4 16 | | 2 14 | | 10 22 | | 6 18 | |
9 0 1 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
10 0 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
11 0 9 1 | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | |
12 1 0 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
13 1 0 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
14 1 8 0 | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | |
15 1 8 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
16 1 2 0 | +---------------------------------------------------------------+ |
17 1 2 1 | | 12MB | |
18 1 10 0 | +---------------------------------------------------------------+ |
19 1 10 1 +-------------------------------------------------------------------+
20 1 1 0 Socket 1:
21 1 1 1 +-------------------------------------------------------------------+
22 1 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
23 1 9 1 | | 1 13 | | 9 21 | | 5 17 | | 3 15 | | 11 23 | | 7 19 | |
------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
Socket 0: ( 0 12 8 20 4 16 2 14 10 22 6 18 ) | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
Socket 1: ( 1 13 9 21 5 17 3 15 11 23 7 19 ) | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | |
------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |

jinx9 $ /nethome/rvuduc3/local/jinx/likwid/2.2.1-ICC/bin/likwid-topology -g
Exploiting NUMA: Linux “first-touch” policy
a = /* … allocate buffer … */;
#pragma omp parallel for … schedule(static)
for (i = 0; i < n; ++i) {
a[i] = /* … initial value … */ ;
}

#pragma omp parallel for …


for (i = 0; i < n; ++i) {
a[i] += foo (i);
}
33
Exploiting NUMA: Linux “first-touch” policy
a = /* … allocate buffer … */;
#pragma omp parallel for … schedule(static)
for (i = 0; i < n; ++i) {
a[i] = /* … initial value … */ ;
}

#pragma omp parallel for … schedule(static)


for (i = 0; i < n; ++i) {
a[i] += foo (i);
}
34
Thread binding
Key environment variables

OMP_NUM_THREADS: Number of OpenMP threads

GOMP_CPU_AFFINITY: Specify thread-to-core binding


!

Consider: 2-socket x 6-core system, main thread initializes data and ‘p’ OpenMP threads operate

env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“0 2 4 6 … 22” ./run-program …


(shorthand: GOMP_CPU_AFFINITY=“0-22:2”)

env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“1 3 5 7 … 23” ./run-program …


(shorthand: GOMP_CPU_AFFINITY=“1-23:2”)

35
25
“Triad:” c[i] ← a[i] + s*b[i]


●●●

20 ●




● ●●
Effective Bandwidth (GB/s)

● ●
● ●


●●
15

●● ●
● ●●●●

●●
●●



●●

● ●● ●●

●●●
●●
●●
●●
● ●●
●●
●●
●● ●●●



●●●
●●●
10 ●● ●●●●●
●●●●
● ●
●●

●●

●●
●●

●●
●●


●●
●●●
● ●
●●
●●


● ●● ●●
●●●●
●●


●●●
●●●
●●
●●
●●
●●●●

Sequential OpenMP x 6 OpenMP x 6 OpenMP x 12 OpenMP x 12


Master initializes Master initializes Master initializes First touch
Read from socket 0 Read from socket 1 Read from both sockets
Backup
Expressing task parallelism (I)
#pragma omp parallel sections [… e.g., nowait]
{
#pragma omp section
foo ();
!
#pragma omp section
bar ();
}

38
Expressing task parallelism (II):
Tasks (new in OpenMP 3.0)
Like Cilk’s spawn construct
// Idiom for tasks
void foo () {
#pragma omp parallel
{
void foo () { #pragma omp single nowait
// A & B independent {
A (); #pragma omp task
B (); A ();
} #pragma omp task
B ();
}
}
} 39

http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature
Expressing task parallelism (II):
Tasks (new in OpenMP 3.0)
Like Cilk’s spawn construct
// Idiom for tasks
void foo () {
#pragma omp parallel
{
void foo () { #pragma omp single nowait
// A & B independent {
A (); #pragma omp task
B (); A ();
} // Or, let parent run B
B ();
}
}
} 40

http://wikis.sun.com/display/openmp/Using+the+Tasking+Feature
Variation 1: Fixed chunk size
Kruskal and Weiss (1985) give a model for computing optimal chunk size

Independent subtasks

Assumed distributions of running time for each subtask (e.g., IFR)

Overhead for extracting task, also random

Limitation: Must know distribution, though ‘n / p’ works (~ .8x optimal for large n/p)

Ref: “Allocating independent subtasks on parallel processors”

41
Variation 2: Guided self-scheduling
Idea

Large chunks at first to avoid overhead

Small chunks near the end to even-out finish times

Chunk size Ki = ceil(Ri / p), Ri = number of remaining tasks

Polychronopoulos & Kuck (1987): “Guided self-scheduling: A practical scheduling scheme for parallel
supercomputers”

42
Variation 3: Tapering
Idea

Chunk size Ki = f(Ri; μ, σ) = min. chunk size


h = selection overhead

(μ, σ) estimated using history
⇤ Ri
= Ki = f , , ,h
High-variance small chunk size µ p

Low-variance larger chunks OK

A little better than guided self-scheduling

Ref: S. Lucco (1994), “Adaptive parallel programs.” PhD Thesis.

43
Variation 4: Weighted factoring
What if hardware is heterogeneous?

Idea: Divide task cost by computational power of requesting node

Ref: Hummel, Schmit, Uma, Wein (1996). “Load-sharing in heterogeneous systems using weighted
factoring.” In SPAA

44
When self-scheduling is useful
Task cost unknown

Locality not important

Shared memory or “small” numbers of processors

Tasks without dependencies; can use with, but most analysis ignores this

45

You might also like