[go: up one dir, main page]

0% found this document useful (0 votes)
127 views83 pages

Class03 - MPI, Part 1, Intermediate PDF

The document discusses the Message Passing Interface (MPI) and parallel programming. It provides an overview of MPI, including its history and standards, basic concepts like point-to-point and collective communication, advantages like scalability and disadvantages like increased programming complexity. It also summarizes common MPI implementations, when MPI is appropriate to use, and fundamentals like datatypes, functions, and handling return values.

Uploaded by

Pavan Behara
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views83 pages

Class03 - MPI, Part 1, Intermediate PDF

The document discusses the Message Passing Interface (MPI) and parallel programming. It provides an overview of MPI, including its history and standards, basic concepts like point-to-point and collective communication, advantages like scalability and disadvantages like increased programming complexity. It also summarizes common MPI implementations, when MPI is appropriate to use, and fundamentals like datatypes, functions, and handling return values.

Uploaded by

Pavan Behara
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Intermediate MPI

M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

1 / 89

Why MPI?

The Message Passing Model

A parallel calculation in which each process (out of a specied number of processes) works on a local copy of the data, with local variables. Namely, no process is allowed to directly access the memory (available data) of another process. The mechanism by which individual processes share information (data) is through explicit sending (and receiving) of data between the processes. General assumption - a one-to-one mapping of processes to processors (although this is not necessarily always the case).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

3 / 89

Why MPI?

Upside of MPI

Advantages: Very general model (message passing) Applicable to widest variety of hardware platforms (SMPs, NOWs, etc.). Allows great control over data location and ow in a program. Programs can usually achieve higher performance level (scalability).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

4 / 89

Why MPI?

Downside of MPI

Disadvantages: Programmer has to work hard(er) to implement. Best performance gains can involve re-engineering the code. The MPI standard does not specify mechanism for launching parallel tasks (task launcher). Implementation dependent - it can be a bit of a pain.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

5 / 89

Why MPI?

The MPI Standard(s)


MPI-1 1.0 released in 1994 1.1 mostly corrections & clarications in 1995 1.2 clarications (& MPI_GET_VERSION function!) in 1997. 1.3 clarications/corrections, 2008. 2.0 1997, signicant enhancements to MPI-1, including C++ bindings, replace deprecated functions of MPI-1. Only in the last year or so is MPI-2 becoming more widely adopted. 2.1 2008, mostly clarications/corrections. 2.2 2009, more clarications/corrections. 3.0 2012 (expected, draft available), major update.

MPI-2

MPI-3

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

6 / 89

Why MPI?

MPI-1

Major MPI-1 features:


1 2 3 4 5 6 7 8

Point-to-point Communications Collective Operations Process Groups Communication Domains Process Topologies Environmental Management & Inquiry Proling Interface FORTRAN and C Bindings

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

7 / 89

Why MPI?

MPI-2

MPI-2 Enhancements (not fully implemented by all vendors!):


1 2 3 4

Dynamic Process Management (pretty available) Input/Output (supporting hardware is hardest to nd) One-sided Operations (hardest to nd, but generally available) C++ Bindings (generally available, but deprecated!)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

8 / 89

Why MPI?

MPI References

Using MPI: Portable Programming With the Message Passing Interface, second edition, W. Gropp, E. Lusk, and A. Skellum (MIT Press, Cambridge, 1999). MPIThe Complete Reference, Vol. 1, The MPI Core, M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra (MIT Press, Cambridge, 1998). MPIThe Complete Reference, Vol. 2, The MPI Extensions, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, M. Snir, and J. Dongarra (MIT Press, Cambridge, 1998).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

9 / 89

Why MPI?

More MPI References

The MPI Forum, http://www.mpi-forum.org.


http://www.netlib.org/utk/papers/mpi-book/mpi-book.html,

rst edition of the title MPI The Complete Reference, also available as a PostScript le. A useful online reference to all of the routines and their bindings: http://www-unix.mcs.anl.gov/mpi/www/www3 Note that this is for MPICH 1.2, but its quite handy.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

10 / 89

Introduction

MPI: Large and Small

MPI is Large
MPI 1.2 has 128 functions. MPI 2.0 has 152 functions.

MPI is Small
Many programs need to use only about 6 MPI functions.

MPI is the right size.


Offers enough exibility that users dont need to master > 150 functions to use it properly.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

12 / 89

Introduction

Some Available MPI Implementations


Some of the more common MPI implementations, and supported network hardware: MPICH , from ANL - has many available devices, but the most common is ch_p4 (using TCP/IP) MPICH-GM/MX , Myricoms port of MPICH to use their low-level network APIs LAM , many device ports, including TCP/IP and GM (now in maintenance mode) OpenMPI , latest from LAM and other (FT-MPI, LA-MPI, PACX-MPI) developers, includes TCP/IP, GM/MX, and IB (inniband) support and those are just some of the more common free ones ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

13 / 89

Introduction

Appropriate Times to Use MPI

When you need a portable parallel API When you are writing a parallel library When you have data processing that is not conducive to a data parallel approach When you care about parallel performance

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

14 / 89

Introduction

Appropriate Times NOT to Use MPI

When you are can just use a parallel library (which may itself be written in MPI). When you need only simple threading on data-parallel tasks. When you dont need large (many processor) parallel speedup.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

15 / 89

Introduction

MPI Fundamentals

Basic Features of Message Passing

Message passing codes run the same (usually serial) code on multiple processors, which communicate with one another via library calls which fall into a few general categories: Calls to initialize, manage, and terminate communications Calls to communicate between two individual processors (point-to-point) Calls to communicate among a group of processors (collective) Calls to create custom datatypes I will briey cover the rst three, and present a few concrete examples.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

16 / 89

Introduction

MPI Fundamentals

Outline of a Program Using MPI

General outline of any program using MPI:


1 2 3 4 5 6 7 8 I n c l u d e MPI header f i l e s Declare v a r i a b l e s & Data S t r u c t u r e s I n i t i a l i z e MPI . Main program message passing enabled . Terminate MPI End program

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

17 / 89

Introduction

MPI Fundamentals

MPI Header Files


All MPI programs need to include the MPI header les to dene necessary datatypes. In C/C++:
1 2 3 # include " mpi . h " # include < s t d i o . h> # include <math . h>

In FORTRAN 77
1 2 3 program main i m p l i c i t none include mpif . h

Fortran 90/95
1 2 3 program main i m p l i c i t none use MPI

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

18 / 89

Introduction

MPI Fundamentals

MPI Naming Conventions


MPI functions are designed to be as language independent as possible. Routine names all begin with MPI_:
FORTRAN names are typically upper case:
1 c a l l MPI_XXXXXXX( param1 , param2 , . . . , IERR )

C functions use a mixed case:


1 i e r r = MPI_Xxxxxxx ( param1 , param2 , ...)

MPI constants are all upper case in both C and FORTRAN:


1 MPI_COMM_WORLD, MPI_REAL , MPI_DOUBLE, ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

19 / 89

Introduction

MPI Fundamentals

MPI Routines & Their Return Values

Generally the MPI routines return an error code, using the exit status in C, which can be tested with a predened success value:
1 2 3 4 5 6 7 int i e r r ; ... i e r r = MPI_INIT (& argc ,& argv ) ; i f ( i e r r ! = MPI_SUCCESS) { . . . e x i t w i t h an e r r o r . . . } ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

20 / 89

Introduction

MPI Fundamentals

and in FORTRAN the error code is passed back as the last argument in the MPI subroutine call:
1 2 3 4 integer : : ierr

c a l l MPI_INIT ( i e r r ) i f ( i e r r . ne . MPI_SUCCESS) STOP MPI_INIT f a i l e d .

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

21 / 89

Introduction

MPI Fundamentals

MPI Handles

MPI denes its own data structures, which can be referenced by the use through the use of handles. handles can be returned by MPI routines, and used as arguments to other MPI routines. Some examples: MPI_SUCCESS - Used to test MPI error codes. An integer in both C and FORTRAN. MPI_COMM_WORLD - A (pre-dened) communicator consisting of all of the processes. An integer FORTRAN, and a MPI_Comm object in C.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

22 / 89

Introduction

MPI Fundamentals

MPI Datatypes

MPI denes its own datatypes that correspond to typical datatypes in C and FORTRAN. Allows for automatic translation between different representations in a heterogeneous parallel environment. You can build your own datatypes from the basic MPI building blocks. Actual representation is implementation dependent. Convention: program variables are usually declared as normal C or FORTRAN types, and then calls to MPI routines use MPI type names as needed.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

23 / 89

Introduction

MPI Fundamentals

MPI Datatypes in C
In C, the basic datatypes (and their ISO C equivalents) are:
MPI Datatype MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_INT MPI_LONG MPI_SHORT MPI_UNSIGNED_SHORT MPI_UNSIGNED_LONG MPI_UNSIGNED MPI_CHAR MPI_UNSIGNED_CHAR MPI_BYTE MPI_PACKED C Type oat double long double signed int signed long int signed short int unsigned short int unsigned long int unsigned int signed char unsigned char

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

24 / 89

Introduction

MPI Fundamentals

MPI Datatypes in FORTRAN

In FORTRAN, the basic datatypes (and their FORTRAN equivalents) are:


MPI Datatype MPI_INTEGER MPI_REAL MPI_DOUBLE_PRECISION MPI_COMPLEX MPI_DOUBLE_COMPLEX MPI_LOGICAL MPI_CHARACTER MPI_BYTE MPI_PACKED C Type integer real double precision complex double complex logical character*1

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

25 / 89

Introduction

MPI Fundamentals

Initializing & Terminating MPI


The rst MPI routine called by any MPI program must be MPI_INIT, called once and only once per program. C:
1 2 3 int i e r r ; i e r r = MPI_INIT (& argc ,& argv ) ; ...

FORTRAN:
1 2 3 integer i e r r c a l l MPI_INIT ( i e r r ) ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

26 / 89

Introduction

MPI Fundamentals

MPI Communicators

Denition (MPI Communicator) A communicator is a group of processors that can communicate with each other. There can be many communicators A given processor can be a member of multiple communicators. Within a communicator, the rank of a processor is the number (starting at 0) uniquely identifying it within that communicator.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

27 / 89

Introduction

MPI Fundamentals

A processors rank is used to specify source and destination in message passing calls. A processors rank can be different in different communicators. MPI_COMM_WORLD is a pre-dened communicator encompassing all of the processes. Additional communicators can be dened to dene subsets of this group.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

28 / 89

Introduction

MPI Fundamentals

More on MPI Communicators


Typically a program executes two MPI calls immediately after MPI_INIT to determine each processors rank: C:
1 2 i n t MPI_Comm_rank (MPI_Comm comm, i n t rank ) ; i n t MPI_Comm_size (MPI_Comm comm, i n t s i z e ) ;

FORTRAN:
1 2 MPI_COMM_RANK(comm, rank , i e r r ) MPI_COMM_SIZE (comm, s i z e , i e r r )

where rank and size are integers returned with (obviously) the rank and extent (0:number of processors-1).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

29 / 89

Introduction

MPI Fundamentals

Simple MPI Program in C

We have already covered enough material to write the simplest of MPI programs: here is one in C:
1 2 3 4 5 6 7 8 9 10 11 12 13 # include < s t d i o . h> # include " mpi . h " i n t main ( i n t argc , char argv ) { i n t i e r r , myid , numprocs ; M P I _ I n i t (& argc ,& argv ) ; MPI_Comm_size (MPI_COMM_WORLD,& numprocs ) ; MPI_Comm_rank (MPI_COMM_WORLD,& myid ) ; p r i n t f ( " H e l l o World , I am Process %d o f %d \ n " , myid , numprocs ) ; MPI_Finalize ( ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

30 / 89

Introduction

MPI Fundamentals

Six Function MPI

Many MPI codes can get away with using only the six most frequently used routines: MPI_INIT for intialization MPI_COMM_SIZE size of communicator MPI_COMM_RANK rank in communicator MPI_SEND send message MPI_RECEIVE receive message MPI_FINALIZE shut down communicator

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

31 / 89

Point to Point Communications

Basic P2P in MPI

Basic features: In MPI 1.2, only two-sided communications are allowed, requiring an explicit send and receive. (2.0 allows for one-sided communications, i.e. get and put). Point-to-point (or P2P) communication is explicitly two-sided, and the message will not be sent without the active participation of both processes. A message generically consists of an envelope (tags indicating source and destination) and a body (data being transferred). Fundamental - almost all of the MPI comms are built around point-to-point operations.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

33 / 89

Point to Point Communications

Message Bodies

MPI Message Bodies


MPI uses three points to describe a message body:
1

buffer: the starting location in memory where the data is to be found. C: actual address of an array element FORTRAN: name of the array element datatype: the type of data to be sent. Commonly one of the predened types, e.g. MPI_REAL. Can also be a user dened datatype, allowing great exibility in dening message content for more advanced applications. count: number of items being sent.

MPI standardizes the elementary datatypes, avoiding having the developer have to worry about numerical representation.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

34 / 89

Point to Point Communications

Message Envelopes

MPI Message Envelopes

MPI message wrappers have the following general attributes: communicator - the group of processes to which the sending and receiving process belong. source - originating process destination - receiving process tag - message identier, allows program to label classes of messages (e.g. one for name data, another for place data, status, etc.)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

35 / 89

Point to Point Communications

Message Envelopes

Blocking vs. Non-Blocking

blocking routine does not return until operation is complete. blocking sends, for example, ensure that it is safe to overwrite the sent data. blocking receives, the data is here and ready for use. nonblocking routine returns immediately, with no info about completion. Can test later for success/failure of operation. In the interim, the process is free to go on to other tasks.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

36 / 89

Point to Point Communications

Send Modes

Point-to-point Semantics

For MPI sends, there are four available modes: standard - no guarantee that the receive has started. synchronous - complete when receipt has been acknowledged. buffered - complete when data has been copied to local buffer. No implication about receipt. ready - the user asserts that the matching receive has been posted (allows user to gain performance). MPI receives are easier - they are complete when the data has arrived and is ready for use.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

37 / 89

Point to Point Communications

Sends & Receives

Blocking Send

MPI_SEND MPI_SEND(buff,count,datatype,dest,tag,comm) buff (IN), initial address of message buffer count (IN), number of entries to send (int) datatype (IN), datatype of each entry (handle) dest (IN), rank of destination (int) tag (IN), message tag (int) comm (IN), communicator (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

38 / 89

Point to Point Communications

Sends & Receives

Blocking Receive

MPI_RECV MPI_RECV(buff,count,datatype,source,tag,comm, status) buff (IN), intial address of message buffer count (IN), number of entries to send (int) datatype (IN), datatype of each entry (handle) source (IN), rank of source (int) tag (IN), message tag (int) comm (IN), communicator (handle) status (OUT), return status (Status)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

39 / 89

Point to Point Communications

Sends & Receives

Blocking Send/Receive Restrictions

source, tag, and comm must match those of a pending message for the message to be received. Wildcards can be used for source and tag, but not communicator. An error will be returned if the message buffer exceeds that allowed for by the receive. It is the users responsibility to ensure that the send/receive datatypes agree - if they do not, the results are undened.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

40 / 89

Point to Point Communications

Sends & Receives

Status of a Receive

More information about message reception is available by examining the status returned by the call to MPI_RECV. C: status is a structure of type MPI_STATUS that contains at minimum the three elds:
1 2 3

MPI_SOURCE MPI_TAG MPI_ERROR

FORTRAN: status is an integer array of length MPI_STATUS_SIZE. MPI_SOURCE, MPI_TAG, and MPI_ERROR are indices of entries that store the source, tag, and error elds.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

41 / 89

Point to Point Communications

Sends & Receives

MPI_GET_COUNT

The routine MPI_GET_COUNT is an auxiliary routine that allows you to test the amount of data received: MPI_GET_COUNT MPI_GET_COUNT(status,datatype,count) status (IN), return status of receive (Status) datatype (IN), datatype of each receive buffer entry (handle) count (OUT), number of entries received (int) MPI_UNDEFINED will be returned in the event of an error.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

42 / 89

Point to Point Communications

Sends & Receives

A Simple Send/Receive Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

# include < s t d i o . h> # include " mpi . h " i n t main ( i n t argc , char argv ) { i n t i , i e r r , rank , s i z e , dest , source , from , to , count , t a g ; i n t stat_count , stat_source , stat_tag ; f l o a t data [ 1 0 0 ] ; MPI_Status s t a t u s ; M P I _ I n i t (& argc ,& argv ) ; MPI_Comm_rank (MPI_COMM_WORLD, &rank ) ; MPI_Comm_size (MPI_COMM_WORLD, &s i z e ) ; p r i n t f ( " I am process %d o f %d \ n " , rank , s i z e ) ; d e s t = s i z e 1; source =0; i f ( rank == source ) { / I n i t i a l i z e and Send Data / to = dest ; count = 100; tag = 11; f o r ( i =0; i <=99; i ++) data [ i ] = i ; i e r r = MPI_Send ( data , count , MPI_REAL , to , tag ,MPI_COMM_WORLD ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

43 / 89

Point to Point Communications

Sends & Receives

23 24 25 26 27 28 29 30 31 32 33 34 35 36

else i f ( rank == d e s t ) { / Receive & Check Data / t a g = MPI_ANY_TAG ; / w i l d c a r d / count = 100; from = MPI_ANY_SOURCE ; / a n o t h e r w i l d c a r d / i e r r = MPI_Recv ( data , count , MPI_REAL , from , tag ,MPI_COMM_WORLD,& s t a t u s ) ; i e r r = MPI_Get_count (& s t a t u s , MPI_REAL,& s t a t _ c o u n t ) ; s t a t _ s o u r c e = s t a t u s . MPI_SOURCE ; s t a t _ t a g = s t a t u s . MPI_TAG ; p r i n t f ( " S t a t u s o f r e c e i v e : d e s t=%d , source=%d , t a g=%d , count=%d \ n " , rank , s t a t _ s o u r c e , s t a t _ t a g , s t a t _ c o u n t ) ; } i e r r = MPI_Finalize ( ) ; return 0; }

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

44 / 89

Point to Point Communications

Sends & Receives

Semantics of Blocking Point-to-point

For MPI_RECV completion is easy - the data is here, and can now be used. A bit trickier for MPI_SEND - completes when the data has been stored away such that the program is free to overwrite the send buffer. It can be non-local - the data could be copied directly to the receive buffer, or it could be stored in a local buffer, in which case the send could return before the receive is initiated (thereby allowing even a single threaded send process to continue).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

45 / 89

Point to Point Communications

Perils of Buffering

Message Buffering

Decouples send/receive operations. Entails added memory-memory copying (additional overhead) Amount of buffering is application and implementation dependent:
applications can choose communication modes - and gain ner control (with additional hazards) over messaging behavior. the standard mode is implementation dependent

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

46 / 89

Point to Point Communications

Perils of Buffering

More on Message Buffering

A properly coded program will not fail if the buffer throttles back on the sends, thereby causing blocking (imagine the assembly line controlled by the rate at which the nal inspector signs off on each item). An improperly coded program can deadlock ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

47 / 89

Point to Point Communications

Perils of Buffering

Deadlock

safe MPI programs do not rely on system buffering for success. Any system will eventually run out of buffer space as message buffer sizes are increased. Users are free to take advantage of knowledge of an implementations buffering policy to increase performance, but they do so by relaxing the margin for safety (as well as decreasing portability, of course).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

48 / 89

Point to Point Communications

Perils of Buffering

Deadlock Examples

Safe code (no buffering requirements):


1 2 3 4 5 6 7 8 CALL MPI_COMM_RANK(comm, rank I F ( rank . eq . 0 ) THEN CALL MPI_SEND( s b u f f , count CALL MPI_RECV( r b u f f , count ELSE I F ( rank . eq . 1 ) THEN CALL MPI_RECV( r b u f f , count CALL MPI_SEND( s b u f f , count END I F , ierr ) , MPI_REAL , 1 , tag , comm, i e r r ) , MPI_REAL , 1 , tag , comm, status , i e r r ) , MPI_REAL , 0 , tag , comm, status , i e r r ) , MPI_REAL , 0 , tag , comm, i e r r )

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

49 / 89

Point to Point Communications

Perils of Buffering

Complete & total deadlock (oops!):


1 2 3 4 5 6 7 8 CALL MPI_COMM_RANK(comm, rank I F ( rank . eq . 0 ) THEN CALL MPI_RECV( r b u f f , count CALL MPI_SEND( s b u f f , count ELSE I F ( rank . eq . 1 ) THEN CALL MPI_RECV( r b u f f , count CALL MPI_SEND( s b u f f , count END I F , ierr ) , MPI_REAL , 1 , tag , comm, status , i e r r ) , MPI_REAL , 1 , tag , comm, i e r r ) , MPI_REAL , 0 , tag , comm, status , i e r r ) , MPI_REAL , 0 , tag , comm, i e r r )

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

50 / 89

Point to Point Communications

Perils of Buffering

Buffering dependent:
1 2 3 4 5 6 7 8 CALL MPI_COMM_RANK(comm, rank I F ( rank . eq . 0 ) THEN CALL MPI_SEND( s b u f f , count CALL MPI_RECV( r b u f f , count ELSE I F ( rank . eq . 1 ) THEN CALL MPI_SEND( s b u f f , count CALL MPI_RECV( r b u f f , count END I F , ierr ) , MPI_REAL , 1 , tag , comm, i e r r ) , MPI_REAL , 1 , tag , comm, status , i e r r ) , MPI_REAL , 0 , tag , comm, i e r r ) , MPI_REAL , 0 , tag , comm, status , i e r r )

for this last buffer-dependent example, one of the sends must buffer and return - if the buffer can not hold count reals, deadlock occurs. Non-blocking communications can be used to avoid buffering, and possibly increase performance.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

51 / 89

Point to Point Communications

Non-blocking Sends & Receives

Non-blocking Sends & Receives

Advantages: 1 Easier to write code that doesnt deadlock 2 Can mask latency in high latency environments by posting receives early (requires a careful attention to detail). Disadvantages: 1 Makes code quite a bit more complex. 2 Harder to debug and maintain code.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

52 / 89

Point to Point Communications

Non-blocking Sends & Receives

Non-blocking Send Syntax

MPI_ISEND MPI_ISEND(buff,count,datatype,dest,tag,comm, request) buff (IN), intial address of message buffer count (IN), number of entries to send (int) datatype (IN), datatype of each entry (handle) dest (IN), rank of destination (int) tag (IN), message tag (int) comm (IN), communicator (handle) request (OUT), request handle (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

53 / 89

Point to Point Communications

Non-blocking Sends & Receives

Non-blocking Receive Syntax

MPI_IRECV MPI_IRECV(buff,count,datatype,dest,tag,comm, request) buff (OUT), intial address of message buffer count (IN), number of entries to send (int) datatype (IN), datatype of each entry (handle) dest (IN), rank of destination (int) tag (IN), message tag (int) comm (IN), communicator (handle) request (OUT), request handle (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

54 / 89

Point to Point Communications

Non-blocking Sends & Receives

Non-blocking Send/Receive Details

The request handle is used to query the status of the communication or to wait for its completion. The user must not overwrite the send buffer until the send is complete, nor use elements of the receiving buffer before the receive is complete (intuitively obvious, but worth stating explicitly).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

55 / 89

Point to Point Communications

Non-blocking Sends & Receives

Non-blocking Send/Receive Completion Operations

MPI_WAIT MPI_WAIT(request,status) request (INOUT), request handle (handle) status (OUT), status object (status) MPI_TEST MPI_TEST(request,flag,status) request (INOUT), request handle (handle) ag (OUT), true if operation complete (logical) status (OUT), status status object (Status)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

56 / 89

Point to Point Communications

Non-blocking Sends & Receives

Completion Operations Details

The request handle should identify a previously posted send or receive MPI_WAIT returns when the operation is complete, and the status is returned for a receive (for a send, may contain a separate error code for the send operation). MPI_TEST returns immediately, with ag = true if posted operation corresponding to the request handle is complete (and status output similar to MPI_WAIT).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

57 / 89

Point to Point Communications

Non-blocking Sends & Receives

A Non-blocking Send/Recv Example


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 # include < s t d i o . h> # include " mpi . h " i n t main ( i n t argc , char argv ) { i n t rank , nprocs , i e r r , s t a t _ c o u n t ; MPI_Request r e q u e s t ; MPI_Status s t a t u s ; float a[100] ,b [100]; M P I _ I n i t (& argc ,& argv ) ; MPI_Comm_rank (MPI_COMM_WORLD, &rank ) ; MPI_Comm_size (MPI_COMM_WORLD, &nprocs ) ; i f ( rank == 0 ) { MPI_Irecv ( b , 1 0 0 ,MPI_REAL, 1 , 1 9 ,MPI_COMM_WORLD,& r e q u e s t ) ; MPI_Send ( a , 1 0 0 ,MPI_REAL, 1 , 1 7 ,MPI_COMM_WORLD ) ; MPI_Wait (& request ,& s t a t u s ) ; } else i f ( rank == 1 ) { MPI_Irecv ( b , 1 0 0 ,MPI_REAL, 0 , 1 7 ,MPI_COMM_WORLD,& r e q u e s t ) ; MPI_Send ( a , 1 0 0 ,MPI_REAL, 0 , 1 9 ,MPI_COMM_WORLD ) ; MPI_Wait (& request ,& s t a t u s ) ; } MPI_Get_count (& s t a t u s , MPI_REAL,& s t a t _ c o u n t ) ; p r i n t f ( " Exchange complete : process %d o f %d \ n " , rank , nprocs ) ; p r i n t f ( " source %d , t a g %d , count %d \ n " , s t a t u s . MPI_SOURCE, s t a t u s . MPI_TAG , stat_count ) ; MPI_Finalize ( ) ;

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

58 / 89

Point to Point Communications

Non-blocking Sends & Receives

More About Send Modes

1 receive mode, 4 send modes


1

standard - used thus far, implementation dependent choice of asynchronous buffer transfer, or synchronous direct transfer. (rationale - MPI makes a better low-level choice) synchronous - synchronize sending and receiving process. when a synchronous send is completed, the user can assume that the receive has begun. ready - matching receive has already been posted, else the result is undened. Can save time and overhead, but requires a very precise knowledge of algorithm and its execution. buffered - force buffering - user is also responsible for maintaining the buffer. Result is undened if buffer is insufcient. (see MPI_BUFFER_ATTACH and MPI_BUFFER_DETACH).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

59 / 89

Point to Point Communications

Non-blocking Sends & Receives

Send Routines for Different Modes

Standard Synchronous Ready Buffered

MPI_SEND MPI_SSEND MPI_RSEND MPI_BSEND

MPI_ISEND MPI_ISSEND MPI_IRSEND MPI_IBSEND

Call syntax is the same as for MPI_SEND and MPI_ISEND.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

60 / 89

Collective Communications

MPI Collective Communications


Routines that allow groups of processes to communicate (e.g. one-to-many or many-to-one). Although they can usually be built from point-to-point calls, intrinsic collective routines allow for simplied code - one routine replacing many point-to-point calls optimized forms - implementation can take advantage of faster algorithms Categories: barrier synchronization broadcast gather scatter reduction

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

62 / 89

Collective Communications

Barrier

Barrier Synchronization

A very simple MPI routine provides the ability to block the calling process until all processes have called it: MPI_BARRIER MPI_BARRIER( comm ) comm (IN), communicator (handle) returns only when all group members have entered the call.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

63 / 89

Collective Communications

Broadcast

Figure: Broadcast in action - 5 data elements on 5 processes.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

64 / 89

Collective Communications

Broadcast

Broadcast

MPI_BCAST MPI_BCAST(buffer,count,datatype,root,comm) buffer (INOUT), starting address of buffer (choice) count (IN), number of entries in buffer (int) datatype (IN), data type of buffer (handle) root (IN), rank of broadcasting process (int) comm (IN), communicator (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

65 / 89

Collective Communications

Broadcast

Broadcast Details

broadcast a message from the process to all members of the group (including itself). root must have identical value on all processes. comm must be the same intra-group domain.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

66 / 89

Collective Communications

Gather

Figure: Scatter/Gather in action - 5 data elements on 5 processes.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

67 / 89

Collective Communications

Gather

Gather

MPI_GATHER
MPI_GATHER(sendbuffer, sendcount, sendtype,recvbuffer,recvcount, recvtype,root,comm) sendbuffer (IN), starting address of send buffer (choice) sendcount (IN), number of entries in send buffer (int) sendtype (IN), data type of send buffer (handle) recvbuffer (OUT), starting address of receive buffer (choice) recvcount (IN), number of entries any single receive (int) recvtype (IN), data type of receive buffer elements (handle) root (IN), rank of receiving process (int) comm (IN), communicator (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

68 / 89

Collective Communications

Gather

Gather Details

each process sends contents of send buffer to root. root stores receives in rank order (as if there were N posted receives of sends from each process).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

69 / 89

Collective Communications

Scatter

Scatter

MPI_SCATTER
MPI_SCATTER( sendbuffer, sendcount, sendtype, recvbuffer, recvcount,recvtype, root, comm) sendbuffer (IN), starting address of send buffer (choice) sendcount (IN), number of entries sent to each process (int) sendtype (IN), data type of send buffer elements (handle) recvbuffer (OUT), starting address of receive buffer (choice) recvcount (IN), number of entries any single receive (int) recvtype (IN), data type of receive buffer elements (handle) root (IN), rank of receiving process (int) comm (IN), communicator (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

70 / 89

Collective Communications

Scatter

Scatter Details

basically the reverse operation to MPI_GATHER. a one-to-all operation in which each recipient get a different chunk.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

71 / 89

Collective Communications

Scatter

Gather Example

1 2 3 4 5 6 7 8 9 10

MPI_Comm comm; i n t myrank , nprocs , r o o t , i a r r a y [ 1 0 0 ] ; ... MPI_Comm_rank (comm,& myrank ) ; i f ( myrank == r o o t ) { MPI_Comm_size (comm,& nprocs ) ; r b u f f = ( i n t ) m a l l o c ( nprocs100 s i z e o f ( i n t ) ) ; } MPI_Gather ( i a r r a y , 1 0 0 , MPI_INT , r b u f , 1 0 0 , MPI_INT , r o o t ,comm ) ; ...

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

72 / 89

Collective Communications

Reduction

Reduction

MPI_REDUCE MPI_REDUCE( sendbuffer, recvbuffer, count, datatype, op, root, comm) sendbuffer (IN), starting address of send buffer (choice) recvbuffer (OUT), starting address of receive buffer (choice) count (IN), number of entries in buffer (int) datatype (IN), data type of buffer (handle) op (IN), reduce operation (handle) root (IN), rank of broadcasting process (int) comm (IN), communicator (handle)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

73 / 89

Collective Communications

Reduction

Reduce Details

combine elements provided in sendbuffer of each process and use op to return combined value in recvbuffer of root process.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

74 / 89

Collective Communications

Reduction

Predened Reduction Operations

MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MINLOC MPI_MAXLOC

maximum minimum sum product logical and bit-wise and logical or bit-wise or logical xor bit-wise xor min value and location max value and location

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

75 / 89

Collective Communications

More Variations

More (Advanced) Collective Ops


MPI_ALLGATHER - gather + broadcast MPI_ALLTOALL - each process sends different subset of data to each receiver MPI_ALLREDUCE - combine elements of each input buffer, store output in receive buffer of all group members. User Dened Reduction Ops - you can dene your own reduction operations Gather/Scatter Vector Ops - allows a varying count of data from or to each process in a gather or scatter operation (MPI_GATHERV/MPI_SCATTERV) MPI_SCAN - prex reduction on data throughout the comm, returns reduction of values of all processes. MPI_REDUCE_SCATTER - combination of MPI_REDUCE and MPI_SCATTERV.
M. D. Jones, Ph.D. (CCR/UB) Intermediate MPI HPC-I Fall 2012 76 / 89

Environmental Tools & Utility Routines

Process Startup

Process Startup

Single most confusing aspect of MPI for most new users Implementation dependent! with many implementation specic options, ags, etc. Consult the documentation for the MPI implementation that you are using.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

78 / 89

Environmental Tools & Utility Routines

Process Startup

Some Examples Using MPI Task Launchers


SGI Origin/Altix (intra-machine):
mpirun np <np> [ o p t i o n s ] <progname> [ progname o p t i o n s ]

MPICH-1 ch_p4 device:


mpirun m a c h i n e f i l e < f i l e n a m e > np <np> [ o p t i o n s ] <progname> [ args ]

Sun HPC Tools:


mprun l nodename [ nproc ] [ , nodename [ nproc ] , . . . ] [ o p t i o n s ] < executable > [ args ]

IBM AIX POE:


poe . / a . o u t nodes [ nnodes ] tasks_per_node [ n t a s k s ] [ o p t i o n s ]

OSCs PBS/Torque based mpiexec:


mpiexec [pernode ] [ k i l l ] [ o p t i o n s ] < executable > [ args ]

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

79 / 89

Environmental Tools & Utility Routines

Process Startup

Intel MPI (also MPICH2/MVAPICH2)


NNODES= c a t $PBS_NODEFILE | u n i q | wc l NPROCS= c a t $PBS_NODEFILE | wc l mpdboot n $NNODES f $PBS_NODEFILE v mpdtrace mpiexec np $NPROCS e n v a l l . / my_executable mpdallexit

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

80 / 89

Environmental Tools & Utility Routines

Inquiry Routines

Getting Implementation Info from MPI

MPI_GET_VERSION MPI_GET_VERSION(version,subversion) version (OUT), version number (int) subversion (OUT), subversion number (int) Not exactly critical for programming, but a nice function for determining what version of MPI you are using (especially when the documentation for your machine is poor).

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

81 / 89

Environmental Tools & Utility Routines

Inquiry Routines

Where am I running?

MPI_GET_PROCESSOR_NAME MPI_GET_PROCESSOR_NAME(name, resultlen) name (OUT), A unique specier for the actual node (string) resultlem (OUT), Length (in printable chars) of the reslut in name (int) returns the name of the processor on which it was called at the moment of the call. name should have storage that is at least MPI_MAX_PROCESSOR_NAME characters long.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

82 / 89

Environmental Tools & Utility Routines

Timing & Synchronization

Timing & Synchronization

MPI_WTIME MPI_WTIME() double precision value returned representing elapsed wall clock time from some point in the past (origin guaranteed not to change during process execution time). A portable timing function (try nding another!) - can be high resolution, provided it has some hardware support.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

83 / 89

Environmental Tools & Utility Routines

Timing & Synchronization

Testing the resolution of MPI_WTIME: MPI_WTICK MPI_WTICK() double precision value returned which is the resolution of MPI_WTIME in seconds. hardware dependent, of course - if a high resolution timer is available, it should be accessible through MPI_WTIME.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

84 / 89

Environmental Tools & Utility Routines

Timing & Synchronization

Common MPI_Wtime usage:


double time0 , time1 ; ... time0 = MPI_Wtime ( ) ; ... / code t o be timed / ... time1 = MPI_Wtime ( ) ; p r i n t f ( Time i n t e r v a l = %f seconds \ n , time1time0 ) ;

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

85 / 89

Environmental Tools & Utility Routines

MPI Error Codes

More About MPI Error Codes


MPI_ERROR_STRING MPI_ERROR_STRING(errorcode, string, resultlen) errorcode (IN), Error code returned by an MPI routine (int) string (OUT), Text that corresponds to errorcode (string) resultlen (OUT), Length (in printable chars) of result returned in string (int) Most error codes in MPI are implementation dependent MPI_ERROR_STRING provides information on the type of MPI exception that occurred. argument string must have storage that is at least MPI_MAX_ERROR_STRING characters.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

86 / 89

Proling

MPI Proling Hooks

The MPI proling interface is designed for authors of proling tools, such that they will not need access to a particular implementations source code (which a vendor may not wish to release). Many proling tools exist:
1

2 3 4

Vampir (Intel, formerly Pallas), now called Intel Trace Analyzer and Visualizer HPMCount (IBM AIX) jumpshot (MPICH) SpeedShop, cvperf (SGI)

Consult your proling tools of choice for detailed usage.

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

88 / 89

Proling

More Advanced MPI Topics

Advanced MPI topics not covered thus far: User dened data types Communicators and Groups Process Topologies MPI-2 Features
MPI-I/O Dynamic process management (MPI_Spawn) One-sided communications (get/put)

M. D. Jones, Ph.D. (CCR/UB)

Intermediate MPI

HPC-I Fall 2012

89 / 89

You might also like