MPI Python Workshop Day1 Fall2024
MPI Python Workshop Day1 Fall2024
MPI Overview
November 17, 2024
Presented by:
Nicholas A. Danes, PhD
Computational Scientist
Research Computing Group, Mines IT
Preliminaries
• HPC Experience (one of these):
• Know the basics of
• Linux Shell
• Python 3
• Scientific Computing
• Active HPC User
• Mines specific: Wendian, Mio
• Off-premise: Cloud, NSF Access, CU Boulder Alpine, etc.
• Previously taken our “Intro to HPC” workshop
• Offered once per semester
Review of Parallel vs Serial Computing
• When a program uses a single process (“task”) with 1 core
(“cpu”), we say it is a serial computing program.
• When a program uses multiple cores, we say it is a parallel
computing program.
• Typically, we try to optimize a serial computing program
before trying to write it in parallel
• For this workshop, we’re going to assume we are well
equipped to deal with the serial code situation
Shared Memory Parallelism:
1 task, 4 threads
Parallel Programming Models
CPU Core CPU Core CPU Core CPU Core
#1 #2 #3 #4
• Shared vs Distributed Memory
Programming
• Shared (e.g. OpenMP)
• All CPU cores have access to the same Memory (RAM)
pool of memory
• Typically, all CPU cores are on the same
CPU node
• Ideal for multi-threaded loops Distributed Memory Parallelism:
• Distributed-memory program (e.g. 4 tasks, 1 thread per task
MPI)
• Each CPU core is given access to a CPU Core CPU Core CPU Core CPU Core
specific pool of memory, which may or #1 #2 #3 #4
may not be shared
• A “communicator” designates how each
CPU core can talk to another CPU core
• CPU cores do not have to live on the Mem Mem Mem Mem
same CPU node
Part #1 Part #2 Part #3 Part#4
Overview of MPI
• MPI stands for message-passing interface, standard provided as a library for exchanging data (called
messages) between objects.
• Different libraries have implemented the MPI standard:
• OpenMPI
• MPICH
• Intel MPI
• Typically used with C, C++ and Fortran
• Objects that can be used to send messages are separated by memory
• Can be entire CPU nodes, or CPU cores (or even a GPU!)
• By breaking up by memory of each tasks, a rank can send messages theoretically anywhere as long as
there is another layer of network communication
• MPI most commonly uses Infiniband for node-to-node communication
• Intra-node communication uses CPU architecture
• Called vader/BTL on OpenMPI
• There are many moving parts involving networking for MPI
• For more information: easybuild_tech_talks_01_OpenMPI_part2_20200708.pdf (open-mpi.org)
Heuristics for writing MPI Programs: Overview
• Typically, MPI programs take a single program, multiple data (SPMD)
model approach
• Single program: Encapsulate all desired functions and routines under one program
• Multiple data: The single program is duplicated with multiple copies of data, and runs
on the system each on its own process.
• Think about your largest data size and how it can be broken up into
smaller chunks
• The multiple processes then can communicate (i.e. share data) using MPI
library functions written by the user
• MPI data communication steps should be brought to a minimum, as they
can slow down performance significantly.
Common Use Case #1: Perfectly Parallel
Computations
• Perfectly (or trivially) parallel programs are ones that do not require
any MPI communication functions
• MPI is still useful, since it allows the program to run across more than
one computer/compute node
• Examples include:
• Matrix/Vector Addition
• Markov Chain Monte Carlo (MCMC) Simulations
Common Use Case #2: Domain Decomposition for
Partial Differential Equations
• Solving a spatial partial differential equation
• Domain is a 1-3D mesh with multiple grid point/cells that can be broken up using
domain decomposition.
• Each processor contains a subset of the domain’s mesh and solves the numerical
problem for the differential equation on that subdomain
• Derivatives in differential equations typically use finite difference/volume/element
approximations, which require knowing values of a function around the evaluated
grid point
• This can require data from other processors
• MPI can be used to send grid data on the edges of the decomposed domain to the
other processors
• Commonly referred to as “ghost” cells/nodes/volumes
• Popular frameworks provide tracking these grid points within the mesh object
• parMETIS,SCOTCH, PETSc, Ansys Fluent
Important MPI concepts
• Initialize – MPI must explicitly started in the code
• Helps MPI identify what resources were requested
• Rank – How the number of processes are labeled/tracked
• Common practice: ranks = # of CPU cores requested
• Other practices: 1 compute node per rank, 1 GPU card per rank
• Size – Total number of ranks
• In most MPI-only programs, size = number of processors requested
• Finalize – Close MPI within the program
1 2
Important MPI concepts
• Communicator – How ranks know their relation to
others 0
• “MPI_COMM_WORLD” – Every rank knows every other rank
• “MPI_COMM_SELF” – Every rank knows itself
MPI_COMM_WORLD
• Communication Types
• Point-to-Point – Synchronized MPI function between ranks
• Send/Receive – Every send must have a receive 1 2
• Calls can be blocking or non-blocking
• Collective - MPI function on all ranks
• Broadcast – One rank sends data to all other ranks
• Scatter – One rank sends a chunk of data to each rank 0
• Gather – One rank receives data from all other ranks
• One-sided
• Not covering this MPI_COMM_SELF
MPI with Python: mpi4py
• mpi4py is a Python library that allows one to use MPI-2 C++
style bindings with Python in an object-oriented way
• Supports various python objects for the buffer interface
• NumPy Arrays
• Pickled Objects (lists, dictionaries, etc)
• Documentation: https://mpi4py.readthedocs.io/en/stable/
• We will be using mpi4py for this entire workshop!
mpi4py vs Other Parallel Python Options
• mpi4py alternatives – Also implements the MPI standard in python
• PyPar: https://github.com/daleroberts/pypar
• Scientific Python: https://github.com/khinsen/ScientificPython/
• pyMPI: https://sourceforge.net/projects/pympi/
• Mpi4py.futures: mpi4py.futures — MPI for Python 4.0.1 documentation
• Based on concurrent.futures (standard Python) to pool workers. Mpi4py futures lets us go across multiple
nodes.
• Multiprocessing – spawns multiple processes (called workers) which can distribute
work for a function
• Easier to implement, but limited to single machine/node
• There are some communication options: multiprocessing — Process-based
parallelism — Python 3.12.2 documentation
• Dask – Provides a full parallel job scheduler framework in Python
• More high-level and communication is more implicit
• Task-scheduling and works well with Jupyter Notebooks
• Can used in combination with MPI (DASK-MPI)
• More details: https://www.dask.org/
Lab #1 (15-20 min):
1. Setting up mpi4py anaconda environment
2. Running our first programs
Today’s files:
/sw/examples/MPI_Workshop_Nov172024.tar.gz
Basic Parallel Computing Theory
• We use parallelization to improve performance of
scientific codes
• How do we measure that?
• Can we predict performance based on various factors?
• Serial performance
• Hardware
• Problem size
• Can we determine how the problem scales as we increase
compute resources?
Measuring Parallel Performance
Variable Description