MPI Python Workshop Day1 Fall2024

The document provides an overview of MPI (Message-Passing Interface) using Python, focusing on parallel computing concepts and the mpi4py library. It covers the differences between serial and parallel computing, programming models, and common use cases for MPI, such as perfectly parallel computations and domain decomposition for partial differential equations. Additionally, it discusses performance measurement, Amdahl's Law, and Gustafson's Law in the context of parallel computing efficiency and scalability.

Uploaded by

Ankit Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views22 pages

MPI Python Workshop Day1 Fall2024

Uploaded by

Ankit Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Intro to MPI using Python: Parallel Theory &

MPI Overview
November 17, 2024

Presented by:
Nicholas A. Danes, PhD
Computational Scientist
Research Computing Group, Mines IT
Preliminaries
• HPC Experience (one of these):
• Know the basics of
• Linux Shell
• Python 3
• Scientific Computing
• Active HPC User
• Mines specific: Wendian, Mio
• Off-premise: Cloud, NSF Access, CU Boulder Alpine, etc.
• Previously taken our “Intro to HPC” workshop
• Offered once per semester
Review of Parallel vs Serial Computing
• When a program uses a single process (“task”) with 1 core
(“cpu”), we say it is a serial computing program.
• When a program uses multiple cores, we say it is a parallel
computing program.
• Typically, we try to optimize a serial computing program
before trying to write it in parallel
• For this workshop, we’re going to assume we are well
equipped to deal with the serial code situation
Shared Memory Parallelism:
1 task, 4 threads
Parallel Programming Models
CPU Core CPU Core CPU Core CPU Core
#1 #2 #3 #4
• Shared vs Distributed Memory
Programming
• Shared (e.g. OpenMP)
• All CPU cores have access to the same Memory (RAM)
pool of memory
• Typically, all CPU cores are on the same
CPU node
• Ideal for multi-threaded loops Distributed Memory Parallelism:
• Distributed-memory program (e.g. 4 tasks, 1 thread per task
MPI)
• Each CPU core is given access to a CPU Core CPU Core CPU Core CPU Core
specific pool of memory, which may or #1 #2 #3 #4
may not be shared
• A “communicator” designates how each
CPU core can talk to another CPU core
• CPU cores do not have to live on the Mem Mem Mem Mem
same CPU node
Part #1 Part #2 Part #3 Part#4
Overview of MPI
• MPI stands for message-passing interface, standard provided as a library for exchanging data (called
messages) between objects.
• Different libraries have implemented the MPI standard:
• OpenMPI
• MPICH
• Intel MPI
• Typically used with C, C++ and Fortran
• Objects that can be used to send messages are separated by memory
• Can be entire CPU nodes, or CPU cores (or even a GPU!)
• By breaking up by memory of each tasks, a rank can send messages theoretically anywhere as long as
there is another layer of network communication
• MPI most commonly uses Infiniband for node-to-node communication
• Intra-node communication uses CPU architecture
• Called vader/BTL on OpenMPI
• There are many moving parts involving networking for MPI
• For more information: easybuild_tech_talks_01_OpenMPI_part2_20200708.pdf (open-mpi.org)
Heuristics for writing MPI Programs: Overview
• Typically, MPI programs take a single program, multiple data (SPMD)
model approach
• Single program: Encapsulate all desired functions and routines under one program
• Multiple data: The single program is duplicated with multiple copies of data, and runs
on the system each on its own process.
• Think about your largest data size and how it can be broken up into
smaller chunks
• The multiple processes then can communicate (i.e. share data) using MPI
library functions written by the user
• MPI data communication steps should be brought to a minimum, as they
can slow down performance significantly.
Common Use Case #1: Perfectly Parallel
Computations
• Perfectly (or trivially) parallel programs are ones that do not require
any MPI communication functions
• MPI is still useful, since it allows the program to run across more than
one computer/compute node
• Examples include:
• Matrix/Vector Addition
• Markov Chain Monte Carlo (MCMC) Simulations
Common Use Case #2: Domain Decomposition for
Partial Differential Equations
• Solving a spatial partial differential equation
• Domain is a 1-3D mesh with multiple grid point/cells that can be broken up using
domain decomposition.
• Each processor contains a subset of the domain’s mesh and solves the numerical
problem for the differential equation on that subdomain
• Derivatives in differential equations typically use finite difference/volume/element
approximations, which require knowing values of a function around the evaluated
grid point
• This can require data from other processors
• MPI can be used to send grid data on the edges of the decomposed domain to the
other processors
• Commonly referred to as “ghost” cells/nodes/volumes
• Popular frameworks provide tracking these grid points within the mesh object
• parMETIS,SCOTCH, PETSc, Ansys Fluent
Important MPI concepts
• Initialize – MPI must explicitly started in the code
• Helps MPI identify what resources were requested
• Rank – How the number of processes are labeled/tracked
• Common practice: ranks = # of CPU cores requested
• Other practices: 1 compute node per rank, 1 GPU card per rank
• Size – Total number of ranks
• In most MPI-only programs, size = number of processors requested
• Finalize – Close MPI within the program
1 2
Important MPI concepts
• Communicator – How ranks know their relation to
others 0
• “MPI_COMM_WORLD” – Every rank knows every other rank
• “MPI_COMM_SELF” – Every rank knows itself
MPI_COMM_WORLD
• Communication Types
• Point-to-Point – Synchronized MPI function between ranks
• Send/Receive – Every send must have a receive 1 2
• Calls can be blocking or non-blocking
• Collective - MPI function on all ranks
• Broadcast – One rank sends data to all other ranks
• Scatter – One rank sends a chunk of data to each rank 0
• Gather – One rank receives data from all other ranks
• One-sided
• Not covering this MPI_COMM_SELF
MPI with Python: mpi4py
• mpi4py is a Python library that allows one to use MPI-2 C++
style bindings with Python in an object-oriented way
• Supports various python objects for the buffer interface
• NumPy Arrays
• Pickled Objects (lists, dictionaries, etc)
• Documentation: https://mpi4py.readthedocs.io/en/stable/
• We will be using mpi4py for this entire workshop!
mpi4py vs Other Parallel Python Options
• mpi4py alternatives – Also implements the MPI standard in python
• PyPar: https://github.com/daleroberts/pypar
• Scientific Python: https://github.com/khinsen/ScientificPython/
• pyMPI: https://sourceforge.net/projects/pympi/
• Mpi4py.futures: mpi4py.futures — MPI for Python 4.0.1 documentation
• Based on concurrent.futures (standard Python) to pool workers. Mpi4py futures lets us go across multiple
nodes.
• Multiprocessing – spawns multiple processes (called workers) which can distribute
work for a function
• Easier to implement, but limited to single machine/node
• There are some communication options: multiprocessing — Process-based
parallelism — Python 3.12.2 documentation
• Dask – Provides a full parallel job scheduler framework in Python
• More high-level and communication is more implicit
• Task-scheduling and works well with Jupyter Notebooks
• Can used in combination with MPI (DASK-MPI)
• More details: https://www.dask.org/
Lab #1 (15-20 min):
1. Setting up mpi4py anaconda environment
2. Running our first programs
Today’s files:
/sw/examples/MPI_Workshop_Nov172024.tar.gz
Basic Parallel Computing Theory
• We use parallelization to improve performance of
scientific codes
• How do we measure that?
• Can we predict performance based on various factors?
• Serial performance
• Hardware
• Problem size
• Can we determine how the problem scales as we increase
compute resources?
Measuring Parallel Performance
Variable Description

𝑃 Number of processors (“ranks“)

𝑛 Problem size (e.g 𝑛 is number of mesh cells, etc)

𝑇 𝑃,𝑚𝑎𝑥 Max wall time with 𝑃 processors

𝑇 𝑃,𝑎𝑣𝑔 Average wall time across 𝑃 processors

𝑇 𝑃,𝑚 Wall time from the 𝑚-th out of 𝑃 processors

𝑆𝑃 Speedup with 𝑃 processors

𝐸𝑃 Efficiency with 𝑃 processors

β𝑃 Load balance with 𝑃 processors

Speed-up, Efficiency, & Load-Balancing
• Speed-up: the ratio of the serial wall time to the parallel (with 𝑃 processors) wall
time
𝑇 1,𝑚𝑎𝑥
𝑆𝑃 =
𝑇{𝑃,𝑚𝑎𝑥}
• When 𝑆𝑃 = 𝑃, the speed-up is ideal.
• Efficiency:
𝑆𝑃
𝐸𝑃 =
𝑃
• When 𝐸𝑃 = 1, the efficiency is ideal.
• Load-balancing:
𝑇 𝑃,𝑎𝑣𝑔
β𝑃 =
𝑇{𝑃,𝑚𝑎𝑥}
When β𝑃 = 1, the efficiency is ideal.
Basic Parallel Computing Theory: Amdahl’s Law

• In 1967, Gene Amdhal proposed a way to predict how much a code

can scale due to a serial bottleneck [4].
• Amdhal’s Law can be summarized with the following equation
relating to speedup:
1
S𝑃,𝐴𝑚 =
1 − 𝐹𝑠
𝐹𝑠 −
𝑃
Where 𝐹𝑠 is the theoretical serial fraction, the proportion of the
runtime of a code that is run with only 1 processor.
Basic Parallel Computing Theory: Amdahl’s Law

• Amdahl’s law shows a a severe constraint to

parallel scalability if a large portion of your code is
in serial.
• Plot on the right shows Amdahl’s Law with 𝑃 =
1024 processors
• If the serial fraction is about 0.5% of the
runtime, then we see about a 167 times
speedup, implying a 167/1024 ~ 16.3% parallel
efficiency.
• If the serial fraction is about 10% of the
runtime, then the speedup drops to about 10,
10/1024 ~ about 0.97% parallel efficiency.
• Main takeaway: Amdahl’s Law states that
minimizing the time a code spends in serial is
crucial for scaling up your parallel program.
Amdahl’s Law Limitations
• Amdahl’s Law makes many assumptions about your compute
situation
• Doesn’t account for hardware limitations
• CPU configuration (cache, memory, etc)
• Disk performance (read/write speeds, etc)
• The fraction of the code spend in parallel could also depend on
the number of processors, i.e.
1 − 𝐹𝑠 = 𝐹𝑃 = 𝐹𝑃 (𝑃)
• It assumes that your problem size is fixed
• In practice, when performing a benchmark with increasing number
of processors with a fixed problem size, we call this Strong
Scaling.
Gustafson’s Law
• In response, John Gustafson argued that the
assumptions from Amdahl’s Law for was not
appropriate for all parallel workloads [4].
• In particular, the serial time spent by the processor
was not independent of the number of processors
• More processors used on a CPU means the
cores will compete for memory bandwidth
• As an approximation, Gustafson approximated
speedup by assuming the parallel part of the program
is linearly proportional to the number of processors:
S𝑃,𝐺𝑢 = 𝑃 + 1 − 𝑃 F𝑠
• This equation is often referred to as scaled speedup.
• When one increases the problem size with the number
of processors linearly, we call this weak scaling.
Lab #2 (15-20 min):
Calcuating pi in parallel using Leibiniz’s
formula
Today’s files:
/sw/examples/MPI_Workshop_Nov172024.tar.gz
References
• [1] https://www.cs.uky.edu/~jzhang/CS621/chapter7.pdf
• [2] https://www.youtube.com/watch?v=pDBIoil-LTk
• [3] https://www-
inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
• [4] Gustafson, John L. “Reevaluating Amdahl’s law.”
Communications of the ACM 31, no. 5 (1988): 532-533:
http://www.johngustafson.net/pubs/pub13/amdahl.htm
• [5] https://xlinux.nist.gov/dads/HTML/singleprogrm.html

Unit Iii Computer Fundamentals
No ratings yet
Unit Iii Computer Fundamentals
37 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Parallel Computing in Python Using Mpi4py: Stephen Weston
No ratings yet
Parallel Computing in Python Using Mpi4py: Stephen Weston
26 pages
omp_hands_on
No ratings yet
omp_hands_on
200 pages
MPI_tutorial_Fall_Break_2022
No ratings yet
MPI_tutorial_Fall_Break_2022
60 pages
Parallel Programming and MPI
No ratings yet
Parallel Programming and MPI
54 pages
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
No ratings yet
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
53 pages
[Scientific and Engineering Computation] William Gropp, Ewing L. Lusk, Anthony Skjellum, Rajeev Thakur - Using MPI and Using MPI-2 (1999, The MIT Press)
No ratings yet
[Scientific and Engineering Computation] William Gropp, Ewing L. Lusk, Anthony Skjellum, Rajeev Thakur - Using MPI and Using MPI-2 (1999, The MIT Press)
385 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
2013 02 24 Ppopp Mpi Basic
No ratings yet
2013 02 24 Ppopp Mpi Basic
102 pages
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Parallel Distributed Computing Using Pyt
No ratings yet
Parallel Distributed Computing Using Pyt
41 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Untitled document
No ratings yet
Untitled document
23 pages
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
No ratings yet
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
10 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
PA
No ratings yet
PA
87 pages
MPI4 Py
No ratings yet
MPI4 Py
28 pages
3.MPI
No ratings yet
3.MPI
44 pages
Lab Mpi
No ratings yet
Lab Mpi
32 pages
Lab Mpi
No ratings yet
Lab Mpi
29 pages
Introduction to Paralel Procesing
No ratings yet
Introduction to Paralel Procesing
40 pages
CSC-334_ P&DC_Lab manual_V2.0
No ratings yet
CSC-334_ P&DC_Lab manual_V2.0
102 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Pdcnotes
No ratings yet
Pdcnotes
23 pages
Ricoh IM C300-D0C9_NA 3DPC 1.36
No ratings yet
Ricoh IM C300-D0C9_NA 3DPC 1.36
156 pages
Intro_MPI
No ratings yet
Intro_MPI
60 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Ch11 (Mass Storage Structure)
No ratings yet
Ch11 (Mass Storage Structure)
47 pages
SERC IntroMPI 2019-09-14 v0
No ratings yet
SERC IntroMPI 2019-09-14 v0
43 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
Computing LLNL Gov
No ratings yet
Computing LLNL Gov
42 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
ch 4
No ratings yet
ch 4
16 pages
Cisco UCS C220 M3 Server Installation and Service Guide
No ratings yet
Cisco UCS C220 M3 Server Installation and Service Guide
136 pages
CS-3006_5_MPI Basics
No ratings yet
CS-3006_5_MPI Basics
53 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
Week 10
No ratings yet
Week 10
52 pages
Sunil Kumar L 24
No ratings yet
Sunil Kumar L 24
21 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Introduction to C MPI PM
No ratings yet
Introduction to C MPI PM
50 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
in3200-chap09
No ratings yet
in3200-chap09
56 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
03-MPIProgramStructure[1]
No ratings yet
03-MPIProgramStructure[1]
42 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
CS621 Final Term Current Papers
No ratings yet
CS621 Final Term Current Papers
9 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
Class03 - MPI, Part 1, Intermediate PDF
No ratings yet
Class03 - MPI, Part 1, Intermediate PDF
83 pages
TG7.11.1.7-61 v1.1 TPU Modbus
No ratings yet
TG7.11.1.7-61 v1.1 TPU Modbus
374 pages
FlashPRO GUI Manual Eng Ver1 (1) .1
No ratings yet
FlashPRO GUI Manual Eng Ver1 (1) .1
64 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
COA Unit 1 Notes
No ratings yet
COA Unit 1 Notes
85 pages
DRP001 Puc V 1012110420 0015 - Iom
No ratings yet
DRP001 Puc V 1012110420 0015 - Iom
48 pages
Isentry Product Overview and Matrix - English v1011-1
No ratings yet
Isentry Product Overview and Matrix - English v1011-1
6 pages
DLC2-USB (ETH) - V3.2 User Manual 20210117
No ratings yet
DLC2-USB (ETH) - V3.2 User Manual 20210117
12 pages
Untitled
No ratings yet
Untitled
35 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
Lenovo Y330 072424 072423m ddr3 Schematics PDF
No ratings yet
Lenovo Y330 072424 072423m ddr3 Schematics PDF
54 pages
Chaos Legion - Manual
No ratings yet
Chaos Legion - Manual
20 pages
EE3404 - Microprocessor and Microcontroller
No ratings yet
EE3404 - Microprocessor and Microcontroller
23 pages
Learn Linux, 101 Boot Managers
No ratings yet
Learn Linux, 101 Boot Managers
35 pages
Lab 3
No ratings yet
Lab 3
16 pages
Acs580mv Quick Guide 2ubb019170 E01 Rev D
No ratings yet
Acs580mv Quick Guide 2ubb019170 E01 Rev D
50 pages
Nuevo Documento de Texto
No ratings yet
Nuevo Documento de Texto
11 pages
Tugas Pertemuan 2 Bahasa Inggris II
50% (2)
Tugas Pertemuan 2 Bahasa Inggris II
3 pages
The SC - Platform Difference - White Paper - Published 5-22
No ratings yet
The SC - Platform Difference - White Paper - Published 5-22
12 pages
2 Timeline 3
No ratings yet
2 Timeline 3
21 pages
End Solution 2023 (Autumn)
No ratings yet
End Solution 2023 (Autumn)
10 pages
A Hybrid Register Cache For GPUs
No ratings yet
A Hybrid Register Cache For GPUs
11 pages
Basics of Information Technology - Full Notes
No ratings yet
Basics of Information Technology - Full Notes
4 pages
Dixie Ratchet Turnbuckles: When You Need The Best
No ratings yet
Dixie Ratchet Turnbuckles: When You Need The Best
6 pages
Apple 新目录
No ratings yet
Apple 新目录
2 pages
Computer Components Worksheet 1A Processor Components
No ratings yet
Computer Components Worksheet 1A Processor Components
2 pages
Amendments-Additions in Wi-Fi ACCESS NETWORK INTERFACE Framework and Guidelines For Registration Dated 16-09-2024
No ratings yet
Amendments-Additions in Wi-Fi ACCESS NETWORK INTERFACE Framework and Guidelines For Registration Dated 16-09-2024
2 pages
PX-PR PROFILE Printer: Part No. 557.202.864
No ratings yet
PX-PR PROFILE Printer: Part No. 557.202.864
4 pages
Nano 8050
No ratings yet
Nano 8050
1 page
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
From Everand
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
Gabriel Clemente
No ratings yet
NumPy Recipes
From Everand
NumPy Recipes
Martin McBride
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet