[go: up one dir, main page]

0% found this document useful (0 votes)
31 views108 pages

Module 1&2

The document outlines a course on High Performance Computing (HPC) at the Indian Institute of Information Technology Sri City, detailing the syllabus, evaluation plan, and reference materials. It covers topics such as parallel computing, supercomputers, and various programming models, emphasizing the importance of parallelism in modern computing. The evaluation plan includes components like mid-exams, end-semester exams, projects, and quizzes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views108 pages

Module 1&2

The document outlines a course on High Performance Computing (HPC) at the Indian Institute of Information Technology Sri City, detailing the syllabus, evaluation plan, and reference materials. It covers topics such as parallel computing, supercomputers, and various programming models, emphasizing the importance of parallelism in modern computing. The evaluation plan includes components like mid-exams, end-semester exams, projects, and quizzes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

HPC-25|IIITS

Indian Institute of Information Technology Sri city

1 High Performance Computing

Dr.Bheemappa H
IIIT, Sricity
bheemappa.h@iiit.in

||https://bheemhh.github.io/||
HPC-25|IIITS

Outline
2

● Course Details
● Syllabus
● Evaluation Plan
● Reference Materials
● SUPERCOMPUTERS

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

● What is computer How Data Represented?


Program

○ Algorithm and data structure.


■ Binary
○ Programming Language: Standard ● Base2/0-1/bit
Notation for writing programs
■ Example : C, Java, Intel assembly
language
● Machine Language(IA32).
○ Program translators : Compiler -
GCC
■ What does GCC do?

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Improving of Computer Performance :


4

● Innovative manufacturing and advancement of VLSI.


● Exploitation of some form parallelism
● ILP
○ Pipelining (IE=F/D/E/M/)
○ Out of order execution
○ Superscalar
● Thread Level Parallelism
● Multiple thread on one processor
● SMT- Hardware Multithreading.
● Multithreading on Multiple processor.
● Example : Loop
● Process Level Parallelism
○ Different processes can be executed in parallel on Multiple process
■ SMP
■ DSM

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Syllabus
5

Module 1: Introduction: Motivating Parallelism, Scope of Parallel ComputingI, Introduction to HPC: Parallel Programming
Platforms; Implicit Parallelism: Trends in Microprocessor Architectures, Limitations of Memory System Performance, Dichotomy
of Parallel Computing Platforms, Physical Organization of Parallel Platforms; . (6- Hrs )
Module 2: Parallel Computer Memory Architectures:Measures of Parallel Algorithms;Analytical Modeling of Parallel Programs:
Sources of Overhead in Parallel Programs, Performance Metrics for Parallel Systems, the Effect of Granularity on Performance;
Parallel Platforms: Models (SIMD,MIMD, SPMD) , Communication (Shared Address Space vs. Message Passing) (7-Hrs)
Module 3: Thread Basics: Why Threads? The POSIX Thread API,Thread Creation and Termination, Synchronization Primitives in
Pthreads,Controlling Thread and Synchronization Attributes, Thread Cancellation, Composite Synchronization Constructs (6-
hrs)
Module 4: Distributed Memory Parallel Programming –Tips for Designing Asynchronous Programs,OpenMP: A Standard for
Directive Based Parallel Programming. (6-Hrs )
Module 5 : The Building Blocks: Send and Receive Operations, MPI: the Message Passing Interface, Topologies and Embedding,
Overlapping Communication with Computation, Collective Communication and Computation Operations, Groups and
Communicators. (7-Hrs )
Module 5 :The Age of Parallel Processing, Central Processing Units, The Rise of GPU Computing, A brief history of GPUs, Early
GPU computing; CUDA: What is CUDA architecture, using the CUDA architecture, Applications of CUDA, Medical Imaging,
Computational Fluid Dynamics, Environmental Science; Introduction to CUDA C: A First
Program, Hello world, A kernel call, Passing parameters, Querying devices, using device properties; Parallel Programming in
CUDA C:CUDA parallel programming, Summing vectors, A fun example. (7-Hrs ) HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

TEXT BOOKS
6

● John Hennessy and David Patterson, Computer Architecture - A Quantitative


Approach. 5ed. Morgan Kaufmann.
● Ananth Grama, Vipin Kumar, Anshul Gupta, George Karypis, Introduction to
Parallel Computing, Addison-Wesley, 2003
● Wen-Mei W Hwu, David B Kirk, Programming Massively Parallel Processors A
Hands-on Approach, Morgann Kaufmann, 3e.
● William Dally and Brian Towles. Principles and Practices of Interconnection
Networks, MK, 2004

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

EVALUATION PLAN
7

Tentative Evolution Plan EVALUATION PLAN :


● Mid Exam = 25%
● End Semester Exam = 45%
● Project = 20%
● Scheduled Quiz 10% = 10%

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Week-1:Class-2

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
9

● Introduction
● Introduction to HPC
● Motivating Parallelism
● Scope of Parallel Computing
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
10

● Introduction to HPC:
● Motivating Parallelism
● Scope of Parallel Computing
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Historical Background
11

● Two Major Stages of Development


○ Mechanical : before 1945
○ Electronic : after 1945
■ 1st Gen 1945-54 : Vacuum Tubes/assembly language/IBM 701 )
■ 2nd Gen 1955-64 : Transistor/compilers/IBM 7090)
■ 3rd Gen 1965- 97: IC-SSI,MSI,I/Multiprogramming/time sharing/IBM
360/70
■ 4th Gen - 197-90 : IC-VLSI/Multiprocessor/HLL/Parallel Processing(home))
■ 5th Gen-1991- Till : (IC-ULSI/Parallel Processing (hetro)/Cray MPP)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Introduction to HPC
12

● Elements of Modern Computer


○ Hardware
○ System Software
○ Application Programs
■ Functionality of Processor is characterised by Instruction Set.
■ What is Architecture and Organization.
● Architecture : Short form of ISA.(Programmer view of Processor)
○ Data transfer,Add,Mul,sub etc. , register, instruction set etc.
● Organization: High level Design,: Caches , Pipeline , controlling Unit.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

13

● What is high performance computing (HPC)?


○use of the most efficient algorithms on computers capable of the highest
performance to solve the most demanding problems
○ HPC main resources are : CPU,RAM, GPU
● Why HPC?
○ Large problems – spatially/temporally
■ Large problems – spatially/temporally
● Weather forecasting; protein folding; turbulence simulations/CFD;
aerospace structures; Full-body simulation/
● Earthquake simulation.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Key characteristics of HPC


14

● Parallel Processing
● Handling Large Datasets
● Complex Simulations
“HPC refers to techniques, algorithms, and methodologies used to achieve high
computational performance”
● HPC systems include:
● Supercomputers, Distributed Computing, Specialized Hardware's.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

SUPERCOMPUTERS
15

● The term is commonly applied to the fastest high-performance systems


available at any given time.
● Computers are used Primarily for scientific and engineering work requiring
exceedingly high-speed computations.
● Top 500 Super computer: https://en.wikipedia.org/wiki/TOP500
● Measured in quadrillions of 64-bit
floating point operations per
second.
● The speed of floating-point
operations, commonly measured in
terms of FLOPS,
● Floating point operations per
second (FLOPS, flops or flop/s) is
a measure of computer
performance,

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

How does HPC systems differ from a regular computer?


16

● HPC ● Nodes in HPC :


○ Major components ○ Two types of nodes on a cluster
■ CPU,RAM,GPU ■ login nodes (also known as
○ Several computers connected head or submit nodes).
together in a network (forming a ■ compute nodes (also known
HPC cluster). Each of these as worker nodes).
computers is referred to as a node ● Job Scheduler
in the network. ○ Users do not have direct access to the compute
○ HPC clusters is to run resource- nodes and instead submitting jobs via a job
scheduler.
intensive and/or parallel task ○ scheduler’s role is to manage all these jobs

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

High Performance Computers


17

● In 1980s
○ 1x106 Floating Point Ops/sec (Mflop/s)
○ Scalar based
● In 1990s
○ 1x109 Floating Point Ops/sec (Gflop/s)
○ Vector & Shared memory computing
● Today
○ 1x1012 Floating Point Ops/sec (Tflop/s)
○ Highly parallel, distributed processing, message passing

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Computer Technology
18

● Performance improvements:
○ Improvements in semiconductor technology
■ Feature size, clock speed
○ Improvements in computer architectures
■ Enabled by HLL compilers, UNIX
■ Lead to RISC architectures
○ Together have enabled:
■ Lightweight computers
■ Productivity-based managed/interpreted programming languages
■ SaaS, Virtualization, Cloud
○ Applications evolution:
■ Speech, sound, images, video, “augmented/extended reality”, “big data”
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Moore’s Law
19

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor


density of semiconductor chips would double roughly every 18 months.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
20

● Introduction to HPC:
● Motivating Parallelism, Scope of Parallel Computing,
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Scope of Parallel Computing Applications


21

● Applications in Engineering and Design.


● Scientific Applications
○ evolution of galaxies, thermonuclear processes, and the analysis of
extremely large datasets from telescopes
● Commercial Applications
○ optimizing business and marketing decisions
● Applications in Computer Systems
○ Network intrusion detection, cryptography
● Applications in Data Science
○ Google File System, Map-reduce

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Introduction to Parallel Computing


22

● What is Parallel Computing?


○ “Processing of multiple tasks simultaneously on multiple processor is
called parallel processing”.
● Concurrency is when two tasks can start, run, and complete in
overlapping time periods.
● Parallelism refers to techniques to make programs faster by
performing several computations at the same time.

● Advantages of Parallel Computing

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Motivating Parallelism
23

● The role of parallelism in accelerating computing speeds has been


recognized for several decades.
● Development of parallel software has traditionally been thought of
as time and effort intensive.
● Its role in providing multiplicity of datapaths and increased access
to storage elements has been significant in commercial applications

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Motivating Parallelism
24

● Computational Speed Argument


○ The logical recourse is to rely on parallelism, both implicit and explicit
○ Most serial (or seemingly serial) processors rely extensively on implicit
parallelism.
● Memory/Disk Speed Argument
○ Mismatch in speeds causes significant performance bottlenecks.
○ Parallel platforms provide increased bandwidth to the memory system
■ Principles of locality of data reference and bulk access,
● Data Communication Argument :
○ Vision of the Internet as one large computing platform has emerged.
○ Any analyses on this data must be performed over the network using
parallel techniques
The emergence of standardized parallel programming environments, libraries, and hardware
have significantly reduced time to (parallel) solution

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Implicit Parallelism: Trends in Microprocessor Architectures


25

● Conventional architectures coarsely comprise All microprocessors


of a processor, memory system, and the currently available rely on
Datapath. parallelism to varying
○ significant performance bottlenecks degrees. These are
● Microprocessor clock speeds gain and High generally hidden from the
programmer
levels of device integration
● How best to utilize these resources?
○ Multiple functional units and execute multiple Implicitly parallelism
instructions in the same cycle. ○ Pipelining
○ VLIW Processor
Implicitly parallel if its compiler or interpreter can
and etc
recognize opportunities for parallelization and
implement them without being told to do so. HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Pipelining
26

Pipelining

1. Pipelining overlaps various stages of instruction execution to


achieve performance
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Superscalar Execution
27

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Superscalar Execution
28

• True Data Dependency:


• The result of one operation is an input to the next.
• Resource Dependency: Two operations require the same resource.
• Branch Dependency: Scheduling instructions across conditional branch statements
cannot be done deterministically a-priori.
• The scheduler, a piece of hardware looks at a large number of instructions in an
instruction queue and selects appropriate number of instructions to execute
concurrently based on these factors.
• The complexity of this hardware is an important constraint on superscalar processors.

• Due to limited parallism in typical instruction traces, dependencies, or the inability of the
scheduler to extract parallelism, the performance of superscalar processors is eventually
limited

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

29
Very Long Instruction Word (VLIW) Processors
1. VLIW processors rely on compile time analysis to identify and bundle
together instructions that can be executed concurrently
2. The instructions are packed and dispatched together, and thus the name
very long instruction word (Intel IA64).
3. Compiler has a bigger context from which to select co-scheduled
instructions.
4. Compilers, however, do not have runtime information such as cache
misses. Scheduling is, therefore, inherently conservative
VLIW performance is highly dependent on the compiler. A

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Current Trends in Architecture


30

● Cannot continue to leverage Instruction-Level parallelism (ILP)


○ Single processor performance improvement ended in 2003
● New models for performance:
○ Data-level parallelism (DLP)
○ Thread-level parallelism (TLP)
○ Request-level parallelism (RLP)
● These require explicit restructuring of the application

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Limitations of Memory System Performance


31

Typical computations only run at 10-50% of peak CPU utilization


because of memory bottlenecks. The key question here is how to
connect a 50 ns latency memory to a processor that runs a 0.5 ns clock

Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM


with a latency of 100 ns (no caches). Assume that the processor has two
multiply-add units and is capable of executing four instructions in each cycle
of 1 ns.
The following observations follow:
• The peak CPU rating is 4GFLOPS (floating-point operations per second)
• What is the peak speed of computation?
10MFLOPS - Since the memory latency is 100 cycles
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

32

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Limitations of Memory System Performance


33

● Limitations of Memory System Performance


○ Bandwidth and Latency

● Improving Effective Memory Latency Using Caches


○ memory acts as a low-latency high-bandwidth storage
○ piece of data is repeatedly used, the effective latency of this memory system
can be reduced by the cache.
○ Cache hit ratio achieved by a code on a memory system often determines its
performance.
○ Exploiting spatial and temporal locality in applications is critical for amortizing
memory latency and increasing effective memory bandwidth.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

HW
34

• Find the historical trends in processor clock speeds.


• Plot the data and compare the observed growth to the limits imposed by thermal and physical
constraints.
• Investigate the relationship between the number of cores in processors and their peak FLOPS rate
over time. Plot and analyze the trend, discussing its implications for parallel computing.
• Compare the computational requirements of modern machine learning models, such as GPT or
image generation systems, over the past five years. Plot the growth in FLOPS required for training
and inference.
• Explore the trends in power consumption of supercomputers over the past decade. Compare the
energy efficiency improvements of the Green500 leaders with those of the Top500 leaders.
• Investigate the challenges of cooling in high-performance computing systems. How do modern
techniques like liquid cooling and immersion cooling compare in terms of efficiency and cost

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

35

1. Compare the capabilities of existing quantum computing platforms


(e.g., IBM Quantum, Google Sycamore, D-Wave) in terms of qubits,
error rates, and computational performance

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
36

● Motivating Parallelism, Scope of Parallel Computing,


● Introduction to HPC:
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Explicitly Parallel Platforms


37

● Dichotomy of Parallel Computing Platforms


○ An explicitly parallel program must specify concurrency and interaction
between concurrent subtasks.
■ Control structure
■ Parallelism can be expressed at various levels of granularity - from instruction
level to processes
● Control Structure of Parallel Programs
○ Processor units either operate under the centralized control of a single
control unit or work independently.
○ Single control unit that dispatches the same instruction to various
processors (that work on different data)

\ HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Parallelism
38

● Classes of parallelism in applications:


○ Data-Level Parallelism (DLP)
○ Task-Level Parallelism (TLP)
● Classes of architectural parallelism:
○ Instruction-Level Parallelism (ILP)
○ Vector architectures/Graphic Processor Units (GPUs)
○ Thread-Level Parallelism
○ Request-Level Parallelism

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Control Structure of Parallel Programs


39

○ Single instruction stream,


multiple data stream (SIMD)
■ If there is a single control
unit that dispatches the
same instruction to various
processors
○ multiple instruction stream,
multiple data stream (MiMD)
■ If each processor has its own
control control unit, each
processor can execute
different instructions on
different data items

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

40

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Parallelism from single instruction on multiple HPC-25|IIITS

processors(SIMD)
41

➢ a single control unit dispatches ● Example


instructions to each processing unit. ○ add instruction is dispatched to all
➢ same instruction is executed processors and executed
synchronously by all processing units. concurrently by them.
➢ MMX units in Intel processors and DSP
chips
➢ Structured computations on parallel
data structures such as arrays, often it
is necessary to selectively turn off
operations on certain data items.
➢ Conditional execution can be
detrimental to the performance of
SIMD processors and therefore must
be used with care.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Multiple Instruction stream, Multiple Data stream (MIMD)


42

➢ each processing element is capable of


executing a different program
independent of the other processing
elements
➢ Each of the multiple programs can be
inserted into one large if-else block
with conditions specified by the task
identifiers.
■ A simple variant of this model,
called the single program
multiple data
(SPMD) model.
■ An example of MIMD system is Intel Xeon Phi,

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

43

● SIMD computers require less hardware than MIMD computers


because they have only one global control unit.
● SIMD computers require less memory because only one copy of the
program needs to be stored.
● While MIMD is complex in terms of complexity than SIMD.
● MIMD is more efficient in terms of performance than SIMD.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
44

● Motivating Parallelism, Scope of Parallel Computing,


● Introduction to HPC:
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Communication Model of Parallel Platforms


45

● Data exchange between parallel tasks


○ accessing a shared data space (shared address-space machines or
multiprocessors)
○ Exchanging messages(message passing platforms or multicomputers.
● Shared-Address-Space Platforms
○ common data space that is accessible to all processors.
○ Processors interact by modifying data objects stored in this shared-address-
space.
○ If the time taken by a processor to access any memory word in the system
global or local is identical, the platform is classified as a uniform memory
access (UMA), else, a non uniform memory access (NUMA) machine

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

46

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

NUMA and UMA Shared-Address-Space Platforms


47

● NUMA and UMA


architectures are
defined only in terms of
memory access times
and not cache access
times.
● Sun Ultra HPC servers-
class of NUMA
multiprocessors.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

NUMA and UMA Shared-Address-Space Platforms


48

Presence Global Memory (RAM) : Presence of caches on processors

● presence of a global memory space ● raises the issue of multiple copies of a


makes (parallel) programming such single memory word being manipulated by
two or more processors at the same time.
platforms much easier.
● Ensuring that concurrent operations
● operations require mutual
on multiple copies of the same
exclusion for concurrent accesses.
memory word have well-defined
● Read/write -interactions are, semantics (cache coherence)
however, harder to program than ● address translation mechanism that
the read-only interactions, locates a memory word in the
system.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Message-Passing Platforms
49

● These platforms comprise of a set of processors and their own


(exclusive) memory.
● Instances of such a view come naturally from clustered
workstations and non-shared-address-space multicomputers.
● These platforms are programmed using (variants of) send and
receive primitives.
● Libraries such as MPI and PVM provide such primitives

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Message Passing vs. Shared Address Space Platforms


50

● Message passing requires little hardware support, other than a


network.
● Shared address space platforms can easily emulate message
passing. The reverse is more difficult to do (in an efficient manner).

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
51

● Motivating Parallelism, Scope of Parallel Computing,


● Introduction to HPC:
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Physical Organization of Parallel Platforms


52

Architecture of an Ideal Parallel Computer

• parallel random access machine (PRAM) PRAMs subclasses


• PRAMs consist of p processors and a Exclusive-read, exclusive-write (EREW)
global memory of unbounded size that PRAM.
is uniformly accessible to all Concurrent-read, exclusive-write
processors. (CREW) PRAM.
• Processors share a common clock but Exclusive-read, concurrent-write
may execute different instructions in (ERCW) PRAM.
each cycle
Concurrent-read, concurrent-write
• Simultaneous memory accesses are (CRCW) PRAM
handled, PRAMs can be divided into
four subclasses

concurrent read access does not concurrent write access to a memory


create any semantic discrepancies location requires arbitration
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Architecture of an Ideal Parallel Computer


53

What does concurrent write mean ?

• The most frequently used protocols are as follows:


• Common, in which the concurrent write is allowed if all the values that the
processors are attempting to write are identical.
• Arbitrary, in which an arbitrary processor is allowed to proceed with the write
operation and the rest fail.
• Priority, in which all processors are organized into a predefined prioritized
list, and the processor with the highest priority succeeds and the rest fail.
• Sum, in which the sum of all the quantities is written (the sum-based write
conflict resolution model can be extended to any associative operator defined
on the quantities being written)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Architectural Complexity of the Ideal Model


54

• Processors and memories are connected via switches.


• System of p processors and m words, the switch complexity is O(mp).
• For a reasonable memory size, constructing a switching network of this
complexity is very expensive.
• Thus, PRAM models of computation are impossible to realize in practice
• Solution :
• Interconnection Networks for Parallel Computers
• A blackbox view of an interconnection network consists of n inputs and m outputs

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M1:
55

● Motivating Parallelism, Scope of Parallel Computing,


● Introduction to HPC:
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Interconnection Networks for Parallel Computers


56
• Interconnects are made of switches and link
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication links among
processing nodes and are also referred to as direct networks.
• Dynamic networks are built using switches and communication links.
Dynamic networks are also referred to as indirect networks

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Interconnection Networks Components


57

• Switches :
• a fixed number of inputs to outputs; Degree of the switch : total number of ports
• Network Interfaces:
• Processors talk to the network via a network interface.
• connectivity between the nodes and the network
• Network Topologies:
• Bus-Based Networks :
• Buses are also ideal for broadcasting information among nodes
• transmission medium is shared, there is little overhead associated with broadcast
compared to point-to-point message transfer
• Reducing shared-bus bandwidth using caches

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Interconnection Networks Components


58

• Network Topologies:

• Crossbar Networks :
• non-blocking network
• Connection of a processing node to a
memory bank does not block the
connection of any other processing
nodes to other memory bank

• Multistage Networks
• Tree-Based Networks

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

59

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

60

M2 : Measures of Parallel
Algorithms

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


61

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


62

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Sources of Overhead in Parallel Programs


63

• If I use two processors, shouldn't my program run twice as fast?


• No - a number of overheads, including wasted computation, communication, idling, and
contention cause degradation in performance.

total time collectively spent :Tall = pTP

• Performance Metrics for Parallel Systems: Execution Time :


• time that elapses from the moment the first processor starts to the moment the last
processor finishes execution.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


64

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Five Performance Metrics for Parallel Systems


65

• Execution Time
• Overhead
• Speedup
• Efficiency
• Cost.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Measures of Parallel Algorithms


66

• Execution Time • Interprocess Interaction

• Sequential algorithm : Evaluated in


• to interact and communicate data (e.g., intermediate
results)
terms of its execution time, • Idling :
expressed as a function of the size • Due to many reasons such as load imbalance,
of its input. synchronization, and presence of serial
components in a program
• Parallel Algorithm: Input size but also on • Excess Computation :
the number of processing elements used, • some computations are performed multiple
and their relative computation and times on different processing elements.
interposes communication speeds.
• An algorithm must therefore be analyzed
in the context of the underlying platform.

A parallel system is a combination of a parallel algorithm


and an underlying platform.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Sources of Overhead in Parallel Programs


67

• a number of overheads, including wasted computation, communication, idling, and


contention cause degradation in performance.

total time collectively spent :Tall = pTP

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Performance Metrics for Parallel Systems


68

• Performance Metrics #3:Speedup


• What is the benefit from parallelism
• Speedup (S) is the ratio of the time taken to solve a problem on a
single processor to the time required to solve the same problem on a
parallel computer with p identical processing elements.
• computing speedup, we always consider the best sequential program as
the baseline

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

69

• Speedup Example : Parallel bubble sort


• serial time for bubblesort is 150 second
• The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40
seconds
• The speedup would appear to be 3.75. (?)
• But is this really a fair assessment of the system?

• What if serial quicksort only took 30 seconds? In this case, the speedup is
30/40 = 0.75. This is a more realistic assessment of the system.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Performance Metrics for Parallel Systems


70

• Performance Metrics #4:Efficiency


• Efficiency is a measure of the fraction of time for which a processing
element is usefully employed; it is defined as the ratio of speedup to
the number of processing elements.
• In an ideal parallel system, speedup is equal to p and efficiency is
equal to one.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

71

• Performance Metrics #5: Cost


• Product of parallel execution time and number of Pes
• The total amount of time by all PEs to solve the problem
• Cost-optimal : parallel cost ≅ serial cost

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Tutorial : Edge detection on HPC-25|IIITS

images
72

1. Given an n x n pixel image, the problem of detecting edges


corresponds to applying a3x 3 template to each pixel.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


73

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

What is Granularity
74

• The no and Size of task into the which problem size is decomposed
determines granularity of the decomposition.
• Smaller granularity
• Larger Granularity
• Scale down

What is the cost of parallel add n


What is the cost of serial add n no??
no??

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

The Effect of Granularity on Performance


75

• Scaling down :
• Often, using fewer processors improves performance of parallel
systems.
• Using fewer than the maximum possible number of processing
elements to execute a parallel algorithm is scaling down.
• number of processing elements decreases by a factor of n/p and the
computation at each processing element increases by a factor of n/p.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Adding n numbers on p processing elements


76

• Adding n numbers on
p processing elements
• for n = 16
and p = 4.
• Virtual processing
element i is
simulated by the
physical processing
element labeled i
mod p.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

77

1. The first log p of the


log n steps of the
original algorithm are
simulated in (n/p)
log p steps
in p processing element

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

78

The Addition and


communication is constant
time ts+tw.

Hence the n-processor


machine Tp is time Θ(log n).

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

79

What is the speedup and cost, efficiency ?

S= (Ts/Tp)
E= S/P

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Cost: An Example
80

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

81

1. Adding n numbers cost-optimally:


1. its cost is Θ(n + p log p). As long as n = Ω(p log p), the
cost is Θ(n), which is the same as the serial runtime.
Hence, this parallel system is cost-optimal.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Scaling Characteristics of Parallel Programs


82

• For a given problem size (i.e., the value of TS remains constant), as


we increase the number of processing elements, To increases.
• The overall efficiency of the parallel program goes down. This is the
case for all parallel programs.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

83

1. Granularity of Task Decomposition


• Fine-grained decomposition: large number of small tasks
• Coarse-grained decomposition: small number of large tasks
• Degree of Concurrency: # of tasks that can execute in parallel

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

84

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

85

• Fine-grained parallelism
• – a program is broken down to a large number of small tasks.
• – Tasks are assigned individually to many processors.
• The work is evenly distributed among the processors.
• Increases the communication and synchronization overhead
• Fine-grained parallelism is best exploited in architectures which support fast
communication
• Shared memory architecture are most suitable for fine-grained parallelism?
Why ?

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

86

• Coarse-grained parallelism
• a program is split into large tasks. a
large amount of computation takes
place in processors.
• Load imbalance
• Certain tasks process the bulk of the
data while others might be idle.
• Low communication and
synchronization overhead.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

87

• Example :Consider a 10*10 image that needs to be processed, given that,


processing of the 100 pixels is independent of each other. Assume there
are
• Fine-grained parallelism:
• 100 processors that are responsible for processing the 10*10 image.
• 100 processors can process the 10*10 image in 1 clock cycle
• Each processor is working on 1 pixel of the image
• Medium-grained parallelism:
• 25 processors processing the 10*10 image.
• The processing of the image will now take 4 clock cycles.
• Coarse-grained parallelism
• we reduce the processors to 2, processing will take 50 clock cycles

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

88

Finding the best grain size, depends on a number of factors and varies
greatly from problem-to-problem.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

89

• Building Granularity: Example


• Consider the problem of adding n numbers on p processing elements
such that p < n and both n and p are powers of 2.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


90

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Taxonomy Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Taxonomy of Parallel computer


91

• SISD: single instruction and single


data stream: uniprocessor
• • MISD: no commercial
multiprocessor: imagine data going
through a pipeline of execution
engines
• • SIMD: vector architectures: lower
flexibility
• • MIMD: most multiprocessors today:
easy to construct withoff-the-shelf
computers, most flexibility

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

92

• MIMD machines are broadly categorized based on the way PEs are coupled to the
main memory.
• Shared-memory MIMD
• Tightly coupled multiprocessor systems
• Connected to a single global memory and they all have access to it.
• Symmetric Multi-Processing.
• Less likely to scale

• Distributed-memory MIMD
• Loosely coupled multiprocessor systems
• All PEs have a local memory
• Communication between PEs in this model takes place through the interconnection
network.
• – HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

93

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

M2 : Measures of Parallel Algorithms


94

• Analytical Modeling of Parallel Programs:


• Sources of Overhead in Parallel Programs
• Performance Metrics for Parallel Systems,
• The Effect of Granularity on Performance;
• Taxonomy Parallel Platforms: Models (SIMD,MIMD, SPMD)
• Communication (Shared Address Space vs. Message Passing)

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Shared Memory(UMA/NUMA)
95

• Global address space provides a user-friendly programming


perspective to memory
• Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs
• Primary disadvantage is the lack of scalability between memory
and CPUs.
• coherent systems, geometrically increase traffic associated with
cache/memory management.
• Programmer responsibility for synchronization constructs that
insure "correct" access of global memory

Images : https://cvw.cac.cornell.edu/parallel/shared HPC-22||https://bheemhh.github.io/


||https://bheemhh.github.io/||
HPC-25|IIITS

Distributed Memory
96

• Require a communication network to connect inter-processor memory.


• no concept of global address space across all processors.
• Changes it makes to its local memory have no effect on the memory of other
processors.
• Memory is scalable with number of processors.
• The programmer is responsible for many of the details associated
with data communication between processors
• Non-uniform memory access (NUMA) times

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Scalability of Parallel Systems


97

• Programs are designed and tested for smaller problems on fewer


processing elements. However, the real problems these programs
are intended to solve are much larger, and the machines contain
larger number of processing elements.

Scalability is a reflection of the system's


ability to efficiently utilize the increased
processing resource

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Scaling Characteristics of Parallel Programs


98

• Scaling Characteristics of Parallel


Programs
• For a given problem size (i.e., the
value of TS remains constant), as we
increase the number of processing
elements, To increases.
• The overall efficiency of the parallel
program goes down. This is the case
for all parallel programs.

• The total overhead function To is an increasing function of p.

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Example
99
Consider the problem of adding n numbers on p processing
elements.

Assuming unit time for adding two numbers,


• The first phase (local summations) of the algorithm takes roughly n/p time.
• The second phase involves log p steps with a communication and an addition
at each step. If a single communication takes unit time as well, the time for
this phase is 2 log p.
Therefore, we can derive parallel time, speedup, and efficiency as:

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Speedup and Efficiency of adding n numbers on p processing


100

What is the speedup and efficiency

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

101

• A parallel architecture is said to be scalable if it can be


expanded (reduced) to a larger (smaller) system with a linear
increase (decrease) in its performance (cost)

A scalable system is capable of increasing its speed in proportion to the increase


in the number of processors

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Effect of increasing the problem size keeping
HPC-25|IIITS
the number of processing elements constant.

102

a given problem size, as we increase the


number of processing elements, the overall
efficiency of the parallel system goes down

O-1: efficiency increases if the problem size


is increased keeping the number of
processing elements constant
To maintain efficiency at a fixed
value by simultaneously increasing
Total overhead
the number function Toelements
of processing is a
function of both
and the size problem
of the problemsizeisTS
and the number
exhibited of processing
by many parallel systems.
elements p systems scalable
We call such
parallel systems
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

Observation on Scaling Characteristics of Parallel Programs


103

• For a given problem size, as we increase the number of processing


elements, the overall efficiency of the parallel system goes down. This
phenomenon is common to all parallel systems.
• In many cases, the efficiency of a parallel system increases if the problem
size is increased while keeping the number of processing elements
constant.
• Scalability of a parallel system is a measure of its capacity to increase
speedup in proportion to the number of processing elements

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

T-02: Speedup and Efficiency


104

1. Suppose a task has:


• f=0.75(75% of the task is parallelizable),
• P=4P = 4P=4 processors.
What is the speedup is: 2.29

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

105

1. Example 1: Effect of Increasing Processors


1. A task consists of:80% of the task is parallelizable and there P=2,4,and 8
processors. Calculate the speedup for each case.
2. Maximum Speedup
• A task has 90% parallelizable and what maximum speedup ?
3. Parallel Efficiency:
4. Identify Bottleneck
4. A task shows a speedup of 3.23 when P=4, determine amount task is
parallelizable

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

106

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

• Amdahl’
107

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS

108

HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||

You might also like