Module 1&2
Module 1&2
Dr.Bheemappa H
IIIT, Sricity
bheemappa.h@iiit.in
||https://bheemhh.github.io/||
HPC-25|IIITS
Outline
2
● Course Details
● Syllabus
● Evaluation Plan
● Reference Materials
● SUPERCOMPUTERS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Syllabus
5
Module 1: Introduction: Motivating Parallelism, Scope of Parallel ComputingI, Introduction to HPC: Parallel Programming
Platforms; Implicit Parallelism: Trends in Microprocessor Architectures, Limitations of Memory System Performance, Dichotomy
of Parallel Computing Platforms, Physical Organization of Parallel Platforms; . (6- Hrs )
Module 2: Parallel Computer Memory Architectures:Measures of Parallel Algorithms;Analytical Modeling of Parallel Programs:
Sources of Overhead in Parallel Programs, Performance Metrics for Parallel Systems, the Effect of Granularity on Performance;
Parallel Platforms: Models (SIMD,MIMD, SPMD) , Communication (Shared Address Space vs. Message Passing) (7-Hrs)
Module 3: Thread Basics: Why Threads? The POSIX Thread API,Thread Creation and Termination, Synchronization Primitives in
Pthreads,Controlling Thread and Synchronization Attributes, Thread Cancellation, Composite Synchronization Constructs (6-
hrs)
Module 4: Distributed Memory Parallel Programming –Tips for Designing Asynchronous Programs,OpenMP: A Standard for
Directive Based Parallel Programming. (6-Hrs )
Module 5 : The Building Blocks: Send and Receive Operations, MPI: the Message Passing Interface, Topologies and Embedding,
Overlapping Communication with Computation, Collective Communication and Computation Operations, Groups and
Communicators. (7-Hrs )
Module 5 :The Age of Parallel Processing, Central Processing Units, The Rise of GPU Computing, A brief history of GPUs, Early
GPU computing; CUDA: What is CUDA architecture, using the CUDA architecture, Applications of CUDA, Medical Imaging,
Computational Fluid Dynamics, Environmental Science; Introduction to CUDA C: A First
Program, Hello world, A kernel call, Passing parameters, Querying devices, using device properties; Parallel Programming in
CUDA C:CUDA parallel programming, Summing vectors, A fun example. (7-Hrs ) HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
TEXT BOOKS
6
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
EVALUATION PLAN
7
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Week-1:Class-2
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
9
● Introduction
● Introduction to HPC
● Motivating Parallelism
● Scope of Parallel Computing
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
10
● Introduction to HPC:
● Motivating Parallelism
● Scope of Parallel Computing
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Historical Background
11
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Introduction to HPC
12
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
13
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
● Parallel Processing
● Handling Large Datasets
● Complex Simulations
“HPC refers to techniques, algorithms, and methodologies used to achieve high
computational performance”
● HPC systems include:
● Supercomputers, Distributed Computing, Specialized Hardware's.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
SUPERCOMPUTERS
15
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
● In 1980s
○ 1x106 Floating Point Ops/sec (Mflop/s)
○ Scalar based
● In 1990s
○ 1x109 Floating Point Ops/sec (Gflop/s)
○ Vector & Shared memory computing
● Today
○ 1x1012 Floating Point Ops/sec (Tflop/s)
○ Highly parallel, distributed processing, message passing
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Computer Technology
18
● Performance improvements:
○ Improvements in semiconductor technology
■ Feature size, clock speed
○ Improvements in computer architectures
■ Enabled by HLL compilers, UNIX
■ Lead to RISC architectures
○ Together have enabled:
■ Lightweight computers
■ Productivity-based managed/interpreted programming languages
■ SaaS, Virtualization, Cloud
○ Applications evolution:
■ Speech, sound, images, video, “augmented/extended reality”, “big data”
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Moore’s Law
19
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
20
● Introduction to HPC:
● Motivating Parallelism, Scope of Parallel Computing,
● Parallel Programming Platforms
○ Implicit Parallelism: Trends in Microprocessor Architectures
○ Limitations of Memory System Performance
● Dichotomy of Parallel Computing Platforms
○ Control Structure of Parallel Platforms
○ Communication Model of Parallel Platforms
● Physical Organization of Parallel Platforms
○ Architecture of an Ideal Parallel Computer
○ Interconnection Networks for Parallel Computers
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Motivating Parallelism
23
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Motivating Parallelism
24
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Pipelining
26
Pipelining
Superscalar Execution
27
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Superscalar Execution
28
• Due to limited parallism in typical instruction traces, dependencies, or the inability of the
scheduler to extract parallelism, the performance of superscalar processors is eventually
limited
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
29
Very Long Instruction Word (VLIW) Processors
1. VLIW processors rely on compile time analysis to identify and bundle
together instructions that can be executed concurrently
2. The instructions are packed and dispatched together, and thus the name
very long instruction word (Intel IA64).
3. Compiler has a bigger context from which to select co-scheduled
instructions.
4. Compilers, however, do not have runtime information such as cache
misses. Scheduling is, therefore, inherently conservative
VLIW performance is highly dependent on the compiler. A
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
32
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HW
34
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
35
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
36
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
\ HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Parallelism
38
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
40
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Parallelism from single instruction on multiple HPC-25|IIITS
processors(SIMD)
41
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
43
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
44
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
46
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Message-Passing Platforms
49
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
51
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
M1:
55
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Switches :
• a fixed number of inputs to outputs; Degree of the switch : total number of ports
• Network Interfaces:
• Processors talk to the network via a network interface.
• connectivity between the nodes and the network
• Network Topologies:
• Bus-Based Networks :
• Buses are also ideal for broadcasting information among nodes
• transmission medium is shared, there is little overhead associated with broadcast
compared to point-to-point message transfer
• Reducing shared-bus bandwidth using caches
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Network Topologies:
• Crossbar Networks :
• non-blocking network
• Connection of a processing node to a
memory bank does not block the
connection of any other processing
nodes to other memory bank
• Multistage Networks
• Tree-Based Networks
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
59
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
60
M2 : Measures of Parallel
Algorithms
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Execution Time
• Overhead
• Speedup
• Efficiency
• Cost.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
69
• What if serial quicksort only took 30 seconds? In this case, the speedup is
30/40 = 0.75. This is a more realistic assessment of the system.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
71
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Tutorial : Edge detection on HPC-25|IIITS
images
72
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
What is Granularity
74
• The no and Size of task into the which problem size is decomposed
determines granularity of the decomposition.
• Smaller granularity
• Larger Granularity
• Scale down
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Scaling down :
• Often, using fewer processors improves performance of parallel
systems.
• Using fewer than the maximum possible number of processing
elements to execute a parallel algorithm is scaling down.
• number of processing elements decreases by a factor of n/p and the
computation at each processing element increases by a factor of n/p.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Adding n numbers on
p processing elements
• for n = 16
and p = 4.
• Virtual processing
element i is
simulated by the
physical processing
element labeled i
mod p.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
77
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
78
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
79
S= (Ts/Tp)
E= S/P
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Cost: An Example
80
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
81
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
83
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
84
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
85
• Fine-grained parallelism
• – a program is broken down to a large number of small tasks.
• – Tasks are assigned individually to many processors.
• The work is evenly distributed among the processors.
• Increases the communication and synchronization overhead
• Fine-grained parallelism is best exploited in architectures which support fast
communication
• Shared memory architecture are most suitable for fine-grained parallelism?
Why ?
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
86
• Coarse-grained parallelism
• a program is split into large tasks. a
large amount of computation takes
place in processors.
• Load imbalance
• Certain tasks process the bulk of the
data while others might be idle.
• Low communication and
synchronization overhead.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
87
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
88
Finding the best grain size, depends on a number of factors and varies
greatly from problem-to-problem.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
89
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
92
• MIMD machines are broadly categorized based on the way PEs are coupled to the
main memory.
• Shared-memory MIMD
• Tightly coupled multiprocessor systems
• Connected to a single global memory and they all have access to it.
• Symmetric Multi-Processing.
• Less likely to scale
• Distributed-memory MIMD
• Loosely coupled multiprocessor systems
• All PEs have a local memory
• Communication between PEs in this model takes place through the interconnection
network.
• – HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
93
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Shared Memory(UMA/NUMA)
95
Distributed Memory
96
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
Example
99
Consider the problem of adding n numbers on p processing
elements.
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
101
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
Effect of increasing the problem size keeping
HPC-25|IIITS
the number of processing elements constant.
102
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
105
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
106
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
• Amdahl’
107
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||
HPC-25|IIITS
108
HPC-22||https://bheemhh.github.io/
||https://bheemhh.github.io/||