0% found this document useful (0 votes)

36 views38 pages

Advanced Parallel Computing Concepts

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views38 pages

Advanced Parallel Computing Concepts

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Multiprocessing,

multithreading and
vectorization

Sparsh Mittal
IIT Hyderabad, India

Courtesy for some slides: S. R. Sarangi and others

2
Background

3
Strong and weak scaling

• Strong scaling: how solution time varies with the

number of cores for a fixed total problem size.
– Use 2X machines for a task => solve it in half the
time
• Weak scaling: how solution time varies with the
number of cores for a fixed problem size per cores.
– if the dataset is twice as big, use 2X machines to solve
the task in constant time.

4
Shared Memory vs Message Passing
 Shared Memory
 All the threadds share the virtual address space.
 They can communicate with each other by reading and
writing values from/to shared memory.
● Application ensures no data corruption (Lock/Unlock)
● Example language: OpenMP, CUDA
 Message Passing
 Programs communicate between each other by sending and
receiving messages (e.g., sending emails
 They do not share memory addresses.
 Example language: MPI
5
Types of Parallelism

• Instruction Level Parallelism

– Different instructions within a stream can be executed in parallel
– Pipelining, out-of-order execution, speculative execution, VLIW
– Dataflow

• Data Parallelism
– Different pieces of data can be operated on in parallel
– SIMD: Vector processing, array processing
– Systolic arrays, streaming processors

• Task Level Parallelism

– Different “tasks/threads” can be executed in parallel
– Multithreading
– Multiprocessing (multi-core)

7
7
Flynn's Taxonomy

8
Flynn's Classification

 Instruction stream → Set of

instructions that are executed
 Data stream → Data values that the
instructions process
 Four types of multiprocessors : SISD,
SIMD, MISD, MIMD

9
SISD and SIMD

 SISD → Standard uniprocessor

 SIMD → One instruction, operates on
multiple pieces of data. Vector processors
have one instruction that operates on
many pieces of data in parallel. For
example, one instruction can compute the
sin-1 of 4 values in parallel.

10
MISD

 MISD → Multiple Instruction Single Data

 Very rare in practice
 Consider an aircraft that has a MIPS, an ARM, and
an X86 processor operating on the same data
(multiple instruction streams)
 We have different instructions operating on the
same data
 The final outcome is decided on the basis of a
majority vote.

11
MIMD

 MIMD → Multiple instruction, multiple

data (two types, SPMD, MPMD)
 SPMD → Single program, multiple data.
Examples: OpenMP or MPI programs. We
typically have multiple processes or threads
executing the same program with different
inputs.
 MPMD → A master program, delegates work to
multiple slave programs. The programs are
different.

12
Summary
• SISD: Single instruction operates on single data element
• SIMD: Single instruction operates on multiple data
elements
– Array processor
– Vector processor
• MISD: Multiple instructions operate on single data
element
– Closest form: systolic array processor, streaming
processor
• MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
– Multiprocessor
– Multithreaded processor
13
13
Multiprocessing

 The term multiprocessing refers to multiple processors

working in parallel.
 This is a generic definition, and it can refer to multiple
processors in the same chip, or processors across different
chips.
 A multicore processor is a specific type of multiprocessor that
contains all of its constituent processors in the same chip.
Each such processor is known as a core.

14
Multithreading

15
The Notion of Threads

 We spawn a set of separate threads

 Properties of threads
 A thread shares its address space with other
threads
 It has its own program counter, set of registers,
and stack
 A process contains multiple threads
 Threads communicate with each other by
writing values to memory or via
synchronization operations

16
Operation of the Program

Parent thread

Initialisation

Spawn child threads

Child
threads

Time

Thread join operation

Sequential
section

17
Multithreading

 Multithreading → A design paradigm that

proposes to run multiple threads on the
same pipeline.
 Three types
 Coarse grained
 Fine grained
 Simultaneous

18
Analogy

• Consider a car which can be shared by 4 people (A,

B, C, D) in following way:
• Option1: Four people ride the car simultaneously
almost all the time to go to their destinations.
• Option2: A uses the car for 15 days, B for 14 days, C
for 18 days and D for 16 days (and repeat).
• Option3: A uses the car for 8 hours, B for 9 hours,
C for 5 hours and D for 7 hours (and repeat).
• Match the above three options to three types of
multithreading

19
Coarse Grained Multithreading

 Assume that we want to run 4 threads on

a pipeline
 Run thread 1 for n cycles, run thread 2 for
n cycles, ….
1

4 2

20
Implémentation

 Steps to minimize the context switch

overhead
 For a 4-way coarse grained MT machine
 4 program counters
 4 register files
 4 flags registers
 A context register that contains a thread id.
 Zero overhead context switching → Change the thread
id in the context register

21
Advantages

 Assume that thread 1 has an L2 miss

 Wait for 200 cycles
 Schedule thread 2
 Now let us say that thread 2 has an L2 miss
 Schedule thread 3
 We can have a sophisticated algorithm that
switches every n cycles, or when there is a long
latency event such as an L2 miss.
 Minimises idle cycles for the entire system

22
Fine Grained Multithreading

 The switching granularity is very small

 1-2 cycles
 Advantage :
 Can take advantage of low latency events such as division,
or L1 cache misses
 Minimise idle cycles to an even greater extent
 Correctness Issues
 We can have instructions of 2 threads simultaneously in the
pipeline.
 We never forward/interlock for instructions across threads

23
Simultaneous Multithreading

 Most modern processors have multiple

issue slots
 Can issue multiple instructions to the functional
units
 For example, a 3 issue processor can fetch, decode,
and execute 3 instructions per cycle
 If a benchmark has low ILP (instruction level
parallelism), then fine and coarse grained
multithreading cannot really help.

24
Simultaneous Multithreading

 Main Idea
 Partition the issue slots across threads
 Scenario : In the same cycle
 Issue 2 instructions for thread 1
 and, issue 1 instruction for thread 2
 and, issue 1 instruction for thread 3

 Support required
 Need smart instruction selection logic.
 Balance fairness and throughput

25
Summary

Coarse grained Fine grained Simultaneous

multithreading multithreading multithreading
Thread 1

Thread 2

Thread 3
Time

Thread 4

issue
slots

26
Vectorization (and vector
processor)

27
BIG PICTURE

28
Vectorization

29 https://colfaxresearch.com/knl-avx512/
Some of the SIMD instruction sets used in
industry

Register size Instruction set

80 bits MMX
128-bits SSE1/SSE2 etc.
256-bits AVX/AVX2
512-bits AVX-512

30
Vector Processors

 A vector instruction operates on arrays of

data
 Example : There are vector instructions to add or
multiply two arrays of data, and produce an array as
output
 Advantage : Can be used to perform all kinds of
array, matrix, and linear algebra operations. These
operations form the core of many scientific
programs, high intensity graphics, and data analytics
applications.

31
Software Interface

 Let us define a vector register

 Example : 128 bit registers in the MMX instruction set
→ XMM0 … XMM15
 Can hold 4 floating point values, or 8 2-byte short
integers
 Addition of vector registers is equivalent to pairwise
addition of each of the individual elements.
 The result is saved in a vector register of the same
size.

32
Example of Vector Addition

vr1

vr2

vr3

Let us define 8 128 bit vector registers in SimpleRisc. vr0 ... vr7

33
Loading Vector Registers

 There are two options :

 Option 1 : We assume that the data elements are
stored in contiguous locations
 Let us define the v.ld instruction that uses this
assumption.
Instruction Semantics
v.ld vr1, 12[r1] vr1  ([r1+12], [r1+16],[r1+20], [r1+24])

 Option 2: Assume that the elements are not saved in

contiguous locations.
 For this, there are scatter-gather instructions

34
Scatter Gather Operation

 The data is scattered in memory

 The load operation needs to gather the data and
save it in a vector register.
 Let us define a scatter gather version of the load
instruction → v.sg.ld
 It uses another vector register that contains the
addresses of each of the elements.

Instruction Semantics
v.sg.ld vr1, vr2 vr1  ([vr2[0]], [vr2[1]], [vr2[2]], [vr2[3]])

35
Vector Store Operation

 We can similarly define two vector store

operations
Instruction Semantics
v.sg.st vr1, vr2 [vr2[0]]  vr1[0]
[vr2[1]]  vr1[1]
[vr2[2]]  vr1[2]
[vr2[3]]  vr1[3]

Instruction Semantics
v.st vr1, 12[r1] [r1+12]  vr1[0]
[r1+16]  vr1[1]
[r1+20]  vr1[2]
[r1+24]  vr1[3]

36
Vector Operations

 We can now define custom operations on vector

registers
 v.add → Adds two vector registers
 v.mul → Multiplies two vector registers
 We can even have operations that have a vector
operand and a scalar operand → Multiply a vector
with a scalar.

37
Design of a Vector Processor

 Salient Points
 We have a vector register file and a scalar register file
 There are scalar and vector functional units
 Unless we are converting a vector to a scalar or vice
versa, we in general do not forward values between
vector and scalar instructions
 The memory unit needs support for regular operations,
vector operations, and possibly scatter-gather
operations.

39
References

• S. Mittal et al, “A Survey on Evaluating and Optimizing

Performance of Intel Xeon Phi” 2019 (pdf)

Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Lecture18 New
No ratings yet
Lecture18 New
19 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Chapter 12 Multiprocessor Systems
No ratings yet
Chapter 12 Multiprocessor Systems
110 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
20 Question of CA
No ratings yet
20 Question of CA
26 pages
CA8 2024S2 Newer
No ratings yet
CA8 2024S2 Newer
21 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
73 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
SIMD
No ratings yet
SIMD
44 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
109 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Module 1-3
No ratings yet
Module 1-3
87 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Lecture #1 - Class-1
No ratings yet
Lecture #1 - Class-1
17 pages
Parallel Programming & Multithreading
No ratings yet
Parallel Programming & Multithreading
168 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Multiprocessor Systems & Pipelining
No ratings yet
Multiprocessor Systems & Pipelining
11 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
17.L15 BranchPrediction
No ratings yet
17.L15 BranchPrediction
38 pages
24.L21 RooflineModel1
No ratings yet
24.L21 RooflineModel1
29 pages
13.L11 Pipelining
No ratings yet
13.L11 Pipelining
17 pages
Advanced CPU Execution Techniques
No ratings yet
Advanced CPU Execution Techniques
12 pages
C51w Touchscreen Service Manual
No ratings yet
C51w Touchscreen Service Manual
35 pages
User's Manual HEIDENHAIN Conversational Format ITNC 530
100% (2)
User's Manual HEIDENHAIN Conversational Format ITNC 530
747 pages
Ols2007v2 Pages 145 150
No ratings yet
Ols2007v2 Pages 145 150
8 pages
Session Guitarist - Strummed Acoustic Manual English
No ratings yet
Session Guitarist - Strummed Acoustic Manual English
29 pages
Graphical Kernel System (GKS)
No ratings yet
Graphical Kernel System (GKS)
107 pages
Arduino Uno: Practical 1
No ratings yet
Arduino Uno: Practical 1
5 pages
Sepk
No ratings yet
Sepk
11 pages
Coverity Multi-Threaded Whitepaper
No ratings yet
Coverity Multi-Threaded Whitepaper
13 pages
XPos Connect HSIA Gateway
No ratings yet
XPos Connect HSIA Gateway
4 pages
Hci634y 312
No ratings yet
Hci634y 312
9 pages
Chapter 03
No ratings yet
Chapter 03
44 pages
Bibliografia
No ratings yet
Bibliografia
3 pages
KG934V1 21-25
No ratings yet
KG934V1 21-25
5 pages
Acer TravelMate 8200 8210 8202wlmi - QUANTA ZC1 - REV 1A Sec
No ratings yet
Acer TravelMate 8200 8210 8202wlmi - QUANTA ZC1 - REV 1A Sec
46 pages
Microsemi 2014
No ratings yet
Microsemi 2014
44 pages
XPIC Antenna Alignment Guide
100% (2)
XPIC Antenna Alignment Guide
2 pages
En/ofieilr: N/zarz Q-SQR: - / (-/"-E LD I R TR) 4o+k
No ratings yet
En/ofieilr: N/zarz Q-SQR: - / (-/"-E LD I R TR) 4o+k
2 pages
Test Data
50% (2)
Test Data
243 pages
Synergy ODM User's Manual
No ratings yet
Synergy ODM User's Manual
100 pages
Support Device List
100% (1)
Support Device List
60 pages
Mini 4k HDR
No ratings yet
Mini 4k HDR
32 pages
Dart DTC U11f6-00, U11f6-87
No ratings yet
Dart DTC U11f6-00, U11f6-87
7 pages
4 Point Probing Using Keithley Pro 4 6-2-11
No ratings yet
4 Point Probing Using Keithley Pro 4 6-2-11
11 pages
Vxworks Lab Manual Embedded
No ratings yet
Vxworks Lab Manual Embedded
44 pages
Hardware and Virtual Machines
No ratings yet
Hardware and Virtual Machines
30 pages
AS B InstallationSheet
No ratings yet
AS B InstallationSheet
4 pages
Retro Gamer No.2 - P6-8
No ratings yet
Retro Gamer No.2 - P6-8
3 pages
Keyboard and Display Interface 8279
No ratings yet
Keyboard and Display Interface 8279
12 pages
4) Information Schema & Performanc Schema PDF
No ratings yet
4) Information Schema & Performanc Schema PDF
22 pages

Advanced Parallel Computing Concepts

Uploaded by

Advanced Parallel Computing Concepts

Uploaded by

Multiprocessing,

Courtesy for some slides: S. R. Sarangi and others

• Strong scaling: how solution time varies with the

• Instruction Level Parallelism

• Task Level Parallelism

 Instruction stream → Set of

 SISD → Standard uniprocessor

 MISD → Multiple Instruction Single Data

 MIMD → Multiple instruction, multiple

 The term multiprocessing refers to multiple processors

 We spawn a set of separate threads

Spawn child threads

Thread join operation

 Multithreading → A design paradigm that

• Consider a car which can be shared by 4 people (A,

 Assume that we want to run 4 threads on

 Steps to minimize the context switch

 Assume that thread 1 has an L2 miss

 The switching granularity is very small

 Most modern processors have multiple

Coarse grained Fine grained Simultaneous

Register size Instruction set

 A vector instruction operates on arrays of

 Let us define a vector register

 There are two options :

 Option 2: Assume that the elements are not saved in

 The data is scattered in memory

 We can similarly define two vector store

 We can now define custom operations on vector

• S. Mittal et al, “A Survey on Evaluating and Optimizing

You might also like