Multiprocessing,
multithreading and
vectorization
Sparsh Mittal
IIT Hyderabad, India
Courtesy for some slides: S. R. Sarangi and others
2
Background
3
Strong and weak scaling
• Strong scaling: how solution time varies with the
number of cores for a fixed total problem size.
– Use 2X machines for a task => solve it in half the
time
• Weak scaling: how solution time varies with the
number of cores for a fixed problem size per cores.
– if the dataset is twice as big, use 2X machines to solve
the task in constant time.
4
Shared Memory vs Message Passing
Shared Memory
All the threadds share the virtual address space.
They can communicate with each other by reading and
writing values from/to shared memory.
● Application ensures no data corruption (Lock/Unlock)
● Example language: OpenMP, CUDA
Message Passing
Programs communicate between each other by sending and
receiving messages (e.g., sending emails
They do not share memory addresses.
Example language: MPI
5
Types of Parallelism
• Instruction Level Parallelism
– Different instructions within a stream can be executed in parallel
– Pipelining, out-of-order execution, speculative execution, VLIW
– Dataflow
• Data Parallelism
– Different pieces of data can be operated on in parallel
– SIMD: Vector processing, array processing
– Systolic arrays, streaming processors
• Task Level Parallelism
– Different “tasks/threads” can be executed in parallel
– Multithreading
– Multiprocessing (multi-core)
7
7
Flynn's Taxonomy
8
Flynn's Classification
Instruction stream → Set of
instructions that are executed
Data stream → Data values that the
instructions process
Four types of multiprocessors : SISD,
SIMD, MISD, MIMD
9
SISD and SIMD
SISD → Standard uniprocessor
SIMD → One instruction, operates on
multiple pieces of data. Vector processors
have one instruction that operates on
many pieces of data in parallel. For
example, one instruction can compute the
sin-1 of 4 values in parallel.
10
MISD
MISD → Multiple Instruction Single Data
Very rare in practice
Consider an aircraft that has a MIPS, an ARM, and
an X86 processor operating on the same data
(multiple instruction streams)
We have different instructions operating on the
same data
The final outcome is decided on the basis of a
majority vote.
11
MIMD
MIMD → Multiple instruction, multiple
data (two types, SPMD, MPMD)
SPMD → Single program, multiple data.
Examples: OpenMP or MPI programs. We
typically have multiple processes or threads
executing the same program with different
inputs.
MPMD → A master program, delegates work to
multiple slave programs. The programs are
different.
12
Summary
• SISD: Single instruction operates on single data element
• SIMD: Single instruction operates on multiple data
elements
– Array processor
– Vector processor
• MISD: Multiple instructions operate on single data
element
– Closest form: systolic array processor, streaming
processor
• MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
– Multiprocessor
– Multithreaded processor
13
13
Multiprocessing
The term multiprocessing refers to multiple processors
working in parallel.
This is a generic definition, and it can refer to multiple
processors in the same chip, or processors across different
chips.
A multicore processor is a specific type of multiprocessor that
contains all of its constituent processors in the same chip.
Each such processor is known as a core.
14
Multithreading
15
The Notion of Threads
We spawn a set of separate threads
Properties of threads
A thread shares its address space with other
threads
It has its own program counter, set of registers,
and stack
A process contains multiple threads
Threads communicate with each other by
writing values to memory or via
synchronization operations
16
Operation of the Program
Parent thread
Initialisation
Spawn child threads
Child
threads
Time
Thread join operation
Sequential
section
17
Multithreading
Multithreading → A design paradigm that
proposes to run multiple threads on the
same pipeline.
Three types
Coarse grained
Fine grained
Simultaneous
18
Analogy
• Consider a car which can be shared by 4 people (A,
B, C, D) in following way:
• Option1: Four people ride the car simultaneously
almost all the time to go to their destinations.
• Option2: A uses the car for 15 days, B for 14 days, C
for 18 days and D for 16 days (and repeat).
• Option3: A uses the car for 8 hours, B for 9 hours,
C for 5 hours and D for 7 hours (and repeat).
• Match the above three options to three types of
multithreading
19
Coarse Grained Multithreading
Assume that we want to run 4 threads on
a pipeline
Run thread 1 for n cycles, run thread 2 for
n cycles, ….
1
4 2
20
Implémentation
Steps to minimize the context switch
overhead
For a 4-way coarse grained MT machine
4 program counters
4 register files
4 flags registers
A context register that contains a thread id.
Zero overhead context switching → Change the thread
id in the context register
21
Advantages
Assume that thread 1 has an L2 miss
Wait for 200 cycles
Schedule thread 2
Now let us say that thread 2 has an L2 miss
Schedule thread 3
We can have a sophisticated algorithm that
switches every n cycles, or when there is a long
latency event such as an L2 miss.
Minimises idle cycles for the entire system
22
Fine Grained Multithreading
The switching granularity is very small
1-2 cycles
Advantage :
Can take advantage of low latency events such as division,
or L1 cache misses
Minimise idle cycles to an even greater extent
Correctness Issues
We can have instructions of 2 threads simultaneously in the
pipeline.
We never forward/interlock for instructions across threads
23
Simultaneous Multithreading
Most modern processors have multiple
issue slots
Can issue multiple instructions to the functional
units
For example, a 3 issue processor can fetch, decode,
and execute 3 instructions per cycle
If a benchmark has low ILP (instruction level
parallelism), then fine and coarse grained
multithreading cannot really help.
24
Simultaneous Multithreading
Main Idea
Partition the issue slots across threads
Scenario : In the same cycle
Issue 2 instructions for thread 1
and, issue 1 instruction for thread 2
and, issue 1 instruction for thread 3
Support required
Need smart instruction selection logic.
Balance fairness and throughput
25
Summary
Coarse grained Fine grained Simultaneous
multithreading multithreading multithreading
Thread 1
Thread 2
Thread 3
Time
Thread 4
issue
slots
26
Vectorization (and vector
processor)
27
BIG PICTURE
28
Vectorization
29 https://colfaxresearch.com/knl-avx512/
Some of the SIMD instruction sets used in
industry
Register size Instruction set
80 bits MMX
128-bits SSE1/SSE2 etc.
256-bits AVX/AVX2
512-bits AVX-512
30
Vector Processors
A vector instruction operates on arrays of
data
Example : There are vector instructions to add or
multiply two arrays of data, and produce an array as
output
Advantage : Can be used to perform all kinds of
array, matrix, and linear algebra operations. These
operations form the core of many scientific
programs, high intensity graphics, and data analytics
applications.
31
Software Interface
Let us define a vector register
Example : 128 bit registers in the MMX instruction set
→ XMM0 … XMM15
Can hold 4 floating point values, or 8 2-byte short
integers
Addition of vector registers is equivalent to pairwise
addition of each of the individual elements.
The result is saved in a vector register of the same
size.
32
Example of Vector Addition
vr1
vr2
vr3
Let us define 8 128 bit vector registers in SimpleRisc. vr0 ... vr7
33
Loading Vector Registers
There are two options :
Option 1 : We assume that the data elements are
stored in contiguous locations
Let us define the v.ld instruction that uses this
assumption.
Instruction Semantics
v.ld vr1, 12[r1] vr1 ([r1+12], [r1+16],[r1+20], [r1+24])
Option 2: Assume that the elements are not saved in
contiguous locations.
For this, there are scatter-gather instructions
34
Scatter Gather Operation
The data is scattered in memory
The load operation needs to gather the data and
save it in a vector register.
Let us define a scatter gather version of the load
instruction → v.sg.ld
It uses another vector register that contains the
addresses of each of the elements.
Instruction Semantics
v.sg.ld vr1, vr2 vr1 ([vr2[0]], [vr2[1]], [vr2[2]], [vr2[3]])
35
Vector Store Operation
We can similarly define two vector store
operations
Instruction Semantics
v.sg.st vr1, vr2 [vr2[0]] vr1[0]
[vr2[1]] vr1[1]
[vr2[2]] vr1[2]
[vr2[3]] vr1[3]
Instruction Semantics
v.st vr1, 12[r1] [r1+12] vr1[0]
[r1+16] vr1[1]
[r1+20] vr1[2]
[r1+24] vr1[3]
36
Vector Operations
We can now define custom operations on vector
registers
v.add → Adds two vector registers
v.mul → Multiplies two vector registers
We can even have operations that have a vector
operand and a scalar operand → Multiply a vector
with a scalar.
37
Design of a Vector Processor
Salient Points
We have a vector register file and a scalar register file
There are scalar and vector functional units
Unless we are converting a vector to a scalar or vice
versa, we in general do not forward values between
vector and scalar instructions
The memory unit needs support for regular operations,
vector operations, and possibly scatter-gather
operations.
39
References
• S. Mittal et al, “A Survey on Evaluating and Optimizing
Performance of Intel Xeon Phi” 2019 (pdf)
40