0% found this document useful (0 votes)

23 views12 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

This document discusses multithreading techniques for hiding memory latency. Blocked multithreading allows overlap of memory access of one thread with computation of another by context switching threads on long-latency operations like cache misses. With enough threads, all latency can be hidden. Fine-grain multithreading switches threads every cycle, eliminating context switch overhead but requiring more threads to hide long latencies and potentially hurting single-thread performance. Both aim to maximize processor utilization through latency hiding.

Uploaded by

manav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views12 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

manav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lect.

9: Multithreading
▪ Memory latencies and even latencies to lower level caches are
becoming longer w.r.t. processor cycle times
▪ There are basically 3 ways to hide/tolerate such latencies by
overlapping computation with the memory access
– Dynamic out-of-order scheduling
– Prefetching
– Multithreading
▪ OOO execution and prefetching allow overlap of computation and
memory access within the same thread (these were covered in CS3
Computer Architecture)
▪ Multithreading allows overlap of memory access of one thread/
process with computation by another thread/process

CS4/MSc Parallel Architectures - 2016-2017

1
Blocked Multithreading
▪ Basic idea:
– Recall multi-tasking: on I/O a process is context-switched out of the processor by
the OS
OS interrupt handler OS interrupt handler
running running

process 1 process 2 Process 1

running running running
system call for I/O I/O completion

– With multithreading a thread/process is context-switched out of the pipeline by

the hardware on longer-latency operations
Hardware context Hardware context
switch switch

process 1 process 2 Process 1

running running running
Long-latency operation Long-latency operation
CS4/MSc Parallel Architectures - 2016-2017
2
Blocked Multithreading
▪ Basic idea:
– Unlike in multi-tasking, context is still kept in the processor and OS is not aware of
any changes
– Context switch overhead is minimal (usually only a few cycles)
– Unlike in multi-tasking, the completion of the long-latency operation does not
trigger a context switch (the blocked thread is simply marked as ready)
– Usually the long-latency operation is a L1 cache miss, but it can also be others, such
as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)
▪ Context of a thread in the processor:
– Registers
– Program counter
– Stack pointer
– Other processor status words
▪ Note: the term “multithreading” is commonly used to mean
simply the fact that the system supports multiple threads

CS4/MSc Parallel Architectures - 2016-2017

3
Blocked Multithreading
▪ Latency hiding example: = context switch overhead
Thread A
= idle (stall cycle)
Thread B

Thread C

Thread D

Pipeline latency

Culler and Singh

Fig. 11.27
Memory latencies
CS4/MSc Parallel Architectures - 2016-2017
4
Blocked Multithreading
▪ Hardware mechanisms:
– Keeping multiple contexts and supporting fast switch
▪ One register file per context
▪ One set of special registers (including PC) per context
– Flushing instructions from the previous context from the pipeline after a context
switch
▪ Note that such squashed instructions add to the context switch overhead
▪ Note that keeping instructions from two different threads in the pipeline
increases the complexity of the interlocking mechanism and requires that
instructions be tagged with context ID throughout the pipeline
– Possibly replicating other microarchitectural structures (e.g., branch prediction
tables)
▪ Employed in the Sun T1 and T2 systems (a.k.a. Niagara)

CS4/MSc Parallel Architectures - 2016-2017

5
Blocked Multithreading
▪ Simple analytical performance model:
– Parameters:
▪ Number of threads (N): the number of threads supported in the hardware
▪ Busy time (R): time processor spends computing between context switch
points
▪ Switching time (C): time processor spends with each context switch
▪ Latency (L): time required by the operation that triggers the switch
– To completely hide all L we need enough N such that (N-1)*R + N*C = L
▪ Fewer threads mean we can’t hide all L
▪ More threads are unnecessary

R C

R C R C R C
– Note: these are only average numbers and ideally N should be bigger to
accommodate variation
CS4/MSc Parallel Architectures - 2016-2017
6
Blocked Multithreading
▪ Simple analytical performance model:
– The minimum value of N is referred to as the saturation point (Nsat)
R+L
Nsat =
R+C
– Thus, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is limited by the switching overhead

R
Usat =
R+C
– E.g.: 0.8

for R=40,
Processor utilization (%)

0.6
L=200,
and C=10 0.4

0.2 Culler and Singh

Fig. 11.25
0
0 2 4 6 8
Number of threads

CS4/MSc Parallel Architectures - 2016-2017

7
Fine-grain or Interleaved Multithreading
▪ Basic idea:
– Instead of waiting for long-latency operation, context switch on every cycle
– Threads waiting for a long latency operation are marked not ready and are not
considered for execution
– With enough threads no two instructions from the same thread are in the pipeline
at the same time → no need for pipeline interlock at all
▪ Advantages and disadvantages over blocked multithreading:
+ No context switch overhead (no pipeline flush)
+ Better at handling short pipeline latencies/bubbles
– Possibly poor single thread performance (each thread only gets the processor once
every N cycles)
– Requires more threads to completely hide long latencies
– Slightly more complex hardware than blocked multithreading (if we want to permit
multiple instructions from the same thread in the pipeline)
▪ Some machines have taken this idea to the extreme and
eliminated caches altogether (e.g., Cray MTA-2, with 128 threads
per processor)

CS4/MSc Parallel Architectures - 2016-2017

8
Fine-grain or Interleaved Multithreading
▪ Simple analytical performance model
▪ Assumption: no caches, 1 in 2 instruction is a memory access
– Parameters:
▪ Number of threads (N) and Latency (L)
▪ Busy time (R) is now 1 and switching time (C) is now 0
L enough N such that N-1 = L
– To completely hide all L we need
R

RR R

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)

– Again, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is 100% (i.e., Usat = 1)

CS4/MSc Parallel Architectures - 2016-2017

9
Fine-grain or Interleaved Multithreading
▪ Latency hiding example:
Thread A Thread E

Thread B Thread F

Thread C

Thread D = idle (stall cycle)

Pipeline latency

Culler and Singh

Memory latencies Fig. 11.28
E is still blocked,
A is still blocked, so is skipped
so is skipped
CS4/MSc Parallel Architectures - 2016-2017
10
Simultaneous Multithreading (SMT)
▪ Basic idea:
– Don’t actually context switch, but on a superscalar processor fetch and issue
instructions from different threads/processes simultaneously
– E.g., 4-issue processor
no multithreading blocked interleaved SMT

cycles

cache
miss

▪ Advantages:
+ Can handle not only long latencies and pipeline bubbles but also unused issue slots
+ Full performance in single-thread mode
– Most complex hardware of all multithreading schemes

CS4/MSc Parallel Architectures - 2016-2017

11
Simultaneous Multithreading (SMT)
▪ Fetch policies:
– Non-multithreaded fetch: only fetch instructions from one thread in each cycle, in
a round-robin alternation
– Partitioned fetch: divide the total fetch bandwidth equally between some of the
available threads (requires more complex fetch unit to fetch from multiple I-cache
lines; see Lecture 3)
– Priority fetch: fetch more instructions for specific threads (e.g., those not in control
speculation, those with the least number of instructions in the issue queue)
▪ Issue policies:
– Round-robin: select one ready instruction from each ready thread in turn until all
issue slots are full or there or no more ready instructions
(note: should remember which thread was the last to have an instruction selected
and start from there in the next cycle)
– Priority issue:
▪ E.g., threads with older instructions in the issue queue are tried first
▪ E.g., threads in control speculative mode are tried last
▪ E.g., issue all pending branches first

CS4/MSc Parallel Architectures - 2016-2017

Chapter 9
No ratings yet
Chapter 9
50 pages
Module - 6
No ratings yet
Module - 6
89 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
Multi Threading
No ratings yet
Multi Threading
5 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Lecture 25
No ratings yet
Lecture 25
41 pages
06b Multithreading MF
No ratings yet
06b Multithreading MF
37 pages
Threads
No ratings yet
Threads
23 pages
Threads
No ratings yet
Threads
9 pages
Presentation On Multithreading/Vector
No ratings yet
Presentation On Multithreading/Vector
7 pages
Unit 5
No ratings yet
Unit 5
86 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
Os Chapter 04
No ratings yet
Os Chapter 04
14 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
OS Notes For M.phil
No ratings yet
OS Notes For M.phil
30 pages
Public Distribution: Fe..Cte I
No ratings yet
Public Distribution: Fe..Cte I
32 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
OS Module2 Unit2
No ratings yet
OS Module2 Unit2
43 pages
Lecture 12
No ratings yet
Lecture 12
49 pages
OS Module-2 Notes
No ratings yet
OS Module-2 Notes
46 pages
Chapter 4 Threads
No ratings yet
Chapter 4 Threads
3 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
Ai Os CH4
No ratings yet
Ai Os CH4
26 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
NV Operating Systems UNIT II
No ratings yet
NV Operating Systems UNIT II
91 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
No ratings yet
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
31 pages
C++ Threading Concepts Explained
No ratings yet
C++ Threading Concepts Explained
4 pages
Unit 5
No ratings yet
Unit 5
29 pages
Hyper-Threading: Neil Chakrabarty William May
No ratings yet
Hyper-Threading: Neil Chakrabarty William May
17 pages
TLP
No ratings yet
TLP
19 pages
ILP MThread
No ratings yet
ILP MThread
3 pages
OS-PROCESS MANAGEMENT Module - 2.2
No ratings yet
OS-PROCESS MANAGEMENT Module - 2.2
89 pages
Lec04 SOFE3950 Threads
No ratings yet
Lec04 SOFE3950 Threads
53 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Threads
No ratings yet
Threads
8 pages
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
22 pages
Basic of Thread Level Parallelism
No ratings yet
Basic of Thread Level Parallelism
30 pages
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
No ratings yet
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
39 pages
Hyper-Threading: Neil Chakrabarty William May
No ratings yet
Hyper-Threading: Neil Chakrabarty William May
17 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
Multi Threaded Architectures
No ratings yet
Multi Threaded Architectures
47 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Lec17 Threads Introduction
No ratings yet
Lec17 Threads Introduction
20 pages
OFM Installation Guide
100% (1)
OFM Installation Guide
26 pages
PTS I Bhs Inggris X
No ratings yet
PTS I Bhs Inggris X
6 pages
Retirement Homes Brochure PDF
No ratings yet
Retirement Homes Brochure PDF
17 pages
C2 Config Guide-9033991-17 - C2
No ratings yet
C2 Config Guide-9033991-17 - C2
698 pages
LTE Handover Process Guide
No ratings yet
LTE Handover Process Guide
45 pages
External Rendering on Celcon Blocks
100% (1)
External Rendering on Celcon Blocks
3 pages
Conductive Flooring Solutions
No ratings yet
Conductive Flooring Solutions
14 pages
India GST Early Access
No ratings yet
India GST Early Access
4 pages
Developed Plan PDF
No ratings yet
Developed Plan PDF
1 page
Guide To Tree Sketching PDF
No ratings yet
Guide To Tree Sketching PDF
15 pages
Overview of MEF 3 and 8
No ratings yet
Overview of MEF 3 and 8
37 pages
Floor Plan Elevation-A
No ratings yet
Floor Plan Elevation-A
8 pages
Colorbond Steel Fencing Installation Guide PDF
No ratings yet
Colorbond Steel Fencing Installation Guide PDF
6 pages
White Arkitekter's Design For Nuuk's Psychiatric Clinic Emphasizes Nature in Mental Health Design - ArchDaily
No ratings yet
White Arkitekter's Design For Nuuk's Psychiatric Clinic Emphasizes Nature in Mental Health Design - ArchDaily
7 pages
Mashroat Project Guidelines
0% (1)
Mashroat Project Guidelines
12 pages
Brick Oven Design - GURU
No ratings yet
Brick Oven Design - GURU
6 pages
VMware AirWatch SEG Administration Guide v8 - 3
No ratings yet
VMware AirWatch SEG Administration Guide v8 - 3
46 pages
03-Clay Bricklaying Made Easy
No ratings yet
03-Clay Bricklaying Made Easy
4 pages
Hypervisor - Vs Container-Based Virtualization (Seminars FI IITM WS 1516)
No ratings yet
Hypervisor - Vs Container-Based Virtualization (Seminars FI IITM WS 1516)
7 pages
A Woodworking Bench
100% (1)
A Woodworking Bench
5 pages
Error Detection & Correction Guide
No ratings yet
Error Detection & Correction Guide
11 pages
Maharashtra Plastic Manufacturing Registrations
No ratings yet
Maharashtra Plastic Manufacturing Registrations
9 pages
Network+ Exam: Network Components
No ratings yet
Network+ Exam: Network Components
61 pages
Epnoyload
No ratings yet
Epnoyload
2 pages
Objective C MCQ Questions and Answers
No ratings yet
Objective C MCQ Questions and Answers
6 pages
Lab 8.4.2 Configuring Access Policies and DMZ Settings
No ratings yet
Lab 8.4.2 Configuring Access Policies and DMZ Settings
9 pages
MP 5002 - MP 4002
No ratings yet
MP 5002 - MP 4002
8 pages
Geotextile
No ratings yet
Geotextile
10 pages
Anchor Design Principles Guide
100% (1)
Anchor Design Principles Guide
26 pages
Architecture Correlation Reviewer (HOA 1-3)
No ratings yet
Architecture Correlation Reviewer (HOA 1-3)
8 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

Lect.

CS4/MSc Parallel Architectures - 2016-2017

process 1 process 2 Process 1

– With multithreading a thread/process is context-switched out of the pipeline by

process 1 process 2 Process 1

CS4/MSc Parallel Architectures - 2016-2017

Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

0.2 Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

CS4/MSc Parallel Architectures - 2016-2017

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)

CS4/MSc Parallel Architectures - 2016-2017

Thread D = idle (stall cycle)

Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

CS4/MSc Parallel Architectures - 2016-2017

You might also like