0% found this document useful (0 votes)

12 views73 pages

Spring 15 Lecture 22 On Memory-Controllers-Afterlecture

Prof Onur's lecture from spring 15 on memory-controllers

Uploaded by

nenabhay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views73 pages

Spring 15 Lecture 22 On Memory-Controllers-Afterlecture

Prof Onur's lecture from spring 15 on memory-controllers

Uploaded by

nenabhay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

18-447

Computer Architecture
Lecture 22: Memory Controllers

Prof. Onur Mutlu

Carnegie Mellon University
Spring 2015, 3/25/2015
Lab 4 Grades
20
Number of Students

0
40 50 60 70 80 90

 Mean: 89.7
 Median: 94.3
 Standard Deviation: 16.2
2
Lab 4 Extra Credit
 Pete Ehrett (fastest) – 2%
 Navneet Saini (2nd fastest) – 1%

3
Announcements (I)
 No office hours today
 Hosting a seminar in this room right after this lecture
 Swarun Kumar, MIT, “Pushing the Limits of Wireless
Networks: Interference Management and Indoor Positioning”
 March 25, 2:30-3:30pm, HH 1107

 From talk abstract:

(…) perhaps our biggest expectation from modern wireless networks is faster
communication speeds. However, state-of-the-art Wi-Fi networks continue to
struggle in crowded environments — airports and hotel lobbies. The core
reason is interference — Wi-Fi access points today avoid transmitting at the
same time on the same frequency, since they would otherwise interfere with
each other. I describe OpenRF, a novel system that enables today’s Wi-Fi
access points to directly combat this interference and demonstrate significantly
faster data-rates for real applications.

4
Today’s Seminar on Flash Memory (4-5pm)
 March 25, Wednesday, CIC Panther Hollow Room, 4-5pm
 Yixin Luo, PhD Student, CMU
 Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery

 Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
"Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery"
Proceedings of the 21st International Symposium on High-
Performance Computer Architecture (HPCA), Bay Area, CA,
February 2015.
[Slides (pptx) (pdf)]
Best paper session.
 http://users.ece.cmu.edu/~omutlu/pub/flash-memory-data-
retention_hpca15.pdf
5
Flash Memory (SSD) Controllers
 Similar to DRAM memory controllers, except:
 They are flash memory specific
 They do much more: error correction, garbage collection,
page remapping, …

Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory 6
Lifetime”, ICCD 2012.
Where We Are in Lecture Schedule
 The memory hierarchy
 Caches, caches, more caches
 Virtualizing the memory hierarchy: Virtual Memory
 Main memory: DRAM
 Main memory control, scheduling
 Memory latency tolerance techniques
 Non-volatile memory

 Multiprocessors
 Coherence and consistency
 Interconnection networks
 Multi-core issues
7
Required Reading (for the Next Few Lectures)
 Onur Mutlu, Justin Meza, and Lavanya Subramanian,
"The Main Memory System: Challenges and
Opportunities"
Invited Article in Communications of the Korean Institute of
Information Scientists and Engineers (KIISE), 2015.

http://users.ece.cmu.edu/~omutlu/pub/main-memory-
system_kiise15.pdf

8
Required Readings on DRAM
 DRAM Organization and Operation Basics
 Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Latency and Low Cost DRAM Architecture,” HPCA 2013.
http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf

 Sections 1 and 2 of Kim et al., “A Case for Subarray-Level

Parallelism (SALP) in DRAM,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/salp-dram_isca12.pdf

 DRAM Refresh Basics

 Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware
Intelligent DRAM Refresh,” ISCA 2012.
http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-
refresh_isca12.pdf
9
Memory Controllers
DRAM versus Other Types of Memories

 Long latency memories have similar characteristics that

need to be controlled.

 The following discussion will use DRAM as an example, but

many scheduling and control issues are similar in the
design of controllers for other types of memories
 Flash memory
 Other emerging memory technologies
 Phase Change Memory
 Spin-Transfer Torque Magnetic Memory
 These other technologies can place other demands on the
controller

11
DRAM Types
 DRAM has different types with different interfaces optimized
for different purposes
 Commodity: DDR, DDR2, DDR3, DDR4, …
 Low power (for mobile): LPDDR1, …, LPDDR5, …
 High bandwidth (for graphics): GDDR2, …, GDDR5, …
 Low latency: eDRAM, RLDRAM, …
 3D stacked: WIO, HBM, HMC, …
 …
 Underlying microarchitecture is fundamentally the same
 A flexible memory controller can support various DRAM types
 This complicates the memory controller
 Difficult to support all types (and upgrades)

12
DRAM Types (II)

Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE Comp Arch Letters 2015.
13
DRAM Controller: Functions
 Ensure correct operation of DRAM (refresh and timing)

 Service DRAM requests while obeying timing constraints of

DRAM chips
 Constraints: resource conflicts (bank, bus, channel), minimum
write-to-read delays
 Translate requests to DRAM command sequences

 Buffer and schedule requests to for high performance + QoS

 Reordering, row-buffer, bank, rank, bus management

 Manage power consumption and thermals in DRAM

 Turn on/off DRAM chips, manage power modes
14
DRAM Controller: Where to Place
 In chipset
+ More flexibility to plug different DRAM types into the system
+ Less power density in the CPU chip

 On CPU chip
+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller
 More information can be communicated (e.g. request’s
importance in the processing core)

15
A Modern DRAM Controller (I)

16
A Modern DRAM Controller (II)

17
DRAM Scheduling Policies (I)
 FCFS (first come first served)
 Oldest request first

 FR-FCFS (first ready, first come first served)

1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate  maximize DRAM throughput

 Actually, scheduling is done at the command level

 Column commands (read/write) prioritized over row commands
(activate/precharge)
 Within each group, older commands prioritized over younger ones

18
DRAM Scheduling Policies (II)
 A scheduling policy is a request prioritization order

 Prioritization can be based on

 Request age
 Row buffer hit/miss status
 Request type (prefetch, read, write)
 Requestor type (load miss or store miss)
 Request criticality
 Oldest miss in the core?
 How many instructions in core are dependent on it?
 Will it stall the processor?
 Interference caused to other cores
 …
19
Row Buffer Management Policies
 Open row
 Keep the row open after an access
+ Next access might need the same row  row hit
-- Next access might need a different row  row conflict, wasted energy

 Closed row
 Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row  avoid a row conflict
-- Next access might need the same row  extra activate latency

 Adaptive policies
 Predict whether or not the next access to the bank will be to
the same row

20
Open vs. Closed Row Policies

Policy First access Next access Commands

needed for next
access
Open row Row 0 Row 0 (row hit) Read
Open row Row 0 Row 1 (row Precharge +
conflict) Activate Row 1 +
Read
Closed row Row 0 Row 0 – access in Read
request buffer
(row hit)
Closed row Row 0 Row 0 – access not Activate Row 0 +
in request buffer Read + Precharge
(row closed)
Closed row Row 0 Row 1 (row closed) Activate Row 1 +
Read + Precharge

21
Memory Interference and Scheduling
in Multi-Core Systems
Review: A Modern DRAM Controller

23
Review: DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)

Row decoder
(Row 1, Column 0)

Rows
Row address 0
1

Row 01
Row
Empty Row Buffer CONFLICT
HIT !

Column address 0
1
85 Column mux

Data

24
Scheduling Policy for Single-Core Systems
 A row-conflict memory access takes significantly longer than a
row-hit access
 Current controllers take advantage of the row buffer

 FR-FCFS (first ready, first come first served) scheduling policy

1. Row-hit first
2. Oldest first

Goal 1: Maximize row buffer hit rate  maximize DRAM throughput

Goal 2: Prioritize older requests  ensure forward progress

 Is this a good policy in a multi-core system?

25
Trend: Many Cores on Chip
 Simpler and lower power than a single large core
 Large scale parallelism on chip

Intel Core i7 IBM Cell BE IBM POWER7

AMD Barcelona 8 cores 8+1 cores 8 cores
4 cores

Nvidia Fermi Intel SCC Tilera TILE Gx

Sun Niagara II 448 “cores” 48 cores, networked 100 cores, networked
8 cores

26
Many Cores on Chip
 What we want:
 N times the system performance with N times the cores

 What do we get today?

27
(Un)expected Slowdowns in Multi-Core
High priority

Low priority

(Core 0) (Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
28
Uncontrolled Interference: An Example

CORE
stream1 random2
CORE Multi-Core
Chip

L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System

DRAM DRAM DRAM DRAM

Bank 0 Bank 1 Bank 2 Bank 3

29
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B

for (j=0; j<N; j++) { for (j=0; j<N; j++) {

index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index];
… …
} }

STREAM RANDOM
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

30
What Does the Memory Hog Do?

Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer

Row size: 8KB, cache blockColumn mux

size: 64B
T0: STREAM
128
T1: (8KB/64B)
RANDOM requests of T0 serviced
Data before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

31
Effect of the Memory Performance Hog
3
2.82X slowdown
2.5

Slowdown 2

1.5 1.18X slowdown

0.5

0
STREAM Virtual
gcc PC
Results on Intel Pentium D running Windows XP
(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

32
Problems due to Uncontrolled Interference
Main memory is the only shared resource
Slowdown High priority

Memory
Lowperformance
priority hog

Cores make
very slow
progress

 Unfair slowdown of different threads

 Low system performance
 Vulnerability to denial of service
 Priority inversion: unable to enforce priorities/SLAs

33
Problems due to Uncontrolled Interference

 Unfair slowdown of different threads

 Low system performance
 Vulnerability to denial of service
 Priority inversion: unable to enforce priorities/SLAs
 Poor performance predictability (no performance isolation)
Uncontrollable, unpredictable system 34
Recap: Inter-Thread Interference in Memory
 Memory controllers, pins, and memory banks are shared

 Pin bandwidth is not increasing as fast as number of cores

 Bandwidth per core reducing

 Different threads executing on different cores interfere with

each other in the main memory system

 Threads delay each other by causing resource contention:

 Bank, bus, row-buffer conflicts  reduced DRAM throughput
 Threads can also destroy each other’s DRAM bank
parallelism
 Otherwise parallel requests can become serialized
35
Effects of Inter-Thread Interference in DRAM
 Queueing/contention delays
 Bank conflict, bus conflict, channel conflict, …

 Additional delays due to DRAM constraints

 Called “protocol overhead”
 Examples
 Row conflicts
 Read-to-write and write-to-read delays

 Loss of intra-thread parallelism

 A thread’s concurrent requests are serviced serially instead of
in parallel

36
Problem: QoS-Unaware Memory Control
 Existing DRAM controllers are unaware of inter-thread
interference in DRAM system

 They simply aim to maximize DRAM throughput

 Thread-unaware and thread-unfair
 No intent to service each thread’s requests in parallel
 FR-FCFS policy: 1) row-hit first, 2) oldest first
 Unfairly prioritizes threads with high row-buffer locality
 Unfairly prioritizes threads that are memory intensive (many outstanding
memory accesses)

37
Solution: QoS-Aware Memory Request Scheduling

Resolves memory contention

by scheduling requests
Core Core Memory Memory
Controller
Core Core

 How to schedule requests to provide

 High system performance
 High fairness to applications
 Configurability to system software

 Memory controller needs to be aware of threads

38
Stall-Time Fair Memory Scheduling

Onur Mutlu and Thomas Moscibroda,

"Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"
40th International Symposium on Microarchitecture (MICRO),
pages 146-158, Chicago, IL, December 2007. Slides (ppt)

STFM Micro 2007 Talk

The Problem: Unfairness

 Unfair slowdown of different threads

 Low system performance
 Vulnerability to denial of service
 Priority inversion: unable to enforce priorities/SLAs
 Poor performance predictability (no performance isolation)
Uncontrollable, unpredictable system 40
How Do We Solve the Problem?
 Stall-time fair memory scheduling [Mutlu+ MICRO’07]

 Goal: Threads sharing main memory should experience

similar slowdowns compared to when they are run alone 
fair scheduling
 Also improves overall system performance by ensuring cores
make “proportional” progress

 Idea: Memory controller estimates each thread’s slowdown

due to interference and schedules requests in a way to
balance the slowdowns

 Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for

Chip Multiprocessors,” MICRO 2007.
41
Stall-Time Fairness in Shared DRAM Systems

 A DRAM system is fair if it equalizes the slowdown of equal-priority threads

relative to when each thread is run alone on the same system

 DRAM-related stall-time: The time a thread spends waiting for DRAM memory
 STshared: DRAM-related stall-time when the thread runs with other threads
 STalone: DRAM-related stall-time when the thread runs alone
 Memory-slowdown = STshared/STalone
 Relative increase in stall-time

 Stall-Time Fair Memory scheduler (STFM) aims to equalize

Memory-slowdown for interfering threads, without sacrificing performance
 Considers inherent DRAM performance of each thread
 Aims to allow proportional progress of threads

42
STFM Scheduling Algorithm [MICRO’07]
 For each thread, the DRAM controller
 Tracks STshared

 Estimates STalone

 Each cycle, the DRAM controller

 Computes Slowdown = STshared/STalone for threads with legal requests

 Computes unfairness = MAX Slowdown / MIN Slowdown

 If unfairness < 
 Use DRAM throughput oriented scheduling policy

 If unfairness ≥ 
 Use fairness-oriented scheduling policy

 (1) requests from thread with MAX Slowdown first

 (2) row-hit first , (3) oldest-first

43
How Does STFM Prevent Unfairness?

T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16

Row
Row 111
Row 16
00 Row Buffer
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness 1.06
1.04
1.03
1.00
Data
 1.05

44
STFM Pros and Cons
 Upsides:
 First algorithm for fair multi-core memory scheduling
 Provides a mechanism to estimate memory slowdown of a
thread
 Good at providing fairness
 Being fair can improve performance

 Downsides:
 Does not handle all types of interference
 (Somewhat) complex to implement
 Slowdown estimations can be incorrect

45
Parallelism-Aware Batch Scheduling

Onur Mutlu and Thomas Moscibroda,

"Parallelism-Aware Batch Scheduling: Enhancing both
Performance and Fairness of Shared DRAM Systems”
35th International Symposium on Computer Architecture (ISCA),
pages 63-74, Beijing, China, June 2008. Slides (ppt)
Another Problem due to Interference

 Processors try to tolerate the latency of DRAM requests by

generating multiple outstanding requests
 Memory-Level Parallelism (MLP)
 Out-of-order execution, non-blocking caches, runahead execution

 Effective only if the DRAM controller actually services the

multiple requests in parallel in DRAM banks

 Multiple threads share the DRAM controller

 DRAM controllers are not aware of a thread’s MLP
 Can service each thread’s outstanding requests serially, not in parallel

47
Bank Parallelism of a Thread
2 DRAM Requests Bank 0 Bank 1
Single Thread:
Thread A : Compute Stall Compute
Bank 0
Bank 1
Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1

Bank access latencies of the two requests overlapped

Thread stalls for ~ONE bank access latency

48
Bank Parallelism Interference in DRAM
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests

A : Compute Stall Stall Compute

Bank 0
Bank 1
2 DRAM Requests Thread A: Bank 0, Row 1

B: Compute Stall Stall Compute Thread B: Bank 1, Row 99

Bank 1 Thread B: Bank 0, Row 99
Bank 0
Thread A: Bank 1, Row 1

Bank access latencies of each thread serialized

Each thread stalls for ~TWO bank access latencies

49
Parallelism-Aware Scheduler
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests

A : Compute Stall Stall Compute

Bank 0
Bank 1
2 DRAM Requests Thread A: Bank 0, Row 1
B: Compute Stall Stall Compute Thread B: Bank 1, Row 99
Bank 1
Bank 0 Thread B: Bank 0, Row 99
Thread A: Bank 1, Row 1
Parallelism-aware Scheduler:
2 DRAM Requests

A : Compute Stall Compute

Saved Cycles
Bank 0 Average stall-time:
Bank 1
2 DRAM Requests
~1.5 bank access
B: Compute Stall Compute
latencies
Stall
Bank 0
Bank 1

50
Parallelism-Aware Batch Scheduling (PAR-BS)
 Principle 1: Parallelism-awareness
 Schedule requests from a thread (to T1 T1
different banks) back to back T2 T0
 Preserves each thread’s bank parallelism
T2 T2
 But, this can cause starvation…
T3 T2 Batch

 Principle 2: Request Batching T0 T3

 Group a fixed number of oldest requests T2 T1

from each thread into a “batch” T1 T0

 Service the batch before all other requests

 Form a new batch when the current one is done

Bank 0 Bank 1
 Eliminates starvation, provides fairness
 Allows parallelism-awareness within a batch

Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

51
PAR-BS Components

 Request batching

 Within-batch scheduling
 Parallelism aware

52
Request Batching

 Each memory request has a bit (marked) associated with it

 Batch formation:
 Mark up to Marking-Cap oldest requests per bank for each thread
 Marked requests constitute the batch
 Form a new batch when no marked requests are left

 Marked requests are prioritized over unmarked ones

 No reordering of requests across batches: no starvation, high fairness

 How to prioritize requests within a batch?

53
Within-Batch Scheduling
 Can use any DRAM scheduling policy
 FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality
 But, we also want to preserve intra-thread bank parallelism
 Service each thread’s requests back to back

HOW?
 Scheduler computes a ranking of threads when the batch is
formed
 Higher-ranked threads are prioritized over lower-ranked ones
 Improves the likelihood that requests from a thread are serviced in
parallel by different banks
 Different threads prioritized in the same order across ALL banks

54
How to Rank Threads within a Batch
 Ranking scheme affects system throughput and fairness

 Maximize system throughput

 Minimize average stall-time of threads within the batch
 Minimize unfairness (Equalize the slowdown of threads)
 Service threads with inherently low stall-time early in the batch
 Insight: delaying memory non-intensive threads results in high
slowdown

 Shortest stall-time first (shortest job first) ranking

 Provides optimal system throughput [Smith, 1956]*
 Controller estimates each thread’s stall-time within the batch
 Ranks threads with shorter stall-time higher
* W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.
55
Shortest Stall-Time First Ranking
 Maximum number of marked requests to any bank (max-bank-load)
 Rank thread with lower max-bank-load higher (~ low stall-time)
 Total number of marked requests (total-load)
 Breaks ties: rank thread with lower total-load higher

T3
max-bank-load total-load
T3
T3 T2 T3 T3 T0 1 3
T1 T0 T2 T0 T1 2 4
T2 T2 T1 T2
T2 2 6
T3 T1 T0 T3
T3 5 9
T1 T3 T2 T3

Bank 0 Bank 1 Bank 2 Bank 3 Ranking:

T0 > T1 > T2 > T3

56
Example Within-Batch Scheduling Order
Baseline Scheduling T3 7 PAR-BS Scheduling T3 7
Order (Arrival order) T3 6 Order T3 6
T3 T2 T3 T3 5 T3 T3 T3 T3 5

Time

Time
T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T1 T0 T3 2 T1 T1 T1 T2 2
T1 T3 T2 T3 1 T1 T0 T0 T0 1

Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3

Ranking: T0 > T1 > T2 > T3

T0 T1 T2 T3 T0 T1 T2 T3
Stall times 4 4 5 7 Stall times 1 2 4 7
AVG: 5 bank access latencies AVG: 3.5 bank access latencies

57
Putting It Together: PAR-BS Scheduling Policy
 PAR-BS Scheduling Policy
(1) Marked requests first Batching
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first) within-batch
(4) Oldest first scheduling

 Three properties:
 Exploits row-buffer locality and intra-thread bank parallelism
 Work-conserving: does not waste bandwidth when it can be used
 Services unmarked requests to banks without marked requests
 Marking-Cap is important
 Too small cap: destroys row-buffer locality
 Too large cap: penalizes memory non-intensive threads

 Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

58
Hardware Cost
 <1.5KB storage cost for
 8-core system with 128-entry memory request buffer

 No complex operations (e.g., divisions)

 Not on the critical path

 Scheduler makes a decision only every DRAM cycle

59
Unfairness on 4-, 8-, 16-core Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
5
FR-FCFS
4.5 FCFS
NFQ
Unfairness (lower is better)

4
STFM
PAR-BS
3.5

2.5

1.5

1
4-core 8-core 16-core

60
System Performance
1.4
1.3
1.2
1.1
Normalized Hmean Speedup

1
0.9
0.8
0.7 FR-FCFS
0.6 FCFS
NFQ
0.5
STFM
0.4
PAR-BS
0.3
0.2
0.1
0
4-core 8-core 16-core

61
PAR-BS Pros and Cons
 Upsides:
 First scheduler to address bank parallelism destruction across
multiple threads
 Simple mechanism (vs. STFM)
 Batching provides fairness
 Ranking enables parallelism awareness

 Downsides:
 Does not always prioritize the latency-sensitive applications

62
TCM:
Thread Cluster Memory Scheduling

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter,

"Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior"
43rd International Symposium on Microarchitecture (MICRO),
pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

TCM Micro 2010 Talk

Throughput vs. Fairness
24 cores, 4 memory controllers, 96 workloads
17
Maximum Slowdown
15
Better fairness

13 System throughput bias

FCFS
11
FRFCFS
9
STFM
7
Fairness bias PAR-BS
5
ATLAS
3
1
7 7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
No previous memory scheduling algorithm provides
both the best fairness and system throughput
64
Throughput vs. Fairness
Throughput biased approach Fairness biased approach
Prioritize less memory-intensive threads Take turns accessing memory

Good for throughput Does not starve

thread A

less memory thread B higher

thread C thread A thread B
intensive priority
thread C
not prioritized 
starvation  unfairness reduced throughput

Single policy for all threads is insufficient

65
Achieving the Best of Both Worlds
higher For Throughput
priority
Prioritize memory-non-intensive threads
thread
thread
thread
thread For Fairness

thread
Unfairness caused by memory-intensive
being prioritized over each other
thread
• Shuffle thread ranking
thread
Memory-intensive threads have
thread
different vulnerability to interference
• Shuffle asymmetrically
66
Thread Cluster Memory Scheduling [Kim+ MICRO’10]
1. Group threads into two clusters
2. Prioritize non-intensive cluster
higher
3. Different policies for each cluster priority
Non-intensive
Memory-non-intensive cluster

thread
thread Throughput
thread
thread
Prioritized higher
thread priority
thread
thread

Threads in the system

Memory-intensive

Intensive cluster Fairness

67
Clustering Threads
Step1 Sort threads by MPKI (misses per kiloinstruction)

higher

thread
thread
MPKI

thread
thread
thread
thread
Non-intensive Intensive
cluster αT cluster

T
α < 10%
T = Total memory bandwidth usage ClusterThreshold

Step2 Memory bandwidth usage αT divides clusters

68
TCM: Quantum-Based Operation
Previous quantum Current quantum
(~1M cycles) (~1M cycles)

Time

Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
1. Memory intensity Beginning of quantum:
2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
intensive threads

69
TCM: Scheduling Algorithm
1. Highest-rank: Requests from higher ranked threads prioritized
• Non-Intensive cluster > Intensive cluster
• Non-Intensive cluster: lower intensity  higher rank
• Intensive cluster: rank shuffling

2.Row-hit: Row-buffer hit requests are prioritized

3.Oldest: Older requests are prioritized

70
TCM: Throughput and Fairness
24 cores, 4 memory controllers, 96 workloads
16
FRFCFS
Better fairness

14
Maximum Slowdown

ATLAS
12

10
STFM
PAR-BS
8
TCM
6

4
7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
TCM, a heterogeneous scheduling policy,
provides best fairness and system throughput 71
TCM: Fairness-Throughput Tradeoff
When configuration parameter is varied…
12
FRFCFS
Better fairness
Maximum Slowdown

STFM ATLAS
8
PAR-BS
6 TCM

2 Adjusting
12 13 14 ClusterThreshold
15 16
Weighted Speedup
Better system throughput

TCM allows robust fairness-throughput tradeoff

72
TCM Pros and Cons
 Upsides:
 Provides both high fairness and high performance
 Caters to the needs for different types of threads (latency vs.
bandwidth sensitive)
 (Relatively) simple

 Downsides:
 Scalability to large buffer sizes?
 Robustness of clustering and shuffling algorithms?

Comparch Fall2020 Lecture11a Memory Controllers
No ratings yet
Comparch Fall2020 Lecture11a Memory Controllers
71 pages
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
No ratings yet
Onur 447 Spring15 Lecture21 Main Memory Afterlecture
94 pages
CASS DRAM 2018 NoAnim
No ratings yet
CASS DRAM 2018 NoAnim
91 pages
CH13 DRAM Controller
No ratings yet
CH13 DRAM Controller
39 pages
L05 Memory
No ratings yet
L05 Memory
45 pages
Spring 15 Lecture 23 On Main-Memory-Management-Afterlecture
No ratings yet
Spring 15 Lecture 23 On Main-Memory-Management-Afterlecture
81 pages
15IF11 Multicore C PDF
No ratings yet
15IF11 Multicore C PDF
46 pages
Main Memory Architecture Lecture
No ratings yet
Main Memory Architecture Lecture
50 pages
Computer Architecture: Main Memory
No ratings yet
Computer Architecture: Main Memory
69 pages
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
No ratings yet
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
14 pages
Lecture-1-02 01 2025
No ratings yet
Lecture-1-02 01 2025
18 pages
Dram Cs7810 Protocolx2
No ratings yet
Dram Cs7810 Protocolx2
22 pages
Memory: Computer Architecture and Assembly Language
No ratings yet
Memory: Computer Architecture and Assembly Language
15 pages
Multi-Core Memory Management
No ratings yet
Multi-Core Memory Management
183 pages
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
No ratings yet
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
18 pages
Advanced Memory Design Guide
No ratings yet
Advanced Memory Design Guide
48 pages
DRAM Basics by Prof. Matthew D. Sinclair
No ratings yet
DRAM Basics by Prof. Matthew D. Sinclair
103 pages
DDR Controller
No ratings yet
DDR Controller
30 pages
Multi Dev
No ratings yet
Multi Dev
47 pages
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
No ratings yet
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
16 pages
Computer Architecture Study Guide Summary
No ratings yet
Computer Architecture Study Guide Summary
4 pages
Memory 2 A
No ratings yet
Memory 2 A
37 pages
운영체제 01
No ratings yet
운영체제 01
60 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
No ratings yet
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
115 pages
DRAM Terminology and Basics, Energy Innovations
No ratings yet
DRAM Terminology and Basics, Energy Innovations
14 pages
ECE 554 Computer Architecture Main Memory Spring 2013
No ratings yet
ECE 554 Computer Architecture Main Memory Spring 2013
35 pages
Tham Khảo Phần Scheduling.modgil2015
No ratings yet
Tham Khảo Phần Scheduling.modgil2015
8 pages
Chapter01 OSedition7Final
No ratings yet
Chapter01 OSedition7Final
62 pages
The Memory Hierarchy
No ratings yet
The Memory Hierarchy
46 pages
11 Memory
No ratings yet
11 Memory
41 pages
Computer Architecture Lecture
No ratings yet
Computer Architecture Lecture
37 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Computer Memory Systems Guide
No ratings yet
Computer Memory Systems Guide
26 pages
The Memory System Hamacher
No ratings yet
The Memory System Hamacher
20 pages
Basics of Memories
No ratings yet
Basics of Memories
33 pages
Chapter 2 Part 2
No ratings yet
Chapter 2 Part 2
18 pages
Module 6 - Memory
No ratings yet
Module 6 - Memory
32 pages
Computer System: Operating Systems: Internals and Design Principles
No ratings yet
Computer System: Operating Systems: Internals and Design Principles
62 pages
CA09 2024S2 New
No ratings yet
CA09 2024S2 New
29 pages
Advanced Memory Systems Lecture
No ratings yet
Advanced Memory Systems Lecture
88 pages
EE6304 Lecture8 Mem Hierarchy
No ratings yet
EE6304 Lecture8 Mem Hierarchy
54 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
DRAM Control for Computer Architecture
No ratings yet
DRAM Control for Computer Architecture
34 pages
Understanding CPU Memory & Cache
No ratings yet
Understanding CPU Memory & Cache
41 pages
DRAM Interface Overview
No ratings yet
DRAM Interface Overview
91 pages
DRAM Schedule
No ratings yet
DRAM Schedule
11 pages
Module 2 (MEMORY) .Debkanta
No ratings yet
Module 2 (MEMORY) .Debkanta
27 pages
Lecture 3: Memory Buffers and Scheduling
No ratings yet
Lecture 3: Memory Buffers and Scheduling
21 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
32 pages
Unit 5
No ratings yet
Unit 5
56 pages
Module 5.3
No ratings yet
Module 5.3
39 pages
Memory L
No ratings yet
Memory L
44 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Memory Interfacing
No ratings yet
Memory Interfacing
65 pages
HD Coaxial Camera Price List For Distributor - J2019: IP66 Waterproof Bullet Camera-Fixed Lens
No ratings yet
HD Coaxial Camera Price List For Distributor - J2019: IP66 Waterproof Bullet Camera-Fixed Lens
8 pages
8085 Instruction Set
No ratings yet
8085 Instruction Set
14 pages
HW6 Spring2022 Solution 2
No ratings yet
HW6 Spring2022 Solution 2
10 pages
St7Lite2: 8-Bit Mcu With Single Voltage Flash Memory, Data Eeprom, Adc, Timers, Spi
No ratings yet
St7Lite2: 8-Bit Mcu With Single Voltage Flash Memory, Data Eeprom, Adc, Timers, Spi
131 pages
Table 1: Control Signals and Opcodes
No ratings yet
Table 1: Control Signals and Opcodes
6 pages
Analog IC Layout Design Guide
No ratings yet
Analog IC Layout Design Guide
40 pages
6510 Microprocessor With I/O: Ommodore em Iconductor Roup
No ratings yet
6510 Microprocessor With I/O: Ommodore em Iconductor Roup
10 pages
lastUIException 63818555841
No ratings yet
lastUIException 63818555841
1 page
Eprom COP8SAB720N8 Central Alarme
No ratings yet
Eprom COP8SAB720N8 Central Alarme
60 pages
Session - I Microprocessors-The Solution in Search of Problems
No ratings yet
Session - I Microprocessors-The Solution in Search of Problems
75 pages
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
No ratings yet
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
2 pages
002 Quotation Ravichandra
No ratings yet
002 Quotation Ravichandra
2 pages
Manish Document
No ratings yet
Manish Document
16 pages
Basic Computer Organization & Design
No ratings yet
Basic Computer Organization & Design
45 pages
Design Rules: Jan M. Rabaey
No ratings yet
Design Rules: Jan M. Rabaey
9 pages
68000 Processor & Assembly Guide
No ratings yet
68000 Processor & Assembly Guide
5 pages
MAE311 PLC - Lecture (5) 08-03-2025s
No ratings yet
MAE311 PLC - Lecture (5) 08-03-2025s
20 pages
13 - CPE220 - 1019 - Theory Final Exam Answer Key
No ratings yet
13 - CPE220 - 1019 - Theory Final Exam Answer Key
7 pages
Region of Operation of Mosfet
No ratings yet
Region of Operation of Mosfet
6 pages
Server 2 PDF
No ratings yet
Server 2 PDF
1 page
Addressing Modes of 8086
100% (1)
Addressing Modes of 8086
15 pages
Devices Table
No ratings yet
Devices Table
20 pages
DB Tef07f32x32hd18 PHRM C180518 120a
100% (1)
DB Tef07f32x32hd18 PHRM C180518 120a
30 pages
Design Project Smart Lighting System: Eee/Instr F241 - Microprocessor Programming & Interfacing BITS-Pilani Goa Campus
No ratings yet
Design Project Smart Lighting System: Eee/Instr F241 - Microprocessor Programming & Interfacing BITS-Pilani Goa Campus
18 pages
OS 4th
No ratings yet
OS 4th
55 pages
Computer Organisation (15CS34) New Syllabus: Notes
100% (1)
Computer Organisation (15CS34) New Syllabus: Notes
119 pages
Computer Memory Multiple Choice Questions
100% (1)
Computer Memory Multiple Choice Questions
25 pages
Memory Devices and Interfacing - (Chapter 9) : Dr. Costas Kyriacou and Dr. Konstantinos Tatas
No ratings yet
Memory Devices and Interfacing - (Chapter 9) : Dr. Costas Kyriacou and Dr. Konstantinos Tatas
60 pages
s7-300 MMC Card Memory Reset
100% (1)
s7-300 MMC Card Memory Reset
3 pages
Features of 8086 Microprocessor:: 1. EU (Execution Unit)
No ratings yet
Features of 8086 Microprocessor:: 1. EU (Execution Unit)
6 pages