Spring 15 Lecture 22 On Memory-Controllers-Afterlecture
Spring 15 Lecture 22 On Memory-Controllers-Afterlecture
Computer Architecture
Lecture 22: Memory Controllers
15
10
0
40 50 60 70 80 90
Mean: 89.7
Median: 94.3
Standard Deviation: 16.2
2
Lab 4 Extra Credit
Pete Ehrett (fastest) – 2%
Navneet Saini (2nd fastest) – 1%
3
Announcements (I)
No office hours today
Hosting a seminar in this room right after this lecture
Swarun Kumar, MIT, “Pushing the Limits of Wireless
Networks: Interference Management and Indoor Positioning”
March 25, 2:30-3:30pm, HH 1107
4
Today’s Seminar on Flash Memory (4-5pm)
March 25, Wednesday, CIC Panther Hollow Room, 4-5pm
Yixin Luo, PhD Student, CMU
Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery
Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
"Data Retention in MLC NAND Flash Memory:
Characterization, Optimization and Recovery"
Proceedings of the 21st International Symposium on High-
Performance Computer Architecture (HPCA), Bay Area, CA,
February 2015.
[Slides (pptx) (pdf)]
Best paper session.
http://users.ece.cmu.edu/~omutlu/pub/flash-memory-data-
retention_hpca15.pdf
5
Flash Memory (SSD) Controllers
Similar to DRAM memory controllers, except:
They are flash memory specific
They do much more: error correction, garbage collection,
page remapping, …
Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory 6
Lifetime”, ICCD 2012.
Where We Are in Lecture Schedule
The memory hierarchy
Caches, caches, more caches
Virtualizing the memory hierarchy: Virtual Memory
Main memory: DRAM
Main memory control, scheduling
Memory latency tolerance techniques
Non-volatile memory
Multiprocessors
Coherence and consistency
Interconnection networks
Multi-core issues
7
Required Reading (for the Next Few Lectures)
Onur Mutlu, Justin Meza, and Lavanya Subramanian,
"The Main Memory System: Challenges and
Opportunities"
Invited Article in Communications of the Korean Institute of
Information Scientists and Engineers (KIISE), 2015.
http://users.ece.cmu.edu/~omutlu/pub/main-memory-
system_kiise15.pdf
8
Required Readings on DRAM
DRAM Organization and Operation Basics
Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low
Latency and Low Cost DRAM Architecture,” HPCA 2013.
http://users.ece.cmu.edu/~omutlu/pub/tldram_hpca13.pdf
11
DRAM Types
DRAM has different types with different interfaces optimized
for different purposes
Commodity: DDR, DDR2, DDR3, DDR4, …
Low power (for mobile): LPDDR1, …, LPDDR5, …
High bandwidth (for graphics): GDDR2, …, GDDR5, …
Low latency: eDRAM, RLDRAM, …
3D stacked: WIO, HBM, HMC, …
…
Underlying microarchitecture is fundamentally the same
A flexible memory controller can support various DRAM types
This complicates the memory controller
Difficult to support all types (and upgrades)
12
DRAM Types (II)
Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE Comp Arch Letters 2015.
13
DRAM Controller: Functions
Ensure correct operation of DRAM (refresh and timing)
On CPU chip
+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller
More information can be communicated (e.g. request’s
importance in the processing core)
15
A Modern DRAM Controller (I)
16
A Modern DRAM Controller (II)
17
DRAM Scheduling Policies (I)
FCFS (first come first served)
Oldest request first
18
DRAM Scheduling Policies (II)
A scheduling policy is a request prioritization order
Closed row
Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row avoid a row conflict
-- Next access might need the same row extra activate latency
Adaptive policies
Predict whether or not the next access to the bank will be to
the same row
20
Open vs. Closed Row Policies
21
Memory Interference and Scheduling
in Multi-Core Systems
Review: A Modern DRAM Controller
23
Review: DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)
Row decoder
(Row 1, Column 0)
Rows
Row address 0
1
Row 01
Row
Empty Row Buffer CONFLICT
HIT !
Column address 0
1
85 Column mux
Data
24
Scheduling Policy for Single-Core Systems
A row-conflict memory access takes significantly longer than a
row-hit access
Current controllers take advantage of the row buffer
25
Trend: Many Cores on Chip
Simpler and lower power than a single large core
Large scale parallelism on chip
26
Many Cores on Chip
What we want:
N times the system performance with N times the cores
27
(Un)expected Slowdowns in Multi-Core
High priority
Low priority
(Core 0) (Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
28
Uncontrolled Interference: An Example
CORE
stream1 random2
CORE Multi-Core
Chip
L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System
29
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B
STREAM RANDOM
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive
30
What Does the Memory Hog Do?
Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer
31
Effect of the Memory Performance Hog
3
2.82X slowdown
2.5
Slowdown 2
0.5
0
STREAM Virtual
gcc PC
Results on Intel Pentium D running Windows XP
(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)
32
Problems due to Uncontrolled Interference
Main memory is the only shared resource
Slowdown High priority
Memory
Lowperformance
priority hog
Cores make
very slow
progress
33
Problems due to Uncontrolled Interference
36
Problem: QoS-Unaware Memory Control
Existing DRAM controllers are unaware of inter-thread
interference in DRAM system
37
Solution: QoS-Aware Memory Request Scheduling
38
Stall-Time Fair Memory Scheduling
DRAM-related stall-time: The time a thread spends waiting for DRAM memory
STshared: DRAM-related stall-time when the thread runs with other threads
STalone: DRAM-related stall-time when the thread runs alone
Memory-slowdown = STshared/STalone
Relative increase in stall-time
42
STFM Scheduling Algorithm [MICRO’07]
For each thread, the DRAM controller
Tracks STshared
Estimates STalone
If unfairness <
Use DRAM throughput oriented scheduling policy
If unfairness ≥
Use fairness-oriented scheduling policy
43
How Does STFM Prevent Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16
Row
Row 111
Row 16
00 Row Buffer
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness 1.06
1.04
1.03
1.00
Data
1.05
44
STFM Pros and Cons
Upsides:
First algorithm for fair multi-core memory scheduling
Provides a mechanism to estimate memory slowdown of a
thread
Good at providing fairness
Being fair can improve performance
Downsides:
Does not handle all types of interference
(Somewhat) complex to implement
Slowdown estimations can be incorrect
45
Parallelism-Aware Batch Scheduling
47
Bank Parallelism of a Thread
2 DRAM Requests Bank 0 Bank 1
Single Thread:
Thread A : Compute Stall Compute
Bank 0
Bank 1
Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1
48
Bank Parallelism Interference in DRAM
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests
49
Parallelism-Aware Scheduler
Baseline Scheduler: Bank 0 Bank 1
2 DRAM Requests
50
Parallelism-Aware Batch Scheduling (PAR-BS)
Principle 1: Parallelism-awareness
Schedule requests from a thread (to T1 T1
different banks) back to back T2 T0
Preserves each thread’s bank parallelism
T2 T2
But, this can cause starvation…
T3 T2 Batch
Request batching
Within-batch scheduling
Parallelism aware
52
Request Batching
Batch formation:
Mark up to Marking-Cap oldest requests per bank for each thread
Marked requests constitute the batch
Form a new batch when no marked requests are left
53
Within-Batch Scheduling
Can use any DRAM scheduling policy
FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality
But, we also want to preserve intra-thread bank parallelism
Service each thread’s requests back to back
HOW?
Scheduler computes a ranking of threads when the batch is
formed
Higher-ranked threads are prioritized over lower-ranked ones
Improves the likelihood that requests from a thread are serviced in
parallel by different banks
Different threads prioritized in the same order across ALL banks
54
How to Rank Threads within a Batch
Ranking scheme affects system throughput and fairness
T3
max-bank-load total-load
T3
T3 T2 T3 T3 T0 1 3
T1 T0 T2 T0 T1 2 4
T2 T2 T1 T2
T2 2 6
T3 T1 T0 T3
T3 5 9
T1 T3 T2 T3
56
Example Within-Batch Scheduling Order
Baseline Scheduling T3 7 PAR-BS Scheduling T3 7
Order (Arrival order) T3 6 Order T3 6
T3 T2 T3 T3 5 T3 T3 T3 T3 5
Time
Time
T1 T0 T2 T0 4 T3 T2 T2 T3 4
T2 T2 T1 T2 3 T2 T2 T2 T3 3
T3 T1 T0 T3 2 T1 T1 T1 T2 2
T1 T3 T2 T3 1 T1 T0 T0 T0 1
T0 T1 T2 T3 T0 T1 T2 T3
Stall times 4 4 5 7 Stall times 1 2 4 7
AVG: 5 bank access latencies AVG: 3.5 bank access latencies
57
Putting It Together: PAR-BS Scheduling Policy
PAR-BS Scheduling Policy
(1) Marked requests first Batching
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first) within-batch
(4) Oldest first scheduling
Three properties:
Exploits row-buffer locality and intra-thread bank parallelism
Work-conserving: does not waste bandwidth when it can be used
Services unmarked requests to banks without marked requests
Marking-Cap is important
Too small cap: destroys row-buffer locality
Too large cap: penalizes memory non-intensive threads
59
Unfairness on 4-, 8-, 16-core Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
5
FR-FCFS
4.5 FCFS
NFQ
Unfairness (lower is better)
4
STFM
PAR-BS
3.5
2.5
1.5
1
4-core 8-core 16-core
60
System Performance
1.4
1.3
1.2
1.1
Normalized Hmean Speedup
1
0.9
0.8
0.7 FR-FCFS
0.6 FCFS
NFQ
0.5
STFM
0.4
PAR-BS
0.3
0.2
0.1
0
4-core 8-core 16-core
61
PAR-BS Pros and Cons
Upsides:
First scheduler to address bank parallelism destruction across
multiple threads
Simple mechanism (vs. STFM)
Batching provides fairness
Ranking enables parallelism awareness
Downsides:
Does not always prioritize the latency-sensitive applications
62
TCM:
Thread Cluster Memory Scheduling
thread
Unfairness caused by memory-intensive
being prioritized over each other
thread
• Shuffle thread ranking
thread
Memory-intensive threads have
thread
different vulnerability to interference
• Shuffle asymmetrically
66
Thread Cluster Memory Scheduling [Kim+ MICRO’10]
1. Group threads into two clusters
2. Prioritize non-intensive cluster
higher
3. Different policies for each cluster priority
Non-intensive
Memory-non-intensive cluster
thread
thread Throughput
thread
thread
Prioritized higher
thread priority
thread
thread
Memory-intensive
higher
thread
thread
MPKI
thread
thread
thread
thread
Non-intensive Intensive
cluster αT cluster
T
α < 10%
T = Total memory bandwidth usage ClusterThreshold
68
TCM: Quantum-Based Operation
Previous quantum Current quantum
(~1M cycles) (~1M cycles)
Time
Shuffle interval
During quantum: (~1K cycles)
• Monitor thread behavior
1. Memory intensity Beginning of quantum:
2. Bank-level parallelism • Perform clustering
3. Row-buffer locality • Compute niceness of
intensive threads
69
TCM: Scheduling Algorithm
1. Highest-rank: Requests from higher ranked threads prioritized
• Non-Intensive cluster > Intensive cluster
• Non-Intensive cluster: lower intensity higher rank
• Intensive cluster: rank shuffling
70
TCM: Throughput and Fairness
24 cores, 4 memory controllers, 96 workloads
16
FRFCFS
Better fairness
14
Maximum Slowdown
ATLAS
12
10
STFM
PAR-BS
8
TCM
6
4
7.5 8 8.5 9 9.5 10
Weighted Speedup
Better system throughput
TCM, a heterogeneous scheduling policy,
provides best fairness and system throughput 71
TCM: Fairness-Throughput Tradeoff
When configuration parameter is varied…
12
FRFCFS
Better fairness
Maximum Slowdown
10
STFM ATLAS
8
PAR-BS
6 TCM
2 Adjusting
12 13 14 ClusterThreshold
15 16
Weighted Speedup
Better system throughput
Downsides:
Scalability to large buffer sizes?
Robustness of clustering and shuffling algorithms?
73