9/14/15
Associative Cache Example
Memory:
Chapter 2 &
Appendix B,
Part 3
Gregory D. Peterson
gdp@utk.edu
Introduction
Memory Hierarchy Basics Memory Hierarchy Basics
• Six basic cache optimizations:
• Q1: Where can a block be placed in the upper – Larger block size
• Reduces compulsory misses
level? (Block placement) • Increases capacity and conflict misses, increases miss penalty
• Q2: How is a block found if it is in the upper – Larger total cache capacity to reduce miss rate
• Increases hit time, increases power consumption
level? (Block identification) – Higher associativity
• Q3: Which block should be replaced on a miss? • Reduces conflict misses
• Increases hit time, increases power consumption
(Block replacement) – Higher number of cache levels
• Q4: What happens on a write? • Reduces overall memory access time
(Write strategy) – Giving priority to read misses over writes
• Reduces miss penalty
– Avoiding address translation in cache indexing
• Reduces hit time
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Advanced Optimizations
Ten Advanced Optimizations L1 Size and Associativity
• Small and simple first level caches
– Critical timing path:
• addressing tag memory, then
• comparing tags, then
• selecting correct set
– Direct-mapped caches can overlap tag compare
and transmission of data
– Lower associativity reduces power because fewer
cache lines are accessed
Access time vs. size and associativity
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
1
9/14/15
Advanced Optimizations
L1 Size and Associativity Cache Example
• Assume a direct cache with 16 words and a
block size of 2 words.
• Which of these hit/miss and what are the final
contents after the following word addresses?
• 4, 36, 4, 13, 7, 12, 15, 11, 8, 56, 27, 21, 12
• What if the cache is 2-way set associative?
Energy per read vs. size and associativity
Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Advanced Optimizations
Way Prediction Pipelining Cache
• To improve hit time, predict the way to pre-set • Pipeline cache access to improve bandwidth
mux – Examples:
– Mis-prediction gives longer hit time • Pentium: 1 cycle
– Prediction accuracy • Pentium Pro – Pentium III: 2 cycles
• > 90% for two-way • Pentium 4 – Core i7: 4 cycles
• > 80% for four-way
• I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s • Increases branch mis-prediction penalty
– Used on ARM Cortex-A8
• Makes it easier to increase associativity
• Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Advanced Optimizations
Nonblocking Caches Multibanked Caches
• Allow hits before • Organize cache as independent banks to
previous misses support simultaneous access
complete
– ARM Cortex-A8 supports 1-4 banks for L2
– “Hit under miss”
– “Hit under multiple – Intel i7 supports 4 banks for L1 and 8 banks for
miss” L2
• L2 must support
this
• In general,
processors can hide
L1 miss penalty but
not L2 miss penalty
• Interleave banks according to block
address
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
2
9/14/15
Advanced Optimizations
Advanced Optimizations
Critical Word First, Early Restart Merging Write Buffer
• Critical word first • When storing to a block that is already pending in
– Request missed word from memory first the write buffer, update write buffer
– Send it to the processor as soon as it arrives • Reduces stalls due to full write buffer
• Early restart • Do not apply to I/O addresses
– Request words in normal order
– Send missed work to the processor as soon as No write
it arrives buffering
• Effectiveness of these strategies depends
on block size and likelihood of another Write buffering
access to the portion of the block that has
not yet been fetched
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Advanced Optimizations
Compiler Optimizations Hardware Prefetching
• Loop Interchange • Fetch two blocks on miss (include next
– Swap nested loops to access memory in sequential block)
sequential order
• Blocking
– Instead of accessing entire rows or columns,
subdivide matrices into blocks
– Requires more memory accesses but improves
locality of accesses
Pentium 4 Pre-fetching
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Advanced Optimizations
Compiler Prefetching Summary
• Insert prefetch instructions before data is
needed
• Non-faulting: prefetch doesn’t cause exceptions
• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache
• Combine with loop unrolling and software
pipelining
Copyright © 2012, Elsevier Inc. All rights reserved. Copyright © 2012, Elsevier Inc. All rights reserved.
3