Implementation of a Set-Associative Cache
A d dr es s
31 3 0 1 2 11 10 9 8 3 2 1 0
22 8
In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D ata
0
1
2
Set
253
254
255
22 32
4 - to - 1 m ultip le xo r
H it D a ta
4-way set-associative cache with 4 comparators and one 4-to-1
multiplexor:size of cache is 1K blocks = 256 sets * 4-block set size
Way Predicting Caches
• Use processor address to index into way prediction table
• Look in predicted way at given index, then:
HIT MISS
Return copy Look in other way
of data from
cache
MISS
SLOW HIT
(change entry in
prediction table) Read block of data from
next level of cache
Reducing Miss rates by Compiler
Optimizations
Merging Arrays
int val[SIZE]; struct record{
int key[SIZE]; int val;
int key;
for (i=0; i<SIZE; i++){ };
key[i] = newkey; struct record records[SIZE];
val[i]++;
} for (i=0; i<SIZE; i++){
records[i].key = newkey;
records[i].val++;
}
• Reduces conflicts between val & key and improves spatial
locality
Loop Fusion
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];
for (i = 0; i < N; i++) Reference can be directly to register
for (j = 0; j < N; j++){
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}
Splitted loops: every access to a and c misses. Fused loops: only 1st
access misses. Improves temporal locality
Summary of Compiler Optimizations
to Reduce Cache Misses
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky (nasa7)
compress
1 1.5 2 2.5 3
Performance Improvement
merged arrays loop interchange loop fusion blocking
Reducing misses by hardware
prefetching of Instruction and Data
Hardware Instruction Prefetching
• Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and the
next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer
block into cache and prefetch next block (i+2)
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then prefetch
b+3N etc.
Performance impact of prefetching
Performance Improvement
2.20 1.97
2.00
1.80
1.60 1.45 1.49
1.40
1.26 1.29 1.32
1.40 1.18 1.20 1.21
1.16
1.20
1.00
u
p
el
3d
ke
e
im
c
id
cf
s
pl
re
ca
ga
is
lg
gr
m
ua
sw
m
pw
ap
ce
ga
lu
m
fa
eq
fa
wu
SPECint2000 SPECfp2000
Translation Lookaside Buffer
8KB
28 28
256 Entries 28
4MB
Inclusive Cache
Consider a CPU with two levels of cache memory. Now, suppose a block X is requested. If the block is found in
the L1 cache, then the data is read from the L1 cache and consumed by the CPU core. However, if the block is
not found in the L1 cache, but is present in L2, then it’s fetched from the L2 cache and placed in L1.
If the L1 cache is also full, a block is evicted from L1 to make room for the newer block while the L2 cache is
unchanged. However, if the data block is found neither in L1 and L2, then it’s fetched from the memory and placed
in both the cache levels. In this case, if the L2 cache is full and a block is evicted to make room for the new data,
the L2 cache sends an invalidation request to the L1 cache, so the evicted block is removed from there as well.
Due to this invalidation procedure, an inclusive cache is slightly slower than a non-inclusive or exclusive cache.
Exclusive Cache
Now, let’s consider the same example with non-inclusive or exclusive cache. Let’s suppose that the CPU core
sends a request for block X. If block X is found in L1, then it’s read and consumed by the core from that location.
However, if block X is not found in L1, but present in L2, then it’s moved from L2 to L1. If there’s no room in L1,
one block is evicted from L1 and stored in L2. This is the only way L2 cache is populated, and as such acts as a
victim cache. If block X isn’t found in L1 or L2, then it’s fetched from the memory and placed in just L1.
PSEUDO ASSOCIATIVE / COLUMN ASSOCIATIVE CACHE