Compiler
Optimizations
and Prefetching
OLEH :
ADITYA P. P. PRASETYO, S. Kom., MT.
Ten Advanced Optimizations
– Small and simple first level caches
– Critical timing path:
– addressing tag memory, then
– comparing tags, then
– selecting correct set
– Direct-mapped caches can overlap tag compare
and transmission of data
– Lower associativity reduces power because
fewer cache lines are accessed
2
L1 Size and Associativity
Access time vs. size and associativity
3
L1 Size and Associativity
Energy per read vs. size and associativity
4
Way Prediction
– To improve hit time, predict the way to pre-set mux
– Mis-prediction gives longer hit time
– Prediction accuracy
– > 90% for two-way
– > 80% for four-way
– I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
– Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty
5
Pipelining Cache
– Pipeline cache access to improve
bandwidth
– Examples:
– Pentium: 1 cycle
– Pentium Pro – Pentium III: 2 cycles
– Pentium 4 – Core i7: 4 cycles
– Increases branch mis-prediction penalty
– Makes it easier to increase associativity
6
Nonblocking Caches
– Allow hits before
previous misses
complete
– “Hit under miss”
– “Hit under multiple miss”
– L2 must support this
– In general, processors
can hide L1 miss
penalty but not L2 miss
penalty
7
Multibanked Caches
– Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2
– Interleave banks according to block address
8
Critical Word First, Early Restart
– Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives
– Effectiveness of these strategies depends on block
size and likelihood of another access to the portion
of the block that has not yet been fetched
9
Merging Write Buffer
– When storing to a block that is already pending in the write
buffer, update write buffer
– Reduces stalls due to full write buffer
– Do not apply to I/O addresses
No write
buffering
Write buffering
10
Compiler Optimizations
– Loop Interchange
– Swap nested loops to access memory in sequential order
– Blocking
– Instead of accessing entire rows or columns, subdivide
matrices into blocks
– Requires more memory accesses but improves locality of
accesses
11
Reducing Cache Misses:
5. Compiler Optimizations
12
Reducing Cache Misses:
5. Compiler Optimizations
13
Reducing Cache Misses:
5. Compiler Optimizations
– Blocking: improve temporal and spatial locality
a) multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks
c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus,
there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in
this case.
14
Reducing Cache Misses:
5. Compiler Optimizations (cont’d)
– Blocking: improve temporal and spatial locality
a) To ensure that elements being accessed can fit in the cache, the original code is changed
to compute a submatrix of size B*B, where B is called the blocking factor.
b) To total number of memory words accessed is 2N3//B + N2
c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.
15
Hardware Prefetching
– Fetch two blocks on miss (include next sequential block): overlapping
memory access with execution by fetching data items before processor
requests them.
Pentium 4 Pre-fetching
16
Compiler Prefetching
– Insert prefetch instructions before data is needed
– Non-faulting: prefetch doesn’t cause exceptions
– Register prefetch
– Loads data into register
– Cache prefetch
– Loads data into cache
– Combine with loop unrolling and software
pipelining
17
Reducing Cache Miss Penalty:
Compiler-Controlled Prefetching
Compiler inserts prefetch instructions
An Example
for(i:=0; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1)
a[i][j] := b[j][0] * b[j+1][0]
16-byte blocks, 8KB cache, 1-way write back, 8-byte elements;
What kind of locality, if any, exists for a and b?
a. 3 100-element rows (100 columns) visited; spatial locality: even-
indexed elements miss and odd-indexed elements hit, leading to
3*100/2 = 150 misses
b. 101 rows and 3 columns visited; no spatial locality, but there is
temporal locality: same element is used in ith and (i + 1)st iterations and
the same element is access in each i iteration (outer loop). 100 misses
for b[j+1][0] when i = 0 and 1 miss for j = 0 for a total of 101 misses
Assuming large penalty (100 cycles and at least 7 iterations must
be prefetched). Splitting the loop into two, we have
18
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching
Assuming that each iteration of the pre-split loop consumes 7 cycles and
no conflict and capacity misses, then it consumes a total of 7*300
iteration cycles + 251*100 cache miss cycles = 27,200 cycles;
With prefetching instructions inserted:
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
19
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching (cont’d)
An Example (continued)
the first loop consumes 9 cycles per iteration (due to the two prefetch
instruction) and iterates 100 times for a total of 900 cycles,
the second loop consumes 8 cycles per iteration (due to the single
prefetch instruction) and iterates 200 times for a total of 1,600 cycles,
during the first 7 iterations of the first loop array a incurs 4 cache
misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100
cache miss cycles,
during the first 7 iterations of the second loop for i = 1 and i = 2
array a incurs 4 cache misses each, for total of (4+4)*100=800 cache
miss cycles; array b does not incur any cache miss in the second split!
Total cycles consumed: 900+1600+1100+800= 44000
Prefetching improves performance: 27200/4400=6.2 folds!
20
Summary
21
THANKS
22