0% found this document useful (0 votes)

31 views22 pages

Compiler Optimizations and Prefetching

This document discusses several compiler and hardware optimizations to reduce cache misses and improve performance, including: 1. Smaller direct-mapped caches can overlap tag comparison and data transmission. Lower associativity also reduces power. 2. Way prediction and pipelining caches can improve hit times but increase penalties for misses. 3. Compiler optimizations like loop interchange and blocking can improve data locality to reduce misses. Prefetching instructions can hide miss latencies. 4. Hardware techniques like nonblocking caches, multibanking, and merging write buffers can allow faster cache accesses. Critical word first and early restart can reduce effective miss penalties.

Uploaded by

Hidayatullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views22 pages

Compiler Optimizations and Prefetching

Uploaded by

Hidayatullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Compiler

Optimizations
and Prefetching

OLEH :

ADITYA P. P. PRASETYO, S. Kom., MT.

Ten Advanced Optimizations

– Small and simple first level caches

– Critical timing path:
– addressing tag memory, then
– comparing tags, then
– selecting correct set
– Direct-mapped caches can overlap tag compare
and transmission of data
– Lower associativity reduces power because
fewer cache lines are accessed

2
L1 Size and Associativity

Access time vs. size and associativity

3
L1 Size and Associativity

Energy per read vs. size and associativity

4
Way Prediction

– To improve hit time, predict the way to pre-set mux

– Mis-prediction gives longer hit time
– Prediction accuracy
– > 90% for two-way
– > 80% for four-way
– I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
– Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty

5
Pipelining Cache

– Pipeline cache access to improve

bandwidth
– Examples:
– Pentium: 1 cycle
– Pentium Pro – Pentium III: 2 cycles
– Pentium 4 – Core i7: 4 cycles

– Increases branch mis-prediction penalty

– Makes it easier to increase associativity

6
Nonblocking Caches

– Allow hits before

previous misses
complete
– “Hit under miss”
– “Hit under multiple miss”
– L2 must support this
– In general, processors
can hide L1 miss
penalty but not L2 miss
penalty

7
Multibanked Caches
– Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2

– Interleave banks according to block address

8
Critical Word First, Early Restart

– Critical word first

– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives

– Effectiveness of these strategies depends on block

size and likelihood of another access to the portion
of the block that has not yet been fetched
9
Merging Write Buffer

– When storing to a block that is already pending in the write

buffer, update write buffer
– Reduces stalls due to full write buffer
– Do not apply to I/O addresses

No write
buffering

Write buffering

10
Compiler Optimizations

– Loop Interchange
– Swap nested loops to access memory in sequential order

– Blocking
– Instead of accessing entire rows or columns, subdivide
matrices into blocks
– Requires more memory accesses but improves locality of
accesses

11
Reducing Cache Misses:
5. Compiler Optimizations

12
Reducing Cache Misses:
5. Compiler Optimizations

13
Reducing Cache Misses:
5. Compiler Optimizations
– Blocking: improve temporal and spatial locality
a) multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks

c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus,
there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in
this case.

14
Reducing Cache Misses:
5. Compiler Optimizations (cont’d)

– Blocking: improve temporal and spatial locality

a) To ensure that elements being accessed can fit in the cache, the original code is changed
to compute a submatrix of size B*B, where B is called the blocking factor.
b) To total number of memory words accessed is 2N3//B + N2
c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

15
Hardware Prefetching
– Fetch two blocks on miss (include next sequential block): overlapping
memory access with execution by fetching data items before processor
requests them.

Pentium 4 Pre-fetching
16
Compiler Prefetching
– Insert prefetch instructions before data is needed
– Non-faulting: prefetch doesn’t cause exceptions

– Register prefetch
– Loads data into register
– Cache prefetch
– Loads data into cache

– Combine with loop unrolling and software

pipelining
17
Reducing Cache Miss Penalty:
Compiler-Controlled Prefetching

 Compiler inserts prefetch instructions

 An Example
for(i:=0; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1)
a[i][j] := b[j][0] * b[j+1][0]
 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements;
What kind of locality, if any, exists for a and b?
a. 3 100-element rows (100 columns) visited; spatial locality: even-
indexed elements miss and odd-indexed elements hit, leading to
3*100/2 = 150 misses
b. 101 rows and 3 columns visited; no spatial locality, but there is
temporal locality: same element is used in ith and (i + 1)st iterations and
the same element is access in each i iteration (outer loop). 100 misses
for b[j+1][0] when i = 0 and 1 miss for j = 0 for a total of 101 misses
 Assuming large penalty (100 cycles and at least 7 iterations must
be prefetched). Splitting the loop into two, we have

18
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching

 Assuming that each iteration of the pre-split loop consumes 7 cycles and
no conflict and capacity misses, then it consumes a total of 7*300
iteration cycles + 251*100 cache miss cycles = 27,200 cycles;
 With prefetching instructions inserted:
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
19
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching (cont’d)

 An Example (continued)
 the first loop consumes 9 cycles per iteration (due to the two prefetch
instruction) and iterates 100 times for a total of 900 cycles,
 the second loop consumes 8 cycles per iteration (due to the single
prefetch instruction) and iterates 200 times for a total of 1,600 cycles,
 during the first 7 iterations of the first loop array a incurs 4 cache
misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100
cache miss cycles,
 during the first 7 iterations of the second loop for i = 1 and i = 2
array a incurs 4 cache misses each, for total of (4+4)*100=800 cache
miss cycles; array b does not incur any cache miss in the second split!
 Total cycles consumed: 900+1600+1100+800= 44000
 Prefetching improves performance: 27200/4400=6.2 folds!

20
Summary

21
THANKS

5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
CS-30005 (HPC) - CS End Nov 2024
No ratings yet
CS-30005 (HPC) - CS End Nov 2024
23 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
EGC121lect19 Cache Prefetching
No ratings yet
EGC121lect19 Cache Prefetching
22 pages
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
No ratings yet
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
12 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
10 Caches
No ratings yet
10 Caches
34 pages
Lec 34
No ratings yet
Lec 34
26 pages
Cache Memory and Optimization Guide
No ratings yet
Cache Memory and Optimization Guide
2 pages
Unit II
No ratings yet
Unit II
9 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
4 pages
F11 - Cache Aware Programming For Multicores
No ratings yet
F11 - Cache Aware Programming For Multicores
20 pages
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
No ratings yet
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
14 pages
Lecture 8
No ratings yet
Lecture 8
37 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
17 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
Memory 2
No ratings yet
Memory 2
31 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Advanced Cache Optimization Techniques: Lecture 4E
No ratings yet
Advanced Cache Optimization Techniques: Lecture 4E
15 pages
Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Cache 2 Output
No ratings yet
Cache 2 Output
37 pages
Cache 2
No ratings yet
Cache 2
37 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Improving Cache Performance Reducing Misses
No ratings yet
Improving Cache Performance Reducing Misses
9 pages
A Branch Target Instruction Prefetchnig Technique For Improved Performance
No ratings yet
A Branch Target Instruction Prefetchnig Technique For Improved Performance
6 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
Cache Optimizations
No ratings yet
Cache Optimizations
29 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
CompArch Most Important Questions
No ratings yet
CompArch Most Important Questions
12 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Important Topics Answers
No ratings yet
Important Topics Answers
3 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
22 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Lec 33
No ratings yet
Lec 33
26 pages
3 - 2 Memory Performance Overview Notes
No ratings yet
3 - 2 Memory Performance Overview Notes
13 pages
CS683: Advanced Microarchitecture
No ratings yet
CS683: Advanced Microarchitecture
14 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Cache
No ratings yet
Cache
31 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
Getting Started With Python
No ratings yet
Getting Started With Python
17 pages
A-Extensions in HYSYS
No ratings yet
A-Extensions in HYSYS
4 pages
SONTU50 Flat Panel Detector
100% (1)
SONTU50 Flat Panel Detector
4 pages
2019 Toyota RAV4 Audio Control LC5i Pro XYZ 4 Channel Subwoofer Amplifier Install Schematic How To Add Amp To Stock Stereo
No ratings yet
2019 Toyota RAV4 Audio Control LC5i Pro XYZ 4 Channel Subwoofer Amplifier Install Schematic How To Add Amp To Stock Stereo
6 pages
Irespond Installation - User Self-Help (v4)
No ratings yet
Irespond Installation - User Self-Help (v4)
18 pages
Sensor Based Shopping Assistance System For Pwds
No ratings yet
Sensor Based Shopping Assistance System For Pwds
6 pages
Operating System (1000 MCQS)
100% (3)
Operating System (1000 MCQS)
135 pages
StreamAPI 3
No ratings yet
StreamAPI 3
28 pages
Instruction Manual For Microcomputer Excitation Regulator (English)
100% (1)
Instruction Manual For Microcomputer Excitation Regulator (English)
98 pages
NeHe Productions - 2D Texture Font
No ratings yet
NeHe Productions - 2D Texture Font
10 pages
Zhou 2021
No ratings yet
Zhou 2021
9 pages
Exercise2 - Solution Introduction For Embedded Systems
No ratings yet
Exercise2 - Solution Introduction For Embedded Systems
4 pages
HP Power Manager
No ratings yet
HP Power Manager
115 pages
Full Flow Clock Domain Crossing - From Source To Si: March 2016
No ratings yet
Full Flow Clock Domain Crossing - From Source To Si: March 2016
13 pages
Network
No ratings yet
Network
15 pages
Real-Time Java Platform Programming
100% (2)
Real-Time Java Platform Programming
254 pages
Override and Audit Handheld Unit Operating Manual: Bimax ST - Bimax Hs
No ratings yet
Override and Audit Handheld Unit Operating Manual: Bimax ST - Bimax Hs
25 pages
TP2 Final Report IP
No ratings yet
TP2 Final Report IP
48 pages
Service Manual BP20
No ratings yet
Service Manual BP20
90 pages
Azure Fundamentals Exam Prep
No ratings yet
Azure Fundamentals Exam Prep
172 pages
Gavr 8a
No ratings yet
Gavr 8a
2 pages
Module2 - Integration Patterns
No ratings yet
Module2 - Integration Patterns
26 pages
ARIS System Requirements
No ratings yet
ARIS System Requirements
17 pages
CS3691 ESIOT Lab Record Final
No ratings yet
CS3691 ESIOT Lab Record Final
55 pages
Securing Wireless Networks - CISA
No ratings yet
Securing Wireless Networks - CISA
3 pages
Data Commission Practical File
No ratings yet
Data Commission Practical File
26 pages
Raspberry Pi 4 Boot Security
No ratings yet
Raspberry Pi 4 Boot Security
12 pages
Power Amplifier Notes
No ratings yet
Power Amplifier Notes
30 pages
Truckcom TE Manual
100% (10)
Truckcom TE Manual
44 pages
Notre Dame University Bangladesh: Department of Computer Science & Engineering
No ratings yet
Notre Dame University Bangladesh: Department of Computer Science & Engineering
15 pages

Compiler Optimizations and Prefetching

Uploaded by

Compiler Optimizations and Prefetching

Uploaded by

Compiler

ADITYA P. P. PRASETYO, S. Kom., MT.

– Small and simple first level caches

Access time vs. size and associativity

Energy per read vs. size and associativity

– To improve hit time, predict the way to pre-set mux

– Pipeline cache access to improve

– Increases branch mis-prediction penalty

– Allow hits before

– Interleave banks according to block address

– Critical word first

– Effectiveness of these strategies depends on block

– When storing to a block that is already pending in the write

– Blocking: improve temporal and spatial locality

– Combine with loop unrolling and software

 Compiler inserts prefetch instructions

You might also like