0% found this document useful (0 votes)

17 views20 pages

F11 - Cache Aware Programming For Multicores

The lecture discusses cache-aware programming on multicore systems, focusing on cache misses, locality of references, and strategies to reduce false and true sharing. It emphasizes the importance of data prefetching and provides examples of techniques to optimize cache usage, such as struct padding and using arrays over linked lists. Additionally, it covers tools for analyzing cache performance and the complexities of prefetching in recursive data structures.

Uploaded by

ejy jawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views20 pages

F11 - Cache Aware Programming For Multicores

Uploaded by

ejy jawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

More Cache Aware Programming on Multicores

Contents of Lecture 11
Cache Misses
Reduce Communication
Impove Locality
Data Prefetching

Jonas Skeppstedt Lecture 11 2022 1 / 20

Cache memories

Faster but smaller memories than normal RAM

When a variable is in the cache (a cache hit), reading it is fast
At a cache miss, a block with e.g. 128 bytes is copied from RAM
A cache miss and can take hundreds of clock cycles
Except in Sequential Consistency, writing to the cache is also fast
In SC it depends on if the cache already owns the cache block
Recall cache block ownership in cache coherence protocols
The time it takes to copy data from RAM is called the cache miss
latency

Jonas Skeppstedt Lecture 11 2022 2 / 20

Locality of references

Temporal locality: After a variable X has been used, it is likely it will

be used again soon
Spatial locality: After a variable at address &X has been used, it is
likely a variable at address &X+1 will be used soon
Caches are for programs with locality of references
Fast programs need to have locality of references

Jonas Skeppstedt Lecture 11 2022 3 / 20

Cache misses in multicores

Misses in uniprocessors:
compulsory misses (cold misses),
capacity misses, and
conflict misses
In addition to those found in sequential programs, we also have:
True sharing miss: essential miss since it communicates data
False sharing miss: non-essential miss.
False sharing misses are due to using a large cache block size
If only one variable at a time would be copied from RAM they would
disappear
But that would be inefficient

Jonas Skeppstedt Lecture 11 2022 4 / 20

False Sharing Miss

Assume a cache block size of two words.

Access Processor 1 Processor 2 Comment
1 Load 0 Cold miss
2 Load 1 Cold miss
3 Store 1 Invalidation
4 Load 0 False sharing miss

Effects of larger cache block size:

Increased benefit from spatial locality (prefetching within block)
The larger risk of suffering from false sharing.

Jonas Skeppstedt Lecture 11 2022 5 / 20

True Sharing Miss

Access Processor 1 Processor 2 Comment

1 Load 0 Cold miss
2 Load 1 Cold miss
3 Store 1 Invalidation
4 Load 0 True sharing miss
5 Load 1 Reads a new value

While we cannot know it at the time of Access 4, that miss is a true

sharing miss (which we realize at Access 5).

Jonas Skeppstedt Lecture 11 2022 6 / 20

Reducing false sharing

Suppose each thread should count something.

The following will result in false sharing
int count[NUM_THREADS];

/∗ . . . . ∗/

count[thread->index] += 1;

It is better to collect the variables a thread should use in a struct that

only that thread will modify.

Jonas Skeppstedt Lecture 11 2022 7 / 20

Reduce also true sharing

Ideally, each thread should work on its own data and no other should
be involved. No communication and no true sharing.
This is normally impossible for most algorithms, though.
True sharing can be reduced with clever decisions of which thread
should work on which data

Jonas Skeppstedt Lecture 11 2022 8 / 20

Examples of tricks to exploit caches better

Use smaller data structures: an int instead of a pointer.

Use arrays instead of linked-lists if possible
If a node’s neighbors never change you can do:
struct node_t {
edge_t* a; /∗ array of edges . ∗/
int n; /∗ neighbors . ∗/
};
struct edge_t {
int v; /∗ the other node . ∗/
int i; /∗ edge number . ∗/
int b; /∗ d i r e c t i o n from lab0 ∗/
};

Keep track of the capacities and flows somewhere else.

Jonas Skeppstedt Lecture 11 2022 9 / 20

Examples of tricks to exploit caches better

Pad structs to fit cache blocks better — to avoid multiple cache

misses per struct
This can be done with a cache array with a suitable size if you know
the cache block size.
Put struct fields used at nearly the same time near each other
Avoid putting smaller and larger struct fields next to each other in a
struct to avoid padding between them.

Jonas Skeppstedt Lecture 11 2022 10 / 20

Cachegrind

valgrind --tool=cachegrind ./a.out < 4huge.in

f = 9924
==2250753==
==2250753== I refs: 182,135,320
==2250753== I1 misses: 2,006
==2250753== LLi misses: 1,916
==2250753== I1 miss rate: 0.00%
==2250753== LLi miss rate: 0.00%
==2250753==
==2250753== D refs: 79,372,178 (51,287,248 rd + 28,084,930 wr)
==2250753== D1 misses: 1,690,859 ( 1,510,713 rd + 180,146 wr)
==2250753== LLd misses: 1,416,910 ( 1,239,883 rd + 177,027 wr)
==2250753== D1 miss rate: 2.1% ( 2.9% + 0.6% )
==2250753== LLd miss rate: 1.8% ( 2.4% + 0.6% )
==2250753==
==2250753== LL refs: 1,692,865 ( 1,512,719 rd + 180,146 wr)
==2250753== LL misses: 1,418,826 ( 1,241,799 rd + 177,027 wr)
==2250753== LL miss rate: 0.5% ( 0.5% + 0.6% )

Jonas Skeppstedt Lecture 11 2022 11 / 20

operf on Power

ophelp lists all events that can be sampled

operf -e PM_LD_MISS_L1:100000 ./a.out < big/002.in
opannotate -s a.out
83 0.8820 : while (p != NULL) {
551 5.8555 : e = p->edge;
5625 59.7768 : p = p->next;
:
455 4.8353 : if (u == e->u) {
576 6.1211 : v = e->v;
: b = 1;
: } else {
773 8.2147 : v = e->u;
: b = -1;
: }
:
1221 12.9756 : if (u->h > v->h && b * e->f < e->c)
: break;
: else
63 0.6695 : v = NULL;

Jonas Skeppstedt Lecture 11 2022 12 / 20

Data Prefetching

The purpose is to fetch data so that it is available in the cache when

it’s needed.
Compilers and hardware can do this for matrix codes.
This is very difficult on recursive data structures such as lists or trees.
Suppose we have a loop which traverses a list or tree.
To prefetch a node needed e.g. three iterations ahead, we need to
dereference multiple pointers where each dereference can result in a
cache miss.
In a superscalar processor with out-of-order execution of load
instructions (i.e. a relaxed memory consistency model), this can
possibly be useful.
In a processor with a blocking cache, the pipeline will halt at the first
cache miss and make the prefetching almost useless.

Jonas Skeppstedt Lecture 11 2022 13 / 20

An Approach to Prefetching Nodes

A problem with lists and trees is that we usually do not know the
address of a node needed in the future.
This is true if we allocate memory with standard methods such as
malloc
However, assume the size of a data structure is fixed for some time.
Then we can put pointers to the nodes in an array in the expected
order of traversal, and then we may be able to prefetch nodes
sufficiently in advance.
This can be useful if we will traverse a data structure multiple times.

Jonas Skeppstedt Lecture 11 2022 14 / 20

More difficulties

For shared data we intend to modify, it can be useful to prefetch it in

exclusive mode, meaning that we request ownership of the cache block.
The effect of this is:
Reduced write penalty in a sequentially consistent machine.
Reduced write traffic in all machines.
However, with the ownership requests, there is a risk that we
introduce additional cache misses!
Measurements are needed, but note they are dependent both on the
Input data
Machine parameters such as number of processors, cache sizes, and
latency.

Jonas Skeppstedt Lecture 11 2022 15 / 20

Prefetch with GCC

void __builtin_prefetch(const void *addr, int write, int loc);

for (i = 0; i < n; i++) {

a[i] = a[i] + b[i];
__builtin_prefetch(&a[i+j], 1, 1);
__builtin_prefetch(&b[i+j], 0, 1);
}

The loc has values in 0..3 with 0 no temporal locality and 3 most
temporal locality
Some CPU’s have extra buffers to save temporary data there instead
of polluting the cache
Data prefetch does not generate a segmentation fault if the address is
invalid.
The expression computing the address obviously must be valid.

Jonas Skeppstedt Lecture 11 2022 16 / 20

Data Prefetching on Power

Several processors, including Power, do prefetching of array references

in hardware
Of course, the CPU does not know it is arrays
They work by discovering a constant stride (or distance between used
addresses) and then predict which blocks will be required.
Modern processors (including Power) have prefetch instructions: dcbt
and dcbtst
Power also supports software programmable prefetch engines.

Jonas Skeppstedt Lecture 11 2022 17 / 20

Software Controlled Stream Prefetch on Power

Four data streams can be prefetched concurrently

The basic instruction is dst — data stream touch
One of the instruction fields is a two bit stream selector
Other parameters:
Prefetch unit size S in 16-byte blocks: 0..31 where 0 means 32.
Number of units to prefetch
Distance D in bytes between two units (i.e. stride)

Jonas Skeppstedt Lecture 11 2022 18 / 20

Cache-miss initiated software controlled prefetch engines
Hardware knows what is happening now and the compiler what will
happen in near the future
Treat L2 cache misses as light weight exceptions — there soon will not
be much to do for the processor to do anyway.
Such exceptions do not involve the OS kernel but simply jump to a
special place in the program.
For certain references in certain loops, the compiler has created an
exception handler which will program a prefetch engine.
The exception handler is a part of the function’s control flow graph so
it has access to all local variables which are register allocated both for
the function and the exception handler.
Therefore the exception handler can compute what to prefetch while
the L2 cache miss is being serviced.
The instruction overhead of always prefetching is removed.
Knowing whether to insert prefetch instructions or not can be
impossible, e.g. for memcpy.
Jonas Skeppstedt Lecture 11 2022 19 / 20
Storing zeroes

Consider a directed graph where each node has a set X , represented

as a bitvector
In each iteration of a certain loop the union of successor nodes’ X is
computed
No member is ever removed from a set.
S
X = Xi
Implemented as X = X ∪ Xi in a loop
Why can it be better to start with setting X to zeroes?

Jonas Skeppstedt Lecture 11 2022 20 / 20

EGC121lect19 Cache Prefetching
No ratings yet
EGC121lect19 Cache Prefetching
22 pages
F05 - Memory Consistency Models Plus Introduction To Caches
No ratings yet
F05 - Memory Consistency Models Plus Introduction To Caches
48 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
10 Caches
No ratings yet
10 Caches
34 pages
Stanford Advanced Caches
No ratings yet
Stanford Advanced Caches
46 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Compiler Optimizations and Prefetching
No ratings yet
Compiler Optimizations and Prefetching
22 pages
Cache 2
No ratings yet
Cache 2
37 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Data Oriented Design for Efficient CPU Processing
No ratings yet
Data Oriented Design for Efficient CPU Processing
17 pages
Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
OpenMP Memory & Cache Optimization
No ratings yet
OpenMP Memory & Cache Optimization
3 pages
Cache Performance Optimization Guide
No ratings yet
Cache Performance Optimization Guide
6 pages
Cache Performance
No ratings yet
Cache Performance
44 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
CS-30005 (HPC) - CS End Nov 2024
No ratings yet
CS-30005 (HPC) - CS End Nov 2024
23 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
4 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
Lec 34
No ratings yet
Lec 34
26 pages
Prefetching Using Markov Predictors: Grunwald@cs - Colorado.edu
No ratings yet
Prefetching Using Markov Predictors: Grunwald@cs - Colorado.edu
12 pages
Disc09 Sols
No ratings yet
Disc09 Sols
7 pages
Rec 07
No ratings yet
Rec 07
40 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
Cache
No ratings yet
Cache
31 pages
CS7810 Prefetching: Seth Pugsley
No ratings yet
CS7810 Prefetching: Seth Pugsley
22 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Memory 2
No ratings yet
Memory 2
31 pages
Lab 8
No ratings yet
Lab 8
10 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Lec14 Demandpage
No ratings yet
Lec14 Demandpage
25 pages
Improving Data Cache Performance by Pre-Executing Instructions Under A Cache Miss
No ratings yet
Improving Data Cache Performance by Pre-Executing Instructions Under A Cache Miss
8 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Onur 447 Spring15 Lecture25 Prefetching Afterlecture
No ratings yet
Onur 447 Spring15 Lecture25 Prefetching Afterlecture
57 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Section9 Sol
No ratings yet
Section9 Sol
9 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
10 Caches
No ratings yet
10 Caches
124 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
No ratings yet
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
23 pages
Lec 22
No ratings yet
Lec 22
14 pages
Week 11
No ratings yet
Week 11
45 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
F10 - Parallelizing Compilers
No ratings yet
F10 - Parallelizing Compilers
77 pages
F06 - Threads and The Memory Model in ISO C C++ and Java
No ratings yet
F06 - Threads and The Memory Model in ISO C C++ and Java
69 pages
F09 - Lock Free Data Structures Stack and Queue
No ratings yet
F09 - Lock Free Data Structures Stack and Queue
13 pages
F03 - Parallelizing A Sequential Algorithm and Multicore Architectures
No ratings yet
F03 - Parallelizing A Sequential Algorithm and Multicore Architectures
66 pages
Bagle Analysis v.1.0-1-25
No ratings yet
Bagle Analysis v.1.0-1-25
25 pages
Data Structures and Algorithms Lab Journal - Lab 5 Doubly Linked List
No ratings yet
Data Structures and Algorithms Lab Journal - Lab 5 Doubly Linked List
4 pages
Unit 1 Array
No ratings yet
Unit 1 Array
29 pages
2022-Trustworthy AI-A Computational Perspective
No ratings yet
2022-Trustworthy AI-A Computational Perspective
59 pages
Prog. Questions
No ratings yet
Prog. Questions
7 pages
Automata Dfa Vs Nfa
No ratings yet
Automata Dfa Vs Nfa
10 pages
Tutorial 2 NFA To DFA, Elimination of Null Moves
No ratings yet
Tutorial 2 NFA To DFA, Elimination of Null Moves
2 pages
Geektrust in Family Python
No ratings yet
Geektrust in Family Python
11 pages
Grade 6 Whole Numbers: Choose Correct Answer(s) From The Given Choices
No ratings yet
Grade 6 Whole Numbers: Choose Correct Answer(s) From The Given Choices
2 pages
Probabilistic Robotics Analysis
No ratings yet
Probabilistic Robotics Analysis
12 pages
PHP File Handling
No ratings yet
PHP File Handling
8 pages
Untitled14.ipynb - Colab1
No ratings yet
Untitled14.ipynb - Colab1
10 pages
ABAP Objects Syntax Summary
No ratings yet
ABAP Objects Syntax Summary
8 pages
MCQ's On Files and Streams: #Include #Include
No ratings yet
MCQ's On Files and Streams: #Include #Include
9 pages
CSC211 Test1
No ratings yet
CSC211 Test1
3 pages
LAB 5-4 Decimal To BCD Encoder
No ratings yet
LAB 5-4 Decimal To BCD Encoder
2 pages
TC-Helicon VoiceLive 3 MIDI SysEx Manual
No ratings yet
TC-Helicon VoiceLive 3 MIDI SysEx Manual
26 pages
Question Bank - C#
100% (1)
Question Bank - C#
16 pages
C++ OOP for Computer Science Students
No ratings yet
C++ OOP for Computer Science Students
2 pages
Tafl 5 Year Pyq
100% (1)
Tafl 5 Year Pyq
11 pages
Data Structures Lab1 2020 Ce 118 PDF
No ratings yet
Data Structures Lab1 2020 Ce 118 PDF
2 pages
CS REG - Unit 7 - Part ONE
No ratings yet
CS REG - Unit 7 - Part ONE
49 pages
1036 MockMidterm2Solution (Partial) 2019
No ratings yet
1036 MockMidterm2Solution (Partial) 2019
10 pages
Oose ct2
No ratings yet
Oose ct2
1 page
Adversarial Search & Game Theory
No ratings yet
Adversarial Search & Game Theory
18 pages
Assignment 3 - Solution
No ratings yet
Assignment 3 - Solution
3 pages
OS (4th) May2022
No ratings yet
OS (4th) May2022
2 pages
Laboratorio #11 - OpenGL - Programacion Grafica 2024
No ratings yet
Laboratorio #11 - OpenGL - Programacion Grafica 2024
21 pages
Java Notes
100% (1)
Java Notes
20 pages
Assignment 1 (Anudip)
No ratings yet
Assignment 1 (Anudip)
3 pages
M.C.A. (2020 Pattern)
No ratings yet
M.C.A. (2020 Pattern)
44 pages

F11 - Cache Aware Programming For Multicores

Uploaded by

F11 - Cache Aware Programming For Multicores

Uploaded by

More Cache Aware Programming on Multicores

Jonas Skeppstedt Lecture 11 2022 1 / 20

Faster but smaller memories than normal RAM

Jonas Skeppstedt Lecture 11 2022 2 / 20

Temporal locality: After a variable X has been used, it is likely it will

Jonas Skeppstedt Lecture 11 2022 3 / 20

Jonas Skeppstedt Lecture 11 2022 4 / 20

Assume a cache block size of two words.

Effects of larger cache block size:

Jonas Skeppstedt Lecture 11 2022 5 / 20

Access Processor 1 Processor 2 Comment

While we cannot know it at the time of Access 4, that miss is a true

Jonas Skeppstedt Lecture 11 2022 6 / 20

Suppose each thread should count something.

It is better to collect the variables a thread should use in a struct that

Jonas Skeppstedt Lecture 11 2022 7 / 20

Jonas Skeppstedt Lecture 11 2022 8 / 20

Use smaller data structures: an int instead of a pointer.

Keep track of the capacities and flows somewhere else.

Jonas Skeppstedt Lecture 11 2022 9 / 20

Pad structs to fit cache blocks better — to avoid multiple cache

Jonas Skeppstedt Lecture 11 2022 10 / 20

valgrind --tool=cachegrind ./a.out < 4huge.in

Jonas Skeppstedt Lecture 11 2022 11 / 20

ophelp lists all events that can be sampled

Jonas Skeppstedt Lecture 11 2022 12 / 20

The purpose is to fetch data so that it is available in the cache when

Jonas Skeppstedt Lecture 11 2022 13 / 20

Jonas Skeppstedt Lecture 11 2022 14 / 20

For shared data we intend to modify, it can be useful to prefetch it in

Jonas Skeppstedt Lecture 11 2022 15 / 20

void __builtin_prefetch(const void *addr, int write, int loc);

for (i = 0; i < n; i++) {

Jonas Skeppstedt Lecture 11 2022 16 / 20

Several processors, including Power, do prefetching of array references

Jonas Skeppstedt Lecture 11 2022 17 / 20

Four data streams can be prefetched concurrently

Jonas Skeppstedt Lecture 11 2022 18 / 20

Consider a directed graph where each node has a set X , represented

Jonas Skeppstedt Lecture 11 2022 20 / 20

You might also like