[go: up one dir, main page]

0% found this document useful (0 votes)
5 views44 pages

ch5 1

Chapter 5 discusses memory hierarchy, emphasizing the principle of locality which includes temporal and spatial locality. It details various memory technologies such as SRAM, DRAM, and flash storage, along with their performance characteristics and access times. The chapter also covers cache memory, its organization, performance measurement, and the impact of cache misses on CPU performance.

Uploaded by

macbay prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

ch5 1

Chapter 5 discusses memory hierarchy, emphasizing the principle of locality which includes temporal and spatial locality. It details various memory technologies such as SRAM, DRAM, and flash storage, along with their performance characteristics and access times. The chapter also covers cache memory, its organization, performance measurement, and the impact of cache misses on CPU performance.

Uploaded by

macbay prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Chapter 5

Memory Hierarchy
Part 1
Introduction
§

5.1
Principle of Locality
• Programs access a small proportion of their
address space at any time
• Temporal locality
– Items accessed recently are likely to be accessed again
soon
– e.g., instructions in a loop, induction variables
• Spatial locality
– Items near those accessed recently are likely to be
accessed soon
– E.g., sequential instruction access, array data
Memory hierarchy
Memory hierarchy: A structure that uses multiple levels of memories;
as the distance from the processor increases, the size of the
memories and the access time both increase.
Memory Hierarchy Levels
• Block (aka line): unit of copying

May be multiple words
• If accessed data is present in
upper level

Hit: access satisfied by upper level

Hit ratio: hits/accesses
• If accessed data is absent

Miss: block copied from lower level

Time taken: miss penalty

Miss ratio: misses/accesses
= 1 – hit ratio

Then accessed data supplied from upper
level

Figure 5.2
Technologies
§

5.2 Memory
Memory Technology
• Static RAM (SRAM)
0.5ns – 2.5ns, $2000 – $5000 per GB (2012)
• Dynamic RAM (DRAM)
50ns – 70ns, $20 – $75 per GB (2012)
• Magnetic disk
5ms – 20ms, $0.20 – $2 per GB (2012)
• Flash semiconductor memory
5,000–50,000 ns, $0.75–$1.00 per GB (2012)

• Ideal memory
– Access time of SRAM
– Capacity and cost/GB of disk
SRAM Technology
DRAM Technology

Data stored as a charge in a capacitor

Single transistor used to access the charge

Must periodically be refreshed

Read contents and write back

Performed on a DRAM “row”

FIGURE 5.4
Advanced DRAM Organization
• Bits in a DRAM are organized as a
rectangular array
– DRAM accesses an entire row
– Burst mode: supply successive words from a
row with reduced latency
• Double data rate (DDR) DRAM
– Transfer on rising and falling clock edges
• Quad data rate (QDR) DRAM
– Separate DDR inputs and outputs
DRAM Performance Factors
• Row buffer
– Allows several words to be read and refreshed in parallel
• Synchronous DRAM
– Allows for consecutive accesses in bursts without
needing to send each address
– Improves bandwidth
• DRAM banking
– Allows simultaneous access to multiple DRAMs
– Improves bandwidth
• dual inline memory modules (DIMMs)
Storage
§

6.4 Flash
Flash Storage
• Nonvolatile semiconductor storage
– 100× – 1000× faster than disk
– Smaller, lower power, more robust
– But more $/GB (between disk and DRAM)

EEPROM
Storage
§

6.3 Disk
Disk Storage

Nonvolatile, rotating magnetic storage
Disk Sectors and Access
• Each sector records
– Sector ID
– Data (512 bytes, 4096 bytes proposed)
– Error correcting code (ECC)
• Used to hide defects and recording errors
– Synchronization fields and gaps
• Access to a sector involves
– Queuing delay if other accesses are pending
– Seek: move the heads
– Rotational latency
– Data transfer
– Controller overhead
Rotational latency also called rotational delay:
The time required for the desired sector of a disk to rotate
under the read/write head; usually assumed to be half the
rotation time.

The average rotational latency at 5400 RPM is


Transfer time, Transfer rate
Disk Access Example
Given
512B sector, 15,000rpm, 4ms average seek time, 100MB/s
transfer rate, 0.2ms controller overhead, idle disk

Average read time =


4ms average seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms

If actual average seek time is 1ms


Average read time = 3.2ms
Caches
§

5.3 The Basics of


Cache Memory
• Cache memory
– The level of the memory hierarchy closest to
the CPU
• Given accesses X1, …, Xn–1, Xn


How do we know
if the data is
present?

Where do we look?
FIGURE 5.7 The cache just before and just after a reference to a word Xn that is not initially in the cache
Direct Mapped Cache

Location determined by address

Direct mapped: only one choice

(Block address) modulo (#Blocks in cache)


#Blocks is
a power of
2

Use low-
order
address bits
FIGURE 5.8 A direct-mapped cache with eight entries showing the addresses of memory words
between 0 and 31 that map to the same cache locations.
Cache Example
Below is a sequence of nine memory references to an empty eight-
block cache, including the action for each reference. Since there are
eight blocks in the cache, the low-order 3 bits of an address give the
block number:
Cache Example
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N

Figure 5-9 a. The initial state of the cache after power-on

The cache is initially empty, with all valid bits (V entry in cache) turned
off (N). The tag field will contain only the upper portion of the address.

The Figure 5-9 b to e show the cache contents after each miss in
the sequence has been handled.
Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110

Index V Tag Data


000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N

b. After handling a miss of address (10110two)


Cache Example
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N

c. After handling a miss of address (11010two)


Cache Example
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N

d. After handling a miss of address (10000two)


Cache Example
Word addr Binary addr Hit/miss Cache block
3 00 011 Miss 011

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N

e. After handling a miss of address (00011two)


Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N

f. After handling a miss of address (10010two)


Address Subdivision

FIGURE 5.10 This cache holds 1024 words or 4 KiB.


EXAMPLE Bits in a Cache

How many total bits are required for a direct-mapped


cache with 16 KiB of data and four-word blocks, assuming
a 64-bit address?

16 KiB = 4096 words (212)


four words (22) there are 1024 (210) blocks.
Each block has 4 × 32 (128 bits) of data plus
a tag, which is 64 − 10 − 2 − 2 bits, plus a valid bit.

Thus, the complete cache size

is 210 ×(4×32 + (64 −10 − 2 −

2) +1)
= 210 ×179 = 179 Kibits = 22.375 KiB
For this cache, the total number of bits in the cache is about
1.4 times as many as needed just for the storage of the data.
EXAMPLE Mapping an Address to a Multiword Cache Block
Consider a cache with 64 blocks and a block size of 16 bytes. To
what block number does byte address 1200 map?

The block is given by formula


(Block address) modulo (Number of blocks in the cache)
where the address of the block = Byte address / Bytes per block

Notice that this block address is the block containing all


addresses between

[Byte address / Bytes per block] * Bytes per

block And

[Byte address / Bytes per block] * Bytes per block + (Bytes per block
– 1)
Thus, with 16 bytes per block, byte address 1200 is block

address [1200 / 6] = 75

which maps to cache block

number (75 modulo 64) = 11.

In fact, this block maps all addresses between 1200 and


1215.
Block Size Considerations
• Larger blocks should reduce miss rate
– Due to spatial locality
• But in a fixed-sized cache
– Larger blocks  fewer of them
• More competition  increased miss rate
– Larger blocks  pollution
• Larger miss penalty
– Can override benefit of reduced miss rate
– Early restart and critical-word-first can help
Cache Misses
• On cache hit, CPU proceeds normally
• On cache miss
– Stall the CPU pipeline
– Fetch block from next level of hierarchy
– Instruction cache miss
• Restart instruction fetch
– Data cache miss
• Complete data access
Write-Through
• On data-write hit, could just update the block in cache
– But then cache and memory would be inconsistent
• Write through: also update memory
• But makes writes take longer
– e.g., if base CPI = 1, 10% of instructions are stores, write to
memory takes 100 cycles
• Effective CPI = 1 + 0.1×100 = 11
• Solution: write buffer
– Holds data waiting to be written to memory
– CPU continues immediately
• Only stalls on write if write buffer is already full
Write-Back
• Alternative: On data-write hit, just update
the block in cache
– Keep track of whether each block is dirty
• When a dirty block is replaced
– Write it back to memory
– Can use a write buffer to allow replacing block
to be read first
split cache
A scheme in which a level of the memory
hierarchy is composed of two independent
caches that operate in parallel with each other,
with one handling instructions and one handling
data.
Performance
§

5.4 Measuring and Improving Cache


Measuring Cache Performance

Components of CPU time



Program execution cycles

Includes cache hit time
Measuring Cache Performance
Measuring Cache Performance
Example - Cache Performance
Assume the miss rate of an instruction cache (I-cache) is 2%
and the miss rate of the data cache (D-cache) is 4%. If a
processor has a CPI of 2 without any memory stalls, and the
miss penalty is 100 cycles for all misses, determine how
much faster a processor would run with a perfect cache that
never missed. Assume the frequency of all loads and stores
is 36%.

Miss cycles per instruction


Instruction miss cycles = I * 0.02 × 100 = 2 * I
Data miss cycles = I * 0.36 × 0.04 × 100 = 1.44 * I
Total number of memory-stall cycles is 2 * I + 1.44 * I = 3.44 * I
Actual CPI = 2 + 2 + 1.44 = 5.44

CPU time with stalls/CPU time with perfect cache


= (I * CPIstall * Clock cycle) / (I * CPIperfect * Clock cycle)= CPIstall / CPIperfect
= 5.44/2 = 2.72 times faster
Average Access Time
Hit time is also important for performance
Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
Example:
Find the AMAT for a processor with a 1 ns clock cycle time,
a miss penalty of 20 clock cycles, a miss rate of 0.05 misses
per instruction, and a cache access time (including hit
detection) of 1 clock cycle. Assume that the read and write
miss penalties are the same and ignore other write stalls.

AMAT = 1 + 0.05 × 20 = 2 clock cycle or 2ns


Performance Summary

When CPU performance increased

Miss penalty becomes more significant

Decreasing base CPI

Greater proportion of time spent on memory stalls

Increasing clock rate

Memory stalls account for more CPU cycles

Can’t neglect cache behavior when
evaluating system performance

You might also like