hw6 Circuits
hw6 Circuits
hw6 Circuits
CS470
Homework 6
2012.04.12
1.Exercises 5.2.1 through 5.2.4, parts a and b
"In this exercise we look at memory locality properties of matrix computation. The following code is written in C,
where elements within the same row are stored contiguously."
Assume that the elements of both the arrays A and B contain single-precision floating-point values, declared with
the codes below:
a.
b.
Hint: Note that these codes differ slightly from those provided in the textbook exercise.
5.2.1 How many single-precision floating-point values can be stored in a 16-byte cache line? Show your work!
Single floats are 32-bits each, which is 4 bytes (8 bits each). So each float takes 4 bytes,
meaning we can fit 4 floats into the cacheline (4 * 4).
5.2.1.EXTRA.A How many cachelines total would the array A need, if it could be stored completely in cache? Show
your work!
For (a) and (b) we use up to A[4000][8] (well, 1 less counting from zero). So that is 4000 * 8
= 32000 floats, which is 32000 / 4 = 8000 cachelines.
5.2.1.EXTRA.C How many cachelines would the array B require, if it could be stored completely in cache? Show
your work!
In both (a) and (b) we use J[0..7][0], which means we are using 8 * 1 = 8 floats. 8 floats / 4
per cacheline = 2 cachelines needed.
In (a) and (b) the indexes i and j use temporal locality, since they are accessed over and
over again. (these would most likely be stored in a register).
In (b), B[J][0] has temporal locality, since it is accessed over and over (4000 times in a row)
5.2.3 References to which variables exhibit spatial locality? Why?
In (a), A[I][J] has spacial locality. As J is incremented, nearby values of the array are
accessed.
5.2.4 (Using the code provided for exercises 5.2.1 through 5.2.3 above)
How many 16-byte cachelines are needed to store all 32-bit matrix elements being referenced -- as the code
executes? Show your work!
For this question only, assume that only one cacheline from the array A and only one cacheline from the
array B can be in the cache at the same time.
Hint: There will be a difference between the two sets of codes, as one of the codes will use all four floats in a
cacheline from one of the two arrays consecutively and the other code will not.
Note that only the first 8 elements of each row in A are used and only the first element of each of the eight
rows of B is used.
For (a), we need 2 cachelines to hold a whole row of A, which is 4 floats per cacheline.
For (b) we need a cacheline for each separate row of B, so 8 cachelines.
5.2.4.EXTRA.E
Since fetching values from memory (to put into the cache) is far more expensive than fetching values directly from
the cache, which of the two codes above is more efficient? Why?
The first is significantly better. It is able to fetch consecutive locations, whereas (b), as an
even multiple of 2, will keep having cache misses.
2.Exercise 5.3.1, using the list of references printed in red below:
"Caches are important to providing a high performance memory hierarchy to processors. Below is a list of 32-bit
memory address references, given as word addresses.
5, 215, 174, 7, 82, 215, 66, 175, 65, 104, 87, 214, 5
5.3.1 "For each of these references,
a) identify the binary address,
b) identify the tag, and
c) identify the index given a direct-mapped cache with 16 one-word blocks.
d) Also list if each reference is a hit or a miss, assuming the cache is initially empty."
Word addresses mean a block size of 4-bytes to hold the word. Since we have 16 one-word
blocks, we need 4 bits for the index (though since we are word-aligned we could in theory
ignore the 2 low bits). So the tag needs to be 32 4 = 28 bits, from the high end of the
address.
Reference
Address
Tag
Index
hit/miss
101
101 miss
215
11000000
1101
111 miss
174
10000000
1010
1110 miss
111
111 miss
82
1010010
101
10 miss
215
11000000
1101
66
1000010
100
10 miss
175
10000000
1010
1111 miss
65
1000001
100
1 miss
104
1101000
110
1000 miss
87
1010111
101
111 miss
214
11000000
1101
110 miss
101
101 miss
111 hit
3.Exercise 5.3.4
"There are many different design parameters that are important to a cache's overall performance. The table below
lists parameters for different ONE direct-mapped cache designs."
This can be determined from the index. In (a) the index is 6 bits, so the cache is 2^6 = 64
blocks. In (b) the index is 7 bits, so the cache is 2^7 = 128 blocks.
Address
Line ID
Hit/Miss
Replace?
1024
16
132 232 160
30
4
10000
1000 1000 1110 1010
1111
100
00000
0
0100 1000 0000
0
0
0
0 2
16
29 20
0 3
M M
M
M
M
M
M
M
N N
N
N
N
N
Y
N
0
3100
140
180
2180
11000
1000
10110 100010
00111
1100
100
000100
00
17
3
22
16
M
M
M
M
N
Y
N
Y
L1
L2
Write-back, write allocate Write-through, non-write allocate
5.5.2 "Describe the procedure of handling an L1 write miss, considering the component involved and the possibility
of replacing a dirty block."
First we'd check to see if the block is dirty. If it is, then we write the dirty block to memory. Next
we retrieve the target block from memory (overwriting the block that is in our way). Finally we
write to our L1 block. Later, when something else must read in a different memory location
overwriting our block, the block is written.