CSE203 Computer Architecture
13. Additional Materials for
Final Exam (incl. HW2 Sol)
Prof. Seong Tae Kim (st.kim@khu.ac.kr)
Augmented Intelligence Lab. (ailab.khu.ac.kr)
School of Computing, Kyung Hee University
1
Assignment 1
• Suppose we have a processor with a base CPI of 1.0, assuming all references hit in
the primary cache and a clock rate of 1 GHz. Please assume a main memory
access time of 100 ns, including all the miss handling
a) Suppose the miss rate per instruction at the primary cache is 4%. Please
calculate the effective CPI with one level of caching
The miss penalty to main memory is
100 𝑛𝑠
𝑛𝑠 = 100 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠
1
𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒
The effective CPI with one level of caching is given by
𝑇𝑜𝑡𝑎𝑙 𝐶𝑃𝐼 = 𝐵𝑎𝑠𝑒 𝐶𝑃𝐼 + 𝑀𝑒𝑚𝑜𝑟𝑦 − 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
= 1.0 + 4% ∗ 100 = 5
2
Assignment 1
• Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache and a clock rate of 1 GHz. Please assume a main memory access time of 100 ns, including
all the miss handling
b) To make the processor be fast, we add a secondary cache that has a 10 ns access time for either a
hit or a miss and is large enough to reduce the miss rate to the main memory to 1%. Please calculate
the effective CPI with two-level of caching:
With two levels of caching, a miss in the primary cache can be satisfied either by the secondary cache
or by main memory. The miss penalty for an access to the second-level cache is
10 𝑛𝑠
𝑛𝑠 = 10 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠
1
𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒
𝑇𝑜𝑡𝑎𝑙 𝐶𝑃𝐼 = 𝐵𝑎𝑠𝑒 𝐶𝑃𝐼 + 𝑃𝑟𝑖𝑚𝑎𝑟𝑦 𝑠𝑡𝑎𝑙𝑙 𝑝𝑒𝑟 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 + 𝑆𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 𝑠𝑡𝑎𝑙𝑙𝑠 𝑝𝑒𝑟 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
= 1.0 + 4% ∗ 10 + 1% ∗ 100 = 2.4
3
Assignment 2
• Assume a pipelined processor (4 stage pipeline; instruction fetch/decoding/ execution/
write back) that has two functional units (i.e., a 3 cycle adder and a 5 cycle multiplier).
There is an instruction sequence as
ADD R3 ← R1, R2 a) Please calculate the number of cycles to complete the instruction sequence
ADD R5 ← R3, R4 in a non-pipelined machine: 3*6+3*8 = 42 cycles
MUL R7 ← R2, R6
ADD R10 ← R8, R9
b) Please calculate the number of cycles to complete the instruction sequence
MUL R11 ← R7, R10
MUL R5 ← R5, R11 at in-order dispatch without forwarding: 29 cycles
c) Please calculate the number of cycles to complete the instruction sequence
at in-order dispatch with forwarding : 22cycles
d) Please calculate the number of cycles to complete the instruction sequence
at out-of-order dispatch with forwarding : 20cycles 4
Assignment 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
F D 1 2 3 W
F D D 1 2 3 W
F D 1 2 3 4 5 W
F D 1 2 3 W
F D D 1 2 3 4 5 W
F D D 1 2 3 4 5 W
F D 1 2 3 W
F D 1 2 3 W
F D 1 2 3 4 5 W
F D 1 2 3 w
F D 1 2 3 4 5 w
F D 1 2 3 4 5 w
F D 1 2 3 W
F D 1 2 3 W
F D 1 2 3 4 5 W
F D 1 2 3 W
F D 1 2 3 4 5 w
5
F D 1 2 3 4 5 w
Assignment 3
Consider a 16-way set-associative cache
- Data words are 1byte long
- Byte addressing (one address defines a byte)
- The cache holds 256 KB of data
- Each block holds 32 data words
- Physical addresses are 32 bits long
How many bits of tag, index, and offset are needed to support references to this cache?
Block offset: 5 bit (because there are 32 bytes in each
Tag: block)
# of Blocks in cache: 8K
Index: # of Cache Lines: 512
Index: 9 bit
Block Offset: Tag: 18 bit 6
Assignment 4
I1: R1 = R2 + R3 (ALU)
Assume a superscalar processor with the following specifications: I2: R4 = MEM[R1] (Load)
• Issue width: 4 instructions per cycle I3: R5 = R1 + R6 (ALU)
• Functional units: 2 ALUs, 1 Load/Store unit, 1 Branch unit I4: if (R5 == 0) jump (Branch)
You are given the following instruction stream: I5: R7 = R8 + R9 (ALU)
(a) Assuming no data hazards and ideal conditions, how many cycles will it take to issue all 5
instructions?
• Cycle 1: Issue I1 (ALU), I2 (Load), I4 (Branch), I5 (ALU) → 4 instructions issued
• Cycle 2: Issue I3 (ALU)
• Total: 2 cycles needed to issue all 5 instructions.
(b) If true data dependencies and structural hazards are considered, what kind of scheduling or
reordering might help improve performance?
✓ Out-of-order execution can help issue I5
•I3 depends on the result of I1 (R1), creating a data hazard. earlier while waiting on dependencies
•I4 depends on I3, adding another hazard. ✓ Dynamic scheduling (e.g., Tomasulo’s
•I2 and I3 both use R1 — load-use hazard may delay I3. algorithm) can resolve hazards at runtime. 7
*Example Assignment
We can first calculate the number of page table
• Please assume that you have 4 GB of main entries associated with each process.
memory at your disposal. - 16 KB pages implies that the offset associated
• 512GB of the 4 GB has been reserved for with a VA/PA is 14 bits (214 = 16KB)
process page table storage
- Thus, the remaining 18 bits are used to represent
• Each page table entry consists of: VPNs and serve as indices to the PT
• A physical frame number
- Thus, each PT has 218 entries.
• 1 valid bit
Next, each PA is 29 bits:
• 1 dirty bit
- 14 bits of the PA come from the offset, the other
• 1 LRU status bits
15 come from a PT lookup
• Virtual addresses are 32 bits - Given that each PT entry consists of a PFN, a
• Physical addresses are 29 bits valid bit, a dirty bit, and a LRU bit, each PT entry
• The page size is 16 KB will be 2 bytes (16 bits)
How many process page tables can fit in the Therefore
1GB space? Size of Page Table = 219 byte (512KB)
Maximum # of page tables in 512MB = 210 or 1,024
8
*Example Assignment
• What is the average memory access time when you have the following memory hierarchy? Assume
that (i) the cache uses physical addresses, (ii) the CPU stalls until the data is delivered, (iii)
everything fits into the memory (i.e., no page fault), and (iv) the hardware does the page table
access and updates TLB when TLB miss.
Unit Additional Access Latency Local Hit rate
TLB 1 cycle 95%
L1 1 cycle 90%
L2 10 cycles 95%
L3 50 cycles 98%
Memory 100 cycles 100%
Page table access & TLB update 200 cycles 100%
L3 miss penalty: 100
L2 miss penalty: 0.98*50+0.02*100=51
L1 miss penalty: 0.95*10+0.05*51=12.05
TLB/Page Table Access: 0.95*1+0.05*201=11
Data Memory Access: 0.9*1+0.1*12.05 = 2.105
Total: 13.105 9