0 ratings0% found this document useful (0 votes) 161 views176 pagesComputer Architecture
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
COMPUTER ARCHITECTURE
Pipeline Architecture 3 2
Memory 53
*“ vector Processor ~ 84
Flynn’s Taxonomy of Computer Architecture 104
RISC & CISC Architectures 120
Interprocess Communication 129
‘Thank You For Purchasing
VKS..
— ig_bihari_Popt IBLICATIONS.
PIPELINE ARCHITECTURE
Sree
= Chapter at a Glance
ee SS er _e_eE—————eeS
Architecture ' eaanes HBR EE “Bveiig ;
Review of eens technique whereby multiple nil os ed i
‘peli is lem« Fi
Pipeline ort 7 in the pipeline (called a pipe stage) comp ar pein:
eee ngs eee atic ean time, the length of a poser (lo ace et
eS o ’ .
Because all ae for the slowest pipe stage. Designer's goal: Balancing lt x
webene if the stages are perfectly balanced, the time per instruction on the Pipelined
pipeline stage.
processor is, ng 3
‘Time per instruction on un-pipelined machine
Number of pipe stages
The Speedup from pipelining is equal to the number of pipe stages.
Ideal Pipeline Techniques
x [wen [we
¢ To Tex [wen [we
we | 0 Tex Tew Two
[0 Tex [wen [wo
Pro
Sigil i 10 lex [we WB
Fig: 1 Pipeline technique
* Pipeline throughput
The throughput of a pipeline is defin
time uni
unit. Alls vnetronized fashion they all start at the same times
(Simplifies hardware design). The time Neuited for moving one instruction down the pipeline
'S called a processor cycle (not to be Confused with clock cycle), Because all pipe stages must
be ready to proceed Symchronously, the processor cycle ig determined by the slowest stage.
"Performance Of the pipeline
Pipelining achieves a red
luction of the average execution ; : ae
on time per instruction. This is in the
Sense that one can perform more instructio
ways:
Per clock eycle. This can be viewed in two
1 : Decreasing the CP],
| Typical way in which 7
reasi People S
2. Decreasing the cycle time Ge, increasing th rey oM the performance increase
is typically invisiby
el creme 8
eto the programmer. “lock ate), The good news is that pipelining
CA2
ica
iiCOMPUTER ARCHITECTURE
oo up; Efficiency and Throughput of Pipeline
oie pipeline of K stages can process n tasks in k + (n-1) clock cycles, where k eycles are
id 10 complete the execution of the very first task and the remaining n-I tasks require n—1
cycles THUS the total time required is T, = [k + (n-1)]}x, where + is the clock period.
pipeline Hazards
ina pipeline, each instruction is supposed to start executing at a given clock cycle, Unfortunately
are are C2865 in which an instruction cannot execute at its allotted clock cycle, These situations
im alld pipeline hazards. Hazards further reduce the performance gain from the speedup.
"The hazard is a situation which prevents to fetch the next instructions in the instruction
stream from executing during its designated clock cycle.
Hazards reduce the performance from the ideal speedup gained by pipelining.
Data Hazards
structural Hazards
Contrél Hazards
1+ Hazards can make it necessary to stall the pipeline.
‘When an instruction is stalled, all instructions issued later than the stalled instruction are
also stalled.
o. Nonew instructions are fetched during the stall.
1. Pipeline stall
When hazards occur, the typical approach is to stall the pipeline. Delay an instruction in the
pipeline to wait until another instruction completes execution, which is called stalling. When
an instruction i stalled, all instructions issued after the stalled instruction are stalled. When an
instruction is stalled, all instructions issued before the stalled instruction are allowed to
continue, No new instruction is fetched during a stall.
2, Pipeline bubble
Pipeline Bubble (a technique also known as a pipeline break or pipeline stall) is a method for
preventing data, structural, and branch hazards from occurring. As instructions are fetched,
control logic determines whether a hazard’ will occur. If this is true, then the control logic
inserts NOPs into the pipeline. Thus, before the next instruction (which would cause the
hazard) is executed, the previous orie will have had sufficient time to complete and prevent
the-hazard. If the number of NOPs is equal to the number of stages in the pipeline, the
Processor has been cleared of all instructions and can proceed free from hazards. This is
called flushing the pipeline. All forms of stalling introduce a delay before the processor can
fesume execution.
Data Hazards
‘hazard is created whenever there is dependence between instructions, and they are close enough
ht Ovetlap caused by pipelining, or other reordering of instructions, would change the order
plas to the operand involved in the dependence. There are three types of data hazards can
CA-3
=PUL 10)
. ite (RAW) hazards: :
Pan oa te a common type. It appears when the next instruction tr
read from a source before the previous instruction writes t0 it. So, the next in
gets the incorrect old value such as an operand is modified and read soon ater.
the first instruction may not have finished writing to the operand, the second instruc
may use incorrect data,
© Write After Read (WAR) hazards: : ab
WAR hazard appears when the next instruction writes to a destination before the prey
instruction reads it. In this case, the previous instruction gets a new value incorrectly sy
as read an operand and writes soon after to that same operand, Because the write
have finished before the read, the read instruction may incorretly get the new wig
value.
© Write After Write (WAW) hazards:
WAW data hazard is situation when the next instruction tries to write to a destination
before a previous instruction writes to it and it results in changes done in the wrong ord
‘Two instructions that write to the same operand are performed. The first one issued ‘may
finish second, and therefore leave the operand with an incorrect data value, So the resuly
of WAW hazards are: (i) Pipeline Latency and (ii) Instruction effects not comple
before next operation begins.
Structural Hazards
A structural hazard occurs when a part of the processor's hardware is needed by two or mae
instructions at the same time. A structural hazard might occur, for instance, if a program were
execute a branch instruction followed by a computation instruction.
Control Hazards
Controls hazards occur when the processor is told to branch — i.., if certain condition is true and
then jump from one part of the instruction stream to another — not necessarily to the net
instruction sequentially. In such a case, the processor cannot tell in advance whether it should
Process the next instruction (when it may instead have to move to a distant instruction).
Multiple Choice Type Question
1. The number of cycles required to complete n tasks in a k stage pipeline is
[WBUT 2007, 2016, 20171
cai — ok d) none of these
Answer: (a)
2. A 4-ary 3-cube hypercube architecture has * [WBUT 2007]
a) 3 dimensions with 4 nodes along each dimension
b) 4 dimensions what 3 nodes along each dimension
¢) both (a) and (by d) none of the:
CA-4COMPUTER ARCHITECTURE,
Answer: (8)
3. Which of these are examples of 2-dimensional topologies in static networks?
i [WBUT 2007, 2010]
a) Mesh b) SCCC networks ¢) Linear array d) None of thoso
Answer: (8)
4, The seek time of a disk is 30 ms. It rotates at the rate of 30 rotations/second. The
proximately
capacity of each track is 300 words. The access time Is ap
[WBUT 2007, 2008, 2010]
d) nono of these
a) 62 ms b) 60 ms ©) 47 ms
Answer: (d)
5. For two instructions /and J WAR hazard occur, if [WBUT 2007, 2010, 2014]
a) R(D)OD(Y) #4 by R(I)AR() #4
) DIOR) #4 d) none of these
Answer: (c)
6. The performance of a pipelined processor suffers if
[WBUT 2008, 2009, 2011, 2013)
s have different delays
a) the pipeline stage:
uctions are dependent on each other
b) Consecutive instr
) the pipeline stages share hardware resources
d) all of these
Answer: (d)
7. Asingle bus structure is primarily found in ~ [WBUT 2008]
a) Main frames b) High performance machines
c) Mini and Micro-computers d) Supercomputers
Answer: (¢)
8. What will be the speed up for a four-stage linear pipeline, when the number of
instruction n = 64? [WBUT 2008, 2009]
a) 4.5 b) 7.4 c) 6.5 d) None of these
Answer: (d)
9. Dynamic pipeline allows [WBUT 2008, 2009, 2011, 2014, 2016, 2017]
a) Multiples functions to evaluate b) only streamline connection
¢) to perform fixed function .d) none of these
Answer: (a)
10. The division of stages of a pipeline into sub-stagos is the basis for
[WBUT 2008, 2014]
a) pipelining b) super-pipelining
¢) superscalar d) VLIW processor
CA-5PUBLICATIO!
Answer: (a)
[WBUT 2012, 2014, 2018, 2019)
Se ae sonal circa b) is combinational circuit
4 +s eeivte of both sequential and combinational circuits
d) none of these
Answer: (¢) :
42. Utilization pattern’ of successive stages of a synchronous pipeline can be
ified b [WBUT 2012, 2015, 2017, 2018, 2019)
mn Truth table 'b) Excitation table
c) Reservation table d) Periodic table
Answer: (c)
13. SPARC stands for ’ IWBUT 2012, 2018)
a) Scalable Processor Architecture
b) Superscalar Processor A RISC Computer
c) Scalable Processor A RISC Computer
d) Scalable Pipeline Architecture
Answer: (a)
14. Portability is definitely an issue for which of the following architectures?
[WBUT 2012, 2018]
a) VLIW processor b) Super Scalar processor
c) Super pipelined d) none of these
Answer: (b)
15. Which of the following is not the cause of possible data hazard?
[WBUT 2012, 2048, 2049]
a) RAR b) RAW c) WAR d) WAW.
Answer: (a)
18. at wal both Speed up for a 4 segment linear pipeline when the number of
7 pe ae * WBUT 2013, 2014, 2019]
Answer: (b) as oa
17. Which type of data hazard is not possible? ‘
a) WAR b) RAW os RAR weur 20151
Answer: (c) a
18. MIPS means
a) Multiple Instruction Per Second a aa
¢) Multi-Instructi
b) Millions of Insti Pe nd
ion Perf ruction Per Seco
Answers(b) formed System
d) none of these
CA-6_COMPUTER ARCHITECTURE
Apa ae ea ta [WBUT 2014, 2046]
¢) control hazard b) structural hazard :
“aanwert (6) d) enhancing the speed of pipeline
20. Suppose the time delay of the four stages of a pipeline are t1 = 60 hs, t2 = 50
S, , 4 = 80 ns respectively and the interface latch has a delay t1 = 10 ns,
then the maximum clock frequency for the pipeline is [WBUT 2016]
a) 100 ns b) 90 ns. c) 190 ns d) 30 ns
Answer: (b)
24. Pipelining uses [WBUT 2017]
a) data parallelism b) temporal parallelism
c) spatial parallelism d) none of these
Answer: (a)
22, An n-dimensional hypercube has [WBUT 2017]
a) n° nodes b) n™ nodes :
c) 2" nodes d) none of these
Answer: (c)
23, The throughput of a super scalar processor is [WBUT 2019]
a) less than 1 b)1
c) more than 1 d) not known
Answer: (c)
ce that the maximum speed-up in a k-stage pip
d-up always achievable? Explain.
4. Define Speed-up. Dedu eline
processor is k. Is this maximum spee'
[WBUT 2006]
OR,
In there are no stalls (waits) then prove that the speedup is equal to the pipeline
dopth i.e., the number of pipeline stages. [WBUT 2046]
ol
Show that the maximum speedup of a pipeline is equal to its stages. [WBUT 2016]
‘Answer:
Suppose we consider that
k=no, of stages in the pipeline
n= no. of processes to be execute
1=time delay for each stage of the pipeline
and S = the Speed-up
then, ¢ = Time require for non-pipeline process _ nkt____ ke
Time require for pipeline process (k+("—1))r k+(n-1)
CA-7PUI 1!
Maximum Speed-up is, S, k when n >> k
The maximum Speed-up is never fully achievable because of data-dependencies between
instructions, interrupts, program branches ete. So, many pipeline cycles may be wasted
on a waiting state caused by out-of sequence instruction executions,
2, What are the different factors that can‘ affect the performance of a pipelined
‘system? Differentiate between WAR and RAW with a suitable example.
[WBUT 2007)
Answer:
Pipelining achieves a reduction of the average execution time per instruction. In the sense
that pipeline can perform more instructions per clock cycle. This can be viewed in two
ways:
* Decreasing the CPI. Typical way in which people view the performance increase
* Decreasing the cycle time (i,e., increasing the clock rate).
Pipelining increases the CPU instruction throughput. Pipelining does not decrease the
execution time of an individual instruction. It increases the execution time due to
overhead (clock skew and pipeline register delay) in the control of the pipeline.
Data hazards occur when data is modified, Ignoring potential data hazards can result in
race conditions.
There are two situations a data hazard can occur in:
Read after Write (RAW):
An operand is modified and read soon after. Because the first instruction may not have
finished writing to the operand, the second instruction may use incorrect data.
Write after Read (WAR):
Read an operand and write soon after to that same operand. Because the write may have
finished before the read, the read instruction may incorrectly get the new written value.
A RAW Data Hazard refers to a situation where we refer to a result that has not yet been
calculated, for example:
i1.R2 RUR2, RS [RARLLRS | RERI. RT| REAL, RI] RIORI, RIL
DADD ‘DSUB AND OR XOR
sen iene RLRR | RARLRS | RERIR’ | ROI, RO | RIORI, REL
Dano] Ds | AND | OR Xt
Execution ——> RLR.R3] RARLLRS | R6RILRT | RSRI,RO | RIOR, RIL
Write back —y pap | D&B | _aND
A OR XOR.
RLRORS | RARLLRS | RERI,R? | RSRI, RO |RIORI Rll
We have considered that the above pipeline technique is a five stage pipeline. The stages
are fetch, decode, read, execution and write back. In the above figure, we also show thi
in which clock pulse what operation is performed. :
Now, the data hazard will occur in 4" clock cycle, where both read and execution
operations are performed on register R1 at the same time.
Result forwarding is technique to minimize the stall in pipeline processing. TH
technique is that after execution of one instruction, if the result is transfer to the ne!
instruction bypass the write back stage directly. i.e. without writing the result in th
register pipeline transfer that result to the next instruction, .
Result forwarding may not work in case of branch instruction ‘of a Pipeline execution. 9%
ifthere is a branch instruction in the above instruction set then result forwarding will
work,
;
CA-10 :
|COMPUTER ARCHITECTURE
6. Explain DMA working pri
Answer: Spree go> AWRUT 2008)
The main idea of direct memory access (DMA) is to trarisfer data between peri
e 7 peripheral
aes oe aoe, to bypass the role of the CPU. It allows peripheral devices to
trans lirectly from and to memory without the intervention of the CPU. Having
peripheral devices access memory directly would allow the CPU to do other work, which
‘vould lead to improved performance, especially in the cases of large transfers. The DMA
controller controls one or more peripheral devices. It allows devices to transfer data to or
from the system’s memory without the help of the processor.’ Both the DMA and CPU
use memory bus and only one or the other can use the memory at the same time. The
DMA controller then sends a request to the CPU asking its permission to use the bus. The
CPU retums an acknowledgment to the DMA controller granting it bus access. The DMA
can now take control of the bus to independently conduct memory transfer. When the
‘transfer is complete the DMA relinquishes its control of the bus to the CPU. Processors
that suppoit DMA provide pne or more input signals that the bus requester can assert to
gain control of the bus and one or more output signals that the CPU asserts to indicate it
has relinquished the bus. The figure below shows how the DMA controller shares the
CPU's memory bus.
DMA Request
+f
DN DMA Acknowledgement
‘Coatroller {#-———————
cru
Address Bus
Device ‘Contrl Signals Memory
Fig: DMA controller shares the CPU's memory bus
A. DMA controller has an address register, a word count register, and a control register.
The address register contains an address that specifies the memory location of the data to
be transferred. It is typically possible to have the DMA controller automatically
increment the address register after each word transfer, so that the next transfer will be
from the next memory location. The' word count register holds the number of words to be
after each word transfer. The control
transferred, The word count is decremented by one
register specifies the transfer mode. Dirett memory access data transfer can be performed
in burst mode or single cycle mode. In burst mode, the DMA controller keeps control of
the bus until all the data has beén transferred to (from) memory from (to) the peripheral
device. This mode of transfer is needed for fast devices where data transfer cannot be
stopped until the entire transfer is done. In single-cycle mode (cycle stealing), the DMA
controller relinquishes the bus after each transfer of one data word. This minimizes the
CA-11LAR PUBLICATIO.
amount of time that the DMA controller keeps the CPU from controlling the bus, but it
requires that the bus request/acknowledge sequence be performed for every single
transfer. This overhead can result in a degradation of the performance. The single-cycle
mode is preferred if the system cannot tolerate more than a few cycles of added interrupt
latency or if the peripheral devices can buffer very large amounts of data, causing the
DMA controller to tie up ihe bus for an excessive amount of time.
7. What are the different pipeline hazards and what are the remedies? [WBUT 2009]
oR,
Discuss data hazards briefly. [WBUT 2015]
- OR,
What do you mean by hazards in pipeline? Describe the different types of hazards,
[WBUT 2018}
OR,
How data hazards are detected and prevented? IWBUT 2019]
Answer:
In computer architecture, a hazard is a potential problem that can-happen in a pipelined
Processor. There are typically three types of hazards: data hazards, branching hazards,
and structural hazards.
Instructions in a pipelined processor are performed in several stages, so that at any given
time several instructions are being executed, and instructions may not be completed in the
desired order. A hazard occurs when two or more of these simultaneous (possibly out of
order) instructions conflict.
it Data hazards
Data hazards occur when data is modified. Ignoring potential data hazards can result in
race conditions. There are three situations a data hazard can occur in:
* Read after Write (RAW): An operand is modified and read soon after. Because the
first instruction may not have finished writing to the operand, the second instruction
may use incorrect data.
* Write after Read (WAR): Read an operand and write soon after to that same
operand. Because the write may have finished before the read, the read instruction
may incorrectly get the new written value.
* Write after Write (WAW): Two instructions that write to the same operand are
performed. The first one issued may finish second, and therefore leave the operand
with an incorrect data value.
The operands involved in data hazards can reside in memory or in a register.
ii, Structural hazards
A structural hazard occurs when a part of the processor's hardware is needed by two or
more instructions at the same time. A structural hazard might occur, for instance, if a
Program were to execute a branch instruction followed by a computation instruction.
Because they are executed in parallel, and because branching is typically slow (requiring
CA-12-COMPUTER ARCHITECTURE
a comparison, program counter-related computati
i is ‘ putation, and writing to regi it is quit
possible (depending on architecture) that the computation inuiilen toa a anh
instruction will both require the ALU at the same time, seta ae
iii, Branch hazards
Branching hazards {also known as control hazards) occur when the processor is told to
branch - i.e., if a certain condition is true, then jump from one part of the instruction
stream to another - not necessarily to the next instruction sequentially. In such a case, the
processor cannot tell in advance whether it should process the next instruction (when it
may instead have to move to a distant instruction). This can result in the processor doing
‘unwanted actions.
8, Use 8-bit 2’s complement integar to perform —43 + (-13). [WBUT 2003]
11110011
— 43+ (— 13) = 111001000
So, discard carry and the result is 11001000. ie. the 2s complement of 56.
9. What do you mean by pipeline processing? [WBUT 2009]
Answer:
Pipelining refers to the technique in which a given task is divided into a number of
subtasks that need to be performed in sequence. Each subtask is performed by a given
functional unit. The units are connected in a serial fashion and all of them operate
simultaneously. The use of pipelining improves the performance compared to the
traditional sequential execution of tasks. The figure below shows an illustration of the
basic difference between executing four subtasks of a given instruction (in this case
fetching F, decoding D, execution E, and writing the results W) using pipelining and
sequential processing.
4
iF (pa) (en | ow |) |e F3]| [D3] [ES] D3}
(a) Sequerti sslng
A (ey) (or) Lei vt | |
h | F2]|[p2 [2 & ‘ | | |
A | E wal] \
's {_(Ues}fips)es { tot ——
rh2a{3yals slot? > " eI
Pot | | @Fipetinine |
Fig: Pipeline vs sequential Processing
[WBUT 2009]
10. What are instruction pipeline and arithmetic pipeline?
CA-13 : ‘POPULAR PUBLICATIONS
Answer: ae a d in the design of computers system to increase
I ion pipeline is a technique use i i Hs
Aa Ee tion’ roudtipit The fundamental idea is to split the processing of a computer
Se oe into a series of independent steps, with storage at the end of each step. The
principles used in instruction pipelining can be used in order to improve the performance
of computers in performing arithmetic operations such as add, subtract, and multiply.
In a program there ig several numbers of instructions. There are five steps to execute an
instruction and the steps are:
i. Fetch
Decode
. Operand fetch
iv. Execute is
vy. Write back
}<—————— Execution cycle —————>}
IF D
Instruction | Instruction
fetch | decode
Instruction execution phase
Fig: Instruction pipeline stages
The streams of instructions are executed in the pipeline in an overlapped manner.
Clockeyele 1 2 3 4 5 6 7 8
n [FED IOF Ewe
DR IF MID ‘OF TE AWB
44
B IF | 1D 1OF IE 1WB
4 IF | 1D 10F | IE WB]
Pipelining can be applied to arithmetic operations. As an exainple, we show a floating-
point add pipeline in Figure below. The floating-point add unit has several stages:
‘Unpack Align Add Normalize
Figure: Floating point add Pipeline stages
L Unpack: The unpack stage-partitions the floating-point numbers into the three field:
the sign field, exponent field, and mantissa field. Any special cases such as not-4-
number (NaN), zero, and infinities are detected during this stage.
2. Align: This stage aligns the binary points of the two mantissas by right-shifting the
mantissa with the smaller exponent.
3. Add: This stage adds the two aligned mantissas,
CA-14-COMPUTER ARCHITECTURE
4, Normalize: This stage packs the three fiel
ora Ids of th oe
rounding into the IEEE-7s4, noes result after. normalization and
detected during this stage. merpoint format: Any output exceptions are
44, Find 2's complement of (14B),,
ed represented in 16 bit format. [WBUT 2009)
nswer?
(14B),,= (0000 0001 1010 1011),
2's complement of (0000 0001 1010 1011), is
(1111 1110 0101 0101),
42. What are the different factors that
system? Differentiate between WAR and RAW he, roe MWWBUT 2010)
Answer:
Pipelining achieves a reduction of the average execution time per instruction. In the sense
that pipeline can perform moré instructions per clock cycle. This can be viewed in two
ways: ,
«Decreasing the CPI. Typical way in which people view the performance increase
= Decreasing the cycle time (i.e., increasing the clock rate).
Pipelining increases the CPU instruction throughput. Pipelining does not decrease the
execution time of an individual instruction. It increases the execution time due to
overhead (clock skew and pipeline register delay) in the control of the pipeline.
WAR hazards
RAW hazards
1. j tries to write a destination before it is
read by i, soi incorrectly gets the new
value.
2. WAR hazard is eliminated by register
renaming of all the destination registers
including those with a pending read or
write for an earlier instruction.
3. WAR hazard if
Di)AR(j)#0
4. For example:
il.R4RI+R3
i2.R3—R1+R2
If we are in a situation that there is a chance
that 2 may be completed before il (i.e. with
concurrent execution) we must ensure that we
do not store the result of register 3 before il
has had a chance to fetch the operands.
1. _] tries to read a source before f writes it, soj
incorrectly gets the old value.
2. RAW hazards are avoided by executing an
instruction’ only ‘when its operands are
available.
3. RAW hazard if
ROADHDFO
4, For exainple:
il. R2—RI+R3
i2. R4 -R2+R3
‘The first instruction is calculating a value to be
saved in register 2, and the second is going to
use this value to compute a result for register 4.
However, in a pipeline, when we fetch the
operands for the 2nd operation, the results from
the first will not yet have been saved, and hence
we have a data dependency.
CA-15_ i}
son with the number of’
POPULAR PUBLICATIONS ae proportion with Me O48, 2017
BUT
ition through! nt. wi
43, “instruction eer 97 Justify your stateme ee
a 1 ivi int of
a in which a given task is dh ae ed by a giv
Answer: 0 the technique in whit btask is perform given
a jul
Pipelining refers ¢ to be performed in sequence. fam fastion and all of them operate
ab ie units ai ance compared to the
functional unit. f m tasks (instructions)
is . The use of pil
simultaneously. Th J n tasks (instruct
traditional sequential execution © t units. hen the
its) pipeline.
ing n-stages (units) pipe!
ut U(n) is, m/( ntme-l) xt
pelining improves
f tasks. Consider tl
We assume that dl
he execution 0}
he unit time 7
from the above equation, if we increase
ted per unit time. So, the throughput of the pipeline.
i no, of tasks execu . *
eee ipeline it also increase
the number of stages in a pil
ed branching can help:
14, For the code set ment given below, explain how dey
12 Dec R3, 1
13 BrZero R3,.15
14 Add R2, R4
5 Sub R5, R6
16 ‘Store R5, B
[WBUT 2013}
Answer:
Instruction 12 performs “Dec R3,1” and I3 performs “BrZero R3,15”. So, both 12 and 3
modify the register R3 at the same time. So, delayed branch actually first modify the
value of R3 by “Dec R3,1” then update the value of R3 by “BrZero R3,15”.
15. For the following code show how loop unrolling can help improve instruction
level parallelism (ILP) performance:
[Loop 4:11: Load RO, A (Ri)
A is the starting address of array location
12: Add RO, R2 i R1 holds the initial; address of the element
: ie RO < RO +R2, R2is a scalar
B tore RO, A (Ri)
i Add R1, -8 =
i 0 to next word in Array of doubles
15S BNE Looe whose address is 8 bytes earlier
Answer; TWBUT 2013]
sherae dow overlap the execution of instructions
a a : Potential overlap among instructions is
care ie abide can be evaluated in Parall
sr bie © amount of parallelism availabl
ng iterations of a loop. A loop is Pi
CA-16
when they are independent of on
called instruction-level parallelism
el. The simplest and most commot
le among instructions is to exploit
arallel unless there is a cycle in theRAR
dependencies, since the absence of a eycle m
: 7 jeans that thi oe :
ordering 0" hee aa Statement I] uses the value mga SE
ty statement 1,50 there fs loop-carried dependency Pectin Gee an
Sependency, this loop can be made parallel because the dependency Ge
8 not circular.
46. What is pipeline chaining?
What do you mean by pipoli OR, [weur 2013]
. y pipelined chaining?
awe: weur 2005, 2010)
Pipeline chaining is a linking process that occurs when results obtained from one pipeline
unit are directly fed into the operand registers of another functional pipe. In oth rd
intermediate results do not have to be restored into memory and fara ph piel
the vector operation is completed. Chaining permits successive operations to be issued as
soon as the first result becomes available as an operand. The desired functional pipes and
operand-registers must be properly reserved; otherwise, chaining operations have to be
suspended until the demanded resources become available.
47. Compare between Control-Flow, Data-Flow and Demand-Driven mechanism.
[WBUT 2013]
Answer:
The Control-Flow Architecture is a°Von Neumann or control flow computing model.
Here a program is a series of addressable instructions, each of which either specifies an
or it specifies conditional transfer
operation along with memory locations of the operands
of control to some other instruction. The next instruction to be executed depends on what
happened during the execution of the current instruction. The next instruction to be
executed is pointed to and triggered by the PC. The instruction is executed even if some
of its operands are not available yet.
y the availability of operand. There
res of von Neumann
sing in Data flow
is driven only b:
dateable store. The two featut
rallelism are mi
But in Dataflow model, the execution
is no Program Counter and global up
model that become bottlenecks in exploiting pa!
Architecture.
CA-17POPULAR PUBLICATIONS
‘Memory Unit
(instructions)
Update
wait
(Castration addeeet)
¥
Dus
token
Processing Unit -g-| Enabled
: @rocetors) [=| Instruction queue
Fig: A static Data flow Computer
‘A demand driven access mechanism comprises logic apparatus at each program capable
of seizing use of a shared communication channel for enabling access to a selected one
program. The logic apparatus receives status signals from all programs so that if one
program seeks access to the channel it will be enabled immediately upon an inactive
status to the channel. If two or more programs simultaneously seek access to the channel,
the logic apparatus establishes a priority order between them, thereby enabling access to
only one program at a time. The priority ordering is based, in part, by the identity of the
program last enabled, to thereby assure priority order allocation among the programs.
18. Draw pipeline execution diagram during the execution of the following
instructions:
MUL Ri, R2, R3
ADD R2, R3, R4
INC R4
SUB R&, R3, R7
Find ii
teehee delay in pipeline execution due to data dependency of the above
Answer: werane
We have corisidered that the above pipeline techni
are fetch, decode, read, execution and write bacl
which clock pulse what operation is performed,
The operations of the instruct
ions are,
RI€R2.R3
R2< R3+R4
R4 R441
R6 MUL “ADD ‘SUB
m 1am | waRiR| NEM ages
Read ——> MUL | ABB Ty ry
nigass | warn] SCM | paps py
ane ML] AbD a
Execution ninars | rare] SCM | pppsa7
Mal [Abo a
Wine back, wt Teceena] Bem [sees
49, How “Reservation Table” helps to study the performance of pipeline.
[WBUT 2016]
‘Answer:
‘There are two types of pipelines: static and dynamic. A static pipeline can perform only
one function at a time, whereas a dynamic pipeline can perform more than one function at
‘a time. A pipeline reservation table shows when stages of a pipeline are in use for a
particular function. Each stage of the pipeline is represented by a row in the reservation
table. Each row of the reservation table is in turn broken into columns, one per clock
cycle. The number of columns indicates the total number of time units required for the
pipeline to perform a particular function. To indicate that some stage S is in use at some
time ty, an X is placed at the intersection of the row and column in the table
corresponding to that stage and time. Figure 1 represents a reservation table for a static
pipeline with three stages. When scheduling a static pipeline, only collisions between
different input data for a particular function had to be avoided. With a dynamic pipeline,
it is possible for different input data requiring different functions to be present in the
pipeline at the same time. Therefore, collisions between these data must be considered as
Well, As with the static pipeline, however, dynamic pipeline scheduling begins with the
compilation of a set of forbidden lists from function reservation tables. Next the collision
vectors are obtained, and finally the sate diagram is drawn.
5000 instructions by linear pipeline
iz. Pipeline has five stages and one
pelines due to branch instructions
20. Consider the execution of a program of 1
processor. The clock rate of pipeline is 25 MH:
instruction is issued per clock cycle. Neglect pi
and out of sequence execution:
(i) Calculate the speedup program execu
non-pipelined processor.
(ii) What are the efficiency and throughput
Answer:
Information we get are:
= 15,000 instructions or tasks
tion by pipeline as compared with that by
of the pipeline processor. [WBUT 2016]
CA-19PULAR PUBLICATIONS
f=25 MHz
k=5 stages
=]
Issued processor ‘s 150005) 75.000 4,
(@ The Speedup (S,) = + (n=l) 5+(15,000—1) ~ 15,004 =»
S._ 4,999
Gi) Efficiency (Ey) == =7? 0,999
nf (15,0025) _ 375,000 _ 94 99 aapg
Tprosetout () Sac) 5+(15,000-1) 15,004.” :
21. Write down Amdahl’s law of parallel processing. [WBUT 2017)
Answer: :
Refer to Question No. 19 (d) of Long Answer Type Questions.
22. Differentiate between 3-address, 2-address, 1-address and O-address
instructions with suitable example. [WBUT 2019)
Answer:
To illustrate the influence of the number of addresses on computer programs, we will
evaluate the arithmetic statement X = (A+B) * (C+D). We will use the symbols ADD,
SUB, MUL, and DIV for the four arithmetic Operations; MOV for the transfer-type
must be stored in memory at address X.
Three-Address Instructions: Computers with three-address instruction formats can use
each address field to specify either a processor register or a memory operand. The
Program in assembly language that evaluates X = (A+B)*(C+ D) is shown below,
together with comments that explain the register transfer operation of each instruction.
ADD R1, A,B RI —M[A]+M [B]
ADD R2, C,D R2—M [C]+M [D]
MUL X, R1, R2 M[X]—RI1 +*R2 g
It is assumed that the computer has two processor registers, R1 and R2. The symbol M
[A] denotes the operand at memory address symbolized by A.
Two-Address Instructions:
MOVRI1,A RI—M[A]
ADDRI,B RI-RI+M[B}
MOV R2,C R2<—M[C]
ADD R2, D R2-R2+M[D]
CA-20MUL RI, R2 RIG RIR2
MOVX, RI M[X]—RI
MOV instruction mi
eno Nn Moves or transfers the operands to and from memory and processor
One-Address Instructions: One-address instructions use an implied accumulator (AC)
register for all data manipulation. For multiplication and division there is a need for a
second register. However, here we will neglect the second and assume that the AC
contains the result of tall operations. The program to evaluate X = (A + B) * (C + D) is
LOAD A AC—M[A]
ADD B AC —A[C]+M [B]
STORET MIT] AC
LOADC AC—MI[C]
ADD D AC AC+M [D]
MUL T AC —AC*M[T]
STORE X M[X] — AC
All operations are done between the AC register and a memory operand. T is the address
of a temporary memory location required for storing the intermediate result.
computer does not use an address field for
and POP instructions, however, need an
following
Zero-Addiress Instructions: A stack-organized
the instructions ADD and MUL. The PUSH
address field to specify the operand that communicates with the stack. The
program shows how X = (A + B) + (C + D) will be written for a stack organized
computer. (TOS stands for top of stack)
PUSH A TOS —A .
PUSH B TOS<+B
ADD TOS — (A+B)
PUSH C TOS —C
PUSH D TOS<—D
ADD TOS — (C+D)
MUL TOS —(C+D)*(A+B)
POP X M[X]— TOS
it is necessary to convert the
To evaluate arithmetic expressions in a stack computer,
expression into reverse Polish notation, The name “zero-address” is given to this type of
computer because of the absence of an address field in the computational instructions.
‘ontrast hardwired vs. micro
23. What. is instruction. cycle? Compare and o
Programmed control unit. [WBUT 2019]
Answer:
1" Part:
Instructions are processed under the direction of the control unit in a step-by-step manner.
There are four fundamental steps in the instruction cycle:
CA-21POPULAR PUBLICATIONS
1. Fetch the instruction: The next instruction is fetched from the memory tee i
currently stored in the Program Counter (PC), and stored in the Tarai ity i
At the end of the fetch operation, the PC points to the next instruction that at
the next cycle. :
2. Decode the instruction: The decoder ere the eee
i ‘ion inside the IR (instruction register) gets decoded. .
pees al Unit of CPU passes the decoded information as a sequence of
control signals to the relevant function units of the CPU to perform the pied ron
by the instruction such as reading values from registers, passing them e e to
perform mathematical or logic functions on them, and writing the result back to a
register. If the ALU is involved, it sends a condition signal back to the CU. 4
4, Store result: The result generated by the operation is stored in the main memory, or
sent to an output device. Based on the condition of any feedback from the ALU, Program
Counter may: be updated to a different address from which the next instruction will be
fetched.
tion. During this cycle the
°¢ Part:
os ae an instruction, there are two types of control units Hardwired Control unit and
Micro-programmed control unit. ' :
1. Hardwired control units are generally faster than micro-programmed designs. In
hardwired control, we saw how all the control signals required inside the CPU can be
generated using a state counter and a PLA circuit. A micro-programmed control unit
is a relatively simple logic circuit that is capable of (1) sequencing through. -
microinstructions and (2) generating control signals to execute each microinstruction.
2. Hardwired control unit generates the control signals needed for the processor using
logic circuits. Micro-programmed control unit generates the control signals with the
help of micro instructions stored in control memory.
3. Hardwired control unit is faster when compared to micro-programmed control unit as
the required control signals are generated with the help of hardware,
4. It is difficult to modify as the control signals that need to be generated are hard wired.
But in micro-program, it is easy to modify as the modification need to be done only
at the instruction level.
5. More costlier as everything has to be realized in terms of logic gates and less costlier
hdc control as only micro instructions are used for generating control
ignals.
6. The hardwired control cannot handle complex instructions as the circuit design for it
becomes complex.
7. In hardwired control unit only limited numbers of instructions are used due to the
hardware implementation. The control signals for many instructions can be generated
in Micro-programmed control unit.
CA-22.COMPUTER ARCHITECTURE
Answer Type Questions
4, What is a pipeline?
Consider the following reservation table:
1
2
St X 2 4
S2 X
$3 x
write down the forbidden latencies and initial collision vector. Drs
. Draw the stat
diagram for scheduling the pipeline. Find out the sample and greedy yee ane
MAL. If the pipeline clock rate is 25 MHz, then what is the throughput of the
pipeline? What are the bounds on MAL? [WaUT 2007, 2011]
Answer:
1* Part:
Pipelining is a technique of splitting a sequential process into sub operations being
execute of different segment that operates concurrently with all other segments. An
instruction pipeline is a technique used in the design of computers to increase their
instruction throughput (the number of instructions that can be executed in a unit of time).
Pipelining assumes that with a single instruction (SIMD) concept successive instructions
in a program sequence will overlap in execution.
‘A non-pipeline architecture is inefficient because some CPU components are idle while
another module is active during the instruction cycle. Pipelining does not completely
cancel out idle time in a CPU but making those modules work in parallel improves
program execution significantly.
Processors with pipelining are organized inside into stages which can semi-independently
work on separate jobs. Each stage is organized ‘and linked into a ‘chain’ so each stage’s
output is inputted to another stage until the job is done. This organization of the processor
allows overall processing time to be significantly reduced.
Unfortunately, not all instructions are independent. In a simple pipeline, completing an
instruction may require 5 stages. To operate at full performance, this pipeline will need to
tun 4 subsequent independent instructions while the first is completing. If 4 instructions
that do not depend on the output of the first instruction are not available, the pipeline
control logic must insert a stall or wasted clock cycle into the pipeline until the
dependency is resolved. Fortunately, techniques such as forwarding can significantly
reduce the cases where stalling is required. While pipelining can in theory increase
performance over an unpipelined core by a factor of the number of stages (assuming the
clock frequency also scales with the number of stages), in reality, most code does not
allow for ideal execution.
CA-23JBLICATIONS
Sree isi bidden latenc
Forbidden latencies means the latencies tha eause collision, Here the for 5;
is3, ;
The initial collision vector is 100.
The state diagram is given below
Simple cycle: (2), (4), (1,4), (11,4) and (2,4)
Greedy cycle: (2)
MAL =2
Throughput = 25 MIPS
Upperbound = 2
Lowerbound = 2
2. a) What do you mean by “Data flow Computer”?
b) With simple diagram, explain Data flow architecture and compare it with control
flow architecture.
present the following computations:
[WBUT 2008, 2014]
a) Data flow computer is a large, very Powerful computer that has a number of processors
all physically wired together with a large amount of memory and backing storage. Such
h
same time. Data flow computers are used to execute Processor intensive applications such
as those associated with areas like molecular biology and. simulation. Numerical
calculations for the simulation of natural phenomena were conducted using a data-flow-
'¥Pe parallel processing computer. The computing time in the data-flow computer was
épproximately three-to-five times shorter than that of the usual medium-size computer of
Computing speed 3 MIPS (million instructions Per second). Dynamic visualization of the
CA-24RCHI E
computing process was realized using an image display direct!
of the data-flow computer. play directly connected to the memory
b) Refer to Question No. 18 of Short Answer Type Questions.
9
Fig: Data Flow graph for the above computation
yn, difference,
3. What is floating point arithmetic operation? Explain all (additio
[WBUT 2009]
multiplication, division) operations with example.
Answer:
{A floating-point (FP) number can be repres
Where m is the mantissa and it represents t
represented as a signed binary fraction, e represents
base (radix) of the exponent.
3 30 23 22 0
[ s I Exponent (e]
Fig; Representation of floating point number
ented in the following form: +m*b°
he fraction part of the number and is normally
the exponent, and b represents the
‘mantissa (rn)
Floating-Point Arithmetic Addition/Subtraction:
The difficulty in adding two FP numbers stems from the fact that they may have different
exponents. Therefore, before adding two FP numbers, their exponents must be equalized,
that is, the mantissa of the number that has smaller magnitude of exponent must be
aligned.
ae required add/subtract two Floating-Point numbers: t
Compare the magnitude of the two exponents and make suitable alignment to the
number with the smaller magnitude of exponent.
CA-25POPULAR PUBLICATIONS
2. Perform the addition/subtraction. : at :
3. Perform normalization by shifting the resulting mantissa and adjusting the resulting
exponent.
Example: Consider adding the two FP numbers:
1, 1100 * 2* and 1.1000 * 2”. a
1. Alignment: 1.1000 * 2” has to be aligned to 0.01 1o*2
2. Addition: Add the two numbers to get 10.0010 * 2°. ; ‘ :
3. Normalization: The final normalized result is 1.0001 * 2° (assuming 4 bits are allowed
after the radix point).
Floating-Point Arithmetic Multiplication : :
A general algorithm for multiplication of FP numbers consists of three basic steps. These
are:
1, Compute the exponent of the product by adding the exponents together.
2. Multiply the two mantissas.
3. Normalize and round the final product.
Example Consider multiplying the two FP numbers
X= 1.000 * 2? and Y =-1.010 * 2"
1. Add exponents: -2 + (-I) = -3.
2. Multiply mantissas: 1.000 * - 1.010 = -1.010000.
The product is -1.0100 * 2°.
Floating-Point Arithmetic Division
A general algorithm for division of FP numbers consists of three basic steps:
1, Compute the exponent of the result by subtracting the exponents.
2. Divide the mantissa and determine the sign of the result,
3. Normalize and round the resulting value, if necessary.
Consider the division of the two FP numbers
X= 1.0000 * 2? and Y = -1.0100 * 27,
1, Subtract exponents: -2 - (-1) = -1.
2. Divide the mantissas: (1.0000) / (-1.0100) = -0.1101.
3. The result is -0,1101* 2",
4. Consider the four stage pipelined Processor specified by the following diagram:
Ousput
Input
The pipeline has a total evaluation time of si
must be used after each clock cycle. mic clock cycles, All successor rE
{) Specify th i df
8p ad @ reservation table for above Pipelined processor with six columns and
CA-26cor
R ARCHITECTURE
ii) What are the forbidden latencies and the initial collision vector? Draw the state
transition diagram.
iii) Determine all si
ple cycles, greedy cycle and MAL.
iv) Determine the throughput of this pipelined processor. Given clock period as
. i)
[WBUT 2010]
20 ns.
Answer:
T 2z 3
Si x =
Sz x x
('S3, x
S4
ii) The forbidden Latency is (0,4)
The pipeline collision vector (100010)
State 1 : 100010
Reaches State 2 ( 100110 ) after 1 cycle.
Reaches State 3 ( 101010 ) after 2 cycles.
Reaches State 4 ( 110010 ) after 3 cycles.
Reaches State 1 ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
State 2: 100110
Reaches State 5 ( 101110 ) after I cycle.
Reaches State 6 ( 111010 ) after 2 cycles.
Reaches State I ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
State 3 : 101010
Reaches State 7 (110110) after 1 cycle.
Reaches State 4 ( 110010 ) after 3 cycles.
Reaches State 1 ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles:
State 4: 110010
Reaches State 3 ( 101010 ) after 2 cycles.
Reaches State 4 ( 110010 ) after 3 cycles.
Reaches State I ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
State 5.101110
Reaches State 8( 111110.) after 1 cycle.
Reaches State 1 (100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
CA-27POPULAR PUBLICATIONS
State 6: 111010 :
Reaches State 4 ( 110010) after 3 cycles. a
Reaches State 1 ( 100010 ) after 5 cycles. ;
Reaches State 1 ( 100010 ) after 6 or more cycles.
State.7 : 110110
Reaches State 6 ( 111010) after 2 cycles.
Reaches State 1 ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
State 8: 111110
Reaches State 1 ( 100010 ) after 5 cycles.
Reaches State 1 ( 100010 ) after 6 or more cycles.
There are 8 states in the state diagram.
iii) Greedy cycle: (1115)
MAL=2 :
iv) The throughput of this pipelined processor = (1/20)x (1/2) = 0.025 ‘instructions per
cycle.
5. a) What do you mean by MMX? Differentiate a data flow computer from a control
flow computer.
5) List some potential problems with data flow computer implementation,
¢) With simple diagram explain data flow architecture.
4) Draw data flow graphs to,represent the following computations:
i) P=X4¥
ii) O=P+Y
iil) R=X*P
iv) S=R-O
v) T=R*P
vi) U=S+T TWBUT 2011, 2013, 2014]
Answer:
a) MMX technology is an extension to the Intel Architecture (IA) designed to improve
Performance of multimedia and communication algorithms. The Pentium processor with
MMX Technology is the first microprocessor to implement the new instruction set. All
MMX chips have a larger intemal LI cache than their non-MMX counterparts. The
Pentium processor with MMX implementation was the design of a new, dedicate, high
performance MMX pipeline, which was able to execute two MMX instructions wi
minimal logic changes in the existing units. Although adding a pipeline stage improves
frequency, it decreases CPI performance, ic., the longer the pipeline, the more work dane
CA-28COMPUTER ARCHITECTURE
speculatively by the machine and therefore more work is being th i
of branch miss-prediction. ie royn ey es
The control flow computers are either uniprocessor or parallei prdcessors architecture. In
uniprocessor system the instructions are executed sequentially and this is called control-
driven mechanism. In parallel processors system control flow computers use. shared
memory. So, parallel executed instructions may cause side effects on other instructions
and data due to shared memory. In control flow computer the sequence of execution of
instructions is controlled by program counter register.
The data flow computers are based on a data driven’mechanism. The fundamental
difference is that instruction execution in a conventional computer is under program-flow
control, whereas that in a data flow computer is driven by the data (operand) availability.
b) The data driven mechanism not requires any shared memory, program counter and
control sequencer. It just checks data availability of needy instructions and to enable the
chain reaction of asynchronous instruction execution.
«9 Suppose, there are some instructions given below. . Now we execute these instructions
by a data flow machine.
Input: a,b,c
ko =0
to 8do
end
Output d, e, k.
In the above example, there are three instructions in the “for loop” and the “for loop” is
executed 8 times. So, total 24 instructions will be executed. Suppose, each’add, multiply
and divide operation requires 1, 2-and 3 clock cycles to complete respectively. The data
s.
flow graph is shown in the figure below, for above instruction:
ey by ea ba as by my yay bs as Bowe ty be
Fig: data flow graph
CA-29LAR TH
i ti executed within: 14 clock cycles as shown in the figure
The above instructions can be
oe 4 78 9 10 12 13-14
a as 1 | €2| €3 | 4] 65] 26 | 67
ca
a |b [a] | be | be
a3 ag | bs] bs| b7
a | a7 jas
Fig: Data driven execution on a 4 processor dataflow computer in 14 cycles
The dataflow multiprocessor completes the execution in 14 cycles. If all the extemal
inputs are available, instructions a1, az, a3 and.ay are all ready for execution in the first
three cycles produced then trigger the execution of as, bi, as, and ay starting from cycle 4,
The data-driven chain reactions are shown in figure above. The output c8 is the last one
to produce, due to its dependence on all the previous ¢;’s.
The theoretical minimum time is 13 cycles along the critical path a;b,cycp...c g. The chain
reaction control in dataflow is. more difficult to implement and may result in longer
overhead, as compared with the uniform operations performed by all - processors.
d)
6. a) What is the diffe
Architecture? rence between Computer Organization and Compute!
BUT 2012)
ines oteeerae one a eae the CPU-time of a program bane expressed
at
¢) A 30% enhancement in aiechag Re [WBUT 202)
CA-30-COMPUTER ARCHITECTURE
d) What are the different approach:
instructions? Briefly illustrate any honeen cei ‘BU vo
rT 201:
What is branch hazard? Briefly discuss two methods to handle branch hazard.
[WBUT 2014, 2017]
Answer:
4) Difference between Computer Organization and C it
Oe ipide! orgeitzation ‘omputer Architecture:
«Deals with all physical components of com; i
ic puter systems that intet it
other to perform various functionalities m eee
The lower level of computer organization is known as microarchitecture which is
more detailed and concrete.
Examples of: Organizational attributes includes Hardware details transparent to
the programmer such as control signal and peripheral.
co Computer architecture
Refers as a set of attributes of a system as seen by programmer
Examples of the Architectural attributes include the instruction set, the no of bits
used to represent the data. types, Input Output mechanism and technique for.
addressing memories
b) Performance analysis should help answering que
be executed using a given computer? In order to answer
determine the time taken by a computer to execute a given job. We define the clock cycle
time as the time between two consecutive rising (trailing) edges of a periodic clock
signal. Clock cycles allow counting unit computations, because the storage of
computation results is synchronized with rising (trailing) clock edges. ‘The time required
to execute a job by a computer is often expressed in terms of clock cycles. We denote the
number of CPU clock cycles for executing a job to be the cycle count (CC), the cycle
time by CT, and the clock frequency by f= L/CT. The time taken by the CPU to execute
ajob can be expressed as CPU time = cc x CT=CC/f
stions such as how fast can a program
such a question, we need to
It may be easier to count the number of instructions executed in a given program as
compared to counting the number ‘of CPU clock cycles needed for executing that
program, Therefore, the average number of clock cycles per instruction (CPI) has been
used as an alternate performance measure. The following equation shows how to
compute the CPI.
CPI =CPU clock cycles for the pro}
gram / Instruction count
CPU time = Instruction count x CPI x Clock eycle ti
ime
CA-31R PUBLICATIONS,
¢) Say, the fraction of the time must enhancement is used to achieve an overall Speedup =
x
So, x. (30/50)=10 => x= 16.6
' j in designi it ion’ pipeline is assuring a stead
d) One of the major problems in designing an instruction pipel u Steady
fd of instructions to initial stages of the pipeline. However, 15-20% of instructions jin
an assembly-level stream are (conditional) branches. Of these, 60-70% take the branch to
a target address. Until the instruction is actually executed, it is impossible to determing
whether the branch will be taken or not.
A number of techniques can be used to minimize the impact of the branch instruction (the
branch penalty).
© Multiple streams: : i: ‘
Replicate the initial portions of the pipeline and fetch both possible next instructions
— Increases chance of memory contention s :
Must support multiple streams for each instruction in the pipeline
© Prefetch branch target
~ When the branch instruction is decoded, begin to fetch the branch target instruction
and place in a second prefetch buffer
> _}fthe branch is not taken, the sequential instructions are already in the pipe, so there
is not loss of performance
~ Ifthe branch is taken, the next instruction has been prefetched and results in minimal
branch penalty (don’t have to incur a memory read operation at the end of the branch
to fetch the instruction)
* Loop buffer: Look ahead, look behind buffer
— Many conditional branches operations are used for loop control
— Expand prefetch buffer so as to buffer the last few instructions:executed in addition
to the ones that are waiting to be executed
— If buffer is big enough, entire loop can be held in it, this can reduce the branch
penalty.
© Branch prediction '
= Make a good guess as to which instruction will be executed next and start that one
down the pipeline.
— Static guesses: make the guess without considering the runtime: history. of the
program : Branch never taken, Branch always taken, Predict based on the opcode
~ Dynamic guesses: track the history of conditional branches in the program.
© Delayed branch
— Minimize the branch Penalty by finding valid instructions to execute in the pipeline
while the branch address is being resolved:
CA-32°COMP
_ It is possible to improve performance by automatically rearranging instruction within
a proeeae so that branch instruction occur later than actually desired
ane a hase a pails the instruction sequence to find enough
: ns. to feed into the pipeline aft n
eg is eal to aoe pipeline after the branch that the branch
7. a) What are the major hurdles to achie i 1
b) Discuss data hazard briefly. bi erie eedd
c) Discuss briefly two approaches to handle branch hazards. *
d) Consider a 4-stage pipeline that consists of Instruction Fetch (IF), Instruction
Decode (ID), Execute (Ex) and Write Back (WB) stages. The times taken by these
stages are 50 ns, 60 ns, 110 ns and 80 ns respectively. The pipeline registers are
Fequired after every pipeline stage, and each of these pipeline register consumes
40 ns delay. What is the speedup of the pipeline under ideal conditions compare to
the corresponding non-pipelined implementation? [WBUT 2012]
Answer:
a) We define the speedup of a k -stage linear pipeline processor over an equivalent non-
pipeline processor as
5,=nk/ kt (7-1)
It should be noted that the maximum speedup is S, — k ;forn>> k. In other words, the
maximum speedup that a linear pipeline can provide us is k, where kis the number of
stages in the pipe. The maximum speedup is never fully achievable because of data
dependencies between instructions, interrupts, and other factors.
b) A data hazard is created whenever there is dependence between instructions, and they
are close enough that .the overlap caused by pipelining, or other reordering of
instructions, would change the order of access to the operand involved in the dependence.
Because of the dependence, we must preserve what is called program order that is the
order that the instructions would execute in, if executed sequentially one at a time as
determined by the original source program. There are three types of data hazards can
occur:
© “Read After Write (RAW) hazards:
RAW data hazard is the most common type. It appears when the next instruction tries to
read from a source before the previous instruction writes to it. So, the next instruction
gets the incorrect old value such as an operand is modified and read soon after. Because
the first instruction may not have finished writing to the operand, the second instruction
may use incorréct data.
© Write After Read (WAR) hazards:
WAR hazard appears when the fext instruction writes to 2 destination before the previous
instruction reads it. In this case, the previous instruction gets a new value incorrectly such
perand, Because the write may
as read an operand and writes soon after to that same 0]
CA-33