Computer Architecture Unit III
Computer Architecture Unit III
3.1.1 Operation
The program counter gives the instruction address to the instruction memory.
After the instruction is fetched, the register operands required by an instruction are specified
by fields of that instruction.
Once the register operands have been fetched, they can be used to compute a memory
address (for a load or store), to compute an arithmetic result (for an integer arithmetic-logical
instruction), or a compare (for a branch).
If the instruction is an arithmetic-logical instruction, the result from the ALU must be
written to a register.
If the operation is a load or store, the ALU result is used as an address to either store a
value from the registers or load a value from memory into the registers. The result from the
ALU or memory is written back into the register file.
Branches require the use of the ALU output to determine the next instruction address,
which comes either from the ALU (where the PC and branch offset are summed) or from an
adder that increments the current PC by 4.
Fig. 3.1 shows that data going to a particular unit is coming from two different
sources. For example, the value written into the PC can come from one of two adders, the
data written into the register file can come from either the ALU or the data memory, and the
second input to the ALU can come from a register or the immediate field of the instruction.
The selection of appropriate source is done using multiplexer (data selector). The multiplexer
selects from among several inputs based on the setting of its control lines. The control lines
are set based primarily on information taken from the instruction being executed. This is
illustrated in Fig. 3.2.
Fig. 3.2 also shows the control unit, which has the instruction as an input, is used to
determine the control signals for the functional units and two of the multiplexers.
The third multiplexer, which determines whether PC+4 or the branch destination
address is written into the PC, is set based on the Zero output of the ALU, which is used to
perform the comparison of a beq instruction.
3.2. Building a Data Path
As shown in Fig. 3.4, the MIPS implementation includes, the datapath elements (a
unit used to operate on or hold data within a processor) such as the instruction and data
memories, the register file, the ALU, and adders.
Fig. 3.3 shows the combination of the three elements (instruction memory, program
counter and adder) from Fig. 3.4 to form a datapath that fetches instructions and increments
the PC to obtain the address of the next sequential instruction.
The instruction memory stores the instructions of a program and gives instruction as
an output corresponding to the address specified by the program counter. The adder is used to
increment the PC by 4 to the address of the next instruction.
Since the instruction memory only reads, the output at any time reflects the contents
of the location specified by the address input, and no read control signal is needed.
The program counter is a 32-bits register that is written at the end of every clock cycle
and thus does not need a write control signal.
The adder always adds its two 32-bits inputs and place the sum on its output.
3.2.1. Datapath Segment for Arithmetic - Logic Instructions
The arithmetic-logic instructions read operands from two registers, perform an ALU
operation on the contents of the registers, and write the result to a register. We call these
instructions as R-type instructions. This instruction class includes add, sub, AND, OR, and
slt. For example, OR $t1, $t2, $t3 reads $t2 and $t3, performs logical OR operation and saves
the result in $t1.
The processor's 32 general-purpose registers are stored in a structure called a register
file. A register file is a collection of registers in which any register can be read or written by
specifying the number of the register in the file. The register file contains the register state of
the computer.
Fig. 3.4 shows multiport register file (two read ports and one write port) and the ALU
section of Fig. 3.4 We know that, the R-format instructions have three register operands:
numbers Two source operands and one destination operand.
For each data word to be read from the register file, we need to specify the register
number to the register file. On the other hand, to write a data word, we need two inputs: One
to specify the register number to be written and one to supply the data to be written into the
register.
The register file always outputs the contents of whatever register numbers are on the
Read register inputs. Write operations, however, are controlled by the write control (Reg W)
signal. This signal is asserted for a write operation at the clock edge.
Since writes to the register file are edge-triggered, it is possible to perform read and
write operation for the same register within a clock cycle: The read operation gives the value
written in an earlier clock cycle, while the value written will be available to a read in a
subsequent clock cycle.
As shown in Fig. 3.4, the register number inputs are 5 bits wide to specify one of 32
registers, whereas the data input and two data output buses are each 32 bits wide.
3.2.2. Datapath Segment for Load Word and Store Word Instructions
Now, consider the MIPS load word and store word instructions, which have the
general form lw $t1, offset_value($t2) or sw $t1, offset_value ($t2).
In these instructions $t1 is a data register and $t2 is a base register. The memory
address is computed by adding the base register ($t2), to the 16-bits signed offset value
specified in the instruction.
In case of store instruction, the value from the data register ($t1) must be read and in
case of load instruction, the value read from memory must be written into the data register
($t1). Thus, we will need both the register file and the ALU from Fig. 3.4.
We know that, the offset value is 16-bits and base register contents are 32-bits. Thus,
we need a sign-extend unit to convert the 16-bits offset field in the instruction to a 32-bits
signed value so that it can be added to base register.
In addition to sign extend unit, we need a data memory unit to read from or write to.
The data memory has read and write control signals to control the read and write operations.
It also has an address input, and an input for the data to be written into memory. Fig. 3.5
shows these two elements.
Sign extension is implemented by replicating the high-order sign bit of the original
data item in the high-order bits of the larger, destination data item.
Therefore, two units needed to implement loads and stores, in addition to the register
file and ALU of Fig. 3.4, are the data memory unit and the sign extension unit.
3.2.3. Datapath Segment for Branch Instruction
The beq instruction has three operands, two registers that are compared for equality,
and a 16-bits offset which is used to compute the branch target address relative to the branch
instruction address. It has a general form beq $t1, $t2, offset.
To implement this instruction, it is necessary to compute the branch target address by
adding the sign-extended offset field of the instruction to the PC. The two important things in
the definition of branch instructions which need careful attention are:
The instruction set architecture specifies that the base for the branch address
calculation is the address of the instruction following the branch (i.e., PC + 4 the address of
the next instruction.
The architecture also states that the offset field is shifted left 2 bits so that it is a word
offset; this shift increases the effective range of the offset field by a factor of 4.
Therefore, the branch target address is given by
Branch target address = PC+4 + offset (shifted left 2 bits)
In addition to computing the branch target address, we must also see whether the two
operands are equal or not. If two operands are not equal the next instruction is the instruction
that follows sequentially (PC= PC+4); in this case, we say that the branch is not taken. On the
other hand, if two operands are equal (i.e., condition is true), the branch target address
becomes the new PC, and we say that the branch is taken.
Thus, the branch datapath must perform two operations : Compute the branch target
address and compare the registercontents.
Fig. 3.7 shows the structure of the datapath segment that handles branches.
To compute the branch target address, the branch datapath includes a sign extension
unit, shifter and an adder.
To perform the compare, we need to use the register file and the ALU shown in Fig.
3.4.
Since the ALU provides an Zero signal that indicates whether the result is 0, we I can
send the two register operands to the ALU with the control set to do a subtract operation. If
the Zero signal is asserted, we know that the two values are equal.
For jump instruction lower 28 bits of the PC are replaced by lower 26 bits of the
instruction shifted left by 2 bits and making two LSB bits 0. This can be implemented by
simply concatenating 00 to the jump.
In the MIPS instruction set, branches are delayed, meaning that the instruction
immediately following the branch is always executed, independent of whether the branch
condition is true or false. When the condition is false, the execution looks like a normal
branch. When the condition is true, a delayed branch first executes the instruction
immediately following the branch in sequential instruction order before jumping to the
specified branch target address.
3.2.4. Creating a Single Datapath
We can combine the datapath components needed for the individual instruction
classes into a single datapath and add the control to complete the implementation.
This simplest datapath will attempt to execute all instructions in one clock cycle. This
means that no datapath resource can be used more than once per instruction, so any element
needed more than once must be duplicated. We therefore need a memory for instructions
separate from one for data. We need the functional units to be duplicated and many of the
elements can be shared by different instruction flows.
To share a datapath element between two different instruction classes, we have
connected multiple connections to the input of an element and used a multiplexer and control
signal to select among the multiple inputs.
Example 3.1 Show how to build a datapath for the operational portion of the memory-
reference and arithmetic-logical instructions that uses a single register file and a single ALU
to handle both types of instructions, adding any necessary multiplexers.
Solution : To create a datapath with only a single register file and a single ALU, we must
support two different sources for the second ALU input, as well as two different sources for
the data stored into the register file. Thus, one multiplexer is placed at the ALU input and
another at the data input to the register file. Fig. 3.8 shows the operational portion of the
combined datapath.
We can make a simple datapath for the core MIPS architecture by adding the datapath for
instruction fetch, the datapath from R-type and memory instructions, and the datapath for
branches as shown in the Fig. 3.9.
3.3. Designing a Control Unit
Here, we restrict ourselves to implement load word (lw), store word (sw), branch
equal (beq), and the arithmetic-logical instructions add, sub, AND, OR, and set on less than.
We will also the design to include a jump instruction (j).
3.3.1. The ALU Control
The MIPS ALU defines the six following combinations of four control inputs shown
in the Table 3.1:
Table 3.1 Combination of Four control inputs
Depending on the instruction class, the ALU will need to perform one of these first five
functions. (NOR function is needed for other parts of the MIPS instruction set. It is not
included in the subset we are implementing.)
In case of load word and store word instructions, we use the ALU to compute the
memory address by addition.
In case of the R-type instructions, the ALU needs to perform one of the five actions -
AND, OR, subtract, add, or set on less than.
In case of branch equal, the ALU must perform a subtraction.
We can control the operation of ALU by the 4-bits ALU control input and 2-bits
ALUOP. The 2-bits ALUOP is interpreted as shown in Table 3.2.
Table 3.2 ALUOp field
Table 3.3 shows how to set the ALU control inputs based on the 2-bits ALUOP control and
the 6-bits function code.
Once the truth table has been constructed, it can be optimized and can be implemented using
logic gates.
3.4. Designing the Main Control Unit
Before looking at the rest of the control design, it is useful to review the formats
of the three instruction classes: The R-type, branch and load-store instructions. Fig. 3.10
shows these formats.
Table 3.6
For all R-format instructions (add, sub, AND, OR, and slt) the source register fields are rs and
rt, and the destination register field is rd; this defines how the signals ALUSrc and RegDst
are set (See first row of Table 3.6).
R-type instruction also writes a register (Reg W=1), but neither reads nor writes data
memory.
For all R-format instructions, the PC should be unconditionally replaced with PC 4. Thus the
Branch control signal is 0; otherwise, the PC is replaced by the branch target if the Zero
output of the ALU is also high.
The ALUOP field for R-type instructions is set to 10 to indicate that the ALU control should
be generated from the funct field.
The second and third rows of Table 3.6 give the control signal settings for lw and sw. These
ALU Src and ALUOP fields are set to perform the address calculation.
The MemRead and MemWrite are set to perform the memory access. Finally, RegDst and
Reg W are set for a load to cause the result to be stored into the rt register.
The branch instruction sends the rs and rt registers to the ALU. The ALUOP field for branch
is set for a subtract ALU control 01), which is used to test for equality.
It is important to note that the Mem to Reg field is irrelevant when the Reg W signal is 0:
since the register is not being written, the value of the data on the register data write port is
not used. Thus, the entry Mem to Reg in the last two rows of the Table 3.6 is replaced with X
for don't care. Don't cares can also be added to RegDst when Reg W is 0.
3.4.2. Operation of the Datapath
3.4.2.1 Datapath for an R-type Instruction
Fig. 7.4.4 shows the operation of the datapath for an R-type instruction. Here as an example
we have considered instruction: add $t1, $t2, $t3. The asserted control signals and active
datapath elements are highlighted. Although everything occurs in one clock cycle, the
execution of the instruction can be divided into sequential steps according to flow of
information as given below:
1. Fetch the instruction and increment the PC.
2.Read data from two registers, $t2 and $t3 and compute the setting of the control lines using
main control unit.
3. Generate the ALU function using the function code for addition operation (bits 50, which
is the funct field, of the instruction) and perform addition on the data read from the register
file.
4. Write the result from the ALU into the register file using bits 15: 11 of the instruction that
select the destination register ($t1).
Note The control lines, datapath units, and connections that are active are highlighted.
Fig 3.13 Datapath in operation for an R-type instruction( Example : add $t1,$t2,$t3)
Fig. 3.13 shows the operation of the datapath for load word instruction. Here as an example
we have considered instruction: Iw $t1, offset($t2). The execution of the load instruction can
be divided into sequential steps according to flow of information as given below:
1. Fetch the instruction and increment the PC.
2. Read data from register $t2 of register file.
3. Compute the sum of the value read from the register file and the sign-extended, lower 16
bits of the instruction (offset) using ALU.
4. Use the sum from the ALU as the address for the data memory.
5. Write the data from the memory unit into the register file; the register destination is given
by bits 20: 16 of the instruction ($t1).
Note The control lines, datapath units, and connections that are active are highlighted.
A load and store instructions operate very similarly. The main differences are that the
memory control indicate a write rather than a read, the value of the second
Fig 3.14 Datapath in operation for a load instruction
register is used for the data to store, and the operation of writing the data memory value to the
register file is not necessary.
3.4.3 Datapath for Branch - On - Equal Instruction
Fig. 3.14 shows the operation of the datapath for branch-on-equal instruction. Here as
an example we have considered instruction: beq $t1, $t2, offset. It operates much like an R-
format instruction, but the ALU output is used to determine whether the PC is written with
PC + 4 or the branch target address. The execution of the branch-on-equal instruction can be
divided into sequential steps according to flow of information as given below :
3.4.3.1. Fetch the instruction and increment the PC.
3.4.3.2. Read data from two registers, $t2 and $t3.
3.4.3.3. Generate the ALU function using the function code for subtract operation (bits 5 0,
which is the funct field, of the instruction) and perform subtraction on the data read from the
register file. Add the value of PC + 4 to the sign-extended, lower 16 bits of the instruction
(offset) shifted left by two to get the branch target address.
3.4.3.4.Use the Zero result from the ALU to decide which adder result to store into the PC.
Fig 3.15 Datapath in operation for a branch-on-equal instruction
Note The control lines, datapath units and connections that are active are highlighted.
3.4.4. Finalizing Control
The control function can be precisely defined using the contents of Table 3.7.. The
outputs are the control lines, and the input is the 6-bits opcode field, Op [5: 0], i.e. bits 31 26
of the instruction. Thus, we can create a truth table for each of the outputs based on the binary
encoding of the opcodes as shown in the Table 3.7.
Table 3.7. Truth table for output control lines and ALUOp
Example 3.7.1 Convert the following code segment in C to MIPS instructions, assuming all
variables are in memory and are addressable as offsets from $t0:
a = b + e;
c = b+f;
Solution :
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1,$t2
sw $t3, 12($t0)
lw $t4, 8($10)
add $t5, $t1,$t4
sw $t5, 16($t0)
Example 3.7.2 Find the hazards in the code segment of the previous example and reorder the
instructions to avoid any pipeline stalls.
Solution: Both add instructions have a hazard because of their respective dependence on the
immediately preceding lw instruction. It is important to note that bypassing eliminates several
other potential hazards, including the dependence of the first add on the first lw and any
hazards for store instructions. Moving up the third lw instruction to become the third
instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($10)
lw $t4, 8($t0)
add $t3, $t1,$t2
sw $t3, 12($t0)
add $t5, $t1,$t4
sw $t5, 16($t0)
Side Effects
When destination register of the current instruction is the source register of the next
instruction there is a data dependency. Such data dependency is explicit and it is identified by
register name. There are some instruction that change the contents of a register other than the
one named as the destination. For example, instruction (stack instructions: push or pop) that
uses an autoincrement or autodecrement addressing mode.
In autoincrement or autodecrement addressing mode, the instruction changes the
contents of a source register used to access one of its operands. In such cases, we need to
check data dependencies for registers affected by an autoincrement or autodecrement
operation along with the destination register.
When a location other than one explicitly named in an instruction as a destination
operand is affected, the instruction is said to have a side effect.
Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Let us assume that, R 1and
R2 holds a double-precision integer number and R3 and R4 holds another double-precision
integer number.
The addition of these two numbers may be accomplished as follows:
ADD R1, R3
ADD with Carry R2, R4
Even though register names are different, the dependency (implicit) exists between these two
instructions through the carry flag. This flag is Set/Reset by the first instruction and used in
the second instruction, which performs the operation
R4← [R2]+[R4]+Carry
Important Note :
Instructions that have side effects give rise to multiple data dependencies, which lead
to a substantial increase in the complexity of the hardware or software needed to resolve
them. For this reason, instructions designed for execution on pipelined hardware should have
less side effects.
Example 3.7.3 The following sequence of instructions are executed in the basic 5-stage
pipelined processor.
or r1, r2, r3
or r2, r1, r4
or r1, r1, r2
a) Indicate dependences and their type.
b) Assume there is no forwarding in this pipelined processor. Indicate hazards and add NOP
instructions to eliminate them.
c) Assume there is full forwarding. Indicate hazards and add NOP instructions to eliminate
them.
Solution: a) Dependences and their type
• Read After Write (RAW) dependency in r1 between Instructions 1, 2 and 3.
• Read After Write (RAW) dependency in r2 between Instructions 2 and 3.
• Write After Read (WAR) in r2 from Instruction 1 to 2.
• Write After Read (WAR) in r1 from Instruction 2 to 3.
• Write After Read (WAR) in r1 from Instruction 1 to 3.
b) No hazards from write after read and write after write, since there are 5 stages. Read after
writes cause data hazards.
or r1, r2, r3
NOP
NOP
or r2, r1, r4
NOP
NOP
or r1, r1, r2
c) In full forwarding the data hazards above are eliminated, thus there is no need for NOP
instructions.
3.8. Handling Control Hazards
3.8.1 Instruction Queue and Prefetching
To reduce the effect of cache miss or branch penalty, many processors employ
sophisticated fetch units that can fetch instructions before they are needed and put them in a
queue. This is illustrated in Fig. 3.33
Fig 3.33 Use of instruction queue
A separate unit called dispatch unit takes instructions from the front of the queue and
sends them to execution unit. It also performs the decoding function.
The fetch unit attempts to keep the instruction queue filled at all times to reduce the
impact of occasional delays when fetching instructions during cache miss.
In case of data hazard, the dispatch unit is not able to issue instructions from the
instruction queue. However, the fetch unit continues to fetch instructions and add them to the
queue.
3.8.2. Use of instruction queue during branch instruction
Fig. 3.34 shows instruction time line. It also shows how the queue length changes
over the clock cycles. Every fetch operation adds one instruction to the queue and every
dispatch unit operation reduces the queue length by one. Hence, the queue length remains the
same for the first four clock cycles.
The instruction I has 2-cycle stall. In these two cycles fetch unit adds two instructions but
dispatch unit does not issue any instruction. Due to this, the queue length rises to 3 in clock
cycle 6.
Fig 3.34 Use of instruction queue during branch instruction
Since I5 is a branch instruction, instruction I 6 is discarded and the target instruction of I 5, IK is
fetched in cycle 7. Since I6 is discarded, normally there would be stall in cycle 7; however,
here instruction I4 is dispatched from the queue to the decoding stage.
After discarding I6, the queue length drops to 1 in cycle 8. The queue length remains one until
another stall is encountered. In this example, instructions I 1, I2, I3, I4 and IK complete execution
in successive clock cycles. Hence, the branch instruction does not increase the overall
execution time.
3.8.3 Branch Folding
The technique in which instruction fetch unit executes the branch instruction (by
computing the branch address) concurrently with the execution of other instructions is called
branch folding.
Branch folding occurs only if there exists at least one instruction in the queue other
than the branch instruction, at the time a branch instruction is encountered.
In case of cache miss, the dispatch unit can send instructions for execution as long as
the instruction queue is not empty. Thus, instruction queue also prevents the delay that may
occur due to cache miss.
3.8.4 Approaches to Deal
The conditional branching is a major factor that affects the performance of instruction
pipelining. There are several approaches to deal with conditional branching.
These are:
• Multiple streams
• Prefetch branch target
• Loop buffer
• Branch prediction.
3.8.5. Multiple streams
We know that, a simple pipeline suffers a penalty for a branch instruction. To avoid
this, in this approach they use two streams to store the fetched instruction. One stream stores
instructions after the conditional branch instruction and it is used when branch condition is
not valid.
The other stream stores the instruction from the branch address and it is used when
branch condition is valid.
Drawbacks :
Due to multiple streams there are contention delays for access to registers and to
memory.
Each branch instruction needs additional stream.
3.8.6. Prefetch branch target
In this approach, when a conditional branch is recognized, the target of the branch is
prefetched, in addition to the instruction following the branch. This already fetched target is
used when branch is valid.
3.8.6.1. Loop buffer
A loop buffer is a small very high speed memory. It is used to store recently
prefetched instructions in sequence. If conditional branch is valid, the hardware first checks
whether the branch target is within the buffer. If so, the next instructions are fetched from the
buffer, instead of memory avoiding memory access.
Advantages:
In case of conditional branch, instructions are fetched from the buffer saving memory
access time. This is useful for loop sequences.
If a branch occurs to a target just a few locations ahead of the address of the branch
instruction, we may find target in the buffer. This is useful for IF-THEN-ELSE-ENDIF
sequences.
3.8.6.2. Delayed branch
The location following a branch instruction is called a branch delay slot. There may
be more than one branch delay slot, depending on the time it takes to execute a branch
instruction.
There are three ways to fill the delay slot :
1. The delay slot is filled with an independent instruction before branch. In this case
performance always improves.
2. The delay slot is filled from the target branch instructions. Performance improves if only
branch is taken.
3. The delay slot is filled with an instruction which is one of the fall through instruction. In
this case performance improves if branch is not taken.
These above techniques are called delayed branching.
3.8.7. Branch Prediction
Prediction techniques can be used to check whether a branch will be valid or not
valid. These techniques reduce the branch penalty.
• The common prediction techniques are:
• Predict never taken
•Predict always taken
• Predict by opcode
Fig 3.35 Branch Prediction
• Taken/Not taken switch
• Branch history table
In the first two approaches if prediction is wrong a page fault or protection violation
error occurs. The processor then halts prefetching and fetches the instruction from the desired
address.
In the third prediction technique, the prediction decision is based on the opcode of the
branch instruction. The processor assumes that the branch will be taken from certain branch
opcodes and not for others.
The fourth and fifth prediction techniques are dynamic; they depend on the execution
history of the previously executed conditional branch instruction.
3.8.8. Branch Prediction Strategies
There are two types of branch prediction strategies :
• Static branch strategy
• Dynamic branch strategy.
3.8.8.1. Static Branch Strategy: In this strategy branch can be predicted based on branch
code types statically. This means that the probability of branch with respect to a particular
branch instruction type is used to predict the branch. This branch strategy may not produce
accurate results every time.
3.8.8.2. Dynamic Branch Strategy: This strategy uses recent branch history during program
execution to predict whether or not the branch will be taken next time when it occurs. It uses
recent branch information to predict the next branch. The recent branch information includes
branch prediction statistics such as:
T: Branch taken
N: Not taken
NN: Last two branches not taken
NT: Not branch taken and previous takes
TT: Both last two branch taken
TN: Last branch taken and previous not taken
• The recent branch information is stored in the buffer called Branch Target Buffer (BTB).
• Along with above information branch target buffer also stores the address of branch target.
Fig. 3.36 shows the organization of branch target buffer.