[go: up one dir, main page]

0% found this document useful (0 votes)
78 views20 pages

The IBM System 360 Model 91 Floating-Point Execution Unit

The IBM System/360 Model 91 used separate, instruction-oriented execution units for floating-point addition, multiplication, and division to support the processor's instruction issuing rate of one per cycle. The floating-point execution unit consisted of multiple, linked execution units that allowed for concurrent instruction execution. It departed from conventional designs to achieve fast, concurrent processing through an instruction-oriented approach using separate units for different functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views20 pages

The IBM System 360 Model 91 Floating-Point Execution Unit

The IBM System/360 Model 91 used separate, instruction-oriented execution units for floating-point addition, multiplication, and division to support the processor's instruction issuing rate of one per cycle. The floating-point execution unit consisted of multiple, linked execution units that allowed for concurrent instruction execution. It departed from conventional designs to achieve fast, concurrent processing through an instruction-oriented approach using separate units for different functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

s. F.

Anderson
J. G. Earle
R. E. Goldschmidt
D. M. Powers

The IBM System/3GO Model 91:


Floating-Point Execution Unit

Abstract: The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instruction-
issuing rate of the processor. The chosen solution was to develop separate, instruction-oriented algorithms for the add, multiply, and
divide functions. Linked together by the floating-point instruction unit, the multiple execution units provide concurrent instruction
execution at the burst rate of one instruction per cycle.

Introduction

The instruction unit of the IBM System/360 Model 91 is tion-oriented units. The first of these is the floating-point
designed to issue instructions at a burst rate of one in- add unit description which is divided into two sub-sections,
struction per cycle, and the performance of floating-point Algorithm and Implementation. In the algorithm sub-
execution must support this rate. However, conventional section, the complete algorithm for execution of a floating
execution unit designs cannot support this level of per- add/subtract is considered with emphasis on the dif-
formance. The Model 91 Floating-Point Execution Unit ficulties inherent in the implementation. Since the add
departs from convention and is instruction-oriented to unit is instruction-oriented, (i.e., only add-type instructions
provide fast, concurrent instruction execution. must be considered), it is possible to overcome the in-
The objectives of this paper are to describe the floating- herent difficulties by merging the several steps of the
point execution unit. Particular attention is given to the algorithm into three hardware areas. The implementation
design of the instruction-oriented units to reveal the tech- section describes these three areas, namely, characteristic
niques which were employed to match the burst instruction comparison and pre-shifting, fraction adder, and post-
rate of one instruction per cycle. These objectives can normalization.
best be accomplished by dividing the paper into four The last section describes the floating-point multiply / di-
sections-General design considerations, Floating-point vide unit. This section describes the multiply algorithm and
terminology, Floating-point add unit, and Floating-point its implementation first, and then the divide algorithm
multiply/divide unit. and its implementation. The emphasis of the multiply
The first section explains how the desire for concurrent algorithm sub-section is on recoding the multiplier and
execution of instructions has led to the design of multiple the usefulness of carry-save adders. In the implementation
execution units linked together by the floating-point in- sub-section the emphasis is on the iterative hardware which
struction unit. Then the concept of instruction-oriented is the heart of the multiply operation. An arrangement of
units is discussed, and its impact on the multiplicity of carry-save adders is shown which, when pipelined by add-
units is pointed out. It is shown that, with the instruction- ing temporary storage platforms, has an iteration repetition
oriented units as building blocks and the floating-point rate of fifty Me/ sec. The divide algorithm is described next
instruction unit as the "cement," an execution unit evolves with emphasis on using multiplication, instead of sub-
which rises to the desired performance level. traction, as the iterative operator. The discussion of divide
The section on floating-point terminology briefly reviews implementation shows how the existing multiply hard-
the System/360 data formats and floating-point definitions. ware, plus a small amount of additional circuitry, is used
34 The next two sections describe the design of the instruc- to perform the divide operation.

IBM JOURNAL· JANUARY 1967

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
STORAGE STORAGE INSTRUCTION UNIT
t
l'~-
--t------ I --------}---- ..
"-1
INSTrCTIONr UNIT

I
I
! I
I
I I
I FLOATING· POINT I
FLOATING·POINT
INSTRUCTION
UNIT (FLlU)
i
I
I
REGISTERS
(FLR)
FLOATING·POINT
BUFFERS
(FLB)
INSTRUCTION
BUFFERS
r-
I
f--
FLiU
CONTROLS
I
I
I
I
I
IL ____ .
_....._-- --_._- ----- -------~ --.._----
EXECtTION JNITS
fLR BUS 1 FLR BUS 2

FLB BUS 1 FlB BUS 2

r l
INST BUS 1 INST BUS 2

1 t
FLOATING·POINT FLOATING·POINT
EXECUTION UNIT EXECUTION UNIT
1 2

t RESULT BUS t
Figure 1 Floating-pointexecution unit capable of concurrent execution.

General design considerations


The programs considered "typical" by the user of high- out-of-sequence execution, and the floating-point in-
performance computers are floating-point oriented. There- struction must insure that executing out of sequence
fore, the prime concern in designing the floating-point does not produce incorrect results.* The organization
execution unit is to develop an overall organization which for an execution unit capable of concurrent execution is
will match the performance of the instruction unit. How- shown in Fig. 1. Buffering and sequence control of all
ever, the execution time of floating-point instructions is instructions, storage operands, and floating-point accu-
long compared with the issuing rate of these instructions by mulators are the responsibility of the floating-point execu-
the instruction unit. The most obvious approach is to tion unit. Each of the execution units is capable of execut-
apply a faster technology and with special design tech- ing all floating-point instructions.
niques reduce the execution time for floating-point. But One might be led to believe that this organization is a
a study of many "typical" floating point programs re- suitable solution in itself. If multiply can be executed
vealed that the execution time per instruction would have in seven cycles and two multiplies are executed simul-
3
to be I to 2 cycles in order to match the performance of the taneously, then the effective execution time is 3.5 cycles.
instruction unit. >I< Conventional execution unit design, Similarly, for add the execution time would go from three
even with state-of-the-art algorithms, will not provide cycles to 1.5 cycles. However, the operating delay of the
these execution times. floating-point instruction unit must be considered, and it
Another approach considered was to provide execution is not always possible to execute concurrently because of
concurrency among instructions; this obviously would re- the dependence among instructions. When these problems
quire two complete floating-point execution units." An are considered the effectiveexecution time is close to three
attendant requirement would be a floating-point instruc- cycles per instruction, which is not sufficient. A third
tion unit. This unit is necessary to sequence the operands execution unit would not help because the complexity of
from storage to the proper execution unit; it must buffer the floating-point instruction unit increases, and the
the instructions and assign each instruction to a non-busy amount of hardware becomes prohibitive.
execution unit. Also, since the execution time is not the The next solution to be considered was to improve the
same for all instructions the possibility now exists for execution time of each instruction by employing faster
algorithms in the design of each execution unit. Obviously
* Even though the burst rate of the instruction unit is one instruction
per cycle, it is not necessary to execute at the same rate. this would increase the hardware, but since the circuit
t Since two complete execution units are necessary for concurrent
execution, the cost-performance factor is important. Analysis showed * Dependence among instructions must be controlled. If instruction
that execution times of three cycles for add and seven cycles for multi- n +1 is dependent on the result of instruction n, instruction n 1 +
ply were reasonable expectations. must not be allowed to start until instruction n is completed. 35

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
Table 1 Floating-point instructions executed by floating-point execution unit.

Condition Arithmetic
Type Instruction code Floating- point exceptions" unit

RR-RX Load (S/L) NO FLlU


RR Load and Test (S/L) YES FLJU
RX Store (S/L) NO FLlU
RR Load Complement (S/L) YES ADD
RR Load Positive (S/L) YES ADD
RR Load Negative (S/L) YES ADD
RR-RX Add Normalized (S/L) YES U,E,LS ADD
RR-RX Add Unnormalized (S/L) YES E,LS ADD
RR-RX Subtract Normalized (S/L) YES U,E,LS ADD
RR-RX Subtract Unnormalized (S/L) YES E,LS ADD
RR-RX Compare (8/L) YES ADD
RR Halve (S/L) NO ADD
RR-RX Multiply NO U,E MID
RR-RX Divide NO U,E,FK MID
• E:rceptions :
U-Exponent-underflow exception
E--Exponent-overflow exception
LS-Significance exception
FK-··-Floating Point Divide Exception

delay is a function not only of the circuit speed but also of of concurrent execution. It is possible to have concurrent
the number of loads on the input net and the length of execution with one execution unit-s-i.e., two arithmetic
the interconnection wiring, more hardware may not make units, add and multiplyI divide. The performance is not
the unit faster." These two factors-the desire for faster quite as good as that attainable using two execution units,
execution of each instruction and the size sensitivity of the but less hardware is required for the implementation.
circuit delay, have produced a concept which is unique to Therefore, more arithmetic units can be added to improve
the organization of floating-point execution units, and the performance. First, two add units and two multiplyI di-
which was adopted for the Model 91: the concept of using vide units were considered. But the floating-point instruc-
separate execution units for different instruction types. tion unit can assign only one instruction per cycle.
Faster execution of each instruction can be achieved if Therefore, since an add operation is two cycles long, two
the conventional execution unit is separated into arithme- add units could be replaced by one add unit if a new add
tic units designed to execute a subset of the floating-point class instruction could be started every cycle. This would
instructions instead of the entire set. This conclusion may introduce still another example of concurrent execution:
not be obvious, but a unit designed exclusively for a class concurrent execution within an arithmetic unit.
of similar instructions can execute those instructions faster Such concurrency within a unit is facilitated by the
than a unit designed to accommodate all floating-point technique of pipelining, If a section of combinatorial logic,
instructions. The control sequences are shorter and less such as the logic to execute an add, could be designed with
complex; the data flow path has fewer logic levels and reo equal delay in all parallel paths through the logic, the
requires less hardware because the designer has more free- rate at which new inputs could enter this section of logic
dom in combining serial operations to eliminate circuit would be independent of the total delay through the logic.
levels; the circuit delay per level is faster because less hard- However, delay is never equal; skew is always present and
ware is required in the smaller, autonomous units. To the interval between input signals must be greater than the
implement the concept in the Model 91, the floating-point total skew of the logic section. But temporary storage plat-
instruction set was separated into two subsets: add and forms can be inserted which will separate the section of
multiplyI divide. Table 1 shows a list of the instructions combinatorial logic into smaller synchronous stages. Now
and identifies the unit in which each instruction is executed. the total skew has been divided into smaller pieces; only the
With this separation, an add unit which executed all add skew between stages has to be considered. The interval
class instructions in two cycles, and a multiplyI divide unit between inputs has decreased and now depends on the
which executed multiply in six cycles and divide in eighteen skew between temporary storage platforms. Essentially
cycles, were designed. the temporary storage platform is used to separate one
36 The use of this concept somewhat changes the character complete job, such as an add, into several pieces; then

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
TO STORAGE
several jobs can be executed simultaneously. Thus, inputs VIA STORE
DATA BUFFERS TO FX PT
can be applied at a predetermined rate and once the pipe- FROM
STORAGE
line is full the outputs will match this rate. INSTR UNIT
FLiU
The technique of pipelining does have practical limits, r------------ ------- ---- I
I
and these limits differ for each application. In general I
I
FLOATING-
I
the rate at which new inputs can be applied is limited by POINT I
OP STACK I
the logic preceding the pipeline (e.g., add is limited to one FLOATlNG- FLOATlNG-
(FLOS)
8x 14
I
I
POINT POINT
instruction per cycle by the floating-point instruction unit) REGISTERS BUFFERS
I
I
(FLR) (FLB) I
or by the rate at which outputs can be accepted. Also, 4x72 6x72 I

both the rate of new inputs and the length of the pipeline 1
1
II
are limited by dependencies among stages of the pipeline I
I
1
I
1 I
or between the output and successive inputs (e.g., the IL ~
I
output of one add can become an input for the next). FLR BUS

The add unit requires two cycles for execution and is FLB BUS

limited to one new input per cycle. Thus pipelining allows


COMMON DATA BUS
two instructions to be in execution concurrently, thereby
increasing the efficiency with a small increase in hardware. ,--------
ADD UNIT

I
-
,---I--r--'I------,
---------1
I
I
MID UNIT
-----l
1
I
Further study of pipelining techniques would indicate I
I
I
I I
1
that a three-cycle multiply and a twelve-cycle divide are I 1
I I
I I
possible. Here the technique of pipelining is used to speed 1 I
I
:
1
up the iterative section of the multiply which is critical to ,------, ,------, ,------,
I
: ,----'----'---, ,------,
multiply/divide execution. (This is discussed in detail in '--T-:-:---,---' L-,-=--r-' '--T----:-::--r--' I '-r-r----,---' ' - , - - r r ' :
the section on the multiply/divide unit.) I
I
1
MID 21
I I
The execution unit would consist at this point of a I I
I I
floating-point instruction unit, an add unit which could TWO-STAGE
FLOATING-
I
I
I
I
POINT I
start an instruction every cycle, and a multiply/divide I
I
I
ADD I I
unit which would execute multiply in three cycles and I 1
I I
PIPELINE
divide in twelve cycles. However the performance still I
I
I
I
I
would not match the instruction unit. The execution I
I
1
I I
times would be adequate but the units would spend con- I I
I I
siderable time waiting for operands. Therefore, instead of I
I
I
1
duplicating the arithmetic unit (which is expensive) extra IL JI _ _ _ _ _ _ _ _ JI

input buffer registers have been added to collect the


COMMON RESULTBUS
operands and necessary instruction control information.
Figure 2 Overall organization of floating-point unit.
When both operands are available, the control information
is processed and a request made to use an arithmetic unit.
These registers are referred to as "reservation stations."
They can be and are treated as independent units.
data format and terminology will be briefly reviewed here.
The final organization is shown in Fig. 2. It consists of
Floating-point data occupy a fixed-length format, which
three parts: the floating-point instruction unit; the floating-
may be either a full-word short format or a double-word
point add unit; and the floating-point multiply/divide unit.
format:
Another paper in this series" explains the floating-point
instruction unit in detail. The problems involved and both Short Floating-Point Binary Format
the considered solutions and the implemented solutions
Sign Characteristic Fraction
are discussed. The floating-point add unit has three reser-
vation stations and, as stated above, is treated as three o 1-----7 8----31
separate add units, AI, A2 and A3. The floating-point
multiply/divide unit has two reservation stations, M/D1
and M/D2. The last two sections of this paper describe Long Floating-Point Binary Format
the design of these two units in detail. Sign Characteristic Fraction

Floating-point terminology o 1 - - - - 7 8- - - - - 63
The reader is assumed to be familiar with System/360 The first bit(s) in either format is (are) the sign bit(s). The
architecture and terminology." However, the floating-point subsequent seven bit positions are occupied by the charac- 37

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
FLR BUS COMMON DATA BUS
CHARACTERISTIC (8 BITS) FLB BUS

CHARACTERISTIC COMPARISON
AND PRE·SHIFTING

:=c
I LATCH I
FRACTION ADDER

I
ZERO
DIGIT
CHECK T POST·NORMALIZATION

CHARACTERISTIC

FRACTION
'---r----'
'-- -L • COMMON RESULT BUS

Figure 3 Floating-point add data flow.

teristic. The fraction consists of six hexadecimal digits A normalized floating-point number has a non-zero
for the short format or 14 hexadecimal digits for the long. high-order hexadecimal fraction digit. To preserve maxi-
The radix point of the fraction is assumed to be im- mum precision in subsequent operation, addition, sub-
mediately to the left of the high-order fraction digit. To traction, multiplication, and division are performed with
provide the proper magnitude for the floating-point num- normalized results. (Addition and subtraction may also
ber, the fraction is considered to be multiplied by a power be programmed to be performed with unnormalized re-
of 16. The characteristic portion, bits 1-7 of both floating- sults. The operands for any floating-point operation can
point formats, indicates this power. The characteristic is be either normalized or unnormalized.)
treated as an excess 64 number with a range from - 64
through +63 corresponding to the binary expression of
the values 0-127. Floating-point add unit
Both positive and negative quantities have a true frac- The challenge in the design of the add unit was to minimize
tion, the difference in sign being indicated by the sign the number of logical levels in the longest delay path.
bit. The number is positive or negative accordingly as the However, the sequence of operations necessary for the
38 sign bit is zero or one. execution of a floating-point add impedes the design goal.

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
C8 1 1 1 1 1 0 0 C.I101000
Consider the following operations: '"

1 1 1 1 I 0 0 C.
(a) Since the radix point must be aligned before an add o0 1 0 1 1 1
can proceed, the characteristics of the two operands 1
(RESULT IS TRUE) 1 1 1 0 1 1 1 1
must be compared and the difference between them
established. C. 1 1 0 1 00 0 C. I 1 I 1 1 00
(b) This difference must be decoded into the shift amount, 1101000 C.
and the fraction with the smaller characteristic shifted 0000011 e,
right a sufficient number of positions to make the (RESULT - - . _ -1 HOT ONE
IS COMPl.EMENT) 0 1101100
characteristics equal.
COMPo RESULT 00 1 001 I
(c) Since subtraction is to be performed by forming the MUST ADDHOTONE 1
two's complement of one of the fractions and then 0010100

adding the two fractions in the fraction adder, one of


the fractions must pass through true/ complement logic. c. <; CB (END·AROUND CARRY)

1101000
(d) The two operand fractions are added in a parallel
0000011
adder. The carries must propagate from the low order (NO CARRY) 0 1101011
end to the high order end. COMPLEMENT 0 0 1 0 1 0 0 CORRECT RESULT

(e) Because of subtraction, the output must provide for


both the true sum and the complement sum, depending Figure 4 Examples of exponent arithmetic.
on the high-order carry.
(f) If the system architecture calls for left justification or
normalized operation, the result out of the adder must
be checked for high-order zeros and shifted left to
remove these zeros.
tion-s-merges the final three operations. The hardware
(g) The characteristic must be reduced by the amount of
implementation of each of these three sections is discussed
left shift necessary to normalize the resultant fraction.
below.
(h) The resultant operand must be stored in the proper
accumulator.
.. Implementation
The above sequence of operations implies a series of
sequential execution stages, each of which is dependent on Characteristic comparison and pre-shifting
the output of the previous stage. The problem then, is to
The first stage of execution for all two-operand instructions
arrange, change and merge these operations to provide
(floating-point add, subtract, and compare) is to compare
fast, efficient execution for a floating-point add.
the characteristics and establish the magnitude of the
None of the steps can be eliminated. Each step is re-
difference between them. The characteristic (CB) of one
quired in order to execute add; but the steps can be merged
operand is always subtracted from the characteristic
so that the interface between them is eliminated, * and
(CA ) of the other operand (C.. - CB)' Characteristic B
each step can be changed to provide only the necessary
is always complemented as it is gated in at the reservation
information to the next stage. For example, the long data
station.
format consists of 14 hexadecimal digits; therefore any
If the output of the characteristic difference adder is
difference between the two characteristics which is greater
the true sum or the complement of the true sum, the
than 14 will result in an all zero fraction. This means that
output can be decoded directly at the pre-shifter. But the
the characteristic difference adder need not generate a sum
adder always subtracts CB from CA and if CB > CA the
for the high-order three bits. Instead, if the difference is
sum would be negative. Therefore, to eliminate the pos-
greater than 14, a shift of 15 is forced. As a result, the
sibility of having to add a 1 in the low order position
characteristic difference adder is faster and less expensive.
and complement when Cn is greater than CA , an "end-
The add unit algorithm is separated into three parts:
around-carry" adder is used. This is shown by the example
characteristic comparison and pre-shifting, fraction adder,
in Fig. 4.
and post-normalization (Fig. 3). The first section, the
The characteristic comparison can result in two states->
characteristic comparison and pre-shifting operation,
merges the first three operations from the sequence given
CA 2::: CB or CB > CA' If CA 2::: c..
there is a carry out
of the high order position of the characteristic differ-
above; the second section-e-the fraction adder-e-merges
ence adder, and the carry is used to gate the fraction of
the next two operations; the final section-post normaliza-
operand B to the pre-shifter. The true sum output of the
~~evel~ used to encode the outpu.t of one step, which is. subse- characteristic difference adder is the amount that the
quently decoded in the next step. Merging the two steps Will eliminate
these levels. fraction must be shifted right to make the characteristics 39

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
INPUTS DIGITS
o 2 4 5 6 7 8 9 II 12 13

FIRST
LEVEL

10,11,
12,13

SECOND
lEI/El

o o o o o o o o c o o
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3
4
3
4
3
4
3
4
3
4
3
4
2 '"
3
4 'i
.t
5 5 5 5 5 5 VI
FIRST LEVEL
- - - SHF RIGHT 0 SECOND LEVEL
6 6
7
6
7
6
7 ~ ~

1~ o~
8 8 8
- - - - SHF RIGHT 1 - - - SHF RIGHT 0 9 9 9
- - - - SHF RIGHT 2 - - - - SHF RIGHT 4 10 10 10
11 11 II
--------- SHF RIGHT 3 - - - - - SHF RIGHT 8 12 12
- - - - SHF RIGHTl2 13

Figure S Digit pre-shifter.

equal. If CB > CA , there is no carry out of the high order if the effective operation is ADD and complemented if the
position of the characteristic difference adder, and the effective operation is SUBTRACT. The true/complement
absence of a carry is used to gate the fraction of operand gating is overlapped with the pre-shifter on a time basis.
A to the pre-shifter. In this case the complement of the The output of both the true/complement logic and the
sum output of the characteristic difference adder is the pre-shifter are the inputs to the fraction adder.
amount that the fraction must be shifted right to make the
characteristics equal. In both cases the second operand Fraction adder
fraction (the one with the larger characteristic) is gated to Most of the time required for binary adders is carry prop-
the true-complement input of the fraction adder. agation time. Two operands must be combined and the
The characteristic of the unshifted fraction becomes the carries allowed to ripple from right (low order) to left
resultant characteristic. It is gated to the characteristic- (high order). The usual method of finding the sum is to
update adder, and after updating, if necessary, it is gated combine the half sum" of bit n (higher order) with the
to the accumulator specified by the instruction. carry from bit n 1 (Sn = An 'if B; 'if Cn)' t The carry
The output of the characteristic difference adder is (Cn ) into bit position n is also a three term expression
decoded by the pre-shifter and the proper fraction shifted which includes the carry into bit position n - 1
right the necessary number of positions. The pre-shifter
is a parallel digit-shifter which shifts each of the 14 digits «; = An - 1·Bn - 1 V A n - 1,Cn - 1 V B,,-I'Cn _ 1 ) ,
right any amount from zero to fifteen. The decode of the If the carry term is rearranged to read
shift amount is designed into each level, thereby eliminating
serial logic levels for decoding. Cn = A,,-1·Bn- 1 V (A n- 1 V Bn- 1)Cn- h

The pre-shifter consists of two circuit levels. The first two new terms can be defined which separate the carry
level shifts a digit right by 0, 1,2 or 3 digit positions. The into two parts-generated carry, and propagated carry.
second level shifts a digit right by 0, 4, 8, or 12 digit The generated carry (G"-l) is defined as An-I' B n- l , and
positions. Thus, by the proper combination of these the carry propagate function (often abbreviated to simply
amounts any right digit shift between and including 0 and propagate or Pn - 1 ) is defined as A n - 1 V En-I' Now the
15 can be executed. Figure 5 shows an example of the
pre-shifter. * The half sum is the exclusive OR of the two input bits, (An 'rf Ba),
The un-shifted fraction is gated to the true/complement t The two operand fractions are designated as A, B and the bits as
An, En, A"-l, Bn-I, etc. c.. is the carry into bit position n, which is
40 gates of the adder. Here the fraction is gated unchanged the carry out from bit n - 1.

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
carry expression can be rewritten as: 1.6 Post-normalization
Normalization or post-shifting takes place when the inter-
en G"-l V p.-ICn- 1
mediate arithmetic result out of the adder is changed to
Cn Gn - I V Pn-1Gn- 1 V Pn-lPn-~Cn-2 the final result. TIle output of the fraction adder is checked
for high-order zero digits and the fraction is left-shifted
C" Gn- l V Pn-1G.- 1 V p.-. , Pn- 2G.- 2 until the high-order digit is non-zero.
r.:.», The output of the fraction adder is gated to the zero-
V Pn- I :ICn- 3
digit checker. The zero-digit checker is simply a large
decoder, which detects the number of leading zero digits,
and provides the shift amount to the post-shifter. Since
The expansion can continue as far as one desires and one this same amount must be subtracted from the character-
could conceive of C n being generated by one large OR istic, the zero-digit checker also must encode the shift
block preceded by several AND blocks (in fact n AND amount for the characteristic update adder.
blocks-one for each stage). But it is obvious that the The implementation of the digit post-shifter is the same
limiting factor would be the circuit fan-in. Only a limited as the digit pre-shifter except for the fact that the post-
number of circuit stages can be connected together in shift is a left-shift. The first level of the post-shifter shifts
this manner. This technique is defined as carry look-ahead, each of the 14 digits left 0, 1, 2 or 3 and the second level
and by cascading different levels of look-ahead the tech- shifts each digit 0, 4, 8, or 12. The output of the second
nique can be made to fit the circuit fan-in, fan-out lirnita- level is gated into the add unit fraction result register, from
tions. which the resultant fraction is routed to the proper floating-
For example, assume that four bits can be arranged in point accumulator.
this manner, and that each four bits form a "group." The The characteristic update is executed in parallel with
adder is now divided into groups and the carries and the fraction shift. The zero-digit checker provides the
propagates can be arranged for carry look-ahead between characteristic update adder with the two's complement of
groups just as they were for look-ahead between bits. It the amount by which the characteristic must be reduced.
is possible to carry the concept even further and define a Since it is not possible to have a post-shift greater than 13,
section as consisting of one or more groups. Now the the high-order three bits of the characteristic can only be
adder has three levels of carry look-ahead: the bit level changed by carries which ripple from the low order four
of look-ahead, the group level, and the section level. bits. The update adder makes use of this fact to reduce
The fraction adder of the floating-point add unit is a the necessary hardware and speed up the operation.
carry look-ahead adder. A group is made up of four bits
(one digit) and two groups form a section. Since it must Floating-point multiply/divide unit
be capable of adding 56 bits, the fraction adder consists Multiply and divide are complicated operations. How-
of seven sections and 14 groups. Each pair of input bits ever, two of the original design goals were to select an
generate the three bit functions: half-sum (A V B), bit algorithm for each operation such that (1) both opera-
carry generate (A· B) and bit propagate (A V B). These tions could use common hardware, and (2) improve-
functions are combined to form the group generate and ment in execution time could be achieved which would
propagate which in tum arc combined to form the section be comparable to that achieved in the floating-point add
generate and propagate. A typical group is shown in unit. Several algorithms exist for each instruction which
Fig. 6 and the group and section look-ahead are shown in make the first design goal attainable. Unfortunately, the
Fig. 7. best of the algorithms generally used for divide are not
The high-order sum consists of nine bits to include the capable of providing an improvement in execution com-
end-around carry for subtraction and the overflow bit parable to the improvement achievable by those used
for addition. The end-around carry is needed for subtrac- for multiply. The algorithm developed for divide in the
tion because the fraction which is complemented may not Model 91 uses multiplication as the basic operator. Thus,
be the subtrahend. This is illustrated by the example given common hardware is used, and comparable improvement
in the description of the characteristic comparison. If the in the execution time is achieved.
effective sign of the instruction is minus (the exclusive OR In order to give a clear, consistent treatment to both
of the sign of the two fractions and the instruction is the instructions, this section discusses the multiply algorithm
effective sign) the effective operation is subtract. Also, and hardware implementation first. Then the divide algo-
the high-order bit (ninth bit of the high order section) is rithm is discussed separately. Finally, it is shown how
set to a one, thus conditioning it for an end-around-carry. divide utilizes the multiply execution hardware and the
If there is no end-around-carry when the effective sign hardware which is unique to the execution of divide is
is minus the adder output is complemented. described. 41

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
TRUE AND COMPLEMENT GROUP Pl10PAGATE
BIT HALF SUM 81T SUM GENERATION GROUP 2
BIT PROPAGATES
81T GENERATES GATE TRUE
LOW ORDER P
P6 7
AND & P02
HALF SUM 7
BIT A7 P5
P4
81T 87 P7
G7
CFI GROUP PROPAGATE
GATE COMP GROUP 1

GATE TRUE
P
P2 3
AND & PGl
PI
Brf A6 PO
81T 86 P6
G6
CFlP7+G7
GATE COMP GROUP GENERATE
GROUP 2

T G7P6P5P4tl
G6P5P4 OR GG2
BIT A5 HALF SUM G5P4
G4
BIT 85 PROPAGATE P5
GENERATE G5 G7P6+ G6+CFl
P7P6 GROUP GENERATE
C GROUP 1

T G3P2PIPO~
G2PIPO OR GGI
GIPO
BIT A4 GO
BIT B4 P4
G7P6P5+G6P5
G4 +
G5+CFl P7P6P5
C

BIT A3 HALF SUM


BIT 83 PROPAGATE P3 NOTE:
4 BITS 1 GROUP
GENERATE G3 8 BITS = 1 SECTION
PGlCFl+GG2 GG AND PO ARE GROUP GENERATE
C AND GROUP PROPAGATE
GS AND PS ARE SECTION GENERATE
AND SECTION PROPAGATE
T P AND G ARE BIT GENERATE
AND 81T PROPAGATE
BIT A2
81T 82 P2
GG2P3+G3
G2
CFlPG2P3
C 81TS A SOURCE IS PRE·SHIFTER
BITS B SOURCE IS T/C GATES
OFI. GATE TRUE. AND GATE COMP
SOURCE IS CARRY LOOK·AHEAD

81T Al
BIT Bl PI
Gl GG2P3P2+G3P2

G2+CFl PG2P3P2

T
HIGH ORDER
BIT AD
BIT BO PO GG2P3P2Pl
+
GO G3P2Pl +G2Pl
+
Gl+CFIP3P2Pl
C

Figure 6 Fraction adder, section 1 (high-order).

.. Multiply algorithm plementing to allow subtraction as well as addition can


Computers usually execute multiply by repetitive addi- be used to reduce the number of necessary additions.
tion, and the time required is dependent on the number of An integer in any number system may be written in
additions required. 1 ,6 A zero bit in the multiplier results in the following form:
adding a zero word to the partial product. Therefore,
because shifting is a faster operation than add, the execu-
tion time can be decreased by shifting over a zero or a
string of zeros. Any improvement in the multiply execution where
beyond this point is not obvious. However, certain proper-
42 ties of the binary number system combined with corn- OsaSb 1, and b = base of the number system

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
SECTION CARRY IN
SECTION GEN AND PROP
GS7
SUBGSIPS7
G OR
PGI4 G ISEC
4 % GS7 SUBGS2PSIPS7
GGI3 PIG PS7 SUBGS3PS2PSI PS7 CAfoRY CF6 (CARRY IN TO SECTION 6)
PGI3 SUBGS4PS3PS2PSI PS7
SUBG55P54PS3PS2PS 1PS7 SECT 6
SUBGS6PS5PS4PS3PS2PS I PS7

GS6
GS7PS6
0
PGl20 1 SEC
2 % GS6 SU8GSIPS7PS6 OR
0011 PIG PS6 SU8GS2P51P57P56 C~~RY CF5 (CARRY IN TO SECTION 5)
PGll SUBGS3P52PSIPS7PS6
5UBGS4PSaPS2PSIP57P56 SECT 5
SUBGS5PS4P5aP52P51 PS7PS6

G55
G56P55
G
PGIO G I SEC
O % GS5 G57PS6PS5 OR
~~ PIG PS5 SUBGSIPS7PS6PS5
SUBGS2PSI PS7PS6PS5
CAfoRY CF4 (CARRY IN TO SECTION 4)
SUBGS3P52PSIPS7PS6PS5 SECT 4
SUBGS4PS3PS2PS1PS7PS6PS5

GS4
GS5PS4
SEC% GS4
PG8 0 8 GS6P55PS4 OR
0
GG7
PG7
PIG PS4 GS7PS6PS5PS4
SUBGSIPS7PS6PS5PS4
CAT~RY CF3 (CARRY IN TO SECTION 3)
SUBGS2PSIP57PS6PS5PS4 SECT 3
SUBGS3P52PSIPS7PS6PS5PS4

GS3
G54PS3
PG6 0 6
SEC% GS3 GS5PS4PS3 OR
0
GG5
PG5
PIG psa GS6PS5PS4PS3 C~~RY CF2 (CARRY IN TO SECTION 2)
G57PS6PS5PS4PSa
SUBGSIPS7PS6P55PS4PS3 SECT 2
5UBG52P51 PS7PS6P55PS4PS3

G52
GS3PS2
PG4 SEC % G52 GS4PS3PS2 OR
G G4
GG3 PIG PS2 GS5PS4PSapS2 ~~RY CFl (CARRY IN TO SECTION I)
PGa GS6P55P54P53PS2
GS7PS6PS5PS4PS3PS2 SECT 1
SUBGSIP57PS6PS5PS4P5aPS2
t--..,--END CARRY TO SIGN CTl
GSi+GS2PSi+PSIPS2
002
PG2 SEC GSI SUBGSI
+ FRACTION
001 PIG PSI OVERFLOW
SUBGS2PSI BIT
PGI SUBGS3PS2PSI OR
SUBGS4P53PS2PSI CARRY
TO CF7 (CARRY IN TO SECTION 7)
SUBGS5PS4PS3P52PSI
SUBGS6PS5PS4PS3PS2PSI SECT 7
SUBGS7PS6PS5PS4PS3PS2PSI

CF7
SUB t - - . . , - - TO ADDER SU M LATCH
GATE TRUE SUM
NeTE:
GG IS GROUP GENERATE
PG IS GROUP PROPAGATE
PS IS SECTION PROPAGATE
GS IS SECTION GENERATE
SUB IS AN EFFECTIVE SUBTRACT OPERATION
CF7 ~ -;'EN ADDER
SUB - - - , SUM COMPGT - - - - TO ADDER SUM LATCH
GATE COMPLEMENT SUM

Figure 7 Fraction adder, carry look-ahead.

One of the properties of numbering systems which is and adding at the end of the string:
particularly interesting in multiply is that an integer can
be rewritten as shown below.

11210 111000 2 = 10000000 2 - 10000 2 ,


where
Therefore, a string of 1's in the multiplier can be reduced
Ok b - 1 for any k ,
from an addition for each 1 in the string to a subtraction
In the binary number system Ok can take only the values for the first 1 in the string, shift the partial product one
o and 1. Thus, using the above property, a string of 1's position for each 1 in the string, and an addition for the
can be skipped by subtracting at the start of the string last 1 in the string. 43

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
SCAN PATIERN

o 4

TOTAL SCAN IS 29 PATTERNS OF THREE BITS EACH. NOTE: BITS 0 AND 57


EACH PATTERN GENERATES ONE MULTIPLE ARE ALWAYS ZERO

Figure 8 Scanning pattern for multiplier.

However, the method described above requires a vari- A tree of carry-save adders is used to reduce the gener-
able shift and thus does not permit one to predict the ated multiples from six to two. A carry-save adder, which
exact number of cycles required to execute multiply. can be used whenever successive addition of several
Furthermore, it does not permit the use of carry-save operands is necessary, requires less hardware, has less
adders in the implementation. (Carry-save adders will be data skew and has less delay than a carry-propagate
discussed later.) adder," The individual carry-save adder takes three input
A multiplier recoding-algorithm, which is based on the operands and generates the resulting sum and carry. How-
property described above, but which uses uniform shifts ever, instead of connecting the carries to the next higher-
is used in the Model 91. The multiplier is divided into order bits and allowing them to ripple, they are treated as
uniform groups of k bits each. These k bits are recoded to independent outputs. In accordance with the customary
generate a multiple of the multiplicand, which is added to rules for addition, the carries will be added to the next
or subtracted from the partial product. The multiples are higher-order bits as separate inputs to the next carry-save
generated by shifting the position of the multiplicand in adder down the tree.
relation to the normal position at which it would enter the Figure 10 illustrates a tree of carry-save adders which
adder for a k equal to one. After adding the generated will reduce six input operands to two, thereby retiring 12
multiple to the partial product, the partial product is bits of the multiplier on each iteration. Note that the final
shifted k positions and the next group of k bits is con- output of the carry-save adder tree is two operands-sum
sidered. and carry-which are shifted right 12 positions and loop
The correct choice for k is important since an average back to become input operands. Thus, the partial product
k
of 1/2 of the generated multiples will have a value of is accumulated as a partial sum and a partial carry. After
zero, and increasing k (over k equal to one) reduces the the multiplier has been assimilated, these two operands,
amount of operand reduction capability that is used in- sum and carry, are added in a carry propagate adder to
efficiently. However, if k is greater than two, carry propa- form the final product.
gate addition is necessary to generate the needed multipli-
cand multiples (shifting can only be used to generate • Implementation
multiples which are a power of two). In the context of a
A block diagram of the data flow for the execution of a
fast multiply, the carry-propagate adder increases the
multiply is shown in Fig. 11. This data flow can be sepa-
start-up time, which is undesirable. The Model 91 uses a k
rated into two parts, the iterative hardware and the periph-
equal to two.
eral hardware (that hardware which is peripheral to the
The technique used to scan the multiplier is shown in
iterative hardware). The latter includes the input reserva-
Fig. 8. Overlapping the high-order bit of one group and
tion stations, the pre-normalizer, the post-normalizer, the
the low-order bit of the next group insures that the begin-
propagate adder, the result register, and the characteristic
ning and end of a string of l's is detected once and only
arithmetic. The peripheral hardware is described first, but
once. Table 2 shows which multiples are selected for
since the iterative hardware is the heart of multiply execu-
all possible combinations of the two new bits and the
tion, the major part of this section is devoted to a discus-
overlapped bit.
sion of this hardware.
Since the objective is fast multiply execution, six groups
of multiplier bits are recoded at one time, and the resul-
tant six multiples are added to the partial product. Five Input peripheral hardware
iterations are sufficient to assimilate the full 56 bits of The input hardware includes the reservation stations, pre-
the multiplier fraction. Figure 9 shows how the multi- normalizer, and the characteristic arithmetic. As was stated
plier fraction is separated for each iteration and how each earlier, the multiply unit has two reservation stations and
44 iteration is separated for the six generated multiples. appears to the floating-point instruction unit for assign-

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
the normal resultant characteristic and the normal charac-
MULTIPLE teristic minus one. Subsequent to post-normalization the
Ml
M2 correct characteristic is outgated.
M3
M4
M5 Output peripheral hardware
M6

ITERATION 1
The output peripheral hardware includes the carry-
, ITERATION 2 propagate adder, the result register and the post-normalizer.

7
ITERATION 3
Since the product is accumulated as two operands (sum
,'TERATION 4/, ~
ITERATION 5 ~ and carry) the output of the iterative hardware is gated to
a carry-propagate adder to form the final product. The
MULTIPLIER BIT [24 25 26 27 28 29 30 31 32 33 34 35 361 design of the carry propagate adder is similar to the one
used in the add unit with the exception that multiply
does not require an end-around carry adder. A result
MULTIPLE M6 M5 M4 M3 M2 M'
register is created by latching the last level of the carry
Figure 9 Iterations and multiple generation for multiply. propagate adder. The output of the result register is gated
to the common data bus via the post-normalizer. Detec-
tion of the need for post-normalization is done in parallel
with the carry propagate adder and the result is gated to
ment purposes as two distinct multiply units. If both units the common data bus, either shifted left one digit or
have been selected for a multiply operation, the first unshifted.
unit to receive both operands is given priority to begin
execution. In the case where both units receive their Iterative hardware
second operand simultaneously, the unit which was The multiply execution area has conflicting design goals.
selected by the floating-point execution unit first is given The execution time must be short but the amount of
priority for execution. hardware necessary for implementation has a practical
The system architecture specifies that multiply is a upper limit. One could design a multiply unit which would
normalized operation. Thus, if the input operands are take two cycles for execution. A large tree of twenty-eight
unnormalized, they must be gated to the pre-normalizer, carry-save adders could be interconnected so that the
normalized, and then returned to the originating reserva- multiplicand and the multiplier would be the input to
tion station. In some cases, one additional machine cycle the tree and the output would be the product," The per-
is added to the execution time for each unnormalized formance of this multiply unit would be acceptable but
operand. However, normalization takes place as soon as
the first operand enters the reservation station, provided
there is not an operation in execution. Thus, normalizing Figure 10 Carry-save adder tree.
can take place while the unit is waiting for the second ,-- MULTIPLES OF MULTIPLICAND
~A~ _

operand.
The design of the zero digit detector and the left-shifter
are similar to those described earlier for the add unit. If
the zero digit detector, detects an all-zero fraction, the
multiply is executed normally, but the outgate of the result
to the floating-point accumulator is inhibited. Thus the
required result, and all-zero-fraction, is stored.
The amount of left shifting necessary to normalize an
operand is gated to the characteristic arithmetic logic,
where the characteristic is updated for this shift. Character-
istic arithmetic for multiply simply requires the two char-
acteristics to be added but this operation can be over-
lapped with the execution of the multiply. Thus, the imple-
mentation is simple and straightforward.
It remains only to update the characteristic because of
NOTE: C S
post-normalization. The post-shift can never be more than C-CARRY
SHIFT RIGHT 12
S -SUM
one digit because the input operands are normalized. CSA - CARRY SAVE ADDER
SHIFT RIGHT 12

Therefore, in order to eliminate logic levels at the end of


CARRY PROPAGATE
multiply execution, two characteristics are generated: ADDER
45

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
FLRi ,COB
I

11 RES STAT lA
I RESSTAT 1B

RES STAT 2A RES STAT 2B I

I BIT DIGIT I I

I~
r--------------,------1 SHIFTER SHIFTER If-~--+---~_+_.I DIV 2
1+ 1
'----r 1_....1.-__-' CDB-_+++--.
I :: DIV 3

I MCAND
OR
I -L
FLR-_++-1--+-,

MPR INGATES MPR INGATES


DIV 4

TLU MPR
RECORDER AND POWERING
-
I

I
I I
1 11
POST- SHIFT
r------~- r- -. ---.,
PROPAGATE! CIN
ADDER DECOOER
i LOOP I \ -
I CSA E I I RESULT I
I I
1 I
I

L_______
eM F
---r---J I
1--1--...,----if--'
! ,,.-------'
CARRY SUM

I MCAND
SHIFT
I POST
SHIFTER
I NOTE:
MPR MULTIPLIER
MCAND MULTIPLICAND
DIV DIVIDE
\ !'!'I~~ / CSA
TLU
CARRY SAVE ADDER
TABLE LOOK UP

~ CDB COMMON DATA BUS


FLRB FLOATING REGISTER BU
FLBB FLOATING BUFFER BUS

Figure 11 Floating-point multiply/divide data flow.

the amount of hardware necessary for implementation is and each method affects the iteration period differently.
much too high. For example, if they are arranged as shown in Fig. 12,
The adopted alternative approach was to select a subset the feedback loop (the partial product) is from the output
of the carry-save adder tree such that one iteration through back to the input. In this case, the iteration period be-
the tree retires 12 bits of the multiplier. This iteration is comes the time required to make one complete pass
repeated until the full 56 bits of the multiplier have been through the tree. However, the adopted arrangement,
exhausted. If each iteration is fast enough, the multiply shown in Fig. 13, allows the iteration period to approach
execution time for this method approaches that for the the delay through the last carry-save adders (these two
large tree of carry-save adders. In fact, if each iteration carry-save adders are accumulating the partial product).
can be 20 nanoseconds the second method can execute a But the delay through the path leading to the last two
multiply in three cycles, and the iterative hardware can carry-save adders (the multiplier receding, multiple gener-
be reduced to 20% of that required for the first method. ation and the first four carry-save adders) is much longer
Thus, with an iterative loop, the primary design problem than the delay through the adders. If, however, temporary
is to design the carry-save adder tree so that the iteration storage platforms are inserted in the iterative loop the
period is minimized. The faster the repetition rate of the concept of pipelining, explained earlier, can be put to use
iterative hardware, the better the cost-performance ratio here. Temporary storage platforms are inserted in the
of the multiply area. iterative hardware for deskewing so that the rate of in-
46 There are several ways to arrange the carry-save adders, serting new inputs (twelve bits of the multiplier) and the

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
MULTIPLES OF MULTIPLICAND ~ MULTIPLES Of MULTIPLICAND
-,A~ --,
,.,....-~--~-"''-------~

NOTE:
C-CARRY c S
SHF RIGHT 12
S-SUM
CSA- CARRY SAVE ADDER
SHF RIGHT 12

CARRY PROPAGATE ADDER AFTER SUFFICIENT ITERATIONS


RIGHT 12 RIGHT 12

CARRY PROPAGATE ADDER AFTER FIVE ITERATIONS

Figure 12 CSA tree with feedback loop from output. Figure 13 CSA tree with accumulating loop at output.

rate of accumulating the partial product may safely be Figure 14 Abstract drawing of "pipelined" iteration.
made equal. Therefore, by pipelining the carry-save adder
tree, the second arrangement can be used and the iteration INPUTS
A
-,
period is equal to the delay through the last two carry-
save adders.
TEMPORARY STORAGE PLATfORM (TSP)
In order to explain the pipelined tree, the path is ab-
stracted in Fig. 14. Each block represents the logic asso- STAGE ONE

ciated with the stages of the pipeline and the first level of
each block represents the temporary storage platform. The INPUT

period of the clock is set by the logic delay of the accumu-


lating loop. In the abstract design the logic delay of all TEMPORARY STORAGE PLATFORM (TSP)
.........~
paths between stages of the pipeline is assumed to be the STAGE TWO
same as the clock period.
Figure 15 is a timing diagram for the abstracted iterative
hardware. At clock time zero, the first input, 11> is gated
into the temporary storage in stage one. At clock time TEMPORARY STORAGE PLATFORM (TSP)
one, III after being operated on by the logic in stage one,
STAGE THREE
is gated in at stage two and 12 is gated in at stage one. This
process continues until at clock time three, the original
input, III is entering stage four. During this clock time,
the pipeline is filled, i.e., each stage of the pipeline now
TEMPORARY STORAGE PLATFORM (TSP)
contains data in various forms of completion. At clock -
....

time four, the last input, Is, enters stage one, and the STAGE FOUR
partial product starts to accumulate at stage four. The
next three clock times are used to drain the pipeline and LOOP lOOP

accumulate the full partial product. Thus the total iterative


loop time is that necessary to fill up the carry-save adder OUTPUT OUTPUT 47

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
CLOCK TIME
where short path is the shortest logic delay between two
temporary storage platforms; long path is the longest
logic delay between two temporary storage platforms; and
gate width is the time necessary to set and latch the tem-
f--;;':"'_...:L-+--.:.l.--+~L-+--~ STAGE TWO porary storage platform.
The temporary storage platforms were placed to mini-
I, I,
mize the hardware; then a careful data path analysis was
I, I, I, made to determine the logic skew. The above relationship
was next applied and the short paths "padded" with
NOTE: additional delay to satisfy the relationship. The result is
[V"
INPUTSTO STAGE ONE shown in Fig. 16. The temporary storage platforms are
FIVE INPUTS ARE NEEDED
TO COMPLETE A MULTfPLY at the multiplier recorder, the multiple gates, carry-save
Figure 15 Timing diagram for abstracted iterative hardware. adder C and the accumulating loop, and carry-save adders
E and F.
Since the design goal was to make the iteration period as
short as possible, the design of the last two carry-save
adders required a minimum number of levels and was con-
tree plus five passes around the accumulating loop, or
strained to account for the "short path around the loop."
eight clock periods. If the feedback loop were from out-
Carry-save adders E and F are each designed as a tempo-
put to input, as shown in Fig. 12, the total iterative loop
rary storage platform and are orthogonal-e-i.e., are not
time would be twenty clock periods. Therefore the iterative
ingated simultaneously. The first, carry-save adder E, is
loop time has been reduced by a factor of 2.5, with only
ingated on the first-half of the clock period and the second,
a small increase in hardware. (This is described later.)
carry-save adder F, is ingated on the second half of the
The actual implementation of the pipeline is not simple.
clock period.
First, the temporary storage platforms require extra hard-
The low order thirteen bits of the multiplier are gated
ware and add delay to the path. Second, the placement of
into the latched multiplier recoder at clock time zero and
the temporary storage platforms is important for two
recoded to six control lines. Every clock period-20 nano-
reasons: (1) The purpose of the temporary storage plat-
seconds-a new set of bits is gated into the multiplier
form is to deskew the logic (difference between fast and
recoder until the full word (56 bits) is exhausted. The next
slow logic paths) and the logic delay is not ideally dis-
step in the pipeline is the latched multiple gates. Six
tributed, and (2) the placement can affect the amount of
multiples are generated by shifting the multiplicand, under
hardware necessary for implementation.
control of the output from the multiplier recoder. These
The solution to the first problem led to a design in
six multiples are reduced to four (two sums and two
which the logic function was designed into the temporary
carries) by carry-save adders (eSA) A and B. Carry-save
platform; e.g., a latched carry-save adder or a latched
adder C takes three of these outputs and reduces them to
multiple gate. The extra hardware is only that required
two latched outputs. The sum from CSA-B is latched in
for the feedback loop which latches the logic function;
parallel with CSA-C and combines with the two outputs
the added delay is eliminated because the logic function is
from CSA-C to provide CSA-D with three inputs. At the
designed into the temporary storage. The solution to the
output of CSA-D, the sum and carry are the result of
second problem was more complex. First, the clock used
multiplying twelve bits of the multiplier and the full
to control the temporary storage platform ingate was
multiplicand. The next two latched carry-save adders are
designed as a series clock. All of the pulses of an iteration
used to accumulate the partial product. Each iteration
are initiated by a single oscillator pulse and then delayed
adds the latest sum and carry from CSA·D to the previous
to drive the ingates of the successive pipeline stages. The
results. After five iterations of the accumulating loop the
clock delay between successive temporary storage ingates
output of CSA-F is the bit product in carry-save form.
is equal to the long path circuit and wiring delay of the
Now the sum and carry operands are gated to the carry
logic between these ingates. The time between iterations
propagate adder and the carries allowed to ripple to form
(the oscillator period) is still the delay of the accumulating
the final product.
loop, but the time between pipeline stages is not equal to
the clock period. This allows the placement of temporary • Divide algorithm
storage to vary without being dependent on the clock.
Several division algorithms exist,':" of varying complexity,
The relationship between the logic skew and clock
cost and performance, which could be used to execute
period can be expressed as
the divide instruction in the Model 91. But because of
48 Short path> [long path - clock period] + gate width, the relatively complex and iterative nature of divide

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
algorithms, the execution time is out of balance with other MULTIPLIER REGISTER

processor functions. Even the higher-performing conven- MULTIPLICAND


tional algorithms contain a shortcoming which requires
that"successivesubtractions be separated by a performance- MULTIPLIER INGATES
degrading decode interval. * The Model 91, however,
LATCHED
utilizes a unique divide algorithm which is based on MPR RECDR

quadratic convergence.I'":" La A major advantage is that


the number of required iterations is reduced (proportional
to log, of the fraction length), which reduces the number of
data-control interactions. Another important advantage
is that MULTIPLY is the basic iterative operator. This
both reduces the cost, by exploiting existing hardware, and MULTIPLE
GATES LATCH
enhances the execution time, because in the Model 91
MULTIPLY is extremely fast.
The divisor and dividend are considered to be the
denominator and numerator of a fraction. On each intera-
tion a factor, R ko multiplies both numerator and de-
nominator so that the resultant denominator converges
quadratically toward one (1) and the resultant numerator
converges quadratically toward the desired quotient.

N X Ii X R 1 X R 2 X ... X R n
D R R1 R2 Rn
==} NRR 1 Quotient,
where N = numerator = dividend,
D = denominator = divisor, and
D R R I R 2 ... n; ==} 1.
NOTE:
The selection of the factor R; is the essential part of LATCH IS A TEMPORARY
STORAGE PLATFORM
the procedure and is based on the following: The divisor
can be expressed as Figure 16 Multiply iterative loop showing temporary stor-
age.
D = 1 - x,
where x:::; 1/2 since D is a bit-normalized, binary floating-
point fraction of the form
0.1 xxx' . '.
In general, if Xk < 1/2 then Xk+1 < 1/2 • Thus, by con-
n 2n

Now, if the factor R is set equal to 1 +x and the de- tinuing the multiplication until Xk is less than the least
nominator is multiplied by R significant bit of the denominator (divisor fraction), the
desired result, namely a denominator equivalent to one
D 1 = DR = (1 - x)(l + x) 2
- x,
(0.1111111 ••• 111), is obtained.
2
where x :::; 1/4, since x:::; 1/2. It is important to note that the multiplier for each
The new denominator is guaranteed to have the form iteration is the two's complement of the denominator,
+
0.11 xxxx .... Likewise, selecting R I = 1 x 2 will double
Rk +1 = 2 - Dk = 2 - (1 - x n ) = 1 +x n·
the leading 1 on the next iteration to yield
Thus the multiplier for iteration k is formed by taking
D2 DIR I = (1 - x
2)(1
+ ;/) = 1 - x 4 the two's complement of the result of iteration (k - 1).
0.1111 xxxx . .. , However, in this form the algorithm is still not fast
enough. For a 56-bit fraction, eleven multiples are required
where x 4 :::; 1/16 since x :::; 1/2.
with a two's complement inserted between six of the
• Conventional refers to previous division algorithms which use sub-
multiples:
traction as the iterative operator. The faster algorithms generate more
than one quotient bit in parallel through the use of pre-wired multiplies.
However, the selection of the multiplies for the next iteration is de-
Q NRRIR2R3R4Rs
pendent upon a decode of the partial remainder of the previous itera-
tion. 49

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
Table 2 Multiplier recoder rules. 3. Multiply D by R forming D 1 •
4. Multiply N by R forming N1 •
5. Truncate D k and complement to form R h •
Input Output Reason
multiple
6. Multiply D k by R k forming D k + 1 •
n* (n + 1) (n + 2) 7. Multiply N, by R k forming N k + 1 •
8. Iterate on 5, 6 and 7 until Dk+" =} 1 and then N H n =
0 0 0 0 No string Quotient.
0 0 1 +2 End of string
0 1 0 +2 Beginning and end
0 1 1 +4 End of string 'It Divide implementation
1 0 0 -4 Beginning of string
1 0 1 -2 Beginning and end Each iteration of divide execution consists of three opera-
1 1 0 -2 Beginning of string tions as shown above. The problem in implementation is
1 1 1 0 Center of string to accomplish these three operations utilizing the multiply
hardware described previously and accomplish them in
" Bit n is the high-order position
the minimum amount of time. But there are three points
which create difficulty. First, the multiplier is a variable
length operand, the length being different on each iteration.
The first multiplier, determined by table-lookup, is ten bits
and yields a minimum string length of seven; the second
multiplier is fourteen bits; the third multiplier is twenty-
eight bits, etc. In other words, the minimum string length
But if the number of bits in the multiplier could be reduced
can be doubled on each iteration after the first. Second,
the time for each multiply would be decreased. H in order
the result of one iteration is the multiplicand for the next
to obtain n bits of convergence the multiplier is truncated
to n bits [1 +
XT where (XT x) < T"] it can be shown
iteration. Since the output of the multiply iterative hard-
ware is two operands--earry and sum-the carry propagate
that the resultant denominator is equivalent to
adder must be included in the divide loop. Third, two
(l + xT)(l - x) = 1- x
2
Irt. multiplies are required in each iteration--one determines
what to do on the next iteration (multiplier X denomi-
where 0 < T (which is due to truncation) < 2- n •
nator) and one converges the numerator towards the
Because the additional term T is always positive, the
resultant denominator can now have two forms: quotient (multiplier X numerator).
When all three of these points are considered simul-
D; = {0.11111 xxxxx . taneously they present a dilemma. Since two multiplies
1.00000 xxxxx . are necessary it is desirable to overlap the two and save
time, but any multiply for which the multiplier is greater
The denominator can converge toward unity from above
than twelve bits requires that the carry-save adder loop
or below, but it will converge, so no additional problems
are encountered. be used. Also, the fact that the carry propagate adder
must be included in the loop lengthens the time for each
Therefore, the number of bits in the multiplier can be
iteration. Several design iterations were required before
reduced to the string bits (all 0 or all 1) and the num-
arriving at the correct solution.
ber of bits of convergence desired. The string bits
First consider the entries in Table 2 and note that the
since they are all 0 or all 1, can be skipped in the multiply:
leading string of 1's or O'sin the multiplier can be skipped
Thus the multiply time has been improved considerably
since they result in a zero multiple out of the multiplier
and so, consequently, has the divide time. To improve
recoder, Also, if the input of the multiplier recoder is
the initial minimum string length, thus reducing the num-
complemented the sign of the output changes but the
ber of iterations, the first multiplier, R, is generated by
magnitude remains the same. Thus, this property can
a table-lookup which inspects the first seven bits of the
be used to produce :rx" at the output of the recoder.
divisor. The first multiply guarantees a result which has
Next consider a multiplier (complement of truncated
seven similar bits to the right of the binary point
(1 ± x has the form a.aaaaaaa ... etc.),
denominator) such as the following:
The following sequence outlines the operations which 1. 0000 0000 OOXX xxxx XXXI]
000[0
result in the execution of a divide. O. III I 111I 111 1 Ll xx xxxx xxxI
1. Bit normalize the divisor and shift the dividend accord-
If all positions were recoded, a bit of value 1 would be
ingly.
recoded from the high-order end and a set of bits of value
so 2. Determine the first multiplier, R, by a table-lookup. =FXk from the right end (1 ± Xk)' However, if only the

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
Table 3 Formats of the denominators and their multipliers.

Digit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D 0 1xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx 0000 0000
[01 -
xxxx=
xxxx xxO] Determined by table lookup of denominator
R
1111 ll1x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
D XR = D 1 {~ 0000 OOOx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

R1 {~ 0000
1111
~x
mr
xxxx nIl ] Determined by complementing denominator
xxxx xx11
1111 1111 1111 11xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
D 1 XRI Dz {~ 0000 0000 0000 OOxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

~
0000 0000 OOxx XXXi xxx~
Rz {6 1111 1111 Un xxxx xxxI
{~
Dz X Rz = D.
1111 1111 1111 1111 1111 l11x xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

R. {6
0000 0000 0000 0000 0000 rx
0000 0000 0000 0000 0000 OOOx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxx xxxx
1111 1111 1111 1111 1111 lUx xxxx xxxx ~
1111 1111 1111 1111 1111 1ill 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
D. X R 3 = D.
{~ 0000 0000 0000 0000 0000 0000 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx

1m ~ 1~
0000 0000 0000 0000 0000 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx
R4
{6 1111 1111 1111 1111 1111 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx 11
1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 .1111
{~
D5
(not formed) 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Short precision divide result is N. = NRR,Ri?


Long precision divide result is N j = N 4R,

portion in brackets is gated to the recoder the output will more than nine bits. But if a multiplier of more than nine
have value TZ12Xk'* The bits in the bracket are chosen bits is used, the carry-save adder loop must be included
such that the left-most three bits are identical. Thus, in the divide loop. Since this is undesirable (concurrency
multiple six (refer to Fig. 9) is not used because a zero among multiplies is discussed below) the multiplier for
multiple is always receded, and the product TDkXk or the third and fourth iterations is chosen to be nine bits,
T NkXk is accomplished by the five operands gated to thereby increasing the string length by nine each time.
multiple gates one through five. If the unshifted multi- Thus, D 4 has 32 leading l's or O's, Now if D, is multi-
plicand is gated simultaneously into the sixth multiple plied by multiplier four, R 4 , the result will be 64 leading
gate the sum of all six operands is Dill +
(=FD",Xk) or I's or O's, which is equivalent to unity within the desired
Dk(l T x",), which is the desired result. The result which accuracy. Therefore, since it is not necessary to calculate
is generated by adding the carry and sum out of the carry- multiplier five, R ti , this multiply is not done and since
save adder tree (refer to Fig. 11) is the following: only the numerator is going to be multiplied by multiplier

D
111+1
_
-
{o. 1111 1111 1111 1111 1111 l11x xxxx
1. 0000 0000 0000 0000 0000 OOOx xxxx
-4
four, the carry-save adder loop is used to speed up this
last operation. (This is discussed more fully below.)
The second difficulty, which was that the carry propagate
Thus, without using the carry-save adder loop the leading adder must be included in the path, was used to solve the
string has been increased by nine bits. third difficulty. Consider Fig. 17, which is the divide loop.
Table 3 presents the format of the multipliers and their To begin the execution of a divide the divisor is multiplied
denominators. Notice that the first multiplier is ten bits by the first multiplier (R), and the first denominator (D 1)
and the second is seven. These are fixed and cannot be is generated at the output of the CSA tree. These two
changed without making the table-lookup decoder larger. outputs are added in the carry propagate adder; the output
Thus the third multiplier is the first one capable of using loops back to the input and becomes the new multiplicand;
the truncated and complemented output forms the new
• The multiplicand is shifted right twelve positions to compensate for
multiplier. Note that the complete loop contains two
the 21.2 factor. temporary storage platforms-one at CSA-C and one at 51

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
the output of the propagate adder, the result latch. Thus
as soon as R X D is gated into CSA-C, the next multiply,
R X N, can be started. Now R X D advances to the result
latch and loops back to start the next multiply R1 X D1 •
At this time R X N, which is latched in CSA-C, advances
through the adder to the result latch. So the two multiplies
follow each other around the divide loop. The first deter-
mines what the second should be multiplied by to con-
verge eventually to the quotient.
This chain continues until multiplier four has been cal-
culated. Since denominator five is equivalent to one, the
multiply is not done. The 32-bit multiplier is gated into
the reservation station and then gated to the multiplier
twelve bits at a time as shown in Table 3. The result of
this multiply, N 4R 4 , is the final quotient. The diagram in
in Fig. 18 shows the concurrency in the divide loop. The
multiplier recoder latch is changed each time a denomi-
nator multiply is completed. Notice that two multiplies
+---.. TO CARRY·SAVE are always in execution, one in the first half of the divide
t - - - - - + - - - ADDER LOOP loop (from input to CSA-C) and one in the second half of
the divide loop (from CSA-C to the result latch).

Conclusions
The prime effort during the design of the floating-point
execution unit was to develop an organization which
would achieve a balance between instruction execution
and preparation. Early in the design phase it appeared
that an organization which would achieve this result
Figure 17 Divide loop. would have a poor cost-performance ratio.

Figure 18 Timing diagram showing concurrency in divide loop.

CLOCK TIME
I

DIY 1 DIV 2 DIV 3 DIV 4 DIV 5


DIVIDE ITERATIONS

R R,
MULTIPLIER RECODER

I D X R I N X RID, x R'I N, x R'I D2 x R2 1N2 X R2 1 D3 X R3 IN3 X R3 1 FIRST·HALF DIVIDE LOOP

SECOND·HALF DIVIDE LOOP

MULTIPLY USING CSA LOOP

QUOTIENT
52

ANDERSON, EARLE, GOLDSCHMIDT AND POWERS

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.
Concurrency, obviously, had to be the key to high L. Grosman, R. C. Letteney and R. M. Wade for the
performance, but the connotation of concurrency in com- multiply/divide unit; Messrs. M. Litwak, K. J. Pockett
puters is parallel execution of different instructions. Thus and K. G. Tan for the add unit; and Mr. E. C. Layden
the early organizations exhibited more than one execution for the processor clock.
unit and a high cost. In the final organization, concurrency Acknowledgment is also made for the early planning
is the key to the high performance, but this organization efforts of Mr. R. J. Litwiller.
exhibits several levels of concurrency:
1. Concurrent execution among instruction classes. References
2. Concurrent execution among instructions in the same 1. W. Buchholtz et al., Planninga Computer System, McGraw-
Hill Publishing Co., New York, 1962.
class (add unit). 2. G. M. Amdahl, G. A. Blaauw and F. P. Brooks, Jr.,
3. Concurrent execution within an instruction (multiply "Architecture of the IBM System/360," IBM Journal 8,
iterative hardware and divide loop). 87 (1964).
3. D. W. Anderson, et al., "Model 91 Machine Philosophy
The concepts of instruction-oriented units and reserva- and Instruction Handling," IBM Journal 11, 8 (1967)
(this issue).
tion stations were used to keep the performance level 4. R. M. Tomasulo, "An Efficient Algorithm for Exploiting
sufficiently high but reduce the cost. These two concepts Multiple Arithmetic Units," IBM Journal 11, 25 (1967)
yield the same performance as several units without the (this issue).
5. R. F. Sechler, A. K. Strube and J. R. Turnbull, "ASLT
cost of several units. The instruction-oriented units allow Circuit Design," IBM Journal 11, 74 (1967)(this issue).
the design to be hand-tailored for faster execution and 6. O. L. MacSorley, "High Speed Arithmetic in Binary Com-
permit the use of a unique algorithm to execute divide. puters," Proc. IRE 49, 67, (1961).
7. R. E. Goldschmidt, "Applications of Division by Con-
Acknowledgments
vergence," Master's Thesis, MIT, June 1964.
8. C. S. Wallace, "A Suggestion for a Fast Multiplier," Trans.
The design of a computer unit such as this-s-ccntaining IEEE, EC-13, 14-17 (1964).
9. M. V. Wilkes et al., Preparation of Programs for an Elec-
nearly as many logical decisions as IBM's previous largest tronic Digital Computer, Addison-Wesley Publishing Co.,
central processor-requires a great deal of decision mak- Cambridge, Mass., 1951.
ing. The authors gratefully acknowledge the logical and 10. T. C. Chen, "Fast Division Scheme," private communica-
tion, November 4, 1963.
engineering design contributions made by the following
individuals: Mr. W. D. Silkman for the floating-point in-
struction unit; Messrs. J. J. DeMacedo, J. G. Gasparini, Received November 1,1965.

53

MODEL 91 FLOATING-POINT EXECUTION

Authorized licensed use limited to: University of Southern California. Downloaded on February 16,2023 at 03:19:26 UTC from IEEE Xplore. Restrictions apply.

You might also like