0 ratings 0% found this document useful (0 votes) 108 views 52 pages Unit 2
This chapter discusses the design of bit-level architectures for addition and multiplication in DSP algorithms, focusing on three implementation styles: bit-parallel, bit-serial, and digit-serial systems. It covers the design methodologies for bit-parallel and bit-serial multipliers, digital filter design, and vector-vector multiplication schemes, emphasizing the use of two's complement representation. The chapter also explores various multiplication techniques, including carry-ripple, carry-save, Baugh-Wooley, and modified Booth recoding to enhance performance and efficiency.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
13
Bit-Level Arithmetic
Architectures
13.1 INTRODUCTION
‘This chapter addresses the design of bit-level architectures for addition and
multiplication frequently encountered in DSP algorithms. Three implementa-
tion styles, bit-parallel, bit serial, and digit-serial are addressed. Bit-paralet
systems process one whole word of the input sample each clock cycle and are
ideal for high-speed applications. Bit-serial systems process 1 bit of the in-
put sample every clock cycle. These systems can be synthesized using integer
linear programming based scheduling approach (see Appendix F). Bit-serial
systems are area-efficient and suitable for low-speed applications [1] ~(3). Bit-
serial arithmetic is used for the implementation of data-flow algorithms of
medium complexity and low to medium data rate, whereas bit-parallel op-
erators may be used for the implementation of data-flow algorithms of low
complexity and high data rate. Digit-serial systems [4J,5] process multiple
number of bits (referred to as digit-size) every clock cycle and are best suited
for applications requiring moderate sample rate, where area and power con-
sumption are critical
This chapter considers the design of bit-parallel and bit-serial multipliers,
bit-serial digital filter design and implementations, and bit-level implemen-
tation schemes for vector-vector multiplications, ‘The major emphasis is on
architecture design based on design methodologies for mapping algorithms to
arithmetic architectures at bit-evel. Basic knowledge of computer arithmetic,
such as number representations, and addition and multiplication algorithms
are assumed, as they are covered in many VLSI design and computer arith-
‘metic books [6}-[8]. In this chapter, all numbers are assumed to be repre-
ar478 BIT-LEVEL ARITHMETIC ARCHITECTURES
sented in Bxed-point two's complement representation. A W-bt number A is
represented as:
A= ay-14y-2°--0109 (aay
where the bits aj, 0 < i < W — 1, are either 0 or 1, and the most significant
bit (msb) aw-. is the sign bit with value of 0 denoting positive number and
1 denoting negative number. ‘The value of this number is in the range of
[-1,1- 2-1] and is given by:
4 (13.2)
‘A main advantage of two's complement arithmetic is the ability to generate
8 correct final result in spite of intermediate overflow. ‘Thus, in a chain of
additions or subtractions, the final result is correct if itis known to lie in the
range [~(1~2-*1), 1-2-1] even though intermediate operations lead to
overfiow. For bit serial implementations, constant wordlength multipliers are
considered, where a constant wordlength of W-bit is maintained for all the
signals in the architecture. Although a W x W-bit multiplication generates
(2W ~1)-bit product, only the W’ most-significant bits are retained.
This chapter is organized as follows: Section 13.2 discusses bit-parallel
multiplier design, including carry-ripple array multiplication, carry-save ar-
ray multiplication, Baugh-Wooley multiplication, and Booth-recoded multi-
plications. The two efficient layout strategies, interleaved floor-plan and bit-
plane, for digital filters are discussed in Section 13.3. Design of bit-serial
multipliers using systolic mapping as well as Horner's rule are addressed in
Section 13.4, Section 13.5 addresses the design methodologies for bit-level
pipelined bit-serial FIR and IIR. filters with constant coefficients. This sec-
tion also addresses issues related to delay management and synchronization
in bit-serial implementations. Section 13.6 addresses the design of low-cost,
high-speed constant multipliers using the canonic signed digit (CSD) repre
sentation. Section 13.7 addresses the bit-level implementation schemes for
vector-vector multiplication based on the distributed arithmetic approach.
13.2 PARALLEL MULTIPLIERS
‘This section considers multipliers that can perform two's complement multi-
plication in time O(W7) using regular structures, including multiplication with
sign extension (derived from Horner’s rule) and Baugh-Wooley multiplication
[9]. Two regular implementation styles, carry-ripple and and carry-save array,
are introduced.
It has been proved that multiplications cannot be performed in a time
smaller than O(log,W). Such a bound is attainable by the binary-tree andPARALLEL MULTIPLIERS 479
Wallace-tree multipliers (see Appendix E) {10]. However, their corresponding
architectures are very iregular.
Let the multiplicand and multiplier be A and B:
wa
A= aW=[Link]-2°* 4109 = -aw-1 + D> aw-
B= bwar dena s-biby = ~bwait D> bw,
a
where -1 < A,B <1 and the case when both A and B is equal to ~1 is
excluded. The value of their product: P= A x B is given by:
(13.3)
‘where the radix point is to the right of the msb pay-2. Since the product of
2 numbers within [1,1 is also in this range, all higher order bits tothe let
of bit-position 2W/— 2 in the result are ignored
In constant wordlength multiplication, the W ~ 1 lower order bits in the
product P are discarded, and the product is denoted as X «= P = Ax B,
where
w.
Xa swat Dawa
(13.4)
‘The product X is used when a constant wordlength multiplier is considered
and P is used when a full-precision product is required.
13.2.1 Parallel Multiplication with Sign Extension
Using Horner's rule, multiplication of A and B can be written as
P= Ax (-bwart YD bw)
=~ Asbwan $[A-bwaa + [Abas + ft (Ash
Asda to". otart, (13.5)
where 2! denotes a scaling operation. In two's complement system, negating
‘a number is equivalent to taking its one’s complement and then adding a 1 to480 BIT-LEVEL ARITHMETIC ARCHITECTURES
eat ei
ibe aab9 abo dodo
ash, ay: agby
aap aby ab dob
—Tyby Tbs Tbs
Popes ae ent
Fig. 13.1. Tabular form of bit-level array multiplication
ed
the least significant bit (Isb) position, as shown below:
wet wa
= awart Slaw 2
ot mm
wa
= awit (1 awa2t— 142-74
t
a CSC
arta ar WH,
(13.6)
‘where (1 — a;) is the one’s complement of a; and denoted as @, for 0 Brongn
ves aout
> Singlet arate:
‘Coat ¢ Sout = intin «Sin Cin
Fig. 13.4 Parallel cazry-ripple array multiplier.
to be added, but the partial sum and partial carry. The addition of partial
sum and partial carry is performed by a vector merging adder (VMA). The
dependence graph of bit-level carry-save multiplication is shown in Fig. 13.5
A parallel carry-save array multiplier can be designed from this dependence
graph, as shown in Fig. 13.6, which can be pipelined at bit-level by placing
delay elements across the cutsets shown in dotted lines in the figure.
‘The VMA can be implemented either as a 4x4 carry-save array as shown in.
Fig. 13.7(a) which contains only half-adders and has a more regular and easier-
to-pipeline structure, but requires more half adders; or as a 4-bit carry-ripple
adder as shown in Fig. 13.7(b). When bit-level pipelining is not required,
the carry-ripple and carry-save principles can also be combined into a single
structure as shown in Fig. 13.7(¢), where the carry output can ripple through
‘at most 1 bit in the same row, and the resulting carry from each carry-ripple
portion is passed on to the next row like the carry-save architecture. The
number of adders required in this combined vector merging part is between,
that of the carry-save and the carry-ripple vector merging structures.484 BIT-LEVEL ARITHMETIC ARCHITECTURES
Vector Merging Adder
TT df
po fed a 0)
Fig. 13.5 Dependence graph for the 4 x 4-bitearry-save array multiplication
13.2.2. Baugh-Wooley Multipliers
‘The difficulty of two's complement multiplication lies in handling the sign
bits of the multiplicand and multiple. An efficient way to overcome this
problem is provided by the Baugh-Wooley multiplication algorithm [9]. This
‘multiplication algorithm is illustrated by the multiplication table shown in
Fig. 13.8 for a 44-bit multiplication, where all the bits have positive weight,
and can be accumulated directly. The derivation of this algorithm is left as
an exercise at the end of this chapter (see Problem 4).
‘The Baugh-Wooley multiplication table in Fig. 13.8 can be implemented
either as a carry-ripple array or a carry-save array. A carry-save Baugh-
Wooley multiplier is shown in Fig. 13.9, where P = [Link] is the
product with full precision. Note that the product of two 4-bit numbers in
the range (~1,1) lies in the same range. Since the radix point is to the
Tight of the bit ps, any higher order bits are ignored. The truncated result is
represented as X'= 23.2720.
13.2.3 Parallel Multipliers with Modified Booth Recoding,
Multiplication involves two basic operations: the generation of partial prod-
ucts and their accumulation. Consequently, there are two ways to speed up
the multiplication: reduce the number of partial products or accelerate their
accumulation. Carry-saveaddition can be used to accelerate the accumulationale
PARALLEL MULTIPLIERS 485
> Broads signas
outbis ata
> Slagle Palade
[Link]+ Sou =sinthin + Sin+Cin
Fig. 13.6. Parallel carry-save array multiplies.486 BIT-LEVEL ARITHMETIC ARCHITECTURES
ication: (a) carry-
‘save vector merging, (b) cerryripple vector merging, (c) combined carry-save and
carry-ripple vector merging,
a
fy by
GiB aby aby anby
GE ashy ab; agby
Gb ashy aide aba
azby abs axby obs
1
Fig. 13.8 Tabular form of bit-level Baugh-Wooley multiplication.PARALLEL MULTIPLIERS 487,
og 85
Fig. 12.9 A4 x 4-bit Baugh-Wooley carry-save multiplier.
of the partial products, To reduce the number of partial products, a straight-
forward approach is to examine 2 or more bits of the multiplier at a time.
‘The reduction in the number of partial products can reduce the latency of the
multiplication operation. However, this requires the generation of multiples
A, 2A, 3A, ete., where A is the multiplicand.
‘The number of partial products can be reduced by the modified Booth
recoding. This algorithm is based on the fact that fewer partial products need
to be generated for groups of consecutive zeros and ones. For a group of
consecutive zeros, there is no need to generate any new partial products. For
‘a group of, say m, consecutive ones in the multiplier, ie.,
{11-1}. = +++ 1(00++-0}0---—--0{00---1}0-++
= e100 -TYO---
instead of m partial products, only 2 partial products need to be generated
(13.8)488 BIT-LEVEL ARITHMETIC ARCHITECTURES
Table 13.1. Radix-4 Modified Booth Recoding Algorithm
Sint bas | rims [| Operation | Comments
Ogata |e Onn 0 40 airing of zeros
o of 1 fal +4 end of 1's
o 1] o fal +4 single 1
o 1] 1 fa] 42a end of I's
1 0} 0 2) ~24 — | beginning of 1's
1 of 1 fal -a single 0
1 1] 0 ft} -A | beginning of 1's
eee) ieee) 0 string of 1's
if signed-digit representation is used. Hence, in this multiplication scheme,
the multiplier bits are first recoded into signed-digit representation with fewer
number of nonzero digits; the partial products are then generated using the
recoded multiplier digits and accumulated.
In the modified Booth recoding algorithm, the signed-digit set {~2,~1,0,1,2}
is used, and itis, therefore, also referred to as 5-level Booth recoding. In this
recoding algorithm, the multiplier bits ésis1 and ba; are recoded into signed
digit Of, with bye-1 serving as a reference bit. For example, an &-bit number
can be recoded as follow
br bs by by by by by by bs
=< 2 (13.9)
wooo
ee
is appended after the Isb of this multiplier as a reference bit.
‘The modified Booth recoding algorithm to generate 6 from byi41, bri, and
byi-1 is shown in Table 13.1
{A simple way to describe the recoding operation isto calculate
: B= Dhan + bv inn (43.10)
‘The reader can verify that
= eer) (13.11)
‘The Booth recoded multiplier consists of 3 parts, the recoding (control ct-
cuitry for multiplier bits, the partial product generation, and accumulation.
Design of fast multipliers using S-level modified Booth recoding is left as anINTERLEAVED FLOOR-PLAN AND BIT-PLANE-BASED DIGITAL FILTERS 489
exercise at the end of this chapter (see Problem 11). Conventional Booth re-
coding circuitry has race problem due to the unbalanced paths from inputs to
outputs and leads to higher power consumption. A low-power Booth-reooding,
circuitry is presented in {11}, which reduces glitches and power consumpti
by using a redundant recoding scheme. Design of low-power Booth recoding,
circuitry is explored in Problem 12.
13.3 INTERLEAVED FLOOR-PLAN AND BI
DIGITAL FILTERS
PLANE-BASED.
‘This section is concerned with efficient implementation and floor-plan tech-
niques for bit-parallel FIR digital filters. These techniques can be adapted for
implementations of IR filters as wel.
Consider the constant-coefiicient FIR filtering operation
un) = a(n) + f-2(n-1)+9-2(n-2), (13.12)
where 2(n) is the input signal, and f and g are filter coefficients. Assume the
signal and coefficient wordlengths to be 6 and 4 bits, respectively, and both.
are represented in two's complement format.
An interleaved approach (12] for high-speed filtering is presented. ‘The main
idea behind the interleaved approach is to perform the computation and ac-
‘cumulation of partial products associated with f and g simultaneously, which
leads to dramatic reduction in the routing complexity. Another advantage
is that since the truncation is performed only at the final step, it also leads
to better accuracy. The multiplication chart, based on the Baugh-Wooley
algorithm, for computing the output y(n) using this interleaved approach is
shown in Fig. 13.10, which is the result of interleaving of 2 parallel Baugh-
Wooley multiplications. Note that only the most significant 4 bits of 2(n) are
considered since a 6-bit result is desired at the output
‘The final interleaved architecture is shown in Fig, 13.11, where the vector
merging portion is a combination of carry-ripple and carry-save implemen:
tations, which permits carry-ripple operation under the constraint that the
architecture can be pipelined at 2-bit level (the pipelining latches are not
shown in Fig. 13.11):- As a result, ater carry has rippled through 2 adders, it
is saved and passed to the adder in next row.
If the coefficients are interleaved in such a way that their partial product
terms are computed in different rows, the resulting architecture is referred to
a8 the bit-plane architecture (13] and is shown in Fig. 13.12, where a row of
flip-flops has been added after every row of adders in the array thereby en-
abling processing at high speed. In order to permit the interleaving operation,
the direction of the sum and carry-out signals alternate. If the digital Slter
implementation needs to be pipelined at p-bit-level, then each bit-plane in the490 BIT-LEVEL ARITHMETIC ARCHITECTURES
ae ate eal — bit postion
feed) erg) fess
OKT OKS OKT
fxs x,
WxS OVXE,
ta
Xe
tae
9xS OKT
1
Yee Ye Ve We Me
Fig. 13.10 Multiplication chart for the illustration of the interleaved approach. Here,
‘2c represents 24(n), 2; represents 2y(n ~ 1), and 2) represents x(n ~ 2)
Dit-plane architecture can be replaced by a modified bit-plane where each bit
is replaced by a group of p consecutive bits. For p = 2, the modified bit-plane
architecture can be designed using the multiplication chart of Fig. 13.13.
13.4 BIT-SERIAL MULTIPLIERS
‘This section addresses the derivation of Lyon’s bit-serial multiplier [14] us-
ing Horner’s rule. Several other bit-serial multipliers are then derived using
systolic mapping techniques.
13.4.1 Design of Lyon's Bit-Serial Multipliers Using Horner's Rule
‘The multiplication rule in (18.5) can be used to derive bit-serial multipli-
ers. The architecture for_a 4 x 4-bit bit-serial multiplication is shown in
Fig. 13.14(a), where the [2-*]is a bit-serial zero-latency scaling operator and
its functionality is illustrated in Fig. 13.14(b). For a bit-serial zero-latency
system, the first output bit needs to be generated in the same clock cycle as the
first input bit entering the system. For the scaling operator, the first output
bit a; should be generated at the same time instance when the first input ay
enters the operator. Since input a; has not entered the system yet, the scaling
operator is a noncausal or advance operation, and cannot be implemented in
hardware.[BIT-SERIAL MULTIPUERS 491
to) dn) 10) 5 apt!
wo)
se)
x0)
io)
vio)
vale)
yo
veto)
Fig. 13.11 Aschitecture for the multiplication chart shown in Fig. 18.10.492 BIT-LEVEL ARITHMETIC ARCHITECTURES
to
Ye
92
fs
vy
Falh [ral [ralh [Fal
Ye
Ye
Fig. 13.12 Bit-plane architecture for the multiplication chart shown in Fig. 13.10.BIT-SERIAL MULTIPLIERS 493,
— bit postion
tax ek
ty
XS 9X7
OX, ONT GiXT ORE
Te fax axa text
ti Te Te Te
ORT ONT XT 0X7
Voces Yee
@
soup Fy] eeail
»
Fig, 13.16 (a) Bit-eerial two's complement multiplication derived from Horner's rule.
‘This architecture is not implementable due to the presence of the scaling operation.
(b) The scaling operator is zero-latency advance operation.496 BIT-LEVEL ARITHMETIC ARCHITECTURES
witches
Fig. 13.15. Derivation of implementable bit-serial two's complement multiplier.
It is possible to implement the architecture in Fig, 13.14(a) using pipelining
and placing delay elements at the feed-forward cutsets shown in Fig. 13.15.
This combination of a delay element and a zero-delay scaling operator can
now be realized in hardware. In Fig. 13.15, switching elements $0, S1, and
‘$2 replace the combined scaling operators and delay elements at, thelr cor-
responding positions, respectively. Their switching time instances can be
found by scheduling the operations in the array multiplication as shown in
Fig. 13.16, where the number in the box to the right of each computation
indicates the time instance when this computation can be performed. Note
that the switches eliminate the Isb of the current partial remainder and ex-
tend the sign of the previous partial remainder at the same time. Therefore,
scaling operators are also referred to as bt-repeatersas these repeat the msb.
‘The multiplier bits are input at time O to the corresponding AND gates; the
rmultiplicand bits are input bit-seriallyIsb-first. ‘The fist bit (Isb) of the 4
bit product is generated at time instance 3. The complete diagram of the
bit-sorial two's complement multiplier is shown in Fig. 13.17. This multiplier
is also refered to as Lyon's multiplier (14). This system has a latency of 3
clock cycles. If the time for AND and multiplexing operations are ignored,
‘the eritical path computation time of this bit-serial multiplier is equal to 37,
‘where Ty is the time for single-bit addition. Generally, a W x W-bit bit-BIT-SERIAL MULTIPLIERS 495
a
wa] orks] res) a
tabs{6] aabs[5]
as) es] ala] ada]
Fig. 13.16 Scheduling instances for operations in bit-srial two's complement array
multiplication.
weet ae) SST el
Fig. 13.17 Lyon's bit-erial two's complement multiplier.
serial multiplier has a latency equal to W — 1 clock cycles with the cycle time
Tat 2 (W - 1)Tq. Pipelining can be used to reduce the critical path and
hence the clock period. For example, by placing delay elements across the
feed-forward cutsets (shown by dashed lines) in Fig. 13.17, the critical path of
the 4 x4 bit-serial multiplier is reduced to 1 single-bit addition at the expense
of an increase in the latency. ‘The critical path of a W-bit Lyon's multiplier
can be reduced to 27, without introducing extra pipelining latches, by using
associativity and rétiming,(15] (see Problem 18)
13.4.2 Design of Bit-Serial Multipliers Using Systolic Mappings
Bit-serial multipliers can be designed by using systolic mapping methodology
(see Chapter 7) of the 2D DGs to one-dimension (1D) array,
Example 13.4.1 This example considers the design of Lyon's bit-serial mul-
tiplier by systolic mapping using the DG of ripple-carry multiplication in496 BIT-LEVEL ARITHMETIC ARCHITECTURES
302) Py
Fig, 13.18. Bit-serialcarry-ripple multiplier in Example 13.4.2.
Fig. 13.9. Using the projection vector a”
sf = [1 I], and the processor space vector
ing edge mappings:
1 0}, the scheduling vector
(0 1], we have the follow-
pre [ate
a@ 1)
PG.)
carry (1, O)
i |
With additional sign-eztension cireutry, the resulting bit-seril multiplier in
Fig. 13.17 is obtained. Bit-level pipelined version of Lyon's multiplier can
also be designed directly from the ripple-carry multiplication DG by systolic
mapping techniques (see Problem 10).
Hest
Example 18.4.2 Consider the DG for the carry-ripple array multiplication
in Fig. 13.9. Using the projection vector, processor space vector, and schedul-
ing vector
a
a-({).
we have the following edge mappings:
10),
01), (1313)
e pre [s7e
Oi | op7
b, 0) | 7 [0
carry (I, ) [7 TO
z(-4) | -1Pr
The resulting bit-serial corry-ripple multiplier is shown in Fig, 13.18, where
the switching instances for the multiplezers are omitted for clarity.
‘Example 13.4.8 Consider the DG for the carry-save array multiplication in
Fig. 13.5. Suppose a carry-ripple adder is used in the vector merging portion.BIT-SERIAL MULTIPLIERS 497,
Fig, 13.19. Bit-serial carry-save multiplier in Example 13.4.3.
Using the projection vector, processor vector, and scheduling vector
(11), (13.4)
‘we have the following edge mappings:
Tepe Tee |
ae
carry 0, T)
6 (ZL, 0)
z(t Y)
The resulting bit-serial carry-save multiplier is shoun in Fig. 13.19. The
control signals of the multiplezers in this figure are omitted for clarity. @
Example 18.4.4 Consider the multiplication table for Baugh-Wooley mult
plication shown in Fig. 13.8. Suppose a carry-save addition is used to ac-
cumulate the partial products and a carry-ripple adder is used in the vector
merging portion. The DG for this computation is shown in Fig. 19.20.
The projection with the following projection vector, processor space vector
and scheduling vector
a=(2),
leads to the following edge mappings
(1 0), 01) (13.15)
e pre|sve
=, alae
any 0, [ott
@(1, 0) ipo
=) 7
carom TOP TT 0498 BIT-LEVEL ARITHMETIC ARCHITECTURES
Fig. 13.20 Dependence graph for carry save Baugh-Wooley multiplication with carry-
ripple vector merging
where carry—vm denotes the carry outputs in the vector merging portion. The
resulting bit-serial carry-save Baugh-Wooley multiplier is shown in Fig. 13.21,
‘where blocks [C] perform inversion operations at some clock cycles to generate
the appropriate partial products and these control details have been omitted for
simplicity. Note that in this design, the computation at node (i, j) is scheduled
at time j, for j =0,1,2,3,4. Therefore, the result bit zo is generated at time
5143 and za, x2, and 2, are generated at time 5! +4 in parallel format.
A paralle-to-serial converter is used such that these result bits are output in
‘serial at time instance 51, 5I-+1, 51 +2, 5I-+3, and 5l+4, respectively. The
control signal Ty = 1 at time 5! when a new multiplication starts, and Ty = 0
otherwise. Hence the architecture has latency equal to 5 cycles and is able
to generate 1 output word every 5 cycles. In addition to inherent hardware
‘utilization inefficiency, a major drawback of this design is the long critical
path (shown in dashed line in Fig. 19.24) proportional to the wordlength.
Another design can be obtained by using the modified DG shown in Fig. 18.22,
with the same projection vector, processor vector, and scheduling vector as in
(13.15). In this case, the carry-save array and the vector merging portion are
treated as two separate planes during systolic mapping, and the carry-ripple
addition in vector merging portion is performed by a separate adder and resultBBIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 499
Fig. 1221 Bit-serial Baugh-Wooley multiplier 1 in Example 13.4.4
bits are generated in serial. This bit-serial design is shown in Pig. 13.99. It
‘has a latency equal to four clock cycles and is able to generate one output
every four clock cyeles. Note that the critical path of this design is limited by
‘one binary adder delay. m
13.5 BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION
FIR and IIR filters are the two most basic structures used in DSP, This
section considers design of bit-serial FIR and IIR digital filters with fixed
constant coefficients, where the multiplications with constant coefficients are
decomposed and implemented using bit-serial shifts and adds.
I FIR Filter
Consider the implementation of the FIR filter
y(n)
Ty at
Ja) + 520-1), (13.16)
with a signal wordlength of 8. Equation (13.16) can be rewritten in terms of
shifts and adds as follows:
u(r) = ~2(n) + 2(n)2-* + a(n —1)2, (3.7)
‘The word-level signal flow graph of the shif-add based FIR filter is shown in
Fig. 13.24(a). Due to the presence of noncausal scaling operators, this is not &
feasible design. Pipelining cutsets, shown in dashed lines in Fig. 13.24(a), can
be used to delay the advance scaling operations. By placing delay elements500 BIT-LEVEL ARITHMETIC ARCHITECTURES
Fig. 13.23. Bit-serial Baugh-Wooley multiplier # 2 in Example 13.44‘BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 501,
o
Fig. 13.24 Bit-level pipelined bit-serial FIR filter where constant coeficient multi-
plications are implemented as shifts and adds. (3) Filter architecture with scaling
operators; (b) feasible bit level pipelined architecture.
along the cutsets and replacing the delayed scaling operators with switches,
a feasible bit-level pipelined bit-serial FIR iter can be derived, as shown in
Fig. 13.24(b). Note that a word-level delay is equivalent to W bit-level delays
for a signal wordlength of W. Therefore, the one word delay in Fig. 13.24(a)
is replaced by 8 bit-level delays in Fig. 13.24(b). The switching time instances
can be derived by scheduling the bit-level computations.
13.5.2 Bit-Serial IIR Filter
Consider the implementation of the IIR filter
1
u(n) = —Zyln—1) + Eu(n 1) +200), (assy502 BIT-LEVEL ARITHMETIC ARCHITECTURES
Fig. 13.25 MIR filter can be obtained from a FIR section as shown in this figure.
where the signal wordlength is assumed to be 8. This IIR filter equation can
be rewritten as
w(n) ~fun-0) + fue-,
y(n)
which can be implemented as an FIR section (computing w(n) from y(n—1))
with an addition and a feedback loop, as shown in Fig, 13.25. A bit-level
pipelined bit-serial IIR filter architecture can be derived by going through the
following steps.
First, a bitlovel pipelined bit-serial implementation of the FIR section
neds to be derived. (For this example, this has already been carried out in
the previous subsection.) The 2nd step is an intermediate design stage. In
this step, the input signal 2(n) is added to the output of the bit-serial FIR
section w(n), The resulting signal y(n) is connected with the signal y(n ~ 1).
‘The number of delay elements on this edge, marked by [?D ] in Fig. 13.26(a),
is yet to be determined. The resulting structure is shown in Fi
which contains 2 loops shown in dashed lines. This completes all the inter-
connections in the bit-serial system. However, for systems containing loops,
the total number of delay elements in the loops should be consistent with that
in the original signal-fiow graph (SFG), in order to maintain synchronization
as well a8 correct functionality. In this case, the number of bit-level delay
elements in the bit-serial loops should equal W x Np, where W is the signal
wordlength and Np denotes the number of (word-level) delay elements in the
loops of the original word-level SFG. Matching the number of word-level loop
delay elements and that in the bit-serial architecture is referred to as loop de-
lay synchronization. For this IIR filtering example, originally, the inner loop
contains 1 word delay and the outer loop contains 2 word delays. Hence, in
the resulting architecture, the inner loop and outer loop should contain 8 and
16 delay elements, respectively. ‘To compute the total number of delays in
the bit-level architecture, the paths with the largest number of delay elements
in the switching elements should be counted. For this example, there are 6
delay elements and 14 delay elements in the inner loop and outer loop of the
w(n) +2(n),BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 508
intermediate archivctuo, respectively. Thereore, 2 synchronization delays
are added tothe common part of ine and gute loop ad the Boal architec.
ture is shown in Fig. 1826(0). Five input synehroniing delay elements are
ted to the input, which ar also referred to as shinning delays or skewing
delays, Thi bilevel pipelined bit serial I filter has a ctical path conainr
ing 1 single-it adder, and has a throughpat of 1 output word ovry 8 clock
cgelea. Note that its aso posable thet the lops inthe intermediate bit level
pipelined architecture may contain more than W x Np number of bit-level
Aay clement. In this case, the wordength noods to be increise.
Note that the architecture without the 2 loop synchronizing delays (in
shaded region) ean function correctly with a signal wordlength of 6, whlch is
the minimum wordlength foc the bitlovl pipelined bit-seralarchivctue in
Fig. 18.26). (In the ease of signal wordlength 6, the [8D delays in th outer
this case, the throughput corresponds to 1 output (word) every 6 clock eyctes.
‘The IIR filter architecture in Fig. 13.26(b) is derived from the signal-ow
graph shown in Fig. 13.27(a). Its minimum wordlength of 6 is constrained by
the loop iteration period of one-muliply-two-add (inner loop) time in the
signal flow graph of Fig. 18.27(a). This iteration bound (see Chapter 2)
can be reduced to one-multiply-add by using associativity which leads to the
structure shown in Pig. 13.27(b). A bit-serial implementation of the structure
in Fig. 13.27(b) is shown in Fig. 13.28, which requires a minimum feasible
wordlength of 5.
Example 13.5-1 Consider the implementation of the equation:
a(n) =
Zain) 4 S2(n-2) +00). (3.19)
Assume the signal wordlength to be 8 (ie., the wordlengths for x and w are 8
bits). Assume the coefficient ~7/32 = —1 + 1/2 + 1/4 + 1/32 is encoded as
1.11001 and 3/4 = 1/2+1/4 is encoded as 0.11 in two's complement format.
A bit-serial architecture based on the block diagram of Fig. 13.27(b) is
shown in Fig. 13.29, where the notation (81 +i) represents the time instance
{81+i} and 7/8 represents the remaining 7 time slots in a period of periodicity
8. Note that the number of loop delay elements equals 8 in the inner loop and
16 in the outer loop (uhile counting the number of loop delay elements, the
maximum number of delay elements in the switches should be included), which
is consistent with the word-level architecture.506 BIT-LEVEL ARITHMETIC ARCHITECTURES
®
Fig, 13.26 Design of bit-level pipelined bit-eral IR filter. (a) Bitlevel pipelined bit-
setil architecture, without synchronization delay elements, (b) Bitserial IR filter.
[Note that this implementation requires a minimum feasible wordlength of 6
@ ®
Fig. 13.27 Loop iteration bound of HR filter can be reduced from one-multply-two-
add to one-multipy-add by associative transformation.CANOMIC SIGNED DIGIT ARITHMETIC 505
Fig. 13,28 Bit-cerial IR Alter ater associ
requires 8 minimum feaible wordlength of 5.
ve transformation, ‘Thi implementation
13.6 CANONIC SIGNED DIGIT ARITHMETIC
‘The number of add operations required in a constant coefficient multiplication
equals one less than the number of nonzero bits in the constant coefficient.
In order to further reduce the area and power consumption, the constant co-
efficient can be encoded such that it contains the fewest number of nonzero
bits, which can be accomplished using canonic signed digit (CSD) represen-
tation. Quantization of FIR digital filters using a specified number of signed
power-of-two (SPT) terms is described in Appendix G.
This section addresses the CSD numaber representation and its applications
for the design of bit-serial and bit-parallel constant multipliers.
13.6.1 CSD Representation
Consider the number A = aw-1.0w-2°+-a1a9, where each aj (W—1 > i > 0)
is in the set {—1,0,1}. Two's complement representation may be considered
‘sa special case, in which the bit aw—1 Is equal to 0 or ~1, whereas
1 for 0 ) (tina +S) tewaj24) (13.22)
One War Noa
= Deut DCS errs)
B jet 8
Define
Cwars =D citewa-s (G #0), Cw.
Seeyw-- (18.28)
‘Then,
Y= PV Cwaye?, (13.24)
Therefore, by interchanging the summing order of i and j, the initial mul-
tiplications in (13.20) ate now distributed to another computation pattern
(19), 20}.
Since the term C depends on the 2,5 values and has only 2¥ possible val-
ues, it is possible o precompute them and store them in a read only memory
(ROM). An input set of N bits (205, 217.---, 2-1) is used as an address to
retrieve the corresponding C, values. These intermediate results are accumu-
lated in W clock cycles to produce one Y value. This leads to a mulkipier-free
realization of vector multiplication. Table 13.2 shows the content of the ROM
for N=4, Fig. 13.36 shows a typical architecuure for the computation of the
inner product of two length-NV vectors. The hift-accumulator isa bit-parallel
carry-propagate adder that adds the ROM content to the previous accumu-
lated result. The inverter and the MUX are used for inverting the output
of the ROM in order to compute Ciy-1 and the control signal $ is 1 when
j= W —1 and 0 otherwise. The computation runs from j= Oto j = W~1
and the result is available in bit-parallel format after W clock cycles. This
approach corresponds to a bit-serial distributed arithmetic. ‘The speed of @
-serial distributed arithmetic implementation ean be limited for
ime applications such as DOT implementation for video com-
pression in a digital TV system. ‘The speed of these systems can be improved
by a digit-serial distributed arithmetic [21},22] where a digit containing mul
tiple bits is processed in a clock eycle. For example, if J consecutive bits areDISTRIBUTED ARITHMETIC 513
Table 13.2 Content of the ROM (N=4)
iP
Fig _z2j #24 | Content of the ROM
o
@
er
ate
a
ate
CE
abate
@
Ore
oto
erate
eo +e
erate
eo ter te
@tatate
processed in a single clock cycle using J ROMSs, then the input words are pro-
cessed in W/.J clock cycles. A mult-input shift-accumulator adds the contents
of J ROMs and the previous accumulated result, and generates the output in
bit-parallel format. ‘The detailed design of the multi-input shift-accumulator
is explored in Problem 24.
13.7.2 Distributed Arithmetic with Offset-Binary Coding
In this section, the offset-binary coding (OBC) [18] is introduced that can
reduce the ROM size by a factor of 2 to 2¥-"
Rewrite (13.21) as:
1
2 = fle (-20)
= Few wa) + Lewy awe 2,
(13.25)514 BIT-LEVEL ARITHMETIC ARCHITECTURES
rey
Meee
sbiesecumaaon un
Fig. 13.36 Architecture of computing i
distributed arithmetic.
where
wa
= mwa + Do mw +2".
Define
{ ij — Fj» for AW -1
~Giw a1 Fw), for j= W=1
and diy € {1,41}. Equation (13.25) can be rewritten as:
wo
(3 diwang2 4 —2-0"-0),
=
Using (13.27), (13.20) can be written as
way wa
Y= YD del dwar yad 2-0
aris
Wes wot 1
= LS fades youre
we
Now define ca
D; Ye feats, for0 |
(co
I Peas
Fig. 13.37 Architecture of computing inner product of 2 length-N vectors using dise
tributed arithmetic with OBC coding.DISTRIBUTED ARITHMETIC 517
13.7.3 ROM Decomposition for Distributed Arithmetic
‘The ROM size of the conventional distributed arithmetic increases exponen-
tially with N. Generally, ROM aovess time can be a bottleneck for speed of
the whole system, especially when ROM size is large. Therefore, reducing the
ROM size is very important and is of great practical concern. Due to the
linearity of equation (13.24), one possible solution is to divide the NV address
bits of the ROM into N/K groups of K bits. Hence itis possible to decompose
the ROM of size 2¥ into N/K ROMs of size 2% and add the outputs of these
ROMS using a multiinput accumulator. Fig. 13.38 illustrates the architecture
for computing an N-input inner product using conventional distributed arith-
metic with ROM decomposition. ‘The total size of storage is now reduced
Mali-input
y
Fig. 13.38 Architecture for computing inner product of two length-N vectors using
distributed arithmetic with ROM decomposition.
from 2% to (N/K)2¥ which increases linearly with N. The ROM access time
is also reduced along with the ROM size. This reduction of the storage size is
balanced by a linear increase of the computation complexity of the accumula
tor. Carry-save arithmetic can be used to realize the multiinput. accumulator
to minimize the computation time. Note that ROM decomposition technique
is applicable to OBC-coding-based distributed arithmetic as well.‘518 BIT-LEVEL ARITHMETIC ARCHITECTURES
13.8 CONCLUSIONS
This chapter has presented the design of bit-level arithmetic architectures,
Design of bit-parallel multipliers, including carry-ripple array, carry-save ar-
ray, Baugh-Wooley, and Booth-recoded multiplications has been introduced,
‘The two efficent layout strategies, interleaved floor-plan and bit-plane, for dig-
ital filters have been discussed. Design of bit-serial multipliers using Horner's
rule as well as using systolic mapping has been addressed. This chapter has
also addressed the design methodologies for bit-level pipelined bit-serial FIR
and IIR filters with constant coefficients. ‘The CSD representation has been
presented for the design of low-cost, high-speed constant multipliers. Latency
reduction in serial and parallel computations using associativity and tree-
height reduction has been discussed. Finally, the bit-level implementation
schemes for vector-vector multiplication based on the distributed arithmetic
approach has been addressed. Residue arithmetic [23], which is often used for
implementation of FIR digital filters and transforms, is beyond the scope of,
this book
Digit-serial architectures are also attractive for discrete wavelet transforms,
where higher levels of wavelet can be implemented using smaller digit sizes
[24]. For example, in a Slevel wavelet, digitsizes W/2, W/4 and W/8 can be
used for levels 1, 2, and 3, respectively. Higher levels require less computa-
tion rates and can be implemented using fewer hardware, assuming use of a
single clock in the system. Traditionally, digit-serial multipliers are designed
by folding bit-parallel multipliers, or unfolding bit-serial multipliers. Some
resulting digit-serial architectures cannot be pipelined at the bit-level. A
novel cell replacement transformation technique presented in (25] can be used
to design low-power bit-evel pipelined digit-serial multipliers from bit-serial
multipliers.
13.9 PROBLEMS
1. (a) Verify that the circuit in Fig. 13.39 implements a binary full-adder.
(b) The adder circuit in Fig, 13.39 uses 32 transistors. Derive an op-
‘timized architecture using CMOS-transmission gates with only 24
transistors (Hint: Try to share hardware of sum and carry parts).
(©) What is the number of transistors in the critical path of an &-bit
carry-ripple adder?
2. Consider the addition of 2 two's complement &-bit numbers X and Y in
a ripple-carry manner as shown in Fig. 13.40.
Derive a truth table and logic circuit for detecting overflow for this adder
using the bits er, yr, cr, cg, and sr. (Hint: If zr # yr, ie, the sign bitsPROBLEMS 519
x
ms
Fig. 1339 Schematic of a complementary pass-transistor logic based CMOS fall-
adder.
|. This problem considers the derivation of the Baugh-Wooley multi
aie
Pa ye yt ye
a! fiddiath
ey
7 Ss
Fig, 13.40. Bight-bit ripple carry adder in Problem 2.
of X and ¥ are different, then there can never be overflow. If x7 = yr,
then overflow oceurs if 27 = yr # #1.)
Design bit-parallel architectures for computation of
un) = Dain) (3.32)
3
using
(a) carry-save arithmetic and
(b) treerheight reduction technique,
using half and full adders and a VMA. Assume a wordlength of 8. Com-
pute the latencies of these architectures without including the latency
of the VMA.
ica
tion algorithm shown in Fig. 13.8. Consider the 4 x 4-bit multiplication
‘operation X = Ax B, where A = a3.ay009 = ~a3 +0227? +12"? +
G92, B= [Link] = ~bs + by2- + b2-* + bo2®.520 BIT-LEVEL ARITHMETIC ARCHITECTURES
Multiplying A and Bis equivalent to perform
P= (03 40,2 + 0,2? + a2-8)(—by + Do21 +12 + B92)
(asbs + (a2 + 012-? + a92"9)(bo2"! + b12-? + bo27%))
~ (02? + 612°? + B92-9) + (a2! + 0,2"? + 092%).
(13.33)
(a) Prove that
~ (05:27 + 12°? + B22) + (aa! + 012°? + a2)
= (agba + agby)2-* + (Guby + ayy + 1)2-? + (Gabo + Gods) 2-*,
(1334)
where @ = 1 — a, for a = 0 or 1. Hint: The product of 2 numbers
‘within the range [—1,1) should still be within this range. Hence,
the nonzero bits at higher bit positions can be ignored,
(b) Using equations (13.33) and (13.34), and associativity property to
rearrange all the terms involved in this multiplication, derive the
Baugh-Wooley multiplication table shown in Fig. 13.8
5. The Baugh-Wooley multiplication algorithm can also be implemented
as follows:
P= (~05 40,71 +0;2°? + ag2-*)(—by + ba? + 512°? + By)
= aby + (az27! + 012°? + a92-*)(bg2* + 12°? + bo2™*)
#05 + (agba2™! + a9b;2"? + asBo2™*) + 032°
+a + (Gab? + abg2-* + W592) + by2->
(13.35)
by rewriting (13.34) as
~(as(22°? + 6:2°? + bo) + (a2! + 012"? + ag 2*}ba)
= 43(—bg2-! — by2-* = by 2) + bg(—a22-! - a1 2°? ~ ag2-*),
(1336)
Prove the Baugh-Wooley multiplication algorithm in (13.35) and derive
the multiplication table for this algorithm. (Hint: Use the fact that
Dhan ~Paaidt = Lh (= dF 142%)
6. Consider the computation of
x(n) + fa(n —1) + ge(n— 2). (13.37)
Using Baugh-Wooley multiplication algorithm and two's complement10.
rt
PROBLEMS — 521
reptetentation, draw the interleaved bit-parallel floor-plan architecture
for this computation with
(2) bit-level pipelining, and
(b) 2-bit-Level pipelining
for signals and coefficients of wordlength of 4,
Using the multiplication chart in Fig. 13.13, design a 2-bit-level pipelined
bit-parallel architecture using modified bit-plane technique for the com-
putation in Problem 6.
Show detailed schedules for the bit-serial multiplier in Fig. 13.19 and
verify its correctness.
|. Show detailed schedules for the bit-serial Baugh-Wooley multiplier in
Fig. 13.23 and verify the correctness of this architecture.
Find appropriate projection vector, scheduling vector, and processor
space vector to design a bit-eval pipelined version of Lyon's bit-serial
carry-ripple multiplier from the DG in Fig. 13.3. This multiplier should
be a pipelined version of the multiplier in Fig, 13.17. Complete the
details of the design and show the switching instances ofall multiplexers
in the multiplier.
‘This problem is concerned with design of complete architecture for a bit-
serial 5-level recoding-based modified Booth multiplier for a wordlength
of 12 using Horner’s rule expansion. Use the notation where the radix
point is to the right of the msb (or sign bit) and bit number 11 is the
msb and bit number 0 is the Isb.
(0) To obtain the recoded signal
Wes Mania bt pets (13.38)
wwe need to derive three control signals Az, Bi, and Ci. Cy = 1
implies y is negative, and C; = O implies yf is nonnegative. By = 1
implies jy| = 2 and j = 1 implies [yf] = 1. Draw the Karmaugh
map for Aj, Bjy and C, and prove that
wi ® ia
(uaier © vai) (92i ® vai
vans (13.39)
where the symbol © denotes exclusive OR, and the overbar indi-
cates complementation. Using As, By and Ci, design the modified
Booth recoding unit.522 IT-LEVEL ARITHMETIC ARCHITECTURES
Table 13.5 Truth Table for Low-Power Booth Recoding Circuitry in Problem 12
ipa [oy [a [me ao
o[o0, 0 po, 0,opiyi
axa ees [OTs EO
0 T 0 7 ofvijyolt*
a
T 0 0 27 1 o;iso
i] aa eof [ga
T T 0 =i1f 1 ijoly*
Tea oat
(b) Draw the complete modified Booth recoded bit-serial multiplier
architecture and show all switching instances.
12. This problem considers design of a Booth recoding circuit for low power.
‘The control circuitry for Booth recoding in Problem 11 contains only
‘two XOR gates and one inverter. However, the generation of the con-
trol signals Aj, Bi, and C; through different logic levels with different
propagation delay time leads to an increase in the glitching activity in
the partial product generation and accumulation circuit, hence leading
to high power dissipation. Glitching and power consumption can be re
duced by using a redundant recoding scheme, which balances the paths
from each input to the outputs of the control circuitry (11).
Consider the multiplication A x Y, and denote the recoded multiplier
bits as yj. Four control signals are used in the new race-free Booth re-
coder, neg, #1, 22 and zp. neg = 1 implies yj is negative, and neg =
implies y; is nonnegative. 21 = 1 implies |y}| = 1; 22 = ZI. zp distin.
sguishes |y}| = 2 from ly}| = 0. The new recoding scheme is summarized
in the Table 13.5, where * denotes a don't-care condition. One bit-slice
of the partial product generator obtained using this recoding scheme is
shown in Fig. 13.41.
(a) Prove that the circuit in Fig. 13.41 generates the correct partial
products for the Booth multiplication
(b) Derive the Booth recoder (control) circuit such that the propaga-
tion delay of all paths from each input bit, including vajsay ways
tnj-1, 0; and aj-1, to each bit of the partial product, Ay}, are
equal and are limited by one XOR/XNOR gate delay, one AND
gate delay, and one NOR gate delay. The don't care conditions in
zp signal can be used for logic minimization. Show those zp values
that you use to substitute the don’t-care conditions.PROBLEMS 523
neg—
al 2
Fig. 13.41 One bit slice partial product generator for the [Link] Booth multiplier
in Problem 12.
13. Obtain a 2-unfolded carry-ripple digit-serial multiplier by unfolding the
Lyon’s multiplier in Fig. 13.17.
14, This problem is concerned with fixed-point bit-serial and digit-seral
implementation of an all-pole second-order recursive filter
vin) = Fyn 1) + Fain —2) +210)
Assume the signal wordlength to be 8 (i.e, the wordlengths for y and
are 8 bits). Assume the coefficient ~7/8 is encoded as 1.001 and 3/4 is
encoded as 0.11. In the multiplication with respect to coefficients, only
‘multiplication with respect to nonzero bits is carried out,
(@) Draw the block diagram of the computation by exploiting associa-
tivity. The inner loop bound of your architecture should be limited
by 1 multiply-add time.
(b) Design a:functionally correct bit-level pipelined bit-serial architec-
‘ture for the structure in part (a).
(©) The loop latency in part (b) imposes a constraint on the minimum
‘wordlength for the signals y and z. What is the minimum feasible
signal wordlength for the system in part (b)?
15. This problem considers bit-serial implementation of the 3rd-order fixed-
coefficient IIR filter shown in Fig. 13.42. Assume the data wordlength to
be 8, filter coefficient wordlength = 4, and filter coefficients a = —1/4,524
16.
Wy,
{IT-LEVEL ARITHMETIC ARCHITECTURES
yo)
Fig, 1342 UR filter diagram for Problem 15.
b= 1/8, c= -3/4. Assume all numbers to be represented using two's
complement representation.
(2) Obtain an equivalent structure by using associativity such that the
inner loop bound is limited by 1 multiply-add time.
(b) Complete the bit-level pipelined bit-serial design of the structure
in part (a). Treat minimization of delay elements in the design as
a secondary objective.
‘This problem addresses a bit-level pipelined bit-serial implementation
of the recursive computation
vin) = Zn 1)+ Sin —2)+ Bin)
using two's complement representation and a signal wordlength of 10.
‘The bit-erial design must use Horner's rule to improve accuracy in the
precision avalable Asocitivity must be exploited so that the Loop
bound of the inner loop is limited by 1 multiply and 1 add time at the
word level. Design the bit-seral architecture fora signal wordlength of
10. What is the minimum feasible wordlength for this computation?
Consider the bit-serial implementation of the computation
25y(n 1) + a(n) (13.40)
y(n) =
shown in Fig. 13.43.
‘What wordlength has been used in the circuit? For this wordlength,
complete the missing switching instances. Unfold this bit-serial struc
ture by a factor of 3 to obtain a digit-serial implementation with digit
size 3,PROBLEMS — 525
x(a) 20}
Fig. 13.43 Bit-oerial implementation of the TR filter in Problem 17,
18. This problem considers the design of a low-latency bit-serial multiplier
by transforming the ripple-carry Lyon's multiplier using associativity
and retiming.
@)
(b)
o
@)
Show that the architecture in Fig, 13.44(a) can be transformed to
that in Fig. 13.44(b) using associativity and retiming. Compare
the critical path and latencies of these 2 architectures.
Design a low-latency bit-serial multiplier by applying similar tran
formations as in part (a) to the Lyon’s ripple-carry bit-serial mul-
tiplier or by recasting the Homer's rule multiplication equation
into an appropriate form. Assume a wordlength of 12 for signal
sample and the coefficient. Show all switching instances in your
architecture.
Design alow latency constant bit-serial multiplier to compute y(n) =
a(n), where a = 001011100101, by considering multiplication
with respect to nonzero bits only using the architecture in part
(b). Show all the switching instances in this architecture. Assume
‘signal wordlength of 12.
Based on the bit-serial multiplier derived in part (c), design a bit-
serial implementation of the recursive computation y(n) = ay(n ~
1)-+2(n) for the same value of ain (c). Assume a signal wordlength
of 12
19. Obtain the CSD representation of the following two's complement num-
bers:
000010110, ex = 0.01001100, ¢2 =
00010110111
191100110,526 BIT-LEVEL ARITHMETIC ARCHITECTURES
»
Fig, 13.44 Bit-serial multipliers in Problem 18.
20, Obtain the CSD representation of the following two's complement num-
bers:
= LAL, ¢ = 111110010, ¢ = 0.01100110,
es = 0.01001100111.
21. ‘This problem considers a bit-serial implementation of the IIR filter
u(n) = ary(n—1) +azy(n 2) + 2(0), (13.41)
‘where a) = 0,00011011100 and ay = 0.01001111110
(a) Represent the coefficients a; and a2 in CSD representation.
(©) Using CSD representation and the data-flow graph in Fig. 13.27(b),
obtain a bit-level pipelined bit-serial IIR filter architecture for a
signal wordlength of 8. What is the minimum feasible wordlength
of this architecture?
(©) Apply the Horner's rule and tree height reduction techniques to
improve your design in part (b). What is the minimum feasible
wwordlength of this design?
22. Extend the OBC concept to implement a distributed arithmetic archi
tecture using a ROM containing 2/4 words.
23. Consider ROM decomposition with N=16, K=4. Calculate the reduc-
tion factor in ROM size. Realize the distributed arithmetic architecture
‘using two-input accumulators and carry-save arithmetic.REFERENCES 527
24, This problem considers design of digit-serial distributed arithmetic ar-
chitecture. Implement the computation
wa
v= San (03.42)
aS
using a 3-bit digit-serial distributed arithmetic assuming a wordlength
of 9. The bits zz.se41 should be processed in the I-th ROM (1 = 0, 1,2)
in k-th clock eycle (k = 0, 1,2). Using 3 ROMs and a multi-input shift-
accumulator, design the 3-bit digit-serial distributed arithmetic archi-
tecture. This architecture processes 3 consecutive input bits in a clock
cycle and all input words are processed in 3 clock cycles, Assume the
ROM contents to be represented using 16 bits. Design two forms of
shift-accumulators using bit-paralle (i) carry-ripple and (ii) carry-save
adders. The output of the shift-accumulator is represented using 16
bits. The critical path of the carry-ripple design should be I7tr4 and
that of the carry-save design should be 3tp.4, where tr. is the propaga-
tion delay of a full adder. ‘The carry-save design requires an additional
carry-propagate adder at the end.
(a) Calculate the latencies of both designs in terms of tp.
(b) Which design is more suitable for high speed? Why?
(©) Which design is more suitable for low power? Why?
REFERENCES
1. L, B, Jackson, J. F. Kaiser, and H. $. McDonald, “An approach to imple-
mentation of digital filters,” EBB Trans. on Audio Electroacoustics, vol. 16,
pp. 413-421, 1968.
2. P.B. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-Serial Approach.
‘Addison-Wesley, 1986.
3. Jain et al, “Custom design of a VLSI POM-FDM transmultiplexor fom
system specification to circuit layout using a computer aided design system,”
IBBE J. Solid State Circuits, vol. 21, pp. 73-85, Feb. 1986,
4. K. K, Parhi, “A systematic approach for design of digit-srial processing archi-
tectures;” IEBE Trans. on Cireuits and Systems, vol. 38, no. 4, pp. 358-375,
April 1991.
5. R. I. Hartley and K. K. Parhi, Digit-Serial Computation. Kluwer, 1998.
6..N. B. Weste and K. Eshraghian, Principles of CMOS VLSI design: A Systems
Perspective, Addison-Wesley, 1993.
7.K. Hwang, Computer Arithmetic: Prine
1979.
Architecture and Design. Wiley,528 BIT-LEVEL ARITHMETIC ARCHITECTURES
8.1. Koren, Computer Arithmetic Algorithms. Prentice Hall, 1998.
8. C. R. Baugh and B. Wooley, *A two's complement parallel array multiplication
algorithm,” 1BBE Trans. on Computers, vl. C-22, no. 12, pp. 145-1047, Dee.
1973.
10. C. 8. Wallace, “A suggestion fora fast multiplier” IBBE Trans. on Computers,
vol. BC-13, pp. 14=17, Feb, 1964,
11 R. Fried, *Minimiaing energy dissipation in high-speed multipliers
ISLPBD-97, (Monterey, CA), pp. 214-219, Aug. 1997.
12M, Hatamian and K, K. Pashi, “An 85-MHl2 fourth-order programmable TIR
digital filter chip,” IBBB Journal of Solid State Cireits, vol. 27, pp 175-188,
Feb. 1992.
13, 7. G. Nol, *Semi-systolic maximum rate transversal Stars with programmable
coeficients,” in Proc. of 1986 International Conference on Systolic Arrays,
(Oxford, U.K.), pp. 5-13, July 1986.
14. R. F, Lyon, “Two's complement pipelined multipliers,” IBEB Trans. on Com-
munications, vol 24, pp. 418-424, April 1976.
15. D. Ait-Boudaoud, M. K. brahim, and B.R. Hayes-ill, “Novel cell architecture
for bit level systolic arrays muldplication,” IBB Proceedings-B, vol. 138, Jan
1901
16. 8. W. Reitwiesner, “Binary arithmetic;” Advances in Computer, pp. 231-308,
1966.
17, MeT. Sua, TC, Chen, and AM. Dottieb, “VLSI implementation of @ 16% 16
discrete cosine transform chip.” IBEB Trans. on Circuits ond Sytems, vol. 36,
zo. 4, pp. 610-617, Apail 1989,
18, N, Demassieux and F. Jutand, “Orthogonal transforms,” in VEST Implementa-
tions for Image Communications (P. Pirsch, ed), pp. 217-250, Elsevier, 1888,
19. A. Peled and B, Liu, “A new hardware realization of digital filters” TEBE Trans
‘Acoustics, Spech, and Signal Processing, vol. ASSP-22, no. 6, pp. 456-462, Dec
1974
20. ©. 8, Burrus, “Digital filter structures described by distributed arithmetic;
IBBB Trans. on Circuits ond Systems, Dec. 1977
21, 8. Uramoto et al, “A 100-MHz 2-D diserete cosine transform core processor”
IBBE Journal of Solid-State Cireuits, vol. 27, no. 4, pp. 402-499, April 1992.
Y. Katayama, 1. Tamitani, A, Taniguchi and Y. Ooi, “A singlechip MPEG1
audio/video decoder using macrocore and call-based implementation,” in IEBE
VES! Signal Processing, VITI, (Osaka, Japan), pp. 431~440, Oct. 1995.
28. , J. Taylor, “Residue atithmetic: a tutorial with examples," IEEE Tra
Computers, May 1984.
24. K. K. Pashi and T. Nisitani, “VEST architectures for disrete wavelet trans-
forms,” IBBB Trans. on VLSI Systems, vol. 1,20. 2, pp. 191~202, June 1998.
25. Y-N. Chang, J. H. Satyanarayana, and K. K, Path, “Design and implementa-
tion of low-power digit-serial multipliers” in Proc. 1997 IBEE International
Gonference on Computer Design (ICCD), (Austin, TX), pp. 186-195, Oct.
1907.
in Proc. of