[go: up one dir, main page]

0% found this document useful (0 votes)
108 views52 pages

Unit 2

This chapter discusses the design of bit-level architectures for addition and multiplication in DSP algorithms, focusing on three implementation styles: bit-parallel, bit-serial, and digit-serial systems. It covers the design methodologies for bit-parallel and bit-serial multipliers, digital filter design, and vector-vector multiplication schemes, emphasizing the use of two's complement representation. The chapter also explores various multiplication techniques, including carry-ripple, carry-save, Baugh-Wooley, and modified Booth recoding to enhance performance and efficiency.

Uploaded by

saranrakshu27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
108 views52 pages

Unit 2

This chapter discusses the design of bit-level architectures for addition and multiplication in DSP algorithms, focusing on three implementation styles: bit-parallel, bit-serial, and digit-serial systems. It covers the design methodologies for bit-parallel and bit-serial multipliers, digital filter design, and vector-vector multiplication schemes, emphasizing the use of two's complement representation. The chapter also explores various multiplication techniques, including carry-ripple, carry-save, Baugh-Wooley, and modified Booth recoding to enhance performance and efficiency.

Uploaded by

saranrakshu27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
13 Bit-Level Arithmetic Architectures 13.1 INTRODUCTION ‘This chapter addresses the design of bit-level architectures for addition and multiplication frequently encountered in DSP algorithms. Three implementa- tion styles, bit-parallel, bit serial, and digit-serial are addressed. Bit-paralet systems process one whole word of the input sample each clock cycle and are ideal for high-speed applications. Bit-serial systems process 1 bit of the in- put sample every clock cycle. These systems can be synthesized using integer linear programming based scheduling approach (see Appendix F). Bit-serial systems are area-efficient and suitable for low-speed applications [1] ~(3). Bit- serial arithmetic is used for the implementation of data-flow algorithms of medium complexity and low to medium data rate, whereas bit-parallel op- erators may be used for the implementation of data-flow algorithms of low complexity and high data rate. Digit-serial systems [4J,5] process multiple number of bits (referred to as digit-size) every clock cycle and are best suited for applications requiring moderate sample rate, where area and power con- sumption are critical This chapter considers the design of bit-parallel and bit-serial multipliers, bit-serial digital filter design and implementations, and bit-level implemen- tation schemes for vector-vector multiplications, ‘The major emphasis is on architecture design based on design methodologies for mapping algorithms to arithmetic architectures at bit-evel. Basic knowledge of computer arithmetic, such as number representations, and addition and multiplication algorithms are assumed, as they are covered in many VLSI design and computer arith- ‘metic books [6}-[8]. In this chapter, all numbers are assumed to be repre- ar 478 BIT-LEVEL ARITHMETIC ARCHITECTURES sented in Bxed-point two's complement representation. A W-bt number A is represented as: A= ay-14y-2°--0109 (aay where the bits aj, 0 < i < W — 1, are either 0 or 1, and the most significant bit (msb) aw-. is the sign bit with value of 0 denoting positive number and 1 denoting negative number. ‘The value of this number is in the range of [-1,1- 2-1] and is given by: 4 (13.2) ‘A main advantage of two's complement arithmetic is the ability to generate 8 correct final result in spite of intermediate overflow. ‘Thus, in a chain of additions or subtractions, the final result is correct if itis known to lie in the range [~(1~2-*1), 1-2-1] even though intermediate operations lead to overfiow. For bit serial implementations, constant wordlength multipliers are considered, where a constant wordlength of W-bit is maintained for all the signals in the architecture. Although a W x W-bit multiplication generates (2W ~1)-bit product, only the W’ most-significant bits are retained. This chapter is organized as follows: Section 13.2 discusses bit-parallel multiplier design, including carry-ripple array multiplication, carry-save ar- ray multiplication, Baugh-Wooley multiplication, and Booth-recoded multi- plications. The two efficient layout strategies, interleaved floor-plan and bit- plane, for digital filters are discussed in Section 13.3. Design of bit-serial multipliers using systolic mapping as well as Horner's rule are addressed in Section 13.4, Section 13.5 addresses the design methodologies for bit-level pipelined bit-serial FIR and IIR. filters with constant coefficients. This sec- tion also addresses issues related to delay management and synchronization in bit-serial implementations. Section 13.6 addresses the design of low-cost, high-speed constant multipliers using the canonic signed digit (CSD) repre sentation. Section 13.7 addresses the bit-level implementation schemes for vector-vector multiplication based on the distributed arithmetic approach. 13.2 PARALLEL MULTIPLIERS ‘This section considers multipliers that can perform two's complement multi- plication in time O(W7) using regular structures, including multiplication with sign extension (derived from Horner’s rule) and Baugh-Wooley multiplication [9]. Two regular implementation styles, carry-ripple and and carry-save array, are introduced. It has been proved that multiplications cannot be performed in a time smaller than O(log,W). Such a bound is attainable by the binary-tree and PARALLEL MULTIPLIERS 479 Wallace-tree multipliers (see Appendix E) {10]. However, their corresponding architectures are very iregular. Let the multiplicand and multiplier be A and B: wa A= aW=[Link]-2°* 4109 = -aw-1 + D> aw- B= bwar dena s-biby = ~bwait D> bw, a where -1 < A,B <1 and the case when both A and B is equal to ~1 is excluded. The value of their product: P= A x B is given by: (13.3) ‘where the radix point is to the right of the msb pay-2. Since the product of 2 numbers within [1,1 is also in this range, all higher order bits tothe let of bit-position 2W/— 2 in the result are ignored In constant wordlength multiplication, the W ~ 1 lower order bits in the product P are discarded, and the product is denoted as X «= P = Ax B, where w. Xa swat Dawa (13.4) ‘The product X is used when a constant wordlength multiplier is considered and P is used when a full-precision product is required. 13.2.1 Parallel Multiplication with Sign Extension Using Horner's rule, multiplication of A and B can be written as P= Ax (-bwart YD bw) =~ Asbwan $[A-bwaa + [Abas + ft (Ash Asda to". otart, (13.5) where 2! denotes a scaling operation. In two's complement system, negating ‘a number is equivalent to taking its one’s complement and then adding a 1 to 480 BIT-LEVEL ARITHMETIC ARCHITECTURES eat ei ibe aab9 abo dodo ash, ay: agby aap aby ab dob —Tyby Tbs Tbs Popes ae ent Fig. 13.1. Tabular form of bit-level array multiplication ed the least significant bit (Isb) position, as shown below: wet wa = awart Slaw 2 ot mm wa = awit (1 awa2t— 142-74 t a CSC arta ar WH, (13.6) ‘where (1 — a;) is the one’s complement of a; and denoted as @, for 0 Brongn ves aout > Singlet arate: ‘Coat ¢ Sout = intin «Sin Cin Fig. 13.4 Parallel cazry-ripple array multiplier. to be added, but the partial sum and partial carry. The addition of partial sum and partial carry is performed by a vector merging adder (VMA). The dependence graph of bit-level carry-save multiplication is shown in Fig. 13.5 A parallel carry-save array multiplier can be designed from this dependence graph, as shown in Fig. 13.6, which can be pipelined at bit-level by placing delay elements across the cutsets shown in dotted lines in the figure. ‘The VMA can be implemented either as a 4x4 carry-save array as shown in. Fig. 13.7(a) which contains only half-adders and has a more regular and easier- to-pipeline structure, but requires more half adders; or as a 4-bit carry-ripple adder as shown in Fig. 13.7(b). When bit-level pipelining is not required, the carry-ripple and carry-save principles can also be combined into a single structure as shown in Fig. 13.7(¢), where the carry output can ripple through ‘at most 1 bit in the same row, and the resulting carry from each carry-ripple portion is passed on to the next row like the carry-save architecture. The number of adders required in this combined vector merging part is between, that of the carry-save and the carry-ripple vector merging structures. 484 BIT-LEVEL ARITHMETIC ARCHITECTURES Vector Merging Adder TT df po fed a 0) Fig. 13.5 Dependence graph for the 4 x 4-bitearry-save array multiplication 13.2.2. Baugh-Wooley Multipliers ‘The difficulty of two's complement multiplication lies in handling the sign bits of the multiplicand and multiple. An efficient way to overcome this problem is provided by the Baugh-Wooley multiplication algorithm [9]. This ‘multiplication algorithm is illustrated by the multiplication table shown in Fig. 13.8 for a 44-bit multiplication, where all the bits have positive weight, and can be accumulated directly. The derivation of this algorithm is left as an exercise at the end of this chapter (see Problem 4). ‘The Baugh-Wooley multiplication table in Fig. 13.8 can be implemented either as a carry-ripple array or a carry-save array. A carry-save Baugh- Wooley multiplier is shown in Fig. 13.9, where P = [Link] is the product with full precision. Note that the product of two 4-bit numbers in the range (~1,1) lies in the same range. Since the radix point is to the Tight of the bit ps, any higher order bits are ignored. The truncated result is represented as X'= 23.2720. 13.2.3 Parallel Multipliers with Modified Booth Recoding, Multiplication involves two basic operations: the generation of partial prod- ucts and their accumulation. Consequently, there are two ways to speed up the multiplication: reduce the number of partial products or accelerate their accumulation. Carry-saveaddition can be used to accelerate the accumulation ale PARALLEL MULTIPLIERS 485 > Broads signas outbis ata > Slagle Palade [Link]+ Sou =sinthin + Sin+Cin Fig. 13.6. Parallel carry-save array multiplies. 486 BIT-LEVEL ARITHMETIC ARCHITECTURES ication: (a) carry- ‘save vector merging, (b) cerryripple vector merging, (c) combined carry-save and carry-ripple vector merging, a fy by GiB aby aby anby GE ashy ab; agby Gb ashy aide aba azby abs axby obs 1 Fig. 13.8 Tabular form of bit-level Baugh-Wooley multiplication. PARALLEL MULTIPLIERS 487, og 85 Fig. 12.9 A4 x 4-bit Baugh-Wooley carry-save multiplier. of the partial products, To reduce the number of partial products, a straight- forward approach is to examine 2 or more bits of the multiplier at a time. ‘The reduction in the number of partial products can reduce the latency of the multiplication operation. However, this requires the generation of multiples A, 2A, 3A, ete., where A is the multiplicand. ‘The number of partial products can be reduced by the modified Booth recoding. This algorithm is based on the fact that fewer partial products need to be generated for groups of consecutive zeros and ones. For a group of consecutive zeros, there is no need to generate any new partial products. For ‘a group of, say m, consecutive ones in the multiplier, ie., {11-1}. = +++ 1(00++-0}0---—--0{00---1}0-++ = e100 -TYO--- instead of m partial products, only 2 partial products need to be generated (13.8) 488 BIT-LEVEL ARITHMETIC ARCHITECTURES Table 13.1. Radix-4 Modified Booth Recoding Algorithm Sint bas | rims [| Operation | Comments Ogata |e Onn 0 40 airing of zeros o of 1 fal +4 end of 1's o 1] o fal +4 single 1 o 1] 1 fa] 42a end of I's 1 0} 0 2) ~24 — | beginning of 1's 1 of 1 fal -a single 0 1 1] 0 ft} -A | beginning of 1's eee) ieee) 0 string of 1's if signed-digit representation is used. Hence, in this multiplication scheme, the multiplier bits are first recoded into signed-digit representation with fewer number of nonzero digits; the partial products are then generated using the recoded multiplier digits and accumulated. In the modified Booth recoding algorithm, the signed-digit set {~2,~1,0,1,2} is used, and itis, therefore, also referred to as 5-level Booth recoding. In this recoding algorithm, the multiplier bits ésis1 and ba; are recoded into signed digit Of, with bye-1 serving as a reference bit. For example, an &-bit number can be recoded as follow br bs by by by by by by bs =< 2 (13.9) wooo ee is appended after the Isb of this multiplier as a reference bit. ‘The modified Booth recoding algorithm to generate 6 from byi41, bri, and byi-1 is shown in Table 13.1 {A simple way to describe the recoding operation isto calculate : B= Dhan + bv inn (43.10) ‘The reader can verify that = eer) (13.11) ‘The Booth recoded multiplier consists of 3 parts, the recoding (control ct- cuitry for multiplier bits, the partial product generation, and accumulation. Design of fast multipliers using S-level modified Booth recoding is left as an INTERLEAVED FLOOR-PLAN AND BIT-PLANE-BASED DIGITAL FILTERS 489 exercise at the end of this chapter (see Problem 11). Conventional Booth re- coding circuitry has race problem due to the unbalanced paths from inputs to outputs and leads to higher power consumption. A low-power Booth-reooding, circuitry is presented in {11}, which reduces glitches and power consumpti by using a redundant recoding scheme. Design of low-power Booth recoding, circuitry is explored in Problem 12. 13.3 INTERLEAVED FLOOR-PLAN AND BI DIGITAL FILTERS PLANE-BASED. ‘This section is concerned with efficient implementation and floor-plan tech- niques for bit-parallel FIR digital filters. These techniques can be adapted for implementations of IR filters as wel. Consider the constant-coefiicient FIR filtering operation un) = a(n) + f-2(n-1)+9-2(n-2), (13.12) where 2(n) is the input signal, and f and g are filter coefficients. Assume the signal and coefficient wordlengths to be 6 and 4 bits, respectively, and both. are represented in two's complement format. An interleaved approach (12] for high-speed filtering is presented. ‘The main idea behind the interleaved approach is to perform the computation and ac- ‘cumulation of partial products associated with f and g simultaneously, which leads to dramatic reduction in the routing complexity. Another advantage is that since the truncation is performed only at the final step, it also leads to better accuracy. The multiplication chart, based on the Baugh-Wooley algorithm, for computing the output y(n) using this interleaved approach is shown in Fig. 13.10, which is the result of interleaving of 2 parallel Baugh- Wooley multiplications. Note that only the most significant 4 bits of 2(n) are considered since a 6-bit result is desired at the output ‘The final interleaved architecture is shown in Fig, 13.11, where the vector merging portion is a combination of carry-ripple and carry-save implemen: tations, which permits carry-ripple operation under the constraint that the architecture can be pipelined at 2-bit level (the pipelining latches are not shown in Fig. 13.11):- As a result, ater carry has rippled through 2 adders, it is saved and passed to the adder in next row. If the coefficients are interleaved in such a way that their partial product terms are computed in different rows, the resulting architecture is referred to a8 the bit-plane architecture (13] and is shown in Fig. 13.12, where a row of flip-flops has been added after every row of adders in the array thereby en- abling processing at high speed. In order to permit the interleaving operation, the direction of the sum and carry-out signals alternate. If the digital Slter implementation needs to be pipelined at p-bit-level, then each bit-plane in the 490 BIT-LEVEL ARITHMETIC ARCHITECTURES ae ate eal — bit postion feed) erg) fess OKT OKS OKT fxs x, WxS OVXE, ta Xe tae 9xS OKT 1 Yee Ye Ve We Me Fig. 13.10 Multiplication chart for the illustration of the interleaved approach. Here, ‘2c represents 24(n), 2; represents 2y(n ~ 1), and 2) represents x(n ~ 2) Dit-plane architecture can be replaced by a modified bit-plane where each bit is replaced by a group of p consecutive bits. For p = 2, the modified bit-plane architecture can be designed using the multiplication chart of Fig. 13.13. 13.4 BIT-SERIAL MULTIPLIERS ‘This section addresses the derivation of Lyon’s bit-serial multiplier [14] us- ing Horner’s rule. Several other bit-serial multipliers are then derived using systolic mapping techniques. 13.4.1 Design of Lyon's Bit-Serial Multipliers Using Horner's Rule ‘The multiplication rule in (18.5) can be used to derive bit-serial multipli- ers. The architecture for_a 4 x 4-bit bit-serial multiplication is shown in Fig. 13.14(a), where the [2-*]is a bit-serial zero-latency scaling operator and its functionality is illustrated in Fig. 13.14(b). For a bit-serial zero-latency system, the first output bit needs to be generated in the same clock cycle as the first input bit entering the system. For the scaling operator, the first output bit a; should be generated at the same time instance when the first input ay enters the operator. Since input a; has not entered the system yet, the scaling operator is a noncausal or advance operation, and cannot be implemented in hardware. [BIT-SERIAL MULTIPUERS 491 to) dn) 10) 5 apt! wo) se) x0) io) vio) vale) yo veto) Fig. 13.11 Aschitecture for the multiplication chart shown in Fig. 18.10. 492 BIT-LEVEL ARITHMETIC ARCHITECTURES to Ye 92 fs vy Falh [ral [ralh [Fal Ye Ye Fig. 13.12 Bit-plane architecture for the multiplication chart shown in Fig. 13.10. BIT-SERIAL MULTIPLIERS 493, — bit postion tax ek ty XS 9X7 OX, ONT GiXT ORE Te fax axa text ti Te Te Te ORT ONT XT 0X7 Voces Yee @ soup Fy] eeail » Fig, 13.16 (a) Bit-eerial two's complement multiplication derived from Horner's rule. ‘This architecture is not implementable due to the presence of the scaling operation. (b) The scaling operator is zero-latency advance operation. 496 BIT-LEVEL ARITHMETIC ARCHITECTURES witches Fig. 13.15. Derivation of implementable bit-serial two's complement multiplier. It is possible to implement the architecture in Fig, 13.14(a) using pipelining and placing delay elements at the feed-forward cutsets shown in Fig. 13.15. This combination of a delay element and a zero-delay scaling operator can now be realized in hardware. In Fig. 13.15, switching elements $0, S1, and ‘$2 replace the combined scaling operators and delay elements at, thelr cor- responding positions, respectively. Their switching time instances can be found by scheduling the operations in the array multiplication as shown in Fig. 13.16, where the number in the box to the right of each computation indicates the time instance when this computation can be performed. Note that the switches eliminate the Isb of the current partial remainder and ex- tend the sign of the previous partial remainder at the same time. Therefore, scaling operators are also referred to as bt-repeatersas these repeat the msb. ‘The multiplier bits are input at time O to the corresponding AND gates; the rmultiplicand bits are input bit-seriallyIsb-first. ‘The fist bit (Isb) of the 4 bit product is generated at time instance 3. The complete diagram of the bit-sorial two's complement multiplier is shown in Fig. 13.17. This multiplier is also refered to as Lyon's multiplier (14). This system has a latency of 3 clock cycles. If the time for AND and multiplexing operations are ignored, ‘the eritical path computation time of this bit-serial multiplier is equal to 37, ‘where Ty is the time for single-bit addition. Generally, a W x W-bit bit- BIT-SERIAL MULTIPLIERS 495 a wa] orks] res) a tabs{6] aabs[5] as) es] ala] ada] Fig. 13.16 Scheduling instances for operations in bit-srial two's complement array multiplication. weet ae) SST el Fig. 13.17 Lyon's bit-erial two's complement multiplier. serial multiplier has a latency equal to W — 1 clock cycles with the cycle time Tat 2 (W - 1)Tq. Pipelining can be used to reduce the critical path and hence the clock period. For example, by placing delay elements across the feed-forward cutsets (shown by dashed lines) in Fig. 13.17, the critical path of the 4 x4 bit-serial multiplier is reduced to 1 single-bit addition at the expense of an increase in the latency. ‘The critical path of a W-bit Lyon's multiplier can be reduced to 27, without introducing extra pipelining latches, by using associativity and rétiming,(15] (see Problem 18) 13.4.2 Design of Bit-Serial Multipliers Using Systolic Mappings Bit-serial multipliers can be designed by using systolic mapping methodology (see Chapter 7) of the 2D DGs to one-dimension (1D) array, Example 13.4.1 This example considers the design of Lyon's bit-serial mul- tiplier by systolic mapping using the DG of ripple-carry multiplication in 496 BIT-LEVEL ARITHMETIC ARCHITECTURES 302) Py Fig, 13.18. Bit-serialcarry-ripple multiplier in Example 13.4.2. Fig. 13.9. Using the projection vector a” sf = [1 I], and the processor space vector ing edge mappings: 1 0}, the scheduling vector (0 1], we have the follow- pre [ate a@ 1) PG.) carry (1, O) i | With additional sign-eztension cireutry, the resulting bit-seril multiplier in Fig. 13.17 is obtained. Bit-level pipelined version of Lyon's multiplier can also be designed directly from the ripple-carry multiplication DG by systolic mapping techniques (see Problem 10). Hest Example 18.4.2 Consider the DG for the carry-ripple array multiplication in Fig. 13.9. Using the projection vector, processor space vector, and schedul- ing vector a a-({). we have the following edge mappings: 10), 01), (1313) e pre [s7e Oi | op7 b, 0) | 7 [0 carry (I, ) [7 TO z(-4) | -1Pr The resulting bit-serial corry-ripple multiplier is shown in Fig, 13.18, where the switching instances for the multiplezers are omitted for clarity. ‘Example 13.4.8 Consider the DG for the carry-save array multiplication in Fig. 13.5. Suppose a carry-ripple adder is used in the vector merging portion. BIT-SERIAL MULTIPLIERS 497, Fig, 13.19. Bit-serial carry-save multiplier in Example 13.4.3. Using the projection vector, processor vector, and scheduling vector (11), (13.4) ‘we have the following edge mappings: Tepe Tee | ae carry 0, T) 6 (ZL, 0) z(t Y) The resulting bit-serial carry-save multiplier is shoun in Fig. 13.19. The control signals of the multiplezers in this figure are omitted for clarity. @ Example 18.4.4 Consider the multiplication table for Baugh-Wooley mult plication shown in Fig. 13.8. Suppose a carry-save addition is used to ac- cumulate the partial products and a carry-ripple adder is used in the vector merging portion. The DG for this computation is shown in Fig. 19.20. The projection with the following projection vector, processor space vector and scheduling vector a=(2), leads to the following edge mappings (1 0), 01) (13.15) e pre|sve =, alae any 0, [ott @(1, 0) ipo =) 7 carom TOP TT 0 498 BIT-LEVEL ARITHMETIC ARCHITECTURES Fig. 13.20 Dependence graph for carry save Baugh-Wooley multiplication with carry- ripple vector merging where carry—vm denotes the carry outputs in the vector merging portion. The resulting bit-serial carry-save Baugh-Wooley multiplier is shown in Fig. 13.21, ‘where blocks [C] perform inversion operations at some clock cycles to generate the appropriate partial products and these control details have been omitted for simplicity. Note that in this design, the computation at node (i, j) is scheduled at time j, for j =0,1,2,3,4. Therefore, the result bit zo is generated at time 5143 and za, x2, and 2, are generated at time 5! +4 in parallel format. A paralle-to-serial converter is used such that these result bits are output in ‘serial at time instance 51, 5I-+1, 51 +2, 5I-+3, and 5l+4, respectively. The control signal Ty = 1 at time 5! when a new multiplication starts, and Ty = 0 otherwise. Hence the architecture has latency equal to 5 cycles and is able to generate 1 output word every 5 cycles. In addition to inherent hardware ‘utilization inefficiency, a major drawback of this design is the long critical path (shown in dashed line in Fig. 19.24) proportional to the wordlength. Another design can be obtained by using the modified DG shown in Fig. 18.22, with the same projection vector, processor vector, and scheduling vector as in (13.15). In this case, the carry-save array and the vector merging portion are treated as two separate planes during systolic mapping, and the carry-ripple addition in vector merging portion is performed by a separate adder and result BBIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 499 Fig. 1221 Bit-serial Baugh-Wooley multiplier 1 in Example 13.4.4 bits are generated in serial. This bit-serial design is shown in Pig. 13.99. It ‘has a latency equal to four clock cycles and is able to generate one output every four clock cyeles. Note that the critical path of this design is limited by ‘one binary adder delay. m 13.5 BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION FIR and IIR filters are the two most basic structures used in DSP, This section considers design of bit-serial FIR and IIR digital filters with fixed constant coefficients, where the multiplications with constant coefficients are decomposed and implemented using bit-serial shifts and adds. I FIR Filter Consider the implementation of the FIR filter y(n) Ty at Ja) + 520-1), (13.16) with a signal wordlength of 8. Equation (13.16) can be rewritten in terms of shifts and adds as follows: u(r) = ~2(n) + 2(n)2-* + a(n —1)2, (3.7) ‘The word-level signal flow graph of the shif-add based FIR filter is shown in Fig. 13.24(a). Due to the presence of noncausal scaling operators, this is not & feasible design. Pipelining cutsets, shown in dashed lines in Fig. 13.24(a), can be used to delay the advance scaling operations. By placing delay elements 500 BIT-LEVEL ARITHMETIC ARCHITECTURES Fig. 13.23. Bit-serial Baugh-Wooley multiplier # 2 in Example 13.44 ‘BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 501, o Fig. 13.24 Bit-level pipelined bit-serial FIR filter where constant coeficient multi- plications are implemented as shifts and adds. (3) Filter architecture with scaling operators; (b) feasible bit level pipelined architecture. along the cutsets and replacing the delayed scaling operators with switches, a feasible bit-level pipelined bit-serial FIR iter can be derived, as shown in Fig. 13.24(b). Note that a word-level delay is equivalent to W bit-level delays for a signal wordlength of W. Therefore, the one word delay in Fig. 13.24(a) is replaced by 8 bit-level delays in Fig. 13.24(b). The switching time instances can be derived by scheduling the bit-level computations. 13.5.2 Bit-Serial IIR Filter Consider the implementation of the IIR filter 1 u(n) = —Zyln—1) + Eu(n 1) +200), (assy 502 BIT-LEVEL ARITHMETIC ARCHITECTURES Fig. 13.25 MIR filter can be obtained from a FIR section as shown in this figure. where the signal wordlength is assumed to be 8. This IIR filter equation can be rewritten as w(n) ~fun-0) + fue-, y(n) which can be implemented as an FIR section (computing w(n) from y(n—1)) with an addition and a feedback loop, as shown in Fig, 13.25. A bit-level pipelined bit-serial IIR filter architecture can be derived by going through the following steps. First, a bitlovel pipelined bit-serial implementation of the FIR section neds to be derived. (For this example, this has already been carried out in the previous subsection.) The 2nd step is an intermediate design stage. In this step, the input signal 2(n) is added to the output of the bit-serial FIR section w(n), The resulting signal y(n) is connected with the signal y(n ~ 1). ‘The number of delay elements on this edge, marked by [?D ] in Fig. 13.26(a), is yet to be determined. The resulting structure is shown in Fi which contains 2 loops shown in dashed lines. This completes all the inter- connections in the bit-serial system. However, for systems containing loops, the total number of delay elements in the loops should be consistent with that in the original signal-fiow graph (SFG), in order to maintain synchronization as well a8 correct functionality. In this case, the number of bit-level delay elements in the bit-serial loops should equal W x Np, where W is the signal wordlength and Np denotes the number of (word-level) delay elements in the loops of the original word-level SFG. Matching the number of word-level loop delay elements and that in the bit-serial architecture is referred to as loop de- lay synchronization. For this IIR filtering example, originally, the inner loop contains 1 word delay and the outer loop contains 2 word delays. Hence, in the resulting architecture, the inner loop and outer loop should contain 8 and 16 delay elements, respectively. ‘To compute the total number of delays in the bit-level architecture, the paths with the largest number of delay elements in the switching elements should be counted. For this example, there are 6 delay elements and 14 delay elements in the inner loop and outer loop of the w(n) +2(n), BIT-SERIAL FILTER DESIGN AND IMPLEMENTATION 508 intermediate archivctuo, respectively. Thereore, 2 synchronization delays are added tothe common part of ine and gute loop ad the Boal architec. ture is shown in Fig. 1826(0). Five input synehroniing delay elements are ted to the input, which ar also referred to as shinning delays or skewing delays, Thi bilevel pipelined bit serial I filter has a ctical path conainr ing 1 single-it adder, and has a throughpat of 1 output word ovry 8 clock cgelea. Note that its aso posable thet the lops inthe intermediate bit level pipelined architecture may contain more than W x Np number of bit-level Aay clement. In this case, the wordength noods to be increise. Note that the architecture without the 2 loop synchronizing delays (in shaded region) ean function correctly with a signal wordlength of 6, whlch is the minimum wordlength foc the bitlovl pipelined bit-seralarchivctue in Fig. 18.26). (In the ease of signal wordlength 6, the [8D delays in th outer this case, the throughput corresponds to 1 output (word) every 6 clock eyctes. ‘The IIR filter architecture in Fig. 13.26(b) is derived from the signal-ow graph shown in Fig. 13.27(a). Its minimum wordlength of 6 is constrained by the loop iteration period of one-muliply-two-add (inner loop) time in the signal flow graph of Fig. 18.27(a). This iteration bound (see Chapter 2) can be reduced to one-multiply-add by using associativity which leads to the structure shown in Pig. 13.27(b). A bit-serial implementation of the structure in Fig. 13.27(b) is shown in Fig. 13.28, which requires a minimum feasible wordlength of 5. Example 13.5-1 Consider the implementation of the equation: a(n) = Zain) 4 S2(n-2) +00). (3.19) Assume the signal wordlength to be 8 (ie., the wordlengths for x and w are 8 bits). Assume the coefficient ~7/32 = —1 + 1/2 + 1/4 + 1/32 is encoded as 1.11001 and 3/4 = 1/2+1/4 is encoded as 0.11 in two's complement format. A bit-serial architecture based on the block diagram of Fig. 13.27(b) is shown in Fig. 13.29, where the notation (81 +i) represents the time instance {81+i} and 7/8 represents the remaining 7 time slots in a period of periodicity 8. Note that the number of loop delay elements equals 8 in the inner loop and 16 in the outer loop (uhile counting the number of loop delay elements, the maximum number of delay elements in the switches should be included), which is consistent with the word-level architecture. 506 BIT-LEVEL ARITHMETIC ARCHITECTURES ® Fig, 13.26 Design of bit-level pipelined bit-eral IR filter. (a) Bitlevel pipelined bit- setil architecture, without synchronization delay elements, (b) Bitserial IR filter. [Note that this implementation requires a minimum feasible wordlength of 6 @ ® Fig. 13.27 Loop iteration bound of HR filter can be reduced from one-multply-two- add to one-multipy-add by associative transformation. CANOMIC SIGNED DIGIT ARITHMETIC 505 Fig. 13,28 Bit-cerial IR Alter ater associ requires 8 minimum feaible wordlength of 5. ve transformation, ‘Thi implementation 13.6 CANONIC SIGNED DIGIT ARITHMETIC ‘The number of add operations required in a constant coefficient multiplication equals one less than the number of nonzero bits in the constant coefficient. In order to further reduce the area and power consumption, the constant co- efficient can be encoded such that it contains the fewest number of nonzero bits, which can be accomplished using canonic signed digit (CSD) represen- tation. Quantization of FIR digital filters using a specified number of signed power-of-two (SPT) terms is described in Appendix G. This section addresses the CSD numaber representation and its applications for the design of bit-serial and bit-parallel constant multipliers. 13.6.1 CSD Representation Consider the number A = aw-1.0w-2°+-a1a9, where each aj (W—1 > i > 0) is in the set {—1,0,1}. Two's complement representation may be considered ‘sa special case, in which the bit aw—1 Is equal to 0 or ~1, whereas 1 for 0 ) (tina +S) tewaj24) (13.22) One War Noa = Deut DCS errs) B jet 8 Define Cwars =D citewa-s (G #0), Cw. Seeyw-- (18.28) ‘Then, Y= PV Cwaye?, (13.24) Therefore, by interchanging the summing order of i and j, the initial mul- tiplications in (13.20) ate now distributed to another computation pattern (19), 20}. Since the term C depends on the 2,5 values and has only 2¥ possible val- ues, it is possible o precompute them and store them in a read only memory (ROM). An input set of N bits (205, 217.---, 2-1) is used as an address to retrieve the corresponding C, values. These intermediate results are accumu- lated in W clock cycles to produce one Y value. This leads to a mulkipier-free realization of vector multiplication. Table 13.2 shows the content of the ROM for N=4, Fig. 13.36 shows a typical architecuure for the computation of the inner product of two length-NV vectors. The hift-accumulator isa bit-parallel carry-propagate adder that adds the ROM content to the previous accumu- lated result. The inverter and the MUX are used for inverting the output of the ROM in order to compute Ciy-1 and the control signal $ is 1 when j= W —1 and 0 otherwise. The computation runs from j= Oto j = W~1 and the result is available in bit-parallel format after W clock cycles. This approach corresponds to a bit-serial distributed arithmetic. ‘The speed of @ -serial distributed arithmetic implementation ean be limited for ime applications such as DOT implementation for video com- pression in a digital TV system. ‘The speed of these systems can be improved by a digit-serial distributed arithmetic [21},22] where a digit containing mul tiple bits is processed in a clock eycle. For example, if J consecutive bits are DISTRIBUTED ARITHMETIC 513 Table 13.2 Content of the ROM (N=4) iP Fig _z2j #24 | Content of the ROM o @ er ate a ate CE abate @ Ore oto erate eo +e erate eo ter te @tatate processed in a single clock cycle using J ROMSs, then the input words are pro- cessed in W/.J clock cycles. A mult-input shift-accumulator adds the contents of J ROMs and the previous accumulated result, and generates the output in bit-parallel format. ‘The detailed design of the multi-input shift-accumulator is explored in Problem 24. 13.7.2 Distributed Arithmetic with Offset-Binary Coding In this section, the offset-binary coding (OBC) [18] is introduced that can reduce the ROM size by a factor of 2 to 2¥-" Rewrite (13.21) as: 1 2 = fle (-20) = Few wa) + Lewy awe 2, (13.25) 514 BIT-LEVEL ARITHMETIC ARCHITECTURES rey Meee sbiesecumaaon un Fig. 13.36 Architecture of computing i distributed arithmetic. where wa = mwa + Do mw +2". Define { ij — Fj» for AW -1 ~Giw a1 Fw), for j= W=1 and diy € {1,41}. Equation (13.25) can be rewritten as: wo (3 diwang2 4 —2-0"-0), = Using (13.27), (13.20) can be written as way wa Y= YD del dwar yad 2-0 aris Wes wot 1 = LS fades youre we Now define ca D; Ye feats, for0 | (co I Peas Fig. 13.37 Architecture of computing inner product of 2 length-N vectors using dise tributed arithmetic with OBC coding. DISTRIBUTED ARITHMETIC 517 13.7.3 ROM Decomposition for Distributed Arithmetic ‘The ROM size of the conventional distributed arithmetic increases exponen- tially with N. Generally, ROM aovess time can be a bottleneck for speed of the whole system, especially when ROM size is large. Therefore, reducing the ROM size is very important and is of great practical concern. Due to the linearity of equation (13.24), one possible solution is to divide the NV address bits of the ROM into N/K groups of K bits. Hence itis possible to decompose the ROM of size 2¥ into N/K ROMs of size 2% and add the outputs of these ROMS using a multiinput accumulator. Fig. 13.38 illustrates the architecture for computing an N-input inner product using conventional distributed arith- metic with ROM decomposition. ‘The total size of storage is now reduced Mali-input y Fig. 13.38 Architecture for computing inner product of two length-N vectors using distributed arithmetic with ROM decomposition. from 2% to (N/K)2¥ which increases linearly with N. The ROM access time is also reduced along with the ROM size. This reduction of the storage size is balanced by a linear increase of the computation complexity of the accumula tor. Carry-save arithmetic can be used to realize the multiinput. accumulator to minimize the computation time. Note that ROM decomposition technique is applicable to OBC-coding-based distributed arithmetic as well. ‘518 BIT-LEVEL ARITHMETIC ARCHITECTURES 13.8 CONCLUSIONS This chapter has presented the design of bit-level arithmetic architectures, Design of bit-parallel multipliers, including carry-ripple array, carry-save ar- ray, Baugh-Wooley, and Booth-recoded multiplications has been introduced, ‘The two efficent layout strategies, interleaved floor-plan and bit-plane, for dig- ital filters have been discussed. Design of bit-serial multipliers using Horner's rule as well as using systolic mapping has been addressed. This chapter has also addressed the design methodologies for bit-level pipelined bit-serial FIR and IIR filters with constant coefficients. ‘The CSD representation has been presented for the design of low-cost, high-speed constant multipliers. Latency reduction in serial and parallel computations using associativity and tree- height reduction has been discussed. Finally, the bit-level implementation schemes for vector-vector multiplication based on the distributed arithmetic approach has been addressed. Residue arithmetic [23], which is often used for implementation of FIR digital filters and transforms, is beyond the scope of, this book Digit-serial architectures are also attractive for discrete wavelet transforms, where higher levels of wavelet can be implemented using smaller digit sizes [24]. For example, in a Slevel wavelet, digitsizes W/2, W/4 and W/8 can be used for levels 1, 2, and 3, respectively. Higher levels require less computa- tion rates and can be implemented using fewer hardware, assuming use of a single clock in the system. Traditionally, digit-serial multipliers are designed by folding bit-parallel multipliers, or unfolding bit-serial multipliers. Some resulting digit-serial architectures cannot be pipelined at the bit-level. A novel cell replacement transformation technique presented in (25] can be used to design low-power bit-evel pipelined digit-serial multipliers from bit-serial multipliers. 13.9 PROBLEMS 1. (a) Verify that the circuit in Fig. 13.39 implements a binary full-adder. (b) The adder circuit in Fig, 13.39 uses 32 transistors. Derive an op- ‘timized architecture using CMOS-transmission gates with only 24 transistors (Hint: Try to share hardware of sum and carry parts). (©) What is the number of transistors in the critical path of an &-bit carry-ripple adder? 2. Consider the addition of 2 two's complement &-bit numbers X and Y in a ripple-carry manner as shown in Fig. 13.40. Derive a truth table and logic circuit for detecting overflow for this adder using the bits er, yr, cr, cg, and sr. (Hint: If zr # yr, ie, the sign bits PROBLEMS 519 x ms Fig. 1339 Schematic of a complementary pass-transistor logic based CMOS fall- adder. |. This problem considers the derivation of the Baugh-Wooley multi aie Pa ye yt ye a! fiddiath ey 7 Ss Fig, 13.40. Bight-bit ripple carry adder in Problem 2. of X and ¥ are different, then there can never be overflow. If x7 = yr, then overflow oceurs if 27 = yr # #1.) Design bit-parallel architectures for computation of un) = Dain) (3.32) 3 using (a) carry-save arithmetic and (b) treerheight reduction technique, using half and full adders and a VMA. Assume a wordlength of 8. Com- pute the latencies of these architectures without including the latency of the VMA. ica tion algorithm shown in Fig. 13.8. Consider the 4 x 4-bit multiplication ‘operation X = Ax B, where A = a3.ay009 = ~a3 +0227? +12"? + G92, B= [Link] = ~bs + by2- + b2-* + bo2®. 520 BIT-LEVEL ARITHMETIC ARCHITECTURES Multiplying A and Bis equivalent to perform P= (03 40,2 + 0,2? + a2-8)(—by + Do21 +12 + B92) (asbs + (a2 + 012-? + a92"9)(bo2"! + b12-? + bo27%)) ~ (02? + 612°? + B92-9) + (a2! + 0,2"? + 092%). (13.33) (a) Prove that ~ (05:27 + 12°? + B22) + (aa! + 012°? + a2) = (agba + agby)2-* + (Guby + ayy + 1)2-? + (Gabo + Gods) 2-*, (1334) where @ = 1 — a, for a = 0 or 1. Hint: The product of 2 numbers ‘within the range [—1,1) should still be within this range. Hence, the nonzero bits at higher bit positions can be ignored, (b) Using equations (13.33) and (13.34), and associativity property to rearrange all the terms involved in this multiplication, derive the Baugh-Wooley multiplication table shown in Fig. 13.8 5. The Baugh-Wooley multiplication algorithm can also be implemented as follows: P= (~05 40,71 +0;2°? + ag2-*)(—by + ba? + 512°? + By) = aby + (az27! + 012°? + a92-*)(bg2* + 12°? + bo2™*) #05 + (agba2™! + a9b;2"? + asBo2™*) + 032° +a + (Gab? + abg2-* + W592) + by2-> (13.35) by rewriting (13.34) as ~(as(22°? + 6:2°? + bo) + (a2! + 012"? + ag 2*}ba) = 43(—bg2-! — by2-* = by 2) + bg(—a22-! - a1 2°? ~ ag2-*), (1336) Prove the Baugh-Wooley multiplication algorithm in (13.35) and derive the multiplication table for this algorithm. (Hint: Use the fact that Dhan ~Paaidt = Lh (= dF 142%) 6. Consider the computation of x(n) + fa(n —1) + ge(n— 2). (13.37) Using Baugh-Wooley multiplication algorithm and two's complement 10. rt PROBLEMS — 521 reptetentation, draw the interleaved bit-parallel floor-plan architecture for this computation with (2) bit-level pipelining, and (b) 2-bit-Level pipelining for signals and coefficients of wordlength of 4, Using the multiplication chart in Fig. 13.13, design a 2-bit-level pipelined bit-parallel architecture using modified bit-plane technique for the com- putation in Problem 6. Show detailed schedules for the bit-serial multiplier in Fig. 13.19 and verify its correctness. |. Show detailed schedules for the bit-serial Baugh-Wooley multiplier in Fig. 13.23 and verify the correctness of this architecture. Find appropriate projection vector, scheduling vector, and processor space vector to design a bit-eval pipelined version of Lyon's bit-serial carry-ripple multiplier from the DG in Fig. 13.3. This multiplier should be a pipelined version of the multiplier in Fig, 13.17. Complete the details of the design and show the switching instances ofall multiplexers in the multiplier. ‘This problem is concerned with design of complete architecture for a bit- serial 5-level recoding-based modified Booth multiplier for a wordlength of 12 using Horner’s rule expansion. Use the notation where the radix point is to the right of the msb (or sign bit) and bit number 11 is the msb and bit number 0 is the Isb. (0) To obtain the recoded signal Wes Mania bt pets (13.38) wwe need to derive three control signals Az, Bi, and Ci. Cy = 1 implies y is negative, and C; = O implies yf is nonnegative. By = 1 implies jy| = 2 and j = 1 implies [yf] = 1. Draw the Karmaugh map for Aj, Bjy and C, and prove that wi ® ia (uaier © vai) (92i ® vai vans (13.39) where the symbol © denotes exclusive OR, and the overbar indi- cates complementation. Using As, By and Ci, design the modified Booth recoding unit. 522 IT-LEVEL ARITHMETIC ARCHITECTURES Table 13.5 Truth Table for Low-Power Booth Recoding Circuitry in Problem 12 ipa [oy [a [me ao o[o0, 0 po, 0,opiyi axa ees [OTs EO 0 T 0 7 ofvijyolt* a T 0 0 27 1 o;iso i] aa eof [ga T T 0 =i1f 1 ijoly* Tea oat (b) Draw the complete modified Booth recoded bit-serial multiplier architecture and show all switching instances. 12. This problem considers design of a Booth recoding circuit for low power. ‘The control circuitry for Booth recoding in Problem 11 contains only ‘two XOR gates and one inverter. However, the generation of the con- trol signals Aj, Bi, and C; through different logic levels with different propagation delay time leads to an increase in the glitching activity in the partial product generation and accumulation circuit, hence leading to high power dissipation. Glitching and power consumption can be re duced by using a redundant recoding scheme, which balances the paths from each input to the outputs of the control circuitry (11). Consider the multiplication A x Y, and denote the recoded multiplier bits as yj. Four control signals are used in the new race-free Booth re- coder, neg, #1, 22 and zp. neg = 1 implies yj is negative, and neg = implies y; is nonnegative. 21 = 1 implies |y}| = 1; 22 = ZI. zp distin. sguishes |y}| = 2 from ly}| = 0. The new recoding scheme is summarized in the Table 13.5, where * denotes a don't-care condition. One bit-slice of the partial product generator obtained using this recoding scheme is shown in Fig. 13.41. (a) Prove that the circuit in Fig. 13.41 generates the correct partial products for the Booth multiplication (b) Derive the Booth recoder (control) circuit such that the propaga- tion delay of all paths from each input bit, including vajsay ways tnj-1, 0; and aj-1, to each bit of the partial product, Ay}, are equal and are limited by one XOR/XNOR gate delay, one AND gate delay, and one NOR gate delay. The don't care conditions in zp signal can be used for logic minimization. Show those zp values that you use to substitute the don’t-care conditions. PROBLEMS 523 neg— al 2 Fig. 13.41 One bit slice partial product generator for the [Link] Booth multiplier in Problem 12. 13. Obtain a 2-unfolded carry-ripple digit-serial multiplier by unfolding the Lyon’s multiplier in Fig. 13.17. 14, This problem is concerned with fixed-point bit-serial and digit-seral implementation of an all-pole second-order recursive filter vin) = Fyn 1) + Fain —2) +210) Assume the signal wordlength to be 8 (i.e, the wordlengths for y and are 8 bits). Assume the coefficient ~7/8 is encoded as 1.001 and 3/4 is encoded as 0.11. In the multiplication with respect to coefficients, only ‘multiplication with respect to nonzero bits is carried out, (@) Draw the block diagram of the computation by exploiting associa- tivity. The inner loop bound of your architecture should be limited by 1 multiply-add time. (b) Design a:functionally correct bit-level pipelined bit-serial architec- ‘ture for the structure in part (a). (©) The loop latency in part (b) imposes a constraint on the minimum ‘wordlength for the signals y and z. What is the minimum feasible signal wordlength for the system in part (b)? 15. This problem considers bit-serial implementation of the 3rd-order fixed- coefficient IIR filter shown in Fig. 13.42. Assume the data wordlength to be 8, filter coefficient wordlength = 4, and filter coefficients a = —1/4, 524 16. Wy, {IT-LEVEL ARITHMETIC ARCHITECTURES yo) Fig, 1342 UR filter diagram for Problem 15. b= 1/8, c= -3/4. Assume all numbers to be represented using two's complement representation. (2) Obtain an equivalent structure by using associativity such that the inner loop bound is limited by 1 multiply-add time. (b) Complete the bit-level pipelined bit-serial design of the structure in part (a). Treat minimization of delay elements in the design as a secondary objective. ‘This problem addresses a bit-level pipelined bit-serial implementation of the recursive computation vin) = Zn 1)+ Sin —2)+ Bin) using two's complement representation and a signal wordlength of 10. ‘The bit-erial design must use Horner's rule to improve accuracy in the precision avalable Asocitivity must be exploited so that the Loop bound of the inner loop is limited by 1 multiply and 1 add time at the word level. Design the bit-seral architecture fora signal wordlength of 10. What is the minimum feasible wordlength for this computation? Consider the bit-serial implementation of the computation 25y(n 1) + a(n) (13.40) y(n) = shown in Fig. 13.43. ‘What wordlength has been used in the circuit? For this wordlength, complete the missing switching instances. Unfold this bit-serial struc ture by a factor of 3 to obtain a digit-serial implementation with digit size 3, PROBLEMS — 525 x(a) 20} Fig. 13.43 Bit-oerial implementation of the TR filter in Problem 17, 18. This problem considers the design of a low-latency bit-serial multiplier by transforming the ripple-carry Lyon's multiplier using associativity and retiming. @) (b) o @) Show that the architecture in Fig, 13.44(a) can be transformed to that in Fig. 13.44(b) using associativity and retiming. Compare the critical path and latencies of these 2 architectures. Design a low-latency bit-serial multiplier by applying similar tran formations as in part (a) to the Lyon’s ripple-carry bit-serial mul- tiplier or by recasting the Homer's rule multiplication equation into an appropriate form. Assume a wordlength of 12 for signal sample and the coefficient. Show all switching instances in your architecture. Design alow latency constant bit-serial multiplier to compute y(n) = a(n), where a = 001011100101, by considering multiplication with respect to nonzero bits only using the architecture in part (b). Show all the switching instances in this architecture. Assume ‘signal wordlength of 12. Based on the bit-serial multiplier derived in part (c), design a bit- serial implementation of the recursive computation y(n) = ay(n ~ 1)-+2(n) for the same value of ain (c). Assume a signal wordlength of 12 19. Obtain the CSD representation of the following two's complement num- bers: 000010110, ex = 0.01001100, ¢2 = 00010110111 191100110, 526 BIT-LEVEL ARITHMETIC ARCHITECTURES » Fig, 13.44 Bit-serial multipliers in Problem 18. 20, Obtain the CSD representation of the following two's complement num- bers: = LAL, ¢ = 111110010, ¢ = 0.01100110, es = 0.01001100111. 21. ‘This problem considers a bit-serial implementation of the IIR filter u(n) = ary(n—1) +azy(n 2) + 2(0), (13.41) ‘where a) = 0,00011011100 and ay = 0.01001111110 (a) Represent the coefficients a; and a2 in CSD representation. (©) Using CSD representation and the data-flow graph in Fig. 13.27(b), obtain a bit-level pipelined bit-serial IIR filter architecture for a signal wordlength of 8. What is the minimum feasible wordlength of this architecture? (©) Apply the Horner's rule and tree height reduction techniques to improve your design in part (b). What is the minimum feasible wwordlength of this design? 22. Extend the OBC concept to implement a distributed arithmetic archi tecture using a ROM containing 2/4 words. 23. Consider ROM decomposition with N=16, K=4. Calculate the reduc- tion factor in ROM size. Realize the distributed arithmetic architecture ‘using two-input accumulators and carry-save arithmetic. REFERENCES 527 24, This problem considers design of digit-serial distributed arithmetic ar- chitecture. Implement the computation wa v= San (03.42) aS using a 3-bit digit-serial distributed arithmetic assuming a wordlength of 9. The bits zz.se41 should be processed in the I-th ROM (1 = 0, 1,2) in k-th clock eycle (k = 0, 1,2). Using 3 ROMs and a multi-input shift- accumulator, design the 3-bit digit-serial distributed arithmetic archi- tecture. This architecture processes 3 consecutive input bits in a clock cycle and all input words are processed in 3 clock cycles, Assume the ROM contents to be represented using 16 bits. Design two forms of shift-accumulators using bit-paralle (i) carry-ripple and (ii) carry-save adders. The output of the shift-accumulator is represented using 16 bits. The critical path of the carry-ripple design should be I7tr4 and that of the carry-save design should be 3tp.4, where tr. is the propaga- tion delay of a full adder. ‘The carry-save design requires an additional carry-propagate adder at the end. (a) Calculate the latencies of both designs in terms of tp. (b) Which design is more suitable for high speed? Why? (©) Which design is more suitable for low power? Why? REFERENCES 1. L, B, Jackson, J. F. Kaiser, and H. $. McDonald, “An approach to imple- mentation of digital filters,” EBB Trans. on Audio Electroacoustics, vol. 16, pp. 413-421, 1968. 2. P.B. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-Serial Approach. ‘Addison-Wesley, 1986. 3. Jain et al, “Custom design of a VLSI POM-FDM transmultiplexor fom system specification to circuit layout using a computer aided design system,” IBBE J. Solid State Circuits, vol. 21, pp. 73-85, Feb. 1986, 4. K. K, Parhi, “A systematic approach for design of digit-srial processing archi- tectures;” IEBE Trans. on Cireuits and Systems, vol. 38, no. 4, pp. 358-375, April 1991. 5. R. I. Hartley and K. K. Parhi, Digit-Serial Computation. Kluwer, 1998. 6..N. B. Weste and K. Eshraghian, Principles of CMOS VLSI design: A Systems Perspective, Addison-Wesley, 1993. 7.K. Hwang, Computer Arithmetic: Prine 1979. Architecture and Design. Wiley, 528 BIT-LEVEL ARITHMETIC ARCHITECTURES 8.1. Koren, Computer Arithmetic Algorithms. Prentice Hall, 1998. 8. C. R. Baugh and B. Wooley, *A two's complement parallel array multiplication algorithm,” 1BBE Trans. on Computers, vl. C-22, no. 12, pp. 145-1047, Dee. 1973. 10. C. 8. Wallace, “A suggestion fora fast multiplier” IBBE Trans. on Computers, vol. BC-13, pp. 14=17, Feb, 1964, 11 R. Fried, *Minimiaing energy dissipation in high-speed multipliers ISLPBD-97, (Monterey, CA), pp. 214-219, Aug. 1997. 12M, Hatamian and K, K. Pashi, “An 85-MHl2 fourth-order programmable TIR digital filter chip,” IBBB Journal of Solid State Cireits, vol. 27, pp 175-188, Feb. 1992. 13, 7. G. Nol, *Semi-systolic maximum rate transversal Stars with programmable coeficients,” in Proc. of 1986 International Conference on Systolic Arrays, (Oxford, U.K.), pp. 5-13, July 1986. 14. R. F, Lyon, “Two's complement pipelined multipliers,” IBEB Trans. on Com- munications, vol 24, pp. 418-424, April 1976. 15. D. Ait-Boudaoud, M. K. brahim, and B.R. Hayes-ill, “Novel cell architecture for bit level systolic arrays muldplication,” IBB Proceedings-B, vol. 138, Jan 1901 16. 8. W. Reitwiesner, “Binary arithmetic;” Advances in Computer, pp. 231-308, 1966. 17, MeT. Sua, TC, Chen, and AM. Dottieb, “VLSI implementation of @ 16% 16 discrete cosine transform chip.” IBEB Trans. on Circuits ond Sytems, vol. 36, zo. 4, pp. 610-617, Apail 1989, 18, N, Demassieux and F. Jutand, “Orthogonal transforms,” in VEST Implementa- tions for Image Communications (P. Pirsch, ed), pp. 217-250, Elsevier, 1888, 19. A. Peled and B, Liu, “A new hardware realization of digital filters” TEBE Trans ‘Acoustics, Spech, and Signal Processing, vol. ASSP-22, no. 6, pp. 456-462, Dec 1974 20. ©. 8, Burrus, “Digital filter structures described by distributed arithmetic; IBBB Trans. on Circuits ond Systems, Dec. 1977 21, 8. Uramoto et al, “A 100-MHz 2-D diserete cosine transform core processor” IBBE Journal of Solid-State Cireuits, vol. 27, no. 4, pp. 402-499, April 1992. Y. Katayama, 1. Tamitani, A, Taniguchi and Y. Ooi, “A singlechip MPEG1 audio/video decoder using macrocore and call-based implementation,” in IEBE VES! Signal Processing, VITI, (Osaka, Japan), pp. 431~440, Oct. 1995. 28. , J. Taylor, “Residue atithmetic: a tutorial with examples," IEEE Tra Computers, May 1984. 24. K. K. Pashi and T. Nisitani, “VEST architectures for disrete wavelet trans- forms,” IBBB Trans. on VLSI Systems, vol. 1,20. 2, pp. 191~202, June 1998. 25. Y-N. Chang, J. H. Satyanarayana, and K. K, Path, “Design and implementa- tion of low-power digit-serial multipliers” in Proc. 1997 IBEE International Gonference on Computer Design (ICCD), (Austin, TX), pp. 186-195, Oct. 1907. in Proc. of

You might also like