Shanthala 2009

Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09
Design and VLSI Implementation of Pipelined Multiply Accumulate

Unit
Shanthala S, Cyril Prasanna Raj, Dr.S.Y.Kulkarni
Abstract— In the majority of the Digital signal processing consumption of the system. Therefore, the main motivation of
(DSP) applications, the critical operations usually involve many this research is to investigate various pipelined
multiplications and /or accumulations. So, for real time signal multiplier/accumulator architectures and circuit design
processing applications, high throughput multiplier –accumulator techniques that are suitable for implementing high throughput
(MAC) is always a key element to achieve a high-performance signal processing algorithms and at the same time achieve low
digital signal processing application. In the last few years, the power Consumption
main consideration of MAC design is to enhance its speed. This is
because speed and throughput rate are always the concerns of
digital signal processing systems. However due to the increase of 2. ONE BIT FULL ADDER ARCHITECTURES
portable electronic products, low power designs also become
another major consideration. This is because, the limited battery
energy of these portable products restricts the power consumption The major cell of multiplier and accumulator is a 1-bit full
of the system. Therefore the main motivation is to investigate adder, which decides the operational speed, power dissipation
various pipelined MAC architectures and circuit and the design and area of the MAC. That is, the operational speed of full
techniques which are suitable for the implementation of high adder between two pipeline stages decides the system clock
through put signal processing algorithms. The goal of this
rate. A fully pipelined full adder designed with True Single
project was to design and VLSI implementation of pipelined MAC
for high-speed DSP applications at 180nm technology. For Phase Clock (TSPC) has been considered but the transistor
designing the pipelined MAC, various architectures of multipliers count and power dissipation is large. The Complementary Pass
and one bit full adders are considered. The static and dynamic one Transistor logic(CPL) based designs have shown that they have
bit full adder was implemented as the basic block. For checking the advantage of low power dissipation and rather high speed
the functionality of the whole system, spice code is written using of operation but their transistor count for pipelining design is
the HSPICE by defining all the blocks in the circuit as the sub
also high due to the requirement to latch both the outputs and
circuits. Then a schematic capture is done using schematic
composer from virtuoso starting from bottom level to top level. their Complements. In this section, we propose a new circuit
Finally the layout for the complete MAC is done using virtuoso. design, which has high operation speed, smallest transistor
count and lowest power/speed ratio. Our Design is based on
Keywords: DSP, MAC, CMOS, Pipeline, static and Quasi-Domino dynamic full adder circuit design, but several
dynamic modifications have been made.
1. INTRODUCTION
In the majority of digital signal processing (DSP)

applications, the critical operations usually involve many
multiplications and/or accumulations. So, for a real-time signal
processing, a high speed and high throughput Multiplier-
Accumulator (MAC) is always a key to achieve a high
performance digital signal processing system. In the last few
years, the main consideration of MAC design is to enhance its
speed. This is because, speed and throughput rate is always the
concern of digital signal processing system. But for the epoch
of personal communication, low power design also becomes
another main design consideration. This is because battery
energy available for these portable products limits the power Fig. 1: Quasi-Domino Full Adder Circuit
Shanthala S, Research Scholar, NMAM Institute of Technology, Nitte and

Assistant Professor, BIT, Bangalore. email: shanthala_wg@yahoo.com The circuit diagram of the Quasi-Domino dynamic full
Cyril Prasanna Raj, Program Manager, M. S. Ramaiha School of adder[Kang and Leblebici] is shown in Fig. 1. The main
Advanced Studies, email: .cyrilnazrth@yahoo.co.in
difference between Quasi-Domino and standard N-P domino
Dr. S. Y. Kulkarni, Principal, NMAM Institute of Technology, Nitte.
email: sy_kul@yahoo.com circuit is that it removes a PMOS transistor controlled by clock
978-0-7695-3884-6/09 $26.00 © 2009 IEEE 381
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on September 15,2020 at 06:42:33 UTC from IEEE Xplore. Restrictions apply.
signal in P-block and a NMOS transistor controlled by clock parasitic capacitance and therefore, the setup and power
signal in N-block. Besides, a resistive NMOS is added in series dissipation. So if a complementary static CMOS logic is used,
with the N-switch under the P-block to purposely reduce the the above condition still exists. However, if static pseudo-
pull-down voltage swing during precharge phases without NMOS logic is used, it can avoid the weakness of PMOS.
enlarging the N-switch gate length and avoid the increase of Further, it reduces one transistor and one clock input so the
capacitive load on the clock Signals. A complementary single- layout area and power dissipation will be reduced. However,
phase clock scheme is used in this pipeline system. The clk and the use of pseudo-NMOS logic will increase the static power
clkb are interchanged from one pipelines stage to another. consumption, so the size of the PMOS transistor should be
During the pre charge phase, depending on the input pattern, carefully chosen to compromise power consumption and speed.
node 1 and 2 are not always fully driven to VSS and VDD, Note that, it is the dynamic power (charge and discharging
respectively. This is because if a conduction path exists loading capacitors) that dominates the overall power
between VDD and VSS, and the output node may stage consumption.
halfway between VDD and VSS. However the proceeding
stages are in evaluation phase, so in the end of the precharge
phase the inputs are correct. Thus during the following
evaluation phase, the discharging (charging) switch is turned
off and the PMOS logic tree (NMOS logic tree) pulls node 1
(2) back to VDD (VSS). The existence of dc path in dynamic
gates does not affect the accuracy of the operation of the full
adder but it can reduce the discharging or charging time at
evaluation phase. This is merit of original circuit. The worst
case delay of quasi-domino full adder is to evaluate node 1
(through PMOS logic tree) and node 2 (through NMOS logic
tree). We find that the precharge phase the proceeding pipeline
stage is in evaluation phase and at the end of the phase, the
output signals of the proceeding stage are already correct. So,
if we change the carry block of the cell into static, then the
Fig 3: Pseudo-NMOS logic
carry block can be partially evaluated during the precharge
phase. As a result, the delay can be shortened. So in the new
full adder design, we use static pseudo-NMOS logic instead of
the conventional PMOS-type domino logic shown in figure 2.
The merit of the change is two fold, increase of speed and
decrease in power consumption and overhead
Fig 4: Standard N-P domino logic
For the sum block, we use the original NMOS logic tree
dynamic circuit because node 1 is not yet correct in the
beginning of evaluation phase. There is no advantage of
changing the sum block in to static logic. Thus, the NMOS
logic tree of the quasi-domino structure is still used to achieve
Fig 2: Conventional CMOS logic
high speed and low power dissipation. One modification to the
The increase of speed has been described early. With regard to sum block is that, we place the NMOS device controlled by the
the decrease of the power consumption and area overhead, as last arriving carry signals nearest the output of the sum block.
we know, the speed of PMOS device is slower than the NMOS. By using this scheme, the early signals in effect discharge
The increase in PMOS size is usually two or three times larger internal nodes and the last arriving signals only have to switch
than NMOS. The increase in PMOS size also increases the transistor with minimum body effect. The new full adder is a
382
combination of static and dynamic logic design[ Jou, Chen, size n bits has n square gates. For multiplication algorithms
Chung and Su,2000]. So we name it S&D full adder. performed in DSP applications, latency and throughput are the
two major constraints from delay perspective.
Fig 5: Complementary pass transistor logic

Fig 6: Array Multiplier
One technique that has been demonstrated to be effective
in reducing the delay of the gates consisting of series of
transistor is to change the size of the transistor according to its
position in the serial structure. This is called as the sizing of
the transistor technique. Here we increase the size of the
transistor by about 30% successively from the device
connected to the output to the device connected to the VSS.
The table 1 shows the comparisons of simulation of all the
pipelined adders at 180nm technology at 1.8V power supply.
From the table, the static and dynamic full adder is the fastest
and one of the smallest. Although it has the largest power
dissipation at its maximum operation frequency, its power
speed ratio is the smallest. The static power dissipation in the
carry block, if the S&D is operated at its maximum speed, the
period of the steady state is very short and thus keeps the static Fig 7: Carry Save Multiplier
power dissipation very small.
Table1: Comparison of full adder implementations
Type Power Transistor Max

dissipation count Delay
Conventional 1.30mw 32 2.5u
adder
Pseduo 1.65mw 22 2ns
NMOS logic
N-P domino 1.78mw 23 1.2ns
logic
Static and 1.90mw 22 800ps
dynamic full
adder Fig 8: Wallace Tree Multiplier
2. MULTIPLIER ARCHITECTURES
The multiplier is a fairly large block of a computing

system [Çiftçi, 2003]. The amount of circuitry involved is
proportional to the square of its resolution; i.e. a multiplier of
383
L – low, M – medium, H-high, V.H –very high, SI-simple, S-
small, LG- larger, A-average
The table 2 shows the comparison of multipliers with respect to

speed, area, power, cost and complexity. The above table
shows the Hitachi and Inoue multiplier has very high speed
than other multipliers. When power, cost and area of these two
multipliers are considered, it is very high. When we observe
the table, the Wallace tree multiplier has high considerable
speed. The area, power and cost of the multiplier is also
medium.
4. MULTIPLY ACCUMULATE UNIT
The figure 10 shows a single MAC unit with

Fig 9: Inoue Multiplier multiplier, adder, and accumulator. The most typical feature
Latency is the real delay of computing a function, a measure of that differentiates a DSP from any GPP is the multiply and
how long after the inputs to a device are stable, is the final Accumulate unit. All DSP Algorithms would require some
result available on outputs. Throughput is the measure of how
many multiplications can be performed in a given period of
time. Multiplier is not only a high-delay block but also a
significant source of power dissipation. That’s why, if one also
aims to minimize power consumption, it is of great interest to
identify the techniques to be applied to reduce delay by using
various delay optimizations.
The figures 6-9 shows various architectures of multiplier

[Meier, Rutenbar and carley, 1996] and the selection of
particular multiplier depends on various factors like speed,
area, power and cost. Speed is the main factor to be considered
in DSP applications.
Table 2: Comparison of Multipliers

Fig 10: Multiply Accumulate Unit
Type Speed Complexit Area Power Cost
y form of the Multiplication and Accumulation Operation. This is
the most important block in DSP systems [Suvakovic, Andre,
Array L SI S H L
Salama, 2000]. It is composed of an adder, multiplier and the
Carry M SI S H A accumulator. Usually adders implemented in DSPs are Ripple
Carry Adders, Carry-Select or Carry-Save adders, as speed is
save
of utmost importance in a DSP. Basically the multiplier will
Booths H M M M M multiply the inputs and give the results to the adder, which will
add the multiplier results to the previously accumulated results.
This operation eases the computation of the most important
Wallace H M M M A formula i.e., b(n)x(n-k) which is needed in filters, Fourier
analyzers, etc. The inputs for the MAC are supposed to be
tree fetched from some memory location and fed to the multiplier
Dadda V.H M M H A block of the MAC, which will perform multiplication and give
the result to adder which will accumulate the result and then if
Hitachi V.H M LG V.H H needed will also store the result into a memory location. This
entire process is to be achieved in a single clock cycle.
Inoue V.H M LG V.H H
384
5. RESULTS
Fig 13: Layout of MAC unit
Fig 14: Layout of Pipeline MAC unit
5.2 PERFORMANCE SUMMARY

fig 11: Spice simulation of pipelined MAC
Table 3 shows the performance summary of the Pipelined
Inputs applied to all MAC are, 0 0 0 0 0 1 1 0 1 01 1 0 1 0 0 , MAC.
1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1, and the outputs of Pipelined
MAC are 0 0 0 1 1 0 1 0 11 01 0 0 0 0. The simulation results Table 3: Performance Summary
shown are the outputs of the fourth stage of pipelined MAC.
Input Bits 8*8- bit
5.1 LAYOUTS 'Technology 0.18micron
Power supply 1.8v
Transistor count 3,200
Power Dissipation 50.26 mw
Latency 6 cycles
Max Operating Frequency 83.3MHZ
Area (with out IO pads) 3*1.05mm2
Fig 12: Layout of 8×8 Wallace Tree Multiplier
385
6. CONCLUSION [5] Shyh-Jye Jou, Chang-Yu Chen, En-Chung and Chau-
Chin Su “ A Pipeline Multiplier-Accumulator Using a
Initially, different MAC architectures were analyzed to High Speed Low-Power Static and Dynamic Full
determine the optimal topology for the given performance Adder” Journal of Solid State State Circuits, Vol 32,
specifications with minimum power and speed. Second, the no- 1, January 2000.
exact implementation of the chosen architecture was [6] G. Goto, et. Al., “A 54x54-b regularly structured tree
investigated in an effort to use the maximum amount of speed. multiplier”, IEEE J. Solid-State Circuits, vol. 27, no.
After a complete analysis, the block was simulated for 9, Sept. 1992.
functionality verification. Once confirmation of correct [7] Pascal C.H. Meier, Rob A. Rutenbar and Richard
operation was achieved, a complete layout was done in order to carley, “Exploring multiplier architecture and Layout
optimize the area. Simulation results from Spice demonstrate for low power”, IEEE Custom Integrated circuits
that the MAC achieves all required performance specifications Conference, 1996.
in terms of accuracy and performance parameters such as delay [8] John Kim, Earl E. Swartzlander, “Improving the
and power. In terms of power and area, the design is dissipating Recursive Multiplier” IEEE Trans. VLSI Systems,
only 50.26 mw of power and is 3*1.05 mm2 in area. The vol-5, pp 2-5, 2000.
latency of the design is 6 clock cycles. [9] Beril Seda Çiftçi “Design and Realization of a High
Speed 64 x 64 – bit Multiplier for Low Power
Applications” Sabancı University Spring 2003.
REFERENCES [10] S.Shah, A. J. AI-Khabb, D. AI-Khabb, “Comparison
of 32-bit Multipliers for Various Performance
[1] Kihak Shin, Ik Kyun Oh, Sang Min, Beom Seom Ryu, Measures”, The 12th International Conference on
Kie Young Lee and Tae Won Cho “ A Multi-Level Microelectronics Tehran, Oct. 31- Nov.2, 2000.
Approach to Low Power Mac Design” IEEE Trans. [11] Sung-Mo Kang and Yusuf Leblebici, “CMOS Digital
VLSI systems, vol 48 , pp 361- 763, 1999. integrated circuits”, Tata McGraw-Hill Publishing
[2] Ichiro Kuroda, Eri Murata, Kouhei Nadehara, Company Limited, 2003.
Kazumasa Suzukit Tomohisa Araitt and Atsushi [12] Sung-Mo Kang and Yusuf Leblebici, “CMOS Digital
Okamuratt “A 16-bit Parallel Mac Architecture for a integrated circuits”, Third Edition,Tata McGraw-Hill
Multimedia Risc Processor” IEEE Trans. VLSI Publishing Company Limited, 2003.
systems, vol. 83, no. 83, pp 103-112, 1995. [13] Jan M.Rabaey, Anantha Chandrakasan and Borivoje
[3] Jae Sung Lee, Young Seop Jeon, and Myung H. Nikolic, “Digital Integrated Circuits”,Second Edition,
Sunwoo “ Design of New Dsp instructions and their Prentince Hall Electronics and VLSI series, 2004.
Hardware Architecture for High-Speed FFT” IEEE [14] M.Tech. Credit Seminar Report, Electronics Systems
Trans. VLSI systems,, pp 80-90, 2001. Group, EE Dept, IIT Bombay, submitted November
[4] Dusan Suvakovic, C. Andre, Salama “A Pipelined ’02-2000,DSP Architectures For System Design”,Vi
Multiply-Accumulate Unit Design for Energy (02307910), Supervisor: Prof A.N. Chandorkar.
Recovery DSP Systems” IEEE Internationa
Symposium on Circuits and Systems, May 28-31,
2000.
386

Shanthala 2009

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Shanthala 2009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Shanthala 2009

Uploaded by

Copyright:

Available Formats

Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09

Design and VLSI Implementation of Pipelined Multiply Accumulate

In the majority of digital signal processing (DSP)

Shanthala S, Research Scholar, NMAM Institute of Technology, Nitte and

978-0-7695-3884-6/09 $26.00 © 2009 IEEE 381

Fig 4: Standard N-P domino logic

Fig 5: Complementary pass transistor logic

Table1: Comparison of full adder implementations

Type Power Transistor Max

The multiplier is a fairly large block of a computing

The table 2 shows the comparison of multipliers with respect to

4. MULTIPLY ACCUMULATE UNIT

The figure 10 shows a single MAC unit with

The figures 6-9 shows various architectures of multiplier

Table 2: Comparison of Multipliers

Fig 13: Layout of MAC unit

Fig 14: Layout of Pipeline MAC unit

5.2 PERFORMANCE SUMMARY

5.1 LAYOUTS 'Technology 0.18micron

Power supply 1.8v

Transistor count 3,200

Power Dissipation 50.26 mw

Max Operating Frequency 83.3MHZ

Area (with out IO pads) 3*1.05mm2

Fig 12: Layout of 8×8 Wallace Tree Multiplier

You might also like