Kumm 2013
Kumm 2013
Multiplier Block
Abstract—This work compares two finite impulse response
(FIR) filter architectures for FPGAs for which the coefficients
can be reconfigured during run-time. One is a recently proposed
filter architecture based on distributed arithmetic (DA) and the
other is based on a LUT multiplication scheme. Instead of
using the common internal configuration access port (ICAP) for
reconfiguration which is able to change the logic as well as the
routing, it is sufficient to reconfigure only the logic in the regarded Fig. 1. FIR filter in transposed form
architectures. This is realized by using the configurable look-up
table (CFGLUT) primitive of Xilinx that allows reconfiguration
times which are orders of magnitudes faster than using ICAP. configurations. Furthermore, todays FPGAs provide standard
The resulting FIR filter architectures achieves reconfiguration interfaces for reconfiguration like the internal configuration
times of typically less than 100 ns. They can be reconfigured with access port (ICAP) of Xilinx FPGAs. With ICAP, a recon-
arbitrary coefficients that are only limited by their length and figurable logic region can be defined whose functionality
word size. As their resource consumptions depend on different can be exchanged during run-time by any circuit, i. e., the
parameters of the filter, a detailed comparison is done. It turned
out that if the input word size is greater than approximately half logic functions (LUT content) as well as the routing can be
the number of coefficients, the LUT based multiplication scheme completely exchanged. However, in many applications like in
needs less resources than the DA architecture and vice versa. the FIR filter case, the functionality of different configurations
is very similar and only a few parameters are exchanged.
I. I NTRODUCTION A possibility to exchange only the logic without changing
The finite impulse response (FIR) filter is one of the the complete routing on modern Xilinx FPGAs, namely the
most fundamental components in digital signal processing. Its Virtex 5-7, Spartan 6, Kintex 7 and Artix 7, is given by the
block schematic is shown in Fig. 1. Due to the high amount use of a configurable LUT (CFGLUT). It is very similar to
of multiply-accumulate (MAC) operations, the computational the 16 bit shift register LUTs (SRL16) of older Xilinx FPGAs
power of many real-time applications can only by realized by but has an input word size of five bit instead of four. One
using the parallel nature of of application specific integrated advantage of exchanging the logic only is, of course, the
circuits (ASICs) like field programmable gate arrays (FPGAs). reduced reconfiguration memory. But much more important
To reduce the performance gap between ASICs and FPGAs, is the fact that there is no bandwidth limitation of a single
digital filters were one of the driving forces to push embedded reconfiguration port like ICAP, as the reconfiguration using
multipliers or DSP blocks into the FPGA fabric. The price CFGLUTs can be done in parallel. Each CFGLUT can be
for those fixed coarse-grained blocks are their inflexibility reconfigured in 32 clock cycles where the configuration clock
and limited quantity. However, in many applications like for is only limited by routing delays of the FPGA fabric. As
digital filters, the multiplications have to be performed only each CFGLUT can be reconfigured in parallel, e. g., sourced
with constants that may be only reconfigured from time to from many block RAM resources, reconfiguration times in
time which can be used to reduce the complexity. Examples the order of 100 ns can be realized. Compared to ICAP, which
are multi-stage filters for decimation or interpolation like has a single 32 bit port operating with up to 100 MHz [7],
polyphase FIR filters [1] or frequency variable filters as needed reconfiguration times are often in the order of microseconds
in telecommunications, digital audio, medical, radar, sonar and or even milliseconds.
instruments [2]. A very generic tool flow for mapping parametrizable circuit
On ASICs, one has to realize the reconfiguration on a very configurations to tunable LUTs (TLUTs) like the CFGLUT or
low-level, e. g., by introducing multiplexers in the circuit to SRL16, which is called dynamic circuit specialization, was
select between different configurations. Several methods for proposed by Bruneel et al. [8], [9]. Here, the FPGA routing
reconfigurable multiplier blocks (ReMB) for reconfigurable of the circuit is fixed but parameters of the circuit, like the
FIR filters have been proposed in the last decade [1], [3]–[6]. constants of an FIR filter, can be exchanged at run-time by
Of course, these circuits can be also mapped to FPGAs [4], performing a dynamic reconfiguration of the LUT content.
[6], but they are very limited in the number of different filter However, this work concentrates on hand optimized low-level
The interface of the run-time reconfigurable CFGLUT is = 2b cn xn,b −2Bx −1 cn xn,Bx −1 (4)
b=0 n=0 n=0
shown in Fig. 2. Besides the LUT inputs I0. . . I4 and outputs | {z } | {z }
O5 and O6, it provides a reconfiguration interface consisting of =f (x̃N
b ) =f (x̃N
Bx −1 )
the signals configuration data in (CDI), configuration data out Now, the inner product can be represented as a sum of bit
(CDO), clock enable (CE) and configuration clock (CCLK). shifted results of the function
It can be used as a single 5-input LUT using O6 only or as
N −1
two 4-input LUTs with shared inputs by setting I4 = 1 and X
f (x̃N
b )= cn xn,b . (5)
using O5 and O6. In order to perform a change of the output
n=0
function(s), 32 bit of configuration data have to be clocked
into CDI while CE is set to high. The previous configuration Function f (x̃N N
b ) only depends on the N -bit vector x̃b =
T
is shifted out at CDO which can be used to cascade a series of (x0,b , x1,b , . . . , xN −1,b ) which contains the b’th bit of the
CFGLUTs. However, each CDI input can be programmed in (time shifted) elements of x. It can be precomputed and stored
parallel leading to a reconfiguration time of 32 clock cycles. in a single LUT with N inputs.
The speed of the configuration clock is only limited by the The storage requirement of the LUT is BfN · 2N bit, where
FPGA routing fabric. The CFGLUT can be mapped into the Bf denotes the output word size of the N -input LUT f (x̃N
N
b ).
FPGA by using the HDL primitive CFGLUT5 [18] which This output word size depends on the coefficient word size
CFGLUT5
I0
O5
I4
RLUT
CDI CDI
CE CE O6
CCLK CDO
RLUT CFGLUT5
I0
O5
I4
CDI
CE O6
CDO
RLUT
CFGLUT5
I0
O5
CDI I4
Filter Select Reconf. SEL
CCLK
Circuit CDI
CE O6
CDO
It can be seen that the required CFGLUTs are very similar for CFGLUT5
Controller
I0
both architectures. To estimate the architecture with the least O5
I4
CFGLUTs, we assume that M and Bx are both dividable by
CDI
CE O6
four, and Bc is dividable by two. Then, the LUT multiplier CCLK CDO
architecture needs less CFGLUTs compared to the DA archi-
tecture when the following unequation is true:
NCFGLUT,LUTM < NCFGLUT,DA
Fig. 6. Control architecture for the reconfiguration of CFGLUTs
M Bx /4(Bc /2 + 2) < (Bx + 1)M/4(Bc /2 + 1)
Bx < Bc /2 + 1 (15)
Hence, if the input word size Bx is less than approximately For the LUT multiplier based architecture, all partial product
half the coefficient word size Bc , the LUT multiplier archi- LUTs except the MSB LUT can use the same configuration
tecture needs less CFGLUTs and vice versa. data, as they all represent the multiplication of one 4 bit
Besides the used CFGLUTs a large amount of adders input with the same coefficient. Only the MSB LUT has to
are necessary. For the DA FIR architecture M adders are be reconfigured with a different configuration because of the
necessary in the pre-processing stage as well as an adder tree sign bit for signed multiplication. Each partial product has an
with Bx adders in the post-processing (see Fig. 3). Each of output word size of BgL , so the storage requirement is
the Bx + 1 RLUTs consist of dM/Le adders, resulting in SLUTM = 64 · M dBc /2 + 2e bit . (20)
NADD,DA = M + Bx + (Bx + 1) dM/4e . (16) As it can be seen, the required storage is independent of the
For the LUT multiplier architecture, N −1 structural adders input word size Bx and the influence of the coefficient word
are needed to compute the filter output from the multiplier size is Bc /2 for both architectures. But for the LUT multiplier
results (see Fig. 1). Each of the M reconfigurable multipliers based architecture, the memory requirement is approximately
consists of dBx /Le adders, resulting in eight times higher than for distributed arithmetic. Thus, the
reconfigurable DA architecture is always the best in terms of
NADD,LUT = N − 1 + M dBx /4e . (17) memory requirements.
To give an estimate for the architecture with the least adders, VI. R ECONFIGURATION C IRCUIT
we assume that M and Bx are both dividable by 4 and N =
The reconfiguration circuit can be built from a block RAM
2M . Then, the LUT multiplier architecture needs less adders
or distributed RAM and a simple controller. It is shown
when the following unequation is true:
in Fig. 6, together with two CFGLUTs. An input vector
NADD,LUT < NADD,DA filt_sel is used which can arbitrarily select a filter co-
2M − 1 + M Bx /4 < M + Bx + (Bx + 1)M/4 efficient set and an enable signal rec_en is used to start
the reconfiguration. Then, a 5 bit counter is used to address
3M − 4 < 4Bx (18) a sequence of 32 reconfiguration words. The RAM address
As each adder typically consists of several slices, where can then be composed of the filt_sel which addresses the
each slice contains up to four CFGLUTs, the adder estima- higher bits together with the five output bits of the counter to
tion is much more significant than the CFGLUT estimation. address the lower bits by a simple concatenation.
Hence, as a rule of thumb, if the input word size is greater Note that during the reconfiguration process, the output of
than approximately half the number of coefficients, the LUT the multiplier is not valid. This problem can be fixed by a
multiplier architecture needs less resources and vice versa. clock enable signal at the output register of the multiplier,
which leads to a neglect of the input during reconfiguration.
B. Configuration Memory Alternatively each CFGLUT can be doubled as it was proposed
With configuration memory we mean the storage require- in previous work [10]. It can be reconfigured with the new
ments for a single configuration. Thus, it has to be multiplied coefficient while the other CFGLUT is processing data. A
by the number of different configurations to get the total mem- multiplexer is used to switch between the two CFGLUTs to
ory requirement. For the DA architecture all RLUTs shown in achieve a glitch free data processing.
Fig. 3 have the same content. As each of the CFGLUT needs
VII. R ESULTS
32 bit of configuration data, the storage requirement is
Two synthesis experiments were made to evaluate the re-
SDA = 32 · dM/4e dBc /2 + 1e bit . (19) source usage, speed and reconfiguration times of the different
TABLE II
architectures. In the first experiment, the LUT based multiplier S YNTHESIS RESULTS FOR THE PROPOSED RECONFIGURABLE LUT
architecture as well as the DA architecture are synthesized for MULTIPLIER FIR FILTER IN COMPARISON WITH RECONFIGURABLE DA
different filter lengths and input word sizes. In the second FIR FILTER [10]
experiment, we used ICAP for the partial reconfiguration Reconf. FIR DA [10] Reconf. FIR LUT (proposed)
of highly optimized filter instances to compare the resource Bx N S Slices fclk Trec S Slices fclk Trec
overhead and configuration times. A VHDL code generator [bit] [MHz] [ns] [bit] [MHz] [ns]
was developed in Matlab for the different architectures. All 8 6 320 145 570.5 56.1 2112 93 647.3 49.4
synthesis results were obtained after place & route using Xilinx 8 10 640 199 492.4 65.0 3520 143 575.0 55.6
ISE v13.4 with the design goal ’speed’ for a Virtex 6 FPGA 8 13 640 191 601.3 53.2 4928 188 608.6 52.6
8 20 960 365 496.3 64.5 7040 297 534.8 59.8
(XC6VLX75T-FF784). 8 28 1280 374 525.2 60.9 9856 406 551.9 58.0
8 41 1920 627 522.7 61.2 14784 595 573.4 55.8
A. Comparison of CFGLUT based Architectures 8 61 2560 835 543.2 58.9 21824 837 499.5 64.1
8 119 4800 1487 499.5 64.1 42240 1668 504.0 63.5
To compare the two discussed architectures, several different 8 151 6080 1813 395.4 80.9 53504 2156 415.3 77.1
reconfigurable filters were synthesized. To ease the comparison 16 6 320 253 622.7 51.4 2112 179 576.7 55.5
with other work, we used a benchmark set of nine filters with 16 10 640 372 552.8 57.9 3520 262 572.4 55.9
N = 6 up to N = 151 coefficients which were already used 16 13 640 353 545.3 58.7 4928 370 522.2 61.3
16 20 960 635 467.7 68.4 7040 544 518.4 61.7
in previous publications [10], [19]–[21]. The coefficients have 16 28 1280 707 525.8 60.9 9856 782 540.3 59.2
a maximum word size of Bc = 17 bit and are available online 16 41 1920 1071 521.9 61.3 14784 1108 487.8 65.6
as MIRZAEI10 N [22]. The input word size was varied from 16 61 2560 1341 413.7 77.3 21824 1575 463.8 69.0
16 119 4800 2594 348.3 91.9 42240 3257 480.3 66.6
Bx = 8 . . . 32 bit in steps of 8 bit. Of course, any filter of the 16 151 6080 4256 410.5 78.0 53504 4149 428.8 74.6
same symmetry up to Bc , Bx and N can be reconfigured.
24 6 320 371 550.7 58.1 2112 303 536.2 59.7
Lower word sizes have to be sign extended while unused 24 10 640 522 556.2 57.5 3520 431 498.3 64.2
coefficients have to be set to zero. The same clock was used 24 13 640 535 470.8 68.0 4928 587 545.6 58.7
for the filter as well as for reconfiguration. 24 20 960 939 573.7 55.8 7040 884 531.6 60.2
24 28 1280 1029 416.5 76.8 9856 1172 482.9 66.3
Detailed synthesis results are listed in Table II, which 24 41 1920 1558 383.6 83.4 14784 1691 467.1 68.5
shows the configuration memory per instance (S), the number 24 61 2560 1958 372.0 86.0 21824 2332 435.0 73.6
of Slices, the maximum clock frequency fclk as well as 24 119 4800 4047 355.4 90.0 42240 5047 393.1 81.4
24 151 6080 5802 385.2 83.1 53504 6515 380.7 84.1
the reconfiguration time Trec = 32/fclk for both methods.
On average, both architectures achieve nearly equally low 32 6 320 490 524.9 61.0 2112 320 512.3 62.5
32 10 640 734 480.3 66.6 3520 531 485.4 65.9
reconfiguration times of 69.94 ns and 65.94 ns and fast clock 32 13 640 704 507.1 63.1 4928 771 495.8 64.5
frequencies of 472.4 MHz and 494.2 MHz, respectively. As 32 20 960 1130 517.9 61.8 7040 1026 473.9 67.5
expected from Section V-B, the reconfiguration memory of 32 28 1280 1419 430.7 74.3 9856 1482 442.9 72.3
32 41 1920 1965 376.5 85.0 14784 2091 446.0 71.7
the LUT multiplier architecture is approximately eight times 32 61 2560 2570 358.6 89.2 21824 3601 401.6 79.7
higher compared to the DA based architecture. A comparison 32 119 4800 4889 302.9 105.6 42240 6523 377.2 84.8
of the resources is visualized in Fig. 7 which shows the 32 151 6080 7170 391.1 81.8 53504 8016 386.7 82.8
percentage slice improvement of the LUT multiplier based
architecture over the DA based architecture. Hence, in case
positive values are shown, the LUT architecture is the better architectures, both logic and routing can be replaced. On the
choice and vice versa. As expected from Section V-A, the LUT one hand, each configuration can be optimized like a static
multiplier based architecture performs better for smaller filter design which is very resource efficient. On the other hand, the
instances when dN/2e < Bx . Exceptions from that rule can be storage requirement is much larger which also leads to longer
explained by the other parts of the filter, e. g., an unbalanced reconfiguration times due to the sequential ICAP interface. The
adder tree with 2n−1 < x < 2n inputs may need as many ICAP interface has a word size of 32 bit and can be operated
slices as a balanced one with x = 2n inputs due to pipelining. with up to 100 MHz [7]. However, this speed is a theoretical
Up to 35% slice reductions can be achieved by using the LUT maximum which can only be achieved with some effort [23].
multiplier architecture for small filters but the DA architecture Reconfiguration clocks of 94.5 MHz were reported by using a
is still advantageous for larger filter instances requiring up to block RAM cache that was closely located at the ICAP [24].
40% less slices. To get the maximum performance out of the ICAP approach,
we used the reduced pipelined adder graphs (RPAG) algorithm
B. Comparison with ICAP [25] which is a state-of-the-art optimization method for fixed
The standard partial reconfiguration scheme supported by coefficient multiplier blocks as needed for FIR filters. All
the Xilinx ISE tools provides that a reconfigurable region multiplications by constants are mapped to add/subtract and bit
of the FPGA is reserved which can be exchanged by any shift operations in a pipelined way. It was shown in previous
circuit in run-time using the ICAP interface. This implies work [20], [21], [26], [27], that this method outperforms
that the reconfigurable region is large enough to include the DA based filters as well as LUT based multipliers in the
largest configuration. In contrast to the discussed CFGLUT case that input word sizes of more than 12 bit are used. As
40 40
0 0
−20 −20
−40 −40
6 10 13 20 28 41 61 119 151 6 10 13 20 28 41 61 119 151
Filter length N Filter length N
(a) Input word size Bx = 8 bit (b) Input word size Bx = 16 bit
40 40
Slice improvement [%]
0 0
−20 −20
−40 −40
6 10 13 20 28 41 61 119 151 6 10 13 20 28 41 61 119 151
Filter length N Filter length N
(c) Input word size Bx = 24 bit (d) Input word size Bx = 32 bit
Fig. 7. Slice improvement of the reconfigurable FIR filter based on LUT multipliers in comparison with the reconfigurable DA FIR filter
TABLE III
C OMPARISON OF A SINGLE FILTER MIRZAEI10 41 WITH Bx = 16 BIT recently proposed method based on distributed arithmetic [10],
USING ICAP RECONFIGURATION AND THE CFGLUT METHODS the second one uses several instances of a reconfigurable LUT
Method S [bit] Slices fclk [MHz] Trec [ns] multiplier [14] to build a reconfigurable multiplier block as
needed in the FIR filter. Similarities between the different
RPAG [25] with ICAP 746496 502. . . 569 386.7. . . 448.8 233280
Reconf. FIR DA [10] 1920 1071 521.9 61.3 approaches were derived as both methods uses similar arith-
Reconf. FIR LUT 14784 1108 487.8 65.6 metic transformations to map large LUTs to several smaller
LUTs by the use of additional adders. It turned out that less
CFGLUTs, and, in most of the cases, less slices are needed for
the optimization heavily depends on the numeric coefficient the LUT based multiplier architecture in the case that the input
values, ten different filters were designed with the same length word size is greater than approximately half the number of
as the mid size benchmark filter MIRZAEI10 41 and an input coefficients and vice versa. Both methods have reconfiguration
word size of 16 bit. These served as realistic configurations times and memory requirements which are about four orders
which can be reconfigured via ICAP. of magnitudes faster than using partial reconfiguration via
The results are summarized in Table III. The number of the ICAP interface which is paid by approximately twice the
slices using RPAG optimized FIR filters varied for the differ- amount of slices.
ent filter instances from 502. . . 569. Hence, a reconfiguration
region with a capacity of 569 slices has to be reserved. The R EFERENCES
reconfiguration is organized in frames of 80 slices, thus, eight [1] M. Faust, O. Gustafsson, and C.-H. Chang, “Reconfigurable Multiple
frames have to be reserved where each frame contributes with Constant Multiplication Using Minimum Adder Depth,” in Conference
Record of the Forty Fourth Asilomar Conference on Signals, Systems
93312 bit, leading to a reconfiguration memory requirement and Computers (ASILOMAR), 2010, pp. 1297–1301.
of SICAP = 746496 bit per filter instance. Compared to the [2] Lowenborg and Johansson, “Minimax Design of Adjustable-bandwidth
CFGLUT-based methods, a factor of 388 and 50 more recon- Linear-phase FIR Filters,” IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 53, no. 2, pp. 431–439, 2006.
figuration memory is necessary, respectively. Assuming that [3] S. S. Demirsoy, A. Dempster, and I. Kale, “Design Guidelines for
the full performance of ICAP can be used, the reconfiguration Reconfigurable Multiplier Blocks,” in IEEE International Symposium
takes Trec = SICAP /32 · 10 ns = 233 µs. Thus, compared on Circuits and Systems (ISCAS), 2003.
[4] S. S. Demirsoy, I. Kale, and A. Dempster, “Efficient Implementation
to the slowest CFGLUT methods with 65.5 ns, the ICAP of Digital Filters Using Novel Reconfigurable Multiplier Blocks,” in
reconfiguration is a factor of 3556 slower. The price for these Conference Record of the Thirty-Eighth Asilomar Conference on Signals,
fast reconfiguration times and low memory requirements is Systems and Computers, 2004, pp. 461–464.
[5] P. Tummeltshammer, J. Hoe, and M. Puschel, “Time-Multiplexed
paid by a slice overhead of 88% and 95%, respectively. Multiple-Constant Multiplication,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 26, no. 9, pp.
VIII. C ONCLUSION 1551–1563, Sep. 2007.
We analyzed two reconfigurable FIR filter architectures [6] R. Gutierrez, J. Valls, and A. Perez-Pascual, “FPGA-Implementation of
Time-Multiplexed Multiple Constant Multiplication based on Carry-Save
based on the CFGLUT primitives which can be mapped to Arithmetic,” in International Conference on Field Programmable Logic
all modern FPGAs of Xilinx. The first one is based on a and Applications (FPL), 2009, pp. 609–612.
[7] Xilinx, Inc., Partial Reconfiguration User Guide, Oct. 2010. [18] Xilinx, Inc., Xilinx Virtex-5 Libraries Guide for HDL Designs, 2009.
[8] K. Bruneel, W. Heirman, and D. Stroobandt, “Dynamic Data Folding [19] S. Mirzaei, R. Kastner, and A. Hosangadi, “Layout Aware Optimization
with Parameterizable FPGA Configurations,” Transactions on Design of High Speed Fixed Coefficient FIR Filters for FPGAs,” Int. Journal
Automation of Electronic Systems, vol. 16, no. 4, pp. 43:1–43:29, Oct. of Reconfigurable Computing, vol. 2010, pp. 1–17, 2010.
2011. [20] M. Kumm and P. Zipf, “High Speed Low Complexity FPGA-based FIR
[9] K. Bruneel, “Efficient Circuit Specialization for Dynamic Reconfigura- Filters Using Pipelined Adder Graphs,” in International Conference on
tion of FPGAs,” Ph.D. dissertation, Universiteit Gent, 2011. Field-Programmable Technology (FPT), 2011, pp. 1–4.
[10] M. Kumm, K. Möller, and P. Zipf, “Reconfigurable FIR Filter Using [21] U. Meyer-Baese, G. Botella, D. Romero, and M. Kumm, “Optimization
Distributed Arithmetic on FPGAs,” in IEEE International Symposium of High Speed Pipelining in FPGA-based FIR Filter Design Using
on Circuits and Systems (ISCAS), 2013, pp. 2058–2061. Genetic Algorithm,” in Proceedings of SPIE, 2012.
[11] A. Crosisier, D. J. Esteban, M. E. Levilio, and V. Riso, “Digital Filter [22] FIRsuite, “Suite of constant coefficient FIR filters,” 2013. [Online].
for PCM Encoded Signals,” United States Patent No. 3777130, 1973. Available: http://www.firsuite.net
[12] S. Zohar, “New Hardware Realizations of Nonrecursive Digital Filters,” [23] S. Liu, R. N. Pittman, A. Forin, and J. L. Gaudiot, “On Energy Efficiency
IEEE Transactions on Computers, vol. 22, no. 4, pp. 328–338, 1973. of Reconfigurable Systems With Run-time Partial Reconfiguration,” in
[13] S. A. White, “Applications of Distributed Arithmetic to Digital Signal IEEE International Conference on Application-specific Systems Archi-
Processing: A Tutorial Review,” IEEE ASSP Mag., vol. 6, no. 3, 1989. tectures and Processors (ASAP), 2010, pp. 265–272.
[14] K. Wiatr and E. Jamro, “Implementation of Multipliers in FPGA [24] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time Partial Reconfigu-
Structures,” International Symposium on Quality Electronic Design, pp. ration Speed Investigation and Architectural Design Space Exploration,”
415–420, 2001. in International Conference on Field Programmable Logic and Appli-
[15] K. D. Chapman, “Fast Integer Multipliers Fit in FPGAs,” Electronic cations (FPL), 2009, pp. 498–502.
Design News, 1994. [25] M. Kumm, P. Zipf, M. Faust, and C.-H. Chang, “Pipelined Adder Graph
[16] K. Chapman, “Constant Coefficient Multipliers for the XC4000E,” Xilinx Optimization for High Speed Multiple Constant Multiplication,” in IEEE
Application Note, 1996. Int. Symposium on Circuits and Systems (ISCAS), 2012, pp. 49–52.
[17] M. J. Wirthlin, “Constant Coefficient Multiplication Using Look-Up [26] U. Meyer-Baese, J. Chen, C. H. Chang, and A. G. Dempster, “A
Tables,” Journal of VLSI Signal Processing, vol. 36, no. 1, pp. 7–15, Comparison of Pipelined RAG-n and DA FPGA-based Multiplierless
Jan. 2004. Filters,” in IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS), 2006, pp. 1555–1558.
[27] M. Kumm, D. Fanghänel, K. Möller, P. Zipf, and U. Meyer-Baese, “FIR
Filter Optimization for Video Processing on FPGAs,” EURASIP Journal
on Advances in Signal Processing, vol. 2013, no. 1, p. 111, 2013.
[Online]. Available: http://asp.eurasipjournals.com/content/2013/1/111