[go: up one dir, main page]

0% found this document useful (0 votes)
25 views8 pages

Kumm 2013

This document compares two reconfigurable finite impulse response (FIR) filter architectures for FPGAs. One architecture is based on distributed arithmetic, while the other uses a lookup table (LUT) multiplication scheme. Both architectures allow fast reconfiguration of filter coefficients by reconfiguring logic functions instead of full routing. This can be done in less than 100 nanoseconds using configurable LUTs, much faster than the internal configuration access port. The architectures are compared based on their resource usage and suitability for different filter parameters like input word size and number of coefficients.

Uploaded by

KRISHNARAJ R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Kumm 2013

This document compares two reconfigurable finite impulse response (FIR) filter architectures for FPGAs. One architecture is based on distributed arithmetic, while the other uses a lookup table (LUT) multiplication scheme. Both architectures allow fast reconfiguration of filter coefficients by reconfiguring logic functions instead of full routing. This can be done in less than 100 nanoseconds using configurable LUTs, much faster than the internal configuration access port. The architectures are compared based on their resource usage and suitability for different filter parameters like input word size and number of coefficients.

Uploaded by

KRISHNARAJ R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Dynamically Reconfigurable FIR Filter

Architectures with Fast Reconfiguration


Martin Kumm, Konrad Möller and Peter Zipf
Digital Technology Group
University of Kassel, Germany
Email: {kumm, konrad.moeller, zipf}@uni-kassel.de

Multiplier Block
Abstract—This work compares two finite impulse response
(FIR) filter architectures for FPGAs for which the coefficients
can be reconfigured during run-time. One is a recently proposed
filter architecture based on distributed arithmetic (DA) and the
other is based on a LUT multiplication scheme. Instead of
using the common internal configuration access port (ICAP) for
reconfiguration which is able to change the logic as well as the
routing, it is sufficient to reconfigure only the logic in the regarded Fig. 1. FIR filter in transposed form
architectures. This is realized by using the configurable look-up
table (CFGLUT) primitive of Xilinx that allows reconfiguration
times which are orders of magnitudes faster than using ICAP. configurations. Furthermore, todays FPGAs provide standard
The resulting FIR filter architectures achieves reconfiguration interfaces for reconfiguration like the internal configuration
times of typically less than 100 ns. They can be reconfigured with access port (ICAP) of Xilinx FPGAs. With ICAP, a recon-
arbitrary coefficients that are only limited by their length and figurable logic region can be defined whose functionality
word size. As their resource consumptions depend on different can be exchanged during run-time by any circuit, i. e., the
parameters of the filter, a detailed comparison is done. It turned
out that if the input word size is greater than approximately half logic functions (LUT content) as well as the routing can be
the number of coefficients, the LUT based multiplication scheme completely exchanged. However, in many applications like in
needs less resources than the DA architecture and vice versa. the FIR filter case, the functionality of different configurations
is very similar and only a few parameters are exchanged.
I. I NTRODUCTION A possibility to exchange only the logic without changing
The finite impulse response (FIR) filter is one of the the complete routing on modern Xilinx FPGAs, namely the
most fundamental components in digital signal processing. Its Virtex 5-7, Spartan 6, Kintex 7 and Artix 7, is given by the
block schematic is shown in Fig. 1. Due to the high amount use of a configurable LUT (CFGLUT). It is very similar to
of multiply-accumulate (MAC) operations, the computational the 16 bit shift register LUTs (SRL16) of older Xilinx FPGAs
power of many real-time applications can only by realized by but has an input word size of five bit instead of four. One
using the parallel nature of of application specific integrated advantage of exchanging the logic only is, of course, the
circuits (ASICs) like field programmable gate arrays (FPGAs). reduced reconfiguration memory. But much more important
To reduce the performance gap between ASICs and FPGAs, is the fact that there is no bandwidth limitation of a single
digital filters were one of the driving forces to push embedded reconfiguration port like ICAP, as the reconfiguration using
multipliers or DSP blocks into the FPGA fabric. The price CFGLUTs can be done in parallel. Each CFGLUT can be
for those fixed coarse-grained blocks are their inflexibility reconfigured in 32 clock cycles where the configuration clock
and limited quantity. However, in many applications like for is only limited by routing delays of the FPGA fabric. As
digital filters, the multiplications have to be performed only each CFGLUT can be reconfigured in parallel, e. g., sourced
with constants that may be only reconfigured from time to from many block RAM resources, reconfiguration times in
time which can be used to reduce the complexity. Examples the order of 100 ns can be realized. Compared to ICAP, which
are multi-stage filters for decimation or interpolation like has a single 32 bit port operating with up to 100 MHz [7],
polyphase FIR filters [1] or frequency variable filters as needed reconfiguration times are often in the order of microseconds
in telecommunications, digital audio, medical, radar, sonar and or even milliseconds.
instruments [2]. A very generic tool flow for mapping parametrizable circuit
On ASICs, one has to realize the reconfiguration on a very configurations to tunable LUTs (TLUTs) like the CFGLUT or
low-level, e. g., by introducing multiplexers in the circuit to SRL16, which is called dynamic circuit specialization, was
select between different configurations. Several methods for proposed by Bruneel et al. [8], [9]. Here, the FPGA routing
reconfigurable multiplier blocks (ReMB) for reconfigurable of the circuit is fixed but parameters of the circuit, like the
FIR filters have been proposed in the last decade [1], [3]–[6]. constants of an FIR filter, can be exchanged at run-time by
Of course, these circuits can be also mapped to FPGAs [4], performing a dynamic reconfiguration of the LUT content.
[6], but they are very limited in the number of different filter However, this work concentrates on hand optimized low-level

978-1-4673-6180-4/13/$31.00 ©2013 IEEE


CFGLUT5
implementations.
I0
A reconfigurable FIR filter that uses CFGLUTs was pro- O5
posed in a recent work of our group [10]. In that work, the I4
FIR filter was realized using distributed arithmetic (DA) [11], CDI
CE O6
[12], as it is naturally realized using LUTs. To map these LUTs CCLK CDO
to the typically smaller CFGLUTs, a LUT reduction technique
[13] was used. Surprisingly, the reconfigurable FIR filters were
Fig. 2. Block diagramm of the CFGLUT5 primitive
even more hardware efficient than the fixed coefficient (non-
reconfigurable) DA filters generated by Xilinx Coregen (16%
less slices on average). The achieved reconfiguration times utilize the same slice resources as a standard 6-input LUT.
were 80 ns on average and 106 ns in the worst case. Note that the clock of the output flip flop of that slice (if
A totally different reconfigurable FIR filter can be built used) is in the same clock domain as CCLK.
by replacing each constant multiplier of Fig. 1 with a re- Back portability to older Xilinx FPGA architectures can
configurable one. A reconfigurable FPGA multiplier using be achieved by using the SRL16 primitive (and adjusting
distributed RAM or block RAM was proposed by Wiatr et al. the input word size). Alternatively, block RAMs which are
[14]. It is based on a LUT based multiplication scheme which available on nearly every modern FPGA can also be used as
was introduced by Chapman [15], [16] and later extended by reconfigurable LUTs. Then, additional resources are needed
Wirthlin [17]. for address counter and data multiplexers (if a single port
In both methods, any number of coefficient sets can be RAM is used) [14].
configured which are only constrained by the configuration
memory. However, both methods are based on totally different III. D ISTRIBUTED A RITHMETIC FIR A RCHITECTURE
architectures. Their resource consumptions depend on the The fundamental operation of an FIR filter is the inner
input word size, the coefficient word size as well the number of product of two vectors,
coefficients in a different manner. Thus, the main contribution
N −1
of this work is a detailed comparison of a reconfigurable FIR X
yn = c · x = cn xn , (1)
filter built from reconfigurable LUT multipliers [14] realized
n=0
with state-of-the-art CFGLUTs, with the DA based reconfigu-
ration scheme [10]. In addition to that, a detailed comparison where vector x consists of time shifted scalars of the filter
with partial reconfiguration using the ICAP interface is done. input and vector c consists of the coefficients. When each
The remainder of the paper is organized as follows. In input sample xn is coded in two’s complement format
Section II, the CFGLUT is introduced as base for the discussed x −2
BX
architectures. Then, the reconfigurable DA filter architecture xn = 2b xn,b − 2Bx −1 xn,Bx −1 (2)
[10] is introduced in Section III as needed for further compar- b=0
isons. The reconfigurable LUT based multiplication scheme where xn,b denotes the b’th bit of xn , we can insert (2) in (1)
as well as the corresponding filter architecture are presented and exchange the two resulting sums:
in Section IV. A detailed comparison of both architectures is
−1 x −2
N BX
!
done in Section V, followed by the experimental results and X
concluding remarks. yn = cn 2b xn,b − 2Bx −1 xn,Bx −1 (3)
n=0 b=0
II. RUN - TIME R ECONFIGURABLE LUT S x −2
BX N
X −1 N
X −1

The interface of the run-time reconfigurable CFGLUT is = 2b cn xn,b −2Bx −1 cn xn,Bx −1 (4)
b=0 n=0 n=0
shown in Fig. 2. Besides the LUT inputs I0. . . I4 and outputs | {z } | {z }
O5 and O6, it provides a reconfiguration interface consisting of =f (x̃N
b ) =f (x̃N
Bx −1 )

the signals configuration data in (CDI), configuration data out Now, the inner product can be represented as a sum of bit
(CDO), clock enable (CE) and configuration clock (CCLK). shifted results of the function
It can be used as a single 5-input LUT using O6 only or as
N −1
two 4-input LUTs with shared inputs by setting I4 = 1 and X
f (x̃N
b )= cn xn,b . (5)
using O5 and O6. In order to perform a change of the output
n=0
function(s), 32 bit of configuration data have to be clocked
into CDI while CE is set to high. The previous configuration Function f (x̃N N
b ) only depends on the N -bit vector x̃b =
T
is shifted out at CDO which can be used to cascade a series of (x0,b , x1,b , . . . , xN −1,b ) which contains the b’th bit of the
CFGLUTs. However, each CDI input can be programmed in (time shifted) elements of x. It can be precomputed and stored
parallel leading to a reconfiguration time of 32 clock cycles. in a single LUT with N inputs.
The speed of the configuration clock is only limited by the The storage requirement of the LUT is BfN · 2N bit, where
FPGA routing fabric. The CFGLUT can be mapped into the Bf denotes the output word size of the N -input LUT f (x̃N
N
b ).
FPGA by using the HDL primitive CFGLUT5 [18] which This output word size depends on the coefficient word size
CFGLUT5
I0
O5
I4
RLUT
CDI CDI
CE CE O6
CCLK CDO

RLUT CFGLUT5
I0
O5
I4
CDI
CE O6
CDO
RLUT

CFGLUT5
I0
O5
CDI I4
Filter Select Reconf. SEL
CCLK
Circuit CDI
CE O6
CDO

Fig. 3. Architecture of the reconfigurable distributed arithmetic FIR filter


Fig. 4. A reconfigurable LUT realization of f (x̃N
b ) using single CFGLUT5
of the given filter. For a maximum coefficient word size of primitives
Bc , the maximum LUT output word size can be obtained by
evaluating (5) with xn,b = 1 ∀ n, leading to
inputs for xn are reduced to M = N2 inputs of zn , thus, the
 
BfN = Bc + dlog2 (N )e . (6)
storage requirements are approximately halved by introducing
bN/2c adder.
A. Mapping the LUT into smaller partial LUTs All adders in Fig. 3 are followed by pipeline registers (not
The input size and, therefore, the memory requirements of shown) forming a pipelined adder tree. Furthermore, most of
the LUT can be reduced by splitting the sum in (5) into several the left shift operations can be moved towards the output of the
smaller sums output adder tree for word size reduction. The reconfigurable
bN/Lc−1 (l+1)L−1 N −1 LUTs (RLUTs) of Fig. 3 already contain CFGLUTs for
reconfiguration which are shown in detail in Fig. 4. Each
X X X
f (x̃N
b )= cn xn,b + cn xn,b (7)
l=0 n=lL n=N −L0
CFGLUT5 computes two bits of f (x̃L b ) which are further
| {z } | {z } processed in a pipelined adder tree (pipeline registers not
=fl (x̃L 0
b ) =fbN/Lc (x̃L
b ) shown) according to (7).
with L < N . Each of the functions
(l+1)L−1
IV. LUT M ULTIPLIER BASED FIR A RCHITECTURE
X
fl (x̃L
b)= cn xn,b (8) A LUT based realization of (1) can also be achieved by
n=lL simply mapping each product cn · xn to a Bx -input LUT.
can be realized as an L-input LUT. If N is not dividable by L, Then, the storage requirement would be Bc · 2Bx for each
one additional LUT of input size L0 = N mod L is necessary, LUT. Like for the DA, the LUT content can be represented
which is represented with the last term in (7). as a sum of partial products. Hence, a similar LUT reduction
Now, L can be chosen to fit into the input word size method as used in (7) can be applied, leading to the constant
of the FPGA LUT. Choosing L = 4 or L = 5 allows a multiplication scheme of Chapman [16].
direct mapping to CFGLUTs. Furthermore, the LUT storage Let us consider one ofthe Bx ×Bc multiplications of (1). It
requirement for the N -input LUT f (x̃N b 0) is reduced from can be divided into K = BLx smaller products by rearranging
0
BfN · 2N bits to bN/Lc · BfL · 2L + BfL · 2L bits. This memory the partial sum terms:
reduction is paid by bN/Lc additional adders. Note that for a
x −2
BX
!
fixed L, this realization style grows linear with the number of b Bx −1
cn · xn = cn 2 xn,b − 2 xn,Bx −1
filter taps N in contrast to (5) which grows exponentially. An | {z }
Bc ×Bx mult. b=0
analytic comparison of the LUT input size revealed that it is
L−1 2L−1
always advantageous to use a CFGLUT as two 4-input LUTs X X
instead of a single 5-input LUT [10]. = cn 2b xn,b + cn 2b xn,b + . . .
b=0 b=L
The resulting reconfigurable DA architecture [10] is shown  
KL−2
in Fig. 3. Here, an additional pre-processing stage of registers X
b KL−1
and adders is used to exploit the inherent coefficient symmetry + cn  2 xn,b − 2 xn,KL−1 
of common linear phase FIR filters. By doing that, the N LUT b=(K−1)L
L−1 L−1 CFGLUT5
I0
X X
b L b
= cn 2 xn,b +2 cn 2 xn,b+L + . . . O5
b=0 b=0 I4
CDI CDI
O6
| {z } | {z }
=g(x̂L =g(x̂L CE CE
n,l ) n,l ) CCLK CDO
L−2
!
X
... + 2 (K−1)L
· cn b
2 xn,b+(K−1)L − 2 L−1
xn,KL−1 CFGLUT5
I0
b=0 O5
I4
| {z }
=g 0 (x̂L
n,l ) CDI
CE O6
K−2
X CDO
(K−1)L 0 L
= 2lL g(x̂L
n,l ) + 2 g (x̂n,l ) (9)
l=0

Each of the K − 1 function terms g(x̂L CFGLUT5


l ) represents a I0
Bc × L unsigned multiplication with input vector x̂L
n,l = O5
T I4
(xn,lL , xn,lL+1 , . . . , xn,lL+L−1 ) : CDI
CE O6
CDO
g(x̂L
n,l ) = cn · x̂L
n,l (10)
The function term g 0 (x̂L l ) for the MSB represents the same Fig. 5. LUT based constant multiplication with a pipelined adder graph
multiplication but with a signed interpretation of the the input
vector x̂L
n,lin case a singed multiplication is performed. Thus, TABLE I
Bx
K = L multiplications of size Bc × L are necessary to LUT CONTENTS OF THE CFGLUT MULTIPLIER USING L = 4
perform the Bc × Bx multiplication. In case that Bx is not
dividable by L, xn has to be sign extended to KL bits with LUT address g 0 (x̂L
n,l ) g(x̂L
n,l )
K = dBx /Le. Now, each partial product can be realized (MSB LUT) (other LUTs)
by an L-input LUT with 2L precomputed products. Hence, 0000 0 · cn 0 · cn
0001 1 · cn 1 · cn
L can be chosen to fit the LUT input size of the specific ... ... ...
FPGA. In the reconfigurable multiplier of Wiatr [14], L was
0111 7 · cn 7 · cn
chosen to fit the address word size of distributed RAM or block 1000 −8 · cn 8 · cn
RAM, in this work, L is chosen to the input word size of the 1001 −7 · cn 9 · cn
CFGLUT. Finally, the bit shifted partial products have to be ... ... ...
added. The resulting reconfigurable constant LUT multiplier 1111 −1 · cn 15 · cn
is shown in Fig. 5. A pipelined adder tree is used to improve
the performance (pipeline registers not shown in Fig. 5). The
applied pipelining does not lead to much extra resources, as V. C OMPARISON OF FIR A RCHITECTURES
most of the flip-flops (FFs) can be mapped to the otherwise
unused FFs of slices realizing full adders. Note that the bit A. Resource Consumption
shifts after the LUTs can also be moved to later stages of the
adder tree to reduce the word size. To estimate the FPGA resource consumption the required
The LUT storage CFGLUTs and adders are evaluated for both architectures. It
  requirements are now reduced from
Bc · 2Bx bits to BLx BgL 2L bits, where can be taken as indicator for selecting the architecture for a
given filter, which is specified with the number of taps (N ),
BgL = Bc + L . (11) the input word size (Bx ) and the coefficient word size (Bc ).
denotes the output word size of the partial LUTs. As for For the reconfigurable DA architecture Bx + 1 M -input LUTs
the reconfigurable DA architecture there are many CFGLUTs are required, where each of them contains dM/Le L-input
with identical inputs, so it is again better to configure the LUTs with an output word size of BfL (see Fig. 3 and Fig. 4).
CFGLUT to two independent 4-input LUTs. The precomputed Using L = 4, two output bits can be computed with a single
LUT contents are listed in Table I. CFGLUT, leading to a total number of CFGLUTs of
Now, several reconfigurable multipliers can be used to build
NCFGLUT,DA = (Bx + 1) dM/4e Bf4 /2
 
a reconfigurable FIR filter. For that, all the multipliers in the
dashed box in Fig. 1 are replaced by reconfigurable ones. Like = (Bx + 1) dM/4e dBc /2 + 1e . (12)
for the reconfigurable DA architecture, coefficient symmetry
is assumed for the proposed design, too. To realize that, only In contrast to the reconfigurable DA filter, the LUT mul-
M = dN/2e of the multipliers are implemented which are tiplier FIR architecture consists of M multipliers, realized
shared between two of the so called structural adders shown as Bx -input LUT. Each of the Bx -input LUT is built from
below the multiplier block in Fig. 1. For negative symmetry, dBx /Le L-input LUTs with an output word size of BgL . When
the corresponding structural adders are replaced by subtractors. two output bits are computed with a single CFGLUT by setting
CFGLUT5
L = 4, the total number of CFGLUTs is I0
O5
I4
NCFGLUT,LUTM = M dBx /4e Bg4 /2
 
(13) CDI
CE O6
= M dBx /4e dBc /2 + 2e . (14) Block RAM CCLK CDO

It can be seen that the required CFGLUTs are very similar for CFGLUT5
Controller
I0
both architectures. To estimate the architecture with the least O5
I4
CFGLUTs, we assume that M and Bx are both dividable by
CDI
CE O6
four, and Bc is dividable by two. Then, the LUT multiplier CCLK CDO
architecture needs less CFGLUTs compared to the DA archi-
tecture when the following unequation is true:
NCFGLUT,LUTM < NCFGLUT,DA
Fig. 6. Control architecture for the reconfiguration of CFGLUTs
M Bx /4(Bc /2 + 2) < (Bx + 1)M/4(Bc /2 + 1)
Bx < Bc /2 + 1 (15)
Hence, if the input word size Bx is less than approximately For the LUT multiplier based architecture, all partial product
half the coefficient word size Bc , the LUT multiplier archi- LUTs except the MSB LUT can use the same configuration
tecture needs less CFGLUTs and vice versa. data, as they all represent the multiplication of one 4 bit
Besides the used CFGLUTs a large amount of adders input with the same coefficient. Only the MSB LUT has to
are necessary. For the DA FIR architecture M adders are be reconfigured with a different configuration because of the
necessary in the pre-processing stage as well as an adder tree sign bit for signed multiplication. Each partial product has an
with Bx adders in the post-processing (see Fig. 3). Each of output word size of BgL , so the storage requirement is
the Bx + 1 RLUTs consist of dM/Le adders, resulting in SLUTM = 64 · M dBc /2 + 2e bit . (20)
NADD,DA = M + Bx + (Bx + 1) dM/4e . (16) As it can be seen, the required storage is independent of the
For the LUT multiplier architecture, N −1 structural adders input word size Bx and the influence of the coefficient word
are needed to compute the filter output from the multiplier size is Bc /2 for both architectures. But for the LUT multiplier
results (see Fig. 1). Each of the M reconfigurable multipliers based architecture, the memory requirement is approximately
consists of dBx /Le adders, resulting in eight times higher than for distributed arithmetic. Thus, the
reconfigurable DA architecture is always the best in terms of
NADD,LUT = N − 1 + M dBx /4e . (17) memory requirements.
To give an estimate for the architecture with the least adders, VI. R ECONFIGURATION C IRCUIT
we assume that M and Bx are both dividable by 4 and N =
The reconfiguration circuit can be built from a block RAM
2M . Then, the LUT multiplier architecture needs less adders
or distributed RAM and a simple controller. It is shown
when the following unequation is true:
in Fig. 6, together with two CFGLUTs. An input vector
NADD,LUT < NADD,DA filt_sel is used which can arbitrarily select a filter co-
2M − 1 + M Bx /4 < M + Bx + (Bx + 1)M/4 efficient set and an enable signal rec_en is used to start
the reconfiguration. Then, a 5 bit counter is used to address
3M − 4 < 4Bx (18) a sequence of 32 reconfiguration words. The RAM address
As each adder typically consists of several slices, where can then be composed of the filt_sel which addresses the
each slice contains up to four CFGLUTs, the adder estima- higher bits together with the five output bits of the counter to
tion is much more significant than the CFGLUT estimation. address the lower bits by a simple concatenation.
Hence, as a rule of thumb, if the input word size is greater Note that during the reconfiguration process, the output of
than approximately half the number of coefficients, the LUT the multiplier is not valid. This problem can be fixed by a
multiplier architecture needs less resources and vice versa. clock enable signal at the output register of the multiplier,
which leads to a neglect of the input during reconfiguration.
B. Configuration Memory Alternatively each CFGLUT can be doubled as it was proposed
With configuration memory we mean the storage require- in previous work [10]. It can be reconfigured with the new
ments for a single configuration. Thus, it has to be multiplied coefficient while the other CFGLUT is processing data. A
by the number of different configurations to get the total mem- multiplexer is used to switch between the two CFGLUTs to
ory requirement. For the DA architecture all RLUTs shown in achieve a glitch free data processing.
Fig. 3 have the same content. As each of the CFGLUT needs
VII. R ESULTS
32 bit of configuration data, the storage requirement is
Two synthesis experiments were made to evaluate the re-
SDA = 32 · dM/4e dBc /2 + 1e bit . (19) source usage, speed and reconfiguration times of the different
TABLE II
architectures. In the first experiment, the LUT based multiplier S YNTHESIS RESULTS FOR THE PROPOSED RECONFIGURABLE LUT
architecture as well as the DA architecture are synthesized for MULTIPLIER FIR FILTER IN COMPARISON WITH RECONFIGURABLE DA
different filter lengths and input word sizes. In the second FIR FILTER [10]
experiment, we used ICAP for the partial reconfiguration Reconf. FIR DA [10] Reconf. FIR LUT (proposed)
of highly optimized filter instances to compare the resource Bx N S Slices fclk Trec S Slices fclk Trec
overhead and configuration times. A VHDL code generator [bit] [MHz] [ns] [bit] [MHz] [ns]
was developed in Matlab for the different architectures. All 8 6 320 145 570.5 56.1 2112 93 647.3 49.4
synthesis results were obtained after place & route using Xilinx 8 10 640 199 492.4 65.0 3520 143 575.0 55.6
ISE v13.4 with the design goal ’speed’ for a Virtex 6 FPGA 8 13 640 191 601.3 53.2 4928 188 608.6 52.6
8 20 960 365 496.3 64.5 7040 297 534.8 59.8
(XC6VLX75T-FF784). 8 28 1280 374 525.2 60.9 9856 406 551.9 58.0
8 41 1920 627 522.7 61.2 14784 595 573.4 55.8
A. Comparison of CFGLUT based Architectures 8 61 2560 835 543.2 58.9 21824 837 499.5 64.1
8 119 4800 1487 499.5 64.1 42240 1668 504.0 63.5
To compare the two discussed architectures, several different 8 151 6080 1813 395.4 80.9 53504 2156 415.3 77.1
reconfigurable filters were synthesized. To ease the comparison 16 6 320 253 622.7 51.4 2112 179 576.7 55.5
with other work, we used a benchmark set of nine filters with 16 10 640 372 552.8 57.9 3520 262 572.4 55.9
N = 6 up to N = 151 coefficients which were already used 16 13 640 353 545.3 58.7 4928 370 522.2 61.3
16 20 960 635 467.7 68.4 7040 544 518.4 61.7
in previous publications [10], [19]–[21]. The coefficients have 16 28 1280 707 525.8 60.9 9856 782 540.3 59.2
a maximum word size of Bc = 17 bit and are available online 16 41 1920 1071 521.9 61.3 14784 1108 487.8 65.6
as MIRZAEI10 N [22]. The input word size was varied from 16 61 2560 1341 413.7 77.3 21824 1575 463.8 69.0
16 119 4800 2594 348.3 91.9 42240 3257 480.3 66.6
Bx = 8 . . . 32 bit in steps of 8 bit. Of course, any filter of the 16 151 6080 4256 410.5 78.0 53504 4149 428.8 74.6
same symmetry up to Bc , Bx and N can be reconfigured.
24 6 320 371 550.7 58.1 2112 303 536.2 59.7
Lower word sizes have to be sign extended while unused 24 10 640 522 556.2 57.5 3520 431 498.3 64.2
coefficients have to be set to zero. The same clock was used 24 13 640 535 470.8 68.0 4928 587 545.6 58.7
for the filter as well as for reconfiguration. 24 20 960 939 573.7 55.8 7040 884 531.6 60.2
24 28 1280 1029 416.5 76.8 9856 1172 482.9 66.3
Detailed synthesis results are listed in Table II, which 24 41 1920 1558 383.6 83.4 14784 1691 467.1 68.5
shows the configuration memory per instance (S), the number 24 61 2560 1958 372.0 86.0 21824 2332 435.0 73.6
of Slices, the maximum clock frequency fclk as well as 24 119 4800 4047 355.4 90.0 42240 5047 393.1 81.4
24 151 6080 5802 385.2 83.1 53504 6515 380.7 84.1
the reconfiguration time Trec = 32/fclk for both methods.
On average, both architectures achieve nearly equally low 32 6 320 490 524.9 61.0 2112 320 512.3 62.5
32 10 640 734 480.3 66.6 3520 531 485.4 65.9
reconfiguration times of 69.94 ns and 65.94 ns and fast clock 32 13 640 704 507.1 63.1 4928 771 495.8 64.5
frequencies of 472.4 MHz and 494.2 MHz, respectively. As 32 20 960 1130 517.9 61.8 7040 1026 473.9 67.5
expected from Section V-B, the reconfiguration memory of 32 28 1280 1419 430.7 74.3 9856 1482 442.9 72.3
32 41 1920 1965 376.5 85.0 14784 2091 446.0 71.7
the LUT multiplier architecture is approximately eight times 32 61 2560 2570 358.6 89.2 21824 3601 401.6 79.7
higher compared to the DA based architecture. A comparison 32 119 4800 4889 302.9 105.6 42240 6523 377.2 84.8
of the resources is visualized in Fig. 7 which shows the 32 151 6080 7170 391.1 81.8 53504 8016 386.7 82.8
percentage slice improvement of the LUT multiplier based
architecture over the DA based architecture. Hence, in case
positive values are shown, the LUT architecture is the better architectures, both logic and routing can be replaced. On the
choice and vice versa. As expected from Section V-A, the LUT one hand, each configuration can be optimized like a static
multiplier based architecture performs better for smaller filter design which is very resource efficient. On the other hand, the
instances when dN/2e < Bx . Exceptions from that rule can be storage requirement is much larger which also leads to longer
explained by the other parts of the filter, e. g., an unbalanced reconfiguration times due to the sequential ICAP interface. The
adder tree with 2n−1 < x < 2n inputs may need as many ICAP interface has a word size of 32 bit and can be operated
slices as a balanced one with x = 2n inputs due to pipelining. with up to 100 MHz [7]. However, this speed is a theoretical
Up to 35% slice reductions can be achieved by using the LUT maximum which can only be achieved with some effort [23].
multiplier architecture for small filters but the DA architecture Reconfiguration clocks of 94.5 MHz were reported by using a
is still advantageous for larger filter instances requiring up to block RAM cache that was closely located at the ICAP [24].
40% less slices. To get the maximum performance out of the ICAP approach,
we used the reduced pipelined adder graphs (RPAG) algorithm
B. Comparison with ICAP [25] which is a state-of-the-art optimization method for fixed
The standard partial reconfiguration scheme supported by coefficient multiplier blocks as needed for FIR filters. All
the Xilinx ISE tools provides that a reconfigurable region multiplications by constants are mapped to add/subtract and bit
of the FPGA is reserved which can be exchanged by any shift operations in a pipelined way. It was shown in previous
circuit in run-time using the ICAP interface. This implies work [20], [21], [26], [27], that this method outperforms
that the reconfigurable region is large enough to include the DA based filters as well as LUT based multipliers in the
largest configuration. In contrast to the discussed CFGLUT case that input word sizes of more than 12 bit are used. As
40 40

Slice improvement [%]

Slice improvement [%]


20 20

0 0

−20 −20

−40 −40
6 10 13 20 28 41 61 119 151 6 10 13 20 28 41 61 119 151
Filter length N Filter length N
(a) Input word size Bx = 8 bit (b) Input word size Bx = 16 bit
40 40
Slice improvement [%]

Slice improvement [%]


20 20

0 0

−20 −20

−40 −40
6 10 13 20 28 41 61 119 151 6 10 13 20 28 41 61 119 151
Filter length N Filter length N
(c) Input word size Bx = 24 bit (d) Input word size Bx = 32 bit

Fig. 7. Slice improvement of the reconfigurable FIR filter based on LUT multipliers in comparison with the reconfigurable DA FIR filter

TABLE III
C OMPARISON OF A SINGLE FILTER MIRZAEI10 41 WITH Bx = 16 BIT recently proposed method based on distributed arithmetic [10],
USING ICAP RECONFIGURATION AND THE CFGLUT METHODS the second one uses several instances of a reconfigurable LUT
Method S [bit] Slices fclk [MHz] Trec [ns] multiplier [14] to build a reconfigurable multiplier block as
needed in the FIR filter. Similarities between the different
RPAG [25] with ICAP 746496 502. . . 569 386.7. . . 448.8 233280
Reconf. FIR DA [10] 1920 1071 521.9 61.3 approaches were derived as both methods uses similar arith-
Reconf. FIR LUT 14784 1108 487.8 65.6 metic transformations to map large LUTs to several smaller
LUTs by the use of additional adders. It turned out that less
CFGLUTs, and, in most of the cases, less slices are needed for
the optimization heavily depends on the numeric coefficient the LUT based multiplier architecture in the case that the input
values, ten different filters were designed with the same length word size is greater than approximately half the number of
as the mid size benchmark filter MIRZAEI10 41 and an input coefficients and vice versa. Both methods have reconfiguration
word size of 16 bit. These served as realistic configurations times and memory requirements which are about four orders
which can be reconfigured via ICAP. of magnitudes faster than using partial reconfiguration via
The results are summarized in Table III. The number of the ICAP interface which is paid by approximately twice the
slices using RPAG optimized FIR filters varied for the differ- amount of slices.
ent filter instances from 502. . . 569. Hence, a reconfiguration
region with a capacity of 569 slices has to be reserved. The R EFERENCES
reconfiguration is organized in frames of 80 slices, thus, eight [1] M. Faust, O. Gustafsson, and C.-H. Chang, “Reconfigurable Multiple
frames have to be reserved where each frame contributes with Constant Multiplication Using Minimum Adder Depth,” in Conference
Record of the Forty Fourth Asilomar Conference on Signals, Systems
93312 bit, leading to a reconfiguration memory requirement and Computers (ASILOMAR), 2010, pp. 1297–1301.
of SICAP = 746496 bit per filter instance. Compared to the [2] Lowenborg and Johansson, “Minimax Design of Adjustable-bandwidth
CFGLUT-based methods, a factor of 388 and 50 more recon- Linear-phase FIR Filters,” IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 53, no. 2, pp. 431–439, 2006.
figuration memory is necessary, respectively. Assuming that [3] S. S. Demirsoy, A. Dempster, and I. Kale, “Design Guidelines for
the full performance of ICAP can be used, the reconfiguration Reconfigurable Multiplier Blocks,” in IEEE International Symposium
takes Trec = SICAP /32 · 10 ns = 233 µs. Thus, compared on Circuits and Systems (ISCAS), 2003.
[4] S. S. Demirsoy, I. Kale, and A. Dempster, “Efficient Implementation
to the slowest CFGLUT methods with 65.5 ns, the ICAP of Digital Filters Using Novel Reconfigurable Multiplier Blocks,” in
reconfiguration is a factor of 3556 slower. The price for these Conference Record of the Thirty-Eighth Asilomar Conference on Signals,
fast reconfiguration times and low memory requirements is Systems and Computers, 2004, pp. 461–464.
[5] P. Tummeltshammer, J. Hoe, and M. Puschel, “Time-Multiplexed
paid by a slice overhead of 88% and 95%, respectively. Multiple-Constant Multiplication,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 26, no. 9, pp.
VIII. C ONCLUSION 1551–1563, Sep. 2007.
We analyzed two reconfigurable FIR filter architectures [6] R. Gutierrez, J. Valls, and A. Perez-Pascual, “FPGA-Implementation of
Time-Multiplexed Multiple Constant Multiplication based on Carry-Save
based on the CFGLUT primitives which can be mapped to Arithmetic,” in International Conference on Field Programmable Logic
all modern FPGAs of Xilinx. The first one is based on a and Applications (FPL), 2009, pp. 609–612.
[7] Xilinx, Inc., Partial Reconfiguration User Guide, Oct. 2010. [18] Xilinx, Inc., Xilinx Virtex-5 Libraries Guide for HDL Designs, 2009.
[8] K. Bruneel, W. Heirman, and D. Stroobandt, “Dynamic Data Folding [19] S. Mirzaei, R. Kastner, and A. Hosangadi, “Layout Aware Optimization
with Parameterizable FPGA Configurations,” Transactions on Design of High Speed Fixed Coefficient FIR Filters for FPGAs,” Int. Journal
Automation of Electronic Systems, vol. 16, no. 4, pp. 43:1–43:29, Oct. of Reconfigurable Computing, vol. 2010, pp. 1–17, 2010.
2011. [20] M. Kumm and P. Zipf, “High Speed Low Complexity FPGA-based FIR
[9] K. Bruneel, “Efficient Circuit Specialization for Dynamic Reconfigura- Filters Using Pipelined Adder Graphs,” in International Conference on
tion of FPGAs,” Ph.D. dissertation, Universiteit Gent, 2011. Field-Programmable Technology (FPT), 2011, pp. 1–4.
[10] M. Kumm, K. Möller, and P. Zipf, “Reconfigurable FIR Filter Using [21] U. Meyer-Baese, G. Botella, D. Romero, and M. Kumm, “Optimization
Distributed Arithmetic on FPGAs,” in IEEE International Symposium of High Speed Pipelining in FPGA-based FIR Filter Design Using
on Circuits and Systems (ISCAS), 2013, pp. 2058–2061. Genetic Algorithm,” in Proceedings of SPIE, 2012.
[11] A. Crosisier, D. J. Esteban, M. E. Levilio, and V. Riso, “Digital Filter [22] FIRsuite, “Suite of constant coefficient FIR filters,” 2013. [Online].
for PCM Encoded Signals,” United States Patent No. 3777130, 1973. Available: http://www.firsuite.net
[12] S. Zohar, “New Hardware Realizations of Nonrecursive Digital Filters,” [23] S. Liu, R. N. Pittman, A. Forin, and J. L. Gaudiot, “On Energy Efficiency
IEEE Transactions on Computers, vol. 22, no. 4, pp. 328–338, 1973. of Reconfigurable Systems With Run-time Partial Reconfiguration,” in
[13] S. A. White, “Applications of Distributed Arithmetic to Digital Signal IEEE International Conference on Application-specific Systems Archi-
Processing: A Tutorial Review,” IEEE ASSP Mag., vol. 6, no. 3, 1989. tectures and Processors (ASAP), 2010, pp. 265–272.
[14] K. Wiatr and E. Jamro, “Implementation of Multipliers in FPGA [24] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time Partial Reconfigu-
Structures,” International Symposium on Quality Electronic Design, pp. ration Speed Investigation and Architectural Design Space Exploration,”
415–420, 2001. in International Conference on Field Programmable Logic and Appli-
[15] K. D. Chapman, “Fast Integer Multipliers Fit in FPGAs,” Electronic cations (FPL), 2009, pp. 498–502.
Design News, 1994. [25] M. Kumm, P. Zipf, M. Faust, and C.-H. Chang, “Pipelined Adder Graph
[16] K. Chapman, “Constant Coefficient Multipliers for the XC4000E,” Xilinx Optimization for High Speed Multiple Constant Multiplication,” in IEEE
Application Note, 1996. Int. Symposium on Circuits and Systems (ISCAS), 2012, pp. 49–52.
[17] M. J. Wirthlin, “Constant Coefficient Multiplication Using Look-Up [26] U. Meyer-Baese, J. Chen, C. H. Chang, and A. G. Dempster, “A
Tables,” Journal of VLSI Signal Processing, vol. 36, no. 1, pp. 7–15, Comparison of Pipelined RAG-n and DA FPGA-based Multiplierless
Jan. 2004. Filters,” in IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS), 2006, pp. 1555–1558.
[27] M. Kumm, D. Fanghänel, K. Möller, P. Zipf, and U. Meyer-Baese, “FIR
Filter Optimization for Video Processing on FPGAs,” EURASIP Journal
on Advances in Signal Processing, vol. 2013, no. 1, p. 111, 2013.
[Online]. Available: http://asp.eurasipjournals.com/content/2013/1/111

You might also like