1.2A High-Performance Distributed RAM Based TCAM Arch
1.2A High-Performance Distributed RAM Based TCAM Arch
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
D-TCAM: A High-performance
Distributed RAM based TCAM
Architecture on FPGAs
MUHAMMAD IRFAN1 , (Student Member, IEEE), ZAHID ULLAH2 , (Member, IEEE), and RAY C.
C. CHEUNG 1 , (Member, IEEE)
1
Department of Electrical Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong (e-mail: m.irfan@my.cityu.edu.hk, racheung@cityu.edu.hk)
2
Department of Electrical Engineering, CECOS University of IT and Emerging Sciences, Pakistan (e-mail: zahidullah@cecos.edu.pk)
Corresponding author: Muhammad Irfan (e-mail: m.irfan@my.cityu.edu.hk).
This work is supported by City University of Hong Kong Collaborative Research Fund (CRF): 8730043.
ABSTRACT
Ternary content-addressable memory (TCAM) is a high-speed searching device that searches the entire
memory in parallel in deterministic time, unlike random-access memory (RAM) which searches sequen-
tially. A network router classifies and forwards a data packet with the aid of a TCAM that stores the routing
data in a table. FPGAs, due to its hardware-like performance and software-like reconfigurability, are widely
used in networking systems where TCAM is an essential component. TCAM is not included in modern
FPGAs which leads to the emulation of TCAM using available resources on FPGA. Several emulated
TCAM designs are presented but they lack the efficient utilization of FPGA’s hardware resources. In this
paper, we present a novel TCAM architecture, distributed RAM based TCAM (D-TCAM), using D-CAM
as a building block. One D-CAM block implements a 48-bytes TCAM using 64 lookup tables (LUTs), that
is cascaded horizontally and vertically to increase the width and depth of TCAM, respectively. A sample
size of 512×144 is implemented on Xilinx Virtex-6 FPGA which reduced the hardware utilization by 60%
compared to the state-of-the-art FPGA-based TCAMs. Similarly, by exploiting the LUT-flip-flip (LUT-FF)
pair nature of Xilinx FPGAs, the proposed TCAM architecture improves throughput by 58.8% without any
additional hardware cost.
INDEX TERMS Content-addressable memory (CAM), field-programmable gate array (FPGA), lookup
table (LUT), memory, random-access memory (RAM).
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
those from Altera (FLEX) [10] and Lattice in the form of B. KEY CONTRIBUTIONS
partial memory blocks and multi-function blocks (MFBs), Key contributions of the proposed work are as follows:
but TCAM is not included in the latest release of these • The proposed TCAM design, D-TCAM, outperforms
devices. What if we need a TCAM for the implementation the available FPGA-based TCAMs in terms of hardware
of networking system inside an FPGA? utilization on FPGA and shows an improvement of 60%
Emulated TCAMs are the alternatives which are proposed compared to the available state-of-the-art TCAMs [11].
by many researchers to utilize the hardware resources of • The available architectures use either lookup tables
FPGA to design a TCAM inside FPGAs [11], [12]. FPGA (LUT) [15] or flip-flops (FF) [18], [19] to implement
vendors including Altera [13] and Xilinx [11], [14] also pro- TCAM and the other one is inferred by synthesizer but
vide soft IP (intellectual property) cores for TCAM to support not part of the design; thus LUT-FF pairs are ineffi-
its applications, but the memory utilization is inefficient in ciently utilized in such designs. D-TCAM efficiently
their current versions. Xilinx recently released an emulated exploits the paired nature of LUTs and FFs in modern
TCAM to support the SDNs [11] which lacks efficient usage FPGAs, and utilize it coherently, such that, none of it
of memory resources on FPGA. infers useless. Speed of the design is improved by using
In this work, we propose a novel TCAM architecture using these redundant FFs.
• D-TCAM provides better performance as per through-
distributed-RAM (D-TCAM) which outperforms the state-
of-the-art FPGA-based TCAMs by efficiently utilizing the put of the TCAM design and shows an improvement
hardware resources. We exploit the physical structure of of 58.8% compared to the state-of-the-art FPGA-based
FPGA slices to improve the throughput of TCAM by intro- TCAMs [11], [15].
ducing pipeline registers to the design. The improvements are
III. RELATED WORK
in the form of 60% better hardware utilization and 58.8%
Conventional TCAMs based on application-specific inte-
enhanced throughput compared to the state-of-the-art FPGA-
grated circuits (ASICs) have low storage density, less recon-
based TCAMs [11], [15], [16].
figurable, and challenging to integrate with FPGA-based sys-
The rest of the paper is arranged as follows: Section II
tems [20], [21]. Emulation of TCAM using FPGA’s hardware
provides the motivations and key contributions of the pro-
resources is a better option to integrate with FPGA-based
posed design. Section III discusses the related work. Section
searching applications. The focus of this work is on emulated
IV describes the proposed TCAM architecture. Section V
TCAMs which is a mature research area and consists of a
illustrates the technical feasibility of the proposed design
large number of different architectures collectively known as
and comparisons with state-of-the-art TCAMs. Section VI
FPGA-based TCAMs.
concludes this work and highlights future directions.
Based on the memory resources that are utilized to emulate
TCAM functionality, FPGA-based TCAMs are classified
II. MOTIVATIONS AND CONTRIBUTIONS into three categories.
A. MOTIVATIONS 1) The first is block RAM (BRAM)-based TCAMs [15],
TCAM is an essential element in modern networking appli- [16], [22]–[25] which use FPGA’s BRAMs to store the
cations where it provides the functionality of packet classi- information of TCAM data. HP-TCAM [22] divided the
fication and packet forwarding in the form of implement- TCAM table into horizontal and vertical blocks and stored
ing a routing table. FPGAs, having considerable amount of it separately in BRAMs of FPGA. It was not possible with-
memory and logical resources, are extremely favorable for out division to implement on FPGA resources because the
complex reconfigurable systems, e.g., software defined net- BRAMs required for the architecture presented in patent [26]
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
increases exponentially with increase in width of TCAM. as well as throughput by 60% and 58.8%, respectively. In
The information about the presence of a rule and its ad- DURE [30] one slice is utilized to implement 1×18 block
dress is stored in different BRAMs which inefficiently utilize of TCAM and carry chain logic is used to roll the AND
BRAMs to emulate a small size (512×36) of TCAM. Z- logic of the match lines for the subsequent slices. It used 4
TCAM [27] and E-TCAM [28] are slightly improved ver- LUTRAMs to emulate 18 bits of TCAM while our proposed
sions of HP-TCAM, but they also suffer from inefficient TCAM architecture uses 4 LUTRAMs to emulate 24 bits
memory utilization. The major drawback of HP-TCAM, Z- TCAM. Our proposed design provides a building block of
TCAM, and E-TCAM is the requirement of TCAM rules 64×6 TCAM which can be expanded horizontally and verti-
to be in ascending or descending order; otherwise, it fails cally to implement large sizes of TCAM.
to provide the match address. UE-TCAM [23] reduced the Distributed RAM-based TCAM architectures waste a large
hardware utilization by storing the information about the number of flip-flops in FPGAs due to the LUT-FF pairs
presence of TCAM rule and its address in the same BRAM. as it only needs LUTs to implement the design which we
The TCAM rules are not needed to be arranged in ascending call as inefficient utilization of memory resources. We have
or descending order. designed the proposed architecture by keeping in mind the
REST [24] proposed a hardware-efficient architecture by hardware structure of FPGA slices and its other components
accessing the BRAM multiple times which affected the which is explained in the upcoming sections.
throughput of TCAM. Jiang [15] presented a scalable TCAM 3) The third type of FPGA-based TCAMs in the literature
architecture in a sophisticated manner and utilized a large are register-based TCAMs [18], [19], [31], [32] which use
number of BRAMs to implement 1024×150 TCAM. How- FFs to store the TCAM rules and perform parallel compar-
ever, the design is entirely not clear from the description of isons of each flip-flop with the input search key. They are also
the paper that how such a large size is achieved. Our pro- known as gate-based TCAMs. FFs in FPGA terminology are
posed architecture utilizes the LUTRAM resources on FPGA known as slice registers (SRs). Gate-based TCAMs reduces
according to its structure (LUT-FF pairs) which improves the the hardware resources in terms of slices per TCAM bit
throughput as well as hardware utilization. Multi-pumping but lacks scalability because of its high routing complexity.
[29] approach is used to access the BRAM memory multiple LH-CAM [18] and G-AETCAM [19] are implemented as
times in one system clock cycle to achieve a reduction in 64×36 BiCAM and TCAM, respectively, on Xilinx FPGA.
hardware utilization, but the speed is reduced with multiple It showed improvement in hardware resources and speed for
accesses. Our proposed design is not compromising on the small size TCAM, but for larger sizes, it is not a good option
speed of the TCAM architecture and improves the hardware to emulate TCAM on FPGA.
utilization as well as the throughput of the emulated TCAM Some other sub-types include hash-based TCAMs [7],
design. algorithmic TCAMs, pseudo-TCAM [2], and cuckoo hashing
2) The second type of FPGA-based TCAM is the em- [33] which have non-deterministic search latency and have
ulation using distributed-RAM resources on FPGAs [11], the drawback of collision as well as bucket overflow. Algo-
[14]–[16]. Most of the previous works that are designed rithmic TCAMs perform best for a specific set of rules as
initially using BRAMs are implemented using distributed- they are based on a specific pattern in the input search key
RAMs. Xilinx in 2012 presented a TCAM architecture in [34]. Updating procedures [35], [36] and soft errors detection
its application note using SRL16E which uses 4-input LU- [9] is another research direction in FPGA-based TCAMs.
TRAM as shift register [14]. One SRL16E implements 2- The scope of this paper is to propose a novel TCAM design
bits of TCAM. If the same approach is applied to 6-input which outperforms in hardware utilization and throughput
LUTRAM, only 4-bit TCAM can be emulated. Our proposed compared to the state-of-the-art TCAM architectures.
TCAM architecture utilizes 6-input LUTRAM to emulate Our proposed architecture is based on RAM of FPGA,
6-bit TCAM and combines 64 LUTRAMS to emulate 46 so only type 1 and type 2 FPGA-based TCAMs are part of
bytes of TCAM. Jiang [15] also proposed a distributed RAM- our comparisons as both are using the RAM available in
based TCAM architecture which uses a huge number of slices FPGAs. To make a fair comparison between the BRAMs and
to implement a TCAM of size 1024×150. It uses 20526 distributed RAM, we computed the number of normalized
slices to implement 1024×150 TCAM on Xilinx Virtex-7 slices for all of the TCAM architectures which are discussed
FPGA while our proposed TCAM design uses 4835 slices in Section V.
to implement 512×144 TCAM on Xilinx Virtex-6 FPGA.
The throughput of Jiang TCAM design and our proposed D- IV. PROPOSED ARCHITECTURE
TCAM are 20.9 and 37.3 Gb/s, respectively. A. TERMINOLOGY
Xilinx in 2017 released a soft IP core for SDNet which Table 1 lists the important terminlogies used in the proposed
uses 12011 slices and 3 BRAMs (36Kb) to implement TCAM design.
512×128 TCAM [11]. The architecture’s internal design is
not disclosed in the IP data sheet, and only implementation B. FPGA STRUCTURE
results are mentioned. Our proposed TCAM architecture Modern FPGAs consist of reconfigurable hardware compo-
shows improvement than Xilinx design in slice utilization nents, to implement any digital system, supported by the
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
0 0 0
0 1 0 0 0 1
1 0 0 1 0 0
0 1 1 0 0 0 0 2
1 1 3 0 0 0 2 3 1 0
3 0 0 3 1 0 1 1
0 0 1 1 0 0
0 0 0 0 0 0
0 0 0
1 2
8x2 1x6 8x4 2x6
RAM TCAM RAM TCAM
(d) (e)
FIGURE 2: Internal Structure of an FPGA. (BRAM: Block FIGURE 3: Converting LUTRAMs into TCAM. (a,b) One 3-
RAM, DSP block: Digital signal processing block, CLB: input LUTRAM emulating 1×3 TCAM. (c) Two 3-input LU-
Configurable logic block, IOB: Input/output block). TRAMs emulating 2×3 TCAM. (d) Two 3-input LUTRAMs
emulating 1×6 TCAM with a 2-input AND gate. (e) Four
3-input LUTRAMs emulating 2×6 TCAM with two 2-input
interconnection matrix. The main parts of Xilinx FPGAs AND gates.
are block RAMs (BRAMs), digital signal processing (DSP)
blocks, configurable logic blocks (CLBs), and input/output
(I/O) blocks (IOBs) as shown in Fig. 2. CLB consists of (Sk) of 1×3 TCAM. Only those location(s) stores ‘logic-
two types of slices; SLICEM and SLICEL. Each slice has 1’ whose address corresponds to the rule present in TCAM,
four lookup tables (LUTs) and eight flip-flops (also known otherwise ‘logic-0’ is stored. For instance, 3-input LUTRAM
as SRs). LUTs inside a SLICEL can be configured as a logic saves the TCAM rule "0 1 1" which made the 3rd location of
element which implements any function up to 6-inputs (using LUTRAM to be ‘logic-1’, and all others ‘logic-0’ as shown
one LUT), and are known as function generators. LUTs in Fig. 3(a). If the rule in TCAM has ‘X-bit’(s), multiple
inside a SLICEM can be configured as a logic element as locations of LUTRAM are ‘logic-1’. For instance, 3-input
well as a memory element, and are known as LUTRAMs. LUTRAM saves the TCAM rule "0 X 1", which make the
Each LUTRAM is a column memory of 64 cells which can be 1st and 3rd location of LUTRAM ‘logic-1’, as shown in Fig.
used as 64-bit RAM. Multiple LUTRAMs combine to form 3(b).
a distributed RAM. Two 3-input LUTRAMs are cascaded to form a 2×3
TCAM. LUTRAMs in Fig. 3(c) store “1 0 X" and “X 0 1"
C. LUTRAM AS TCAM as two 3-bit TCAM rules. Similarly, two 3-input LUTRAMs
A 3-input LUTRAM saves the information of 1×3 TCAM. are connected in parallel with output ANDed to form 1×6
The address input of the LUTRAM is treated as search key TCAM. LUTRAMs in Fig. 3(d) stores "1 0 X 0 1 0" as one
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
6-input rule. The input search key (Sk) equal to “1 0 0 0 1 be read and write using the Address and Data_in as inputs,
0" and “1 0 0 0 1 0" matches with the stored rule to provide and Data_out as the output of the whole memory block
match line (ML) equal to ‘logic-1’as output. (D-CAM). The same memory of Fig. 4(a) is used in Fig.
Four 3-input LUTRAMs are connected such that two LU- 4(b) with a priority encoder, and is changed into a TCAM
TRAMs in parallel and two in cascade form to emulate a 2×6 memory.
TCAM. In Fig. 3(e) two TCAM rules of 6 bits each (“1 0 0 Address in Fig. 4(a) represents the search key (Sk) of
X 1 0" and “1 X X 0 0 1") are stored. TCAM memory in Fig. 4(b). Data_out in Fig. 4(a) represents
To simplify the understanding, we explained the emula- the match lines (MLs) of TCAM memory in Fig. 4(b). The
tion of LUTRAMs into TCAM using 3-input LUTRAMs. final output of a TCAM is always an address where the search
Modern FPGAs have mostly 6-input LUTRAM as in Xilinx key is found. A priority encoder (PE) is normally used to
FPGAs. In the proposed TCAM design we use 6-input LU- translate match lines into an address but PE in itself has dif-
TRAMs which save information of 1×6 TCAM in a similar ferent implementation designs and is not part of the proposed
way as described for 3-input LUTRAM in Fig. 3(a). A D- design. Fig. 4 (b) shows PE to ease the understanding of a
CAM block has 64 6-input LUTRAMs which implements typical TCAM data flow.
a 64×6 TCAM. Multiple D-CAM blocks are connected to
form larger TCAMs similarly as described for multiple 3- 6 D-CAM 64
input LUTRAMs in cascade and parallel form as shown in [5:0] block [63:0]
12 64
Fig. 3(e). Sw MLs
[11:0] [63:0]
6 D-CAM 64
Address (a)
6 Data_out
SRAM Memory
(64 x 64) Sw TCAM 64 Sw TCAM 128
Data_in 64 (64x12)
12
MLs
6 (128x6) MLs
64
(a) 6 64
64 x 6 TCAM D-CAM
[5:0] block [127:64]
One D-CAM block matchlines 6 128
Sw MLs
Sk [5:0] [127:0]
6 D-CAM 64
6 Add block
SRAM Memory P [5:0] [63:0]
FIGURE 4: A building block of D-TCAM. (Sk: Search key, E. MODULAR STRUCTURE OF D-TCAM
MLs: Match lines, PE: Priority encoder). (a) SRAM memory Each D-CAM block implements 48 bytes of TCAM with
for one D-CAM block. (b) 64×6 TCAM emulated using configuration as 64×6, where 64 is the depth (D) of TCAM
64×64 SRAM memory made of 64 LUTRAMs and a priority and 6 is the width (W) of TCAM. Increasing the width of
encoder. TCAM from 6 bits to 12 bits requires an additional D-CAM
block by keeping the depth of TCAM constant, as shown in
Fig. 5(a). Similarly, increasing the depth of TCAM from 64
D. ONE D-CAM BLOCK to 128 requires an additional D-CAM block by keeping the
D-CAM block is the basic building block of the proposed width of TCAM constant, as shown in Fig. 5(b). Horizontal
architecture, D-TCAM. One D-CAM block consists of 64 6- expansion of D-CAM blocks is done to increase the width
input LUTs. These LUTs are from SLICEM slices of Xilinx while the vertical expansion of D-CAM blocks is done to
FPGAs, which are also known as LUTRAMs. Fig. 4(a) shows increase the depth of emulated D-TCAM. Expansion in both
a D-CAM block which is a distributed memory of 64×64 directions is done to create the desired size configuration of
consisting of 64 LUTRAMs. All of the memory locations can TCAM on FPGAs. Fig. 6 shows the formation of 128×12
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
[63:0]
6 D-CAM 64 Sk MLs
block
D-TCAM
[11:6] [63:0] MLs AND-
128 [Cascaded/Parallel FFs
Sw
12
ing
[11:0] [127:0]
DCAM blocks]
[127:64]
D-CAM
[5:0] block 64
64
12 (128x12) MLs
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
Algorithm 1 Populating Algorithm (D×W TCAM) TABLE 2: Speed (in MHz) of D-TCAM for different sizes
Input: TCAM Table (D×W) of TCAM. (W: Width of TCAM, D: Depth of TCAM, The
Output: D-TCAM Table (m*n sub-tables) left value is Speed of D-TCAM without pipeline registers
1: for t ← 0 to t ← m*n do while the right value is the Speed of D-TCAM with pipeline
2: for i ← 0 to i ← m do registers)
3: for j ← 0 to j ← n do Without pipeline registers — With pipeline registers
4: D-CAM[i][j] <= TCAM sub-table [t] Width (W) 18 36 72 144
5: end for Depth (D)
64 513 — 793 459 — 717 368 — 526 353 — 492
6: end for 128 425 — 685 459 — 701 345 — 459 258 — 332
7: end for 256 369 — 450 376 — 510 238 — 351 276 — 370
m × n: Total number of sub-tables in TCAM which is 512 362 — 444 326 — 460 188 — 214 207 — 259
equal to the number of D-CAM blocks in D-TCAM.
m: Number of sub-tables in a column. TABLE 3: Number of Slices used by D-TCAM for different
n: Number of sub-tables in a row. sizes of TCAM. (W: Width of TCAM, D: Depth of TCAM,
The left value is the slices used by D-TCAM without pipeline
registers while the right value is the slices used by D-TCAM
blocks of D-TCAM. If we apply algorithm 1 on Fig. 6, t is with pipeline registers)
4. Theoretically, a total of m*n*64 LUTRAMs are required Without pipeline registers — With pipeline registers
to implement D×W D-TCAM because each D-CAM block W 18 36 72 144
constitutes of 64 LUTRAMs. D
Update latency of RAM-based TCAMs is proportional to 64 76 — 68 123 — 119 280 — 288 556 — 562
128 128 — 128 254 — 254 587 — 583 1078 — 1067
the depth of the utilized RAM blocks. BRAM-based TCAMs 256 271 — 258 458 — 462 1205 — 1168 2435 — 2395
have a 513 clock cycle update latency while distributed-RAM 512 538 — 536 930 — 968 2387 — 2357 4842 — 4835
based TCAMs have an update latency of 65 clock cycles.
Our proposed design is based on LUTRAMs which can
be correspondingly updated in 65 clock cycles as a typical lines (MLs) which are converted to an address by the priority
distributed RAM-based TCAM. encoder (PE) as shown in Algorithm 2.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
5000 5000
Slices Used (#)
3000 3000
2000 2000
1000 1000
0 0
64 128 256 512 64 128 256 512
TCAM depth (D) TCAM depth (D)
(a) (b)
900 900
18-bits 36-bits 72-bits 144-bits 18-bits 36-bits 72-bits 144-bits
750 750
Speed (MHz)
Speed (MHz)
600 600
450 450
300 300
150 150
0 0
64 128 256 512 64 128 256 512
TCAM depth (D) TCAM depth (D)
(c) (d)
FIGURE 8: To visualize the difference between using of pipeline registers and not using of pipeline registers. (a) Number of
slices used vs. TCAM depth without Pipelining. (b) Number of slices used vs. TCAM depth with Pipelining. (c) Speed of the
design vs. TCAM depth without Pipelining. (d) Speed of the design vs. TCAM depth with Pipelining. The four different lines
of each graph are for different TCAM widths (18, 36, 72, and 144 bits).
For instance, HP-TCAM is implemented using BRAM, using (3) where Speed is the maximum clock rate of the
and its utilization is published in the form of BRAMs. So we emulated TCAM and W is the width of the emulated TCAMs.
converted the BRAMs to normalized slices using (1) which
is provided in the comparison table as the number of slices T hroughput = Speed ∗ W (3)
(Slices0 ). Most of the TCAM designs under comparison
are implemented using Virtex-6. We also implemented D- C. IMPACT OF PIPELINING
TCAM on Virtex-6 to make the comparisons fair. Some of The TCAM architectures that used LUTRAMs for emulating
the architectures are implemented on different devices which TCAM are not using FFs from the LUT-FF pairs of FPGA’s
use different technology, e.g., Virtex-7 is used by Jiang [15] slices. We have used those redundant FFs as pipeline registers
and Xilinx [11] while Kintex-7 is used by REST [24]. Virtex- to improve the speed of the proposed TCAM architecture.
6 is a 40 nm device while Virtex-7 and Kintex-7 are 28 nm The pipeline registers are FFs which requires additional
devices, which are also different in speed. For comparisons, slices to implement in a design, but in our case, almost
the normalized speed is calculated using (2), where Speed’ is no additional slices are used, but the speed improvement
the normalized speed on the target device and Vdd is always is prominent. To visualize it more clearly we have plotted
equal to 1.0 V. four graphs showing the change in slice usage and change
in the speed of the proposed design by using and not using
Device (nm)
Speed0 = Speed ∗ [ ] (2) pipeline registers. The horizontal dotted lines in Fig. 8(a)
V irtex 6 (nm) and Fig. 8(b) shows that the increase in slices is almost
Throughput (in Gbits/s) of the proposed design is found negligible, but the horizontal dotted lines in Fig. 8(c) and
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
Fig. 8(d) shows that the increase in speed is prominent. [12] P. Alfke, “Creative uses of block RAM,” White Paper: Virtex and Spartan
Speed of all emulated designs of D-TCAM is greater than the FPGA Families, Xilinx, 2008.
[13] Altera, “Implementing High Speed Search Applications with Altera
speed of the design without pipeline registers. Our proposed CAM,” Application Note 119 V2.1, July. 2001.
architecture shows improvement in speed of TCAM without [14] K. Locke, “Parameterizable content-addressable memory,” Xilinx Appli-
using additional hardware resources on FPGAs. The ideal cation Note XAPP1151, 2011.
case of this implementation can be seen from Table 2 and [15] W. Jiang, “Scalable ternary content addressable memory implementation
using FPGAs,” in Proceedings of the ninth ACM/IEEE symposium on
3 in case of D-TCAM size 128×36. The slices used in both Architectures for networking and communications systems. IEEE Press,
cases (with pipeline registers and without pipeline registers) 2013, pp. 71–82.
are the same (254), but the speed is improved by 52% (from [16] Z. Qian and M. Margala, “Low power RAM-based hierarchical CAM on
FPGA,” in ReConFigurable Computing and FPGAs (ReConFig), 2014
459 to 701 MHz). International Conference on. IEEE, 2014, pp. 1–4.
Table 4 summarizes the state-of-the-art FPGA-based [17] A. M. Abdelhadi, G. G. Lemieux, and L. Shannon, “Modular block-
TCAM architectures, which show that our proposed archi- ram-based longest-prefix match ternary content-addressable memories,”
in 2018 28th International Conference on Field Programmable Logic and
tecture improves the hardware resources by 60% in terms Applications (FPL). IEEE, 2018, pp. 243–2437.
of slices on FPGA. The throughput is improved from 29.8 [18] Z. Ullah, “LH-CAM: Logic-Based Higher Performance Binary CAM
Gb/s to 37.3 Gb/s by using the redundant FFs of the LUT-FF Architecture on FPGA,” IEEE Embedded Systems Letters, vol. 9, no. 2,
pairs as pipeline registers. The state-of-the-art TCAM design pp. 29–32, 2017.
[19] M. Irfan and Z. Ullah, “G-AETCAM: Gate-Based Area-Efficient Ternary
[11] has a throughput of 15.36 Gb/s while our proposed Content-Addressable Memory on FPGA,” IEEE Access, vol. 5, pp.
architecture significantly improves it to 37.3 Gb/s, which 20 785–20 790, 2017.
is 58.8% improvement because of the novel TCAM design [20] H. M. Kittur et al., “Precharge-Free, Low-Power Content-Addressable
Memory,” IEEE Transactions on Very Large Scale Integration (VLSI)
using LUT-FF pairs. Systems, vol. 24, no. 8, pp. 2614–2621, 2016.
[21] S. Mishra and A. Dandapat, “Energy-efficient adaptive match-line con-
VI. CONCLUSIONS AND FUTURE WORK troller for large-scale associative storage,” IEEE Transactions on Circuits
and Systems II: Express Briefs, vol. 64, no. 6, pp. 710–714, 2017.
The proposed TCAM design efficiently utilizes the hardware [22] Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based ternary
resources on FPGA and exploits the internal structure of content addressable memory,” IEEE Transactions on Circuits and Systems
slices inside FPGA. Pipelining improved the performance of I: Regular Papers, vol. 59, no. 12, pp. 2969–2979, 2012.
the TCAM, in terms of throughput, without extra hardware [23] Z. Ullah, M. K. Jaiswal, R. C. Cheung, and H. K. So, “UE-TCAM: An ultra
efficient SRAM-based TCAM,” in TENCON 2015-2015 IEEE Region 10
cost compared to the state-of-the-art FPGA-based TCAMs. Conference. IEEE, 2015, pp. 1–6.
Future work targets on further optimization of emulated [24] A. Ahmed, K. Park, and S. Baeg, “Resource-Efficient SRAM-Based
TCAM for networking applications which can more effi- Ternary Content Addressable Memory,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 25, no. 4, pp. 1583–1587, 2017.
ciently utilize the hardware resources on FPGAs. [25] C. A. Zerbini and J. M. Finochietto, “Performance evaluation of packet
classification on FPGA-based TCAM emulation architectures,” in Global
REFERENCES Communications Conference (GLOBECOM), 2012 IEEE. IEEE, 2012,
pp. 2766–2771.
[1] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, “Emerging trends in design
[26] M. Somasundaram, “Circuits to generate a sequential index for an input
and applications of memory-based computing and content-addressable
number in a pre-defined list of numbers,” Dec. 26 2006, uS Patent
memories,” Proceedings of the IEEE, vol. 103, no. 8, pp. 1311–1330, 2015.
7,155,563.
[2] W. Yu, S. Sivakumar, and D. Pao, “Pseudo-TCAM: SRAM-based Archi-
[27] Z. Ullah, M. K. Jaiswal, and R. C. Cheung, “Z-TCAM: an SRAM-
tecture for Packet Classification in One Memory Access,” IEEE Network-
based architecture for TCAM,” IEEE Transactions on Very Large Scale
ing Letters, 2019.
Integration (VLSI) Systems, vol. 23, no. 2, pp. 402–406, 2015.
[3] X. Chen, H. Huo, J. Huan, and J. S. Vitter, “Efficient graph Similarity
Search in External Memory,” IEEE Access, vol. 5, pp. 4551–4560, 2017. [28] ——, “E-TCAM: An efficient SRAM-based architecture for TCAM,”
Circuits, Systems, and Signal Processing, vol. 33, no. 10, pp. 3123–3144,
[4] E. I. Guerra-Hernandez, A. Espinal, P. Batres-Mendoza, C. Garcia-
2014.
Capulin, R. Romero-Troncoso, and H. Rostro-Gonzalez, “A FPGA-based
neuromorphic locomotion system for multi-legged robots,” IEEE Access, [29] I. Ullah, Z. Ullah, and J.-A. Lee, “Efficient TCAM design based on
2017. multipumping-enabled multiported SRAM on FPGA,” IEEE Access,
[5] S. Mittal, “A survey of techniques for architecting tlbs,” Concurrency and vol. 6, pp. 19 940–19 947, 2018.
Computation: Practice and Experience, vol. 29, no. 10, p. e4061, 2017. [30] I. Ullah, Z. Ullah, U. Afzaal, and J.-A. Lee, “DURE: An Energy-
[6] O. Mujahid, Z. Ullah, H. Mahmood, and A. Hafeez, “Fast pattern recog- and Resource-Efficient TCAM architecture for FPGAs with Dynamic
nition through an LBP driven CAM on FPGA,” IEEE Access, vol. 6, pp. Updates,” IEEE Transactions on Very Large Scale Integration (VLSI)
39 525–39 531, 2018. Systems, 2019.
[7] Y. Yu, D. Belazzougui, C. Qian, and Q. Zhang, “Memory-Efficient and [31] M. Irfan and A. Ahmad, “Impact of initialization on gate-based area
Ultra-Fast Network Lookup and Forwarding using Othello Hashing,” efficient ternary content-addressable memory,” in 2018 International Con-
IEEE/ACM Transactions on Networking, 2018. ference on Computing, Electronics & Communications Engineering (iC-
[8] T.-S. Chen, D.-Y. Lee, T.-T. Liu, and A.-Y. Wu, “Dynamic Reconfigurable CECE). IEEE, 2018, pp. 328–332.
Ternary Content Addressable Memory for Openflow-Compliant Low- [32] H. Mahmood, Z. Ullah, O. Mujahid, I. Ullah, and A. Hafeez, “Beyond
Power Packet Processing,” IEEE Transactions on Circuits and Systems I: the limits of Typical Strategies: Resources Efficient FPGA-based TCAM,”
Regular Papers, vol. 63, no. 10, pp. 1661–1672, 2016. IEEE Embedded Systems Letters, 2018.
[9] A. Ullah, P. Reviriego, A. Sánchez-Macián, and J. A. Maestro, “Multiple [33] S. Pontarelli, P. Reviriego, and J. A. Maestro, “Parallel D-pipeline: a
Cell Upset Injection in BRAMs for Xilinx FPGAs,” IEEE Transactions on cuckoo hashing implementation for increased throughput,” IEEE Trans-
Device and Materials Reliability, vol. 18, no. 4, pp. 636–638, 2018. actions on Computers, vol. 65, no. 1, pp. 326–331, 2016.
[10] F. Altera, “10k embedded programmable logic device family data sheet,” [34] A. Bremler-Barr, Y. Harchol, D. Hay, and Y. Hel-Or, “Encoding short
2003. ranges in TCAM without expansion: Efficient algorithm and applications,”
[11] Xilinx, “Ternary Content Addressable Memory (TCAM) Search IP for IEEE/ACM Transactions on Networking, vol. 26, no. 2, pp. 835–850,
SDNet,” Xilinx Product Guide PG190, San Jose, USA, Nov, 2017. 2018.
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2927108, IEEE Access
TABLE 4: Comparison with other TCAM Architectures. (Tech.: Technology in nm, Speed: Speed of the design in MHz,
Speed0 : Normalized speed of the design computed using (2), Slices: Number of slices used by the design on target FPGA,
Slices0 : Number of slices computed using (1) by combining BRAMs and Slices, Througput is in Gbits/s computed using (3))
FPGA Device BRAMs Slices Slices0 Speed Speed0 Throughput
TCAM Design TCAM Size
(Tech.) (36K) (#) (#) (MHz) (MHz) (Gb/s)
Xilinx Locke [14] 256 × 32 Virtex-5 (65 nm) 0 1406 — 100 163 5.2
HP-TCAM [22] 512 × 36 Virtex-6 (40 nm) 56 1637 2981 118 — 4.25
Z-CAM [27] 512 × 36 Virtex-6 (40 nm) 40 1116 2076 159 — 5.72
E-CAM [28] 512 × 36 Virtex-6 (40 nm) 40 1116 2076 164 — 5.77
UE-CAM [23] 512 × 36 Virtex-6 (40 nm) 32 913 1681 202 — 7.26
REST [24] 72 × 28 Kintex-7 (28 nm) 1 77 101 50 35 0.98
Heirarchical [16] 504 × 180 Virtex-6 (40 nm) 0 10067 — 109 — 19.26
DURE [30] 512 × 36 Virtex-6 (40 nm) 0 1668 — 335 — 12.06
Scalable [15] 1024 × 150 Virtex-7 (28 nm) 0 20526 — 199 139 20.9
Xilinx SDNet [11] 512 × 128 Virtex-7 (28 nm) 3 12011 12083 171 120 15.36
D-TCAM I 512 × 36 Virtex-6 (40 nm) 0 968 — 460 — 16.6
D-TCAM I 512 × 72 Virtex-6 (40 nm) 0 2357 — 214 — 15.4
D-TCAM II 512 × 144 Virtex-6 (40 nm) 0 4835 — 259 — 37.3
[35] F. Syed, Z. Ullah, and M. K. Jaiswal, “Fast Content Updating Algorithm DR. RAY C. C. CHEUNG received the BEng
for an SRAM based TCAM on FPGA,” IEEE Embedded Systems Letters, and MPhil degrees in computer engineering and
2017. computer science and engineering at the Chinese
[36] D.-Y. Lee, C.-C. Wang, and A.-Y. Wu, “Bundle-Updatable SRAM-Based University of Hong Kong (CUHK), Hong Kong,
TCAM Design for OpenFlow-Compliant Packet Processor,” IEEE Trans- in 1999 and 2001, respectively, and the PhD de-
actions on Very Large Scale Integration (VLSI) Systems, 2019. gree and DIC in computing at Imperial College
London, London, United Kingdom, in 2007. After
completing the PhD work, he received the Hong
Kong Croucher Foundation Fellowship for his
postdoctoral study in the Electrical Engineering
Department, University of California, Los Angeles (UCLA). In 2009, he
MUHAMMAD IRFAN received the B.Sc. degree worked as a visiting research fellow in the Department of Electrical Engi-
in electrical engineering from University of En- neering, Princeton University, Princeton, NJ.
gineering and Technology, Peshawar, Pakistan, in Currently, he is an associate professor in the Department of Electronic
2013 and the M.S. degree in electrical engineering Engineering, City University of Hong Kong (CityU). He is the author of
from Lahore University of Management Sciences, more than 150 journal and conference papers. His research team, CityU
Lahore, Pakistan, in 2016, respectively. Architecture Lab for Arithmetic and Security (CALAS), focuses on the
He has taught for two years in the Department following research topics: reconfigurable trusted computing, applied cryp-
of Electrical Engineering, CECOS University of tography, and high-performance biomedical VLSI designs. He is a member
IT & Emerging Sciences, Peshawar, Pakistan and of the IEEE.
is currently on leave from there. He is doing PhD
in Electrical Engineering at City University of Hong Kong, Hong Kong. His
research interests include FPGA-based digital systems designs, low power
computer architectures, and memory design.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.